Table 4 Model configuration.

From: Evaluating Mandarin tone pronunciation accuracy for second language learners using a ResNet-based Siamese network

Layer

ResNet-18

VGG-16

AlexNet

Baseline

1

Conv: 5\(\times\)5, 64k, 2p, 1s

Conv\(\times\)2: 3\(\times\)3, 64c, 1p, 1s

Conv: 1\(\times\)1, 48k, 1p, 1s

Conv: 1\(\times\)1, 64k, 1p, 1s

Max-pooling: 3\(\times\)3, 1p, 2s

Max-pooling: 3\(\times\)3, 1p, 2s

Max-pooling: 3\(\times\)3, 2s

2

Conv\(\times\)2: 3\(\times\)3, 64k, 1p, 1s

 

Conv: 5\(\times\)5, 128k, 1p, 1s

Conv: 5\(\times\)5, 128k, 2p, 1s

 

Max-pooling: 2\(\times\)2, 1p, 2s

Max-pooling: 3\(\times\)3, 1p, 2s

3

rConv: 64k, 1s

Conv\(\times\)2: 3\(\times\)3, 128c, 1p, 1s

Conv\(\times\)2: 3\(\times\)3, 192k, 1p, 1s

Conv: 1\(\times\)1, 1k, 1p, 1s

Ave-pooling: 3\(\times\)3, 1p, 2s

4

Conv\(\times\)2: 3\(\times\)3, 64k, 1p, 1s

 

Bi-LSTM: 256

 

Max-pooling: 2\(\times\)2, 1p, 2s

5

rConv: 64k, 1s

Conv\(\times\)3: 3\(\times\)3, 256c, 1p, 1s

Conv: 3\(\times\)3, 128k, 1p, 1s

FC-1: 256

Max-pooling: 3\(\times\)3, 2s

6

Conv: 3\(\times\)3, 128k, 1p, 2s

FC-1: 2048

FC-2: 32

7

Conv: 3\(\times\)3, 128k, 1p, 1s

rConv: 128k, 2s

FC-2: 2048

 

Max-pooling: 2\(\times\)2, 1p, 2s

 

8

Conv\(\times\)2: 3\(\times\)3, 128k, 1p, 1s

 

Conv\(\times\)3: 3\(\times\)3, 512c, 1p, 1s

FC-3: 32

 

9

rConv: 128k, 1s

  

10

Conv: 3\(\times\)3, 256k, 1p, 2s

  
  

Max-pooling: 2\(\times\)2, 1p, 2s

  

11

Conv: 3\(\times\)3, 256k, 1p, 1s

rConv: 256k, 2s

Conv\(\times\)3: 3\(\times\)3, 512c, 1p, 1s

  

12

Conv\(\times\)2: 3\(\times\)3, 256k, 1p, 1s

   

13

rConv: 256k, 1s

  

Max-pooling: 2\(\times\)2, 1p, 2s

  

14

Conv: 3\(\times\)3, 512k, 1p, 2s

FC-1: 4096

  

15

Conv: 3\(\times\)3, 512k, 1p, 1s

rConv: 512k, 2s

FC-2: 4096

  

16

Conv\(\times\)2: 3\(\times\)3, 512k, 1p, 1s

 

FC-3: 1000

  

17

rConv: 512k, 1s

FC-4: 32

  

Ave-pooling: 1\(\times\)1, 1s

  

18

FC-1: 1000

   

19

FC-2: 32

   
  1. In the table, “Conv” denotes the convolutional layer, “rConv” represents the residual convolutional layer, “FC” stands for the fully
  2. connected layer, “Max-pooling” refers to the max-pooling layer, “Ave-pooling” indicates the average pooling layer, “\(\times\)2” means two
  3. identical layers, “n\(\times\)n” refers to the kernel size of n, “k” represents the convolutional kernel, “c” refers to the convolutional channels,
  4. p” stands for padding, and “s” denotes the stride.