Id |
Subject |
Object |
Predicate |
Lexical cue |
T418 |
0-214 |
Sentence |
denotes |
Convolutional neural network (CNN) architectures use multiple blocks of successive convolution and pooling operations for feature learning and down sampling along the time and feature dimensions, respectively [71]. |
T419 |
215-309 |
Sentence |
denotes |
The VGG16 is a pre-trained CNN [72] used as a base model for transfer learning (Table 6) [73]. |
T420 |
310-494 |
Sentence |
denotes |
VGG16 is a famous CNN architecture that uses multiple stacks of small kernel filters (3 by 3) instead of the shallow architecture of two or three layers with large kernel filters [74]. |
T421 |
495-659 |
Sentence |
denotes |
Using multiple stacks of small kernel filters increases the network’s depth, which results in improving complex feature learning while decreasing computation costs. |
T422 |
660-738 |
Sentence |
denotes |
VGG16 architecture includes 16 convolutional and three fully connected layers. |
T423 |
739-993 |
Sentence |
denotes |
Audio-based risky behavior detection is based on complex features and distinguishable behaviors (e.g., coughing, sneezing, background noise), which requires a deeper CNN model than shallow architecture (i.e., two or three-layer architecture) offers [75]. |
T424 |
994-1096 |
Sentence |
denotes |
VGG16 has been adopted for audio event detection and demonstrated significant literature results [71]. |
T425 |
1097-1200 |
Sentence |
denotes |
The feature maps were flattened to obtain the fully connected layer after the last convolutional layer. |
T426 |
1201-1334 |
Sentence |
denotes |
For most CNN-based architectures, only the last convolutional layer activations are connected to the final classification layer [76]. |