Id |
Subject |
Object |
Predicate |
Lexical cue |
T406 |
0-4 |
Sentence |
denotes |
5.6. |
T407 |
5-41 |
Sentence |
denotes |
Audio-Based Risky Behavior Detection |
T408 |
42-185 |
Sentence |
denotes |
This section examines an audio classification algorithm that recognizes coughing and sneezing using an audio sensor with an embedded DL engine. |
T409 |
186-244 |
Sentence |
denotes |
The methodology for audio detection is shown in Figure 13. |
T410 |
245-410 |
Sentence |
denotes |
This figure shows the four main steps of the audio DL process.The recording needs to first be preprocessed for noise before being used for extracting sound features. |
T411 |
411-551 |
Sentence |
denotes |
The most commonly known time-frequency feature is the short-time Fourier transform [67], Mel spectrogram [68], and wavelet spectrogram [69]. |
T412 |
552-744 |
Sentence |
denotes |
The Mel spectrogram was based on a nonlinear frequency scale motivated by human auditory perception and provides a more compact spectral representation of sounds when compared to the STFT [3]. |
T413 |
745-832 |
Sentence |
denotes |
To compute a Mel spectrogram, we first convert the sample audio files into time series. |
T414 |
833-926 |
Sentence |
denotes |
Next, its magnitude spectrogram is computed, and then mapped onto the Mel scale with power 2. |
T415 |
927-974 |
Sentence |
denotes |
The end result would be a Mel spectrogram [70]. |
T416 |
975-1069 |
Sentence |
denotes |
The last step in preprocessing would be to convert Mel spectrograms into log Mel spectrograms. |
T417 |
1070-1164 |
Sentence |
denotes |
Then the image results would be introduced as an input to the deep learning modelling process. |
T418 |
1165-1379 |
Sentence |
denotes |
Convolutional neural network (CNN) architectures use multiple blocks of successive convolution and pooling operations for feature learning and down sampling along the time and feature dimensions, respectively [71]. |
T419 |
1380-1474 |
Sentence |
denotes |
The VGG16 is a pre-trained CNN [72] used as a base model for transfer learning (Table 6) [73]. |
T420 |
1475-1659 |
Sentence |
denotes |
VGG16 is a famous CNN architecture that uses multiple stacks of small kernel filters (3 by 3) instead of the shallow architecture of two or three layers with large kernel filters [74]. |
T421 |
1660-1824 |
Sentence |
denotes |
Using multiple stacks of small kernel filters increases the network’s depth, which results in improving complex feature learning while decreasing computation costs. |
T422 |
1825-1903 |
Sentence |
denotes |
VGG16 architecture includes 16 convolutional and three fully connected layers. |
T423 |
1904-2158 |
Sentence |
denotes |
Audio-based risky behavior detection is based on complex features and distinguishable behaviors (e.g., coughing, sneezing, background noise), which requires a deeper CNN model than shallow architecture (i.e., two or three-layer architecture) offers [75]. |
T424 |
2159-2261 |
Sentence |
denotes |
VGG16 has been adopted for audio event detection and demonstrated significant literature results [71]. |
T425 |
2262-2365 |
Sentence |
denotes |
The feature maps were flattened to obtain the fully connected layer after the last convolutional layer. |
T426 |
2366-2499 |
Sentence |
denotes |
For most CNN-based architectures, only the last convolutional layer activations are connected to the final classification layer [76]. |
T427 |
2500-2600 |
Sentence |
denotes |
The ESC-50 [77] and AudioSet [78] datasets were used to extract cough and sneezing training samples. |
T428 |
2601-2756 |
Sentence |
denotes |
The ESC-50 dataset is a labelled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification. |
T429 |
2757-2916 |
Sentence |
denotes |
AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labelled, 10 s sound clips taken from YouTube videos. |
T430 |
2917-3036 |
Sentence |
denotes |
Over 5000 samples were extracted for the transfer learning CNN model which was then divided to train and test datasets. |
T431 |
3037-3119 |
Sentence |
denotes |
We examined the performance of the trained CNN models using coughing and sneezing. |
T432 |
3120-3153 |
Sentence |
denotes |
The results are shown in Table 7. |