Therefore, the time of processing for audio is higher than video object detection.