Spoken language recognition on Mozilla Common Voice — Part II: Models. | by Sergey Vilov | Aug, 2023

Mozilla Common Voice: Unveiling Spoken Language Recognition Models in Part II | August 2023 | Sergey Vilov

Introduction:

Welcome to the second article on spoken language recognition based on the Mozilla Common Voice dataset. In this installment, we will be discussing the training and evaluation of several models to select the best one. The models we trained include a convolutional neural network (CNN), a CRNN model from Bartz et al. 2017, a CRNN model from Alashban et al. 2022, an AttNN model from De Andrade et al. 2018, a CRNN* model (same architecture as AttNN but without the attention block), and a time-delay neural network (TDNN) model. All models were trained on a split of the Common Voice dataset and evaluated based on accuracy. We found that the AttNN model performed the best, with a pairwise accuracy of 97%. Additionally, we studied the effect of tuplemax loss and found that it performed worse compared to softmax loss. Overall, our results set a new benchmark for spoken language recognition on the Common Voice dataset, and future studies can explore the combination of different embeddings and neural network architectures to further improve performance. In Part III, we will discuss potential audio transformations that can enhance model performance.

Full Article: Mozilla Common Voice: Unveiling Spoken Language Recognition Models in Part II | August 2023 | Sergey Vilov

Spoken Language Recognition: Training and Evaluation of Various Models

In this article, we continue our discussion on spoken language recognition using the Mozilla Common Voice dataset. In the first part, we covered data selection and optimal embedding choices. Now, we will focus on training multiple models and selecting the best one for the task.

Models to be Trained and Evaluated

We will proceed by training and evaluating the following models on the full dataset, which consists of 40,000 samples:

1. Convolutional neural network (CNN) model: In this approach, we treat the language classification problem as the classification of 2-dimensional images. CNN-based classifiers have shown promising results in a language recognition competition.
2. CRNN model from Bartz et al. 2017: This model combines the descriptive power of CNNs with the ability to capture temporal features of RNNs.
3. CRNN model from Alashban et al. 2022: This is a variation of the CRNN architecture.
4. AttNN model from De Andrade et al. 2018: Initially proposed for speech recognition, this model has an attention block that weighs the relevance of input sequence parts for classification.
5. CRNN* model: Similar to the AttNN architecture, but without the attention block.
6. Time-delay neural network (TDNN) model: This model, tested in Snyder et al. 2018, was used to generate X-vector embeddings for spoken language recognition. Here, we bypass X-vector generation and directly train the network for language classification.

You May Also Like to Read  R Data Manipulation - Mastering Statistics and R

Training and Evaluation Process

All models were trained on the same train/validation/test split and mel spectrogram embeddings with the first 13 mel filterbank coefficients. The resulting learning curves on the validation set can be seen in the figure shown.

Analyzing Model Performance

Based on the results, it is clear that the AttNN, TDNN, and CRNN* models perform similarly well, with AttNN achieving the highest accuracy at 92.4%. On the other hand, the CRNN (Bartz et al. 2017), CNN, and CRNN (Alashban et al. 2022) models showed modest performance, with CRNN (Alashban et al. 2022) having the lowest accuracy at 58.5%.

To further analyze the performance difference between the models, we note that TDNN and AttNN were specifically designed for speech recognition tasks and had been previously tested against benchmarks. This might explain their superior performance.

The performance gap between AttNN and our CRNN model reinforces the importance of the attention mechanism for spoken language recognition. Interestingly, the CRNN model from Bartz et al. 2017, despite its similar architecture, performs worse. This could be because the default model hyperparameters are not optimal for the Mozilla Common Voice dataset.

The CNN model, while lacking a specific memory mechanism, comes next in terms of performance. However, it can be noted that CNNs do possess some notion of memory due to the hierarchical nature of their computations. The TDNN model, which scored second, can be viewed as a 1-D CNN. With more exploration of CNN architecture, the CNN model might have achieved results similar to the TDNN model.

Surprisingly, the CRNN model from Alashban et al. 2022 showed the worst accuracy among all models. It is interesting to note that the original study reported an accuracy of about 97% but did not make the code publicly available, making it difficult to pinpoint the source of the large discrepancy.

Pairwise Accuracy Analysis

In many cases, users are only concerned with distinguishing between two specific languages. In this context, pairwise accuracy becomes a more relevant metric. The pairwise accuracy for the AttNN model on the test set is shown in the accompanying table, along with the confusion matrix.

The model performs best at distinguishing between German and Spanish, as well as between French and English, both with 98% accuracy. These results are expected given the significant differences in sound systems between these languages.

You May Also Like to Read  Simplifying Lakehouse Access for Developers: Databricks and Posit Unveil Exciting New Integrations

Comparison of Softmax Loss and Tuplemax Loss

Although we used softmax loss to train the model, it has been reported that tuplemax loss might yield higher accuracy in pairwise classification tasks. After implementing tuplemax loss in PyTorch, we retrained the AttNN model and compared the effect of both loss functions on accuracy and pairwise accuracy on the validation set.

The results indicate that tuplemax loss performs worse than softmax loss in terms of both overall accuracy and pairwise accuracy.

Conclusion

In this study, we achieved 92% accuracy and 97% pairwise accuracy in spoken language recognition using short audio clips from the Mozilla Common Voice dataset. Our analysis led to the selection of the AttNN model as the best performing model, highlighting the importance of LSTM and attention blocks for this task.

The results presented here serve as a benchmark for spoken language recognition on the Mozilla Common Voice dataset. Further improvements can be made by combining different embeddings and exploring additional neural network architectures, such as transformers. In Part III of this series, we will discuss audio transformations that can potentially enhance model performance.

References:
– Alashban, Adal A., et al. “Spoken language identification system using convolutional recurrent neural network.” Applied Sciences 12.18 (2022): 9181.
– Bartz, Christian, et al. “Language identification using deep convolutional recurrent neural networks.” Neural Information Processing: 24th International Conference, ICONIP 2017, Guangzhou, China, November 14–18, 2017, Proceedings, Part VI 24. Springer International Publishing, 2017.
– De Andrade, Douglas Coimbra, et al. “A neural attention model for speech command recognition.” arXiv preprint arXiv:1808.08929 (2018).
– Snyder, David, et al. “Spoken language recognition using x-vectors.” Odyssey. Vol. 2018. 2018.
– Wan, Li, et al. “Tuplemax loss for language identification.” ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

Summary: Mozilla Common Voice: Unveiling Spoken Language Recognition Models in Part II | August 2023 | Sergey Vilov

In this article, we continue our exploration of spoken language recognition using the Mozilla Common Voice dataset. We train and evaluate several models, including a convolutional neural network (CNN), a convolutional recurrent neural network (CRNN), an attention-based model (AttNN), and a time-delay neural network (TDNN). We find that the AttNN model achieves the highest accuracy of 92.4%, followed closely by the TDNN and our CRNN* model. On the other hand, the CRNN (Bartz et al. 2017), CNN, and CRNN (Alashban et al. 2022) models perform modestly. We also investigate the effect of tuplemax loss and find that it performs worse than softmax loss in terms of accuracy and pairwise accuracy. Our results provide a new benchmark for spoken language recognition and suggest potential areas for future improvement, such as the exploration of different embeddings and neural network architectures.

You May Also Like to Read  Unleashing the Potential of Vocaloid AI: Bridging the Gap Between Technology and Artistry

Frequently Asked Questions:

Q1: What is data science and why is it important?
A1: Data science is a multidisciplinary field that combines statistics, mathematics, computer science, and domain expertise to extract meaningful insights and knowledge from data. It involves using various techniques, such as data mining, machine learning, and visualization, to uncover patterns and trends that can inform decision-making processes. Data science is important because it allows organizations to make data-driven strategies, identify opportunities, mitigate risks, improve efficiencies, and gain a competitive edge in today’s data-driven economy.

Q2: What are the main steps involved in the data science process?
A2: The data science process typically involves several key steps. It starts with understanding the problem or objective at hand and identifying relevant data sources. The next step involves data acquisition, where the necessary data is collected from various sources. Once the data is collected, it undergoes a process of data cleaning and preprocessing to ensure its quality and usability. After that, exploratory data analysis is conducted to gain insights and identify patterns. Following this, predictive modeling or machine learning techniques are applied to build models that can make predictions or classify objects. Finally, the results are visualized and communicated effectively to stakeholders.

Q3: What are the common challenges faced in data science projects?
A3: Data science projects can face various challenges. Data quality and availability can pose significant hurdles as incomplete, inaccurate, or biased data can lead to misleading results. Data privacy and security concerns also need to be addressed to ensure the confidentiality and integrity of sensitive information. Another challenge is selecting the right algorithms and techniques that best suit the problem at hand. Additionally, interpretability of complex models can be a challenge, as stakeholders often require transparent explanations for decisions made by these models.

Q4: What skills and knowledge are required to become a data scientist?
A4: To become a data scientist, one needs a combination of technical and non-technical skills. Proficiency in programming languages such as Python or R is essential for data manipulation, analysis, and model building. Strong knowledge of statistics and mathematics is crucial for understanding and applying various data science techniques. Additionally, expertise in machine learning algorithms and data visualization tools is valuable. Non-technical skills such as critical thinking, problem-solving, and effective communication are also important for a data scientist to be successful in interpreting and conveying insights from data.

Q5: How is data science applied in different industries?
A5: Data science has widespread applications across industries. In healthcare, data science is used to predict disease outbreaks, optimize treatment plans, and detect patterns in patient data. In finance, it helps with fraud detection, risk assessment, and algorithmic trading. Retail companies leverage data science for customer segmentation, demand forecasting, and personalized marketing. Transportation and logistics utilize data science for route optimization, fleet management, and supply chain analytics. These are just a few examples, but data science is being adopted by almost every industry to gain actionable insights from data and drive informed decision-making.