Final November, we introduced the 1,000 Languages Initiative, an bold dedication to construct a machine studying (ML) mannequin that may assist the world’s one thousand most-spoken languages, bringing better inclusion to billions of individuals across the globe. Nevertheless, a few of these languages are spoken by fewer than twenty million individuals, so a core problem is tips on how to assist languages for which there are comparatively few audio system or restricted out there information.
Right this moment, we’re excited to share extra in regards to the Common Speech Mannequin (USM), a essential first step in the direction of supporting 1,000 languages. USM is a household of state-of-the-art speech fashions with 2B parameters skilled on 12 million hours of speech and 28 billion sentences of textual content, spanning 300+ languages. USM, which is to be used in YouTube (e.g., for closed captions), can carry out automated speech recognition (ASR) not solely on widely-spoken languages like English and Mandarin, but in addition on under-resourced languages like Amharic, Cebuano, Assamese, and Azerbaijani to call a couple of. In “Google USM: Scaling Automated Speech Recognition Past 100 Languages”, we show that using a big unlabeled multilingual dataset to pre-train the encoder of the mannequin and fine-tuning on a smaller set of labeled information allows us to acknowledge under-represented languages. Furthermore, our mannequin coaching course of is efficient at adapting to new languages and information.
![]() |
A pattern of the languages that USM helps. |
Challenges in present ASR
To perform this bold aim, we have to handle two vital challenges in ASR.
First, there’s a lack of scalability with typical supervised studying approaches. A elementary problem of scaling speech applied sciences to many languages is acquiring sufficient information to coach high-quality fashions. With typical approaches, audio information must be both manually labeled, which is time-consuming and dear, or collected from sources with pre-existing transcriptions, that are more durable to search out for languages that lack huge illustration. In distinction, self-supervised studying can leverage audio-only information, which is on the market in a lot bigger portions throughout languages. This makes self-supervision a greater method to perform our aim of scaling throughout tons of of languages.
One other problem is that fashions should enhance in a computationally environment friendly method whereas we increase the language protection and high quality. This requires the training algorithm to be versatile, environment friendly, and generalizable. Extra particularly, such an algorithm ought to have the ability to use massive quantities of knowledge from a wide range of sources, allow mannequin updates with out requiring full retraining, and generalize to new languages and use instances.
Our method: Self-supervised studying with fine-tuning
USM makes use of the usual encoder-decoder structure, the place the decoder might be CTC, RNN-T, or LAS. For the encoder, USM makes use of the Conformer, or convolution-augmented transformer. The important thing element of the Conformer is the Conformer block, which consists of consideration, feed-forward, and convolutional modules. It takes as enter the log-mel spectrogram of the speech sign and performs a convolutional sub-sampling, after which a sequence of Conformer blocks and a projection layer are utilized to acquire the ultimate embeddings.
Our coaching pipeline begins with step one of self-supervised studying on speech audio protecting tons of of languages. Within the second non-compulsory step, the mannequin’s high quality and language protection might be improved by an extra pre-training step with textual content information. The choice to include the second step will depend on whether or not textual content information is on the market. USM performs greatest with this second non-compulsory step. The final step of the coaching pipeline is to fine-tune on downstream duties (e.g., ASR or automated speech translation) with a small quantity of supervised information.
For step one, we use BEST-RQ, which has already demonstrated state-of-the-art outcomes on multilingual duties and has confirmed to be environment friendly when utilizing very massive quantities of unsupervised audio information.
Within the second (non-compulsory) step, we used multi-objective supervised pre-training to include information from extra textual content information. The mannequin introduces an extra encoder module to take textual content as enter and extra layers to mix the output of the speech encoder and the textual content encoder, and trains the mannequin collectively on unlabeled speech, labeled speech, and textual content information.
Within the final stage, USM is fine-tuned on the downstream duties. The general coaching pipeline is illustrated beneath. With the information acquired throughout pre-training, USM fashions obtain good high quality with solely a small quantity of supervised information from the downstream duties.
![]() |
USM’s general coaching pipeline. |
Key outcomes
Efficiency throughout a number of languages on YouTube Captions
Our encoder incorporates 300+ languages by pre-training. We show the effectiveness of the pre-trained encoder by fine-tuning on YouTube Caption’s multilingual speech information. The supervised YouTube information contains 73 languages and has on common lower than three thousand hours of knowledge per language. Regardless of restricted supervised information, the mannequin achieves lower than 30% phrase error price (WER; decrease is healthier) on common throughout the 73 languages, a milestone we’ve got by no means achieved earlier than. For en-US, USM has a 6% relative decrease WER in comparison with the present inside state-of-the-art mannequin. Lastly, we evaluate with the not too long ago launched massive mannequin, Whisper (large-v2), which was skilled with greater than 400k hours of labeled information. For the comparability, we solely use the 18 languages that Whisper can efficiently decode with decrease than 40% WER. Our mannequin has, on common, a 32.7% relative decrease WER in comparison with Whisper for these 18 languages.
![]() |
USM helps all 73 languages within the YouTube Captions’ Check Set and outperforms Whisper on the languages it could actually assist with decrease than 40% WER. Decrease WER is healthier. |
Generalization to downstream ASR duties
On publicly out there datasets, our mannequin exhibits decrease WER in comparison with Whisper on CORAAL (African American Vernacular English), SpeechStew (en-US), and FLEURS (102 languages). Our mannequin achieves decrease WER with and with out coaching on in-domain information. The comparability on FLEURS experiences the subset of languages (62) that overlaps with the languages supported by the Whisper mannequin. For FLEURS, USM with out in-domain information has a 65.8% relative decrease WER in comparison with Whisper and has a 67.8% relative decrease WER with in-domain information.
![]() |
Comparability of USM (with or with out in-domain information) and Whisper outcomes on ASR benchmarks. Decrease WER is healthier. |
Efficiency on automated speech translation (AST)
For speech translation, we fine-tune USM on the CoVoST dataset. Our mannequin, which incorporates textual content by way of the second stage of our pipeline, achieves state-of-the-art high quality with restricted supervised information. To evaluate the breadth of the mannequin’s efficiency, we phase the languages from the CoVoST dataset into excessive, medium, and low primarily based on useful resource availability and calculate the BLEU rating (increased is healthier) for every phase. As proven beneath, USM outperforms Whisper for all segments.
![]() |
CoVoST BLEU rating. Increased BLEU is healthier. |
Towards 1,000 languages
The event of USM is a essential effort in the direction of realizing Google’s mission to prepare the world’s data and make it universally accessible. We imagine USM’s base mannequin structure and coaching pipeline comprise a basis on which we will construct to increase speech modeling to the following 1,000 languages.
Study Extra
Take a look at our paper right here. Researchers can request entry to the USM API right here.
Acknowledgements
We thank all of the co-authors for contributing to the challenge and paper, together with Andrew Rosenberg, Ankur Bapna, Bhuvana Ramabhadran, Bo Li, Chung-Cheng Chiu, Daniel Park, Françoise Beaufays, Hagen Soltau, Gary Wang, Ginger Perng, James Qin, Jason Riesa, Johan Schalkwyk, Ke Hu, Nanxin Chen, Parisa Haghani, Pedro Moreno Mengibar, Rohit Prabhavalkar, Tara Sainath, Trevor Strohman, Vera Axelrod, Wei Han, Yonghui Wu, Yongqiang Wang, Yu Zhang, Zhehuai Chen, and Zhong Meng.
We additionally thank Alexis Conneau, Min Ma, Shikhar Bharadwaj, Sid Dalmia, Jiahui Yu, Jian Cheng, Paul Rubenstein, Ye Jia, Justin Snyder, Vincent Tsang, Yuanzhong Xu, Tao Wang for helpful discussions.
We recognize priceless suggestions and assist from Eli Collins, Jeff Dean, Sissie Hsiao, Zoubin Ghahramani. Particular due to Austin Tarango, Lara Tumeh, Amna Latif, and Jason Porta for his or her steering round Accountable AI practices. We thank Elizabeth Adkison, James Cokerille for assist with naming the mannequin, Tom Small for the animated graphic, Abhishek Bapna for editorial assist, and Erica Moreira for useful resource administration . We thank Anusha Ramesh for suggestions, steering, and help with the publication technique, and Calum Barnes and Salem Haykal for his or her priceless partnership.