WenetSpeech
A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

About

We release a 10000+ hours multi-domain transcribed Mandarin Speech Corpus collected from YouTube and Podcast. Optical character recognition (OCR) and automatic speech recognition (ASR) techniques are adopted to label each YouTube and Podcast recording, respectively. To improve the quality of the corpus, we use a novel end-to-end label error detection method to further validate and filter the data.

10,000 +

hours high-label data

with confidence >= 95%, for supervised training.

2400 +

hours weak-label data

0.6 < confidence < 0.95, for semi-supervised or noisy training, etc.

22400 +

hours audio in total

consists of both labeled and unlabeled data, for unsupervised training or pretraining, etc.


Diversity

The WenetSpeech can be mainly classified into 10 categories according to speaking styles and spoken scenarios.

utterances

License

The WenetSpeech dataset is available to download for non-commercial purposes under a Creative Commons Attribution 4.0 International License. WenetSpeech doesn't own the copyright of the audios, the copyright remains with the original owners of the video or audio, and the public URL is given for the original video or audio.

DOWNLOAD

Please fill out the Google Form here, checkout your mailbox, and follow the instructions to download the WenetSpeech dataset. If you fail to get the email, please write to binbzha@gmail.com.

Schedule


Oct 08, 2021: Release paper
Oct 25, 2021: Release data
Nov 11, 2021: Release various ASR models trianed using WenetSpeech

WenetSpeech 2.0

We are preparing for WenetSpeech 2.0, which will contains more data as well as richer data. If you are willing to cooperate and contribute, please contact the authors by the following WeChat or email.

...

Binbin Zhang

binbzha@qq.com

...

Hang Lv

hanglyu1991@gmail.com

ACKNOWLEDGEMENTS

WenetSpeech refers a lot of work of GigaSpeech . The authors would like to thank Jiayu Du and Guoguo Chen for their suggestions on this work. The authors would like to thank my colleagues, Lianhui Zhang and Yu Mao, for collecting some of the YouTube data.