Pretrained Models in Wespeaker

Besides speaker related tasks, speaker embeddings can be utilized for many related tasks which requires speaker modeling, such as

voice conversion
text-to-speech
speaker adaptive ASR
target speaker extraction

For users who would like to verify the SV performance or extract speaker embeddings for the above tasks without troubling about training the speaker embedding learner, we provide two types of pretrained models.

Checkpoint Model, with suffix .pt, the model trained and saved as checkpoint by WeSpeaker python code, you can reproduce our published result with it, or you can use it as checkpoint to continue.
Runtime Model, with suffix .onnx, the runtime model is exported by Onnxruntime on the checkpoint model.

Model License

The pretrained model in WeNet follows the license of it’s corresponding dataset. For example, the pretrained model on VoxCeleb follows Creative Commons Attribution 4.0 International License., since it is used as license of the VoxCeleb dataset, see https://mm.kaist.ac.kr/datasets/voxceleb/.

Onnx Inference Demo

To use the pretrained model in pytorch format, please directly refer to the run.sh in corresponding recipe.

As for extracting speaker embeddings from the onnx model, the following is a toy example.

# Download the pretrained model in onnx format and save it as onnx_path
# wav_path is the path to your wave file (16k)
python wespeaker/bin/infer_onnx.py --onnx_path $onnx_path --wav_path $wav_path

You can easily adapt infer_onnx.py to your application, a speaker diarization example can be found in the voxconverse recipe.

Model List

The model with suffix LM means that it is further fine-tuned using large-margin fine-tuning, which could perform better on long audios, e.g. >3s.

modelscope

Datasets	Languages	Checkpoint (pt)	Runtime Model (onnx)
VoxCeleb	EN	ResNet34 / ResNet34_LM	ResNet34 / ResNet34_LM
VoxCeleb	EN	ResNet152_LM	ResNet152_LM
VoxCeleb	EN	ResNet221_LM	ResNet221_LM
VoxCeleb	EN	ResNet293_LM	ResNet293_LM
VoxCeleb	EN	CAM++ / CAM++_LM	CAM++ / CAM++_LM
VoxCeleb	EN	ECAPA512 / ECAPA512_LM / ECAPA512_DINO	ECAPA512 / ECAPA512_LM
VoxCeleb	EN	ECAPA1024 / ECAPA1024_LM	ECAPA1024 / ECAPA1024_LM
VoxCeleb	EN	Gemini_DFResnet114_LM	Gemini_DFResnet114_LM
CNCeleb	CN	ResNet34 / ResNet34_LM	ResNet34 / ResNet34_LM
VoxBlink2	Multilingual	SimAMResNet34	SimAMResNet34
VoxBlink2 (pretrain) + VoxCeleb2 (finetune)	Multilingual	SimAMResNet34	SimAMResNet34
VoxBlink2	Multilingual	SimAMResNet100	SimAMResNet100
VoxBlink2 (pretrain) + VoxCeleb2 (finetune)	Multilingual	SimAMResNet100	SimAMResNet100
### huggingface

Datasets	Languages	Checkpoint (pt)	Runtime Model (onnx)
VoxCeleb	EN	ResNet34 / ResNet34_LM	ResNet34 / ResNet34_LM
VoxCeleb	EN	ResNet152_LM	ResNet152_LM
VoxCeleb	EN	ResNet221_LM	ResNet221_LM
VoxCeleb	EN	ResNet293_LM	ResNet293_LM
VoxCeleb	EN	CAM++ / CAM++_LM	CAM++ / CAM++_LM
VoxCeleb	EN	ECAPA512 / ECAPA512_LM	ECAPA512 / ECAPA512_LM
VoxCeleb	EN	ECAPA1024 / ECAPA1024_LM	ECAPA1024 / ECAPA1024_LM
VoxCeleb	EN	Gemini_DFResnet114_LM	Gemini_DFResnet114_LM
CNCeleb	CN	ResNet34 / ResNet34_LM	ResNet34 / ResNet34_LM