Python Package
Install
pip install git+https://github.com/wenet-e2e/wespeaker.git
for development install:
git clone https://github.com/wenet-e2e/wespeaker.git
cd wespeaker
pip install -e .
Command Line Usage
$ wespeaker --task embedding --audio_file audio.wav --output_file embedding.txt
$ wespeaker --task embedding_kaldi --wav_scp wav.scp --output_file /path/to/embedding
$ wespeaker --task similarity --audio_file audio.wav --audio_file2 audio2.wav
$ wespeaker --task diarization --audio_file audio.wav
$ wespeaker --task diarization --audio_file audio.wav --device cuda:0 # use CUDA on Windows/Linux
$ wespeaker --task diarization --audio_file audio.wav --device mps # use Metal Performance Shaders on MacOS
You can specify the following parameters. (use -h
for details)
-t
or--task
: five tasks are supported nowembedding: extract embedding for an audio and save it into an output file
embedding_kaldi: extract embeddings from kaldi-style wav.scp and save it to ark/scp files.
similarity: compute similarity of two audios (in the range of [0, 1])
diarization: apply speaker diarization for an input audio
diarization_list: apply speaker diarization for a kaldi-style wav.scp
-l
or--language
: use Chinese/English speaker models-p
or--pretrain
: the path of pretrained model,avg_model.pt
andconfig.yaml
should be contained--device
: set pytorch device,cpu
,cuda
,cuda:0
ormps
--campplus
: usecampplus_cn_common_200k
of damo--eres2net
: useres2net_cn_common_200k
of damo--vblinkp
: use the sam_resnet34 model pretrained on VoxBlink2--vblinkf
: use the sam_resnet34 model pretrained on VoxBlink2 and finetuned on VoxCeleb2--audio_file
: input audio file path--audio_file2
: input audio file2 path, specifically for the similarity task--wav_scp
: input wav.scp file in kaldi format (each line: key wav_path)--resample_rate
: resample rate (default: 16000)--vad
: apply vad or not for the input audios (default: true)--output_file
: output file to save speaker embedding, if you use kaldi wav_scp, output will beoutput_file.ark
andoutput_file.scp
Pretrained model support
We provide different pretrained models, which can be found at pretrained models.
Warning If you want to use the models provided in the above link, be sure to rename the model and config file
to avg_model.pt
and config.yaml
.
By default, specifying the language
option will download the pretrained models as
english:
ResNet221_LM
pretrained on VoxCelebchinese:
ResNet34_LM
pretrained on CnCeleb
If you want to use other pretrained models, please use the -p
or --pretrain
to specify the directory
containing avg_model.pt
and config.yaml
,
which can either be the ones we provided and trained by yourself.
Python Programming Usage
import wespeaker
model = wespeaker.load_model('chinese')
# set the device on which tensors are or will be allocated.
model.set_device('cuda:0')
# embedding/embedding_kaldi/similarity/diarization
embedding = model.extract_embedding('audio.wav')
utt_names, embeddings = model.extract_embedding_list('wav.scp')
similarity = model.compute_similarity('audio1.wav', 'audio2.wav')
diar_result = model.diarize('audio.wav', 'give_this_utt_a_name')
# register and recognize
model.register('spk1', 'spk1_audio1.wav')
model.register('spk2', 'spk2_audio1.wav')
model.register('spk3', 'spk3_audio1.wav')
result = model.recognize('spk1_audio2.wav')