# Python Package

## Install

``` sh
pip install git+https://github.com/wenet-e2e/wespeaker.git
```

for development install:

``` sh
git clone https://github.com/wenet-e2e/wespeaker.git
cd wespeaker
pip install -e .
```

## Command Line Usage

``` sh
$ wespeaker --task embedding --audio_file audio.wav --output_file embedding.txt
$ wespeaker --task embedding_kaldi --wav_scp wav.scp --output_file /path/to/embedding
$ wespeaker --task similarity --audio_file audio.wav --audio_file2 audio2.wav
$ wespeaker --task diarization --audio_file audio.wav
$ wespeaker --task diarization --audio_file audio.wav --device cuda:0 # use CUDA on Windows/Linux
$ wespeaker --task diarization --audio_file audio.wav --device mps    # use Metal Performance Shaders on MacOS
```

You can specify the following parameters. (use `-h` for details)

* `-t` or `--task`: five tasks are supported now
    - embedding: extract embedding for an audio and save it into an output file
    - embedding_kaldi: extract embeddings from kaldi-style wav.scp and save it to ark/scp files.
    - similarity: compute similarity of two audios (in the range of [0, 1])
    - diarization: apply speaker diarization for an input audio
    - diarization_list: apply speaker diarization for a kaldi-style wav.scp
* `-l` or `--language`: use Chinese/English speaker models
* `-p` or `--pretrain`: the path of pretrained model, `avg_model.pt` and `config.yaml` should be contained
* `--device`: set pytorch device, `cpu`, `cuda`, `cuda:0` or `mps`
* `--campplus`:
  use [`campplus_cn_common_200k` of damo](https://www.modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary)
* `--eres2net`:
  use [`res2net_cn_common_200k` of damo](https://www.modelscope.cn/models/iic/speech_eres2net_sv_zh-cn_16k-common/summary)
* `--vblinkp`: use the sam_resnet34 model pretrained on VoxBlink2
* `--vblinkf`: use the sam_resnet34 model pretrained on VoxBlink2 and finetuned on VoxCeleb2
* `--audio_file`: input audio file path
* `--audio_file2`: input audio file2 path, specifically for the similarity task
* `--wav_scp`: input wav.scp file in kaldi format (each line: key wav_path)
* `--resample_rate`: resample rate (default: 16000)
* `--vad`: apply vad or not for the input audios (default: true)
* `--output_file`: output file to save speaker embedding, if you use kaldi wav_scp, output will be `output_file.ark`
                   and `output_file.scp`

### Pretrained model support

We provide different pretrained models, which can be found
at [pretrained models](https://github.com/wenet-e2e/wespeaker/blob/master/docs/pretrained.md).

**Warning** If you want to use the models provided in the above link, be sure to rename the model and config file
to `avg_model.pt` and `config.yaml`.

By default, specifying the `language` option will download the pretrained models as

* english: `ResNet221_LM` pretrained on VoxCeleb
* chinese: `ResNet34_LM` pretrained on CnCeleb

If you want to use other pretrained models, please use the `-p` or `--pretrain` to specify the directory
containing `avg_model.pt` and `config.yaml`,
which can either be the ones we provided and trained by yourself.

## Python Programming Usage

``` python
import wespeaker

model = wespeaker.load_model('chinese')
# set the device on which tensors are or will be allocated.
model.set_device('cuda:0')

# embedding/embedding_kaldi/similarity/diarization
embedding = model.extract_embedding('audio.wav')
utt_names, embeddings = model.extract_embedding_list('wav.scp')
similarity = model.compute_similarity('audio1.wav', 'audio2.wav')
diar_result = model.diarize('audio.wav', 'give_this_utt_a_name')

# register and recognize
model.register('spk1', 'spk1_audio1.wav')
model.register('spk2', 'spk2_audio1.wav')
model.register('spk3', 'spk3_audio1.wav')
result = model.recognize('spk1_audio2.wav')
```