## Diarization Tutorial on VoxConverse v2 If you meet any problems when going through this tutorial, please feel free to ask in github [issues](https://github.com/wenet-e2e/wespeaker/issues). Thanks for any kind of feedback. ### First Experiment Speaker diarization is a typical downstream task of applying the well-learnt speaker embedding. Here we introduce our diarization recipe `examples/voxconverse/v2/run.sh` on the Voxconverse 2020 dataset. Note that we provide two recipes: **v1** and **v2**. Their only difference is that in **v2**, we split the Fbank extraction, embedding extraction and clustering modules to different stages. We recommend newcomers to follow the **v2** recipe and run it stage by stage and check the result to better understand the whole process. ``` cd examples/voxconverse/v2/ bash run.sh --stage 1 --stop_stage 1 bash run.sh --stage 2 --stop_stage 2 bash run.sh --stage 3 --stop_stage 3 bash run.sh --stage 4 --stop_stage 4 bash run.sh --stage 5 --stop_stage 5 bash run.sh --stage 6 --stop_stage 6 bash run.sh --stage 7 --stop_stage 7 bash run.sh --stage 8 --stop_stage 8 ``` ### Stage 1: Download Prerequisites ``` if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then mkdir -p external_tools # [1] Download evaluation toolkit wget -c https://github.com/usnistgov/SCTK/archive/refs/tags/v2.4.12.zip -O external_tools/SCTK-v2.4.12.zip unzip -o external_tools/SCTK-v2.4.12.zip -d external_tools # [2] Download voice activity detection model pretrained by Silero Team wget -c https://github.com/snakers4/silero-vad/archive/refs/tags/v3.1.zip -O external_tools/silero-vad-v3.1.zip unzip -o external_tools/silero-vad-v3.1.zip -d external_tools # [3] Download ResNet34 speaker model pretrained by WeSpeaker Team mkdir -p pretrained_models wget -c https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.onnx -O pretrained_models/voxceleb_resnet34_LM.onnx fi ``` Download three Prerequisites: * the evaluation toolkit **SCTK**: Compute the DER metric * the open-source VAD model pre-trained by [silero-vad](https://github.com/snakers4/silero-vad): Remove the silence in audio * the pre-trained ResNet34 model: Extract the speaker embeddings When finishing this stage, you will get two new dirs: - **external_tools** - SCTK-v2.4.12.zip - SCTK-v2.4.12 - silero-vad-v3.1.zip - silero-vad-v3.1 - **pretrained_models** - voxceleb_resnet34_LM.onnx ### Stage 2: Download and Prepare Data ``` if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then mkdir -p data # Download annotations for dev and test sets (version 0.0.3) wget -c https://github.com/joonson/voxconverse/archive/refs/heads/master.zip -O data/voxconverse_master.zip unzip -o data/voxconverse_master.zip -d data # Download annotations from VoxSRC-23 validation toolkit (looks like version 0.0.2) # cd data && git clone https://github.com/JaesungHuh/VoxSRC2023.git --recursive && cd - # Download dev audios mkdir -p data/dev wget --no-check-certificate -c https://www.robots.ox.ac.uk/~vgg/data/voxconverse/data/voxconverse_dev_wav.zip -O data/voxconverse_dev_wav.zip unzip -o data/voxconverse_dev_wav.zip -d data/dev # Create wav.scp for dev audios ls `pwd`/data/dev/audio/*.wav | awk -F/ '{print substr($NF, 1, length($NF)-4), $0}' > data/dev/wav.scp # Test audios mkdir -p data/test wget --no-check-certificate -c https://www.robots.ox.ac.uk/~vgg/data/voxconverse/data/voxconverse_test_wav.zip -O data/voxconverse_test_wav.zip unzip -o data/voxconverse_test_wav.zip -d data/test # Create wav.scp for test audios ls `pwd`/data/test/voxconverse_test_wav/*.wav | awk -F/ '{print substr($NF, 1, length($NF)-4), $0}' > data/test/wav.scp fi ``` Download the Voxconverse 2020 dev and test sets as well as their annotations. Here we use the latest version 0.0.3 in default (recommended). You can also try the version 0.0.2 (seem to be used in the [VoxSRC-23 baseline repo](https://github.com/JaesungHuh/VoxSRC2023.git)). When finishing this stage, you will get the new **data** dir: - **data** - voxconverse_master.zip - voxconverse_dev_wav.zip - voxconverse_test_wav.zip - voxconverse_master - dev: ground-truth rttms - test: ground-truth rttms - dev - audio: wav files - wav.scp - test - voxconverse_test_wav: wav files - wav.scp **wav.scp**: each line records two blank-separated columns : `wav_id` and `wav_path` ``` abjxc /path/to/wespeaker/examples/voxconverse/v2/data/dev/audio/abjxc.wav afjiv /path/to/wespeaker/examples/voxconverse/v2/data/dev/audio/afjiv.wav ... ``` ### Stage 3: Apply SAD (i.e., VAD) ``` if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then # Set VAD min duration min_duration=0.255 if [[ "x${sad_type}" == "xoracle" ]]; then # Oracle SAD: handling overlapping or too short regions in ground truth RTTM while read -r utt wav_path; do python3 wespeaker/diar/make_oracle_sad.py \ --rttm data/voxconverse-master/${partition}/${utt}.rttm \ --min-duration $min_duration done < data/${partition}/wav.scp > data/${partition}/oracle_sad fi if [[ "x${sad_type}" == "xsystem" ]]; then # System SAD: applying 'silero' VAD python3 wespeaker/diar/make_system_sad.py \ --repo-path external_tools/silero-vad-3.1 \ --scp data/${partition}/wav.scp \ --min-duration $min_duration > data/${partition}/system_sad fi fi ``` `sad_type` could be oracle or system: * oracle: get vad infos from the ground truth RTTMs, saved in `data/${partition}/oracle_sad` * system: compute vad results using the [silero-vad](https://github.com/snakers4/silero-vad), saved in `data/${partition}/system_sad` where `partition` is dev or test. Note that too short VAD segments with less than `min_duration` seconds are ignored and simply regarded as silence. ### Stage 4: Extract Fbank Features ``` if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then [ -d "exp/${sad_type}_sad_fbank" ] && rm -r exp/${sad_type}_sad_fbank echo "Make Fbank features and store it under exp/${sad_type}_sad_fbank" echo "..." bash local/make_fbank.sh \ --scp data/${partition}/wav.scp \ --segments data/${partition}/${sad_type}_sad \ --store_dir exp/${partition}_${sad_type}_sad_fbank \ --subseg_cmn ${subseg_cmn} \ --nj 24 fi ``` `subseg_cmn` suggests applying Cepstral Mean Normalization (CMN) to Fbanks: * on the sliding-window sub-segment (`subseg_cmn=true`) or * on the whole vad segment (`subseg_cmn=false`) You can specify `nj` jobs according to your cpu cores num. The final Fbank features are saved under dir `exp/${partition}_${sad_type}_sad_fbank`. ### Stage 5: Extract Sliding-window Speaker Embeddings ``` if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then [ -d "exp/${sad_type}_sad_embedding" ] && rm -r exp/${sad_type}_sad_embedding echo "Extract embeddings and store it under exp/${sad_type}_sad_embedding" echo "..." bash local/extract_emb.sh \ --scp exp/${partition}_${sad_type}_sad_fbank/fbank.scp \ --pretrained_model pretrained_models/voxceleb_resnet34_LM.onnx \ --device cuda \ --store_dir exp/${partition}_${sad_type}_sad_embedding \ --batch_size 96 \ --frame_shift 10 \ --window_secs 1.5 \ --period_secs 0.75 \ --subseg_cmn ${subseg_cmn} \ --nj 1 fi ``` Extract speaker embeddings from the Fbank features in a sliding-window fashion: `step=0.75s, window=1.5s`, which means extracting embedding from each `1.5s` speech window after every `0.75s`. Thus the contiguous windows overlap by `1.5-0.75=0.75s` in duration. You can also specify `nj` jobs and decide to use the `gpu` or `cpu` devices. The extracted embeddings are saved under dir `exp/${partition}_${sad_type}_sad_embedding`. ### Stage 6: Apply Spectral Clustering ``` if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then [ -f "exp/spectral_cluster/${partition}_${sad_type}_sad_labels" ] && rm exp/spectral_cluster/${partition}_${sad_type}_sad_labels echo "Doing spectral clustering and store the result in exp/spectral_cluster/${partition}_${sad_type}_sad_labels" echo "..." python3 wespeaker/diar/spectral_clusterer.py \ --scp exp/${partition}_${sad_type}_sad_embedding/emb.scp \ --output exp/spectral_cluster/${partition}_${sad_type}_sad_labels fi ``` Apply spectral clustering using the extracted sliding-window speaker embeddings, and store the results in `exp/spectral_cluster/${partition}_${sad_type}_sad_labels`, where each line records two blank-separated columns : `subseg_id` and `spk_id` ``` abjxc-00000400-00007040-00000000-00000150 0 abjxc-00000400-00007040-00000075-00000225 0 abjxc-00000400-00007040-00000150-00000300 0 abjxc-00000400-00007040-00000225-00000375 0 ... ``` ### Stage 7: Reformat Clustering Labels into RTTMs ``` if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then python3 wespeaker/diar/make_rttm.py \ --labels exp/spectral_cluster/${partition}_${sad_type}_sad_labels \ --channel 1 > exp/spectral_cluster/${partition}_${sad_type}_sad_rttm fi ``` Convert the clustering labels into the Rich Transcription Time Marked (RTTM) format, saved in `exp/spectral_cluster/${partition}_${sad_type}_sad_rttm`. RTTM files are space-delimited text files containing one turn per line, each line containing ten fields: * `Type` -- segment type; should always by `SPEAKER` * `File ID` -- file name; basename of the recording minus extension (e.g., `abjxc`) * `Channel ID` -- channel (1-indexed) that turn is on; should always be `1` * `Turn Onset` -- onset of turn in seconds from beginning of recording * `Turn Duration` -- duration of turn in seconds * `Orthography Field` -- should always by `` * `Speaker Type` -- should always be `` * `Speaker Name` -- name of speaker of turn; should be unique within scope of each file * `Confidence Score` -- system confidence (probability) that information is correct; should always be `` * `Signal Lookahead Time` -- should always be `` For instance, ``` SPEAKER abjxc 1 0.400 6.640 0 SPEAKER abjxc 1 8.680 55.960 0 ``` ### Stage 8: Evaluate the Result (DER) ``` if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then ref_dir=data/voxconverse-master/ #ref_dir=data/VoxSRC2023/voxconverse/ echo -e "Get the DER results\n..." perl external_tools/SCTK-2.4.12/src/md-eval/md-eval.pl \ -c 0.25 \ -r <(cat ${ref_dir}/${partition}/*.rttm) \ -s exp/spectral_cluster/${partition}_${sad_type}_sad_rttm 2>&1 | tee exp/spectral_cluster/${partition}_${sad_type}_sad_res if [ ${get_each_file_res} -eq 1 ];then single_file_res_dir=exp/spectral_cluster/${partition}_${sad_type}_single_file_res mkdir -p $single_file_res_dir echo -e "\nGet the DER results for each file and the results will be stored underd ${single_file_res_dir}\n..." awk '{print $2}' exp/spectral_cluster/${partition}_${sad_type}_sad_rttm | sort -u | while read file_name; do perl external_tools/SCTK-2.4.12/src/md-eval/md-eval.pl \ -c 0.25 \ -r <(cat ${ref_dir}/${partition}/${file_name}.rttm) \ -s <(grep "${file_name}" exp/spectral_cluster/${partition}_${sad_type}_sad_rttm) > ${single_file_res_dir}/${partition}_${file_name}_res done echo "Done!" fi fi ``` Use the **SCTK** toolkit to compute the Diarization Error Rate (DER) metric, which is the sum of * speaker error -- percentage of scored time for which the wrong speaker id is assigned within a speech region * false alarm speech -- percentage of scored time for which a nonspeech region is incorrectly marked as containing speech * missed speech -- percentage of scored time for which a speech region is incorrectly marked as not containing speech For more details about DER, consult Section 6.1 of the [NIST RT-09 evaluation plan](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf). The overall DER result would be saved in `exp/spectral_cluster/${partition}_${sad_type}_sad_res`. Optionally, set `get_each_file_res` as `1` if you also want to get the DER result for each single file, which will be saved under dir `exp/spectral_cluster/${partition}_${sad_type}_single_file_res`.