Diarization Tutorial on VoxConverse v2
If you meet any problems when going through this tutorial, please feel free to ask in github issues. Thanks for any kind of feedback.
First Experiment
Speaker diarization is a typical downstream task of applying the well-learnt speaker embedding.
Here we introduce our diarization recipe examples/voxconverse/v2/run.sh
on the Voxconverse 2020 dataset.
Note that we provide two recipes: v1 and v2. Their only difference is that in v2, we split the Fbank extraction, embedding extraction and clustering modules to different stages. We recommend newcomers to follow the v2 recipe and run it stage by stage and check the result to better understand the whole process.
cd examples/voxconverse/v2/
bash run.sh --stage 1 --stop_stage 1
bash run.sh --stage 2 --stop_stage 2
bash run.sh --stage 3 --stop_stage 3
bash run.sh --stage 4 --stop_stage 4
bash run.sh --stage 5 --stop_stage 5
bash run.sh --stage 6 --stop_stage 6
bash run.sh --stage 7 --stop_stage 7
bash run.sh --stage 8 --stop_stage 8
Stage 1: Download Prerequisites
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
mkdir -p external_tools
# [1] Download evaluation toolkit
wget -c https://github.com/usnistgov/SCTK/archive/refs/tags/v2.4.12.zip -O external_tools/SCTK-v2.4.12.zip
unzip -o external_tools/SCTK-v2.4.12.zip -d external_tools
# [2] Download voice activity detection model pretrained by Silero Team
wget -c https://github.com/snakers4/silero-vad/archive/refs/tags/v3.1.zip -O external_tools/silero-vad-v3.1.zip
unzip -o external_tools/silero-vad-v3.1.zip -d external_tools
# [3] Download ResNet34 speaker model pretrained by WeSpeaker Team
mkdir -p pretrained_models
wget -c https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.onnx -O pretrained_models/voxceleb_resnet34_LM.onnx
fi
Download three Prerequisites:
the evaluation toolkit SCTK: Compute the DER metric
the open-source VAD model pre-trained by silero-vad: Remove the silence in audio
the pre-trained ResNet34 model: Extract the speaker embeddings
When finishing this stage, you will get two new dirs:
external_tools
SCTK-v2.4.12.zip
SCTK-v2.4.12
silero-vad-v3.1.zip
silero-vad-v3.1
pretrained_models
voxceleb_resnet34_LM.onnx
Stage 2: Download and Prepare Data
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
mkdir -p data
# Download annotations for dev and test sets (version 0.0.3)
wget -c https://github.com/joonson/voxconverse/archive/refs/heads/master.zip -O data/voxconverse_master.zip
unzip -o data/voxconverse_master.zip -d data
# Download annotations from VoxSRC-23 validation toolkit (looks like version 0.0.2)
# cd data && git clone https://github.com/JaesungHuh/VoxSRC2023.git --recursive && cd -
# Download dev audios
mkdir -p data/dev
wget --no-check-certificate -c https://www.robots.ox.ac.uk/~vgg/data/voxconverse/data/voxconverse_dev_wav.zip -O data/voxconverse_dev_wav.zip
unzip -o data/voxconverse_dev_wav.zip -d data/dev
# Create wav.scp for dev audios
ls `pwd`/data/dev/audio/*.wav | awk -F/ '{print substr($NF, 1, length($NF)-4), $0}' > data/dev/wav.scp
# Test audios
mkdir -p data/test
wget --no-check-certificate -c https://www.robots.ox.ac.uk/~vgg/data/voxconverse/data/voxconverse_test_wav.zip -O data/voxconverse_test_wav.zip
unzip -o data/voxconverse_test_wav.zip -d data/test
# Create wav.scp for test audios
ls `pwd`/data/test/voxconverse_test_wav/*.wav | awk -F/ '{print substr($NF, 1, length($NF)-4), $0}' > data/test/wav.scp
fi
Download the Voxconverse 2020 dev and test sets as well as their annotations. Here we use the latest version 0.0.3 in default (recommended). You can also try the version 0.0.2 (seem to be used in the VoxSRC-23 baseline repo).
When finishing this stage, you will get the new data dir:
data
voxconverse_master.zip
voxconverse_dev_wav.zip
voxconverse_test_wav.zip
voxconverse_master
dev: ground-truth rttms
test: ground-truth rttms
dev
audio: wav files
wav.scp
test
voxconverse_test_wav: wav files
wav.scp
wav.scp: each line records two blank-separated columns : wav_id
and wav_path
abjxc /path/to/wespeaker/examples/voxconverse/v2/data/dev/audio/abjxc.wav
afjiv /path/to/wespeaker/examples/voxconverse/v2/data/dev/audio/afjiv.wav
...
Stage 3: Apply SAD (i.e., VAD)
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# Set VAD min duration
min_duration=0.255
if [[ "x${sad_type}" == "xoracle" ]]; then
# Oracle SAD: handling overlapping or too short regions in ground truth RTTM
while read -r utt wav_path; do
python3 wespeaker/diar/make_oracle_sad.py \
--rttm data/voxconverse-master/${partition}/${utt}.rttm \
--min-duration $min_duration
done < data/${partition}/wav.scp > data/${partition}/oracle_sad
fi
if [[ "x${sad_type}" == "xsystem" ]]; then
# System SAD: applying 'silero' VAD
python3 wespeaker/diar/make_system_sad.py \
--repo-path external_tools/silero-vad-3.1 \
--scp data/${partition}/wav.scp \
--min-duration $min_duration > data/${partition}/system_sad
fi
fi
sad_type
could be oracle or system:
oracle: get vad infos from the ground truth RTTMs, saved in
data/${partition}/oracle_sad
system: compute vad results using the silero-vad, saved in
data/${partition}/system_sad
where partition
is dev or test.
Note that too short VAD segments with less than min_duration
seconds are ignored and simply regarded as silence.
Stage 4: Extract Fbank Features
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
[ -d "exp/${sad_type}_sad_fbank" ] && rm -r exp/${sad_type}_sad_fbank
echo "Make Fbank features and store it under exp/${sad_type}_sad_fbank"
echo "..."
bash local/make_fbank.sh \
--scp data/${partition}/wav.scp \
--segments data/${partition}/${sad_type}_sad \
--store_dir exp/${partition}_${sad_type}_sad_fbank \
--subseg_cmn ${subseg_cmn} \
--nj 24
fi
subseg_cmn
suggests applying Cepstral Mean Normalization (CMN) to Fbanks:
on the sliding-window sub-segment (
subseg_cmn=true
) oron the whole vad segment (
subseg_cmn=false
)
You can specify nj
jobs according to your cpu cores num.
The final Fbank features are saved under dir exp/${partition}_${sad_type}_sad_fbank
.
Stage 5: Extract Sliding-window Speaker Embeddings
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
[ -d "exp/${sad_type}_sad_embedding" ] && rm -r exp/${sad_type}_sad_embedding
echo "Extract embeddings and store it under exp/${sad_type}_sad_embedding"
echo "..."
bash local/extract_emb.sh \
--scp exp/${partition}_${sad_type}_sad_fbank/fbank.scp \
--pretrained_model pretrained_models/voxceleb_resnet34_LM.onnx \
--device cuda \
--store_dir exp/${partition}_${sad_type}_sad_embedding \
--batch_size 96 \
--frame_shift 10 \
--window_secs 1.5 \
--period_secs 0.75 \
--subseg_cmn ${subseg_cmn} \
--nj 1
fi
Extract speaker embeddings from the Fbank features in a sliding-window fashion: step=0.75s, window=1.5s
,
which means extracting embedding from each 1.5s
speech window after every 0.75s
.
Thus the contiguous windows overlap by 1.5-0.75=0.75s
in duration.
You can also specify nj
jobs and decide to use the gpu
or cpu
devices.
The extracted embeddings are saved under dir exp/${partition}_${sad_type}_sad_embedding
.
Stage 6: Apply Spectral Clustering
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
[ -f "exp/spectral_cluster/${partition}_${sad_type}_sad_labels" ] && rm exp/spectral_cluster/${partition}_${sad_type}_sad_labels
echo "Doing spectral clustering and store the result in exp/spectral_cluster/${partition}_${sad_type}_sad_labels"
echo "..."
python3 wespeaker/diar/spectral_clusterer.py \
--scp exp/${partition}_${sad_type}_sad_embedding/emb.scp \
--output exp/spectral_cluster/${partition}_${sad_type}_sad_labels
fi
Apply spectral clustering using the extracted sliding-window speaker embeddings,
and store the results in exp/spectral_cluster/${partition}_${sad_type}_sad_labels
,
where each line records two blank-separated columns : subseg_id
and spk_id
abjxc-00000400-00007040-00000000-00000150 0
abjxc-00000400-00007040-00000075-00000225 0
abjxc-00000400-00007040-00000150-00000300 0
abjxc-00000400-00007040-00000225-00000375 0
...
Stage 7: Reformat Clustering Labels into RTTMs
if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
python3 wespeaker/diar/make_rttm.py \
--labels exp/spectral_cluster/${partition}_${sad_type}_sad_labels \
--channel 1 > exp/spectral_cluster/${partition}_${sad_type}_sad_rttm
fi
Convert the clustering labels into the Rich Transcription Time Marked (RTTM) format, saved in exp/spectral_cluster/${partition}_${sad_type}_sad_rttm
.
RTTM files are space-delimited text files containing one turn per line, each line containing ten fields:
Type
– segment type; should always bySPEAKER
File ID
– file name; basename of the recording minus extension (e.g.,abjxc
)Channel ID
– channel (1-indexed) that turn is on; should always be1
Turn Onset
– onset of turn in seconds from beginning of recordingTurn Duration
– duration of turn in secondsOrthography Field
– should always by<NA>
Speaker Type
– should always be<NA>
Speaker Name
– name of speaker of turn; should be unique within scope of each fileConfidence Score
– system confidence (probability) that information is correct; should always be<NA>
Signal Lookahead Time
– should always be<NA>
For instance,
SPEAKER abjxc 1 0.400 6.640 <NA> <NA> 0 <NA> <NA>
SPEAKER abjxc 1 8.680 55.960 <NA> <NA> 0 <NA> <NA>
Stage 8: Evaluate the Result (DER)
if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
ref_dir=data/voxconverse-master/
#ref_dir=data/VoxSRC2023/voxconverse/
echo -e "Get the DER results\n..."
perl external_tools/SCTK-2.4.12/src/md-eval/md-eval.pl \
-c 0.25 \
-r <(cat ${ref_dir}/${partition}/*.rttm) \
-s exp/spectral_cluster/${partition}_${sad_type}_sad_rttm 2>&1 | tee exp/spectral_cluster/${partition}_${sad_type}_sad_res
if [ ${get_each_file_res} -eq 1 ];then
single_file_res_dir=exp/spectral_cluster/${partition}_${sad_type}_single_file_res
mkdir -p $single_file_res_dir
echo -e "\nGet the DER results for each file and the results will be stored underd ${single_file_res_dir}\n..."
awk '{print $2}' exp/spectral_cluster/${partition}_${sad_type}_sad_rttm | sort -u | while read file_name; do
perl external_tools/SCTK-2.4.12/src/md-eval/md-eval.pl \
-c 0.25 \
-r <(cat ${ref_dir}/${partition}/${file_name}.rttm) \
-s <(grep "${file_name}" exp/spectral_cluster/${partition}_${sad_type}_sad_rttm) > ${single_file_res_dir}/${partition}_${file_name}_res
done
echo "Done!"
fi
fi
Use the SCTK toolkit to compute the Diarization Error Rate (DER) metric, which is the sum of
speaker error – percentage of scored time for which the wrong speaker id is assigned within a speech region
false alarm speech – percentage of scored time for which a nonspeech region is incorrectly marked as containing speech
missed speech – percentage of scored time for which a speech region is incorrectly marked as not containing speech
For more details about DER, consult Section 6.1 of the NIST RT-09 evaluation plan.
The overall DER result would be saved in exp/spectral_cluster/${partition}_${sad_type}_sad_res
.
Optionally, set get_each_file_res
as 1
if you also want to get the DER result for each single file, which will be saved under dir exp/spectral_cluster/${partition}_${sad_type}_single_file_res
.