Each of the 4 considered metrics are shallowly described at the beginning of each section. Please note that some approximations are made in favour of clarity. Please refer to the Zero Resource Speech paper for more details.
Checkpoints of the baseline can be obtained by running :
curl https://download.zerospeech.com/2021/baseline_checkpoints_VG.zip | jar xv
In this task, the model receives 3 triplets A, B and X that are triphones, with A and X being the same triphone (but different occurences), and B differing from A/X only in its central phone (e.g /big/ vs /bug/). Under a distance function d, we expect d(A,X) < d(B,X) as A and X correspoond to the same triphone. This metric is computed over 2 conditions : within speakers, when the 3 triphones are pronounced by the same speakers, and across speakers when triphones are pronounced by different speakers. Lower is better.
LibriSpeech dev | |||||
dev-clean | dev-other | ||||
Model | Layer | within | across | within | across |
MFCCs+VG | rnn1 | 8.70 | 10.69 | 9.86 | 14.71 |
CPC small+VG | rnn1 | 5.36 | 6.68 | 7.41 | 11.30 |
Layer refers to the layer used to extract representations ('rnn1' corresponds to the second recurrent layer).
In this task, the model receives 2 words, let's say /abduct/ and /kidnap/. The distance between the embeddings of these 2 words are computed. Then, a Spearman's rank correlation coefficent is computed between these distances, and human semantic similarity judgements. Higher is better.
sSIMI dev | |||
Model | Layer | librispeech | synthetic |
MFCCs+VG | att | 13.0894 | 9.4661 |
CPC small+VG | att | 11.8885 | 6.3074 |
Layer refers to the layer used to extract representations ('att' corresponds to the attention layer).
In this task, the model receives two sentences, one of which is syntactically wrong. Let's say /dogs eat meat/ vs /dogs eats meat/. The model is asked to return pseudo-probabilities for each of these sentence. The pseudo-probability of the syntactically right sentence is expected to be higher than the pseudo-probability of the syntactically wrong sentence. Note that the hyperparameters used to extract the pseudo-probabilities from the model can greatly impact the performance. Higher is better.
Model | K | M_d | Delta_t | sBLIMP dev |
MFCCs+VG+ KMEANS+ BERT small |
50 | 10 | 1 | 53.19 |
CPC small+VG+ KMEANS+ BERT large |
500 | 10 | 1 | 54.68 |
K refers to the number of clusters used in K-means. M_d and Delta_t are respectively the decoding span size and the temporal sliding size used to extract pseudo-probabilities
In this task, the model receives a word and a non-word. Let's say /brick/ and /blick/. It it asked to return pseudo-probabilities for each of these. The pseudo-probability of the word is expected to be higher than the pseudo-probability of the non-word. Higher is better.
Model | K | M_d | Delta_t | sWUGGY dev |
MFCCs+VG+ KMEANS+ BERT small |
50 | 10 | 1 | 52.53 |
CPC small+VG+ KMEANS+ BERT large |
500 | 10 | 1 | 67.16 |
K refers to the number of clusters used in K-means. M_d and Delta_t are respectively the decoding span size and the temporal sliding size used to extract pseudo-probabilities
ABX test-clean | ABX test-clean | sSIMI | ||||||
Model | within | across | within | across | librispeech | synthetic | sWUGGY | sBLIMP |
MFCCs+VG+ KMEANS+ BERT small |
8.39 | 10.59 | 10.66 | 15.03 | -0.10 | 9.99 | 52.86 | 53.02 |
CPC small+VG+ KMEANS+ BERT large |
5.36 | 6.71 | 7.35 | 11.92 | 0.16 | 9.71 | 67.20 | 54.53 |
Overall, using the visual modality for learning speech representations in an unsupervised way seems on par with audio only models.
-
The ABX error rate obtained by CPC small is further improved with the VG model (high-budget baseline): it goes from 6.24% ABX error rate to 5.36% ABX error rate (librispeech dev-clean, within speakers).
-
The best achievement is obtained on the semantic similarity task for which our best VG model gets a score of 13.09 compared to 8.72 for the audio-only baseline (results reported here are computed on sSIMI librispeech dev set). However, we observe a decrease on the test set for the sSIMI metric which is probably due to the fact that the dev and test sets for this metrics haven't been drawn from the same distribution, and that the test metric is averaged across multiple datasets.
-
No improvement is observed on the syntax acceptability judgment task (sBLIMP). A small decrease is observed on the spot-the-word task : 70.69% for the audio-only baseline compared to 67.16% for the multimodal baseline.