Each of the 4 considered metrics are shallowly described at the beginning of each section. Please note that some approximations are made in favour of clarity. Please refer to the Zero Resource Speech paper for more details.

Checkpoints of the baseline can be obtained by running :

curl https://download.zerospeech.com/2021/baseline_checkpoints_VG.zip | jar xv

ABX error rate : Phoneme discriminability

In this task, the model receives 3 triplets A, B and X that are triphones, with A and X being the same triphone (but different occurences), and B differing from A/X only in its central phone (e.g /big/ vs /bug/). Under a distance function d, we expect d(A,X) < d(B,X) as A and X correspoond to the same triphone. This metric is computed over 2 conditions : within speakers, when the 3 triphones are pronounced by the same speakers, and across speakers when triphones are pronounced by different speakers. Lower is better.

		LibriSpeech dev
		dev-clean		dev-other
Model	Layer	within	across	within	across
MFCCs+VG	rnn1	8.70	10.69	9.86	14.71
CPC small+VG	rnn1	5.36	6.68	7.41	11.30

Layer refers to the layer used to extract representations ('rnn1' corresponds to the second recurrent layer).

sSIMI : Semantic similarity with human judgments

In this task, the model receives 2 words, let's say /abduct/ and /kidnap/. The distance between the embeddings of these 2 words are computed. Then, a Spearman's rank correlation coefficent is computed between these distances, and human semantic similarity judgements. Higher is better.

		sSIMI dev
Model	Layer	librispeech	synthetic
MFCCs+VG	att	13.0894	9.4661
CPC small+VG	att	11.8885	6.3074

Layer refers to the layer used to extract representations ('att' corresponds to the attention layer).

sBLIMP : Syntax acceptability judgment

In this task, the model receives two sentences, one of which is syntactically wrong. Let's say /dogs eat meat/ vs /dogs eats meat/. The model is asked to return pseudo-probabilities for each of these sentence. The pseudo-probability of the syntactically right sentence is expected to be higher than the pseudo-probability of the syntactically wrong sentence. Note that the hyperparameters used to extract the pseudo-probabilities from the model can greatly impact the performance. Higher is better.

Model	K	M_d	Delta_t	sBLIMP dev
MFCCs+VG+ KMEANS+ BERT small	50	10	1	53.19
CPC small+VG+ KMEANS+ BERT large	500	10	1	54.68

K refers to the number of clusters used in K-means. M_d and Delta_t are respectively the decoding span size and the temporal sliding size used to extract pseudo-probabilities

sWUGGY : Spot-the-word task

In this task, the model receives a word and a non-word. Let's say /brick/ and /blick/. It it asked to return pseudo-probabilities for each of these. The pseudo-probability of the word is expected to be higher than the pseudo-probability of the non-word. Higher is better.

Model	K	M_d	Delta_t	sWUGGY dev
MFCCs+VG+ KMEANS+ BERT small	50	10	1	52.53
CPC small+VG+ KMEANS+ BERT large	500	10	1	67.16

K refers to the number of clusters used in K-means. M_d and Delta_t are respectively the decoding span size and the temporal sliding size used to extract pseudo-probabilities

Test metrics

	ABX test-clean		ABX test-clean		sSIMI
Model	within	across	within	across	librispeech	synthetic	sWUGGY	sBLIMP
MFCCs+VG+ KMEANS+ BERT small	8.39	10.59	10.66	15.03	-0.10	9.99	52.86	53.02
CPC small+VG+ KMEANS+ BERT large	5.36	6.71	7.35	11.92	0.16	9.71	67.20	54.53

Findings

Overall, using the visual modality for learning speech representations in an unsupervised way seems on par with audio only models.

The ABX error rate obtained by CPC small is further improved with the VG model (high-budget baseline): it goes from 6.24% ABX error rate to 5.36% ABX error rate (librispeech dev-clean, within speakers).
The best achievement is obtained on the semantic similarity task for which our best VG model gets a score of 13.09 compared to 8.72 for the audio-only baseline (results reported here are computed on sSIMI librispeech dev set). However, we observe a decrease on the test set for the sSIMI metric which is probably due to the fact that the dev and test sets for this metrics haven't been drawn from the same distribution, and that the test metric is averaged across multiple datasets.
No improvement is observed on the syntax acceptability judgment task (sBLIMP). A small decrease is observed on the spot-the-word task : 70.69% for the audio-only baseline compared to 67.16% for the multimodal baseline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RESULTS.md

RESULTS.md

ABX error rate : Phoneme discriminability

sSIMI : Semantic similarity with human judgments

sBLIMP : Syntax acceptability judgment

sWUGGY : Spot-the-word task

Test metrics

Findings

Files

RESULTS.md

Latest commit

History

RESULTS.md

File metadata and controls

ABX error rate : Phoneme discriminability

sSIMI : Semantic similarity with human judgments

sBLIMP : Syntax acceptability judgment

sWUGGY : Spot-the-word task

Test metrics

Findings