Model / Methods
Title
Paper Link
Code Link
Published
Keywords
Venue
Whisper
Robust Speech Recognition via Large-Scale Weak Supervision
2022.12.06
openai
VALL-E
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
2023.01.05
VALOR
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
2023.04.17
VAST
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
2023.05.29
AudioPaLM
AudioPaLM: A Large Language Model That Can Speak and Listen
-
2023.06.22
google
SALMONN
SALMONN: Towards Generic Hearing Abilities for Large Language Models
2023.10.20
bytedance
SpeechGPT-Gen
SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation
2024.01.24
SpeechVerse
SpeechVerse: A Large-scale Generalizable Audio Language Model
2024.05.14
SpeechGPT
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
2024.05.18
SpeechInstruct
video-SALMONN
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
2024.06.22
bytedance
Qwen2-Audio
Qwen2-Audio Technical Report
2024.07.15
alibaba
VITA
Towards Open-Source Interactive Omni Multimodal LLM
2024.08.09
Mini-Omni
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
2024.08.29
VoiceAssistant-400K
LLaMA-Omni
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
2024.09.10
InstructS2S-200K
Zero-Shot Multi-Speaker TTS
Model / Methods
Title
Paper Link
Code Link
Published
Keywords
Venue
YourTTS
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
2021.12.04
MegaTTS
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias
-
2023.06.06
MegaTTS2
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
2023.07.14
XTTS
XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
2023.06.07
Model / Methods
Title
Paper Link
Code Link
Published
Keywords
Venue
InstructTTS
InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt
2023.01.31
https://huggingface.co/datasets/ICTNLP/ComSpeech_Datasets
https://github.com/2noise/chattts
https://github.com/suno-ai/bark
https://github.com/openai/whisper