https://medium.com/@prdeepak.babu/the-rise-of-multimodal-large-speech-language-models-4fc5ea34d04f
Qwen-Audio is a foundational audio model that is capable of handling diverse audio types(human speech, natural sound, music and songs) and audio tasks (ASR, acoustic scence classification,etc.). The Qwen-Audio model is trained on over 30+ diverse audio tasks like audio classification, speech recognition and emotion recognition). They demonstrate the Qwen-Audio model beats SoTA models on varied tasks indicating good performance in zero-shot setting. Author further demonstrate timestamp prediction shows improvement in grounding and grounding based QA tasks beyond speech signals, as well as ASR. Qwen-Audio is built using whisper-large-v2 as the audio encoder and using Qwen-7B decoder-only LLM model as the core component. The audio encoder is based on whisper-large-v2 composed of 640M parameters. Authors also modify the prompt tags to hierarchially organize tasks and datasets, to avoid loss in gains from interferance. The authors also train a Qwen-Audio-Chat model by fine-tuning Qwen-Audio model using supervised instruction fine-tuning on 30+ tasks and datasets. Chat model support multi-turn dialogs.
Key Contributions from the paper include
[1] Introduces Qwen-Audio, a versatile multi-task audio-language model, alongside its extension Qwen-Audio-Chat for multi-turn dialogues, both of which are open-source to benefit the audio-text multimodal community.
[2] Development a multi-task training framework to handle textual label variations across datasets, allowing for knowledge sharing and reducing interference, with Qwen-Audio excelling in over 30 tasks.
[3] Demonstrates the importance of the SRWT task for audio-language pre-training, showing improvements in grounding tasks, question answering, and ASR performance.
[4] Qwen-Audio outperforms similar models on benchmark tasks without task-specific fine-tuning, setting new records on Aishell1, cochlscene, ClothoAQA, and VocalSound datasets.