MultiModal-Models

Model / Methods	Title	Published	Keywords	Venue
ALIGN	Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision	2021
AltCLIP	AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities	2022
BLIP	BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	2022	salesforce
BLIP-2	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	2023	salesforce
BLIP-3
BridgeTower	BridgeTower: Building Bridges Between Encoders in Vision-Language Representative Learning	2023	microsoft	AAAI’23
BROS	BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents	2021	clovaai
Chameleon	Chameleon: Mixed-Modal Early-Fusion Foundation Models	2024	facebookresearch
Chinese-CLIP	Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese	2023	OFA-Sys
CLIP	Learning Transferable Visual Models From Natural Language Supervision	2021	openai
CLIPSeg	Image Segmentation Using Text and Image Prompts	2021		CVPR 2022
CLVP	Better speech synthesis through scaling	2023
Data2Vec	data 2 vec：A General Framework for Self-supervised Learning in Speech，Vision and Language	2022
DePlot	DePlot: One-shot visual language reasoning by plot-to-table translation	2022
Donut	OCR-free Document Understanding Transformer	2021	clovaai
FLAVA	FLAVA: A Foundational Language And Vision Alignment Model	2021	facebookresearch
GIT	GIT: A Generative Image-to-text Transformer for Vision and Language	2022	microsoft
Grounding DINO	Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection	2023	IDEA-Research
GroupViT	GroupViT: Semantic Segmentation Emerges from Text Supervision	2022	NVlabs
IDEFICS	OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents	2023	NVlabs
IDEFICS-2	What matters when building vision-language models?	2024	NVlabs
InstructBLIP	InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	2023	salesforce
InstructBlipVideo	InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	2023	salesforce
KOSMOS-2	Kosmos-2: Grounding Multimodal Large Language Models to the World	2023	microsoft
LayoutLM	LayoutLM: Pre-training of Text and Layout for Document Image Understanding	2019	microsoft
LayoutLMV2	LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding	2020	microsoft
LayoutLMv3	LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking	2022	microsoft
LayoutXLM	LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding	2021	microsoft
LiLT	LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding	2022
LLaVa	Visual Instruction Tuning	2023
LLaVa-VL	Improved Baselines with Visual Instruction Tuning	2024
LLaVA-NeXT	LLaVA-NeXT: Improved reasoning, OCR, and world knowledge	2024
LLaVa-NeXT-Video	LLaVA-NeXT: A Strong Zero-shot Video Understanding Model	2024
Video-LLaVA	Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	2023
LXMERT	LXMERT: Learning Cross-Modality Encoder Representations from Transformers	2019
MatCha	MatCha：Enhancing Visual Language Pretraining with Math Reasoning and Chart Derrendering	2022	google
MGP-STR	Multi-Granularity Prediction for Scene Text Recognition	2022	AlibabaResearch
Mini-Omni	Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming	2024	gpt-omni
Nougat	Nougat: Neural Optical Understanding for Academic Documents	2023	facebookresearch
OneFormer	OneFormer: One Transformer to Rule Universal Image Segmentation	2022	SHI-Labs
OWL-ViT	Simple Open-Vocabulary Object Detection with Vision Transformers	2022	google
OWLv2	Scaling Open-Vocabulary Object Detection	2023	google
PaliGemma	PaliGemma – Google’s Cutting-Edge Open Vision Language Model	2024
Perceiver	Perceiver IO: A General Architecture for Structured Inputs & Outputs	2021
Pix2Struct	Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding	2022
SAM	Segment Anything	2023	meta
SAM v2	SAM 2: Segment Anything in Images and Videos	2024	meta
SigLIP	Sigmoid Loss for Language Image Pre-Training	2023
TAPAS	TAPAS: Weakly Supervised Table Parsing via Pre-training	2020
TrOCR	TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models	2021
TVLT	TVLT: Textless Vision-Language Transformer	2022
TVP	Text-Visual Prompting for Efficient 2D Temporal Video Grounding	2023	Intel
UDOP	Unifying Vision, Text, and Layout for Universal Document Processing	2022
ViLT	ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision	2021
VipLlava	Making Large Multimodal Models Understand Arbitrary Visual Prompts	2023
VisualBERT	VisualBERT: A Simple and Performant Baseline for Vision and Language	2019
X-CLIP	Expanding Language-Image Pretrained Models for General Video Recognition	2022

Vision Encoder Decoder Models
- Vision Models
  - ViT
  - BEiT
  - Swin
Language Models
- RoBERTa
- GPT2
- BERT
- DistilBERT
Vision TextDual Encoder
Speech Encoder Decoder Models

Reading List

Section 1: LLMs and MLLMs

OpenAI, 2023,Introducing ChatGPT
OpenAI, 2023,GPT-4 Technical Report
Alayrac, et al., 2022,Flamingo: a Visual Language Model for Few-Shot Learning
Li, et al., 2023,BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Zhu, et al., 2023,MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Wu, et al., 2023,Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Shen, et al., 2023,HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Tang, et al., 2023,Any-to-Any Generation via Composable Diffusion
Girdhar, et al., 2023,ImageBind: One Embedding Space To Bind Them All
Wu, et al., 2023,NExT-GPT: Any-to-Any Multimodal LLM
Moon, et al., 2023,AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Hu, et al., 2023,Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Bai, et al., 2023,Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Wang, et al., 2023,CogVLM: Visual Expert for Pretrained Language Models
Peng, et al., 2023,Kosmos-2: Grounding Multimodal Large Language Models to the World
Dong, et al., 2023,InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
Zhu, et al., 2023,LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Ge, et al., 2023,Planting a SEED of Vision in Large Language Model
Zhan, et al., 2024,AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Kondratyuk, et al., 2023,VideoPoet: A Large Language Model for Zero-Shot Video Generation
Zhang, et al., 2023,SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
Zeghidour, et al., 2021,SoundStream: An End-to-End Neural Audio Codec
Liu, et al., 2023,Improved Baselines with Visual Instruction Tuning
Wu, et al., 2023,Visual-ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Wang, et al., 2023,ModaVerse: Efficiently Transforming Modalities with LLMs
Fei, et al., 2024,VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Lu, et al., 2023,Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Bai, et al., 2023,LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models
Huang, et al., 2023,Language Is Not All You Need: Aligning Perception with Language Models
Li, et al., 2023, VideoChat: Chat-Centric Video Understanding
Maaz, et al., 2023, Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Zhang, et al., 2023, Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Lin, et al., 2023, Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Qian, et al., 2024, Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
Hong, et al., 2023, 3D-LLM: Injecting the 3D World into Large Language Models
Sun, et al., 2023, 3D-GPT: Procedural 3D Modeling with Large Language Models
Chen, et al., 2023, LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
Xu, et al., 2023, PointLLM: Empowering Large Language Models to Understand Point Clouds
Chen, et al., 2024, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Huang, et al., 2023, AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
Zhang, et al., 2023, SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
Wang, et al., 2023, VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
Rubenstein, et al., 2023, AudioPaLM: A Large Language Model That Can Speak and Listen
Tang, et al., 2023, SALMONN: Towards Generic Hearing Abilities for Large Language Models
Latif, et al., 2023, Sparks of Large Audio Models: A Survey and Outlook
Luo, et al., 2022, BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
Li, et al., 2023, DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins
Chen, et al., 2023, MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
Wang, et al., 2023, HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge
Zhang, et al., 2023, AlpaCare:Instruction-tuned Large Language Models for Medical Application
Frey, et al., 2023, Neural Scaling of Deep Chemical Models
Zhang, et al., 2023, ChemLLM: A Chemical Large Language Model
Liu, et al., 2023, MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter
Jiang, et al., 2023, StructGPT: A General Framework for Large Language Model to Reason on Structured Data
Chen, et al., 2024, LLaGA: Large Language and Graph Assistant
Koh, et al., 2023, Generating Images with Multimodal Language Models
Sun, et al., 2023, Generative Pretraining in Multimodality
Zheng, et al., 2023, MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
Dong, et al., 2023, DreamLLM: Synergistic Multimodal Comprehension and Creation
Liu, et al., 2023, LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Wang, et al., 2023, GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
Jin, et al., 2024, Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Jin, et al., 2023, Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Li, et al., 2023, LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Su, et al., 2023, PandaGPT: One Model to Instruction-Follow Them All
Lyu, et al., 2023, Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
Tang, et al., 2023, CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
Zhang, et al., 2023, GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Yuan, et al., 2023, Osprey: Pixel Understanding with Visual Instruction Tuning
Rasheed, et al., 2023, GLaMM: Pixel Grounding Large Multimodal Model
Pi, et al., 2023, DetGPT: Detect What You Need via Reasoning
Ren, et al., 2023, PixelLM: Pixel Reasoning with Large Multimodal Model
Lai, et al., 2023, Lisa: Reasoning segmentation via large language model
Chen, et al., 2023, Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Munasinghe, et al., 2023, PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
Yu, et al., 2023, Merlin: Empowering Multimodal LLMs with Foresight Minds
Fu, et al., 2023, MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Xu, et al., 2023, LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Ying, et al., 2024, MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Pan, et al., 2024, Auto-Encoding Morph-Tokens for Multimodal LLM
Thagard, et al., 1997, Abductive reasoning: Logic, visual thinking, and coherence
Bavishi, et al., 2023, Fuyu-8B: A Multimodal Architecture for AI Agents

Section 2: Instruction Tuning

Liu, et al., 2023, Visual Instruction Tuning
Liu, et al., 2023, Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Gao, et al., 2023, LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Zhao, et al., 2023, SVIT: Scaling up Visual Instruction Tuning
Ye, et al., 2023, mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Yu, et al., 2023, RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Liu, et al., 2023, MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
Zhao, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Liu, et al., 2023, HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Li, et al., 2023, Evaluating Object Hallucination in Large Vision-Language Models
Huang, et al., 2023, Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
Yin, et al., 2023, A Survey on Multimodal Large Language Models
Yin, et al., 2023, Woodpecker: Hallucination Correction for Multimodal Large Language Models

Section 3: Reasoning with LLM

Zhang, et al., 2023, Multimodal Chain-of-Thought Reasoning in Language Models
Zhao, et al., 2023, MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Lu, et al., 2023, Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Zhang, et al., 2023, You Only Look at Screens: Multimodal Chain-of-Action Agents
Sun, et al., 2023, Generative multimodal models are in-context learners
Fei, et al., 2023, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Wei, et al., 2023, Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework
Zhang, et al., 2023, Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
Fei, et al., 2024, Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
Prystawski, et al., 2023, Why think step by step? Reasoning emerges from the locality of experience
Gou, et al., 2023, CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Tang, et al., 2024, Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
Yuan, et al., 2024, R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

Section 4: Efficient Learning

Hu, et al., 2021, LoRA: Low-Rank Adaptation of Large Language Models
Dettmers, et al., 2023, QLoRA: Efficient Finetuning of Quantized LLMs
Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Luo, et al., 2023, Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Yao, et al., 2024, MiniCPM-V
DeepSpeed Team, 2020, DeepSpeed Blog
Zhao, et al., 2023, PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Chen, et al., 2023, MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Hong, et al., 2023, CogAgent: A Visual Language Model for GUI Agents
Chen, et al., 2024, How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Dehghani, et al., 2023, Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
Zhang, et al., 2023, VPGTrans: Transfer Visual Prompt Generator across LLMs
Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Zhang, et al., 2024, NExT-Chat: An LMM for Chat, Detection and Segmentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

MultiModal-Models

Section 1: LLMs and MLLMs

Section 2: Instruction Tuning

Section 3: Reasoning with LLM

Section 4: Efficient Learning

Files

index.md

Latest commit

History

index.md

File metadata and controls

MultiModal-Models

Section 1: LLMs and MLLMs

Section 2: Instruction Tuning

Section 3: Reasoning with LLM

Section 4: Efficient Learning