Model / Methods | Title | Paper Link | Code Link | Published | Keywords | Venue |
---|---|---|---|---|---|---|
ALIGN | Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | 2021 | ||||
AltCLIP | AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities | 2022 | ||||
BLIP | BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | 2022 | salesforce | |||
BLIP-2 | BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | 2023 | salesforce | |||
BLIP-3 | ||||||
BridgeTower | BridgeTower: Building Bridges Between Encoders in Vision-Language Representative Learning | 2023 | microsoft | AAAI’23 | ||
BROS | BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents | 2021 | clovaai | |||
Chameleon | Chameleon: Mixed-Modal Early-Fusion Foundation Models | 2024 | facebookresearch | |||
Chinese-CLIP | Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese | 2023 | OFA-Sys | |||
CLIP | Learning Transferable Visual Models From Natural Language Supervision | 2021 | openai | |||
CLIPSeg | Image Segmentation Using Text and Image Prompts | 2021 | CVPR 2022 | |||
CLVP | Better speech synthesis through scaling | 2023 | ||||
Data2Vec | data 2 vec:A General Framework for Self-supervised Learning in Speech,Vision and Language | 2022 | ||||
DePlot | DePlot: One-shot visual language reasoning by plot-to-table translation | 2022 | ||||
Donut | OCR-free Document Understanding Transformer | 2021 | clovaai | |||
FLAVA | FLAVA: A Foundational Language And Vision Alignment Model | 2021 | facebookresearch | |||
GIT | GIT: A Generative Image-to-text Transformer for Vision and Language | 2022 | microsoft | |||
Grounding DINO | Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection | 2023 | IDEA-Research | |||
GroupViT | GroupViT: Semantic Segmentation Emerges from Text Supervision | 2022 | NVlabs | |||
IDEFICS | OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents | 2023 | NVlabs | |||
IDEFICS-2 | What matters when building vision-language models? | 2024 | NVlabs | |||
InstructBLIP | InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | 2023 | salesforce | |||
InstructBlipVideo | InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | 2023 | salesforce | |||
KOSMOS-2 | Kosmos-2: Grounding Multimodal Large Language Models to the World | 2023 | microsoft | |||
LayoutLM | LayoutLM: Pre-training of Text and Layout for Document Image Understanding | 2019 | microsoft | |||
LayoutLMV2 | LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding | 2020 | microsoft | |||
LayoutLMv3 | LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking | 2022 | microsoft | |||
LayoutXLM | LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding | 2021 | microsoft | |||
LiLT | LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding | 2022 | ||||
LLaVa | Visual Instruction Tuning | 2023 | ||||
LLaVa-VL | Improved Baselines with Visual Instruction Tuning | 2024 | ||||
LLaVA-NeXT | LLaVA-NeXT: Improved reasoning, OCR, and world knowledge | 2024 | ||||
LLaVa-NeXT-Video | LLaVA-NeXT: A Strong Zero-shot Video Understanding Model | 2024 | ||||
Video-LLaVA | Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | 2023 | ||||
LXMERT | LXMERT: Learning Cross-Modality Encoder Representations from Transformers | 2019 | ||||
MatCha | MatCha:Enhancing Visual Language Pretraining with Math Reasoning and Chart Derrendering | 2022 | ||||
MGP-STR | Multi-Granularity Prediction for Scene Text Recognition | 2022 | AlibabaResearch | |||
Mini-Omni | Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming | 2024 | gpt-omni | |||
Nougat | Nougat: Neural Optical Understanding for Academic Documents | 2023 | facebookresearch | |||
OneFormer | OneFormer: One Transformer to Rule Universal Image Segmentation | 2022 | SHI-Labs | |||
OWL-ViT | Simple Open-Vocabulary Object Detection with Vision Transformers | 2022 | ||||
OWLv2 | Scaling Open-Vocabulary Object Detection | 2023 | ||||
PaliGemma | PaliGemma – Google’s Cutting-Edge Open Vision Language Model | 2024 | ||||
Perceiver | Perceiver IO: A General Architecture for Structured Inputs & Outputs | 2021 | ||||
Pix2Struct | Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding | 2022 | ||||
SAM | Segment Anything | 2023 | meta | |||
SAM v2 | SAM 2: Segment Anything in Images and Videos | 2024 | meta | |||
SigLIP | Sigmoid Loss for Language Image Pre-Training | 2023 | ||||
TAPAS | TAPAS: Weakly Supervised Table Parsing via Pre-training | 2020 | ||||
TrOCR | TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models | 2021 | ||||
TVLT | TVLT: Textless Vision-Language Transformer | 2022 | ||||
TVP | Text-Visual Prompting for Efficient 2D Temporal Video Grounding | 2023 | Intel | |||
UDOP | Unifying Vision, Text, and Layout for Universal Document Processing | 2022 | ||||
ViLT | ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision | 2021 | ||||
VipLlava | Making Large Multimodal Models Understand Arbitrary Visual Prompts | 2023 | ||||
VisualBERT | VisualBERT: A Simple and Performant Baseline for Vision and Language | 2019 | ||||
X-CLIP | Expanding Language-Image Pretrained Models for General Video Recognition | 2022 | ||||
-
Vision Encoder Decoder Models
-
Language Models
-
Vision TextDual Encoder
-
Speech Encoder Decoder Models
Reading List
-
OpenAI, 2023,Introducing ChatGPT
-
OpenAI, 2023,GPT-4 Technical Report
-
Alayrac, et al., 2022,Flamingo: a Visual Language Model for Few-Shot Learning
-
Li, et al., 2023,BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
-
Zhu, et al., 2023,MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
-
Wu, et al., 2023,Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
-
Shen, et al., 2023,HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
-
Tang, et al., 2023,Any-to-Any Generation via Composable Diffusion
-
Girdhar, et al., 2023,ImageBind: One Embedding Space To Bind Them All
-
Wu, et al., 2023,NExT-GPT: Any-to-Any Multimodal LLM
-
Moon, et al., 2023,AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
-
Hu, et al., 2023,Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
-
Bai, et al., 2023,Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
-
Wang, et al., 2023,CogVLM: Visual Expert for Pretrained Language Models
-
Peng, et al., 2023,Kosmos-2: Grounding Multimodal Large Language Models to the World
-
Dong, et al., 2023,InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
-
Zhu, et al., 2023,LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
-
Ge, et al., 2023,Planting a SEED of Vision in Large Language Model
-
Zhan, et al., 2024,AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
-
Kondratyuk, et al., 2023,VideoPoet: A Large Language Model for Zero-Shot Video Generation
-
Zhang, et al., 2023,SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
-
Zeghidour, et al., 2021,SoundStream: An End-to-End Neural Audio Codec
-
Liu, et al., 2023,Improved Baselines with Visual Instruction Tuning
-
Wu, et al., 2023,Visual-ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
-
Wang, et al., 2023,ModaVerse: Efficiently Transforming Modalities with LLMs
-
Fei, et al., 2024,VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
-
Lu, et al., 2023,Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
-
Bai, et al., 2023,LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models
-
Huang, et al., 2023,Language Is Not All You Need: Aligning Perception with Language Models
-
Li, et al., 2023, VideoChat: Chat-Centric Video Understanding
-
Maaz, et al., 2023, Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
-
Zhang, et al., 2023, Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
-
Lin, et al., 2023, Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
-
Qian, et al., 2024, Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
-
Hong, et al., 2023, 3D-LLM: Injecting the 3D World into Large Language Models
-
Sun, et al., 2023, 3D-GPT: Procedural 3D Modeling with Large Language Models
-
Chen, et al., 2023, LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
-
Xu, et al., 2023, PointLLM: Empowering Large Language Models to Understand Point Clouds
-
Chen, et al., 2024, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
-
Huang, et al., 2023, AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
-
Zhang, et al., 2023, SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
-
Wang, et al., 2023, VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
-
Rubenstein, et al., 2023, AudioPaLM: A Large Language Model That Can Speak and Listen
-
Tang, et al., 2023, SALMONN: Towards Generic Hearing Abilities for Large Language Models
-
Latif, et al., 2023, Sparks of Large Audio Models: A Survey and Outlook
-
Luo, et al., 2022, BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
-
Li, et al., 2023, DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins
-
Chen, et al., 2023, MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
-
Wang, et al., 2023, HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge
-
Zhang, et al., 2023, AlpaCare:Instruction-tuned Large Language Models for Medical Application
-
Frey, et al., 2023, Neural Scaling of Deep Chemical Models
-
Zhang, et al., 2023, ChemLLM: A Chemical Large Language Model
-
Liu, et al., 2023, MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter
-
Jiang, et al., 2023, StructGPT: A General Framework for Large Language Model to Reason on Structured Data
-
Chen, et al., 2024, LLaGA: Large Language and Graph Assistant
-
Koh, et al., 2023, Generating Images with Multimodal Language Models
-
Sun, et al., 2023, Generative Pretraining in Multimodality
-
Zheng, et al., 2023, MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
-
Dong, et al., 2023, DreamLLM: Synergistic Multimodal Comprehension and Creation
-
Liu, et al., 2023, LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
-
Wang, et al., 2023, GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
-
Jin, et al., 2024, Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
-
Jin, et al., 2023, Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
-
Li, et al., 2023, LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
-
Su, et al., 2023, PandaGPT: One Model to Instruction-Follow Them All
-
Lyu, et al., 2023, Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
-
Tang, et al., 2023, CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
-
Zhang, et al., 2023, GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
-
Yuan, et al., 2023, Osprey: Pixel Understanding with Visual Instruction Tuning
-
Rasheed, et al., 2023, GLaMM: Pixel Grounding Large Multimodal Model
-
Pi, et al., 2023, DetGPT: Detect What You Need via Reasoning
-
Ren, et al., 2023, PixelLM: Pixel Reasoning with Large Multimodal Model
-
Lai, et al., 2023, Lisa: Reasoning segmentation via large language model
-
Chen, et al., 2023, Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
-
Munasinghe, et al., 2023, PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
-
Yu, et al., 2023, Merlin: Empowering Multimodal LLMs with Foresight Minds
-
Fu, et al., 2023, MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
-
Xu, et al., 2023, LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
-
Ying, et al., 2024, MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
-
Pan, et al., 2024, Auto-Encoding Morph-Tokens for Multimodal LLM
-
Thagard, et al., 1997, Abductive reasoning: Logic, visual thinking, and coherence
-
Bavishi, et al., 2023, Fuyu-8B: A Multimodal Architecture for AI Agents
-
Liu, et al., 2023, Visual Instruction Tuning
-
Liu, et al., 2023, Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
-
Gao, et al., 2023, LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
-
Zhao, et al., 2023, SVIT: Scaling up Visual Instruction Tuning
-
Ye, et al., 2023, mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
-
Yu, et al., 2023, RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
-
Liu, et al., 2023, MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
-
Zhao, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
-
Li, et al., 2023, Evaluating Object Hallucination in Large Vision-Language Models
-
Huang, et al., 2023, Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
-
Yin, et al., 2023, A Survey on Multimodal Large Language Models
-
Yin, et al., 2023, Woodpecker: Hallucination Correction for Multimodal Large Language Models
-
Zhang, et al., 2023, Multimodal Chain-of-Thought Reasoning in Language Models
-
Zhao, et al., 2023, MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
-
Lu, et al., 2023, Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
-
Zhang, et al., 2023, You Only Look at Screens: Multimodal Chain-of-Action Agents
-
Sun, et al., 2023, Generative multimodal models are in-context learners
-
Fei, et al., 2023, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
-
Wei, et al., 2023, Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework
-
Zhang, et al., 2023, Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
-
Fei, et al., 2024, Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
-
Prystawski, et al., 2023, Why think step by step? Reasoning emerges from the locality of experience
-
Gou, et al., 2023, CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
-
Tang, et al., 2024, Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
-
Yuan, et al., 2024, R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
-
Hu, et al., 2021, LoRA: Low-Rank Adaptation of Large Language Models
-
Dettmers, et al., 2023, QLoRA: Efficient Finetuning of Quantized LLMs
-
Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
-
Luo, et al., 2023, Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
-
Yao, et al., 2024, MiniCPM-V
-
DeepSpeed Team, 2020, DeepSpeed Blog
-
Zhao, et al., 2023, PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
-
Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
-
Chen, et al., 2023, MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
-
Hong, et al., 2023, CogAgent: A Visual Language Model for GUI Agents
-
Chen, et al., 2024, How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
-
Dehghani, et al., 2023, Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
-
Zhang, et al., 2023, VPGTrans: Transfer Visual Prompt Generator across LLMs
-
Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
-
Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
-
Zhang, et al., 2024, NExT-Chat: An LMM for Chat, Detection and Segmentation