Skip to content

Latest commit

 

History

History
executable file
·
358 lines (203 loc) · 60.7 KB

File metadata and controls

executable file
·
358 lines (203 loc) · 60.7 KB

MultiModal-Models

Model / Methods Title Paper Link Code Link Published Keywords Venue
ALIGN Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision PaperReading Code 2021
AltCLIP AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities PaperReading Code 2022
BLIP BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation PaperReading CodeStar 2022 salesforce
BLIP-2 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models PaperReading CodeStar 2023 salesforce
BLIP-3
BridgeTower BridgeTower: Building Bridges Between Encoders in Vision-Language Representative Learning PaperReading CodeStar 2023 microsoft AAAI’23
BROS BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents PaperReading CodeStar 2021 clovaai
Chameleon Chameleon: Mixed-Modal Early-Fusion Foundation Models PaperReading CodeStar 2024 facebookresearch
Chinese-CLIP Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese PaperReading CodeStar 2023 OFA-Sys
CLIP Learning Transferable Visual Models From Natural Language Supervision PaperReading CodeStar 2021 openai
CLIPSeg Image Segmentation Using Text and Image Prompts PaperReading CodeStar 2021 CVPR 2022
CLVP Better speech synthesis through scaling PaperReading CodeStar 2023
Data2Vec data 2 vec:A General Framework for Self-supervised Learning in Speech,Vision and Language PaperReading CodeStar 2022
DePlot DePlot: One-shot visual language reasoning by plot-to-table translation PaperReading Code 2022
Donut OCR-free Document Understanding Transformer PaperReading CodeStar 2021 clovaai
FLAVA FLAVA: A Foundational Language And Vision Alignment Model PaperReading CodeStar 2021 facebookresearch
GIT GIT: A Generative Image-to-text Transformer for Vision and Language PaperReading CodeStar 2022 microsoft
Grounding DINO Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection PaperReading CodeStar 2023 IDEA-Research
GroupViT GroupViT: Semantic Segmentation Emerges from Text Supervision PaperReading CodeStar 2022 NVlabs
IDEFICS OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents PaperReading Code 2023 NVlabs
IDEFICS-2 What matters when building vision-language models? PaperReading Code 2024 NVlabs
InstructBLIP InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning PaperReading CodeStar 2023 salesforce
InstructBlipVideo InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning PaperReading CodeStar 2023 salesforce
KOSMOS-2 Kosmos-2: Grounding Multimodal Large Language Models to the World PaperReading CodeStar 2023 microsoft
LayoutLM LayoutLM: Pre-training of Text and Layout for Document Image Understanding PaperReading Code 2019 microsoft
LayoutLMV2 LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding Paper CodeStar 2020 microsoft
LayoutLMv3 LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking Paper CodeStar 2022 microsoft
LayoutXLM LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding Paper CodeStar 2021 microsoft
LiLT LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding PaperReading CodeStar 2022
LLaVa Visual Instruction Tuning PaperReading CodeStar 2023
LLaVa-VL Improved Baselines with Visual Instruction Tuning Paper CodeStar 2024
LLaVA-NeXT LLaVA-NeXT: Improved reasoning, OCR, and world knowledge Paper CodeStar 2024
LLaVa-NeXT-Video LLaVA-NeXT: A Strong Zero-shot Video Understanding Model Paper CodeStar 2024
Video-LLaVA Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Paper CodeStar 2023
LXMERT LXMERT: Learning Cross-Modality Encoder Representations from Transformers Paper CodeStar 2019
MatCha MatCha:Enhancing Visual Language Pretraining with Math Reasoning and Chart Derrendering Paper Code 2022 google
MGP-STR Multi-Granularity Prediction for Scene Text Recognition Paper CodeStar 2022 AlibabaResearch
Mini-Omni Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming PaperReading CodeStar 2024 gpt-omni
Nougat Nougat: Neural Optical Understanding for Academic Documents Paper CodeStar 2023 facebookresearch
OneFormer OneFormer: One Transformer to Rule Universal Image Segmentation Paper CodeStar 2022 SHI-Labs
OWL-ViT Simple Open-Vocabulary Object Detection with Vision Transformers Paper CodeStar 2022 google
OWLv2 Scaling Open-Vocabulary Object Detection Paper CodeStar 2023 google
PaliGemma PaliGemma – Google’s Cutting-Edge Open Vision Language Model Paper Code 2024
Perceiver Perceiver IO: A General Architecture for Structured Inputs & Outputs Paper CodeStar 2021
Pix2Struct Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Paper CodeStar 2022
SAM Segment Anything Paper CodeStar 2023 meta
SAM v2 SAM 2: Segment Anything in Images and Videos Paper CodeStar 2024 meta
SigLIP Sigmoid Loss for Language Image Pre-Training Paper CodeStar 2023
TAPAS TAPAS: Weakly Supervised Table Parsing via Pre-training Paper CodeStar 2020
TrOCR TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models Paper CodeStar 2021
TVLT TVLT: Textless Vision-Language Transformer Paper CodeStar 2022
TVP Text-Visual Prompting for Efficient 2D Temporal Video Grounding Paper CodeStar 2023 Intel
UDOP Unifying Vision, Text, and Layout for Universal Document Processing Paper Code 2022
ViLT ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision Paper CodeStar 2021
VipLlava Making Large Multimodal Models Understand Arbitrary Visual Prompts Paper CodeStar 2023
VisualBERT VisualBERT: A Simple and Performant Baseline for Vision and Language Paper CodeStar 2019
X-CLIP Expanding Language-Image Pretrained Models for General Video Recognition Paper CodeStar 2022

Reading List


Section 1: LLMs and MLLMs

  1. OpenAI, 2023,Introducing ChatGPT

  2. OpenAI, 2023,GPT-4 Technical Report

  3. Alayrac, et al., 2022,Flamingo: a Visual Language Model for Few-Shot Learning

  4. Li, et al., 2023,BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

  5. Zhu, et al., 2023,MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

  6. Wu, et al., 2023,Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

  7. Shen, et al., 2023,HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

  8. Tang, et al., 2023,Any-to-Any Generation via Composable Diffusion

  9. Girdhar, et al., 2023,ImageBind: One Embedding Space To Bind Them All

  10. Wu, et al., 2023,NExT-GPT: Any-to-Any Multimodal LLM

  11. Moon, et al., 2023,AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

  12. Hu, et al., 2023,Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages

  13. Bai, et al., 2023,Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

  14. Wang, et al., 2023,CogVLM: Visual Expert for Pretrained Language Models

  15. Peng, et al., 2023,Kosmos-2: Grounding Multimodal Large Language Models to the World

  16. Dong, et al., 2023,InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

  17. Zhu, et al., 2023,LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

  18. Ge, et al., 2023,Planting a SEED of Vision in Large Language Model

  19. Zhan, et al., 2024,AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

  20. Kondratyuk, et al., 2023,VideoPoet: A Large Language Model for Zero-Shot Video Generation

  21. Zhang, et al., 2023,SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

  22. Zeghidour, et al., 2021,SoundStream: An End-to-End Neural Audio Codec

  23. Liu, et al., 2023,Improved Baselines with Visual Instruction Tuning

  24. Wu, et al., 2023,Visual-ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

  25. Wang, et al., 2023,ModaVerse: Efficiently Transforming Modalities with LLMs

  26. Fei, et al., 2024,VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

  27. Lu, et al., 2023,Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

  28. Bai, et al., 2023,LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models

  29. Huang, et al., 2023,Language Is Not All You Need: Aligning Perception with Language Models

  30. Li, et al., 2023, VideoChat: Chat-Centric Video Understanding

  31. Maaz, et al., 2023, Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

  32. Zhang, et al., 2023, Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

  33. Lin, et al., 2023, Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

  34. Qian, et al., 2024, Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

  35. Hong, et al., 2023, 3D-LLM: Injecting the 3D World into Large Language Models

  36. Sun, et al., 2023, 3D-GPT: Procedural 3D Modeling with Large Language Models

  37. Chen, et al., 2023, LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

  38. Xu, et al., 2023, PointLLM: Empowering Large Language Models to Understand Point Clouds

  39. Chen, et al., 2024, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

  40. Huang, et al., 2023, AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

  41. Zhang, et al., 2023, SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

  42. Wang, et al., 2023, VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

  43. Rubenstein, et al., 2023, AudioPaLM: A Large Language Model That Can Speak and Listen

  44. Tang, et al., 2023, SALMONN: Towards Generic Hearing Abilities for Large Language Models

  45. Latif, et al., 2023, Sparks of Large Audio Models: A Survey and Outlook

  46. Luo, et al., 2022, BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining

  47. Li, et al., 2023, DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins

  48. Chen, et al., 2023, MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

  49. Wang, et al., 2023, HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge

  50. Zhang, et al., 2023, AlpaCare:Instruction-tuned Large Language Models for Medical Application

  51. Frey, et al., 2023, Neural Scaling of Deep Chemical Models

  52. Zhang, et al., 2023, ChemLLM: A Chemical Large Language Model

  53. Liu, et al., 2023, MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter

  54. Jiang, et al., 2023, StructGPT: A General Framework for Large Language Model to Reason on Structured Data

  55. Chen, et al., 2024, LLaGA: Large Language and Graph Assistant

  56. Koh, et al., 2023, Generating Images with Multimodal Language Models

  57. Sun, et al., 2023, Generative Pretraining in Multimodality

  58. Zheng, et al., 2023, MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

  59. Dong, et al., 2023, DreamLLM: Synergistic Multimodal Comprehension and Creation

  60. Liu, et al., 2023, LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

  61. Wang, et al., 2023, GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

  62. Jin, et al., 2024, Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

  63. Jin, et al., 2023, Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

  64. Li, et al., 2023, LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

  65. Su, et al., 2023, PandaGPT: One Model to Instruction-Follow Them All

  66. Lyu, et al., 2023, Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

  67. Tang, et al., 2023, CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

  68. Zhang, et al., 2023, GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

  69. Yuan, et al., 2023, Osprey: Pixel Understanding with Visual Instruction Tuning

  70. Rasheed, et al., 2023, GLaMM: Pixel Grounding Large Multimodal Model

  71. Pi, et al., 2023, DetGPT: Detect What You Need via Reasoning

  72. Ren, et al., 2023, PixelLM: Pixel Reasoning with Large Multimodal Model

  73. Lai, et al., 2023, Lisa: Reasoning segmentation via large language model

  74. Chen, et al., 2023, Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

  75. Munasinghe, et al., 2023, PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models

  76. Yu, et al., 2023, Merlin: Empowering Multimodal LLMs with Foresight Minds

  77. Fu, et al., 2023, MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

  78. Xu, et al., 2023, LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

  79. Ying, et al., 2024, MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

  80. Pan, et al., 2024, Auto-Encoding Morph-Tokens for Multimodal LLM

  81. Thagard, et al., 1997, Abductive reasoning: Logic, visual thinking, and coherence

  82. Bavishi, et al., 2023, Fuyu-8B: A Multimodal Architecture for AI Agents

Section 2: Instruction Tuning

  1. Liu, et al., 2023, Visual Instruction Tuning

  2. Liu, et al., 2023, Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

  3. Gao, et al., 2023, LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

  4. Zhao, et al., 2023, SVIT: Scaling up Visual Instruction Tuning

  5. Ye, et al., 2023, mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

  6. Yu, et al., 2023, RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

  7. Liu, et al., 2023, MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning

  8. Zhao, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

  9. Liu, et al., 2023, HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models

  10. Li, et al., 2023, Evaluating Object Hallucination in Large Vision-Language Models

  11. Huang, et al., 2023, Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

  12. Yin, et al., 2023, A Survey on Multimodal Large Language Models

  13. Yin, et al., 2023, Woodpecker: Hallucination Correction for Multimodal Large Language Models

Section 3: Reasoning with LLM

  1. Zhang, et al., 2023, Multimodal Chain-of-Thought Reasoning in Language Models

  2. Zhao, et al., 2023, MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

  3. Lu, et al., 2023, Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models

  4. Zhang, et al., 2023, You Only Look at Screens: Multimodal Chain-of-Action Agents

  5. Sun, et al., 2023, Generative multimodal models are in-context learners

  6. Fei, et al., 2023, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

  7. Wei, et al., 2023, Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework

  8. Zhang, et al., 2023, Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents

  9. Fei, et al., 2024, Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

  10. Prystawski, et al., 2023, Why think step by step? Reasoning emerges from the locality of experience

  11. Gou, et al., 2023, CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

  12. Tang, et al., 2024, Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science

  13. Yuan, et al., 2024, R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

Section 4: Efficient Learning

  1. Hu, et al., 2021, LoRA: Low-Rank Adaptation of Large Language Models

  2. Dettmers, et al., 2023, QLoRA: Efficient Finetuning of Quantized LLMs

  3. Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

  4. Luo, et al., 2023, Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

  5. Yao, et al., 2024, MiniCPM-V

  6. DeepSpeed Team, 2020, DeepSpeed Blog

  7. Zhao, et al., 2023, PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

  8. Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

  9. Chen, et al., 2023, MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

  10. Hong, et al., 2023, CogAgent: A Visual Language Model for GUI Agents

  11. Chen, et al., 2024, How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

  12. Dehghani, et al., 2023, Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

  13. Zhang, et al., 2023, VPGTrans: Transfer Visual Prompt Generator across LLMs

  14. Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM

  15. Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

  16. Zhang, et al., 2024, NExT-Chat: An LMM for Chat, Detection and Segmentation