Skip to content

This repository is used to collect papers and code in the field of AI.

License

Notifications You must be signed in to change notification settings

songqiang321/Awesome-AI-Papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Awesome-AI-Papers

This repository is used to collect papers and code in the field of AI. The contents contain the following parts:

Table of Content

  ├─ NLP/  
  │  ├─ Word2Vec/  
  │  ├─ Seq2Seq/           
  │  └─ Pretraining/  
  │    ├─ Large Language Model/          
  │    ├─ LLM Application/ 
  │      ├─ AI Agent/          
  │      ├─ Academic/          
  │      ├─ Code/       
  │      ├─ Financial Application/
  │      ├─ Information Retrieval/  
  │      ├─ Math/     
  │      ├─ Medicine and Law/   
  │      ├─ Recommend System/      
  │      └─ Tool Learning/             
  │    ├─ LLM Technique/ 
  │      ├─ Alignment/          
  │      ├─ Context Length/          
  │      ├─ Corpus/       
  │      ├─ Evaluation/
  │      ├─ Hallucination/  
  │      ├─ Inference/     
  │      ├─ MoE/   
  │      ├─ PEFT/     
  │      ├─ Prompt Learning/   
  │      ├─ RAG/       
  │      └─ Reasoning and Planning/       
  │    ├─ LLM Theory/       
  │    └─ Chinese Model/             
  ├─ CV/  
  │  ├─ CV Application/          
  │  ├─ Contrastive Learning/         
  │  ├─ Foundation Model/ 
  │  ├─ Generative Model (GAN and VAE)/          
  │  ├─ Image Editing/          
  │  ├─ Object Detection/          
  │  ├─ Semantic Segmentation/            
  │  └─ Video/          
  ├─ Multimodal/       
  │  ├─ Audio/          
  │  ├─ BLIP/         
  │  ├─ CLIP/        
  │  ├─ Diffusion Model/   
  │  ├─ Multimodal LLM/          
  │  ├─ Text2Image/          
  │  ├─ Text2Video/            
  │  └─ Survey/           
  │─ Reinforcement Learning/ 
  │─ GNN/ 
  └─ Transformer Architecture/          

NLP

1. Word2Vec

  • Efficient Estimation of Word Representations in Vector Space, Mikolov et al., arxiv 2013. [paper]
  • Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al., NIPS 2013. [paper]
  • Distributed representations of sentences and documents, Le and Mikolov, ICML 2014. [paper]
  • Word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method, Goldberg and Levy, arxiv 2014. [paper]
  • word2vec Parameter Learning Explained, Rong, arxiv 2014. [paper]
  • Glove: Global vectors for word representation.Pennington et al., EMNLP 2014. [paper][code]
  • fastText: Bag of Tricks for Efficient Text Classification, Joulin et al., arxiv 2016. [paper][code]
  • ELMo: Deep Contextualized Word Representations, Peters et al., NAACL 2018. [paper]
  • BPE: Neural Machine Translation of Rare Words with Subword Units, Sennrich et al., ACL 2016. [paper][code]
  • Byte-Level BPE: Neural Machine Translation with Byte-Level Subwords, Wang et al., arxiv 2019. [paper][code]

2. Seq2Seq

  • Generating Sequences With Recurrent Neural Networks, Graves, arxiv 2013. [paper]
  • Sequence to Sequence Learning with Neural Networks, Sutskever et al., NeruIPS 2014. [paper]
  • Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et al., ICLR 2015. [paper][code]
  • On the Properties of Neural Machine Translation: Encoder-Decoder Approaches, Cho et al., arxiv 2014. [paper]
  • Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Cho et al., arxiv 2014. [paper]
  • [fairseq][fairseq2][pytorch-seq2seq]

3. Pretraining

3.1 Large Language Model

3.2 LLM Application

3.2.1 AI Agent
3.2.2 Academic
3.2.3 Code
3.2.4 Financial Application
  • DocLLM: A layout-aware generative language model for multimodal document understanding, Wang et al., arxiv 2024. [paper]

  • DocGraphLM: Documental Graph Language Model for Information Extraction, Wang et al., arxiv 2023. [paper]

  • FinBERT: A Pretrained Language Model for Financial Communications, Yang et al., arxiv 2020. [paper][Wiley paper][code][finBERT][valuesimplex/FinBERT]

  • FinGPT: Open-Source Financial Large Language Models, Yang et al., IJCAI 2023. [paper][code]

  • FinRobot: An Open-Source AI Agent Platform for Financial Applications using Large Language Models, Yang et al., arxiv 2024. [paper][code]

  • FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets, Wang et al., arxiv 2023. [paper][code]

  • Instruct-FinGPT: Financial Sentiment Analysis by Instruction Tuning of General-Purpose Large Language Models, Zhang et al., arxiv 2023. [paper][code]

  • FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance, Liu et al., arxiv 2020. [paper][code]

  • FinRL-Meta: Market Environments and Benchmarks for Data-Driven Financial Reinforcement Learning, Liu et al., NeurIPS 2022. [paper][code]

  • DISC-FinLLM: A Chinese Financial Large Language Model based on Multiple Experts Fine-tuning, Chen et al., arxiv 2023. [paper][code]

  • A Multimodal Foundation Agent for Financial Trading: Tool-Augmented, Diversified, and Generalist, Zhang et al., arxiv 2024. [paper]

  • XuanYuan 2.0: A Large Chinese Financial Chat Model with Hundreds of Billions Parameters, Zhang et al., arxiv 2023. [paper][code]

  • Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications, Xie et al., arxiv 2024. [paper][code]

  • StructGPT: A General Framework for Large Language Model to Reason over Structured Data, Jiang et al., arxiv 2023. [paper][code]

  • Large Language Model for Table Processing: A Survey, Lu et al., arxiv 2024. [paper][llm-table-survey][table-transformer][Awesome-Tabular-LLMs][Awesome-LLM-Tabular][Table-LLaVA][tablegpt-agent]

  • rLLM: Relational Table Learning with LLMs, Li et al., arxiv 2024. [paper][code]

  • Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow, Zhang et al., arxiv 2023. [paper][code]

  • Data Interpreter: An LLM Agent For Data Science, Hong et al., arxiv 2024. [paper][code]

  • AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework, Li et al., COLING 2024. [paper][code]

  • LLMFactor: Extracting Profitable Factors through Prompts for Explainable Stock Movement Prediction, Wang et al., arxiv 2024. [paper][MIGA]

  • A Survey of Large Language Models in Finance (FinLLMs), Lee et al., arxiv 2024. [paper][code][Revolutionizing Finance with LLMs: An Overview of Applications and Insights]

  • A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges, Nie et al., arxiv 2024. [paper][financial-datasets][LLMs-in-Finance]

  • PEER: Expertizing Domain-Specific Tasks with a Multi-Agent Framework and Tuning Methods, Wang et al., arxiv 2024. [paper][code][Stockagent]

  • Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset, Zhu et al., ACL 2024. [paper][code][Golden-Touchstone][financebench][OmniEval]

  • [gpt-investor][FinGLM][agentUniverse][gs-quant][stockbot-on-groq][Real-Time-Stock-Market-Prediction-using-Ensemble-DL-and-Rainbow-DQN][openbb-agents][ai-hedge-fund]

3.2.5 Information Retrieval
3.2.6 Math
3.2.7 Medicine and Law
3.2.8 Recommend System
3.2.9 Tool Learning
  • Tool Learning with Foundation Models, Qin et al., arxiv 2023. [paper][code]

  • Tool Learning with Large Language Models: A Survey, Qu et al., arxiv 2024. [paper][code]

  • Toolformer: Language Models Can Teach Themselves to Use Tools, Schick et al., arxiv 2023. [paper][toolformer-pytorch][conceptofmind/toolformer][xrsrke/toolformer][Graph_Toolformer]

  • ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, Qin et al., ICLR 2024 Spotlight. [paper][code][StableToolBench]

  • Gorilla: Large Language Model Connected with Massive APIs, Patil et al., arxiv 2023. [paper][code]

  • HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, Shen et al., NeurIPS 2023. [paper][code]

  • GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction, Yang et al., arxiv 2023. [paper][code]

  • RestGPT: Connecting Large Language Models with Real-World RESTful APIs, Song et al., arxiv 2023. [paper][code]

  • LLMCompiler: An LLM Compiler for Parallel Function Calling, Kim et al., ICML 2024. [paper][code]

  • Large Language Models as Tool Makers, Cai et al, arxiv 2023. [paper][code]

  • ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases Tang et al., arxiv 2023. [paper][code][ToolQA][toolbench]

  • ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search, Zhuang et al., arxiv 2023. [paper][[code]]

  • Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models, Lu et al., NeurIPS 2023. [paper][code]

  • ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios, Ye et al., arxiv 2024. [paper][code]

  • AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls, Du et al., arxiv 2024. [paper][code]

  • LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error, Wang et al., arxiv 2024. [paper][code]

  • What Are Tools Anyway? A Survey from the Language Model Perspective, Wang et al., arxiv 2024. [paper]

  • ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities, Lu et al., arxiv 2024. [paper][code][API-Bank]

  • Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval, Chen et al., arxiv 2024. [paper]

  • ToolACE: Winning the Points of LLM Function Calling, Liu et al., arxiv 2024. [paper][ToolGen]

  • Hammer: Robust Function-Calling for On-Device Language Models via Function Masking, Lin et al., arxiv 2024. [paper][code]

  • [functionary][ToolLearningPapers][awesome-tool-llm]

3.3 LLM Technique

3.3.1 Alignment
3.3.2 Context Length
  • ALiBi: Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, Press et al., ICLR 2022. [paper][code]
  • Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation, Chen et al., arxiv 2023. [paper]
  • Scaling Transformer to 1M tokens and beyond with RMT, Bulatov et al., AAAI 2024. [paper][code][LM-RMT]
  • RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text, Zhou et al., arxiv 2023. [paper][code]
  • LongNet: Scaling Transformers to 1,000,000,000 Tokens, Ding et al., arxiv 2023. [paper][code][unofficial code]
  • Focused Transformer: Contrastive Training for Context Scaling, Tworkowski et al., NeurIPS 2023. [paper][code]
  • LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, Chen et al., ICLR 2024 Oral. [paper][code]
  • StreamingLLM: Efficient Streaming Language Models with Attention Sinks, Xiao et al., ICLR 2024. [paper][code][SwiftInfer][SwiftInfer blog]
  • YaRN: Efficient Context Window Extension of Large Language Models, Peng et al., ICLR 2024. [paper][code][LM-Infinite]
  • Ring Attention with Blockwise Transformers for Near-Infinite Context, Liu et al., ICLR 2024. [paper][code][ring-attention-pytorch][local-attention][tree_attention]
  • LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression, Jiang et al., ACL 2024. [paper][code]
  • LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens, Ding et al., arxiv 2024. [paper][code]
  • LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning, Jin et al., arxiv 2024. [paper][code]
  • The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey, Pawar et al., arxiv 2024. [paper][Awesome-LLM-Long-Context-Modeling]
  • Data Engineering for Scaling Language Models to 128K Context, Fu et al., arxiv 2024. [paper][code]
  • CEPE: Long-Context Language Modeling with Parallel Context Encoding, Yen et al., ACL 2024. [paper][code]
  • Training-Free Long-Context Scaling of Large Language Models, An et al., ICML 2024. [paper][code]
  • InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory, Xiao et al., NeurIPS 2024. [paper][code]
  • Counting-Stars: A Simple, Efficient, and Reasonable Strategy for Evaluating Long-Context Large Language Models, Song et al., arxiv 2024. [paper][code][LLMTest_NeedleInAHaystack][RULER][LooGLE][LongBench][google-deepmind/loft]
  • Infini-Transformer: Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention, Munkhdalai et al., arxiv 2024. [paper][infini-transformer-pytorch][InfiniTransformer][infini-mini-transformer][megalodon]
  • Extending Llama-3's Context Ten-Fold Overnight, Zhang et al., arxiv 2024. [paper][code][activation_beacon]
  • Make Your LLM Fully Utilize the Context, An et al., arxiv 2024. [paper][code]
  • CoPE: Contextual Position Encoding: Learning to Count What's Important, Golovneva et al., arxiv 2024. [paper][rope_cope]
  • Scaling Granite Code Models to 128K Context, Stallone et al., arxiv 2024. [paper][code][granite-3.1-language-models]
  • Generalizing an LLM from 8k to 1M Context using Qwen-Agent, Qwen Team, 2024. [blog]
  • LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, Bai et al., arxiv 2024. [paper][code][LongCite][LongReward]
  • A failed experiment: Infini-Attention, and why we should keep trying, HuggingFace Blog, 2024. [blog][Magic Blog]
  • Why Does the Effective Context Length of LLMs Fall Short, An et al., arxiv 2024. [paper][code][rotary-embedding-torch]
  • How to Train Long-Context Language Models (Effectively), Gao et al., arxiv 2024. [paper][code]
3.3.3 Corpus
3.3.4 Evaluation
3.3.5 Hallucination
  • Extrinsic Hallucinations in LLMs, Lilian Weng, 2024. [blog]
  • Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models, Zhang et al., arxiv 2023. [paper][code]
  • A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions, Huang et al., arxiv 2023. [paper][code][Awesome-MLLM-Hallucination]
  • The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models, Li et al., arxiv 2024. [paper][code]
  • FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios, Chem et al., arxiv 2023. [paper][code][OlympicArena][FActScore]
  • Chain-of-Verification Reduces Hallucination in Large Language Models, Dhuliawala et al., arxiv 2023. [paper][code]
  • HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models, Guan et al., CVPR 2024. [paper][code]
  • Woodpecker: Hallucination Correction for Multimodal Large Language Models, Yin et al., arxiv 2023. [paper][code]
  • OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation, Huang et al., CVPR 2024 Highlight. [paper][code]
  • TrustLLM: Trustworthiness in Large Language Models, Sun et al., arxiv 2024. [paper][code]
  • SAFE: Long-form factuality in large language models, Wei et al., arxiv 2024. [paper][code]
  • RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models, Hu et al., arxiv 2024. [paper][code][HaluAgent][LLMsKnow]
  • Detecting hallucinations in large language models using semantic entropy, Farquhar et al., Nature 2024. [paper][semantic_uncertainty][long_hallucinations][Semantic Uncertainty ICLR 2023][Lynx-hallucination-detection]
  • A Survey on the Honesty of Large Language Models, Li et al., arxiv 2024. [paper][code]
  • LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations, Orgad et al., arxiv 2024. [paper][code]
3.3.6 Inference
3.3.7 MoE
3.3.8 PEFT (Parameter-efficient Fine-tuning)
3.3.9 Prompt Learning
3.3.10 RAG (Retrieval Augmented Generation)
Text Embedding
3.3.11 Reasoning and Planning
  • Few-Shot-CoT: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Wei et al., NeurIPS 2022. [paper][chain-of-thought-hub]

  • Self-Consistency Improves Chain of Thought Reasoning in Language Models, Wang et al., ICLR 2023. [paper]

  • Zero-Shot-CoT: Large Language Models are Zero-Shot Reasoners, Kojima et al., NeurIPS 2022. [paper][code]

  • Auto-CoT: Automatic Chain of Thought Prompting in Large Language Models, Zhang et al., ICLR 2023. [paper][code]

  • Multimodal Chain-of-Thought Reasoning in Language Models, Zhang et al., arxiv 2023. [paper][code]

  • Fine-tune-CoT: Large Language Models Are Reasoning Teachers, Ho et al., ACL 2023. [paper][code]

  • The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning, Kim et al., EMNLP 2023. [paper][code]

  • Chain-of-Thought Reasoning Without Prompting, Wang et al., arxiv 2024. [paper]

  • ReAct: Synergizing Reasoning and Acting in Language Models, Yao et al., ICLR 2023. [paper][code]

  • MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action, Yang et al., arxiv 2023. [paper][code][AutoAct]

  • Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al., NeurIPS 2023. [paper][code][Plug in and Play Implementation][tree-of-thought-prompting]

  • Graph of Thoughts: Solving Elaborate Problems with Large Language Models, Besta et al., arxiv 2023. [paper][code]

  • Cumulative Reasoning with Large Language Models, Zhang et al., arxiv 2023. [paper][code][On the Diagram of Thought]

  • Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models, Sel et al., arxiv 2023. [paper][unofficial code]

  • Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation, Ding et al., arxiv 2023. [paper][code]

  • Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models, Ye et al., arxiv 2024. [paper][code]

  • Least-to-Most Prompting Enables Complex Reasoning in Large Language Models, Zhou et al., ICLR 2023. [paper]

  • DEPS: Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents, Wang et al., arxiv 2023. [paper][code]

  • RAP: Reasoning with Language Model is Planning with World Model, Hao et al., EMNLP 2023. [paper][code][LLM Reasoners COLM 2024]

  • LEMA: Learning From Mistakes Makes LLM Better Reasoner, An et al., arxiv 2023. [paper][code]

  • Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks, Chen et al., TMLR 2023. [paper][code]

  • Chain of Code: Reasoning with a Language Model-Augmented Code Emulator, Li et al., arxiv 2023. [paper][[code]]

  • The Impact of Reasoning Step Length on Large Language Models, Jin et al., arxiv 2024. [paper][code]

  • Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models, Wang et al., ACL 2023. [paper][code][maestro]

  • Improving Factuality and Reasoning in Language Models through Multiagent Debate, Du et al., ICML 2024. [paper][code][Multi-Agents-Debate]

  • Self-Refine: Iterative Refinement with Self-Feedback, Madaan et al., NeurIPS 2024. [paper][code][MCT Self-Refine][SelFee]

  • Reflexion: Language Agents with Verbal Reinforcement Learning, Shinn et al., NeurIPS 2023. [paper][code]

  • CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, Gou et al., ICLR 2024. [paper][code]

  • LATS: Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models, Zhou et al., ICML 2024. [paper][code]

  • Self-Discover: Large Language Models Self-Compose Reasoning Structures, Zhou et al., NeurIPS 2024. [paper][unofficial implementation][SELF-DISCOVER]

  • RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation, Wang et al., arxiv 2024. [paper][code]

  • KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents, Zhu et al., arxiv 2024. [paper][code][KnowLM][KnowPAT]

  • Advancing LLM Reasoning Generalists with Preference Trees, Yuan et al., arxiv 2024. [paper][code]

  • Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models, Yang et al., arxiv 2024. [paper][code][SymbCoT]

  • ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, Singh et al., arxiv 2023. [paper][unofficial code]

  • ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent, Aksitov et al., arxiv 2023. [paper][[code]]

  • Searchformer: Beyond A: Better Planning with Transformers via Search Dynamics Bootstrapping*, Lehnert et al., COLM 2024. [paper][code][Dualformer]

  • How Far Are We from Intelligent Visual Deductive Reasoning?, Zhang et al., arxiv 2024. [paper][code]

  • PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers, Lee et al., arxiv 2024. [paper][code]

  • Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning, Kim et al., arxiv 2024. [paper][code]

  • Sibyl: Simple yet Effective Agent Framework for Complex Real-world Reasoning, Wang et al., arxiv 2024. [paper][code]

  • QueryAgent: A Reliable and Efficient Reasoning Framework with Environmental Feedback-based Self-Correction, Huang et al., ACL 2024. [paper][code]

  • Internal Consistency and Self-Feedback in Large Language Models: A Survey, Liang et al., arxiv 2024. [paper][code]

  • Prover-Verifier Games improve legibility of language model outputs, Kirchner et al., 2024. [blog][paper]

  • Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning, Wang et al., ACL 2024. [paper][code]

  • ReST-MCTS: LLM Self-Training via Process Reward Guided Tree Search*, Zhang et al., NeurIPS 2024. [paper][code][llm-mcts]

  • GenRM: Generative Verifiers: Reward Modeling as Next-Token Prediction, Zhang et al., arxiv 2024. [paper][CriticGPT][Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning][Free Process Rewards without Process Labels]

  • rStar: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers, Qi et al., arxiv 2024. [paper][code][Orca 2][STaR][Quiet-STaR]

  • OpenAI o1: Learning to Reason with LLMs, OpenAI, 2024. [blog][OpenAI o1 System Card][Agent Q][Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters][search-and-learn][Let's Verify Step by Step][Thinking LLMs: General Instruction Following with Thought Generation][Awesome-LLM-Strawberry]

  • O1 Replication Journey: A Strategic Progress Report -- Part 1, Qin et al., arxiv 2024. [paper][code][O1 Replication Journey -- Part 2][LLaMA-O1][Marco-o1][qwq-32b-preview]

  • ReFT: Reasoning with Reinforced Fine-Tuning, Luong et al., ACL 2024. [paper][code][VinePPO]

  • LLaVA-o1: Let Vision Language Models Reason Step-by-Step, Xu et al., arxiv 2024. [paper][code][internvl2.0_mpo][Insight-V][VisVM]

  • Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems, Min et al., arxiv 2024. [paper][code][Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search]

  • [llm-reasoners][g1][Open-O1][show-me][OpenR]

Survey

3.4 LLM Theory

3.5 Chinese Model


CV

  • CS231n: Deep Learning for Computer Vision [link]

1. Basic for CV

  • AlexNet: ImageNet Classification with Deep Convolutional Neural Networks, Krizhevsky et al., NIPS 2012. [paper]
  • VGG: Very Deep Convolutional Networks for Large-Scale Image Recognition, Simonyan et al., ICLR 2015. [paper]
  • GoogLeNet: Going Deeper with Convolutions, Szegedy et al., CVPR 2015. [paper]
  • ResNet: Deep Residual Learning for Image Recognition, He et al., CVPR 2016 Best Paper. [paper][code]
  • DenseNet: Densely Connected Convolutional Networks, Huang et al., CVPR 2017 Oral. [paper][code]
  • EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, Tan et al., ICML 2019. [paper][code][EfficientNet-PyTorch][noisystudent]
  • BYOL: Bootstrap your own latent: A new approach to self-supervised Learning, Grill et al., arxiv 2020. [paper][code][byol-pytorch][simsiam]
  • ConvNeXt: A ConvNet for the 2020s, Liu et al., CVPR 2022. [paper][code][ConvNeXt-V2]

2. Contrastive Learning

  • MoCo: Momentum Contrast for Unsupervised Visual Representation Learning, He et al., CVPR 2020. [paper][code]

  • SimCLR: A Simple Framework for Contrastive Learning of Visual Representations, Chen et al., PMLR 2020. [paper][code]

  • CoCa: Contrastive Captioners are Image-Text Foundation Models, Yu et al., arxiv 2024. [paper][CoCa-pytorch][multimodal]

  • DINOv2: Learning Robust Visual Features without Supervision, Oquab et al., arxiv 2023. [paper][code]

  • FeatUp: A Model-Agnostic Framework for Features at Any Resolution, Fu et al., ICLR 2024. [paper][code]

  • InfoNCE Loss: Representation Learning with Contrastive Predictive Coding, Oord et al., arxiv 2018. [paper][unofficial code]

3. CV Application

4. Foundation Model

  • ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., ICLR 2021. [paper][code][vit-pytorch][efficientvit][EfficientFormer][ViT-Adapter]

  • ViT-Adapter: Vision Transformer Adapter for Dense Predictions, Chen et al., ICLR 2023 Spotlight. [paper][code]

  • Vision Transformers Need Registers, Darcet et al., ICLR 2024 Outstanding Paper. [paper]

  • DeiT: Training data-efficient image transformers & distillation through attention, Touvron et al., ICML 2021. [paper][code]

  • ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, Kim et al., ICML 2021. [paper][code]

  • Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, Liu et al., ICCV 2021. [paper][code]

  • MAE: Masked Autoencoders Are Scalable Vision Learners, He et al., CVPR 2022. [paper][code][FLIP][LVMAE-pytorch]

  • Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks, Xiao et al., CVPR 2024 Oral. [paper][model][Inference code][Florence-VL]

  • LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models, Bai et al., arxiv 2023. [paper][code]

  • GLEE: General Object Foundation Model for Images and Videos at Scale, Wu wt al., CVPR 2024 Highlight. [paper][code]

  • Tokenize Anything via Prompting, Pan et al., arxiv 2023. [paper][code]

  • Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model Zhu et al., ICML 2024. [paper][code][VMamba][mambaout][MLLA]

  • MambaVision: A Hybrid Mamba-Transformer Vision Backbone, Hatamizadeh and Kautz, arxiv 2024. [paper][code]

  • Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data, Yang et al., arxiv 2024. [paper][code][Depth-Anything-V2][PromptDA][ml-depth-pro][DepthCrafter][rollingdepth]

  • Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models, Guo et al., arxiv 2024. [paper][code]

  • TiTok: An Image is Worth 32 Tokens for Reconstruction and Generation, Yu et al., NeurIPS 2024. [paper][code][titok-pytorch][Randomized Autoregressive Visual Generation][Cosmos-Tokenizer]

  • Theia: Distilling Diverse Vision Foundation Models for Robot Learning, Shang et al., arxiv 2024. [paper][code]

  • [pytorch-image-models][Pointcept]

5. Generative Model (GAN and VAE)

6. Image Editing

  • InstructPix2Pix: Learning to Follow Image Editing Instructions, Brooks et al., CVPR 2023 Highlight. [paper][code]

  • Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold, Pan et al., SIGGRAPH 2023. [paper][code]

  • DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing, Shi et al., arxiv 2023. [paper][code]

  • DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models, Mou et al., ICLR 2024 Spolight. [paper][code]

  • DragAnything: Motion Control for Anything using Entity Representation, Wu et al., ECCV 2024. [paper][code][Framer][SG-I2V]

  • LEDITS++: Limitless Image Editing using Text-to-Image Models, Brack et al., arxiv 2023. [paper][code][demo]

  • Diffusion Model-Based Image Editing: A Survey, Huang et al., arxiv 2024. [paper][code]

  • PromptFix: You Prompt and We Fix the Photo, Yu et al., NeurIPS 2024. [paper][code]

  • MimicBrush: Zero-shot Image Editing with Reference Imitation, Chen et al., arxiv 2024. [paper][code][EchoMimic]

  • A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models, Shuai et al., arxiv 2024. [paper][code]

  • Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models, Atzmon et al., arxiv 2024. [paper]

  • MagicQuill: An Intelligent Interactive Image Editing System, Liu et al., arxiv 2024. [paper][code]

  • BrushEdit: All-In-One Image Inpainting and Editing, Li et al., arxiv 2024. [paper][code]

  • [EditAnything][ComfyUI-UltraEdit-ZHO][libcom][Awesome-Image-Composition][RF-Solver-Edit]

7. Object Detection

  • DETR: End-to-End Object Detection with Transformers, Carion et al., arxiv 2020. [paper][code][detrex]

  • Focus-DETR: Less is More_Focus Attention for Efficient DETR, Zheng et al., arxiv 2023. [paper][code]

  • U2-Net_Going Deeper with Nested U-Structure for Salient Object Detection, Qin et al., arxiv 2020. [paper][code]

  • YOLO: You Only Look Once: Unified, Real-Time Object Detection Redmon et al., arxiv 2015. [paper]

  • YOLOX: Exceeding YOLO Series in 2021, Ge et al., arxiv 2021. [paper][code]

  • Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism, Wang et al., arxiv 2023. [paper][code]

  • Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, Liu et al., ECCV 2024. [paper][code][DINO-X][OV-DINO][OmDet][groundingLMM]

  • YOLO-World: Real-Time Open-Vocabulary Object Detection, Cheng et al., CVPR 2024. [paper][code]

  • YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information, Wang et al., arxiv 2024. [paper][code]

  • T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy, Jiang et al., arxiv 2024. [paper][code][ChatRex]

  • YOLOv10: Real-Time End-to-End Object Detection, Wang et al., arxiv 2024. [paper][code]

  • D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement, Peng et al., arxiv 2024. [paper][code]

  • [detectron2][yolov5][mmdetection][mmdetection3d][detrex][ultralytics][AlphaPose]

8. Semantic Segmentation

9. Video

  • VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training, Tong et al., NeurIPS 2022 Spotlight. [paper][code]

  • Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts, Zhao et al., arxiv 2024. [paper][code]

  • MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation, Wang et al., arxiv 2024. [paper]

  • [V-JEPA][I-JEPA][DINO-WM]

  • VideoMamba: State Space Model for Efficient Video Understanding, Li et al., ECCV 2024. [paper][code]

  • VideoChat: Chat-Centric Video Understanding, Li et al., CVPR 2024 Highlight. [paper][code]

  • Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models, Maaz et al., ACL 2024. [paper][code][Video-LLaMA][MovieChat][Chat-UniVi]

  • MVBench: A Comprehensive Multi-modal Video Understanding Benchmark, Li et al., CVPR 2024 Highlight. [paper][code][PhyGenBench]

  • OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer, Zhang et al., EMNLP 2024. [paper][code]

  • MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions, Ju et al., arxiv 2024. [paper][code]

  • MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling, Men et al., arxiv 2024. [paper][code][MIMO-pytorch][StableV2V]

  • Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding, Shu et al., arxiv 2024. [paper][code][LongVU][VisionZip]

  • [Awesome-LLMs-for-Video-Understanding]

10. Survey for CV

  • ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy, Vishniakov et al., arxiv 2023. [paper][code]
  • Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey, Xin et al., arxiv 2024. [paper][code]

Multimodal

1. Audio

2. Blip

  • ALBEF: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation, Li et al., NeurIPS 2021. [paper][code]
  • BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, Li et al., ICML 2022. [paper][code][laion-coco]
  • BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, Li et al., ICML 2023. [paper][code]
  • InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, Dai et al., arxiv 2023. [paper][code]
  • X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning, Panagopoulou et al., arxiv 2023. [paper][code]
  • xGen-MM (BLIP-3): A Family of Open Large Multimodal Models, Xue et al., arxiv 2024. [paper][code]
  • xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations, Qin et al., arxiv 2024. [paper][code]
  • xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs, Ryoo et al., arxiv 2024. [paper]
  • LAVIS: A Library for Language-Vision Intelligence, Li et al., arxiv 2022. [paper][code]
  • VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, Bao et al., NeurIPS 2022. [paper][code]
  • BEiT: BERT Pre-Training of Image Transformers, Bao et al., ICLR 2022 Oral presentation. [paper][code]
  • BeiT-V3: Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, Wang et al., CVPR 2023. [paper][code]

3. Clip

  • CLIP: Learning Transferable Visual Models From Natural Language Supervision, Radford et al., ICML 2021. [paper][code][open_clip][clip-as-service][SigLIP][EVA][DIVA][Clip-Forge]
  • DALL-E2: Hierarchical Text-Conditional Image Generation with CLIP Latents, Ramesh et al., arxiv 2022. [paper][code]
  • GLIPv2: Unifying Localization and Vision-Language Understanding, Zhang et al., NeurIPS 2022. [paper][code][GLIGEN]
  • SigLIP: Sigmoid Loss for Language Image Pre-Training, Zhai et al, arxiv 2023. [paper][siglip]
  • EVA-CLIP: Improved Training Techniques for CLIP at Scale, Sun et al., arxiv 2023. [paper][code][EVA-CLIP-18B]
  • Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese, Yang et al., arxiv 2022. [paper][code]
  • MetaCLIP: Demystifying CLIP Data, Xu et al., ICLR 2024 Spotlight. [paper][code]
  • Alpha-CLIP: A CLIP Model Focusing on Wherever You Want, Sun et al., arxiv 2023. [paper][code][Bootstrap3D]
  • MMVP: Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs, Tong et al., arxiv 2024. [paper][code]
  • MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training, Vasu et al., CVPR 20224. [paper][code]
  • Long-CLIP: Unlocking the Long-Text Capability of CLIP, Zhang et al., ECCV 2024. [paper][code][Inf-CLIP]
  • CLOC: Contrastive Localized Language-Image Pre-Training, Chen et al., arxiv 2024. [paper]
  • LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation, Huang et al., arxiv 2024. [paper][code]
  • SuperClass: Classification Done Right for Vision-Language Pre-Training, Huang et al., NeurIPS 2024. [paper][code]
  • AIM-v2: Multimodal Autoregressive Pre-training of Large Vision Encoders, Fini et al., arxiv 2024. [paper][code]

4. Diffusion Model

  • Tutorial on Diffusion Models for Imaging and Vision, Stanley H. Chan, arxiv 2024. [paper][diffusion-models-class]

  • Denoising Diffusion Probabilistic Models, Ho et al., NeurIPS 2020. [paper][code][Pytorch Implementation][RDDM]

  • Improved Denoising Diffusion Probabilistic Models, Nichol and Dhariwal, ICML 2021. [paper][code]

  • Diffusion Models Beat GANs on Image Synthesis, Dhariwal and Nichol, NeurIPS 2021. [paper][code]

  • Classifier-Free Diffusion Guidance, Ho and Salimans, NeurIPS 2021. [paper][code]

  • GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, Nichol et al., arxiv 2021. [paper][code]

  • DALL-E2: Hierarchical Text-Conditional Image Generation with CLIP Latents, Ramesh et al., arxiv 2022. [paper][code][dalle-mini]

  • Stable-Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., CVPR 2022. [paper][code][CompVis/stable-diffusion][Stability-AI/stablediffusion][ml-stable-diffusion][cleandift]

  • SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, Podell et al., arxiv 2023. [paper][code][SDXL-Lightning]

  • Introducing Stable Cascade, Stability AI, 2024. [link][code][model]

  • SDXL-Turbo: Adversarial Diffusion Distillation, Sauer et al., arxiv 2023. [paper][code]

  • LCM: Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference, Luo et al., arxiv 2023. [paper][code][Hyper-SD][DMD2][ddim]

  • LCM-LoRA: A Universal Stable-Diffusion Acceleration Module, Luo et al., arxiv 2023. [paper][code][diffusion-forcing][InstaFlow]

  • Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, Esser et al., ICML 2024 Best Paper. [paper][model][mmdit]

  • SD3-Turbo: Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation, Sauer et al., arxiv 2024. [paper][SD3.5]

  • StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation, Kodaira et al., arxiv 2023. [paper][code]

  • DiffuseKronA: A Parameter Efficient Fine-tuning Method for Personalized Diffusion Models, Marjit et al., arxiv 2024. [paper][code]

  • Video Diffusion Models, Ho et al., arxiv 2022. [paper][code]

  • Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets, Blattmann et al., arxiv 2023. [paper][code][Stable Video 4D][VideoCrafter][Video-Infinity]

  • Consistency Models, Song et al., arxiv 2023. [paper][code][Consistency Decoder]

  • sCM: Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models, Lu and Song, arxiv 2024. [paper][blog]

  • A Survey on Video Diffusion Models, Xing et al., srxiv 2023. [paper][code]

  • Diffusion Models: A Comprehensive Survey of Methods and Applications, Yang et al., arxiv 2023. [paper][code]

  • MAGVIT2: Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation, Yu et al., ICLR 2024. [paper][magvit2-pytorch][Open-MAGVIT2][LlamaGen]

  • The Chosen One: Consistent Characters in Text-to-Image Diffusion Models, Avrahami et al., arxiv 2023. [paper][code]

  • U-ViT: All are Worth Words: A ViT Backbone for Diffusion Models, Bao et al., CVPR 2023. [paper][code]

  • UniDiffuser: One Transformer Fits All Distributions in Multi-Modal Diffusion, Bao et al., arxiv 2023. [paper][code]

  • Matryoshka Diffusion Models, Gu et al., arxiv 2023. [paper][code]

  • SEDD: Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution, Lou et al., ICML 2024 Best Paper. [paper][code]

  • l-DAE: Deconstructing Denoising Diffusion Models for Self-Supervised Learning, Chen et al., arxiv 2024. [paper]

  • DiT: Scalable Diffusion Models with Transformers, Peebles et al., ICCV 2023 Oral. [paper][code][OpenDiT][VideoSys][MDT][PipeFusion][fast-DiT][FastVideo][xDiT][rlt][U-DiT]

  • SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers, Ma et al., arxiv 2024. [paper][code]

  • Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis, Ren et al., NeurIPS 2024. [paper][model][AdaCache]

  • Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer, Yang et al., arxiv 2024. [paper][code]

  • Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion, Chen et al., arxiv 2024. [paper][code]

  • Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget, Sehwag et al., arxiv 2024. [paper][code][tiny-stable-diffusion]

  • Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model, Zhou et al. arxiv 2024. [paper][transfusion-pytorch][chameleon][MonoFormer]

  • REPA: Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think, Yu et al., arxiv 2024. [paper][code]

  • In-Context LoRA for Diffusion Transformers, Huang et al., arxiv 2024. [paper][code]

  • SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models, Li et al., arxiv 2024. [paper][code]

  • Training-free Regional Prompting for Diffusion Transformers, Chen et al., arxiv 2024. [paper][code][Add-it][RAG-Diffusion]

  • Github Repositories

  • [Awesome-Diffusion-Models][Awesome-Video-Diffusion]

  • [stable-diffusion-webui][stable-diffusion-webui-colab][sd-webui-controlnet][stable-diffusion-webui-forge][automatic]

  • [Fooocus][Omost]

  • [ComfyUI][streamlit][gradio][ComfyUI-Workflows-ZHO][ComfyUI_Bxb]

  • [diffusers][DiffSynth-Studio]

5. Multimodal LLM

6. Text2Image

  • DALL-E: Zero-Shot Text-to-Image Generation, Ramesh et al., arxiv 2021. [paper][code]

  • DALL-E3: Improving Image Generation with Better Captions, Betker et al., OpenAI 2023. [paper][code][blog][Glyph-ByT5]

  • ControlNet: Adding Conditional Control to Text-to-Image Diffusion Models, Zhang et al., ICCV 2023 Marr Prize. [paper][code][ControlNet_Plus_Plus][ControlNeXt][ControlAR][OminiControl][ROICtrl]

  • T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models, Mou et al., AAAI 2024. [paper][code]

  • AnyText: Multilingual Visual Text Generation And Editing, Tuo et al., arxiv 2023. [paper][code]

  • RPG: Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs, Yang et al., ICML 2024. [paper][code][IterComp]

  • LAION-5B: An open large-scale dataset for training next generation image-text models, Schuhmann et al., NeurIPS 2022. [paper][code][blog][laion-coco][]

  • DeepFloyd IF: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, Saharia et al., arxiv 2022. [paper][code]

  • Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, Saharia et al., NeurIPS 2022. [paper][unofficial code]

  • Instruct-Imagen: Image Generation with Multi-modal Instruction, Hu et al., arxiv 2024. [paper][Imagen 3]

  • CogView: Mastering Text-to-Image Generation via Transformers, Ding et al., NeurIPS 2021. [paper][code][ImageReward]

  • CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers, Ding et al., arxiv 2022. [paper][code]

  • CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion, Zheng et al., ECCV 2024. [paper][code]

  • TextDiffuser: Diffusion Models as Text Painters, Chen et al., arxiv 2023. [paper][code]

  • TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering, Chen et al., arxiv 2023. [paper][code]

  • PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis, Chen et al., arxiv 2023. [paper][code]

  • PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models, Chen et al., arxiv 2024. [paper][code]

  • PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation, Chen et al., arxiv 2024. [paper][code]

  • IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models, Ye et al., arxiv 2023. [paper][code][ID-Animator][InstantID]

  • Controllable Generation with Text-to-Image Diffusion Models: A Survey, Cao et al., arxiv 2024. [paper][code]

  • StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation, Zhou et al., NeurIPS 2024. [paper][code][AutoStudio]

  • Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding, Li et al., arxiv 2024. [paper][code][Hunyuan3D-1][xDiT]

  • GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation, Li et al., CVPR 2024. [paper][t2v_metrics][VQAScore]

  • [Kolors][Kolors-Virtual-Try-On][EVLM: An Efficient Vision-Language Model for Visual Understanding]

  • EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models, Zhao et al., NeurIPS 2024. [paper][code]

  • Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens, Fan et al., arxiv 2024. [paper]

  • Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis, Bai et al., arxiv 2024. [paper][code]

  • SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers, Xie et al., ICLR 2025. [paper][code]

  • [flux][x-flux][x-flux-comfyui][FLUX.1-dev-LoRA][qwen2vl-flux]

7. Text2Video

  • Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation, Hu et al., arxiv 2023. [paper][code][Open-AnimateAnyone][Moore-AnimateAnyone][AnimateAnyone][UniAnimate][Animate-X]

  • EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions, Tian et al., arxiv 2024. [paper][code][V-Express]

  • AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation, Wei wt al., arxiv 2024. [paper][code]

  • DreaMoving: A Human Video Generation Framework based on Diffusion Models, Feng et al., arxiv 2023. [paper][code]

  • MagicAnimate:Temporally Consistent Human Image Animation using Diffusion Model, Xu et al., arxiv 2023. [paper][code][champ][MegActor]

  • DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors, Xing et al., ECCV 2024. [paper][code]

  • LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control, Guo et al., arxiv 2024. [paper][code][FasterLivePortrait][FollowYourEmoji]

  • FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis, Liang et al., arxiv 2023. [paper][code]

  • [Awesome-Video-Diffusion]

  • Video Diffusion Models, Ho et al., arxiv 2022. [paper][video-diffusion-pytorch]

  • Make-A-Video: Text-to-Video Generation without Text-Video Data, Singer et al., arxiv 2022. [paper][make-a-video-pytorch]

  • Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation, Wu et al., ICCV 2023. [paper][code]

  • Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators, Khachatryan et al., ICCV 2023 Oral. [paper][code][StreamingT2V]

  • CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers, Hong et al., ICLR 2023. [paper][code]

  • CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer, Yang et al., arxiv 2024. [paper][code][cogvideox-factory]

  • Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos, Ma et al., AAAI 2024. [paper][code][Follow-Your-Pose v2][Follow-Your-Emoji]

  • Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts, Ma et al., arxiv 2024. [paper][code]

  • AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning, Guo et al., arxiv 2023. [paper][code][AnimateDiff-Lightning]

  • StableVideo: Text-driven Consistency-aware Diffusion Video Editing, Chai et al., ICCV 2023. [paper][code]

  • I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models, Zhang et al., arxiv 2023. [paper][code]

  • TF-T2V: A Recipe for Scaling up Text-to-Video Generation with Text-free Videos, Wang et al., arxiv 2023. [paper][code]

  • Lumiere: A Space-Time Diffusion Model for Video Generation, Bar-Tal et al., arxiv 2024. [paper][lumiere-pytorch]

  • Sora: Creating video from text, OpenAI, 2024. [blog][Generative Models for Image and Long Video Synthesis][Generative Models of Images and Neural Networks][Open-Sora][VideoSys][Open-Sora-Plan][minisora][SoraWebui][MuseV][PhysDreamer][easyanimate]

  • Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models, Liu et al., arxiv 2024. [paper][code]

  • How Far is Video Generation from World Model: A Physical Law Perspective, Kang et al., arxiv 2024. [paper][code]

  • Mora: Enabling Generalist Video Generation via A Multi-Agent Framework, Yuan et al., arxiv 2024. [paper][code]

  • Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution, Dehghani et al., NeurIPS 2024. [paper][unofficial code]

  • VideoPoet: A Large Language Model for Zero-Shot Video Generation, Kondratyuk et al., ICML 2024 Best Paper. [paper]

  • Latte: Latent Diffusion Transformer for Video Generation, Ma et al., arxiv 2024. [paper][code][LaVIT][LaVie][VBench][Vchitect-2.0][LiteGen]

  • Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis, Menapace et al., arxiv 2024. [paper][articulated-animation]

  • FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance, Feng et al., arxiv 2024. [paper][code][Qihoo-T2X]

  • DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos, Hu et al., arxiv 2024. [paper][code]

  • Loong: Generating Minute-level Long Videos with Autoregressive Language Models, Wang et al., arxiv 2024. [paper]

  • Movie Gen: A Cast of Media Foundation Models, The Movie Gen team @ Meta, 2024. [blog][paper][unofficial code]

  • Pyramidal Flow Matching for Efficient Video Generative Modeling, Jin et al., arxiv 2024. [paper][code][LaVIT]

  • Allegro: Open the Black Box of Commercial-Level Video Generation Model, Zhou et al., arxiv 2024. [paper][code]

  • Open-Sora Plan: Open-Source Large Video Generation Model, Lin et al., arxiv 2024. [paper][code][Open-Sora][ConsisID]

  • HunyuanVideo: A Systematic Framework For Large Video Generative Models, Kong et al., arxiv 2024. [paper][code][FastVideo]

  • [MoneyPrinterTurbo][clapper][videos][manim][Mochi 1][genmoai-smol][LTX-Video][Kandinsky-4]

8. Survey for Multimodal

  • A Survey on Multimodal Large Language Models, Yin et al., arxiv 2023. [paper][Awesome-Multimodal-Large-Language-Models][MME-Survey]
  • Multimodal Foundation Models: From Specialists to General-Purpose Assistants, Li et al., arxiv 2023. [paper][cvinw_readings]
  • From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities, Lu et al., arxiv 2024. [paper][Leaderboards]
  • Efficient Multimodal Large Language Models: A Survey, Jin et al., arxiv 2024. [paper][code]
  • An Introduction to Vision-Language Modeling, Bordes et al., arxiv 2024. [paper]
  • Building and better understanding vision-language models: insights and future directions, Laurençon et al., arxiv 2024. [paper]
  • Video Understanding with Large Language Models: A Survey, Tang et al., arxiv 2023. [paper][code]

9. Other

  • Fuyu-8B: A Multimodal Architecture for AI Agents Bavishi et al., Adept blog 2023. [blog][model]
  • Otter: A Multi-Modal Model with In-Context Instruction Tuning, Li et al., arxiv 2023. [paper][code]
  • OtterHD: A High-Resolution Multi-modality Model, Li et al., arxiv 2023. [paper][code][model]
  • CM3leon: Scaling Autoregressive Multi-Modal Models_Pretraining and Instruction Tuning, Yu et al., arxiv 2023. [paper][Unofficial Implementation]
  • MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer, Tian et al., arxiv 2024. [paper][code]
  • CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations, Qi et al., arxiv 2024. [paper][code]
  • SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models, Gao et al., arxiv 2024. [paper][code]
  • Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers, Gao et al., arxiv 2024. [paper][code]
  • Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining, Liu et al., arxiv 2024. [paper][code]
  • LWM: World Model on Million-Length Video And Language With RingAttention, Liu et al., arxiv 2024. [paper][code]
  • Chameleon: Mixed-Modal Early-Fusion Foundation Models, Chameleon Team, arxiv 2024. [paper][code][X-Prompt]
  • *SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation, Ge et al., arxiv 2024. [paper][code][SEED][SEED-Story]

Reinforcement Learning

1.Basic for RL

2. LLM for decision making

  • Decision Transformer_Reinforcement Learning via Sequence Modeling, Chen et al., NeurIPS 2021. [paper][code]
  • Trajectory Transformer: Offline Reinforcement Learning as One Big Sequence Modeling Problem, Janner et al., NeurIPS 2021. [paper][code]
  • Guiding Pretraining in Reinforcement Learning with Large Language Models, Du et al., ICML 2023. [paper][code]
  • Introspective Tips: Large Language Model for In-Context Decision Making, Chen et al., arxiv 2023. [paper]
  • Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions, Chebotar et al., CoRL 2023. [paper][Unofficial Implementation]
  • Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods, Cao et al., arxiv 2024. [paper]

GNN

Survey for GNN


Transformer Architecture