Collected from Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future
Task | Dataset | Year | Size | Input | Output | Rationale | Description | Paper |
---|---|---|---|---|---|---|---|---|
Mathematical Reasoning | AddSub | 2014 | 395 | Question | Number | Equation | Simple arithmetic | Learning to Solve Arithmetic Word Problems with Verb Categorization |
Mathematical Reasoning | SingleEq | 2015 | 508 | Question | Number | Equation | Simple arithmetic | Parsing Algebraic Word Problems into Equations |
Mathematical Reasoning | MultiArith | 2015 | 600 | Question | Number | Equation | Simple arithmetic | Solving General Arithmetic Word Problems |
Mathematical Reasoning | MAWPS | 2016 | 3,320 | Question | Number | Equation | Simple arithmetic | MAWPS: A Math Word Problem Repository |
Mathematical Reasoning | AQUA-RAT | 2017 | 100,000 | Question | Option | Natural Language | Math reasoning with NL rationale | Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems |
Mathematical Reasoning | ASDiv | 2020 | 2,305 | Question | Number | Equation | Multi-step math reasoning | A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers |
Mathematical Reasoning | SVAMP | 2021 | 1,000 | Question | Number | Equation | Multi-step math reasoning | Are NLP Models really able to Solve Simple Math Word Problems? |
Mathematical Reasoning | GSM8K | 2021 | 8,792 | Question | Number | Natural Language | Multi-step math reasoning | Training Verifiers to Solve Math Word Problems |
Mathematical Reasoning | GSM-Hard | 2023 | 936 | Question | Number | Natural Language | GSM8K with larger number | PAL: Program-aided Language Models |
Mathematical Reasoning | MathQA | 2019 | 37,297 | Question | Number | Operation | Annotated based on AQUA | MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms |
Mathematical Reasoning | DROP | 2019 | 96,567 | Question+Passage | Number+Span | Equation | Reading comprehension form | DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs |
Mathematical Reasoning | TheoremQA | 2023 | 800 | Question+Theorem | Number | - | Answer based on theorems | TheoremQA: A Theorem-driven Question Answering Dataset |
Mathematical Reasoning | TAT-QA | 2021 | 16,552 | Question+Table+Text | Number+Span | Operation | Answer based on tables | TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance |
Mathematical Reasoning | FinQA | 2021 | 8,281 | Question+Table+Text | Number | Operation | Answer based on tables | FinQA: A Dataset of Numerical Reasoning over Financial Data |
Mathematical Reasoning | ConvFinQA | 2022 | 3,892 | Question+Table+Dialog | Number | Operation | Multi-turn dialogs | ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering |
Mathematical Reasoning | MATH | 2021 | 12,500 | Question | Number | Natural Language | Challenging competition math problems | Measuring Mathematical Problem Solving With the MATH Dataset |
Mathematical Reasoning | NumGLUE | 2022 | 101,835 | Question+Text | Number+Span | - | Multi-task benchmark | NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks |
Mathematical Reasoning | LILA | 2022 | 133,815 | Question+Text | Free-form | Program | Multi-task benchmark | LILA: A Unified Benchmark for Mathematical Reasoning |
Commonsense Reasoning | ARC | 2021 | 7,787 | Question | Option | - | From science exam | Think you have Solved Direct-Answer Question Answering? Try ARC-DA,the Direct-Answer AI2 Reasoning Challenge |
Commonsense Reasoning | OpenBookQA | 2018 | 5,957 | Question+Context | Option | - | Open-book knowledges | Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering |
Commonsense Reasoning | PIQA | 2020 | 21,000 | Goal+Solution | Option | - | Physical commonsense knowledge | PIQA: Reasoning about Physical Commonsense in Natural Language |
Commonsense Reasoning | CommonsenseQA | 2019 | 12,247 | Question | Option | - | Derived from ConceptNet | CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge |
Commonsense Reasoning | CommonsenseQA 2.0 | 2021 | 14,343 | Question | Yes/No | - | Gaming annotation with high quality | CommonsenseQA 2.0: Exposing the Limits of AI through Gamification |
Commonsense Reasoning | Event2Mind | 2018 | 25,000 | Event | Intent+Reaction | - | Intension commonsense reasoning | Event2Mind: Commonsense Inference on Events, Intents, and Reactions |
Commonsense Reasoning | McTaco | 2019 | 13,225 | Question | Option | - | Event temporal commonsense reasoning | Going on a vacation takes longer than Going for a walk: A Study of Temporal Commonsense Understanding |
Commonsense Reasoning | CosmosQA | 2019 | 35,588 | Question+Paragraph | Option | - | Narrative commonsense reasoning | Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning |
Commonsense Reasoning | ComValidation | 2019 | 11,997 | Statement | Option | - | Commonsense verification | Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation |
Commonsense Reasoning | ComExplanation | 2019 | 11,997 | Statement | Option/Free-form | - | Commonsense explanation | Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation |
Commonsense Reasoning | StrategyQA | 2021 | 2,780 | Question | Yes/No | - | Multi-hop commonsense reasoning | Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies |
Symbolic Reasoning | Last Letter Concat | 2022 | - | Words | Letters | - | Rule-based | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |
Symbolic Reasoning | Coin Flip | 2022 | - | Statement | Yes/No | - | Rule-based | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |
Symbolic Reasoning | Reverse List | 2022 | - | List | Reversed List | - | Rule-based | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |
Symbolic Reasoning | BigBench | 2022 | - | - | - | - | Contains multiple symbolic reasoning datasets | Beyond the Imitation Game: Quantifying and extrapolating the capabilitiesof language models |
Symbolic Reasoning | BigBench-Hard | 2023 | - | - | - | - | Contains multiple symbolic reasoning datasets | Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them |
Logical Reasoning | ReClor | 2020 | 6,138 | Question+Context | Option | - | Questions from GMAT and LSAT | ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning |
Logical Reasoning | LogiQA | 2020 | 8,678 | Question+Paragraph | Option | - | Questions from China Civil Service Exam | LogiQA: A Challenge Dataset for Machine Reading Comprehension withLogical Reasoning |
Logical Reasoning | ProofWriter | 2021 | 20,192 | Question+Rule | Answer+Proof | Entailment Tree | Reasoning process generation | ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language |
Logical Reasoning | FOLIO | 2022 | 1,435 | Conclusion+Premise | Yes/No | - | First-order logic | FOLIO: Natural Language Reasoning with First-Order Logic |
Logical Reasoning | DEER | 2024 | 1,200 | Fact | Rule | - | Inductive reasoning | Language Models as Inductive Reasoners |
Logical Reasoning | PrOntoQA | 2023 | - | Question+Context | Yes/No+Proccess | First-Order Logic | Deductive reasoning | Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought |
Multimodal Reasoning | VCR | 2019 | 264,720 | Question+Image | Option | Natural Language | Visual commonsense reasoning | From Recognition to Cognition: Visual Commonsense Reasoning |
Multimodal Reasoning | VisualCOMET | 2020 | 1,465,704 | Image+Event | Action+Intent | - | Visual commonsense reasoning | VisualCOMET: Reasoning About the Dynamic Context of a Still Image |
Multimodal Reasoning | PMR | 2022 | 15,360 | Image+Background | Option | - | Premise-based multi-modal reasoning | Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues |
Multimodal Reasoning | ScienceQA | 2022 | 21,208 | Q+Image+Context | Option | Natural Language | Multi-modal reasoning with NL rationales | Learn to Explain: Multimodal Reasoning via Thought Chains for ScienceQuestion Answering |
Multimodal Reasoning | VLEP | 2020 | 28,726 | Premise+Video | Option | - | Video event prediction | What is More Likely to Happen Next? Video-and-Language Future Event Prediction |
Multimodal Reasoning | CLEVRER | 2020 | 305,280 | Question+Video | Option/Free-form | Program | Video temporal and causal reasoning | CLEVRER: Collision Events for Video Representation and Reasoning |
Multimodal Reasoning | STAR | 2021 | 600,000 | Question+Video | Option | - | Video situated reasoning | STAR: A Benchmark for Situated Reasoning in Real-World Videos |
Multimodal Reasoning | NEXT-QA | 2021 | 47,692 | Question+Video | Option | - | Video temporal,causal,commonsense reasoning | NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions |
Multimodal Reasoning | Causal-VidQA | 2022 | 107,600 | Question+Video | Free-form | Natural Language | Video causal and commonsense reasoning | From Representation to Reasoning: Towards both Evidence and CommonsenseReasoning for Video Question-Answering |
Multimodal Reasoning | News-KVQA | 2022 | 1,041,352 | Q+V+KG | Option | - | Video reasoning with external knowledge | NewsKVQA: Knowledge-Aware News Video Question Answering |