Skip to content

fjiangAI/MMAPIS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

The repository for Bridging Research and Readers: A Multi-Modal Automated Academic Papers Interpretation System

✨ Latest News

  • [10/06/2024]: Release Version Update:
    • Optimized code structure for easier comprehension and secondary development.
    • Introduced a video_demo to enhance understanding of MMAPIS functionalities and usage.
  • [08/06/2024]: New Pre-Release Version:
    • Middle-End Services Added: This version introduces middle-end services, allowing the file system, API service, and front-end system to operate independently.
    • Optimized Interactive Experience: Enhancements have been made to improve the user interaction across all applications.
  • [03/26/2024]: Launch of Interactive Functionality in Streamlit.
  • [03/26/2024]: Deployment of Multimodal Question Answering (QA) Generation System.
  • [02/20/2024] : Incorporate updates to Open APIs complete with relevant calling methods in api_usage.ipynb
  • [02/05/2024] : Release the code of preprocessing

TODO List

  • Open-source the core code of each component.
  • Provide APIs for all components to facilitate user integration.
  • Refine the User Interface (UI) for an enhanced interactive experience.
  • Improve the Multimodal QA experience
  • Boost code execution speed and streamline code architecture.
  • Add a Chinese language version

TOC

⚑ Introduction

Welcome to the repository of MMAPIS!

In the contemporary information era, significantly accelerated by the advent of Large-scale Language Models (LLMs), the proliferation of scientific literature is reaching unprecedented levels. Researchers urgently require efficient tools for reading and summarizing academic papers, uncovering significant scientific literature, and employing diverse interpretative methodologies. To address this burgeoning demand, the role of automated scientific literature interpretation systems has become paramount. However, prevailing models, both commercial and open-source, confront notable challenges: they often overlook multimodal data, grapple with summarizing over-length texts, and lack diverse user interfaces. In response, we introduce an open-source multi-modal automated academic paper interpretation system (MMAPIS) with three-step process stages, incorporating LLMs to augment its functionality. Our system first employs the hybrid modality preprocessing and alignment module to extract plain text, and tables or figures from documents separately. It then aligns this information based on the section names they belong to, ensuring that data with identical section names are categorized under the same section. Following this, we introduce a hierarchical discourse-aware summarization method. It utilizes the extracted section names to divide the article into shorter text segments, facilitating specific summarizations both within and between sections via LLMs with specific prompts. Finally, we have designed four types of diversified user interfaces, including paper recommendation, multimodal Q&A, audio broadcasting, and interpretation blog, which can be widely applied across various scenarios. Our qualitative and quantitative evaluations underscore the system's superiority, especially in scientific summarization, where it outperforms solutions relying solely on GPT-4. We hope our work can present an open-sourced user-centered solution that addresses the critical needs of the scientific community in our rapidly evolving digital landscape.

🎯 MMAPIS Architecture

Our system comprises three main parts: (1) Hybrid Modality Preprocessing and Alignment Module; (2) Hierarchical Discourse-Aware Summarization Module; (3) Diverse Multimodal User Interface Module. Firstly, the Hybrid Modality Preprocessing and Alignment Module effectively processes and integrates different types of information from papers separately, including text, images, and tables. Next, the Hierarchical Discourse-Aware Summarization Module utilizes advanced LLMs with special prompts to extract key information from each section of a paper, generating comprehensive and accurate summaries. Lastly, our system features a Diverse Multimodal User Interface Module, designed to meet the specific needs of various user groups, offering multiple interaction modes like paper recommendations, multimodal Q&A, audio broadcasting, and interpretation blogs.

1. Hybrid Modality Preprocessing and Alignment Module

In our Hybrid Modality Preprocessing and Alignment Module, we innovatively combine the strengths of Nougat and PDFFigures 2.0 to transform PDF documents into Markdown format while preserving their essential multimodal content and hierarchical structure. Nougat's transformer-based model is adept at extracting textual content, including plaintext and mathematical formulas, directly from PDFs, ensuring minimal data loss and maintaining the document's inherent structure. Complementarily, PDFFigures 2.0 identifies and extracts critical figures and tables, including their attribution details. This dual approach not only preserves the rich multimodal information of the original document but also facilitates the alignment of these modalities with their respective structural elements, like section titles. The result is a semantically comprehensive and high-quality Markdown document that closely mirrors the original PDF, retaining its key experimental results and concepts, particularly vital in scientific documentation.

2. Hierarchical Discourse-Aware Summarization Module

In this module, we tackle the complexity of summarizing long documents through a novel two-stage Hierarchical Discourse-aware Summarization process. Initially, the document is segmented into sections based on hierarchical cues in the Markdown format, moving away from the rigid structure used in previous models like FacetSum. This segmentation allows for concise, section-level summaries that maintain the document's semantic integrity, with the flexibility to exclude non-essential parts such as appendices. Customized prompts, aligned with the generic structure of scientific papers, guide the extraction of key points from each section, while a universal prompt caters to sections without specific prompts. The second stage involves integrating these section-level summaries into a cohesive, document-level summary, emphasizing continuity and enriched with key information like titles and author details identified via NER technology. This method overcomes challenges like semantic fragmentation, offering a comprehensive and contextually rich summary suitable for interpretation systems.

3. Diverse Multimodal User Interface Module

Our Diverse Multimodal User Interface Module, built on a Streamlit-based interface, offers four distinct applications tailored for various user needs, based on the Hierarchical Discourse-Aware Summarization Module's output. The module includes a Paper Recommendation feature that evaluates research papers on multiple dimensions using LLM prompts, focusing on both summary and original text for a comprehensive assessment. A Multimodal Q&A feature enhances traditional Q&A by integrating queries about visual elements in papers, leveraging GPT models for precise responses. An Audio Broadcasting option converts summaries into easily digestible audio formats, ideal for quick information consumption, utilizing advanced text-to-speech technology. Lastly, an Interpretation Blog tool transforms summaries into detailed, reader-friendly blogs, offering in-depth exploration and understanding of the papers. This module effectively combines textual and visual data, ensuring a versatile and user-centric experience in engaging with research content.

Demo

We provide two versions of the demo video that showcase what our system can do and how to use it:

  • English Version:

MMAPIS demo

If you've cloned this repository locally, expand here to watch <iframe width="1000" height="750" src="./example/MMAPIS_demo_en.mp4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

MMAPIS demo

  • δΈ­ζ–‡η‰ˆζœ¬:

δΈ­ζ–‡η‰ˆMMAPIS介绍

ε¦‚ζžœζ‚¨ε·²ε…‹ιš†ζ­€δ»“εΊ“εˆ°ζœ¬εœ°οΌŒε±•εΌ€ζ­€θ§‚ηœ‹ <iframe width="1000" height="750" src="./example/MMAPIS_demo_zh.mp4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

δΈ­ζ–‡η‰ˆMMAPIS介绍

πŸ“š Evalution

Model Method Dataset Informative Quality Coherence Attributable Overall Average Score
GPT-4o MMAPIS 2017 4.708 4.562 4.523 4.786 4.616 4.639
2023 4.674 4.523 4.531 4.689 4.577 4.593
Direct 2017 4.633 4.505 4.508 4.631 4.560 4.567
2023 4.611 4.496 4.506 4.618 4.548 4.556
GPT-4 MMAPIS 2017 4.581 4.469 4.495 4.547 4.504 4.519
2023 4.565 4.454 4.434 4.539 4.480 4.494
Direct 2017 4.475 4.388 4.381 4.571 4.431 4.449
2023 4.469 4.395 4.399 4.633 4.439 4.467
GPT-3.5-Turbo MMAPIS 2017 4.515 4.391 4.381 4.519 4.439 4.449
2023 4.499 4.393 4.383 4.500 4.426 4.440
Direct 2017 4.225 4.395 4.383 4.465 4.254 4.344
2023 4.199 4.375 4.381 4.455 4.218 4.326
Mistral MMAPIS 2017 4.367 4.213 4.217 4.289 4.260 4.269
2023 4.366 4.227 4.223 4.258 4.259 4.267
Direct 2017 4.065 3.954 3.934 4.139 3.980 4.014
2023 3.913 3.933 3.883 3.982 3.924 3.927

Please refer to tech report for more analysis results.

πŸ“‹Project Framework

MMAPIS/
β”œβ”€β”€ main.py                         # Main workflow orchestrating the application
β”‚
β”œβ”€β”€ middleware/                     # Middleware-side file management
β”‚   β”œβ”€β”€ config/                     # Configuration files for middleware functionalities
β”‚   β”‚
β”‚   β”œβ”€β”€ utils/                      # General utility functions for middleware
β”‚   └── middleware.py               # Middleware functions for request handling
β”‚
β”œβ”€β”€ backend/                        # Main server-side system components
β”‚   β”œβ”€β”€ config/                     # Configuration files for the backend
β”‚   β”‚   β”œβ”€β”€ templates/              # HTML templates for downstream applications
β”‚   β”‚   β”œβ”€β”€ __init__.py             # Initialization file for the config module
β”‚   β”‚   β”œβ”€β”€ config.py               # Main configuration script for backend settings
β”‚   β”‚   β”œβ”€β”€ config.yaml             # Central configuration file in YAML format
β”‚   β”‚   β”œβ”€β”€ install.sh              # Installation script for dependencies
β”‚   β”‚   β”œβ”€β”€ logging.ini             # Main logging configuration file
β”‚   β”‚   β”œβ”€β”€ prompts_config.json     # Configuration for prompt generation
β”‚   β”‚   └── requirements.txt        # Python package dependencies
β”‚   β”‚
β”‚   β”œβ”€β”€ pretrained_w/               # Directory for pre-trained models (e.g., Nougat model)
β”‚   β”‚
β”‚   β”œβ”€β”€ data_structure/             # Data structures for management and organization
β”‚   β”‚
β”‚   β”œβ”€β”€ preprocessing/              # Preprocessing functionalities for input data
β”‚   β”‚   β”œβ”€β”€ aligner/                # Scripts for aligning different data modalities
β”‚   β”‚   β”œβ”€β”€ arxiv_extractor/        # Tool for fetching academic documents from arXiv
β”‚   β”‚   β”œβ”€β”€ nougat/                 # Tool for rich text and formula extraction
β”‚   β”‚   β”œβ”€β”€ pdffigure/              # Tool for extracting images and tables from PDFs
β”‚   β”‚   └── __init__.py             # Initialization file for the preprocessing module
β”‚   β”‚
β”‚   β”œβ”€β”€ summarization/              # Two-stage summarization processes
β”‚   β”‚   β”œβ”€β”€ Summarizer.py           # Main summarization script
β”‚   β”‚   β”œβ”€β”€ SectionSummarizer.py    # Script for generating section-level summaries
β”‚   β”‚   └── DocumentSummarizer.py   # Script for integrating section summaries into a document-level summary
β”‚   β”‚
β”‚   β”œβ”€β”€ downstream/                 # Functionalities for downstream applications
β”‚   β”‚   β”œβ”€β”€ paper_recommendation/   # Module for recommending academic papers
β”‚   β”‚   β”‚   └── recommendation.py   # Script for executing paper recommendations
β”‚   β”‚   β”œβ”€β”€ multimodal_qa/          # Multimodal Question and Answering module
β”‚   β”‚   β”‚   β”œβ”€β”€ user_intent.py      # Script for determining user intent
β”‚   β”‚   β”‚   └── answer_generation.py # Script for generating answers based on user queries
β”‚   β”‚   β”œβ”€β”€ audio_broadcast/        # Module for audio broadcasting functionalities
β”‚   β”‚   β”‚   β”œβ”€β”€ script_conversion.py # Script for converting text to audio scripts
β”‚   β”‚   β”‚   └── tts_integration.py   # Integration script for Text-to-Speech capabilities
β”‚   β”‚   └── blog_generation/        # Module for generating blog content
β”‚   β”‚       β”œβ”€β”€ blog_script.py      # Script for creating blog posts
β”‚   β”‚       └── image_integration.py # Script for integrating images into blog content
β”‚   β”‚
β”‚   └── tools/                      # Various utility tools for application enhancement
β”‚       β”œβ”€β”€ chatgpt/                # ChatGPT-related tools for processing tasks
β”‚       β”‚   β”œβ”€β”€ chatgpt_helper.py   # Helper functions for ChatGPT interaction
β”‚       β”‚   └── llm_helper.py       # Helper functions for large language models
β”‚       └── utils/                  # Additional utility functions
β”‚           β”œβ”€β”€ save_file.py        # Utility for saving files
β”‚           └── llm_helper.py       # Functions aiding large language model operations
β”‚
└── client/                         # Client-side visualization interface
    β”œβ”€β”€ api_usage.ipynb             # Notebook documenting API usage and examples
    β”œβ”€β”€ config/                     # Configuration settings for the client
    └── app.py                      # Interactive functionality implemented with Streamlit

πŸš€How to Run

Installation Guide

$ git clone https://github.com/fjiangAI/MMAPIS.git
$ cd MMAPIS

Before Installing Dependent Libraries

  • Ensure Conda is Installed: Make sure that Conda is installed on your system and the conda command is available in your PATH.
  • Check Permissions: Ensure you have the necessary permissions to execute the commands.

Backend

  1. Navigate to the configuration directory:

    cd path/to/MMAPIS
    cd backend/config
  2. Make the installation script executable:

    chmod +x install.sh
  3. Run the installation script:

    ./install.sh

    Alternatively, you can manually execute the following steps:

    1. Create and activate the Conda environment:

      conda create -n MMAPIS_backend python=3.11 -y
      conda activate MMAPIS_backend
    2. Install dependencies:

      pip install -r requirements.txt
  4. Use pdffigures2 to extract figures and tables from scientific papers you need to configure the java environment.

    $ export JAVA_HOME="your/path/to/java"
  5. For spacy to calculate similarity of title-like keyword, you need to download en_core_web_lg with the following script:

    $ python -m spacy download en_core_web_lg
  6. Make sure to setup your parameters in config.yaml

Middleware

  1. Navigate to the configuration directory:

    cd path/to/MMAPIS
    cd middleware/config
  2. Make the installation script executable:

    chmod +x install.sh
  3. Run the installation script:

    ./install.sh

    Alternatively, you can manually execute the following steps:

    1. Create and activate the Conda environment:

      conda create -n MMAPIS_middleware python=3.11 -y
      conda activate MMAPIS_middleware
    2. Install dependencies:

      pip install -r requirements.txt
  4. Make sure to setup your parameters in config.yaml

Client

  1. Navigate to the client directory:

    cd path/to/MMAPIS
    cd client
  2. Create and activate the Conda environment:

    conda create -n MMAPIS_client python=3.11 -y
    conda activate MMAPIS_client
  3. Install dependencies:

    pip install -r requirements.txt
  4. Start the client application:

    streamlit run app.py
  5. Make sure to setup your parameters in config.yaml

Run End-to-End

  1. To ensure the proper functioning of the system, which primarily relies on the OpenAI API for summarization tasks, it is essential to configure the backend settings before initiating any operations. Specifically, you need to edit the configuration file located at config.yaml to include your api_key and base_url.

  2. Run the Entire Process End-to-End (Click to Expand)

    You can run the whole process by executing main.py.

    The features are:

    • Raw Markdown file of the PDF with plain text
    • Align figures with corresponding sections in the Markdown file (multimodal)
    • Section-level summary Markdown file with multimodal
    • Document-level summary Markdown file with multimodal (based on the Section-level summary, but more focused on fluency and coherence)
    • Downstream Applications:
      • Recommendation Score
      • Broadcast Style Generation
      • Blog Style Generation
      • Multimodal QA (Due to Markdown compatibility issues, QA cannot be generated directly from main.py. Currently, only front-end interactive generation is supported.)

    To execute, run:

    
    cd path/to/MMAPIS
    python main.py --pdf <your/pdf1/path> <your/pdf2/path> -s <your/save/dir> --app <your_preference_downstream_application>
    

    For more parameter configuration of main.py, you can run:

    
    $ python main.py --help
    options:                                                                                                                                                                     
      -h, --help            show this help message and exit                                                                                                                      
      -p [PDF ...], --pdf [PDF ...]                                                                                                                                              
                            List of PDF files or directories or Arxiv links                                                                                                      
      -k API_KEY, --api_key API_KEY                                                                                                                                              
                            OpenAI API key                                                                                                                                       
      -b BASE_URL, --base_url BASE_URL                                                                                                                                           
                            Base URL for OpenAI API                                                                                                                              
      -kw KEYWORD, --keyword KEYWORD                                                                                                                                             
                            Keyword for Arxiv search if PDFs are not provided                                                                                                    
      -dt DAILY_TYPE, --daily_type DAILY_TYPE                                                                                                                                    
                            Type of daily Arxiv papers if PDFs are not provided and keyword is not provided                                                                      
      -d, --download        Download PDFs from Arxiv if PDFs are not provided                                                                                                    
      -s SAVE_DIR, --save_dir SAVE_DIR                                                                                                             
                            Directory to save the results                                                                                           
      --recompute           Recompute the results                                                                                                   
      --all                 Process all downstream tasks after summarization, including recommendation, blog, and broadcast                          
      --app {recommendation,blog,broadcast}                                                                                                        
                            Specify the downstream task to run, choose from recommendation, blog, broadcast                                         
    

    If you process xxx.pdf, you will get the following output structure:

    
    MMAPIS/
    β”œβ”€β”€ your_save_dir/                       # Output file directory
    β”‚   β”œβ”€β”€ xxx/                   # Directory named after the processed file
    β”‚   β”‚   β”œβ”€β”€ img/               # Directory for storing images used in Markdown
    β”‚   β”‚   β”œβ”€β”€ aligned_raw_md_text.md # Raw Markdown file with aligned text
    β”‚   β”‚   β”œβ”€β”€ broadcast.mp3      # MP3 file for broadcast style generation
    β”‚   β”‚   β”œβ”€β”€ broadcast.md       # Markdown file for broadcast style generation
    β”‚   β”‚   β”œβ”€β”€ aligned_section_level_summary.md  # Section-level summary Markdown file with aligned content
    β”‚   β”‚   β”œβ”€β”€ aligned_document_level_summary.md # Document-level summary Markdown file with aligned content
    β”‚   β”‚   β”œβ”€β”€ recommendation.md  # Markdown file containing the recommendation score
    β”‚   β”‚   └── blog.md            # Markdown file for blog style generation
    

Run with Three Servers

To operationalize the frontend and backend components, initiate your server and client by executing the following procedures:

  • For the backend:

    $ cd path/to/MMAPIS
    $ cd backend
    $ uvicorn backend:app --reload --port <your port> --host <your host>

    If you encounter an ImportError stating "No module named MMAPIS", you may need to add the root directory to the PYTHONPATH in each terminal session. To do this, run the following command:

    $ export $PYTHONPATH=../..

    This command ensures that Python can locate the MMAPIS module by including the parent directory in its search path.

  • For middleware, you can start your file server in the following way:

    $ cd path/to/MMAPIS
    $ cd middleware
    $ uvicorn middleware:app --reload --port <your port> --host <your host>
  • For the frontend:

    $ cd path/to/MMAPIS
    $ cd client
    $ streamlit run app.py --server.port <your port>

Notes:

  • Configuration: Don't forget to update your config.yaml file with the correct URLs for each server, ensuring that the client, middleware, and backend can communicate with each other.

  • Multimodal QA Generation:

    • For section-specific questions: Include the specific chapter index in your query. For example:
      • What's the main idea of chapter 2? (Chapter index can be found before the section title)
    • For image-specific questions: Include the specific image index in your query. For example:
      • What's the main idea of image 2?
      • What's the main idea of image 2 in chapter 2?

    The first question uses the global image index, which refers to the third image (index starts from 0) in the entire document. The second question uses the image index within a specific chapter, making it easier to locate the target image.

Acknowledgement

πŸ“© Contact

If you have any questions, please feel free to contact me .

Citation

@misc{jiang2024bridging,
      title={Bridging Research and Readers: A Multi-Modal Automated Academic Papers Interpretation System}, 
      author={Feng Jiang and Kuang Wang and Haizhou Li},
      year={2024},
      eprint={2401.09150},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

We are from the School of Data Science, the Chinese University of Hong Kong, Shenzhen (CUHKSZ), and the Shenzhen Research Institute of Big Data (SRIBD). we welcome aspiring individuals to join our group and contribute to the new era of LLM.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published