DocumentGraph is a Python package designed for end-to-end document analysis, using an ETL (Extract, Transform, Load) pipeline to process textual documents and represent the extracted information in a Neo4j knowledge graph. The package extracts text from documents, preprocesses and chunks the content, generates embeddings, and identifies entities and relationships within the text. These entities, relationships, and text chunks are then loaded into a Neo4j graph database for advanced analysis and querying.
This package is ideal for users who need to process large volumes of documents and structure them into a graph-based knowledge representation, where entities and their relationships can be explored and queried efficiently.
- Document extraction: Loads documents from a specified input folder.
- Text preprocessing: Cleans and chunks the document into smaller, meaningful pieces.
- Embedding generation: Generates vector representations for text chunks.
- Entity and relationship extraction: Detects entities and relationships within the text using a knowledge extraction model.
- Knowledge graph loading: Loads documents, text chunks, entities, and relationships into a Neo4j graph database.
- The package assumes that the input documents are in
.txt
format. - Preprocessing and extraction pipelines are designed for text data only.
- The performance depends on the quality of the pre-trained embedding models and entity extraction logic.
- Neo4j must be set up and running (locally or via AuraDB) with proper credentials for the package to function.
Install the DocumentGraph package using pip:
pip install documentgraph
The package requires a Neo4j database to store and query the knowledge graph. You can either use Neo4j Aura (cloud-based) or run a local Neo4j instance.
- Sign up for a free or paid Neo4j Aura account at https://aura.neo4j.io/.
- Create a new Neo4j project and note down the
uri
,username
, andpassword
for connection.
- Download and install Neo4j Desktop from https://neo4j.com/download/.
- Start a new local graph database instance.
- The default local connection
uri
is usuallybolt://localhost:7687
, and the default username/password isneo4j/neo4j
.
You need to set up environment variables to allow the package to connect to the Neo4j database. You can add these variables to your shell environment or use a .env
file.
export NEO4J_URI=bolt://localhost:7687 # For local Neo4j instance
export NEO4J_USER=neo4j
export NEO4J_PASSWORD=password
export OPENAI_API_KEY=your-openai-api-key
If you are using Neo4j Aura, replace the URI and credentials accordingly:
export NEO4J_URI=neo4j+s://your-aura-database-uri
export NEO4J_USER=your-username
export NEO4J_PASSWORD=your-password
export OPENAI_API_KEY=your-openai-api-key
For proper relationship creation, ensure you have the APOC (Awesome Procedures on Cypher) plugin installed in your Neo4j instance. This is necessary for creating custom relationships between entities and text chunks.
from documentgraph import ETLConfig, DocumentAnalysisPipeline
# Create an ETLConfig with Neo4j credentials
etl_config = ETLConfig()
# Initialize the ETL pipeline
pipeline = DocumentAnalysisPipeline(etl_config)
# Execute the pipeline with the input folder containing text documents
pipeline.execute_pipeline(input_folder="path/to/your/text/files")
- Document Extraction: The pipeline reads all
.txt
files from the specified input folder. - Text Preprocessing: The text is cleaned and broken down into smaller chunks.
- Embedding Generation: Each chunk gets converted into a vector using a pre-trained embedding model.
- Entity and Relationship Extraction: Entities and relationships between them are identified within the chunks.
- Knowledge Graph Loading: The extracted entities, relationships, and chunks are saved in the Neo4j knowledge graph.
Once the pipeline has processed the documents and loaded the data into Neo4j, you can query the graph for insights using Cypher.
For example, to retrieve all entities in the graph:
MATCH (e:Entity) RETURN e LIMIT 10;
To retrieve relationships between entities:
MATCH (e1:Entity)-[r]->(e2:Entity) RETURN e1, r, e2 LIMIT 10;
We welcome contributions to DocumentGraph! Here's how you can help:
If you encounter any bugs or have suggestions for improvements:
- Check the existing issues to avoid duplicates.
- If your issue isn't already listed, open a new issue.
- Clearly describe the problem or enhancement, including steps to reproduce if applicable.
- Add relevant labels (e.g., 'bug', 'enhancement', 'documentation').
To contribute code or documentation improvements:
- Fork the repository.
- Create a new branch for your feature:
git checkout -b feature/your-feature-name
. - Make your changes, ensuring you follow the project's coding standards.
- Write or update tests as necessary.
- Commit your changes with clear, descriptive commit messages.
- Push to your fork and submit a pull request.
For significant changes that could alter the project's direction:
- Open an issue to discuss your proposal before starting work.
- Outline the rationale and implementation details of your proposal.
- Engage in discussion with maintainers and the community.
- If approved, follow the process for making enhancements.
We appreciate your contributions to making DocumentGraph better!
DocumentGraph is licensed under the Apache License Version 2.0.