Transform earnings calls and financial documents into actionable insights using AI-powered analysis.
- Overview
- Architecture
- Features
- Installation
- Configuration
- Usage
- Project Structure
- Implementation Details
- Examples
- Components
- Testing
- Contributing
- Troubleshooting
- License
- Acknowledgments
EarningsAI Demo is a powerful tool that combines audio transcription, document processing, and AI-powered analysis to help users extract insights from earnings calls and financial documents. Built with Fireworks AI and MongoDB, it provides both a command-line interface and a web application for processing and querying financial data.
- Audio Transcription: Automatically transcribe earnings calls using Fireworks AI's Whisper v3
- Document Processing: Extract text from PDF, DOCX, and TXT files
- Vector Search: Store and retrieve documents using MongoDB's vector search capabilities
- Natural Language Querying: Ask questions about your documents in plain English
- Web Interface: User-friendly UI for uploading documents and querying insights
- Batch Processing: Handle multiple files and documents efficiently
- Python 3.8+
- MongoDB Atlas account
- Fireworks AI API key
- Clone the repository:
git clone https://github.com/yourusername/earnings-ai-demo.git
cd earnings-ai-demo
- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Update
config/config-example.yaml
file with your credentials and rename toconfig/config.yaml
:
fireworks:
api_key: "your_fireworks_api_key"
mongodb:
uri: "your_mongodb_uri"
You can also set credentials using environment variables:
export FIREWORKS_API_KEY="your_fireworks_api_key"
export MONGODB_URI="your_mongodb_uri"
- Start the Streamlit app:
streamlit run earnings_ai_demo/earnings_ai_demo/app.py
- Open your browser at
http://localhost:8501
- Upload documents or audio files
- Ask questions about your uploaded content
Process a directory of files:
python earnings_ai_demo/earnings_ai_demo/main.py
The script will:
- Process all audio files in
data/audio/
- Process all documents in
data/documents/
- Store processed data in MongoDB
- Run sample queries
earnings-ai-demo/
├── config/
│ └── config.yaml # Configuration file
├── data/
│ ├── audio/ # Audio files for processing
│ └── documents/ # Documents for processing
├── earnings_ai_demo/
│ ├── __init__.py
│ ├── app.py # Streamlit web interface
│ ├── database.py # MongoDB operations
│ ├── embedding.py # Vector embedding generation
│ ├── extraction.py # Document text extraction
│ ├── main.py # CLI entry point
│ ├── query.py # Query processing
│ └── transcription.py # Audio transcription
├── tests/
│ ├── conftest.py
│ ├── test_database.py
│ ├── test_embeddings.py
│ ├── test_extraction.py
│ ├── test_query.py
│ └── test_transcription.py
├── requirements.txt
├── setup.py
└── README.md
The system uses Fireworks AI's LLM for processing queries about MongoDB earnings data. Here's how it works:
# From query.py
def query(self, query: str, company_ticker: Optional[str] = None,
doc_type: Optional[str] = None, num_results: int = 5) -> Dict:
# Generate embedding for the query
query_embedding = self.client.embeddings.create(
input=[query],
model=self.embedding_model
).data[0].embedding
# Apply filters if specified
filters = {}
if company_ticker:
filters["metadata.company_ticker"] = company_ticker
if doc_type:
filters["metadata.document_type"] = doc_type
# Retrieve relevant documents
relevant_docs = self.db.query_similar(query_embedding, limit=num_results, filters=filters)
# Build context from retrieved documents
context = self._build_context(relevant_docs)
# System prompt for MongoDB earnings analysis
system_prompt = "You are analyzing MongoDB earnings data. Use only information from the provided context. Include specific numbers and details when available."
# Generate response using Fireworks AI
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}\n\nProvide an answer using only the context."}
]
)
return {
"response": response.choices[0].message.content,
"sources": relevant_docs
}
MongoDB vector search is implemented in the DatabaseOperations class:
# From database.py
def query_similar(self, query_embedding: List[float], limit: int = 5, filters: Dict = None) -> List[Dict]:
pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"queryVector": query_embedding,
"path": "embeddings",
"numCandidates": limit * 10,
"limit": limit
}
}
]
if filters:
pipeline.append({"$match": filters})
pipeline.append({
"$project": {
"text": 1,
"metadata": 1,
"score": {"$meta": "vectorSearchScore"}
}
})
return list(self.documents.aggregate(pipeline))
# Initialize transcriber
transcriber = AudioTranscriber(api_key="your_api_key")
# Process single file
result = await transcriber.transcribe_file(
"earnings_call.mp3",
metadata={"company": "MDB", "quarter": "Q3"}
)
# Process directory
results = await transcriber.transcribe_directory(
"data/audio",
metadata={"year": "2024"}
)
# Initialize extractor
extractor = DocumentExtractor()
# Process PDF
pdf_result = extractor.extract_text("quarterly_report.pdf")
# Process DOCX
docx_result = extractor.extract_text("presentation.docx")
# Process directory
all_docs = extractor.process_directory("data/documents")
# Initialize embedding generator
embedding_gen = EmbeddingGenerator(api_key="your_api_key")
# Single text embedding
embedding = embedding_gen.generate_embedding(
"MongoDB reported strong Q3 results",
prefix="earnings: "
)
# Batch processing
texts = ["First document", "Second document"]
embeddings = embedding_gen.generate_embeddings_batch(texts)
# Document embedding with chunking
long_text = "... very long document ..."
doc_embedding = embedding_gen.generate_document_embedding(
long_text,
method="mean" # or "max"
)
# Initialize database
db_ops = DatabaseOperations("mongodb_uri")
# Store document
doc_id = db_ops.store_document(
text="Quarterly earnings report content...",
embeddings=[0.1, 0.2, ...], # 768-dimensional vector
metadata={
"company": "MDB",
"document_type": "earnings_call",
"date": "2024-03-21"
}
)
# Query similar documents
similar_docs = db_ops.query_similar(
query_embedding=[0.1, 0.2, ...],
limit=5,
filters={"metadata.company": "MDB"}
)
# Initialize query interface
query_interface = QueryInterface(api_key="your_api_key", database_ops=db_ops)
# Basic query
result = query_interface.query(
"What was MongoDB's revenue growth in Q3?"
)
# Filtered query
result = query_interface.query(
"What are the key AI initiatives?",
company_ticker="MDB",
doc_type="earnings_call"
)
# Streaming query
async def handle_chunk(chunk):
print(chunk)
await query_interface.process_streaming_query(
"Summarize the earnings highlights",
handle_chunk
)
# Initialize app
app = EarningsAIApp()
# Process uploaded files
results = await app.process_files([
uploaded_file1,
uploaded_file2
])
# Query documents
response, sources = await app.query_documents(
"What is MongoDB's cloud strategy?"
)
- DatabaseOperations: Manages MongoDB interactions and vector search
- AudioTranscriber: Handles audio file transcription using Fireworks AI
- EmbeddingGenerator: Creates vector embeddings for text and documents
- DocumentExtractor: Extracts text from various document formats
- QueryInterface: Processes natural language queries and retrieves relevant information
- Uses MongoDB Atlas Vector Search
- 768-dimensional embeddings from Nomic AI
- Supports filtered queries by company and document type
- Supports MP3, WAV, FLAC, M4A formats
- Automatic language detection
- Metadata extraction and storage
- Supports PDF, DOCX, TXT formats
- Automatic text extraction
- Metadata preservation
pip install pytest pytest-asyncio
- Create test data directories:
mkdir -p data/audio data/documents
- Add sample files:
- Audio: Place sample MP3 files in
data/audio/
- Documents: Place sample PDF, DOCX, and TXT files in
data/documents/
- Audio: Place sample MP3 files in
Run all tests:
pytest -v
# Test transcription functionality
pytest tests/test_transcription.py -v
# Key tests:
pytest tests/test_transcription.py::test_single_file_transcription
pytest tests/test_transcription.py::test_metadata_handling
pytest tests/test_transcription.py::test_file_not_found
# Test document processing
pytest tests/test_extraction.py -v
# Key tests:
pytest tests/test_extraction.py::test_txt_extraction
pytest tests/test_extraction.py::test_pdf_extraction
pytest tests/test_extraction.py::test_process_directory
# Test embedding generation
pytest tests/test_embeddings.py -v
# Key tests:
pytest tests/test_embeddings.py::test_single_embedding
pytest tests/test_embeddings.py::test_batch_embeddings
pytest tests/test_embeddings.py::test_document_chunking
pytest tests/test_embeddings.py::test_prefix_handling
# Test MongoDB operations
pytest tests/test_database.py -v
# Key tests:
pytest tests/test_database.py::test_store_document
pytest tests/test_database.py::test_query_similar
pytest tests/test_database.py::test_update_document
# Test query processing
pytest tests/test_query.py -v
# Key tests:
pytest tests/test_query.py::test_basic_query
pytest tests/test_query.py::test_filtered_query
pytest tests/test_query.py::test_streaming_query
- Create a test configuration file
tests/test_config.yaml
:
fireworks:
api_key: "your_test_api_key"
mongodb:
uri: "your_test_mongodb_uri"
- Set up test environment variables:
export FIREWORKS_TEST_API_KEY="your_test_api_key"
export MONGODB_TEST_URI="your_test_mongodb_uri"
Generate test coverage report:
pytest --cov=earnings_ai_demo tests/
Test the complete pipeline:
# tests/test_integration.py
import pytest
from earnings_ai_demo.main import main
@pytest.mark.asyncio
async def test_full_pipeline():
await main()
Run integration tests:
pytest tests/test_integration.py -v
tests/test_transcription.py::test_single_file_transcription PASSED
tests/test_transcription.py::test_metadata_handling PASSED
tests/test_transcription.py::test_file_not_found PASSED
tests/test_extraction.py::test_txt_extraction PASSED
tests/test_extraction.py::test_pdf_extraction PASSED
tests/test_extraction.py::test_process_directory PASSED
tests/test_embeddings.py::test_single_embedding PASSED
tests/test_embeddings.py::test_batch_embeddings PASSED
tests/test_embeddings.py::test_document_chunking PASSED
tests/test_embeddings.py::test_prefix_handling PASSED
tests/test_database.py::test_store_document PASSED
tests/test_database.py::test_query_similar PASSED
tests/test_database.py::test_update_document PASSED
tests/test_query.py::test_basic_query PASSED
tests/test_query.py::test_filtered_query PASSED
tests/test_query.py::test_streaming_query PASSED
-
MongoDB Connection Failures
- Solution: Ensure MongoDB is running and test URI is correct
pytest tests/mongo_test.py
-
Missing Test Files
- Solution: Verify test data exists
ls data/audio/ ls data/documents/
-
API Rate Limits
- Solution: Use test API key with sufficient quota
# tests/conftest.py @pytest.fixture(autouse=True) def rate_limit_delay(): yield time.sleep(1) # Add delay between API calls
-
Async Test Failures
- Solution: Ensure pytest-asyncio is installed and imported
import pytest pytestmark = pytest.mark.asyncio
After running tests:
# Clean up test files
rm -rf data/audio/*_transcription.json
rm -rf data/documents/*.json
# Reset test database
python -c "from earnings_ai_demo.database import DatabaseOperations; DatabaseOperations('test_uri').documents.delete_many({})"
- Fork the repository
- Create a feature branch
- Commit changes
- Push to the branch
- Open a Pull Request
- Follow PEP 8
- Add docstrings to functions and classes
- Write unit tests for new features
-
MongoDB Connection Issues
- Ensure your IP is whitelisted in MongoDB Atlas
- Check network connectivity
- Verify credentials in config.yaml
-
API Rate Limits
- Implement exponential backoff (built into the embedding generator)
- Monitor Fireworks AI usage limits
-
File Processing Errors
- Check file permissions
- Verify file formats are supported
- Ensure files aren't corrupted
-
Vector Search Issues
- Wait for index build completion
- Check index definition matches embedding dimensions
- Verify documents are properly embedded
-
Memory Issues with Large Files
- Use streaming for large audio files
- Implement chunking for large documents
- Monitor RAM usage during batch processing
This project is licensed under the MIT License. See the LICENSE file for details.
- Fireworks AI for their powerful ML models
- MongoDB for vector search capabilities
- The open-source community for various Python libraries
Note: For production deployments, ensure proper security measures are in place and never expose sensitive credentials in configuration files or version control.