This project leverages OpenAI's Whisper encoder architecture to classify bird species from their songs. We adapt the powerful audio processing capabilities of Whisper, originally designed for speech recognition, to the unique challenges of birdsong classification.
The system works by:
- Converting audio recordings into mel spectrograms
- Using Whisper's encoder to extract audio features
- Classifying bird species through a custom classification head
-
Audio Input → Mel Spectrogram Conversion
- Input: Raw audio files (.ogg, .wav, .mp3)
- Output: Mel spectrogram matrix (n_mels × time_steps)
- Parameters:
- Sample rate: 16kHz
- n_mels: 80
- Window size: 25ms
- Hop length: 10ms
-
Feature Extraction
- Using Whisper's pre-trained encoder
- Leverages Whisper's robust audio understanding capabilities
- Outputs high-dimensional feature embeddings
-
Classification
- Custom classification head
- Current implementation: Single-species classification
- Future work: Multi-species detection
- Single species classification from isolated recordings
- Support for common North American bird species
- Real-time inference on standard hardware
- Integration with BirdCLEF 2024 dataset
- Implementing multi-label classification
- Handling overlapping bird calls
- Temporal localization of different species
- Fine-tuning Whisper encoder for bird-specific features
- Data augmentation techniques
- Model quantization for edge devices
We use the BirdCLEF 2024 dataset, which includes:
- High-quality bird recordings
- Species labels
- Temporal annotations
- Geographic metadata
# Example code for processing a single audio file
from model import WhisperBirdClassifier
from utils import audio_to_mel
# `Load model
model = WhisperBirdClassifier()
# Process audio
audio_path = "path/to/bird_recording.ogg"
mel_spectrogram = audio_to_mel(audio_path)
prediction = model(mel_spectrogram)
The main training pipeline is implemented in notebooks/subsample.ipynb
. This notebook:
- Loads and preprocesses the BirdCLEF dataset
- Filters recordings to Great Britain region
- Trains a classifier using Whisper's encoder and a custom classification head
- Logs training metrics to Weights & Biases
- PyTorch
- OpenAI Whisper
- GeoPandas
- Weights & Biases
- pandas
- matplotlib
- contextily
We welcome contributions! Areas we're particularly interested in:
- Multi-species detection improvements
- Data augmentation techniques
- Model optimization
- Additional species support
If you use this project in your research, please cite:
@software{mlx-birdsong,
author = {John R Hoopes},
title = {Birdsong Classification using Whisper Encoder},
year = {2024},
url = {https://github.com/johnx25bd/mlx-birdsong}
}
MIT License
- Barbora Filova
- OpenAI Whisper team
- BirdCLEF dataset creators
- Contributing researchers and developers