This project provides a Python-based solution to extract Arabic text from PDF documents using Google Document AI. It processes PDFs to generate formatted .txt
files containing the extracted text.
- Arabic Text Extraction: Utilizes Google Document AI to accurately extract Arabic text from PDF files.
- Formatted Output: Saves the extracted text into well-structured
.txt
files for easy readability and further processing. - Tashkeel Support: Adds diacritical marks (Tashkeel) to the extracted Arabic text for enhanced readability.
Ensure the following dependencies are installed:
pylibtashkeel
google-cloud-documentai
PyPDF2
Install these dependencies using the provided requirements.txt
file:
pip install -r requirements.txt
Note: Ensure that you have access to Google Document AI and have set up the necessary authentication credentials.
-
Set Up Google Document AI Credentials: Follow the Google Cloud documentation to set up authentication and obtain your credentials.
-
Run the Scripts:
- Use
main.py
to extract text from PDF files. The extracted and formatted text will be saved as.txt
files in the specified output directory. - If you want to add Tashkeel, format the text, and combine it, use
step2.py
.
- Use
-
Configure the Scripts:
- Specify the path to your input PDF file in
main.py
.
- Specify the path to your input PDF file in
- Text Files:
.txt
files containing the extracted Arabic text, formatted for readability and ease of use.
Here's how to set the input PDF path and output directory in the scripts:
# Set the path to the input PDF file in main.py
input_pdf = '/path/to/your/input.pdf'
# Run step2.py to add Tashkeel and format the text
After running the scripts, the extracted and processed text files will be saved in the specified output directory.
- Ensure that your Google Cloud credentials are correctly set up and that you have the necessary permissions to use Document AI.
- The script is designed to handle PDFs containing Arabic text. For other languages, adjust the Document AI settings accordingly.
For more details and updates, visit the GitHub repository.