This script extracts data from PDF forms and maps it to predefined fields. It utilizes several libraries for text and checkbox detection, visualization, and data extraction. The output can be saved in Excel/CSV format for further analysis.
##IMPORTANT NOTE : convert.py is a version for older form formats, Test.py is an updated version for new form format.
- pdfplumber: To extract text and images from PDF documents.
- OpenCV: To process images and detect checkboxes.
- numpy: To handle numerical operations and array manipulations.
- matplotlib: To visualize detected checkboxes for debugging purposes.
- pytesseract: (If required) For optical character recognition.
- pandas: To handle tabular data and export it to Excel/CSV.
The fields expected to be present in the form are defined in two lists:
- Name
- Account number
- Last 4 digits of the card
- Transaction date
- Amount
- Merchant name
- Date you lost your card
- Time you lost your card
- Date you realized your card was stolen
- Time you realized your card was stolen
- When was the last time you used your card
- Last transaction amount
- Where do you normally store your card
- Where do you normally store your PIN
- Other items that were stolen
- Officer name
- Report number
- Suspect name
- Date
- Contact number
- Reason for dispute
- Transaction not authorized
- My card was
- Do you know who made the transaction
- Have you given permission to anyone to use your card
- Have you filed police report
Field mappings are used to handle differences between the actual field names in the PDF and the expected field names in the script. For example:
- "Your Name" is mapped to "Name"
- "Account#" is mapped to "Account number"
- "Last 4 digits of the card#" is mapped to "Last 4 digits of the card"
Note: Ensure any new additions to the form layout are updated in the field_mappings
dictionary to maintain accuracy.
Extracts all text lines from a PDF, including empty lines.
pdf_path
(str): Path to the input PDF file.
list
: A list of text lines extracted from the PDF.
Detects checkboxes on the given image and determines whether they are checked or unchecked.
image
(numpy array): An image of the PDF page.
dict
: Checkbox states with coordinates as keys and states (Checked
/Unchecked
) as values.list
: Positions of all detected checkboxes.
Visualizes all detected checkboxes with color-coded rectangles for debugging.
image
(numpy array): An image of the PDF page.checkbox_positions
(list): Positions of detected checkboxes.checkbox_states
(dict): Checkbox states (Checked
/Unchecked
).page_number
(int): Page number for display purposes.
Extracts checkbox states from nearby text lines and returns text associated with the checkbox.
lines
(list): List of text lines.index
(int): Index of the current line.aliases
(list): List of aliases for the field name.
str
: Text associated with the checkbox.
Maps predefined fields to their respective content in the PDF.
lines
(list): List of text lines extracted from the PDF.pdf_path
(str): Path to the input PDF file.
dict
: Mapped fields and their corresponding content.
Saves the extracted data to an Excel or CSV file.
mapped_data
(dict): Mapped fields and content.filename
(str): Name of the output Excel/CSV file.
- Place the input PDF in the
ref
directory. - Run the script using:
python Test.py
- The output will be saved as
Dispute_form_output.xlsx
in the working directory.
- Ensure all required libraries are installed:
pip install pdfplumber opencv-python-headless numpy matplotlib pandas
- If there are changes to the form layout, update the
fields
,checkbox_fields
, andfield_mappings
lists accordingly. - Debugging visuals for checkbox detection can be enabled using the
visualize_checkboxes
function.
- Add support for dynamic field detection using OCR.
- Implement error handling for malformed PDFs.
- Automate updates to the
field_mappings
dictionary based on new form layouts.