GitHub - neetigyab/PDFReader: Ready to use Python application/file for parsing a specific format of pdf form, and storing relevant user data in a tabular format in excel sheet

neetigyab / PDFReader Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Ready to use Python application/file for parsing a specific format of pdf form, and storing relevant user data in a tabular format in excel sheet

1 star 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Tests		Tests
__pycache__		__pycache__
myenv		myenv
.gitattributes		.gitattributes
.gitignore.txt		.gitignore.txt
Dispute_form_output.xlsx		Dispute_form_output.xlsx
ReadME.txt		ReadME.txt
checkbox_detector.py		checkbox_detector.py
checkbox_parser.py		checkbox_parser.py
config.py		config.py
desktop.ini		desktop.ini
extractor.py		extractor.py
field_mapper.py		field_mapper.py
main.py		main.py
output_generator.py		output_generator.py

Repository files navigation

# PDF Form Processing Tool

This project is a modular tool designed for extracting, processing, and saving data from form-like PDF documents. The tool detects and processes textual and checkbox elements in PDF files, maps fields to corresponding content, and exports the results to an Excel file.

## Features
- Extract text and checkbox data from PDFs.
- Map form fields to their corresponding values.
- Visualize detected checkboxes on PDF pages.
- Save mapped data to an Excel file.

## Project Structure

### **Packages and Modules**

#### 1. **`pdf_extractor`**
- **`extractor.py`**
- `extract_pdf_text(pdf_path)`: Extracts all text lines from a PDF document.

#### 2. **`checkbox_detector`**
- **`checkbox_detector.py`**
- `detect_checkboxes(image, ignored_area=None)`: Detects checkboxes in a given image and determines whether they are checked or unchecked.
- `visualize_checkboxes(image, checkbox_positions, checkbox_states, page_number)`: Visualizes detected checkboxes and their states on the image.

#### 3. **`checkbox_parser`**
- **`checkbox_parser.py`**
- `parse_checkbox(lines, index, aliases, pdf_path)`: Extracts checkbox states from nearby lines in the PDF.

#### 4. **`field_mapper`**
- **`field_mapper.py`**
- `map_fields_to_content(lines, pdf_path)`: Maps fields to their respective content based on the extracted text and detected checkboxes.

#### 5. **`output_saver`**
- **`output_generator.py`**
- `save_to_excel(mapped_data, filename)`: Saves the mapped data to an Excel file.

#### 6. **`config`**
- **`config.py`**
- `fields`: A list of fields expected in the PDF.
- `checkbox_fields`: A subset of fields specifically for checkboxes.
- `field_mappings`: A dictionary mapping field names to possible aliases in the PDF.

---

## Libraries Utilized
1. **`pdfplumber`**: For extracting text and images from PDF documents.
2. **`OpenCV`**: For detecting and visualizing checkboxes.
3. **`numpy`**: For handling image arrays and numerical operations.
4. **`matplotlib`**: For visualizing checkboxes.
5. **`pytesseract`**: For Optical Character Recognition (OCR) when necessary.
6. **`pandas`**: For saving data to an Excel file.

---

## Manual Inputs and Configuration

### 1. **Fields of Expected Form Elements**
- **`fields`**: Define all possible fields that the PDF form may contain. This list must be updated to include any new fields introduced in the form layout.
- Example:
```python
fields = ["Name", "Address", "Date of Birth", "Phone Number"]
```

### 2. **Field Mappings**
- **`field_mappings`**: Define mappings to evaluate differences between actual field names in the form and expected field names.
- Example:
```python
field_mappings = {
"Name": ["Full Name", "Name of Applicant"],
"Address": ["Residential Address", "Home Address"],
}
```

### 3. **Checkbox Fields**
- **`checkbox_fields`**: Specify fields that require checkbox processing.
- Example:
```python
checkbox_fields = ["Terms and Conditions", "Subscription Opt-in"]
```

---

## Updating for New Form Layouts
1. **Add New Fields**: Update the `fields` list in `config/config.py` to include any new fields present in the updated form layout.
2. **Update Field Mappings**: Extend `field_mappings` with aliases for new or renamed fields.
3. **Update Checkbox Fields**: Add any new checkbox-related fields to the `checkbox_fields` list.

---

## Usage

### 1. Extract Text from PDF
The `extract_pdf_text` function extracts lines of text from the PDF.
```python
lines = extract_pdf_text("path/to/pdf")
```

### 2. Map Fields to Content
Map the extracted lines to the predefined fields using `map_fields_to_content`.
```python
mapped_data = map_fields_to_content(lines, "path/to/pdf")
```

### 3. Save Data to Excel
Save the mapped data to an Excel file using `save_to_excel`.
```python
save_to_excel(mapped_data, "output.xlsx")
```

---

## Example Workflow

1. Place the PDF file in the project directory.
2. Run the main script:
```bash
python main.py
```
3. View the extracted and processed data in the output Excel file.

---

## Notes
- Ensure that any updates to the form layout are reflected in the `fields`, `field_mappings`, and `checkbox_fields` in `config/config.py`.
- Make sure all required libraries are installed before running the script:
```bash
pip install pdfplumber opencv-python numpy matplotlib pytesseract pandas
```