PDF Data Extraction and Transformation

README em Português, clique aqui ->
README in English, click here ->

About

This project extract data from a website (.pdf file) containing car data, manipulate data, store in a AWS RDS, create pipeline with Apache Airflow to automatically refresh and create a Power BI Dashboard. The data is processed using Python libraries such as pypdf, pandas, numpy, requests, seaborn, matplotlib, scikit-learn and ARIMA/SARIMA. The main tasks performed include data extraction, transformation, cleaning, forecasting, load and visualization.

Features:

Downloading a PDF containing car market data.
Extracting the tables from the PDF.
Cleaning, transforming, and organizing the data into a pandas DataFrame.
Visualizing missing values and data distribution.
Exporting the final cleaned data into a CSV file for further use.
Final DataFrame loaded into AWS RDS for storage
Integrated with Power BI for dynamic visualizations
Configured Apache Airflow for automated monthly data refresh
Using ARIMA to forecasting 2025 sales and Scikitlearn to analyze model performance (tbu)

Images

Power BI Dashboard

pt-BR
en-US

Technologies Used and Requirements

Python: 🐍
Pandas: Data manipulation and analysis library.
NumPy: Numerical computing library.
Pypdf: Library for extracting data from PDFs.
Seaborn: Data visualization library based on matplotlib.
Matplotlib: Plotting library for creating visualizations.
Requests: Send API requests to download file from a website.
Scikit-learn: Train, test, regression and evaluate model performance of past sales data.
ARIMA/SARIMA: Forecasting sales data.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

If you have any questions, feel free to reach out vinigoes@outlook.com or vinox_quente on Discord.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
images		images
notebooks		notebooks
pdf		pdf
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README-pt_br.md		README-pt_br.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Data Extraction and Transformation

About

Features:

Images

Power BI Dashboard

Technologies Used and Requirements

License

Contact

About

Releases

Packages

Languages

License

vin0x/pdf-to-vehicle-data-ETL

Folders and files

Latest commit

History

Repository files navigation

PDF Data Extraction and Transformation

About

Features:

Images

Power BI Dashboard

Technologies Used and Requirements

License

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages