by Anne Ensign
This is a project for Code Louisville's Data Analysis Course 2.
You are recently hired to a company that hasn't been keeping all their records updated. You notice a CSV with all orders for the last 2 years, which is up to date. You also notice a CSV with contact people for the companies, but there are many more companies with orders than contacts on this file. Also, a database would be a better way to store these records.
- Create a program to scan all the CSVs in one folder.
- Update the tables to be clean and consistent.
- Extract the data to evaluate.
- Create visuals for the other departments.
- Make sure the program can update the data every time it's run when new info is added to the CSVs.
Spreadsheets
Data was generated through Mockaroo.
original_csvs
folder contains the CSVs as originally uploaded. Any transformation will occur to the CSVs in the csv
folder.
Python 3 is required to run this program. I wrote these scripts with version 3.9.6 installed, and at the time I'm writing this, any version 3.6 or below is EOL.
If you need to download or update Python, click here.
The files to try the program for yourself are in the zip file TRY_IT
.
-
Download the repo
company_data
or clone and save to location of your choice on your machine. -
Recommended: Create a virtual environment to run the program.
-
Run the requirements from the root
company_data
folder:Most Systems:
pip install -r requirements.txt
orpip3 install -r requirements.txt
If these aren't working for you, try the following:
CONDA:
conda install --file requirements.txt
Windows:
py -m pip install -r requirements.txt
Unix/macOS:
python -m pip install -r requirements.txt
If you do not have Anaconda installed, make sure to enter the following commands in your command line once you've navigated to the repo folder:
pip install pandas
-
Extract all files from the zip folder
TRY_IT
. -
Navigate to the
TRY_IT/python
folder in the command line. -
Run the
db_auto.py
file first with command:python db_auto.py
-
Once that file is complete, remaining in the
TRY_IT/python
folder, run the next file:python sql_pandas.py
You should see other folders appear: database
, datasets
, and tableau
. Your CSVs are now cleaned up tables in a database!
You can also run unit tests to make sure your pandas methods and python function are working correctly.
-
Navigate to
TRY_IT/test/python
. -
Run files one at a time:
python test_pandas.py
python test_python.py
python test_sql.py
The test_pandas.py
will log your unit test results with a timestamp in the TRY_IT/text/log.txt
file.
The commands for SQLite are saved in the TRY_IT/python/query.py
file. The returned lists are imported to the sql_pandas.py
file. This is to keep the script for making the datadrames a little cleaner. You can run the query.py
script on your own, alter the commands, write print statements into it, etc. You will also need to edit the dataframes in the sql_pandas.py
file if you want to run that against your edited queries, though.
My Tableau profile has visuals, a dashboard and a story for these databases. CLICK HERE to visit my page.
A PDF version of the Story is located at company_data/project/tableau/Company Story.pdf
.
The following criteria have been met with this project:
- Category 1: Python Programming Basics
- Create a dictionary or list, populate it with several values, retrieve at least one value, and use it in your program.
- Create and call at least 3 functions or methods, at least one of which must return a value that is used somewhere else in your code.
- Implement a regular expression (regex)
- Category 2: Utilize External Data
- Read data from an external file, such as text, JSON, CSV, etc, and use that data in your application.
- Connect to a database and read data using SQL.
- Category 3: Data Display
- Implement a data visualization in Tableau
- Category 4: Best Practices
- Implement a log that records errors, invalid inputs, or other important events and writes them to a text file.
- Create 3 or more unit tests for your application.
- The program should utilize a virtual environment and document library dependencies in a requirements.txt file.
- Source data should not be modified/changed - clean data should be stored separately.
While this code and project is my own, I found a lot of helpful information about structuring the process from Nate from StrataScratch.