Skip to content

Latest commit

 

History

History
85 lines (69 loc) · 3.63 KB

README.md

File metadata and controls

85 lines (69 loc) · 3.63 KB

Crowdfunding ETL Data Analysis


Overview

  • This project demonstrates the development of an ETL (Extract, Transform, Load) pipeline to process crowdfunding data.
  • The pipeline extracts raw data, performs transformations using Python and Pandas, and loads the cleaned data into a PostgreSQL database.
  • To ensure data integrity, the project incorporates foreign key constraints during data loading.

Installation

  • Tools and Languages:

    • Python (including Pandas and NumPy libraries).
    • PostgreSQL, managed via pgAdmin4.
    • VS Code for development.
  • Dependencies:

    • Imported pandas, NumPy, json, and re (regex module).

Code examples

  • Example 1 (comprehension method)

      # Use a list comprehension to add "cat" to each category_id. 
          cat_ids = [f'cat{cat_id}' for cat_id in category_ids]
      # Use a list comprehension to add "subcat" to each subcategory_id.
          scat_ids = [f'subcat{subcat_id}' for subcat_id in subcategory_ids]
      # Displaying results 
          print(cat_ids)
          print(scat_ids)
  • Example 2 (comprehension method)

     # Initialize an empty lists to store the list values and keys
         dict_values = []
         column_names = []
     # Iterate through the DataFrame.
         for i, row in contact_info_df.iterrows():
         data = row.iloc[0]
     # Convert each row to a Python dictionary.
         converted_data = json.loads(data)
    
     # Use a list comprehension to get the keys from the converted data.
         columns = [k for k,v in converted_data.items()]
     # Use a list comprehension to get the values for each row.
         row_values = [v for k, v in converted_data.items()]
     # Append the keys and list values to the lists created in step 1.  
         column_names.append(columns)
         dict_values.append(row_values)
     # Print out the list of values for each row and columns names
       print(column_names[0])
       print(dict_values)
  • Example 3 (Regex)

     # Extract the name of the contact and add it to a new column.
         contacts_next = pd.DataFrame(contacts_df)
     # Function to extract name from contact_info using regex
         def extract_name(contact_info): #define the function
     # Set paramenters of the search: any range and any characters and whitespaces after "name" except the double quote.
         match = re.search(r'"name":\s*"([^"]+)"', contact_info) 
         return match.group(1) if match else '' #save results of the search if the match is found
     # Apply the function to the contact_info column and create a new column 'name'
         contacts_next['name'] = contacts_next['contact_info'].apply(extract_name)
     # Display the first few rows of the new DataFrame
         print(contacts_next.head())

Roadmap

  • ETL_Project: Contains the main ETL pipeline code, showcasing data transformation with Pandas and Regex techniques.
  • crowdfunding_db_schema: SQL script to create the database schema and load transformed data.
  • Resources: Includes raw data files and transformed datasets.
  • Images_SQL_Data_Upload: Features screenshots verifying successful data upload into PostgreSQL tables.

Resources

  • The data used in this analysis is fictional and was created solely for the purpose of showcasing data analytics and ETL skills.