In order to use the sustainable product database, the user must
- Download the database (provided in compressed .gzip format) from the GitHub repository.
- Install Sqlite3 and Pandas python modules if not installed already.
- Use the UseSustainableProductDB.ipynb notebook to load the view containing sustainable products.
- Additionally, data in the database can be queried and retrieved through simple SQL queries.
The design given in the project report gives a fair understanding of the process of creating a data pipeline. MainProjectPipeline.ipynb notebook is an actual implementation of one such pipeline from scratch. A summary of the steps to be followed is listed below:
- Read the product dataset from any source and explore the fields
- Decide on the pre-processing pipeline design for that data – leverage ColumnDropper, RowDropper, StringCleaner classes
- Load the ontology data (already processed data – vocab_updated.xlsx available on Git repo).
- Implement the keyword extraction step using KeywordExtractor class.
- Implement keyword mapper using KeywordMapper class.
- Investigate the results and tune the vocabulary as required.
- Insert the data generated at different stages into the sustainable product database using the DBWriter class.
- Save the keyword extractor and mapper objects using Pickle Python module in order to reuse later.
If there is an already existing data pipeline with keyword extractor and mapper classes saved, then, the same objects can be used on a different product database. SecondDataPipeline.ipynb is an example of this design. To do this, the steps listed below can be followed.
- Read the product dataset from any source and explore the fields
- Decide on the pre-processing pipeline design for that data – leverage ColumnDropper, RowDropper, StringCleaner classes
- Load the keyword extractor object and extract keywords from the new product data.
- Load the keyword mapper object, set the results of the previous step in this object and then map the keywords using the ontology data already saved in the object.
- Investigate the results and tune the vocabulary as required.
- Insert the data generated at different stages into the sustainable product database using the DBWriter class.