-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata607_project3_team_document.Rmd
47 lines (41 loc) · 3.68 KB
/
data607_project3_team_document.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
title: "Data 607 - Project 3 - Team Document"
author: "Glen Dale Davis, Coco Donovan, Alex Khaykin, Mohamed Hassan-El Serafi, Eddie Xu"
date: "2023-02-21"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Team Document
* Group Members:
+ Glen Dale Davis
+ Coco Donovan
+ Alex Khaykin
+ Mohamed Hassan-El Serafi
+ Eddie Xu
* Collaboration Tools:
+ Communication:
+ Slack thread, Zoom meetings
+ Code Sharing:
+ Github: https://github.com/geedoubledee/data607_project3
+ Project Documentation:
+ RStudio Documents for our Research (which will include a recap of the motivation/approach conversations we had on Slack and via Zoom, in addition to our process, the successes/challenges, and ultimately our findings), our Team Document, the Deliverables Timeline, and the Rubric
+ MySQL Database Hosted on Google Cloud Platform for keeping track of job listings we've found and gone through
+ TXT files for all the job descriptions for individual listings
+ A CSV dataframe gathering all the job description text we collected together
* Data:
+ Several job boards that are at least partially devoted to data science job listings have RSS feeds that we were able to gather data from, including:
+ ai-jobs.net
+ CareerCastIT&Engineering
+ OpenDataScienceJobPortal
+ JobsforR-Users
+ MLconfJobBoard
+ PythonJobBoard
+ Indeed
+ NB: Indeed's RSS feed was not refreshing after the initial retrieval, and we also were unable to scrape the few job listings we did retrieve from Indeed's RSS feed, so we later used a [recent Kaggle dataset of Indeed job listings](https://www.kaggle.com/datasets/yusufolonade/data-science-job-postings-indeed-usa) to supplement our data. There are approximately as many observations in that dataset as we were able to collect via other means, so we feel we ended up having a good mix of sources. Additionally, we are using a dataset of job postings from 2019, strictly as a comparative/historical analysis to the most recent data we have. This dataset was scraped from Glassdoor, and [obtained through Kaggle](https://www.kaggle.com/datasets/andrewmvd/data-scientist-jobs).
+ We set up a [Feedbin](https://feedbin.com/) RSS reader account to collect data science job listings from these RSS feeds. We then sent API calls to Feedbin to retrieve results every few days and save them as dataframes in CSV format. We read the data into R, then extracted the job descriptions from the listings. Each job description was saved as a TXT file, and we recorded all the job listings for which we were able to actually retrieve job descriptions in our SQL database.
+ LinkedIn job listings were not accessible via RSS feed, so we used a [RapidAPI alternative](https://rapidapi.com/jaypat87/api/linkedin-jobs-search) to retrieve job listings from there. We were unable to automatically scrape these job listings, but we were able to manually download the job descriptions instead. We stored them similarly to how we stored the rest of the data we collected.
* Normalized Database: Entity-Relationship (ER) Diagram:
![](https://raw.githubusercontent.com/geedoubledee/data607_project3/main/db_erd.png)
+ The text from the job descriptions we retrieved will be stored into a third table called Txts in the normalized database, in addition to the tables for Sites and Jobs. We will split the lines of the job description into multiple lines that will be labeled to avoid the character limits on fields in SQL databases, since some of the "lines" in our job description files are very long due to how the data was collected from the Web.