-
Notifications
You must be signed in to change notification settings - Fork 0
Dataset
In this section, we describe in detail how to get data from data sources.
We found that pandas datareader
is a great tool to get various data sources.
Detail on pandas Datareader remote resources could be found here.
Below is the summary of data sources in this project:
Method | Source | Categories | Feature/Key | Frequency | Unit | Description |
---|---|---|---|---|---|---|
API | JHU CSSE | Covid | Confirmed cases US | daily | case | Covid-19 confirmed cases in the US |
API | JHU CSSE | Covid | Deaths US | daily | case | Covid-19 deaths in the US |
API | JHU CSSE | Covid | Confirmed cases Global | daily | case | Covid-19 confirmed in the world |
API | JHU CSSE | Covid | Deaths cases Global | daily | case | Covid-19 deaths in the world |
PD Datareader | fred | Financial | nasdaq100 | daily | N/A | NASDAQ 100 Index |
PD Datareader | fred | Financial | sp500 | daily | N/A | S&P 500 Index |
PD Datareader | fred | Financial | djia | daily | N/A | Dow Jones Industrial Average |
PD Datareader | fred | Labor market | payems | monthly | Thousands of person | All Employees, Total Nonfarm |
PD Datareader | fred | Labor market | CES4348100001 | monthly | Thousands of person | All Employees, Air Transportation |
PD Datareader | fred | Labor market | CES6562000101 | monthly | Thousands of person | All Employees, Health Care |
PD Datareader | fred | Labor market | CES6561000001 | monthly | Thousands of person | All Employees, Educational Services |
PD Datareader | fred | Labor market | CES7071000001 | monthly | Thousands of person | All Employees, Arts, Entertainment, and Recreation |
PD Datareader | fred | Labor market | unrate | monthly | Percent | Unemployment Rate |
PD Datareader | fred | Labor market | jtsjol | monthly | Level in Thousands | Job Openings: Total Nonfarm |
PD Datareader | fred | Labor market | LNS13023653 | monthly | Thousands of Persons | Unemployment Level - Job Losers on Layoff |
PD Datareader | fred | Production & Business Activity | VMTD11 | monthly | Millions | Vehicle Miles Traveled |
PD Datareader | fred | Production & Business Activity | AIRRPMTSID11 | monthly | Thousands | Air Revenue Passenger Miles |
PD Datareader | fred | Production & Business Activity | MRTSSM7225USN | monthly | Millions of Dollars | Retail Sales: Restaurants and Other Eating Places |
PD Datareader | fred | Production & Business Activity | MRTSSM4541USS | monthly | Millions of Dollars | Retail Sales: Electronic Shopping and Mail-order Houses |
PD Datareader | fred | Production & Business Activity | MRTSSM4451USS | monthly | Millions of Dollars | Retail Sales: Grocery Stores |
PD Datareader | fred | Production & Business Activity | MRTSSM446USS | monthly | Millions of Dollars | Retail Sales: Health and Personal Care Stores |
-
This datase has two subsets: global and the US.
-
Type: Time series
-
Supported methods: API, csv
-
Frequency: daily
-
Interested features:
- Population
- World bank group (High Income,Upper Middle Income, Lower Middle Income, Low Income)
- Total cases
- Total deaths
- Total tests
- Test per case
-
Sample code:
import pandas as pd
confirmed_case_global_url='https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
confirmed_df = pd.read_csv(confirmed_case_global_url)
print(confirmed_df.shape)
print(confirmed_df.head())
- US Raw data sample:
UID iso2 iso3 code3 ... 10/15/20 10/16/20 10/17/20 10/18/20
0 84001001 US USA 840 ... 1949 1966 1983 1989
1 84001003 US USA 840 ... 6285 6333 6350 6369
2 84001005 US USA 840 ... 965 968 977 981
3 84001007 US USA 840 ... 761 771 775 785
4 84001009 US USA 840 ... 1768 1783 1807 1827
- Global Raw data sample:
Province/State Country/Region Lat ... 10/16/20 10/17/20 10/18/20
0 NaN Afghanistan 33.93911 ... 40073 40141 40200
1 NaN Albania 41.15330 ... 16501 16774 17055
2 NaN Algeria 28.03390 ... 53998 54203 54402
3 NaN Andorra 42.50630 ... 3377 3377 3377
4 NaN Angola -11.20270 ... 7222 7462 7622
Important Note The raw data is in coloumn-oriented format. Data of a new day is append as new column.
US:
UID 84001001
iso2 US
iso3 USA
code3 840
FIPS 1001.0
Admin2 Autauga
Province_State Alabama
Country_Region US
Lat 32.53952745
Long_ -86.64408227
Combined_Key Autauga, Alabama, US
day1
day2
...
currentday
Global:
Province/State NaN
Country/Region Albania
Lat 41.1533
Long 20.1683
day1
day2
...
currentday
-
Currently, there are three methods to get stock prices: Yahoo Finance API, Google Finance API, and pandas_datareader. Only the last one work.
-
Type: Time series
-
Supported methods: API, json
-
Frequency: daily
-
We focus on popular stock market in the US such as S&P500, Dow Jones, and Nasdaq
-
stock_ticker_raw
table:
CREATE TABLE IF NOT EXISTS stock_ticker_raw(
ticker VARCHAR(16) UNIQUE NOT NULL,
name VARCHAR(128), -- full name of the stock ticker
industry VARCHAR(64) NULL,
subindustry VARCHAR(64) NULL,
hq_location VARCHAR(64) NULL,
date_first_added datetime NULL,
cik VARCHAR(10) NULL, -- A Central Index Key or CIK number
founded_year int NULL,
PRIMARY KEY(ticker)
);
-
stock_price_raw
table:
CREATE TABLE IF NOT EXISTS stock_price_raw(
stock_ticker VARCHAR(16) NOT NULL,
date datetime NOT NULL,
High double NOT NULL,
Low double NOT NULL,
Open double NOT NULL,
Close double NOT NULL,
Volume double NOT NULL,
adj_close double NOT NULL
);
- How to extract data
- First, we get the master list of all stock stickers using the API.
- Next, for each stock sticker, we get the stock prices data using Yahoo's API.
- Extract data into the raw table.
pip install pandas-datareader
from pandas_datareader import data
tickers = ['AAPL', 'MSFT', '^GSPC']
start_date = '2020-01-01'
end_date = '2020-09-30'
for ticker in tickers:
panel_data = data.DataReader(ticker, 'yahoo', start_date, end_date)
print(type(panel_data))
print("ticker: ", ticker)
print(panel_data)
- Stock price data sample
Date High Low Open Close Volume Adj Close
2020-01-02 75.150002 73.797501 74.059998 75.087502 135480400.0 74.573036
2020-01-03 75.144997 74.125000 74.287498 74.357498 146322800.0 73.848030
2020-01-06 74.989998 73.187500 73.447502 74.949997 118387200.0 74.436470
2020-01-07 75.224998 74.370003 74.959999 74.597504 108872000.0 74.086395
2020-01-08 76.110001 74.290001 74.290001 75.797501 132079200.0 75.278160
- Data source: U.S Bureau of Labor
- Type: Time series
- Supported methods: API, json
- Frequency: monthly
- All data from Bureau of Labor has unique series id for each feature. We list out desired features and their corresponding series_id as below
- Interested features:
-
Unemployment Rate (overall):
LNS14000000
-
Unemployment Rate Races:
- Black or African American:
LNS14000006
- Hispanic or Latino:
LNS14000009
- White:
LNS14000003
- Asian:
LNS14032183
- Black or African American:
-
Unemployment Rate - Occupations source
- Management, Professional, and Related Occupations:
LNU04032215
- Service Occupations:
LNU04032218
- Sales and Office Occupations:
LNU04032219
- Natural Resources, Construction, and Maintenance Occupations:
LNU04032222
- Production, Transportation and Material Moving Occupations:
LNU04032226
- Management, Professional, and Related Occupations:
-
- How to extract data:
- First, we build the master list of interested series (e.g., LNS14000000, LNS14000009, LNS14000003, etc) and keep the data in
bol_series_dim
.
- First, we build the master list of interested series (e.g., LNS14000000, LNS14000009, LNS14000003, etc) and keep the data in
CREATE TABLE IF NOT EXISTS bol_series_dim(
series_id VARCHAR(64) UNIQUE NOT NULL, -- matched with series_id from bol_raw
category VARCHAR(256) NOT NULL, -- main category
subcat1 VARCHAR(256), -- subcategory 1
subcat2 VARCHAR(256), -- subcategory 1
PRIMARY KEY(series_id)
);
INSERT INTO BOL_series_dim VALUES('LNS14000000', 'Unemployment Rate', 'overall', '');
INSERT INTO BOL_series_dim VALUES('LNS14000006', 'Unemployment Rate', 'race', 'Black or African American');
INSERT INTO BOL_series_dim VALUES('LNS14000009', 'Unemployment Rate', 'race', 'Hispanic or Latino');
INSERT INTO BOL_series_dim VALUES('LNS14000003', 'Unemployment Rate', 'race', 'White');
INSERT INTO BOL_series_dim VALUES('LNS14000003', 'Unemployment Rate', 'race', 'Asian');
INSERT INTO BOL_series_dim VALUES('LNU04032215', 'Unemployment Rate', 'occupation', 'Management, Professional, and Related Occupations');
INSERT INTO BOL_series_dim VALUES('LNU04032218', 'Unemployment Rate', 'occupation', 'Service');
INSERT INTO BOL_series_dim VALUES('LNU04032219', 'Unemployment Rate', 'occupation', 'Sales and Office Occupations');
INSERT INTO BOL_series_dim VALUES('LNU04032222', 'Unemployment Rate', 'occupation', 'Natural Resources, Construction, and Maintenance Occupations');
INSERT INTO BOL_series_dim VALUES('LNU04032226', 'Unemployment Rate', 'occupation', 'Production, Transportation and Material Moving Occupations');
- Next, for each serie, we extract data from data source using HTTP request.
- We extract data to raw tables.
import requests
import json
import prettytable
headers = {'Content-type': 'application/json'}
data = json.dumps({"seriesid": ['CUUR0000SA0','SUUR0000SA0'],"startyear":"2011", "endyear":"2014"})
p = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
json_data = json.loads(p.text)
for series in json_data['Results']['series']:
x=prettytable.PrettyTable(["series id","year","period","value","footnotes"])
seriesId = series['seriesID']
for item in series['data']:
year = item['year']
period = item['period']
value = item['value']
footnotes=""
for footnote in item['footnotes']:
if footnote:
footnotes = footnotes + footnote['text'] + ','
'if 'M01' <= period <= 'M12':'
x.add_row([seriesId,year,period,value,footnotes[0:-1]])
output = open(seriesId + '.txt','w')
output.write (x.get_string())
output.close()
- Sample data
CUUR0000SA0.txt
+-------------+------+--------+---------+-----------+
| series id | year | period | value | footnotes |
+-------------+------+--------+---------+-----------+
| CUUR0000SA0 | 2020 | M09 | 260.280 | |
| CUUR0000SA0 | 2020 | M08 | 259.918 | |
| CUUR0000SA0 | 2020 | M07 | 259.101 | |
| CUUR0000SA0 | 2020 | M06 | 257.797 | |
| CUUR0000SA0 | 2020 | M05 | 256.394 | |
| CUUR0000SA0 | 2020 | M04 | 256.389 | |
| CUUR0000SA0 | 2020 | M03 | 258.115 | |
| CUUR0000SA0 | 2020 | M02 | 258.678 | |
| CUUR0000SA0 | 2020 | M01 | 257.971 | |
| CUUR0000SA0 | 2019 | M12 | 256.974 | |
| CUUR0000SA0 | 2019 | M11 | 257.208 | |
| CUUR0000SA0 | 2019 | M10 | 257.346 | |
| CUUR0000SA0 | 2019 | M09 | 256.759 | |
| CUUR0000SA0 | 2019 | M08 | 256.558 | |
| CUUR0000SA0 | 2019 | M07 | 256.571 | |
| CUUR0000SA0 | 2019 | M06 | 256.143 | |
| CUUR0000SA0 | 2019 | M05 | 256.092 | |
| CUUR0000SA0 | 2019 | M04 | 255.548 | |
| CUUR0000SA0 | 2019 | M03 | 254.202 | |
| CUUR0000SA0 | 2019 | M02 | 252.776 | |
| CUUR0000SA0 | 2019 | M01 | 251.712 | |
+-------------+------+--------+---------+-----------+
SUUR0000SA0.txt
+-------------+------+--------+---------+-----------+
| series id | year | period | value | footnotes |
+-------------+------+--------+---------+-----------+
| SUUR0000SA0 | 2020 | M09 | 146.072 | Initial |
| SUUR0000SA0 | 2020 | M08 | 145.853 | Initial |
| SUUR0000SA0 | 2020 | M07 | 145.405 | Initial |
| SUUR0000SA0 | 2020 | M06 | 144.651 | Interim |
| SUUR0000SA0 | 2020 | M05 | 143.800 | Interim |
| SUUR0000SA0 | 2020 | M04 | 143.847 | Interim |
| SUUR0000SA0 | 2020 | M03 | 145.005 | Interim |
| SUUR0000SA0 | 2020 | M02 | 145.390 | Interim |
| SUUR0000SA0 | 2020 | M01 | 144.995 | Interim |
| SUUR0000SA0 | 2019 | M12 | 144.437 | Interim |
| SUUR0000SA0 | 2019 | M11 | 144.613 | Interim |
| SUUR0000SA0 | 2019 | M10 | 144.722 | Interim |
| SUUR0000SA0 | 2019 | M09 | 144.428 | |
| SUUR0000SA0 | 2019 | M08 | 144.388 | |
| SUUR0000SA0 | 2019 | M07 | 144.409 | |
| SUUR0000SA0 | 2019 | M06 | 144.243 | |
| SUUR0000SA0 | 2019 | M05 | 144.183 | |
| SUUR0000SA0 | 2019 | M04 | 143.926 | |
| SUUR0000SA0 | 2019 | M03 | 143.297 | |
| SUUR0000SA0 | 2019 | M02 | 142.571 | |
| SUUR0000SA0 | 2019 | M01 | 142.001 | |
+-------------+------+--------+---------+-----------+
- Air Carrier Statistics (Form 41 Traffic)- U.S. Carriers
- Air Carriers : T-100 Domestic Segment (U.S. Carriers) (Latest Available Data: July 2020)
- National Center for Environmental Information
- AirNow Web service: Get daily air quality index