This repository contains cleaned daily reports and time series data from the 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins University for Systems Science and Engineering (JHU CSSE).
- To provide a stable version of the JHU CSSE data repository where variable names are not subject to change.
- To provide an updated data set that maintains consistent naming conventions across all files.
- To address various inconsistencies in how data were named and managed; including how dates and time are encoded and how programs may interpret missing values.
- To provide time series data in long (tidy) format that will be familiar to most R-users.
The cleaned data are organized within the following folders akin to how JHU CSSE organize their data.
- Cleaned Daily Reports (csse_covid_19_daily_reports_cleaned)
- Cleaned Daily Reports US (csse_covid_19_daily_reports_us_cleaned)
- Tidy Time-Series Data (csse_covid_19_time_series_cleaned)
- Cleaned JHU CSSE Data (csse_covid_19_clean_data)
- This is a new folder. It contains the above daily reports (concatenated) and time series data in a .Rdata format.
- Cleaned Supporting Material (csse_cleaned_supporting_material)
- This is a new folder that contains a cleaned copy of JHU CSSE's Lookup Table (i.e., UID_ISO_FIPS_LookUp_Table.csv) for mapping geographical codes to regions.
This folder contains cleaned daily reports from CSSE JHU. Unlike CSSE JHU's raw csv files, every file in this folder consists of the same variables. These variables adopt a consistent naming scheme and order that will not change (although new variables may be added sequentially). Cleaned daily reports will be added daily to this folder to reflect the latest additions and updates from JHU CSSE.
The following columns are found in every csv file in this order:
- Date_Published *
- FIPS
- Admin2
- Province_State
- Country_Region
- Last_Updated
- Latitude
- Longitude
- Confirmed
- Deaths
- Recovered
- Active
- Combined_Key
New variables will be added IF JHU CSSE make changes. However, the above variables and their names will not change. If JHU CSSE introduce new variables, these will be added sequentially to the variables above.
Note 1: * the
Date_Published
field is unique to this repository. It records the date that JHU CSSE publishes each daily report and is used to organize data and troubleshoot issues. Note that the dates recorded inDate_Published
may differ from those recorded inLast_Update
. Note also thatDate_Published
for reports - starting April 23, 2020 - are usually one day behindLast_Update
. This is not an issue with this repo but a result of how JHU scheduled its automated updates. See #2369 for more information.Note 2: Some variables (e.g.,
Incidence_rate
,Case-fatality_Ratio
) are excluded from this repo as JHU inconsistently uploaded them in the past. Users are able to calculate these variables themselves using the data available to them in this repository.
-
Last_Update
fixes inconsistencies in how dates and times were formatted across csv files.- All timestamps adopt a YYYY-MM-DD HH:MM:SS (24 hour format, in UTC) POSIXct format.
-
Blank cells indicating an absence of COVID-19
Confirmed
,Deaths
, andRecovered
cases are replace with zeros. (Preventing programs like R from treating these as missing values). -
Active
cases (i.e., Active = Confirmed - Deaths - Recoveries) are recalculated to correct for errors and to replace missing values in older daily reports. A sanity check also ensures that active cases are equal to or greater than zero; cases where JHU reports negative active cases are reported as missing values. -
A consistent naming scheme is enforced in
Country_Region
andProvince_State
to uniquely identify geographical locations across daily reports and time series data. For example, "Korea, South", and "Republic of Korea" are reduced to "South Korea" across all files. Other values such as "US" are changed to "United States" to improve data exploration and enforce a consistent naming scheme (e.g., "United States" and "United Kingdom" rather than "US" and "United Kingdom"). -
Values in
Combined_Key
are recreated each day based on relevent string values. Creating a newCombined_Key
addresses various inconsistencies across daily reports (e.g., "France" and ",,France") as well as issues that occur as a result of typos (see: #2603). -
Fixes inconsistencies found in older daily reports where string values in
Province_State
combined the names of cities/counties with provinces/states (e.g., "Los Angeles, CA" and "Calgary, AB"). Such characters are split intoAdmin2
(e.g., "Los Angeles) andProvince_State
(e.g., California). See #2 for more information. -
Data from JHU's Lookup Table are mapped to daily reports to better ensure consistent naming conventions and data accuracy for
FIPS
geographical codes as well asLatitude
andLongitude
coordinates. Mapping these data also updates missing values in older daily reports and ensures consistency across all files. Any changes that JHU makes to their Lookup Table are automatically applied to all files in this repository with each update. -
FIPS
codes are fixed to address known issues pertaining to JHU truncating leading zeros (e.g., #2638 and #2530).FIPS
values in this repo are corrected to 2 digits at the state-level and 5 digits at the county-level.
This folder contains daily reports from JHU CSSE's csse_covid_19_daily_reports_us folder. Daily reports contain US-only data, which are aggregated at the state level.
These are similar to daily reports discussed above, but include more variables.
The following columns are found in every csv file in this order:
- Date_Published
- UID
- iso2
- iso3
- code3
- FIPS
- Province_State
- Country_Region
- Last_Update
- Latitude
- Longitude
- Confirmed
- Deaths
- Recovered
- Active
- Incident_Rate
- People_Tested
- People_Hospitalized
- Mortality_Rate
- Testing_Rate
- Hospitalization_Rate
- Population
See the README in csse_covid_19_daily_reports_us_cleaned for more information.
This folder contains time-series data in a tidy rather than wide format. Data includes confirmed
, deaths
and recovered
cases of COVID-19. All data are from the JHU CSSE's time series csv files (which JHU CSSE creates from their daily case reports).
The following variables are recorded in this order:
- Province_State
- Country_Region
- Latitude
- Longitude
- Date
- Confirmed
- Deaths
- Recovered
New variables will be added IF JHU CSSE make changes. However, the above variable names will not change. If JHU CSSE introduce new variables, these will be added sequentially to the variables above.
-
Date
is encoded as date objects and adopts a standard YYYY-MM-DD format. -
Data on
Confirmed
,Deaths
, andRecovered
cases are concatenated into a single csv file in a tidy format. -
Data from JHU's Lookup Table are mapped to time series data to ensure consistency in
Latitude
andLongitude
coordinates.
Warning: The length of
Recovered
cases is less thanConfirmed
andDeaths
. Missing values indicate dates where data on recoveries are unavailable. JHU CSSE also warns that there are no reliable data sources reporting recovered cases for many countries. Therefore, please exercise caution when interpreting data on the number of recoveries.
This folder contains the latest data from JHU CSSE in .Rdata and csv formats. CSSE_DailyReports, concatenates JHU's daily reports into a single master file; Date_Published
identifies the csv file behind each daily report. CSSE_TimeSeries contains the latest time series data. Both are presented in long rather than wide format. Data are also cleaned per the descriptions above.
Additional features:
-
Geographic codes from JHU's Lookup Table are matched with geographical locations in CSSE_DailyReports according to
Combined_Key
to providing additional variables that are not included in JHU daily reports; for example,UID
,iso2
,code3
, etc., where applicable. -
Population
statistics are also added to CSSE_DailyReports based on values reported in JHU's Lookup Table.
Note: I will provide notices if changes to files in this folder are planned. But keep in mind that unlike the other folders, these files are more likely to be modified.
This folder contains other miscellaneous files that JHU CSSE shares on GitHub. Currently it consists of a cleaned copy of JHU's Lookup Table (i.e., UID_ISO_FIPS_LookUp_Table.csv), which contains geographical codes and population statistics on various regions.
FIPS
is encoded as a character variable rather than an integer. Relatedly, a known issue with FIPS codes is fixed such that state-level and county-level FIPS codes are appropriately padded with leading zeros. FIPS codes include an appropriate number of digits for states (e.g., Alabama's FIPS is 01 rather than 1) and counties (e.g., Alabama's Autauga is 01001 rather than 1001).- Some country names have been modified in
Country_Region
to ensure a more consistent naming scheme across data; specifically,- "US" becomes "United States".
- "Korea, South" becomes "South Korea".
- "Taiwan*" becomes "Taiwan".
- Blank cells are treated as missing values.
Warning: When opening the Lookup_Table.csv keep in mind that programs such as Microsoft Excel and LibreOffice Calc will, by default, truncate leading zeros. Keep this tidbit in mind when opening the file in these programs or consider using a text editor such as Notepad++.
A huge thanks to everyone at Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) who are involved in collecting and maintaining the 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository.
Data can be found at the JHU CSSE's Data Repository.
I'm just one guy. If you use these data I make no warranties regarding the accuracy of this information and disclaim any liability for damages resulting from using this repository. JHU CSSE's Terms of Use apply.