This open source pipeline aggregates public COVID-19 data sources into a single dataset, which includes COVID-19 cases, deaths, tests, hospitalizations, discharges, intensive care unit (ICU) cases, ventilator cases, government interventions, and Google's COVID19 Community Mobility Reports. The aggregated data is designed for researchers to build models quickly, and the pipeline is designed for engineers to add new data sources quickly.
In particular, we support data that comes in three formats: data that can be downloaded automatically (generally a .csv or .xlsx from a stable url), data that can be downloaded manually (generally .csv or .xslx files without stable urls), and data that is not machine-readable and must be scraped by a human (from charts, tables, pdfs, or occasionally tweets).
If you just want to use the latest data for models, visualizations, or research, we provide four aggregated data files under four different licenses. This is to provide you with options so that you can use data with a license that is acceptable for your use case, while respecting the original licenses of the data sources.
- The CC-BY aggregation can be downloaded from this link
- The CC-BY-SA aggregation can be downloaded from this link.
- The CC-BY-NC aggregation can be downloaded from this link.
- The Google TOS data can be downloaded from this link. In order to download or use the data or reports, you must agree to the Google Terms of Service.
Releases to the dataset are tagged so there is a stable Github url that points to each version of the data.
Please see the Data Sources section of this README to note the attributions and licenses for each source.
We have carefully checked the license and attribution information on each data source included in this repository, and in many cases have contacted the data owners directly to ask how they would like to be attributed.
If you are the owner of a data source included here and would like us to remove data, add or alter an attribution, or add or alter license information, please do not hesitate to email us at open-covid-19-data@google.com and we will happily consider your request.
Data is fetched from the original source either as an automatic download, a manual download, or scraped by humans. All data goes into the data/inputs
directory before being consumed by the rest of the pipeline. Data sources for each data type are then loaded into pandas dataframes with a standardized schema for dates, locations, and columns. These dataframes are joined into a single dataframe, which can then be exported.
Locations are mapped to a standardized, hierarchical set of region codes. The full list of region codes can be found at data/exports/locations/locations.csv
.
The first-level region codes are ISO-3166-1 codes. By default, the second-level region codes are ISO-3166-2 codes. However, in some locations, COVID-19 data is reported in administrative regions other than ISO-3166-2, so the choice of sub-country regions is informed partially by data availability. Third-level regions include cities and counties - within the United States counties are coded using FIPS 6-4 codes.
All dates are mapped to ISO 8601 format during data loading.
To install Python dependencies:
pip install pandas xlrd pyyaml python3-wget
Before adding a new data source, we go through an internal approval within Google to ensure compliance with licensing and terms. Once a data source is approved, you can add the data to the pipeline as follows:
- If the source includes a data type that isn't yet included in the data schema, register the data type in the schema by adding an entry to
src/config/data.yaml
.
- Specify the
fetch
parameters:
source_url
: where to download the datamethod
: one ofAUTOMATIC_DOWNLOAD
,MANUAL_DOWNLOAD
,SCRAPED
,STATIC
file
: filename for the data source
- Specify the
load
parameters.
function
: which function inload_functions.py
to use to load the data. Most data sources can be loaded withdefault_load_function
, but some data sources will have formatting that requires implementing a new function inload_functions.py
.read
: data sources are read using thepandas.read_csv()
orpandas.read_excel()
functions. Theread
field accepts key/val parameters that are passed to the appropriate pandas read function.dates
:
columns
: list of column names in the original data source that are required as arg to a function that will return the date in ISO-8601 format. This is often just a single column, but sometimes the year/month/date are in separate columns in the original data.date_format
: the format of the date in the original data sourceparse_function
: most dates can be parsed using thedefault
function indate_utils.py
. If the data source has a date format that requires a parser that doesn't exist indate_utils.py
, implement a separate function in that file.
regions
:mapping_keys
: if a data source contains multiple regions but not ISO-3166 codes for the regions, the locations file atdata/exports/locations/locations.csv
must contain a column or list of columns that can be uniquely map the locations in the data to theregion_code
for that location. Themapping_keys
field takes key/value fields where the key is the column in the locations file, and the value is the string name of the column in the original data source.
- Specify the
data
parameters:- These parameters follow the data schema specified in
src/config/data.yaml
, where the keys come from the data schema and the values are the column name in the original data source for the corresponding data.
- These parameters follow the data schema specified in
- Specify the
attribution
parameters. These are used to generate the data source section of the README. The fields for existing data sources serve as an example of what to include. - Specify the
license
parameters. These are used to generate the LICENSE file. The fields for existing data sources serve as an example of what to include. - Specify the
cc_by
andcc_by_sa
fields: we produce two aggregated csv files, one is licensed underCC-BY
and the other is underCC-BY-SA
. These fields specify whether the data can appear in each file.
- When you run
src/scripts/export_data.py
, it will update theREADME.md
as well as theLICENSE
files withindata/exports
This repository is created and maintained by Katie Everett, Dan Nanas, Maddy Myers (UCSD), Sumit Arora, and Ian Fischer.
Source name: covid19data.com.au (link)
Link to data: https://www.covid19data.com.au/hospitalisations-icu
Description: Data is scraped manually from the charts provided at the source link. Data for Australia consists of time series data for current hospitalizations, ICU and ventilator cases.
License: Creative Commons Attribution 4.0 International (link)
Last accessed: 2020-07-19
Source name: COVID-19 Tracking Project (link)
Link to data: https://github.com/COVID19Tracking/covid-tracking-data/tree/master/data
Description: Data is downloaded automatically from the source link. Data for the United States consists of time series data for current and cumulative hospitalizations.
License: Apache 2.0 (link)
Last accessed: 2020-07-27
Original data source: GOV.CO (link)
Link to original data: https://www.datos.gov.co/Salud-y-Protecci-n-Social/Casos-positivos-de-COVID-19-en-Colombia/gt2j-8ykr/data
Data aggregated by: COVID-19 Colombia (link)
License: Creative Commons Attribution-ShareAlike 4.0 International (link)
Last accessed: 2020-07-27
Source name: National Health Information System, Regional Hygiene Stations, Ministry of Health of the Czech Republic (link)
Link to data: https://onemocneni-aktualne.mzcr.cz/covid-19
Description: Data is scraped manually from the charts provided at the source link. Data for the Czech Republic consists of time series data for current ICU cases, and current and cumulative hospitalizations.
Citation:
Komenda M., Karolyi M., Bulhart V., Žofka J., Brauner T., Hak J., Jarkovský J., Mužík J., Blaha M., Kubát J., Klimeš D., Langhammer P., Daňková Š ., Májek O., Bartůňková M., Dušek L. COVID ‑ 19: Přehled aktuální situace v ČR. Onemocnění aktuálně [online]. Praha: Ministerstvo zdravotnictví ČR, 2020 [cit. 25.04.2020]. Dostupné z: https://onemocneni-aktualne.mzcr.cz/covid-19. Vývoj: společné pracoviště ÚZIS ČR a IBA LF MU. ISSN 2694-9423.
Last accessed: 2020-07-19
Source name: Statens Serum Institute (link)
Link to data: https://www.sst.dk/da/corona/tal-og-overvaagning
Description: Data is manually scraped from charts at the source link. Data for Denmark consists of time series data for current hospitalizations and ICU cases.
Last accessed: 2020-07-19
Source name: Finnish institute for health and welfare (link)
Link to data: https://thl.fi/en/web/infectious-diseases/what-s-new/coronavirus-covid-19-latest-updates
License: Creative Commons Attribution 4.0 International (link)
Last accessed: 2020-07-19
Source name: data.gouv.fr (link)
Link to data: https://www.data.gouv.fr/en/datasets/donnees-hospitalieres-relatives-a-lepidemie-de-covid-19/
Description: Data is scraped manually from the charts provided at the source link. Data for France consists of time series data for cumulative hospitalizations and ICU cases.
License: Open License 2.0 (link)
Last accessed: 2020-07-27
Source name: https://www.google.com/covid19/mobility/
Link to data: https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv
Help Center: https://support.google.com/covid19-mobility
Description: These Community Mobility Reports aim to provide insights into what has changed in response to policies aimed at combating COVID-19. The reports chart movement trends over time by geography, across different categories of places.
Terms: In order to download or use the data or reports, you must agree to the Google Terms of Service.
License: Google Terms of Service (link)
Citation:
Google LLC "Google COVID-19 Community Mobility Reports".
https://www.google.com/covid19/mobility/ Accessed: <date>.
Last accessed: 2020-07-27
Source name: Directorate of Health in Iceland (Embaetti landlaeknis) (link)
Link to data: https://www.covid.is/data
Description: Data is downloaded manually from the source link. Data for Iceland consists of time series data for current ICU cases, and current and cumulative hospitalizations.
Last accessed: 2020-06-22
Source name: Health Protection Surveillance Centre (link)
Link to data: https://www.hpsc.ie/a-z/respiratory/coronavirus/novelcoronavirus/casesinireland/epidemiologyofcovid-19inireland/
Description: Data is scraped manually from daily situation reports. Data for Ireland consists of time series data for cumulative hospitalizations.
License: Creative Commons Attribution ShareAlike 3.0 (link)
Last accessed: 2020-07-19
Source name: Dipartimento della Protezione Civile (link)
Link to data: https://github.com/pcm-dpc/COVID-19
Description: Data is downloaded automatically from the source repository. Data for Italy consists of time series data for current hospitalizations, but we can also compute cumulative hospitalizations.
License: Creative Commons Attribution 4.0 International (link)
Last accessed: 2020-07-27
Source name: Toyo Keizai Online (link)
Link to data: https://github.com/kaz-ogiwara/covid19
Copyright notice: Copyright (c) 2020 Kazuki OGIWARA / 荻原 和樹
Description: Data is downloaded automatically from the source repository. Data for Japan consists of time series data for current hospitalizations and ICU cases.
License: MIT (link)
Last accessed: 2020-07-27
Source name: Luxembourg Ministry of Health (link)
Link to data: https://data.public.lu/fr/datasets/donnees-covid19/#_
Description: Data is downloaded automatically from the source link. Data for Luxembourg consists of time series data for current hospitalizations and ICU cases.
License: Creative Commons Zero 1.0 Universal (link)
Last accessed: 2020-07-27
Source name: Ministry of Health, Labour and Social Protection (link)
Link to data: https://msmps.gov.md/ro/advanced-page-type/comunicate-de-presa
Last accessed: 2020-07-19
Source name: National Institute for Public Health and The Environment (link)
Link to data: https://www.rivm.nl/coronavirus-covid-19/grafieken
Description: Data is downloaded manually from the source link. Data for the Netherlands consists of time series data for current hospitalizations.
Last accessed: 2020-06-29
Source name: New Zealand Ministry of Health (link)
Link to data: https://www.health.govt.nz/our-work/diseases-and-conditions/covid-19-novel-coronavirus/covid-19-current-situation/covid-19-current-cases
Last accessed: 2020-07-19
Source name: Norwegian Institute of Public Health (link)
Link to data: https://www.fhi.no/en/id/infectious-diseases/coronavirus/daily-reports/daily-reports-COVID19/
Last accessed: 2020-06-22
Source name: Our World in Data (link)
Link to data: https://github.com/owid/covid-19-data/tree/master/public/data
License: Creative Commons Attribution 4.0 International (link)
Citation:
Data from Our World in Data has been collected, aggregated, and documented by Diana Beltekian, Daniel Gavrilov, Charlie Giattino, Joe Hasell, Bobbie Macdonald, Edouard Mathieu, Esteban Ortiz-Ospina, Hannah Ritchie, and Max Roser.
Last accessed: 2020-07-27
Source name: Oxford Covid-19 Government Response Tracker (link)
Link to data: https://github.com/OxCGRT/covid-policy-tracker/blob/master/data/OxCGRT_latest.csv
License: Creative Commons Attribution 4.0 International (link)
Citation:
Thomas Hale, Sam Webster, Anna Petherick, Toby Phillips, and Beatriz Kira. (2020). Oxford COVID-19 Government Response Tracker. Blavatnik School of Government.
Last accessed: 2020-07-27
Source name: Philippines Department of Health (link)
Link to data: http://www.doh.gov.ph/covid19tracker
Last accessed: 2020-07-19
Source name: Ministerio de Sanidad, Consumo y Bienestar Social (link)
Link to data: https://cnecovid.isciii.es/covid19/resources/agregados.csv
Description: The data is downloaded automatically from the source link. Due to regional differences in hospitalization reporting, we do not aggregate across regions to produce country-level statistics for Spain.
Last accessed: 2020-07-27
Source name: Public Health Agency of Sweden (link)
Link to data: https://www.arcgis.com/sharing/rest/content/items/b5e7488e117749c19881cce45db13f7e/data
Description: Data is downloaded automatically from the source link. Data for Sweden consists of time series data for current ICU cases.
Last accessed: 2020-07-27
Source name: Switzerland Federal Office of Public Health BAG (link)
Link to data: https://www.bag.admin.ch/bag/de/home/krankheiten/ausbrueche-epidemien-pandemien/aktuelle-ausbrueche-epidemien/novel-cov/situation-schweiz-und-international.html
Last accessed: 2020-06-29
Source name: The New York Times COVID-19 Data (link)
Link to data: https://github.com/nytimes/covid-19-data
License: Creative Commons Attribution-NonCommercial 4.0 International (link)
Citation:
Data from The New York Times, based on reports from state and local health agencies.
Last accessed: 2020-07-27
Source name: GOV.UK (link)
Link to data: https://www.gov.uk/government/publications/
Description: Data is downloaded manually from the publications provided at the source link. Data is aggregated across regions in England and reported at the country level for England, Scotland, Wales and Northern Ireland. Data consists of time series data for current hospitalizations.
License: Open Government License 3.0 (link)
Last accessed: 2020-06-23