A datascience project using the HCUP National Readmission Database to predict hospital readmission
-
The conda package manager.
-
PostgreSQL 10.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9), 64-bit
- Other versions of PostgreSQL may work but none have been tested.
-
User with
CREATE
,INSERT
,SELECT
,UPDATE
, andDELETE
privileges. -
An existing database with
public
schema. -
The 2016 NRD data available for purchase from HCUP.
-
Clone this repository:
$ git clone https://github.com/tadtenacious/nrd_db_proj.git
-
Create a virtual environment with the required libraries:
$ conda env create --name nrd -f=env.yml
-
Store the 3 csv files from HCUP in the data directory.
-
Activate the virtual environment:
$ conda activate nrd
-
Create the configuration file for connecting to the database.
(nrd) $ python nrd.py --config
-
Run the tests. This is done from the parent directory, so no need to cd into
tests
.python -m pytest
-
Load the data to the database and perform most of the feature engineering. The full and sample data sets will also be extracted. The full data set is roughly 7.4 GB on disk.
(nrd) $ python nrd.py --etl
-
Run the model on the sample.
(nrd) $ python nrd.py --model sample
-
Run the model on the full data set. This was only successfully run on a computer with 125 GB of RAM.
(nrd) $ python nrd.py --model full