-
Notifications
You must be signed in to change notification settings - Fork 0
6. Key Changes (2018.09.14) Imputation
Before I write about the changes that I made, it might be better to describe why filling missing data, or imputation, is important for this study. In general, imputation can avoid any potential biases when we deal with sample data. While researchers in the past used to remove or ignore missing values, imputation has become trending. Many papers found that the NA values, which was thought as truly random, wasn't actually true, particularly in time-series data where you can actually estimate what the gap might be. Nowadays, in tribute to mathematicians, we can replace NA
values with numerical values using various approaches.
For this study, imputation is important because the model investigates people's exposures and responses to air pollution by an hourly period. If a missing data is not added to the model (due to repair or network error), it will not be able to consider the levels of fluctuation, especially when it is during rush hour. So filling in the gaps will surely help the measurement of people's exposure levels that vary throughout days and seasons.
The topics I will cover:
- Exploring different methods to impute NA values
- Transform PM10 data frame for Netlogo
- Netlogo Simulation
- Plotting people's health at the end of simulation
Note there are two softwares R
and Netlogo
running at the same time. This will be written at the beginning of each code sections.
From imputeTS
in R (https://cran.r-project.org/web/packages/imputeTS/imputeTS.pdf), you will meet two options (after some code exercise) before you choose a method.
First option is called a Seasonally Decomposed approach (Seasonally Decomposed Missing Value Imputation,na.seadec
), which "removes the seasonal component from the time series, performs imputation on the deseasonalised series and afterwards adds the seasonal component again".
The other approach is Seasonally Splitted (Seasonally Splitted Missing Value Imputation, na.seasplit
), which "splits the times series into seasons and afterwards performs imputation separately for each of the resulting time series datasets (each containing the data for one specific season)".
Now you will find four different methods in each approaches tabled below:
Function | Method | Description |
---|---|---|
na.seadec | Mean | Mean values of SEADEC |
na.seadec | Moving average | Moving average values of SEADEC |
na.seadec | Interpolation | Interpolation values of SEADEC |
na.seadec | Kalman | Kalman values of SEADEC |
na.seasplit | Mean | Mean values of SEASPLIT |
na.seasplit | Moving average | Moving average values of SEASPLIT |
na.seasplit | Interpolation | Interpolation values of SEASPLIT |
na.seasplit | Kalman | Kalman values of SEASPLIT |
Personally, I found the seasonally splitted approach quite useful because pollution shows a clear seasonal trend, thus it may take less risk to fit in values. Using the other approach might give you a overall trend, but will smooth out the instantaneous rise in each hour which is very important.
Now, I will compare four different methods that will impute the NA
values in Seoul monitoring stations. Out of 6 years, I sampled February 2012 because no data was shown in the entire month! Lets have a look at the outcomes:
Even if you haven't looked at the equations, you can roughly guess how it is measured. But because I couldn't find any preferences out of the results, I thought it might be better to import them all.
Here is a sample PM10 data that I have for Gangnam. It includes Date
, hour
, Type
,Value
, and work
. You can see the working hours are between 9-19 hours, and the home hours are 20-24 and 01-08 hours in a long format.
Date | hour | Type | Value | work |
---|---|---|---|---|
2010-01-01 | 1 | ts_mean | 30 | home |
2010-01-01 | 2 | ts_mean | 35 | home |
2010-01-01 | 3 | ts_mean | 36 | home |
2010-01-01 | 4 | ts_mean | 31 | home |
2010-01-01 | 5 | ts_mean | 33 | home |
2010-01-01 | 6 | ts_mean | 36 | home |
2010-01-01 | 7 | ts_mean | 28 | home |
2010-01-01 | 8 | ts_mean | 38 | home |
2010-01-01 | 9 | ts_mean | 37 | work |
2010-01-01 | 10 | ts_mean | 33 | work |
2010-01-01 | 11 | ts_mean | 31 | work |
2010-01-01 | 12 | ts_mean | 41 | work |
2010-01-01 | 13 | ts_mean | 29 | work |
2010-01-01 | 14 | ts_mean | 41 | work |
2010-01-01 | 15 | ts_mean | 36 | work |
2010-01-01 | 16 | ts_mean | 36 | work |
2010-01-01 | 17 | ts_mean | 37 | work |
2010-01-01 | 18 | ts_mean | 35 | work |
2010-01-01 | 19 | ts_mean | 37 | work |
2010-01-01 | 20 | ts_mean | 38 | home |
2010-01-01 | 21 | ts_mean | 52 | home |
2010-01-01 | 22 | ts_mean | 43 | home |
2010-01-01 | 23 | ts_mean | 43 | home |
2010-01-01 | 24 | ts_mean | 39 | home |
If we import the file directly to NetLogo, then we will need more work to do. This is because NetLogo likes wide formats rather than long. We can obviously use a for loop to do the job, but I prefer to reduce the burden for NetLogo.
I transformed the data in to the format shown below. You can see in each day there are 13 home hours and 11 working hours. To avoid any confusion with a mixture of formats (integers, factors, characters) allocated in each row, I converted the NA
values to -999
.
A tibble: 23,376 x 16
dates | type | work | h1 | h2 | h3 | h4 | h5 | h6 | h7 | h8 | h9 | 10h | 11h | 12h | 13h | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2010-01-01 | ts_int | home | 30 | 35 | 36 | 31 | 33 | 36 | 28 | 38 | 38 | 52 | 43 | 43 | 39 |
2 | 2010-01-01 | ts_int | work | 37 | 33 | 31 | 41 | 29 | 41 | 36 | 36 | 37 | 35 | 37 | -999 | -999 |
3 | 2010-01-02 | ts_int | home | 39 | 37 | 38 | 40 | 43 | 39 | 41 | 40 | 99 | 85 | 83 | 77 | 70 |
4 | 2010-01-02 | ts_int | work | 46 | 44 | 47 | 42 | 39 | 42 | 47 | 55 | 68 | 75 | 102 | -999 | -999 |
5 | 2010-01-03 | ts_int | home | 64 | 62 | 59 | 56 | 54 | 52 | 47 | 40 | 46 | 59 | 61 | 64 | 65 |
6 | 2010-01-03 | ts_int | work | 49 | 36 | 33 | 41 | 45 | 39 | 46 | 37 | 44 | 50 | 43 | -999 | -999 |
7 | 2010-01-04 | ts_int | home | 57 | 48 | 54 | 51 | 44 | 37 | 35 | 36 | 34 | 62 | 54 | 44 | 45 |
8 | 2010-01-04 | ts_int | work | 25 | 26 | 28 | 24 | 25 | 27 | 31 | 30 | 33 | 27 | 26 | -999 | -999 |
9 | 2010-01-05 | ts_int | home | 46 | 56 | 53 | 60 | 61 | 62 | 62 | 59 | 40 | 33 | 31 | 31 | 37 |
10 | 2010-01-05 | ts_int | work | 65 | 63 | 56 | 58 | 41 | 46 | 68 | 64 | 47 | 29 | 41 | -999 | -999 |
This data has been exported as an hourly .csv
file.
Having imported the data in to NetLogo, I let NetLogo pick one of the values in h1-h13
and randomly distribute it to each patch. Of course, I did not allow the programme to select -999
. Have a look at the NetLogo
codes below:
Software: NetLogo
if (Scenario = "BAU")
[ask patches with [gangnam = true]
[
if ticks > 0 and (ticks + 1) mod 2 = 0 [set ts__int item (3 + random 13) table:get ts_int ticks + 1] ; home
if ticks > 0 and ticks mod 2 = 0 [set ts__int item (3 + random 11) table:get ts_int ticks + 1] ; work
if ticks > 0 and (ticks + 1) mod 2 = 0 [set ts__mean item (3 + random 13) table:get ts_mean ticks + 1] ; home
if ticks > 0 and ticks mod 2 = 0 [set ts__mean item (3 + random 11) table:get ts_mean ticks + 1]; work
if ticks > 0 and (ticks + 1) mod 2 = 0 [set ts__ma item (3 + random 13) table:get ts_ma ticks + 1] ; home
if ticks > 0 and ticks mod 2 = 0 [set ts__ma item (3 + random 11) table:get ts_ma ticks + 1] ; work
if ticks > 0 and (ticks + 1) mod 2 = 0 [set ts__kal item (3 + random 13) table:get ts_kal ticks + 1] ; home
if ticks > 0 and ticks mod 2 = 0 [set ts__kal item (3 + random 11) table:get ts_kal ticks + 1] ; work
]
]
As a result, you may see the bottom four values in the attribution box.
Here is a sampled result after simulating 6 years from 2010.01.01-2015.12.31.
who | homename | destinationName | age | health |
---|---|---|---|---|
0 | sinsa | sinsa | young | 208.3 |
1 | sinsa | sinsa | young | 208.9 |
2 | sinsa | sinsa | young | 211.2 |
3 | sinsa | sinsa | young | 206.5 |
4 | sinsa | sinsa | young | 208.0 |
5 | sinsa | sinsa | young | 0.0 |
6 | sinsa | sinsa | young | 211.0 |
7 | sinsa | sinsa | young | 208.8 |
8 | sinsa | sinsa | young | 0.0 |
9 | sinsa | sinsa | young | 208.2 |
and some plots...
By districts | By age |
---|---|