6. Key Changes (2018.09.14) Imputation

Filling missing data = Imputation

Before I write about the changes that I made, it might be better to describe why filling missing data, or imputation, is important for this study. In general, imputation can avoid any potential biases when we deal with sample data. While researchers in the past used to remove or ignore missing values, imputation has become trending. Many papers found that the NA values, which was thought as truly random, wasn't actually true, particularly in time-series data where you can actually estimate what the gap might be. Nowadays, in tribute to mathematicians, we can replace NA values with numerical values using various approaches.

For this study, imputation is important because the model investigates people's exposures and responses to air pollution by an hourly period. If a missing data is not added to the model (due to repair or network error), it will not be able to consider the levels of fluctuation, especially when it is during rush hour. So filling in the gaps will surely help the measurement of people's exposure levels that vary throughout days and seasons.

The topics I will cover:

Exploring different methods to impute NA values
Transform PM₁₀ data frame for Netlogo
Netlogo Simulation
Plotting people's health at the end of simulation

Note there are two softwares R and Netlogo running at the same time. This will be written at the beginning of each code sections.

Change 1: Exploring different methods to impute NA values

From imputeTS in R (https://cran.r-project.org/web/packages/imputeTS/imputeTS.pdf), you will meet two options (after some code exercise) before you choose a method.
First option is called a Seasonally Decomposed approach (Seasonally Decomposed Missing Value Imputation,na.seadec), which "removes the seasonal component from the time series, performs imputation on the deseasonalised series and afterwards adds the seasonal component again".
The other approach is Seasonally Splitted (Seasonally Splitted Missing Value Imputation, na.seasplit), which "splits the times series into seasons and afterwards performs imputation separately for each of the resulting time series datasets (each containing the data for one specific season)".

Now you will find four different methods in each approaches tabled below:

Function	Method	Description
na.seadec	Mean	Mean values of SEADEC
na.seadec	Moving average	Moving average values of SEADEC
na.seadec	Interpolation	Interpolation values of SEADEC
na.seadec	Kalman	Kalman values of SEADEC
na.seasplit	Mean	Mean values of SEASPLIT
na.seasplit	Moving average	Moving average values of SEASPLIT
na.seasplit	Interpolation	Interpolation values of SEASPLIT
na.seasplit	Kalman	Kalman values of SEASPLIT

Personally, I found the seasonally splitted approach quite useful because pollution shows a clear seasonal trend, thus it may take less risk to fit in values. Using the other approach might give you a overall trend, but will smooth out the instantaneous rise in each hour which is very important.

Now, I will compare four different methods that will impute the NA values in Seoul monitoring stations. Out of 6 years, I sampled February 2012 because no data was shown in the entire month! Lets have a look at the outcomes:

Even if you haven't looked at the equations, you can roughly guess how it is measured. But because I couldn't find any preferences out of the results, I thought it might be better to import them all.

Change 2: Transform PM₁₀ data frame for Netlogo

Here is a sample PM₁₀ data that I have for Gangnam. It includes Date, hour, Type,Value, and work. You can see the working hours are between 9-19 hours, and the home hours are 20-24 and 01-08 hours in a long format.

Date	hour	Type	Value	work
2010-01-01	1	ts_mean	30	home
2010-01-01	2	ts_mean	35	home
2010-01-01	3	ts_mean	36	home
2010-01-01	4	ts_mean	31	home
2010-01-01	5	ts_mean	33	home
2010-01-01	6	ts_mean	36	home
2010-01-01	7	ts_mean	28	home
2010-01-01	8	ts_mean	38	home
2010-01-01	9	ts_mean	37	work
2010-01-01	10	ts_mean	33	work
2010-01-01	11	ts_mean	31	work
2010-01-01	12	ts_mean	41	work
2010-01-01	13	ts_mean	29	work
2010-01-01	14	ts_mean	41	work
2010-01-01	15	ts_mean	36	work
2010-01-01	16	ts_mean	36	work
2010-01-01	17	ts_mean	37	work
2010-01-01	18	ts_mean	35	work
2010-01-01	19	ts_mean	37	work
2010-01-01	20	ts_mean	38	home
2010-01-01	21	ts_mean	52	home
2010-01-01	22	ts_mean	43	home
2010-01-01	23	ts_mean	43	home
2010-01-01	24	ts_mean	39	home

If we import the file directly to NetLogo, then we will need more work to do. This is because NetLogo likes wide formats rather than long. We can obviously use a for loop to do the job, but I prefer to reduce the burden for NetLogo.

I transformed the data in to the format shown below. You can see in each day there are 13 home hours and 11 working hours. To avoid any confusion with a mixture of formats (integers, factors, characters) allocated in each row, I converted the NA values to -999.

A tibble: 23,376 x 16

	dates	type	work	h1	h2	h3	h4	h5	h6	h7	h8	h9	10h	11h	12h	13h
1	2010-01-01	ts_int	home	30	35	36	31	33	36	28	38	38	52	43	43	39
2	2010-01-01	ts_int	work	37	33	31	41	29	41	36	36	37	35	37	-999	-999
3	2010-01-02	ts_int	home	39	37	38	40	43	39	41	40	99	85	83	77	70
4	2010-01-02	ts_int	work	46	44	47	42	39	42	47	55	68	75	102	-999	-999
5	2010-01-03	ts_int	home	64	62	59	56	54	52	47	40	46	59	61	64	65
6	2010-01-03	ts_int	work	49	36	33	41	45	39	46	37	44	50	43	-999	-999
7	2010-01-04	ts_int	home	57	48	54	51	44	37	35	36	34	62	54	44	45
8	2010-01-04	ts_int	work	25	26	28	24	25	27	31	30	33	27	26	-999	-999
9	2010-01-05	ts_int	home	46	56	53	60	61	62	62	59	40	33	31	31	37
10	2010-01-05	ts_int	work	65	63	56	58	41	46	68	64	47	29	41	-999	-999

This data has been exported as an hourly .csv file.

Change 3: Netlogo simulation

Having imported the data in to NetLogo, I let NetLogo pick one of the values in h1-h13 and randomly distribute it to each patch. Of course, I did not allow the programme to select -999. Have a look at the NetLogo codes below:

Software: NetLogo

  if (Scenario = "BAU")
  [ask patches with [gangnam = true]
    [
     if ticks > 0 and (ticks + 1) mod 2 = 0 [set ts__int item (3 + random 13) table:get ts_int ticks + 1]  ; home
     if ticks > 0 and ticks mod 2 = 0       [set ts__int item (3 + random 11) table:get ts_int ticks + 1]  ; work
     if ticks > 0 and (ticks + 1) mod 2 = 0 [set ts__mean item (3 + random 13) table:get ts_mean ticks + 1] ; home
     if ticks > 0 and ticks mod 2 = 0       [set ts__mean item (3 + random 11) table:get ts_mean ticks + 1]; work
     if ticks > 0 and (ticks + 1) mod 2 = 0 [set ts__ma item (3 + random 13) table:get ts_ma ticks + 1] ; home
     if ticks > 0 and ticks mod 2 = 0       [set ts__ma item (3 + random 11) table:get ts_ma ticks + 1] ; work
     if ticks > 0 and (ticks + 1) mod 2 = 0 [set ts__kal item (3 + random 13) table:get ts_kal ticks + 1] ; home
     if ticks > 0 and ticks mod 2 = 0       [set ts__kal item (3 + random 11) table:get ts_kal ticks + 1] ; work
    ]
  ]

As a result, you may see the bottom four values in the attribution box.

Change 4: Plotting people's health at the end of simulation

Here is a sampled result after simulating 6 years from 2010.01.01-2015.12.31.

who	homename	destinationName	age	health
0	sinsa	sinsa	young	208.3
1	sinsa	sinsa	young	208.9
2	sinsa	sinsa	young	211.2
3	sinsa	sinsa	young	206.5
4	sinsa	sinsa	young	208.0
5	sinsa	sinsa	young	0.0
6	sinsa	sinsa	young	211.0
7	sinsa	sinsa	young	208.8
8	sinsa	sinsa	young	0.0
9	sinsa	sinsa	young	208.2

and some plots...

By districts	By age

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6. Key Changes (2018.09.14) Imputation

Filling missing data = Imputation

Change 1: Exploring different methods to impute NA values

Change 2: Transform PM₁₀ data frame for Netlogo

Change 3: Netlogo simulation

Change 4: Plotting people's health at the end of simulation

Clone this wiki locally

6. Key Changes (2018.09.14) Imputation

Filling missing data = Imputation

Change 1: Exploring different methods to impute NA values

Change 2: Transform PM10 data frame for Netlogo

Change 3: Netlogo simulation

Change 4: Plotting people's health at the end of simulation

Clone this wiki locally

Change 2: Transform PM₁₀ data frame for Netlogo