-
Notifications
You must be signed in to change notification settings - Fork 1
/
01-intro.Rmd
244 lines (160 loc) · 12.8 KB
/
01-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
# (PART) PART I. INTRODUCTION {-}
# About the Course {#intro}
## ‘QALL401 Data Analysis for Researchers’
> An introduction to data science (DS). It is an exploratory data analysis (EDA) that is an essential part of scientific research and an evidence-based decision making of a responsible global citizen. Students acquire the knowledge and learn the necessary principles for appropriate computer utilization in making research results public in order to communicate the outcomes. Since data science supports technologies of artificial intelligence, ethical issues are becoming more and more important.
> We introduce R, a widely used free software environment for statistical computing and graphics, and Rmarkdown, an authoring format that enables easy creation of dynamic documents, presentations, and reports from R, supporting reproducible research and literate programming.
> We will experience the process of data science and set a foundation to delovep data science skills and take time to think about the ethical issues of its outcomes.
> Instructors: Taisei Kaizoji, Professor of Economics, and Hiroshi Suzuki, Part-time Instructor
Description: This course will help students from many academic fields develop skills to obtain necessary information from open data, as well as make charting and graphing for visualization. Students also learn fundamentals of data analysis and write short articles including data reasoning. The laboratory work uses open software such as R, and guest lectures on data analysis for research are included.
Key Words: open data, data visualization, data analysis, data reasoning, R
Features: laboratory work - practicum, write short articles, guest lectures
### Course Overview
The objective of this course is to learn the fundamentals of data science. Using the free software, R and its IDE, RStudio, students will learn how to collect data, transform it into appropriate forms, and visualize it. Students will also learn how to analyze data and present their findings to others.
**Sylabus**: https://campus.icu.ac.jp/public/ehandbook/PreviewSyllabus.aspx?regno=32002&year=2022&term=3
### Course Schedule
1. 2022-12-07: Introduction: About the course [lead by TK]
- An introduction to open and public data, and data science
2. 2022-12-14: Exploratory Data Analysis (EDA) 1 [lead by hs]
- R Basics with RStudio and/or RStudio.cloud
- `tidyverse` using Toy Data
- Assignment One
3. 2022-12-21: Exploratory Data Analysis (EDA) 2 [lead by hs]
- R Markdown for reproducibility and communication
- `dplyr` for transforming data
- Assignment Two
4. 2023-01-11: Exploratory Data Analysis (EDA) 3 [lead by hs]
- WDI, a package for searching and downloading World Development Indicators
- `ggplot2` for data visualization
- Assignment Three
5. 2023-01-18: Exploratory Data Analysis (EDA) 4 [lead by hs]
- `tidyr` for tidying data
- Workflow of EDA
- Assignment Four
6. 2023-01-25: Exploratory Data Analysis (EDA) 5 [lead by hs]
- Data Modeling
- Roundups, R Markdown revisited
- Assignment Five
7. 2023-02-01: Introduction to PPDAC (Problem-Plan-Data-Analysis-Conclusion) Cycle: [lead by TK]
- PPDAC in EDA
- `owidR`
8. 2023-02-08: Model building I [lead by TK]
- World Bank data
- Merging data
9. 2023-02-15: Model building II [lead by TK]
-Analyzing data and communications
10. 2023-02-22: Project Presentation
### Objective and Grading Policy:
The objective of this course is to learn the fundamentals of data science. Using the free software, R and its IDE, RStudio, students will learn how to collect data, transform it into appropriate forms, and visualize it. Students will also learn how to analyze data and present their findings to others.
Grading policy:
A. Course participation by giving feedback - 10%
B. Short papers: Assignment 1-5 - 30%
C. Presentation - 20%
D. Final report - 40%
### Learning Resources
#### Textbooks and References
* "R for Data Science" by Hadley Wickham and Garrett Grolemund:
- Free Online Book: https://r4ds.had.co.nz
* Visit `bookdown` site: https://bookdown.org
- Many more on the [archive page](https://bookdown.org/home/archive/).
### Interactive Tutorials for R
#### Posit Primers https://posit.cloud/learn/primers
1. The Basics -- [r4ds: Explore, I](https://r4ds.had.co.nz/explore-intro.html#explore-intro)
- [Visualization Basics](https://rstudio.cloud/learn/primers/1.1)
- [Programming Basics](https://rstudio.cloud/learn/primers/1.2)
2. Work with Data -- [r4ds: Wrangle, I](https://r4ds.had.co.nz/wrangle-intro.html#wrangle-intro)
- [Working with Tibbles](https://rstudio.cloud/learn/primers/2.1)
- [Isolating Data with dplyr](https://rstudio.cloud/learn/primers/2.2)
- [Deriving Information with dplyr](https://rstudio.cloud/learn/primers/2.3)
3. Visualize Data -- [r4ds: Explore, II](https://r4ds.had.co.nz/explore-intro.html#explore-intro)
- [Exploratory Data Analysis](https://rstudio.cloud/learn/primers/3.1)
- [Bar Charts](https://rstudio.cloud/learn/primers/3.2)
- [Histograms](https://rstudio.cloud/learn/primers/3.3)
- [Boxplots and Counts](https://rstudio.cloud/learn/primers/3.4)
- [Scatterplots](https://rstudio.cloud/learn/primers/3.5)
- [Line plots and maps](https://rstudio.cloud/learn/primers/3.6)
- [Overplotting](https://rstudio.cloud/learn/primers/3.7)
- [Customize plots](https://rstudio.cloud/learn/primers/3.8)
4. Tidy Your Data -- [r4ds: Wrangle, II](https://r4ds.had.co.nz/wrangle-intro.html#wrangle-intro)
- [Reshape Data - a bit old](https://rstudio.cloud/learn/primers/4.1)
- [Separate and Unite](https://rstudio.cloud/learn/primers/4.2)
- [Join Data Sets](https://rstudio.cloud/learn/primers/4.3)
5. Iterate -- [r4ds: Program](https://r4ds.had.co.nz/program-intro.html#program-intro)
- [Introduction to Iteration](https://rstudio.cloud/learn/primers/5.1)
- [Map](https://rstudio.cloud/learn/primers/5.2)
- [Map Shortcut](https://rstudio.cloud/learn/primers/5.3)
- [Multiple Vectors](https://rstudio.cloud/learn/primers/5.3)
- [List Columns](https://rstudio.cloud/learn/primers/5.4)
6. Write Functions -- [r4ds: Program](https://r4ds.had.co.nz/program-intro.html#program-intro)
- [Function Basics](https://rstudio.cloud/learn/primers/6.1)
- [How to Write a Function](https://rstudio.cloud/learn/primers/6.2)
- [Argument Matching](https://rstudio.cloud/learn/primers/6.3)
- [Environments and Scoping](https://rstudio.cloud/learn/primers/6.4)
- [Control Flow](https://rstudio.cloud/learn/primers/6.5)
- [Advanced Control Flow](https://rstudio.cloud/learn/primers/6.6)
- [Loops in R](https://rstudio.cloud/learn/primers/6.7)
7. Report Reproductively -- [r4ds: Communicate](https://r4ds.had.co.nz/communicate-intro.html)
- [Link to Videos and Explanations](https://rmarkdown.rstudio.com/lesson-1.html?_ga=2.215340127.979535829.1639794069-1104332695.1639233659)
8. [Build Interactive Web Apps](https://shiny.rstudio.com/tutorial/?_ga=2.149795838.979535829.1639794069-1104332695.1639233659)
#### Swirl: An interactive learning environment for R and statistics
It is a console-based interactive tutorial containing several courses. We did not use it in class this academic year.
* {swirl} website: https://swirlstats.com
- JHU Data Science in coursera uses swirl for exercises.
### A massage to students
This course consists of the following components.
1. Lecture Note
- We provide slides, notes and lecture note in the past
2. Lecture
- We provide Zoom as an option, and its recording
3. Textbook
- R for Data Science - you can read online
4. Practicum in class
- We provide the log in R Notebooks or R Scripts
5. Interactive Tutorial
- Posit Primers - you can practice online
6. Assignments - format: R Notebook
- We provide feedback to each and responses in R Notebook
7. Student Presentation - format: R Notebook, Slides, ...
- Last class
8. Final Paper - format: R Notebook (including codes) and PDF (8 pages)
- Due: Two weeks after the last class
Each component is closely linked. We do not check your engagement in Posit Primers, but the lectures from week two to week six are designed following Posit Primers. For assignments, you can submit R Notebook containing code chunks with errors. Hopefully, instructors will give feedback and suggestions. We also set up a personal tutorial meeting on Zoom upon request.
Our goal is that you develop skills to explore and analyze data, mainly using open public data by yourself. We truly hope you enjoy the course.
## Introduction to Exploratory Data Analysis
### What is data science?
Wikipedia https://en.wikipedia.org/wiki/Data_science
> An inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.
* Create Insights
* Impact Decision Making
* Maintain & Improve Overtime
## Gapminder
### Hans Rosling (1948 – 2017)
> Hans Rosling was a Swedish physician, academic, and public speaker. He was a professor of international health at Karolinska Institute[4] and was the co-founder and chairman of the Gapminder Foundation, which developed the Trendalyzer software system. ([wikipedia](https://en.wikipedia.org/wiki/Hans_Rosling))
* Books:
- Factfulness: Ten Reasons We're Wrong About The World - And Why Things Are Better Than You Think, 2018
- How I Learned to Understand the World: A Memoir, 2020
* Gapminder: https://www.gapminder.org
- [You are probably wrong about: Upgrade Your World View](https://upgrader.gapminder.org)
- [Bubble Chart](https://www.gapminder.org/tools/#$state$time$value=2020;;&chart-type=bubbles): Income vs Life Expectancy over time, 1800 - 2020
+ How many variables?
* Videos: [The best stats you’ve ever seen, Hans Rosling](http://www.edtech.events/the-best-stats-youve-ever-seen-hans-rosling/)
<iframe width="560" height="315" src="https://www.youtube.com/embed/Sm5xF-UYgdg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
* [How not to be ignorant about the world | Hans and Ola Rosling](https://www.youtube.com/watch?v=Sm5xF-UYgdg)
### Factfulness is ... \hfill _From the book_
recognizing when a decision feels urgent and remembering that it rarely is.
To control the urgency instinct, take small steps.
* Take a breath. When your urgency instinct is triggered, your other instincts kick in and your analysis shuts down. Ask for more time and more information. It’s rarely now or never and it’s rarely either/or.
* Insist on the data. If something is urgent and important, it should be measured. Beware of data that is relevant but inaccurate, or accurate but irrelevant. Only relevant and accurate data is useful.
* Beware of fortune-tellers. Any prediction about the future is uncertain. Be wary of predictions that fail to acknowledge that. Insist on a full range of scenarios, never just the best or worst case. Ask how often such predictions have been right before.
* Be wary of drastic action. Ask what the side effects will be. Ask how the idea has been tested. Step-by-step practical improvements, and evaluation of their impact, are less dramatic but usually more effective.
## Exploratory Data Analysis
### What is EDA (Posit Primers: [Visualise Data](https://posit.cloud/learn/primers/3.1))
1. EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
2. Generate questions about your data
3. Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.
## Open and Public Data, World Bank
### [Open Government Data Toolkit](http://opendatatoolkit.worldbank.org): [Open Data Defined](http://opendatatoolkit.worldbank.org/en/essentials.html)
The term **Open Data** has a very precise meaning. Data or content is open if anyone is free to use, re-use or redistribute it, subject at most to measures that preserve provenance and openness.
1. The data must be _legally open_, which means they must be placed in the public domain or under liberal terms of use with minimal restrictions.
2. The data must be _technically open_, which means they must be published in electronic formats that are machine readable and non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions. To make Open Data easier to find, most organizations create and manage Open Data catalogs.