The General Social Survey (GSS) gathers data on American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. It is conducted biannually through in-person interviews using a probability sampling approach. It is one of the most commonly studied datasets in the social science disciplines.
In the data
folder, I have included a (large) sample of questions asked during the 2012 GSS. Using the exploratory data analysis skills we have reviewed in-class, you will conduct an exploratory analysis of the data to identify interesting questions and (potential) answers. Remember the types of questions we seek to answer using EDA:
- What type of variation occurs within my variables?
- What type of covariation occurs between my variables?
- Are there outliers in the data?
- Do I have missingness? Are there patterns to it?
- How much variation/error exists in my statistical estimates? Is there a pattern to it?
No complex statistical methods should be employed. Focus instead on primarily graphical analysis, though you can also use basic statistical tests you may have learned in other classes (e.g. tests for normality, difference of means).
Do not worry about using survey weights in your exploratory analysis. Just treat every observation equally.
The final submission should include two components.
This is a record of all your exploratory analysis. It should be extensive (minimum 30-40 graphs), and mostly code and graphics.
- Minimally annotate your code and output as necessary to keep track of what you've done and highlight important insights gained through your exploration
- It should be somewhat stream-of-conscious (that is, a stored record of your exploration as you explore the data), though certainly feel free to maintain a structure or go back and reformat different sections
- Don't bother cleaning up each graph to have meaningful labels
In a short paper (around 750 words), summarize your insights and what you've learned about the data. This could include one or two important research questions you think you could answer using the data, as well as some initial hypotheses supported by your exploratory analysis. Or perhaps you've identified unusual variation in a single variable, or extreme outliers or systematic missingness in the data that should be accounted for in future analysis. This component will look different for each student. That's fine. What I want to see is genuine effort and some thought put into what you've learned from this EDA.
- This component should include mostly written analysis and a handful of graphs to support your questions and answers
- Clean up these graphs so they are publication-ready. This means give each graph a meaningful title, axes labels, legends, etc.
There are three files in the data
folder. Each file contains the same data, simply in a different format:
gss2012.csv
- CSV filegss2012.dta
- Stata filegss2012.feather
- Feather file
Use which ever format is easier to import into your software package of choice.
If using R, you can access this data file in the poliscidata
package:
install.packages("poliscidata")
data(gss, package = "poliscidata")
# convert to tibble
library(tidyverse)
gss <- as_tibble(gss)
In the documentation
folder, there are three files that are potentially relevant to your analysis.
codebook.txt
- a codebook of the dataset automatically generated by StataGSS_Codebook_index.pdf
- a list of all variables available from the GSS, with their variable names in the data file and a brief description of the variableGSS_Codebook_mainbody.pdf
- a detailed description of all variables available from the GSS, with full question wording and potential responses
You can also find more information on the survey and specific variables at the GSS website.
Here are some relevant resources for how to write code in Python or R to generate EDA graphs.
- VanderPlas, Jake. (2016). Python Data Science Handbook. O'Reilly Media, Inc.
- Unwin, A. (2015). Graphical data analysis with R (Vol. 27). CRC Press.
- My lecture notes (which I will post in the repo after class on Wednesday)
- Lab session on Wednesday
See here for instructions on submitting course assignments.
Submit your assignment as a set of reproducible notebooks - one notebook for the lab notebook, and one notebook for the exploration write-up.
- If you use Python, this means a Jupyter Notebook (
.ipynb
). - If you use R, this means an R Markdown document (
.Rmd
knitted withoutput: github_document
in the front matter). Be sure to stage and commit not only the source file but also the output images as well so everything is visible in GitHub.
Do not submit a plain .py
or .R
script. Your generated graphs will not be visible in the repo, and we will not be running every single script to ensure it works correctly. Make sure to use a notebook format. If you have questions about this, please ask.
Submit your pull request before class on Monday, November 27 (11:30 am).