Exploring the General Social Survey (10 points)

The General Social Survey (GSS) gathers data on American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. It is conducted biannually through in-person interviews using a probability sampling approach. It is one of the most commonly studied datasets in the social science disciplines.

In the data folder, I have included a (large) sample of questions asked during the 2012 GSS. Using the exploratory data analysis skills we have reviewed in-class, you will conduct an exploratory analysis of the data to identify interesting questions and (potential) answers. Remember the types of questions we seek to answer using EDA:

What type of variation occurs within my variables?
What type of covariation occurs between my variables?
Are there outliers in the data?
Do I have missingness? Are there patterns to it?
How much variation/error exists in my statistical estimates? Is there a pattern to it?

What not to do

Build a statistical model

No complex statistical methods should be employed. Focus instead on primarily graphical analysis, though you can also use basic statistical tests you may have learned in other classes (e.g. tests for normality, difference of means).

Adjust for survey weights

Do not worry about using survey weights in your exploratory analysis. Just treat every observation equally.

What you should do

The final submission should include two components.

Lab notebook (6 points)

This is a record of all your exploratory analysis. It should be extensive (minimum 30-40 graphs), and mostly code and graphics.

Minimally annotate your code and output as necessary to keep track of what you've done and highlight important insights gained through your exploration
It should be somewhat stream-of-conscious (that is, a stored record of your exploration as you explore the data), though certainly feel free to maintain a structure or go back and reformat different sections
Don't bother cleaning up each graph to have meaningful labels

Exploration write-up (4 points)

In a short paper (around 750 words), summarize your insights and what you've learned about the data. This could include one or two important research questions you think you could answer using the data, as well as some initial hypotheses supported by your exploratory analysis. Or perhaps you've identified unusual variation in a single variable, or extreme outliers or systematic missingness in the data that should be accounted for in future analysis. This component will look different for each student. That's fine. What I want to see is genuine effort and some thought put into what you've learned from this EDA.

This component should include mostly written analysis and a handful of graphs to support your questions and answers
Clean up these graphs so they are publication-ready. This means give each graph a meaningful title, axes labels, legends, etc.

Accessing the data

There are three files in the data folder. Each file contains the same data, simply in a different format:

gss2012.csv - CSV file
gss2012.dta - Stata file
gss2012.feather - Feather file

Use which ever format is easier to import into your software package of choice.

If using R, you can access this data file in the poliscidata package:

install.packages("poliscidata")
data(gss, package = "poliscidata")

# convert to tibble
library(tidyverse)
gss <- as_tibble(gss)

Dataset documentation

In the documentation folder, there are three files that are potentially relevant to your analysis.

codebook.txt - a codebook of the dataset automatically generated by Stata
GSS_Codebook_index.pdf - a list of all variables available from the GSS, with their variable names in the data file and a brief description of the variable
GSS_Codebook_mainbody.pdf - a detailed description of all variables available from the GSS, with full question wording and potential responses

You can also find more information on the survey and specific variables at the GSS website.

Writing the code

Here are some relevant resources for how to write code in Python or R to generate EDA graphs.

VanderPlas, Jake. (2016). Python Data Science Handbook. O'Reilly Media, Inc.
Unwin, A. (2015). Graphical data analysis with R (Vol. 27). CRC Press.
My lecture notes (which I will post in the repo after class on Wednesday)
Lab session on Wednesday

Submitting your assignment

See here for instructions on submitting course assignments.

Submission format

Submit your assignment as a set of reproducible notebooks - one notebook for the lab notebook, and one notebook for the exploration write-up.

If you use Python, this means a Jupyter Notebook (.ipynb).
If you use R, this means an R Markdown document (.Rmd knitted with output: github_document in the front matter). Be sure to stage and commit not only the source file but also the output images as well so everything is visible in GitHub.

Do not submit a plain .py or .R script. Your generated graphs will not be visible in the repo, and we will not be running every single script to ensure it works correctly. Make sure to use a notebook format. If you have questions about this, please ask.

Submission deadline

Submit your pull request before class on Monday, November 27 (11:30 am).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Exploring the General Social Survey (10 points)

What not to do

Build a statistical model

Adjust for survey weights

What you should do

Lab notebook (6 points)

Exploration write-up (4 points)

Accessing the data

Dataset documentation

Writing the code

Submitting your assignment

Submission format

Submission deadline

Files

README.md

Latest commit

History

README.md

File metadata and controls

Exploring the General Social Survey (10 points)

What not to do

Build a statistical model

Adjust for survey weights

What you should do

Lab notebook (6 points)

Exploration write-up (4 points)

Accessing the data

Dataset documentation

Writing the code

Submitting your assignment

Submission format

Submission deadline