Skip to content

Latest commit

 

History

History
50 lines (44 loc) · 3.28 KB

14.0. Data Analysis and Dataframes.md

File metadata and controls

50 lines (44 loc) · 3.28 KB

14.0. Data Analysis and Visualization

Python has two very popular modules for data analysis and visualization. The Pandas module is a popular Python library that helps users manipulate and analyze data. Matplotlib is a module for creating plots and visualizing data. For people who have used R, pandas and matplotlib replicate a lot of the functionality of R (except for building complex statistical models, though there are other python modules to do that).

Installing pandas and matplotlib

Let's start with installation. These are add-on modules that probably are not already installed. So let's do that first. Go to your terminal window (or command prompt on windows) and enter the following commands:

Windows:

pip install pandas
pip install matplotlib

Mac:

python3 -m pip install pandas
python3 -m pip install matplotlib

Understanding pandas

The value of matplotlib is pretty obvious: making graphs and figures can be very useful. But what does pandas do? To better understand the value of pandas, let's consider some common tasks researchers in cognitive science might encounter and how pandas can help.

The main thing to know about pandas is that it gives us a new data structure to work with, the DataFrame. A pandas DataFrame is a versatile, two-dimensional, tabular data structure. Some of its key features include the fact that both rows and columns have labels. In a pandas DataFrame, each column is a pandas Series, which is a one-dimensional labeled array capable of holding any data type, such as integers, floats, and strings. So what are its advantages?

  • Data Organization: So far we have dealt with many datasets that we have had to store in matrices or lists of lists. Pandas dataframes gives is many methods for inputting and organizing data, so we don't have to write code to do it ourselves.
  • Data Cleaning: Suppose a researcher is working with a dataset containing reaction times for participants in a study, but some reaction times are missing. Pandas can be used to fill in those missing values with an average reaction time or remove the rows with missing information altogether, ensuring a clean dataset for further analysis.
  • Data Combining: In another scenario, the researcher might have separate datasets with information about participants' demographics and their performance in a memory task. Pandas makes it easy to merge these two datasets, connecting participants with their respective task results.
  • Data Filtering: When it comes to filtering data, pandas is equally helpful. Imagine a dataset that includes details about participants, such as their age and scores in a cognitive test. The researcher can use pandas to extract data for participants who scored above a certain threshold and are within a specific age range, allowing for more targeted analysis.
  • Data Summary: Data summarization is another area where pandas shines. For instance, a dataset might include responses from multiple participants in a survey about learning strategies. The researcher can use pandas to group the data by strategy and calculate the average effectiveness rating for each learning technique.

Next: 14.1. Loading and Saving DataFrames
Previous: 13.5. Lab 13