Solve short hands-on challenges to perfect your data manipulation skills.
In this tutorial, you will learn how to create your own data, along with how to work with data that already exists.
import pandas as pd
There are two core objects in pandas: the DataFrame
and the Series
.
pd.DataFrame()
constructor to generate these DataFrame
objects. The syntax for declaring a new one is a dictionary whose keys are the column names (later Yes/No; Bob/Sue in example), and whose values are a list of entries. This is the standard way of constructing a new DataFrame
.
# A DataFrame is a table.
# It contains an array of individual entries, each of which has a certain value.
# Each entry corresponds to a row (or record) and a column.
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
Out:
Yes No
0 50 131
1 21 2
# DataFrame entries are not limited to integers.
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})
Out:
Bob Sue
0 I liked it. Pretty good.
1 It was awful. Bland.
The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])
Out:
Bob Sue
Product A I liked it. Pretty good.
Product B It was awful. Bland.
A Series
, by contrast, is a sequence of data values. If a DataFrame
is a table, a Series
is a list. And in fact you can create one with nothing more than a list:
pd.Series([1, 2, 3, 4, 5])
Out:
0 1
1 2
2 3
3 4
4 5
dtype: int64
You can assign column values to the Series
the same way as before, using an index
parameter. However, a Series
does not have a column name, it only has one overall name
:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
Out:
2015 Sales 30
2016 Sales 35
2017 Sales 40
Name: Product A, dtype: int64
It's helpful to think of a
DataFrame
as actually being just a bunch ofSeries
"glued together".
Let's now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame
. We'll use the pd.read_csv()
function to read the data into a DataFrame
. This goes thusly:
wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv")
wine_reviews.shape #to check how large the resulting DataFrame
Out:
(129971, 14)
This means 129971 records in each row in 14 columns separated, which means 1820000 entries.
wine_reviews.head()
- gives fst 5 row of datasetwine_reviews.head()
- can have more than 30 optinoal parameterindex_col=0
parameter to set fst column as index:wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
animals.to_csv("cows_and_goats.csv")
- save animalsDataFrame
intocows_and_goats.csv
file on the disk
One of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to you quickly and effectively
desc = reviews['description']
anddesc = reviews.description
both gives the same result: thedescription
column fromreviews
and assign the result to the variabledesc
.reviews.description.iloc[0]
andreviews.description.loc[0]
andreviews.description[0]
gives te same: the first value from the description column of reviews. Both loc and iloc are row-first, column-second.reviews.iloc[0]
gives back the first row of reviews datasetreviews.description.iloc[:10]
andreviews.description.head(10)
andreviews.loc[:9, "description"]
are the same, and give back the first 10 values from the description column in reviewssample_reviews = reviews.iloc[[1,2,3,5,8],:]
- Select the records with index labels 1, 2, 3, 5, and 8, assigning the result to the variablesample_reviews
df = reviews.loc[[0,1,10,100], ['country', 'province', 'region_1', 'region_2']]
- Create a variabledf
containing the country, province, region_1, and region_2 columns of the records with the index labels 0, 1, 10, and 100df = reviews.loc[:99,['country','variety']]
- Create a variabledf
containing the country and variety columns of the first 100 records. In this casedf.iloc[0:1000]
will return 1000 entries, whiledf.loc[0:1000]
return 1001 of them!italian_wines = reviews[reviews.country == 'Italy']
- Create a DataFrame italian_wines containing reviews of wines made in Italytop_oceania_wines = reviews.loc[(reviews.country.isin(['Australia', 'New Zealand'])) & (reviews.points >= 95)]
- Create a DataFrametop_oceania_wines
containing all reviews with at least 95 points (out of 100) for wines from Australia or New Zealand.reviews.loc[reviews.price.notnull()]
- ˛isnull
(and its companionnotnull
). These methods let you highlight values which are (or are not) empty (NaN).
It's worth knowing that negative numbers can be used in selection. This will start counting forwards from the end of the values reviews.iloc[-5:]
Label-based selection derives its power from the labels in the index. Critically, the index we use is not immutable. We can manipulate the index in any way we see fit. reviews.set_index("title")
reviews.points.describe()
gives dataser's summary, also possible to use on columns of dataset.mean()
get the mean of the row, ex:reviews.points.mean()
.unique()
get the unique values, so let's filter the duplicate ones, ex:reviews.taster_name.unique()
.value_counts()
how often the unique values appear, ex:reviews.taster_name.value_counts()
=> out:Roger Voss 25514
A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later.
There are two mapping methods that you will use often.
Note that map()
and apply()
return new, transformed Series
and DataFrames
, respectively. They don't modify the original data they're called on.
The function you pass to map()
should expect a single value from the Series
(a point value, in the above example), and return a transformed version of that value.
example | Output |
---|---|
# map() is the first, and slightly simpler one.
# For example, suppose that we wanted to remean
# the scores the wines received to 0.
#
# We can do this as follows:
review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean) |
|
Create variable centered_price = reviews.price.map(lambda p: p-reviews.price.mean())
# of course this is equal to
# centered_price = reviews.price - reviews.price.mean() |
|
apply()
is the equivalent method if we want to transform a whole DataFrame
by calling a custom method on each row.
If we had called reviews.apply()
with axis='index'
, then instead of passing a function to transform each row, we would need to give a function to transform each column.
def remean_points(row):
row.points = row.points - review_points_mean
return row
reviews.apply(remean_points, axis='columns') |
|
# score of 95 or higher counts as 3 stars,
# a score of at least 85 but less than 95 is 2 stars.
# Any other score is 1 star.
#
# any wines from Canada should automatically get 3 stars,
# regardless of points.
#
# Create a series star_ratings with the number of stars
# corresponding to each review in the dataset
def stars(row):
if row.country == 'Canada':
return 3
elif row.points >= 95:
return 3
elif row.points >= 85:
return 2
else:
return 1
star_ratings = reviews.apply(stars, axis='columns') |
|
often we want to group our data, and then do something specific to the group the data is in
One function we've been using heavily thus far is the value_counts()
function. We can replicate what value_counts()
does by doing the following:
reviews.points |
|
reviews.groupby('points').points.count() |
|
groupby()
created a group of reviews which allotted the same point values to the given wines. Then, for each of these groups, we grabbed the points()
column and counted how many times it appeared. value_counts()
is just a shortcut to this groupby()
operation.
For example, here's one way of selecting the name of the first wine reviewed from each winery in the dataset: reviews.groupby('winery').apply(lambda df: df.title.iloc[0])
For even more fine-grained control, you can also group by more than one column. For an example, here's how we would pick out the best wine by country and province: reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])
Lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:
reviews.groupby(['country']).price.agg([len, min, max])
Out
len min max
country
Argentina 3800 4.0 230.0
Armenia 2 14.0 15.0
... ... ... ...
Ukraine 14 6.0 13.0
Uruguay 109 10.0 130.0
43 rows × 3 columns
groupby()
is slightly different in the fact that, depending on the operation we run, it will sometimes result in what is called a multi-index. A multi-index differs from a regular index in that it has multiple levels. For example:
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed
Out:
len
country province
Argentina Mendoza Province 3264
Other 536
... ... ...
Uruguay San Jose 3
Uruguay 24
The use cases for a multi-index are detailed alongside instructions on using them in the MultiIndex / Advanced Selection section of the pandas documentation.
However, in general the multi-index method you will use most often is the one for converting back to a regular index, the reset_index()
method:
countries_reviewed.reset_index()
Out:
country province len
0 Argentina Mendoza Province 3264
1 Argentina Other 536
... ... ... ...
423 Uruguay San Jose 3
424 Uruguay Uruguay 24
425 rows × 3 columns
- grouping returns data in index order, not in value order
- To get data in the order want it in we can sort it ourselves. The
sort_values()
method is handy for this. - also we can set descending sort by
ascending=False
insort_values()
argument
countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')
Out:
country province len
179 Greece Muscat of Kefallonian 1
192 Greece Sterea Ellada 1
... ... ... ...
415 US Washington 8639
392 US California 36247
425 rows × 3 columns
sort by more than one column at a time:
countries_reviewed.sort_values(by=['country', 'len'])
Out:
country province len
1 Argentina Other 536
0 Argentina Mendoza Province 3264
... ... ... ...
424 Uruguay Uruguay 24
419 Uruguay Canelones 43
425 rows × 3 columns
more in exercise-grouping-and-sorting.ipynb
The data type for a column in a DataFrame or a Series is known as the dtype. You can use the dtype()
property to grab the type of a specific column. Alternatively, the dtypes property returns the dtype of every column in the DataFrame: reviews.dtypes
.
It's possible to convert a column of one type into another wherever such a conversion makes sense by using the astype()
function. For example, we may transform the points column from its existing int64
data type into a float64
data type: reviews.points.astype('float64')
To select NaN
entries you can use pd.isnull()
(or its companion pd.notnull()
). Pandas provides a really handy method for this problem: fillna()
. reviews.region_2.fillna("Unknown")
replace each NaN with an "Unknown".
.replace("@kerinokeefe", "@kerino")
replace @kerinokeefe to @kerino.
more in exercise-data-types-and-missing-values.ipynb
Oftentimes data will come to us with column names, index names, or other naming conventions that we are not satisfied with.
rename()
- lets you change index names and/or column names. ex: to change the points column in our dataset to score, we would do:
reviews.rename(columns={'points': 'score'})
- rename index or column values by specifying a index or column keyword parameter, respectively:
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})
- lets you change index names and/or column names. ex: to change the points column in our dataset to score, we would do:
set_index()
rename indexes more convenientrename_axis()
rename a row and a column index, ex:reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')
When performing operations on a dataset, we will sometimes need to combine different DataFrames and/or Series in non-trivial ways.
merge()
-join()
-concat()
- simplest combining method
Concat different databases with same structure:
canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")
pd.concat([canadian_youtube, british_youtube])
same:
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])
left.join(right, lsuffix='_CAN', rsuffix='_UK')