forked from ageron/handson-ml3
-
Notifications
You must be signed in to change notification settings - Fork 25
/
Copy pathtools_pandas.qmd
1235 lines (909 loc) · 40.7 KB
/
tools_pandas.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: Setup
jupyter: python3
---
**Tools - pandas**
*The `pandas` library provides high-performance, easy-to-use data structures and data analysis tools. The main data structure is the `DataFrame`, which you can think of as an in-memory 2D table (like a spreadsheet, with column names and row labels). Many features available in Excel are available programmatically, such as creating pivot tables, computing columns based on other columns, plotting graphs, etc. You can also group rows by column value, or join tables much like in SQL. Pandas is also great at handling time series.*
Prerequisites:
* NumPy – if you are not familiar with NumPy, we recommend that you go through the [NumPy tutorial](tools_numpy.ipynb) now.
<table align="left">
<td>
<a href="https://colab.research.google.com/github/ageron/handson-ml3/blob/main/tools_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</td>
<td>
<a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ageron/handson-ml3/blob/main/tools_pandas.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
</td>
</table>
First, let's import `pandas`. People usually import it as `pd`:
```{python}
import pandas as pd
```
# `Series` objects
The `pandas` library contains the following useful data structures:
* `Series` objects, that we will discuss now. A `Series` object is 1D array, similar to a column in a spreadsheet (with a column name and row labels).
* `DataFrame` objects. This is a 2D table, similar to a spreadsheet (with column names and row labels).
* `Panel` objects. You can see a `Panel` as a dictionary of `DataFrame`s. These are less used, so we will not discuss them here.
## Creating a `Series`
Let's start by creating our first `Series` object!
```{python}
s = pd.Series([2,-1,3,5])
s
```
## Similar to a 1D `ndarray`
`Series` objects behave much like one-dimensional NumPy `ndarray`s, and you can often pass them as parameters to NumPy functions:
```{python}
import numpy as np
np.exp(s)
```
Arithmetic operations on `Series` are also possible, and they apply *elementwise*, just like for `ndarray`s:
```{python}
s + [1000,2000,3000,4000]
```
Similar to NumPy, if you add a single number to a `Series`, that number is added to all items in the `Series`. This is called * broadcasting*:
```{python}
s + 1000
```
The same is true for all binary operations such as `*` or `/`, and even conditional operations:
```{python}
s < 0
```
## Index labels
Each item in a `Series` object has a unique identifier called the *index label*. By default, it is simply the rank of the item in the `Series` (starting from `0`) but you can also set the index labels manually:
```{python}
s2 = pd.Series([68, 83, 112, 68], index=["alice", "bob", "charles", "darwin"])
s2
```
You can then use the `Series` just like a `dict`:
```{python}
s2["bob"]
```
You can still access the items by integer location, like in a regular array:
```{python}
s2[1]
```
To make it clear when you are accessing by label or by integer location, it is recommended to always use the `loc` attribute when accessing by label, and the `iloc` attribute when accessing by integer location:
```{python}
s2.loc["bob"]
```
```{python}
s2.iloc[1]
```
Slicing a `Series` also slices the index labels:
```{python}
s2.iloc[1:3]
```
This can lead to unexpected results when using the default numeric labels, so be careful:
```{python}
surprise = pd.Series([1000, 1001, 1002, 1003])
surprise
```
```{python}
surprise_slice = surprise[2:]
surprise_slice
```
Oh, look! The first element has index label `2`. The element with index label `0` is absent from the slice:
```{python}
try:
surprise_slice[0]
except KeyError as e:
print("Key error:", e)
```
But remember that you can access elements by integer location using the `iloc` attribute. This illustrates another reason why it's always better to use `loc` and `iloc` to access `Series` objects:
```{python}
surprise_slice.iloc[0]
```
## Init from `dict`
You can create a `Series` object from a `dict`. The keys will be used as index labels:
```{python}
weights = {"alice": 68, "bob": 83, "colin": 86, "darwin": 68}
s3 = pd.Series(weights)
s3
```
You can control which elements you want to include in the `Series` and in what order by explicitly specifying the desired `index`:
```{python}
s4 = pd.Series(weights, index = ["colin", "alice"])
s4
```
## Automatic alignment
When an operation involves multiple `Series` objects, `pandas` automatically aligns items by matching index labels.
```{python}
print(s2.keys())
print(s3.keys())
s2 + s3
```
The resulting `Series` contains the union of index labels from `s2` and `s3`. Since `"colin"` is missing from `s2` and `"charles"` is missing from `s3`, these items have a `NaN` result value (i.e. Not-a-Number means *missing*).
Automatic alignment is very handy when working with data that may come from various sources with varying structure and missing items. But if you forget to set the right index labels, you can have surprising results:
```{python}
s5 = pd.Series([1000,1000,1000,1000])
print("s2 =", s2.values)
print("s5 =", s5.values)
s2 + s5
```
Pandas could not align the `Series`, since their labels do not match at all, hence the full `NaN` result.
## Init with a scalar
You can also initialize a `Series` object using a scalar and a list of index labels: all items will be set to the scalar.
```{python}
meaning = pd.Series(42, ["life", "universe", "everything"])
meaning
```
## `Series` name
A `Series` can have a `name`:
```{python}
s6 = pd.Series([83, 68], index=["bob", "alice"], name="weights")
s6
```
## Plotting a `Series`
Pandas makes it easy to plot `Series` data using matplotlib (for more details on matplotlib, check out the [matplotlib tutorial](tools_matplotlib.ipynb)). Just import matplotlib and call the `plot()` method:
```{python}
#| scrolled: true
import matplotlib.pyplot as plt
temperatures = [4.4,5.1,6.1,6.2,6.1,6.1,5.7,5.2,4.7,4.1,3.9,3.5]
s7 = pd.Series(temperatures, name="Temperature")
s7.plot()
plt.show()
```
There are *many* options for plotting your data. It is not necessary to list them all here: if you need a particular type of plot (histograms, pie charts, etc.), just look for it in the excellent [Visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) section of pandas' documentation, and look at the example code.
# Handling time
Many datasets have timestamps, and pandas is awesome at manipulating such data:
* it can represent periods (such as 2016Q3) and frequencies (such as "monthly"),
* it can convert periods to actual timestamps, and *vice versa*,
* it can resample data and aggregate values any way you like,
* it can handle timezones.
## Time range
Let's start by creating a time series using `pd.date_range()`. It returns a `DatetimeIndex` containing one datetime per hour for 12 hours starting on October 29th 2016 at 5:30pm.
```{python}
dates = pd.date_range('2016/10/29 5:30pm', periods=12, freq='H')
dates
```
This `DatetimeIndex` may be used as an index in a `Series`:
```{python}
temp_series = pd.Series(temperatures, dates)
temp_series
```
Let's plot this series:
```{python}
temp_series.plot(kind="bar")
plt.grid(True)
plt.show()
```
## Resampling
Pandas lets us resample a time series very simply. Just call the `resample()` method and specify a new frequency:
```{python}
temp_series_freq_2H = temp_series.resample("2H")
temp_series_freq_2H
```
The resampling operation is actually a deferred operation, which is why we did not get a `Series` object, but a `DatetimeIndexResampler` object instead. To actually perform the resampling operation, we can simply call the `mean()` method. Pandas will compute the mean of every pair of consecutive hours:
```{python}
temp_series_freq_2H = temp_series_freq_2H.mean()
```
Let's plot the result:
```{python}
temp_series_freq_2H.plot(kind="bar")
plt.show()
```
Note how the values have automatically been aggregated into 2-hour periods. If we look at the 6-8pm period, for example, we had a value of `5.1` at 6:30pm, and `6.1` at 7:30pm. After resampling, we just have one value of `5.6`, which is the mean of `5.1` and `6.1`. Rather than computing the mean, we could have used any other aggregation function, for example we can decide to keep the minimum value of each period:
```{python}
temp_series_freq_2H = temp_series.resample("2H").min()
temp_series_freq_2H
```
Or, equivalently, we could use the `apply()` method instead:
```{python}
temp_series_freq_2H = temp_series.resample("2H").apply(np.min)
temp_series_freq_2H
```
## Upsampling and interpolation
It was an example of downsampling. We can also upsample (i.e. increase the frequency), but it will create holes in our data:
```{python}
temp_series_freq_15min = temp_series.resample("15Min").mean()
temp_series_freq_15min.head(n=10) # `head` displays the top n values
```
One solution is to fill the gaps by interpolating. We just call the `interpolate()` method. The default is to use linear interpolation, but we can also select another method, such as cubic interpolation:
```{python}
#| scrolled: true
temp_series_freq_15min = temp_series.resample("15Min").interpolate(method="cubic")
temp_series_freq_15min.head(n=10)
```
```{python}
temp_series.plot(label="Period: 1 hour")
temp_series_freq_15min.plot(label="Period: 15 minutes")
plt.legend()
plt.show()
```
## Timezones
By default, datetimes are *naive*: they are not aware of timezones, so 2016-10-30 02:30 might mean October 30th 2016 at 2:30am in Paris or in New York. We can make datetimes timezone *aware* by calling the `tz_localize()` method:
```{python}
temp_series_ny = temp_series.tz_localize("America/New_York")
temp_series_ny
```
Note that `-04:00` is now appended to all the datetimes. It means that these datetimes refer to [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) - 4 hours.
We can convert these datetimes to Paris time like this:
```{python}
temp_series_paris = temp_series_ny.tz_convert("Europe/Paris")
temp_series_paris
```
You may have noticed that the UTC offset changes from `+02:00` to `+01:00`: this is because France switches to winter time at 3am that particular night (time goes back to 2am). Notice that 2:30am occurs twice! Let's go back to a naive representation (if you log some data hourly using local time, without storing the timezone, you might get something like this):
```{python}
temp_series_paris_naive = temp_series_paris.tz_localize(None)
temp_series_paris_naive
```
Now `02:30` is really ambiguous. If we try to localize these naive datetimes to the Paris timezone, we get an error:
```{python}
try:
temp_series_paris_naive.tz_localize("Europe/Paris")
except Exception as e:
print(type(e))
print(e)
```
Fortunately, by using the `ambiguous` argument we can tell pandas to infer the right DST (Daylight Saving Time) based on the order of the ambiguous timestamps:
```{python}
temp_series_paris_naive.tz_localize("Europe/Paris", ambiguous="infer")
```
## Periods
The `pd.period_range()` function returns a `PeriodIndex` instead of a `DatetimeIndex`. For example, let's get all quarters in 2016 and 2017:
```{python}
quarters = pd.period_range('2016Q1', periods=8, freq='Q')
quarters
```
Adding a number `N` to a `PeriodIndex` shifts the periods by `N` times the `PeriodIndex`'s frequency:
```{python}
quarters + 3
```
The `asfreq()` method lets us change the frequency of the `PeriodIndex`. All periods are lengthened or shortened accordingly. For example, let's convert all the quarterly periods to monthly periods (zooming in):
```{python}
quarters.asfreq("M")
```
By default, the `asfreq` zooms on the end of each period. We can tell it to zoom on the start of each period instead:
```{python}
quarters.asfreq("M", how="start")
```
And we can zoom out:
```{python}
quarters.asfreq("A")
```
Of course, we can create a `Series` with a `PeriodIndex`:
```{python}
quarterly_revenue = pd.Series([300, 320, 290, 390, 320, 360, 310, 410], index=quarters)
quarterly_revenue
```
```{python}
quarterly_revenue.plot(kind="line")
plt.show()
```
We can convert periods to timestamps by calling `to_timestamp`. By default, it will give us the first day of each period, but by setting `how` and `freq`, we can get the last hour of each period:
```{python}
last_hours = quarterly_revenue.to_timestamp(how="end", freq="H")
last_hours
```
And back to periods by calling `to_period`:
```{python}
last_hours.to_period()
```
Pandas also provides many other time-related functions that we recommend you check out in the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html). To whet your appetite, here is one way to get the last business day of each month in 2016, at 9am:
```{python}
months_2016 = pd.period_range("2016", periods=12, freq="M")
one_day_after_last_days = months_2016.asfreq("D") + 1
last_bdays = one_day_after_last_days.to_timestamp() - pd.tseries.offsets.BDay()
last_bdays.to_period("H") + 9
```
# `DataFrame` objects
A DataFrame object represents a spreadsheet, with cell values, column names and row index labels. You can define expressions to compute columns based on other columns, create pivot-tables, group rows, draw graphs, etc. You can see `DataFrame`s as dictionaries of `Series`.
## Creating a `DataFrame`
You can create a DataFrame by passing a dictionary of `Series` objects:
```{python}
people_dict = {
"weight": pd.Series([68, 83, 112], index=["alice", "bob", "charles"]),
"birthyear": pd.Series([1984, 1985, 1992], index=["bob", "alice", "charles"], name="year"),
"children": pd.Series([0, 3], index=["charles", "bob"]),
"hobby": pd.Series(["Biking", "Dancing"], index=["alice", "bob"]),
}
people = pd.DataFrame(people_dict)
people
```
A few things to note:
* the `Series` were automatically aligned based on their index,
* missing values are represented as `NaN`,
* `Series` names are ignored (the name `"year"` was dropped),
* `DataFrame`s are displayed nicely in Jupyter notebooks, woohoo!
You can access columns pretty much as you would expect. They are returned as `Series` objects:
```{python}
people["birthyear"]
```
You can also get multiple columns at once:
```{python}
people[["birthyear", "hobby"]]
```
If you pass a list of columns and/or index row labels to the `DataFrame` constructor, it will guarantee that these columns and/or rows will exist, in that order, and no other column/row will exist. For example:
```{python}
d2 = pd.DataFrame(
people_dict,
columns=["birthyear", "weight", "height"],
index=["bob", "alice", "eugene"]
)
d2
```
Another convenient way to create a `DataFrame` is to pass all the values to the constructor as an `ndarray`, or a list of lists, and specify the column names and row index labels separately:
```{python}
values = [
[1985, np.nan, "Biking", 68],
[1984, 3, "Dancing", 83],
[1992, 0, np.nan, 112]
]
d3 = pd.DataFrame(
values,
columns=["birthyear", "children", "hobby", "weight"],
index=["alice", "bob", "charles"]
)
d3
```
To specify missing values, you can use either `np.nan` or NumPy's masked arrays:
```{python}
masked_array = np.ma.asarray(values, dtype=object)
masked_array[(0, 2), (1, 2)] = np.ma.masked
d3 = pd.DataFrame(
masked_array,
columns=["birthyear", "children", "hobby", "weight"],
index=["alice", "bob", "charles"]
)
d3
```
Instead of an `ndarray`, you can also pass a `DataFrame` object:
```{python}
d4 = pd.DataFrame(
d3,
columns=["hobby", "children"],
index=["alice", "bob"]
)
d4
```
It is also possible to create a `DataFrame` with a dictionary (or list) of dictionaries (or lists):
```{python}
people = pd.DataFrame({
"birthyear": {"alice": 1985, "bob": 1984, "charles": 1992},
"hobby": {"alice": "Biking", "bob": "Dancing"},
"weight": {"alice": 68, "bob": 83, "charles": 112},
"children": {"bob": 3, "charles": 0}
})
people
```
## Multi-indexing
If all columns are tuples of the same size, then they are understood as a multi-index. The same goes for row index labels. For example:
```{python}
d5 = pd.DataFrame(
{
("public", "birthyear"):
{("Paris","alice"): 1985, ("Paris","bob"): 1984, ("London","charles"): 1992},
("public", "hobby"):
{("Paris","alice"): "Biking", ("Paris","bob"): "Dancing"},
("private", "weight"):
{("Paris","alice"): 68, ("Paris","bob"): 83, ("London","charles"): 112},
("private", "children"):
{("Paris", "alice"): np.nan, ("Paris","bob"): 3, ("London","charles"): 0}
}
)
d5
```
You can now get a `DataFrame` containing all the `"public"` columns very simply:
```{python}
d5["public"]
```
```{python}
d5["public", "hobby"] # Same result as d5["public"]["hobby"]
```
## Dropping a level
Let's look at `d5` again:
```{python}
d5
```
There are two levels of columns, and two levels of indices. We can drop a column level by calling `droplevel()` (the same goes for indices):
```{python}
d5.columns = d5.columns.droplevel(level = 0)
d5
```
## Transposing
You can swap columns and indices using the `T` attribute:
```{python}
d6 = d5.T
d6
```
## Stacking and unstacking levels
Calling the `stack()` method will push the lowest column level after the lowest index:
```{python}
d7 = d6.stack()
d7
```
Note that many `NaN` values appeared. This makes sense because many new combinations did not exist before (e.g. there was no `bob` in `London`).
Calling `unstack()` will do the reverse, once again creating many `NaN` values.
```{python}
d8 = d7.unstack()
d8
```
If we call `unstack` again, we end up with a `Series` object:
```{python}
d9 = d8.unstack()
d9
```
The `stack()` and `unstack()` methods let you select the `level` to stack/unstack. You can even stack/unstack multiple levels at once:
```{python}
#| scrolled: true
d10 = d9.unstack(level = (0,1))
d10
```
## Most methods return modified copies
As you may have noticed, the `stack()` and `unstack()` methods do not modify the object they are called on. Instead, they work on a copy and return that copy. This is true of most methods in pandas.
## Accessing rows
Let's go back to the `people` `DataFrame`:
```{python}
people
```
The `loc` attribute lets you access rows instead of columns. The result is a `Series` object in which the `DataFrame`'s column names are mapped to row index labels:
```{python}
people.loc["charles"]
```
You can also access rows by integer location using the `iloc` attribute:
```{python}
people.iloc[2]
```
You can also get a slice of rows, and this returns a `DataFrame` object:
```{python}
people.iloc[1:3]
```
Finally, you can pass a boolean array to get the matching rows:
```{python}
people[np.array([True, False, True])]
```
This is most useful when combined with boolean expressions:
```{python}
people[people["birthyear"] < 1990]
```
## Adding and removing columns
You can generally treat `DataFrame` objects like dictionaries of `Series`, so the following works fine:
```{python}
people
```
```{python}
people["age"] = 2018 - people["birthyear"] # adds a new column "age"
people["over 30"] = people["age"] > 30 # adds another column "over 30"
birthyears = people.pop("birthyear")
del people["children"]
people
```
```{python}
birthyears
```
When you add a new column, it must have the same number of rows. Missing rows are filled with NaN, and extra rows are ignored:
```{python}
people["pets"] = pd.Series({"bob": 0, "charles": 5, "eugene": 1}) # alice is missing, eugene is ignored
people
```
When adding a new column, it is added at the end (on the right) by default. You can also insert a column anywhere else using the `insert()` method:
```{python}
people.insert(1, "height", [172, 181, 185])
people
```
## Assigning new columns
You can also create new columns by calling the `assign()` method. Note that this returns a new `DataFrame` object, the original is not modified:
```{python}
people.assign(
body_mass_index = people["weight"] / (people["height"] / 100) ** 2,
has_pets = people["pets"] > 0
)
```
Note that you cannot access columns created within the same assignment:
```{python}
try:
people.assign(
body_mass_index = people["weight"] / (people["height"] / 100) ** 2,
overweight = people["body_mass_index"] > 25
)
except KeyError as e:
print("Key error:", e)
```
The solution is to split this assignment in two consecutive assignments:
```{python}
d6 = people.assign(body_mass_index = people["weight"] / (people["height"] / 100) ** 2)
d6.assign(overweight = d6["body_mass_index"] > 25)
```
Having to create a temporary variable `d6` is not very convenient. You may want to just chain the assignment calls, but it does not work because the `people` object is not actually modified by the first assignment:
```{python}
try:
(people
.assign(body_mass_index = people["weight"] / (people["height"] / 100) ** 2)
.assign(overweight = people["body_mass_index"] > 25)
)
except KeyError as e:
print("Key error:", e)
```
But fear not, there is a simple solution. You can pass a function to the `assign()` method (typically a `lambda` function), and this function will be called with the `DataFrame` as a parameter:
```{python}
(people
.assign(body_mass_index = lambda df: df["weight"] / (df["height"] / 100) ** 2)
.assign(overweight = lambda df: df["body_mass_index"] > 25)
)
```
Problem solved!
## Evaluating an expression
A great feature supported by pandas is expression evaluation. It relies on the `numexpr` library which must be installed.
```{python}
people.eval("weight / (height/100) ** 2 > 25")
```
Assignment expressions are also supported. Let's set `inplace=True` to directly modify the `DataFrame` rather than getting a modified copy:
```{python}
people.eval("body_mass_index = weight / (height/100) ** 2", inplace=True)
people
```
You can use a local or global variable in an expression by prefixing it with `'@'`:
```{python}
overweight_threshold = 30
people.eval("overweight = body_mass_index > @overweight_threshold", inplace=True)
people
```
## Querying a `DataFrame`
The `query()` method lets you filter a `DataFrame` based on a query expression:
```{python}
people.query("age > 30 and pets == 0")
```
## Sorting a `DataFrame`
You can sort a `DataFrame` by calling its `sort_index` method. By default, it sorts the rows by their index label, in ascending order, but let's reverse the order:
```{python}
people.sort_index(ascending=False)
```
Note that `sort_index` returned a sorted *copy* of the `DataFrame`. To modify `people` directly, we can set the `inplace` argument to `True`. Also, we can sort the columns instead of the rows by setting `axis=1`:
```{python}
people.sort_index(axis=1, inplace=True)
people
```
To sort the `DataFrame` by the values instead of the labels, we can use `sort_values` and specify the column to sort by:
```{python}
people.sort_values(by="age", inplace=True)
people
```
## Plotting a `DataFrame`
Just like for `Series`, pandas makes it easy to draw nice graphs based on a `DataFrame`.
For example, it is trivial to create a line plot from a `DataFrame`'s data by calling its `plot` method:
```{python}
people.sort_values(by="body_mass_index", inplace=True)
people.plot(kind="line", x="body_mass_index", y=["height", "weight"])
plt.show()
```
You can pass extra arguments supported by matplotlib's functions. For example, we can create scatterplot and pass it a list of sizes using the `s` argument of matplotlib's `scatter()` function:
```{python}
#| scrolled: true
people.plot(kind="scatter", x="height", y="weight", s=[40, 120, 200])
plt.show()
```
Again, there are way too many options to list here: the best option is to scroll through the [Visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) page in pandas' documentation, find the plot you are interested in and look at the example code.
## Operations on `DataFrame`s
Although `DataFrame`s do not try to mimic NumPy arrays, there are a few similarities. Let's create a `DataFrame` to demonstrate this:
```{python}
grades_array = np.array([[8, 8, 9], [10, 9, 9], [4, 8, 2], [9, 10, 10]])
grades = pd.DataFrame(grades_array, columns=["sep", "oct", "nov"], index=["alice", "bob", "charles", "darwin"])
grades
```
You can apply NumPy mathematical functions on a `DataFrame`: the function is applied to all values:
```{python}
np.sqrt(grades)
```
Similarly, adding a single value to a `DataFrame` will add that value to all elements in the `DataFrame`. This is called *broadcasting*:
```{python}
grades + 1
```
Of course, the same is true for all other binary operations, including arithmetic (`*`,`/`,`**`...) and conditional (`>`, `==`...) operations:
```{python}
grades >= 5
```
Aggregation operations, such as computing the `max`, the `sum` or the `mean` of a `DataFrame`, apply to each column, and you get back a `Series` object:
```{python}
grades.mean()
```
The `all` method is also an aggregation operation: it checks whether all values are `True` or not. Let's see during which months all students got a grade greater than `5`:
```{python}
(grades > 5).all()
```
Most of these functions take an optional `axis` parameter which lets you specify along which axis of the `DataFrame` you want the operation executed. The default is `axis=0`, meaning that the operation is executed vertically (on each column). You can set `axis=1` to execute the operation horizontally (on each row). For example, let's find out which students had all grades greater than `5`:
```{python}
(grades > 5).all(axis=1)
```
The `any` method returns `True` if any value is True. Let's see who got at least one grade 10:
```{python}
(grades == 10).any(axis=1)
```
If you add a `Series` object to a `DataFrame` (or execute any other binary operation), pandas attempts to broadcast the operation to all *rows* in the `DataFrame`. This only works if the `Series` has the same size as the `DataFrame`s rows. For example, let's subtract the `mean` of the `DataFrame` (a `Series` object) from the `DataFrame`:
```{python}
grades - grades.mean() # equivalent to: grades - [7.75, 8.75, 7.50]
```
We subtracted `7.75` from all September grades, `8.75` from October grades and `7.50` from November grades. It is equivalent to subtracting this `DataFrame`:
```{python}
pd.DataFrame([[7.75, 8.75, 7.50]]*4, index=grades.index, columns=grades.columns)
```
If you want to subtract the global mean from every grade, here is one way to do it:
```{python}
#| scrolled: true
grades - grades.values.mean() # subtracts the global mean (8.00) from all grades
```
## Automatic alignment
Similar to `Series`, when operating on multiple `DataFrame`s, pandas automatically aligns them by row index label, but also by column names. Let's create a `DataFrame` with bonus points for each person from October to December:
```{python}
bonus_array = np.array([[0, np.nan, 2], [np.nan, 1, 0], [0, 1, 0], [3, 3, 0]])
bonus_points = pd.DataFrame(bonus_array, columns=["oct", "nov", "dec"], index=["bob", "colin", "darwin", "charles"])
bonus_points
```
```{python}
#| scrolled: true
grades + bonus_points
```
Looks like the addition worked in some cases but way too many elements are now empty. That's because when aligning the `DataFrame`s, some columns and rows were only present on one side, and thus they were considered missing on the other side (`NaN`). Then adding `NaN` to a number results in `NaN`, hence the result.
## Handling missing data
Dealing with missing data is a frequent task when working with real life data. Pandas offers a few tools to handle missing data.
Let's try to fix the problem above. For example, we can decide that missing data should result in a zero, instead of `NaN`. We can replace all `NaN` values by any value using the `fillna()` method:
```{python}
#| scrolled: true
(grades + bonus_points).fillna(0)
```
It's a bit unfair that we're setting grades to zero in September, though. Perhaps we should decide that missing grades are missing grades, but missing bonus points should be replaced by zeros:
```{python}
fixed_bonus_points = bonus_points.fillna(0)
fixed_bonus_points.insert(0, "sep", 0)
fixed_bonus_points.loc["alice"] = 0
grades + fixed_bonus_points
```
That's much better: although we made up some data, we have not been too unfair.
Another way to handle missing data is to interpolate. Let's look at the `bonus_points` `DataFrame` again:
```{python}
bonus_points
```
Now let's call the `interpolate` method. By default, it interpolates vertically (`axis=0`), so let's tell it to interpolate horizontally (`axis=1`).
```{python}
bonus_points.interpolate(axis=1)
```
Bob had 0 bonus points in October, and 2 in December. When we interpolate for November, we get the mean: 1 bonus point. Colin had 1 bonus point in November, but we do not know how many bonus points he had in September, so we cannot interpolate, this is why there is still a missing value in October after interpolation. To fix this, we can set the September bonus points to 0 before interpolation.
```{python}
better_bonus_points = bonus_points.copy()
better_bonus_points.insert(0, "sep", 0)
better_bonus_points.loc["alice"] = 0
better_bonus_points = better_bonus_points.interpolate(axis=1)
better_bonus_points
```
Great, now we have reasonable bonus points everywhere. Let's find out the final grades:
```{python}
grades + better_bonus_points
```
It is slightly annoying that the September column ends up on the right. This is because the `DataFrame`s we are adding do not have the exact same columns (the `grades` `DataFrame` is missing the `"dec"` column), so to make things predictable, pandas orders the final columns alphabetically. To fix this, we can simply add the missing column before adding:
```{python}
#| scrolled: true
grades["dec"] = np.nan
final_grades = grades + better_bonus_points
final_grades
```
There's not much we can do about December and Colin: it's bad enough that we are making up bonus points, but we can't reasonably make up grades (well, I guess some teachers probably do). So let's call the `dropna()` method to get rid of rows that are full of `NaN`s:
```{python}
final_grades_clean = final_grades.dropna(how="all")
final_grades_clean
```
Now let's remove columns that are full of `NaN`s by setting the `axis` argument to `1`:
```{python}
final_grades_clean = final_grades_clean.dropna(axis=1, how="all")
final_grades_clean
```
## Aggregating with `groupby`
Similar to the SQL language, pandas allows grouping your data into groups to run calculations over each group.
First, let's add some extra data about each person so we can group them, and let's go back to the `final_grades` `DataFrame` so we can see how `NaN` values are handled:
```{python}
#| scrolled: true
final_grades["hobby"] = ["Biking", "Dancing", np.nan, "Dancing", "Biking"]
final_grades
```
Now let's group data in this `DataFrame` by hobby:
```{python}
grouped_grades = final_grades.groupby("hobby")
grouped_grades
```
We are ready to compute the average grade per hobby:
```{python}
grouped_grades.mean()
```
That was easy! Note that the `NaN` values have simply been skipped when computing the means.
## Pivot tables
Pandas supports spreadsheet-like [pivot tables](https://en.wikipedia.org/wiki/Pivot_table) that allow quick data summarization. To illustrate this, let's create a simple `DataFrame`:
```{python}
bonus_points
```
```{python}
more_grades = final_grades_clean.stack().reset_index()
more_grades.columns = ["name", "month", "grade"]
more_grades["bonus"] = [np.nan, np.nan, np.nan, 0, np.nan, 2, 3, 3, 0, 0, 1, 0]
more_grades
```
Now we can call the `pd.pivot_table()` function for this `DataFrame`, asking to group by the `name` column. By default, `pivot_table()` computes the mean of each numeric column:
```{python}
pd.pivot_table(more_grades, index="name")
```
We can change the aggregation function by setting the `aggfunc` argument, and we can also specify the list of columns whose values will be aggregated:
```{python}
pd.pivot_table(more_grades, index="name", values=["grade", "bonus"], aggfunc=np.max)
```
We can also specify the `columns` to aggregate over horizontally, and request the grand totals for each row and column by setting `margins=True`:
```{python}
pd.pivot_table(more_grades, index="name", values="grade", columns="month", margins=True)
```
Finally, we can specify multiple index or column names, and pandas will create multi-level indices:
```{python}
pd.pivot_table(more_grades, index=("name", "month"), margins=True)
```
## Overview functions
When dealing with large `DataFrames`, it is useful to get a quick overview of its content. Pandas offers a few functions for this. First, let's create a large `DataFrame` with a mix of numeric values, missing values and text values. Notice how Jupyter displays only the corners of the `DataFrame`: