import numpy as np
import pandas as pd
Notes
Introduction
This is a long chapter, these notes are intended as a tour of main ideas!
Pandas is a major tool in Python data analysis
Works with Numpy, adding support for tabular / heterogenous data
Import conventions:
Panda’s primary data structures
Series: One dimensional object containing a sequence of values of the same type.
DataFrame: Tabular data, similar (and inspired by) R dataframe.
Other structures will be introduced as they arise, e.g. Index and Groupby objects.
Series
= pd.Series([4,7,-4,3], index = ["A","B","C","D"])
obj obj
A 4
B 7
C -4
D 3
dtype: int64
The index
is optional, if not specified it will default to 0 through N-1
Selection
Select elements or sub-Series by labels, sets of labels, boolean arrays …
'A'] obj[
4
'A','C']] obj[[
A 4
C -4
dtype: int64
> 3] obj[obj
A 4
B 7
dtype: int64
Other things you can do
- Numpy functions and Numpy-like operations work as expected:
*3 obj
A 12
B 21
C -12
D 9
dtype: int64
np.exp(obj)
A 54.598150
B 1096.633158
C 0.018316
D 20.085537
dtype: float64
- Series can be created from and converted to a dictionary
obj.to_dict()
{'A': 4, 'B': 7, 'C': -4, 'D': 3}
- Series can be converted to numpy array:
obj.to_numpy()
array([ 4, 7, -4, 3], dtype=int64)
DataFrame
Represents table of data
Has row index index and column index column
Common way to create is from a dictionary, but see Table 5.1 for more!
= pd.DataFrame({"cars":['Chevy','Ford','Dodge','BMW'],'MPG':[14,15,16,12], 'Year':[1979, 1980, 2001, 2020]})
test test
cars | MPG | Year | |
---|---|---|---|
0 | Chevy | 14 | 1979 |
1 | Ford | 15 | 1980 |
2 | Dodge | 16 | 2001 |
3 | BMW | 12 | 2020 |
If you want a non-default index, it can be specified just like with Series.
head(n)
/tail(n)
- return the first / last n rows, 5 by default
Selecting
- Can retrieve columns or sets of columns by using
obj[...]
:
'cars'] test[
0 Chevy
1 Ford
2 Dodge
3 BMW
Name: cars, dtype: object
Note that we got a Series
here.
'cars','MPG']] test[[
cars | MPG | |
---|---|---|
0 | Chevy | 14 |
1 | Ford | 15 |
2 | Dodge | 16 |
3 | BMW | 12 |
Dot notation can also be used (
test.cars
) as long as the column names are valid identifiersRows can be retrieved with
iloc[...]
andloc[...]
:loc
retrieves by indexiloc
retrieves by position.
Modifying / Creating Columns
- Columns can be modified (and created) by assignment:
'MPG^2'] = test['MPG']**2
test[ test
cars | MPG | Year | MPG^2 | |
---|---|---|---|---|
0 | Chevy | 14 | 1979 | 196 |
1 | Ford | 15 | 1980 | 225 |
2 | Dodge | 16 | 2001 | 256 |
3 | BMW | 12 | 2020 | 144 |
del
keyword can be used to drop columns, ordrop
method can be used to do so non-destructively
Index object
- Index objects are used for holding axis labels and other metadata
test.index
RangeIndex(start=0, stop=4, step=1)
- Can change the index, in this case replacing the default:
# Create index from one of the columns
= test['cars']
test.index
# remove 'cars' column since i am using as an index now. s
=test.drop('cars', axis = "columns") # or axis = 1
test test
MPG | Year | MPG^2 | |
---|---|---|---|
cars | |||
Chevy | 14 | 1979 | 196 |
Ford | 15 | 1980 | 225 |
Dodge | 16 | 2001 | 256 |
BMW | 12 | 2020 | 144 |
Note the
axis
keyword argument above, many DataFrame methods use this.Above I changed a column into an index. Often you want to go the other way, this can be done with
reset_index
:
# Note this doesn't actually change test test.reset_index()
cars | MPG | Year | MPG^2 | |
---|---|---|---|---|
0 | Chevy | 14 | 1979 | 196 |
1 | Ford | 15 | 1980 | 225 |
2 | Dodge | 16 | 2001 | 256 |
3 | BMW | 12 | 2020 | 144 |
- Columns are an index as well:
test.columns
Index(['MPG', 'Year', 'MPG^2'], dtype='object')
- Indexes act like immutable sets, see Table 5.2 in book for Index methods and properties
Essential Functionality
Reindexing and dropping
reindex
creats a new object with the values arranged according to the new index. Missing values are used if necessary, or you can use optional fill methods. You can useiloc
andloc
to reindex as well.
= pd.Series([1,2,3,4,5], index = list("abcde"))
s = s.reindex(list("abcfu")) # not a song by GAYLE
s2 s2
a 1.0
b 2.0
c 3.0
f NaN
u NaN
dtype: float64
- Missing values and can be tested for with
isna
ornotna
methods
pd.isna(s2)
a False
b False
c False
f True
u True
dtype: bool
drop
, illustrated above can drop rows or columns. In addition to usingaxis
you can usecolumns
orindex
. Again these make copies.
= 'MPG') test.drop(columns
Year | MPG^2 | |
---|---|---|
cars | ||
Chevy | 1979 | 196 |
Ford | 1980 | 225 |
Dodge | 2001 | 256 |
BMW | 2020 | 144 |
= ['Ford', 'BMW']) test.drop(index
MPG | Year | MPG^2 | |
---|---|---|---|
cars | |||
Chevy | 14 | 1979 | 196 |
Dodge | 16 | 2001 | 256 |
Indexing, Selection and Filtering
Series
- For Series, indexing is similar to Numpy, except you can use the index as well as integers.
= pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj 0:3] obj[
a 0.0
b 1.0
c 2.0
dtype: float64
'a':'c'] obj[
a 0.0
b 1.0
c 2.0
dtype: float64
<2] obj[obj
a 0.0
b 1.0
dtype: float64
'a','d']] obj[[
a 0.0
d 3.0
dtype: float64
- However, preferred way is to use
loc
for selection by index andiloc
for selection by position. This is to avoid the issue where theindex
is itself integers.
'a','d']] obj.loc[[
a 0.0
d 3.0
dtype: float64
1] obj.iloc[
1.0
Note if a range or a set of indexes is used, a Series is returned. If a single item is requested, you get just that item.
DataFrame
Selecting with
df[...]
for a DataFrame retrieves one or more columns as we have seen, if you select a single column you get a SeriesThere are some special cases, indexing with a boolean selects rows, as does selecting with a slice:
0:1] test[
MPG | Year | MPG^2 | |
---|---|---|---|
cars | |||
Chevy | 14 | 1979 | 196 |
'MPG'] < 15] test[test[
MPG | Year | MPG^2 | |
---|---|---|---|
cars | |||
Chevy | 14 | 1979 | 196 |
BMW | 12 | 2020 | 144 |
iloc
andloc
can be used to select rows as illustrated before, but can also be used to select columns or subsets of rows/columns
'Year','MPG']] test.loc[:,[
Year | MPG | |
---|---|---|
cars | ||
Chevy | 1979 | 14 |
Ford | 1980 | 15 |
Dodge | 2001 | 16 |
BMW | 2020 | 12 |
'Ford','MPG'] test.loc[
15
- These work with slices and booleans as well! The following says “give me all the rows with MPG more then 15, and the columns starting from Year”
'MPG'] > 15, 'Year':] test.loc[test[
Year | MPG^2 | |
---|---|---|
cars | ||
Dodge | 2001 | 256 |
Indexing options are fully illustrated in the book and Table 5.4
Be careful with chained indexing:
'MPG']> 15].loc[:,'MPG'] = 18 test[test[
C:\Users\jryan\AppData\Local\Temp\ipykernel_13388\2484144822.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
test[test['MPG']> 15].loc[:,'MPG'] = 18
Here we are assigning to a ‘slice’, which is probably not what is intended. You will get a warning and a recommendation to fix it by using one loc
:
'MPG']> 15 ,'MPG'] = 18
test.loc[test[ test
MPG | Year | MPG^2 | |
---|---|---|---|
cars | |||
Chevy | 14 | 1979 | 196 |
Ford | 15 | 1980 | 225 |
Dodge | 18 | 2001 | 256 |
BMW | 12 | 2020 | 144 |
Avoid chained indexing when doing assignments
Arithmetic and Data Alignment
- Pandas can make it simpler to work with objects that have different indexes, usually ‘doing the right thing’
= pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s1 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])
s2 +s2 s1
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
- Fills can be specified by using methods:
= 0) s1.add(s2, fill_value
a 5.2
c 1.1
d 3.4
e 0.0
f 4.0
g 3.1
dtype: float64
See Table 5.5 for list of these methods.
You can also do arithmetic between DataFrames and Series in a way that is similar to Numpy.
Function Application and Mapping
- Numpy ufuncs also work with Pandas objects.
= pd.DataFrame(np.random.standard_normal((4, 3)),
frame =list("bde"),
columns=["Utah", "Ohio", "Texas", "Oregon"])
index frame
b | d | e | |
---|---|---|---|
Utah | 0.798548 | -1.455476 | 0.507618 |
Ohio | -0.405775 | -0.132380 | -0.563721 |
Texas | -0.404526 | 0.703566 | 1.661291 |
Oregon | -0.359775 | -0.450894 | -0.712254 |
abs(frame) np.
b | d | e | |
---|---|---|---|
Utah | 0.798548 | 1.455476 | 0.507618 |
Ohio | 0.405775 | 0.132380 | 0.563721 |
Texas | 0.404526 | 0.703566 | 1.661291 |
Oregon | 0.359775 | 0.450894 | 0.712254 |
apply
can be used to apply a function on 1D arrays to each column or row:
apply(np.max, axis = 'rows') #'axis' is optional here, default is rows frame.
b 0.798548
d 0.703566
e 1.661291
dtype: float64
Applying accross columns is common, especially to combine different columns in some way:
'max'] = frame.apply(np.max, axis = 'columns')
frame[ frame
b | d | e | max | |
---|---|---|---|---|
Utah | 0.798548 | -1.455476 | 0.507618 | 0.798548 |
Ohio | -0.405775 | -0.132380 | -0.563721 | -0.132380 |
Texas | -0.404526 | 0.703566 | 1.661291 | 1.661291 |
Oregon | -0.359775 | -0.450894 | -0.712254 | -0.359775 |
- Many more examples of this in the book.
Sorting and Ranking
sort_index
will sort with the index (on either axis for DataFrame)sort_values
is used to sort by values or a particular column
'MPG') test.sort_values(
MPG | Year | MPG^2 | |
---|---|---|---|
cars | |||
BMW | 12 | 2020 | 144 |
Chevy | 14 | 1979 | 196 |
Ford | 15 | 1980 | 225 |
Dodge | 18 | 2001 | 256 |
rank
will assign ranks from on through the number of data points.
Summarizing and Computing Descriptive Statistics
= pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
df 0.75, -1.3]],
[np.nan, np.nan], [=["a", "b", "c", "d"],
index=["one", "two"])
columns df
one | two | |
---|---|---|
a | 1.40 | NaN |
b | 7.10 | -4.5 |
c | NaN | NaN |
d | 0.75 | -1.3 |
Some Examples:
Sum over rows:
sum() df.
one 9.25
two -5.80
dtype: float64
Sum over columns:
# Sum Rows
sum(axis="columns") df.
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
Extremely useful is describe
:
df.describe()
one | two | |
---|---|---|
count | 3.000000 | 2.000000 |
mean | 3.083333 | -2.900000 |
std | 3.493685 | 2.262742 |
min | 0.750000 | -4.500000 |
25% | 1.075000 | -3.700000 |
50% | 1.400000 | -2.900000 |
75% | 4.250000 | -2.100000 |
max | 7.100000 | -1.300000 |
Book chapter contains many more examples and a full list of summary statistics and related methods.
Summary
Primary Panda’s data structures:
Series
DataFrame
Many ways to access and transform these objects. Key ones are:
[]
: access an element(s) of aSeries
or columns(s) of aDataFrame
loc[r ,c]
: access a row / column / cell by theindex
.iloc[i, j]
: access ar row / column / cell by the integer position.
Work though the chapter’s code and try stuff!
References
Next Chapter
- Loading and writing data sets!