Notes

Introduction

Note

This is a long chapter, these notes are intended as a tour of main ideas!

Panda bus tour!
  • Pandas is a major tool in Python data analysis

  • Works with Numpy, adding support for tabular / heterogenous data

Import conventions:

import numpy as np
import pandas as pd

Panda’s primary data structures

  • Series: One dimensional object containing a sequence of values of the same type.

  • DataFrame: Tabular data, similar (and inspired by) R dataframe.

  • Other structures will be introduced as they arise, e.g. Index and Groupby objects.

Series

obj = pd.Series([4,7,-4,3], index = ["A","B","C","D"])
obj
A    4
B    7
C   -4
D    3
dtype: int64

The index is optional, if not specified it will default to 0 through N-1

Selection

Select elements or sub-Series by labels, sets of labels, boolean arrays …

obj['A']
4
obj[['A','C']]
A    4
C   -4
dtype: int64
obj[obj > 3]
A    4
B    7
dtype: int64

Other things you can do

  • Numpy functions and Numpy-like operations work as expected:
obj*3
A    12
B    21
C   -12
D     9
dtype: int64
np.exp(obj)
A      54.598150
B    1096.633158
C       0.018316
D      20.085537
dtype: float64
  • Series can be created from and converted to a dictionary
obj.to_dict()
{'A': 4, 'B': 7, 'C': -4, 'D': 3}
  • Series can be converted to numpy array:
obj.to_numpy()
array([ 4,  7, -4,  3], dtype=int64)

DataFrame

  • Represents table of data

  • Has row index index and column index column

  • Common way to create is from a dictionary, but see Table 5.1 for more!

test = pd.DataFrame({"cars":['Chevy','Ford','Dodge','BMW'],'MPG':[14,15,16,12], 'Year':[1979, 1980, 2001, 2020]})
test
cars MPG Year
0 Chevy 14 1979
1 Ford 15 1980
2 Dodge 16 2001
3 BMW 12 2020
  • If you want a non-default index, it can be specified just like with Series.

  • head(n) / tail(n) - return the first / last n rows, 5 by default

Selecting

  • Can retrieve columns or sets of columns by using obj[...]:
test['cars']
0    Chevy
1     Ford
2    Dodge
3      BMW
Name: cars, dtype: object

Note that we got a Series here.

test[['cars','MPG']]
cars MPG
0 Chevy 14
1 Ford 15
2 Dodge 16
3 BMW 12
  • Dot notation can also be used (test.cars) as long as the column names are valid identifiers

  • Rows can be retrieved with iloc[...] and loc[...]:

    • loc retrieves by index

    • iloc retrieves by position.

Modifying / Creating Columns

  • Columns can be modified (and created) by assignment:
test['MPG^2'] = test['MPG']**2
test
cars MPG Year MPG^2
0 Chevy 14 1979 196
1 Ford 15 1980 225
2 Dodge 16 2001 256
3 BMW 12 2020 144
  • del keyword can be used to drop columns, or drop method can be used to do so non-destructively

Index object

  • Index objects are used for holding axis labels and other metadata
test.index
RangeIndex(start=0, stop=4, step=1)
  • Can change the index, in this case replacing the default:
# Create index from one of the columns
test.index = test['cars']  

 # remove 'cars' column since i am using as an index now.  s
test=test.drop('cars', axis = "columns")  # or axis  = 1
test
MPG Year MPG^2
cars
Chevy 14 1979 196
Ford 15 1980 225
Dodge 16 2001 256
BMW 12 2020 144
  • Note the axis keyword argument above, many DataFrame methods use this.

  • Above I changed a column into an index. Often you want to go the other way, this can be done with reset_index:

test.reset_index()  # Note this doesn't actually change test
cars MPG Year MPG^2
0 Chevy 14 1979 196
1 Ford 15 1980 225
2 Dodge 16 2001 256
3 BMW 12 2020 144
  • Columns are an index as well:
test.columns
Index(['MPG', 'Year', 'MPG^2'], dtype='object')
  • Indexes act like immutable sets, see Table 5.2 in book for Index methods and properties

Essential Functionality

Reindexing and dropping

  • reindex creats a new object with the values arranged according to the new index. Missing values are used if necessary, or you can use optional fill methods. You can use iloc and loc to reindex as well.
s = pd.Series([1,2,3,4,5], index = list("abcde"))
s2 = s.reindex(list("abcfu"))  #  not a song by GAYLE 
s2
a    1.0
b    2.0
c    3.0
f    NaN
u    NaN
dtype: float64
  • Missing values and can be tested for with isna or notna methods
pd.isna(s2)
a    False
b    False
c    False
f     True
u     True
dtype: bool
  • drop , illustrated above can drop rows or columns. In addition to using axis you can use columns or index. Again these make copies.
test.drop(columns = 'MPG')
Year MPG^2
cars
Chevy 1979 196
Ford 1980 225
Dodge 2001 256
BMW 2020 144
test.drop(index = ['Ford', 'BMW'])
MPG Year MPG^2
cars
Chevy 14 1979 196
Dodge 16 2001 256

Indexing, Selection and Filtering

Series

  • For Series, indexing is similar to Numpy, except you can use the index as well as integers.
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj[0:3]
a    0.0
b    1.0
c    2.0
dtype: float64
obj['a':'c']
a    0.0
b    1.0
c    2.0
dtype: float64
obj[obj<2]
a    0.0
b    1.0
dtype: float64
obj[['a','d']]
a    0.0
d    3.0
dtype: float64
  • However, preferred way is to use loc for selection by index and iloc for selection by position. This is to avoid the issue where the index is itself integers.
obj.loc[['a','d']]
a    0.0
d    3.0
dtype: float64
obj.iloc[1]
1.0
Note

Note if a range or a set of indexes is used, a Series is returned. If a single item is requested, you get just that item.

DataFrame

  • Selecting with df[...] for a DataFrame retrieves one or more columns as we have seen, if you select a single column you get a Series

  • There are some special cases, indexing with a boolean selects rows, as does selecting with a slice:

test[0:1]
MPG Year MPG^2
cars
Chevy 14 1979 196
test[test['MPG'] < 15]
MPG Year MPG^2
cars
Chevy 14 1979 196
BMW 12 2020 144
  • iloc and loc can be used to select rows as illustrated before, but can also be used to select columns or subsets of rows/columns
test.loc[:,['Year','MPG']]
Year MPG
cars
Chevy 1979 14
Ford 1980 15
Dodge 2001 16
BMW 2020 12
test.loc['Ford','MPG']
15
  • These work with slices and booleans as well! The following says “give me all the rows with MPG more then 15, and the columns starting from Year”
test.loc[test['MPG'] > 15, 'Year':]
Year MPG^2
cars
Dodge 2001 256
  • Indexing options are fully illustrated in the book and Table 5.4

  • Be careful with chained indexing:

test[test['MPG']> 15].loc[:,'MPG'] = 18
C:\Users\jryan\AppData\Local\Temp\ipykernel_13388\2484144822.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test[test['MPG']> 15].loc[:,'MPG'] = 18

Here we are assigning to a ‘slice’, which is probably not what is intended. You will get a warning and a recommendation to fix it by using one loc:

test.loc[test['MPG']> 15 ,'MPG'] = 18
test
MPG Year MPG^2
cars
Chevy 14 1979 196
Ford 15 1980 225
Dodge 18 2001 256
BMW 12 2020 144
Rule of Thumb

Avoid chained indexing when doing assignments

Arithmetic and Data Alignment

  • Pandas can make it simpler to work with objects that have different indexes, usually ‘doing the right thing’
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])
s1+s2
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64
  • Fills can be specified by using methods:
s1.add(s2, fill_value = 0)
a    5.2
c    1.1
d    3.4
e    0.0
f    4.0
g    3.1
dtype: float64
  • See Table 5.5 for list of these methods.

  • You can also do arithmetic between DataFrames and Series in a way that is similar to Numpy.

Function Application and Mapping

  • Numpy ufuncs also work with Pandas objects.
frame = pd.DataFrame(np.random.standard_normal((4, 3)),
                         columns=list("bde"),
                         index=["Utah", "Ohio", "Texas", "Oregon"])
frame
b d e
Utah 0.798548 -1.455476 0.507618
Ohio -0.405775 -0.132380 -0.563721
Texas -0.404526 0.703566 1.661291
Oregon -0.359775 -0.450894 -0.712254
np.abs(frame)
b d e
Utah 0.798548 1.455476 0.507618
Ohio 0.405775 0.132380 0.563721
Texas 0.404526 0.703566 1.661291
Oregon 0.359775 0.450894 0.712254
  • apply can be used to apply a function on 1D arrays to each column or row:
frame.apply(np.max, axis = 'rows') #'axis' is optional here, default is rows
b    0.798548
d    0.703566
e    1.661291
dtype: float64

Applying accross columns is common, especially to combine different columns in some way:

frame['max'] = frame.apply(np.max, axis = 'columns')
frame
b d e max
Utah 0.798548 -1.455476 0.507618 0.798548
Ohio -0.405775 -0.132380 -0.563721 -0.132380
Texas -0.404526 0.703566 1.661291 1.661291
Oregon -0.359775 -0.450894 -0.712254 -0.359775
  • Many more examples of this in the book.

Sorting and Ranking

  • sort_index will sort with the index (on either axis for DataFrame)
  • sort_values is used to sort by values or a particular column
test.sort_values('MPG')
MPG Year MPG^2
cars
BMW 12 2020 144
Chevy 14 1979 196
Ford 15 1980 225
Dodge 18 2001 256
  • rank will assign ranks from on through the number of data points.

Summarizing and Computing Descriptive Statistics

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                      [np.nan, np.nan], [0.75, -1.3]],
                      index=["a", "b", "c", "d"],
                      columns=["one", "two"])
df
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3

Some Examples:

Sum over rows:

df.sum()
one    9.25
two   -5.80
dtype: float64

Sum over columns:

# Sum Rows
df.sum(axis="columns")
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

Extremely useful is describe:

df.describe()
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000

Book chapter contains many more examples and a full list of summary statistics and related methods.

Summary

  • Primary Panda’s data structures:

    • Series

    • DataFrame

  • Many ways to access and transform these objects. Key ones are:

    • [] : access an element(s) of a Series or columns(s) of a DataFrame

    • loc[r ,c] : access a row / column / cell by the index.

    • iloc[i, j] : access ar row / column / cell by the integer position.

  • Online reference.

Suggestion

Work though the chapter’s code and try stuff!

References

Next Chapter

  • Loading and writing data sets!