Notes – Python for Data Analysis Book Club

Introduction

Note

This is a long chapter, these notes are intended as a tour of main ideas!

Pandas is a major tool in Python data analysis
Works with Numpy, adding support for tabular / heterogenous data

Import conventions:

import numpy as np
import pandas as pd

Panda’s primary data structures

Series: One dimensional object containing a sequence of values of the same type.
DataFrame: Tabular data, similar (and inspired by) R dataframe.
Other structures will be introduced as they arise, e.g. Index and Groupby objects.

Series

obj = pd.Series([4,7,-4,3], index = ["A","B","C","D"])
obj

A    4
B    7
C   -4
D    3
dtype: int64

The index is optional, if not specified it will default to 0 through N-1

Selection

Select elements or sub-Series by labels, sets of labels, boolean arrays …

obj['A']

obj[['A','C']]

A    4
C   -4
dtype: int64

obj[obj > 3]

A    4
B    7
dtype: int64

Other things you can do

Numpy functions and Numpy-like operations work as expected:

obj*3

A    12
B    21
C   -12
D     9
dtype: int64

np.exp(obj)

A      54.598150
B    1096.633158
C       0.018316
D      20.085537
dtype: float64

Series can be created from and converted to a dictionary

obj.to_dict()

{'A': 4, 'B': 7, 'C': -4, 'D': 3}

Series can be converted to numpy array:

obj.to_numpy()

array([ 4,  7, -4,  3], dtype=int64)

DataFrame

Represents table of data
Has row index index and column index column
Common way to create is from a dictionary, but see Table 5.1 for more!

test = pd.DataFrame({"cars":['Chevy','Ford','Dodge','BMW'],'MPG':[14,15,16,12], 'Year':[1979, 1980, 2001, 2020]})
test

	cars	MPG	Year
0	Chevy	14	1979
1	Ford	15	1980
2	Dodge	16	2001
3	BMW	12	2020

If you want a non-default index, it can be specified just like with Series.
head(n) / tail(n) - return the first / last n rows, 5 by default

Selecting

Can retrieve columns or sets of columns by using obj[...]:

test['cars']

0    Chevy
1     Ford
2    Dodge
3      BMW
Name: cars, dtype: object

Note that we got a Series here.

test[['cars','MPG']]

	cars	MPG
0	Chevy	14
1	Ford	15
2	Dodge	16
3	BMW	12

Dot notation can also be used (test.cars) as long as the column names are valid identifiers
Rows can be retrieved with iloc[...] and loc[...]:
- loc retrieves by index
- iloc retrieves by position.

Modifying / Creating Columns

Columns can be modified (and created) by assignment:

test['MPG^2'] = test['MPG']**2
test

	cars	MPG	Year	MPG^2
0	Chevy	14	1979	196
1	Ford	15	1980	225
2	Dodge	16	2001	256
3	BMW	12	2020	144

del keyword can be used to drop columns, or drop method can be used to do so non-destructively

Index object

Index objects are used for holding axis labels and other metadata

test.index

RangeIndex(start=0, stop=4, step=1)

Can change the index, in this case replacing the default:

# Create index from one of the columns
test.index = test['cars']  

 # remove 'cars' column since i am using as an index now.  s
test=test.drop('cars', axis = "columns")  # or axis  = 1
test

	MPG	Year	MPG^2
cars
Chevy	14	1979	196
Ford	15	1980	225
Dodge	16	2001	256
BMW	12	2020	144

Note the axis keyword argument above, many DataFrame methods use this.
Above I changed a column into an index. Often you want to go the other way, this can be done with reset_index:

test.reset_index()  # Note this doesn't actually change test

	cars	MPG	Year	MPG^2
0	Chevy	14	1979	196
1	Ford	15	1980	225
2	Dodge	16	2001	256
3	BMW	12	2020	144

Columns are an index as well:

test.columns

Index(['MPG', 'Year', 'MPG^2'], dtype='object')

Indexes act like immutable sets, see Table 5.2 in book for Index methods and properties

Essential Functionality

Reindexing and dropping

reindex creats a new object with the values arranged according to the new index. Missing values are used if necessary, or you can use optional fill methods. You can use iloc and loc to reindex as well.

s = pd.Series([1,2,3,4,5], index = list("abcde"))
s2 = s.reindex(list("abcfu"))  #  not a song by GAYLE 
s2

a    1.0
b    2.0
c    3.0
f    NaN
u    NaN
dtype: float64

Missing values and can be tested for with isna or notna methods

pd.isna(s2)

a    False
b    False
c    False
f     True
u     True
dtype: bool

drop , illustrated above can drop rows or columns. In addition to using axis you can use columns or index. Again these make copies.

test.drop(columns = 'MPG')

	Year	MPG^2
cars
Chevy	1979	196
Ford	1980	225
Dodge	2001	256
BMW	2020	144

test.drop(index = ['Ford', 'BMW'])

	MPG	Year	MPG^2
cars
Chevy	14	1979	196
Dodge	16	2001	256

Indexing, Selection and Filtering

Series

For Series, indexing is similar to Numpy, except you can use the index as well as integers.

obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj[0:3]

a    0.0
b    1.0
c    2.0
dtype: float64

obj['a':'c']

a    0.0
b    1.0
c    2.0
dtype: float64

obj[obj<2]

a    0.0
b    1.0
dtype: float64

obj[['a','d']]

a    0.0
d    3.0
dtype: float64

However, preferred way is to use loc for selection by index and iloc for selection by position. This is to avoid the issue where the index is itself integers.

obj.loc[['a','d']]

a    0.0
d    3.0
dtype: float64

obj.iloc[1]

1.0

Note

Note if a range or a set of indexes is used, a Series is returned. If a single item is requested, you get just that item.

DataFrame

Selecting with df[...] for a DataFrame retrieves one or more columns as we have seen, if you select a single column you get a Series
There are some special cases, indexing with a boolean selects rows, as does selecting with a slice:

test[0:1]

	MPG	Year	MPG^2
cars
Chevy	14	1979	196

test[test['MPG'] < 15]

	MPG	Year	MPG^2
cars
Chevy	14	1979	196
BMW	12	2020	144

iloc and loc can be used to select rows as illustrated before, but can also be used to select columns or subsets of rows/columns

test.loc[:,['Year','MPG']]

	Year	MPG
cars
Chevy	1979	14
Ford	1980	15
Dodge	2001	16
BMW	2020	12

test.loc['Ford','MPG']

These work with slices and booleans as well! The following says “give me all the rows with MPG more then 15, and the columns starting from Year”

test.loc[test['MPG'] > 15, 'Year':]

	Year	MPG^2
cars
Dodge	2001	256

Indexing options are fully illustrated in the book and Table 5.4
Be careful with chained indexing:

test[test['MPG']> 15].loc[:,'MPG'] = 18

C:\Users\jryan\AppData\Local\Temp\ipykernel_13388\2484144822.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test[test['MPG']> 15].loc[:,'MPG'] = 18

Here we are assigning to a ‘slice’, which is probably not what is intended. You will get a warning and a recommendation to fix it by using one loc:

test.loc[test['MPG']> 15 ,'MPG'] = 18
test

	MPG	Year	MPG^2
cars
Chevy	14	1979	196
Ford	15	1980	225
Dodge	18	2001	256
BMW	12	2020	144

Rule of Thumb

Avoid chained indexing when doing assignments

Arithmetic and Data Alignment

Pandas can make it simpler to work with objects that have different indexes, usually ‘doing the right thing’

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])
s1+s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

Fills can be specified by using methods:

s1.add(s2, fill_value = 0)

a    5.2
c    1.1
d    3.4
e    0.0
f    4.0
g    3.1
dtype: float64

See Table 5.5 for list of these methods.
You can also do arithmetic between DataFrames and Series in a way that is similar to Numpy.

Function Application and Mapping

Numpy ufuncs also work with Pandas objects.

frame = pd.DataFrame(np.random.standard_normal((4, 3)),
                         columns=list("bde"),
                         index=["Utah", "Ohio", "Texas", "Oregon"])
frame

	b	d	e
Utah	0.798548	-1.455476	0.507618
Ohio	-0.405775	-0.132380	-0.563721
Texas	-0.404526	0.703566	1.661291
Oregon	-0.359775	-0.450894	-0.712254

np.abs(frame)

	b	d	e
Utah	0.798548	1.455476	0.507618
Ohio	0.405775	0.132380	0.563721
Texas	0.404526	0.703566	1.661291
Oregon	0.359775	0.450894	0.712254

apply can be used to apply a function on 1D arrays to each column or row:

frame.apply(np.max, axis = 'rows') #'axis' is optional here, default is rows

b    0.798548
d    0.703566
e    1.661291
dtype: float64

Applying accross columns is common, especially to combine different columns in some way:

frame['max'] = frame.apply(np.max, axis = 'columns')
frame

	b	d	e	max
Utah	0.798548	-1.455476	0.507618	0.798548
Ohio	-0.405775	-0.132380	-0.563721	-0.132380
Texas	-0.404526	0.703566	1.661291	1.661291
Oregon	-0.359775	-0.450894	-0.712254	-0.359775

Many more examples of this in the book.

Sorting and Ranking

sort_index will sort with the index (on either axis for DataFrame)
sort_values is used to sort by values or a particular column

test.sort_values('MPG')

	MPG	Year	MPG^2
cars
BMW	12	2020	144
Chevy	14	1979	196
Ford	15	1980	225
Dodge	18	2001	256

rank will assign ranks from on through the number of data points.

Summarizing and Computing Descriptive Statistics

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                      [np.nan, np.nan], [0.75, -1.3]],
                      index=["a", "b", "c", "d"],
                      columns=["one", "two"])
df

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

Some Examples:

Sum over rows:

df.sum()

one    9.25
two   -5.80
dtype: float64

Sum over columns:

# Sum Rows
df.sum(axis="columns")

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

Extremely useful is describe:

df.describe()

	one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000

Book chapter contains many more examples and a full list of summary statistics and related methods.

Summary

Primary Panda’s data structures:
- Series
- DataFrame
Many ways to access and transform these objects. Key ones are:
- [] : access an element(s) of a Series or columns(s) of a DataFrame
- loc[r ,c] : access a row / column / cell by the index.
- iloc[i, j] : access ar row / column / cell by the integer position.
Online reference.

Suggestion

Work though the chapter’s code and try stuff!

References

Next Chapter

Loading and writing data sets!