import numpy as np
import pandas as pdNotes
Introduction
This is a long chapter, these notes are intended as a tour of main ideas!

Pandas is a major tool in Python data analysis
Works with Numpy, adding support for tabular / heterogenous data
Import conventions:
Panda’s primary data structures
Series: One dimensional object containing a sequence of values of the same type.
DataFrame: Tabular data, similar (and inspired by) R dataframe.
Other structures will be introduced as they arise, e.g. Index and Groupby objects.
Series
obj = pd.Series([4,7,-4,3], index = ["A","B","C","D"])
objA 4
B 7
C -4
D 3
dtype: int64
The index is optional, if not specified it will default to 0 through N-1
Selection
Select elements or sub-Series by labels, sets of labels, boolean arrays …
obj['A']4
obj[['A','C']]A 4
C -4
dtype: int64
obj[obj > 3]A 4
B 7
dtype: int64
Other things you can do
- Numpy functions and Numpy-like operations work as expected:
obj*3A 12
B 21
C -12
D 9
dtype: int64
np.exp(obj)A 54.598150
B 1096.633158
C 0.018316
D 20.085537
dtype: float64
- Series can be created from and converted to a dictionary
obj.to_dict(){'A': 4, 'B': 7, 'C': -4, 'D': 3}
- Series can be converted to numpy array:
obj.to_numpy()array([ 4, 7, -4, 3], dtype=int64)
DataFrame
Represents table of data
Has row index index and column index column
Common way to create is from a dictionary, but see Table 5.1 for more!
test = pd.DataFrame({"cars":['Chevy','Ford','Dodge','BMW'],'MPG':[14,15,16,12], 'Year':[1979, 1980, 2001, 2020]})
test| cars | MPG | Year | |
|---|---|---|---|
| 0 | Chevy | 14 | 1979 |
| 1 | Ford | 15 | 1980 |
| 2 | Dodge | 16 | 2001 |
| 3 | BMW | 12 | 2020 |
If you want a non-default index, it can be specified just like with Series.
head(n)/tail(n)- return the first / last n rows, 5 by default
Selecting
- Can retrieve columns or sets of columns by using
obj[...]:
test['cars']0 Chevy
1 Ford
2 Dodge
3 BMW
Name: cars, dtype: object
Note that we got a Series here.
test[['cars','MPG']]| cars | MPG | |
|---|---|---|
| 0 | Chevy | 14 |
| 1 | Ford | 15 |
| 2 | Dodge | 16 |
| 3 | BMW | 12 |
Dot notation can also be used (
test.cars) as long as the column names are valid identifiersRows can be retrieved with
iloc[...]andloc[...]:locretrieves by indexilocretrieves by position.
Modifying / Creating Columns
- Columns can be modified (and created) by assignment:
test['MPG^2'] = test['MPG']**2
test| cars | MPG | Year | MPG^2 | |
|---|---|---|---|---|
| 0 | Chevy | 14 | 1979 | 196 |
| 1 | Ford | 15 | 1980 | 225 |
| 2 | Dodge | 16 | 2001 | 256 |
| 3 | BMW | 12 | 2020 | 144 |
delkeyword can be used to drop columns, ordropmethod can be used to do so non-destructively
Index object
- Index objects are used for holding axis labels and other metadata
test.indexRangeIndex(start=0, stop=4, step=1)
- Can change the index, in this case replacing the default:
# Create index from one of the columns
test.index = test['cars']
# remove 'cars' column since i am using as an index now. s
test=test.drop('cars', axis = "columns") # or axis = 1
test| MPG | Year | MPG^2 | |
|---|---|---|---|
| cars | |||
| Chevy | 14 | 1979 | 196 |
| Ford | 15 | 1980 | 225 |
| Dodge | 16 | 2001 | 256 |
| BMW | 12 | 2020 | 144 |
Note the
axiskeyword argument above, many DataFrame methods use this.Above I changed a column into an index. Often you want to go the other way, this can be done with
reset_index:
test.reset_index() # Note this doesn't actually change test| cars | MPG | Year | MPG^2 | |
|---|---|---|---|---|
| 0 | Chevy | 14 | 1979 | 196 |
| 1 | Ford | 15 | 1980 | 225 |
| 2 | Dodge | 16 | 2001 | 256 |
| 3 | BMW | 12 | 2020 | 144 |
- Columns are an index as well:
test.columnsIndex(['MPG', 'Year', 'MPG^2'], dtype='object')
- Indexes act like immutable sets, see Table 5.2 in book for Index methods and properties
Essential Functionality
Reindexing and dropping
reindexcreats a new object with the values arranged according to the new index. Missing values are used if necessary, or you can use optional fill methods. You can useilocandlocto reindex as well.
s = pd.Series([1,2,3,4,5], index = list("abcde"))
s2 = s.reindex(list("abcfu")) # not a song by GAYLE
s2a 1.0
b 2.0
c 3.0
f NaN
u NaN
dtype: float64
- Missing values and can be tested for with
isnaornotnamethods
pd.isna(s2)a False
b False
c False
f True
u True
dtype: bool
drop, illustrated above can drop rows or columns. In addition to usingaxisyou can usecolumnsorindex. Again these make copies.
test.drop(columns = 'MPG')| Year | MPG^2 | |
|---|---|---|
| cars | ||
| Chevy | 1979 | 196 |
| Ford | 1980 | 225 |
| Dodge | 2001 | 256 |
| BMW | 2020 | 144 |
test.drop(index = ['Ford', 'BMW'])| MPG | Year | MPG^2 | |
|---|---|---|---|
| cars | |||
| Chevy | 14 | 1979 | 196 |
| Dodge | 16 | 2001 | 256 |
Indexing, Selection and Filtering
Series
- For Series, indexing is similar to Numpy, except you can use the index as well as integers.
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj[0:3]a 0.0
b 1.0
c 2.0
dtype: float64
obj['a':'c']a 0.0
b 1.0
c 2.0
dtype: float64
obj[obj<2]a 0.0
b 1.0
dtype: float64
obj[['a','d']]a 0.0
d 3.0
dtype: float64
- However, preferred way is to use
locfor selection by index andilocfor selection by position. This is to avoid the issue where theindexis itself integers.
obj.loc[['a','d']]a 0.0
d 3.0
dtype: float64
obj.iloc[1]1.0
Note if a range or a set of indexes is used, a Series is returned. If a single item is requested, you get just that item.
DataFrame
Selecting with
df[...]for a DataFrame retrieves one or more columns as we have seen, if you select a single column you get a SeriesThere are some special cases, indexing with a boolean selects rows, as does selecting with a slice:
test[0:1]| MPG | Year | MPG^2 | |
|---|---|---|---|
| cars | |||
| Chevy | 14 | 1979 | 196 |
test[test['MPG'] < 15]| MPG | Year | MPG^2 | |
|---|---|---|---|
| cars | |||
| Chevy | 14 | 1979 | 196 |
| BMW | 12 | 2020 | 144 |
ilocandloccan be used to select rows as illustrated before, but can also be used to select columns or subsets of rows/columns
test.loc[:,['Year','MPG']]| Year | MPG | |
|---|---|---|
| cars | ||
| Chevy | 1979 | 14 |
| Ford | 1980 | 15 |
| Dodge | 2001 | 16 |
| BMW | 2020 | 12 |
test.loc['Ford','MPG']15
- These work with slices and booleans as well! The following says “give me all the rows with MPG more then 15, and the columns starting from Year”
test.loc[test['MPG'] > 15, 'Year':]| Year | MPG^2 | |
|---|---|---|
| cars | ||
| Dodge | 2001 | 256 |
Indexing options are fully illustrated in the book and Table 5.4
Be careful with chained indexing:
test[test['MPG']> 15].loc[:,'MPG'] = 18C:\Users\jryan\AppData\Local\Temp\ipykernel_13388\2484144822.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
test[test['MPG']> 15].loc[:,'MPG'] = 18
Here we are assigning to a ‘slice’, which is probably not what is intended. You will get a warning and a recommendation to fix it by using one loc:
test.loc[test['MPG']> 15 ,'MPG'] = 18
test| MPG | Year | MPG^2 | |
|---|---|---|---|
| cars | |||
| Chevy | 14 | 1979 | 196 |
| Ford | 15 | 1980 | 225 |
| Dodge | 18 | 2001 | 256 |
| BMW | 12 | 2020 | 144 |
Avoid chained indexing when doing assignments
Arithmetic and Data Alignment
- Pandas can make it simpler to work with objects that have different indexes, usually ‘doing the right thing’
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])
s1+s2a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
- Fills can be specified by using methods:
s1.add(s2, fill_value = 0)a 5.2
c 1.1
d 3.4
e 0.0
f 4.0
g 3.1
dtype: float64
See Table 5.5 for list of these methods.
You can also do arithmetic between DataFrames and Series in a way that is similar to Numpy.
Function Application and Mapping
- Numpy ufuncs also work with Pandas objects.
frame = pd.DataFrame(np.random.standard_normal((4, 3)),
columns=list("bde"),
index=["Utah", "Ohio", "Texas", "Oregon"])
frame| b | d | e | |
|---|---|---|---|
| Utah | 0.798548 | -1.455476 | 0.507618 |
| Ohio | -0.405775 | -0.132380 | -0.563721 |
| Texas | -0.404526 | 0.703566 | 1.661291 |
| Oregon | -0.359775 | -0.450894 | -0.712254 |
np.abs(frame)| b | d | e | |
|---|---|---|---|
| Utah | 0.798548 | 1.455476 | 0.507618 |
| Ohio | 0.405775 | 0.132380 | 0.563721 |
| Texas | 0.404526 | 0.703566 | 1.661291 |
| Oregon | 0.359775 | 0.450894 | 0.712254 |
applycan be used to apply a function on 1D arrays to each column or row:
frame.apply(np.max, axis = 'rows') #'axis' is optional here, default is rowsb 0.798548
d 0.703566
e 1.661291
dtype: float64
Applying accross columns is common, especially to combine different columns in some way:
frame['max'] = frame.apply(np.max, axis = 'columns')
frame| b | d | e | max | |
|---|---|---|---|---|
| Utah | 0.798548 | -1.455476 | 0.507618 | 0.798548 |
| Ohio | -0.405775 | -0.132380 | -0.563721 | -0.132380 |
| Texas | -0.404526 | 0.703566 | 1.661291 | 1.661291 |
| Oregon | -0.359775 | -0.450894 | -0.712254 | -0.359775 |
- Many more examples of this in the book.
Sorting and Ranking
sort_indexwill sort with the index (on either axis for DataFrame)sort_valuesis used to sort by values or a particular column
test.sort_values('MPG')| MPG | Year | MPG^2 | |
|---|---|---|---|
| cars | |||
| BMW | 12 | 2020 | 144 |
| Chevy | 14 | 1979 | 196 |
| Ford | 15 | 1980 | 225 |
| Dodge | 18 | 2001 | 256 |
rankwill assign ranks from on through the number of data points.
Summarizing and Computing Descriptive Statistics
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
[np.nan, np.nan], [0.75, -1.3]],
index=["a", "b", "c", "d"],
columns=["one", "two"])
df| one | two | |
|---|---|---|
| a | 1.40 | NaN |
| b | 7.10 | -4.5 |
| c | NaN | NaN |
| d | 0.75 | -1.3 |
Some Examples:
Sum over rows:
df.sum()one 9.25
two -5.80
dtype: float64
Sum over columns:
# Sum Rows
df.sum(axis="columns")a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
Extremely useful is describe:
df.describe()| one | two | |
|---|---|---|
| count | 3.000000 | 2.000000 |
| mean | 3.083333 | -2.900000 |
| std | 3.493685 | 2.262742 |
| min | 0.750000 | -4.500000 |
| 25% | 1.075000 | -3.700000 |
| 50% | 1.400000 | -2.900000 |
| 75% | 4.250000 | -2.100000 |
| max | 7.100000 | -1.300000 |
Book chapter contains many more examples and a full list of summary statistics and related methods.
Summary
Primary Panda’s data structures:
Series
DataFrame
Many ways to access and transform these objects. Key ones are:
[]: access an element(s) of aSeriesor columns(s) of aDataFrameloc[r ,c]: access a row / column / cell by theindex.iloc[i, j]: access ar row / column / cell by the integer position.
Work though the chapter’s code and try stuff!
References
Next Chapter
- Loading and writing data sets!