```
import numpy as np
import pandas as pd
```

# Notes

## Introduction

This is a long chapter, these notes are intended as a tour of main ideas!

Pandas is a major tool in Python data analysis

Works with Numpy, adding support for tabular / heterogenous data

## Import conventions:

## Panda’s primary data structures

Series: One dimensional object containing a sequence of values of the same type.

DataFrame: Tabular data, similar (and inspired by) R dataframe.

Other structures will be introduced as they arise, e.g. Index and Groupby objects.

### Series

```
= pd.Series([4,7,-4,3], index = ["A","B","C","D"])
obj obj
```

```
A 4
B 7
C -4
D 3
dtype: int64
```

The `index`

is optional, if not specified it will default to 0 through N-1

#### Selection

Select elements or sub-Series by labels, sets of labels, boolean arrays …

`'A'] obj[`

`4`

`'A','C']] obj[[`

```
A 4
C -4
dtype: int64
```

`> 3] obj[obj `

```
A 4
B 7
dtype: int64
```

#### Other things you can do

- Numpy functions and Numpy-like operations work as expected:

`*3 obj`

```
A 12
B 21
C -12
D 9
dtype: int64
```

` np.exp(obj)`

```
A 54.598150
B 1096.633158
C 0.018316
D 20.085537
dtype: float64
```

- Series can be created from and converted to a dictionary

` obj.to_dict()`

`{'A': 4, 'B': 7, 'C': -4, 'D': 3}`

- Series can be converted to numpy array:

` obj.to_numpy()`

`array([ 4, 7, -4, 3], dtype=int64)`

### DataFrame

Represents table of data

Has row index

*index*and column index*column*Common way to create is from a dictionary, but see

*Table 5.1*for more!

```
= pd.DataFrame({"cars":['Chevy','Ford','Dodge','BMW'],'MPG':[14,15,16,12], 'Year':[1979, 1980, 2001, 2020]})
test test
```

cars | MPG | Year | |
---|---|---|---|

0 | Chevy | 14 | 1979 |

1 | Ford | 15 | 1980 |

2 | Dodge | 16 | 2001 |

3 | BMW | 12 | 2020 |

If you want a non-default index, it can be specified just like with Series.

`head(n)`

/`tail(n)`

- return the first / last n rows, 5 by default

#### Selecting

- Can retrieve columns or sets of columns by using
`obj[...]`

:

`'cars'] test[`

```
0 Chevy
1 Ford
2 Dodge
3 BMW
Name: cars, dtype: object
```

Note that we got a `Series`

here.

`'cars','MPG']] test[[`

cars | MPG | |
---|---|---|

0 | Chevy | 14 |

1 | Ford | 15 |

2 | Dodge | 16 |

3 | BMW | 12 |

Dot notation can also be used (

`test.cars`

) as long as the column names are valid identifiers*Rows*can be retrieved with`iloc[...]`

and`loc[...]`

:`loc`

retrieves by index`iloc`

retrieves by position.

#### Modifying / Creating Columns

- Columns can be modified (and created) by assignment:

```
'MPG^2'] = test['MPG']**2
test[ test
```

cars | MPG | Year | MPG^2 | |
---|---|---|---|---|

0 | Chevy | 14 | 1979 | 196 |

1 | Ford | 15 | 1980 | 225 |

2 | Dodge | 16 | 2001 | 256 |

3 | BMW | 12 | 2020 | 144 |

`del`

keyword can be used to drop columns, or`drop`

method can be used to do so non-destructively

### Index object

- Index objects are used for holding axis labels and other metadata

` test.index`

`RangeIndex(start=0, stop=4, step=1)`

- Can change the index, in this case replacing the default:

```
# Create index from one of the columns
= test['cars']
test.index
# remove 'cars' column since i am using as an index now. s
=test.drop('cars', axis = "columns") # or axis = 1
test test
```

MPG | Year | MPG^2 | |
---|---|---|---|

cars | |||

Chevy | 14 | 1979 | 196 |

Ford | 15 | 1980 | 225 |

Dodge | 16 | 2001 | 256 |

BMW | 12 | 2020 | 144 |

Note the

`axis`

keyword argument above, many DataFrame methods use this.Above I changed a column into an index. Often you want to go the other way, this can be done with

`reset_index`

:

`# Note this doesn't actually change test test.reset_index() `

cars | MPG | Year | MPG^2 | |
---|---|---|---|---|

0 | Chevy | 14 | 1979 | 196 |

1 | Ford | 15 | 1980 | 225 |

2 | Dodge | 16 | 2001 | 256 |

3 | BMW | 12 | 2020 | 144 |

- Columns are an index as well:

` test.columns`

`Index(['MPG', 'Year', 'MPG^2'], dtype='object')`

- Indexes act like immutable sets, see
*Table 5.2*in book for Index methods and properties

## Essential Functionality

### Reindexing and dropping

`reindex`

creats a*new*object with the values arranged according to the new index. Missing values are used if necessary, or you can use optional fill methods. You can use`iloc`

and`loc`

to reindex as well.

```
= pd.Series([1,2,3,4,5], index = list("abcde"))
s = s.reindex(list("abcfu")) # not a song by GAYLE
s2 s2
```

```
a 1.0
b 2.0
c 3.0
f NaN
u NaN
dtype: float64
```

- Missing values and can be tested for with
`isna`

or`notna`

methods

` pd.isna(s2)`

```
a False
b False
c False
f True
u True
dtype: bool
```

`drop`

, illustrated above can drop rows or columns. In addition to using`axis`

you can use`columns`

or`index`

. Again these make copies.

`= 'MPG') test.drop(columns `

Year | MPG^2 | |
---|---|---|

cars | ||

Chevy | 1979 | 196 |

Ford | 1980 | 225 |

Dodge | 2001 | 256 |

BMW | 2020 | 144 |

`= ['Ford', 'BMW']) test.drop(index `

MPG | Year | MPG^2 | |
---|---|---|---|

cars | |||

Chevy | 14 | 1979 | 196 |

Dodge | 16 | 2001 | 256 |

### Indexing, Selection and Filtering

#### Series

- For Series, indexing is similar to Numpy, except you can use the index as well as integers.

```
= pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj 0:3] obj[
```

```
a 0.0
b 1.0
c 2.0
dtype: float64
```

`'a':'c'] obj[`

```
a 0.0
b 1.0
c 2.0
dtype: float64
```

`<2] obj[obj`

```
a 0.0
b 1.0
dtype: float64
```

`'a','d']] obj[[`

```
a 0.0
d 3.0
dtype: float64
```

*However*, preferred way is to use`loc`

for selection by*index*and`iloc`

for selection by position. This is to avoid the issue where the`index`

is itself integers.

`'a','d']] obj.loc[[`

```
a 0.0
d 3.0
dtype: float64
```

`1] obj.iloc[`

`1.0`

Note if a range or a set of indexes is used, a Series is returned. If a single item is requested, you get just that item.

#### DataFrame

Selecting with

`df[...]`

for a DataFrame retrieves one or more columns as we have seen, if you select a single column you get a SeriesThere are some special cases, indexing with a boolean selects

*rows*, as does selecting with a slice:

`0:1] test[`

MPG | Year | MPG^2 | |
---|---|---|---|

cars | |||

Chevy | 14 | 1979 | 196 |

`'MPG'] < 15] test[test[`

MPG | Year | MPG^2 | |
---|---|---|---|

cars | |||

Chevy | 14 | 1979 | 196 |

BMW | 12 | 2020 | 144 |

`iloc`

and`loc`

can be used to select rows as illustrated before, but can also be used to select columns or subsets of rows/columns

`'Year','MPG']] test.loc[:,[`

Year | MPG | |
---|---|---|

cars | ||

Chevy | 1979 | 14 |

Ford | 1980 | 15 |

Dodge | 2001 | 16 |

BMW | 2020 | 12 |

`'Ford','MPG'] test.loc[`

`15`

- These work with slices and booleans as well! The following says “give me all the rows with MPG more then 15, and the columns starting from Year”

`'MPG'] > 15, 'Year':] test.loc[test[`

Year | MPG^2 | |
---|---|---|

cars | ||

Dodge | 2001 | 256 |

Indexing options are fully illustrated in the book and

*Table 5.4*Be careful with

*chained indexing*:

`'MPG']> 15].loc[:,'MPG'] = 18 test[test[`

```
C:\Users\jryan\AppData\Local\Temp\ipykernel_13388\2484144822.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
test[test['MPG']> 15].loc[:,'MPG'] = 18
```

Here we are assigning to a ‘slice’, which is probably not what is intended. You will get a warning and a recommendation to fix it by using one `loc`

:

```
'MPG']> 15 ,'MPG'] = 18
test.loc[test[ test
```

MPG | Year | MPG^2 | |
---|---|---|---|

cars | |||

Chevy | 14 | 1979 | 196 |

Ford | 15 | 1980 | 225 |

Dodge | 18 | 2001 | 256 |

BMW | 12 | 2020 | 144 |

Avoid chained indexing when doing assignments

### Arithmetic and Data Alignment

- Pandas can make it simpler to work with objects that have different indexes, usually ‘doing the right thing’

```
= pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s1 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])
s2 +s2 s1
```

```
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
```

- Fills can be specified by using methods:

`= 0) s1.add(s2, fill_value `

```
a 5.2
c 1.1
d 3.4
e 0.0
f 4.0
g 3.1
dtype: float64
```

See

*Table 5.5*for list of these methods.You can also do arithmetic between

*DataFrame*s and*Series*in a way that is similar to Numpy.

### Function Application and Mapping

- Numpy
*ufuncs*also work with Pandas objects.

```
= pd.DataFrame(np.random.standard_normal((4, 3)),
frame =list("bde"),
columns=["Utah", "Ohio", "Texas", "Oregon"])
index frame
```

b | d | e | |
---|---|---|---|

Utah | 0.798548 | -1.455476 | 0.507618 |

Ohio | -0.405775 | -0.132380 | -0.563721 |

Texas | -0.404526 | 0.703566 | 1.661291 |

Oregon | -0.359775 | -0.450894 | -0.712254 |

`abs(frame) np.`

b | d | e | |
---|---|---|---|

Utah | 0.798548 | 1.455476 | 0.507618 |

Ohio | 0.405775 | 0.132380 | 0.563721 |

Texas | 0.404526 | 0.703566 | 1.661291 |

Oregon | 0.359775 | 0.450894 | 0.712254 |

`apply`

can be used to apply a function on 1D arrays to each column or row:

`apply(np.max, axis = 'rows') #'axis' is optional here, default is rows frame.`

```
b 0.798548
d 0.703566
e 1.661291
dtype: float64
```

Applying accross columns is common, especially to combine different columns in some way:

```
'max'] = frame.apply(np.max, axis = 'columns')
frame[ frame
```

b | d | e | max | |
---|---|---|---|---|

Utah | 0.798548 | -1.455476 | 0.507618 | 0.798548 |

Ohio | -0.405775 | -0.132380 | -0.563721 | -0.132380 |

Texas | -0.404526 | 0.703566 | 1.661291 | 1.661291 |

Oregon | -0.359775 | -0.450894 | -0.712254 | -0.359775 |

- Many more examples of this in the book.

### Sorting and Ranking

`sort_index`

will sort with the index (on either axis for*DataFrame*)`sort_values`

is used to sort by values or a particular column

`'MPG') test.sort_values(`

MPG | Year | MPG^2 | |
---|---|---|---|

cars | |||

BMW | 12 | 2020 | 144 |

Chevy | 14 | 1979 | 196 |

Ford | 15 | 1980 | 225 |

Dodge | 18 | 2001 | 256 |

`rank`

will assign ranks from on through the number of data points.

## Summarizing and Computing Descriptive Statistics

```
= pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
df 0.75, -1.3]],
[np.nan, np.nan], [=["a", "b", "c", "d"],
index=["one", "two"])
columns df
```

one | two | |
---|---|---|

a | 1.40 | NaN |

b | 7.10 | -4.5 |

c | NaN | NaN |

d | 0.75 | -1.3 |

Some Examples:

Sum over rows:

`sum() df.`

```
one 9.25
two -5.80
dtype: float64
```

Sum over columns:

```
# Sum Rows
sum(axis="columns") df.
```

```
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
```

Extremely useful is `describe`

:

` df.describe()`

one | two | |
---|---|---|

count | 3.000000 | 2.000000 |

mean | 3.083333 | -2.900000 |

std | 3.493685 | 2.262742 |

min | 0.750000 | -4.500000 |

25% | 1.075000 | -3.700000 |

50% | 1.400000 | -2.900000 |

75% | 4.250000 | -2.100000 |

max | 7.100000 | -1.300000 |

**Book chapter contains many more examples and a full list of summary statistics and related methods.**

## Summary

Primary Panda’s data structures:

Series

DataFrame

Many ways to access and transform these objects. Key ones are:

`[]`

: access an element(s) of a`Series`

or columns(s) of a`DataFrame`

`loc[r ,c]`

: access a row / column / cell by the`index`

.`iloc[i, j]`

: access ar row / column / cell by the integer position.

Work though the chapter’s code and try stuff!

## References

## Next Chapter

- Loading and writing data sets!