7. Data Cleaning and Preparation

Learning Objectives

Know which tools to use for missing data
Know how to filter out missing data
Understand methods to fill in missing values
Know when and how to transform data
Know how to use certain numpy functions to handle outliers, permute, and take random samples
Know how to manipulate strings
Understand some useful methods for regular expressions
Learn about some helpful methods in pandas to explore strings
Understand how to handle categorical data more optimally

import pandas as pd
import numpy as np

food = pd.read_csv("https://openmv.net/file/food-consumption.csv")

print(food.head(5))

   Country  Real coffee  Instant coffee  Tea  Sweetener  Biscuits  \
0  Germany           90              49   88       19.0      57.0   
1    Italy           82              10   60        2.0      55.0   
2   France           88              42   63        4.0      76.0   
3  Holland           96              62   98       32.0      62.0   
4  Belgium           94              38   48       11.0      74.0   

   Powder soup  Tin soup  Potatoes  Frozen fish  ...  Apples  Oranges  \
0           51        19        21           27  ...      81       75   
1           41         3         2            4  ...      67       71   
2           53        11        23           11  ...      87       84   
3           67        43         7           14  ...      83       89   
4           37        23         9           13  ...      76       76   

   Tinned fruit  Jam  Garlic  Butter  Margarine  Olive oil  Yoghurt  \
0            44   71      22      91         85         74     30.0   
1             9   46      80      66         24         94      5.0   
2            40   45      88      94         47         36     57.0   
3            61   81      15      31         97         13     53.0   
4            42   57      29      84         80         83     20.0   

   Crisp bread  
0           26  
1           18  
2            3  
3           15  
4            5  

[5 rows x 21 columns]

dataset: The relative consumption of certain food items in European and Scandinavian countries. The numbers represent the percentage of the population consuming that food type

7.1 Handling Missing Data

Some things to note:

ALL DESCRIPTIVE STATISTICS ON pandas OBJECTS EXLUDE MISSING DATA - BY DEFAULT
NaN is used for missing values of type: float64
Values like NaN are called sentinel values
- a value that is not part of the input but indicates a special meaning; a signal value
- NaN for missing integers, -1 as a value to be inserted in a function that computes only non-negative integers, etc.

print(food.Yoghurt.isna())

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13     True
14    False
15    False
Name: Yoghurt, dtype: bool

We do have an NaN in our midst!

# descriptive stats
print(np.mean(food['Yoghurt']), "\n versus", np.average(food['Yoghurt']))

20.533333333333335 
 versus nan

Different results! Why?? According to numpy documentation:

np.mean always calculates the arithmetic mean along a specified axis. The first argument requires the type to be of int64 so will take the mean of those that fit. The average is taken over the flattened array by default. np.average computes the weighted average along the specified axis.

sum(food.Yoghurt) –> nan

from average source:

        avg = avg_as_array = np.multiply(a, wgt,
                          dtype=result_dtype).sum(axis, **keepdims_kw) / scl

from mean source:

if type(a) is not mu.ndarray:
        try:
            mean = a.mean
        except AttributeError:
            pass
        else:
            return mean(axis=axis, dtype=dtype, out=out, **kwargs)

    return _methods._mean(a, axis=axis, dtype=dtype,
                          out=out, **kwargs)

FYI: the statistics module includes mean()

Something weird to consider….

print(np.nan == np.nan)

# apparently, according to the floating-point standard, NaN is not equal to itself!

False

I digress…

Filtering Missing Data

# method dropna
print("`dropna`: option to include `how = all` to only remove rows where every value is NaN \n",food.Yoghurt.dropna().tail(), "\n",
"`fillna`: pass fillna a dictionary (fillna({1: 0.5, 2: 0})) to specify a different value for each column\n", food.Yoghurt.fillna(0).tail(), "\n",
"`isna`\n", food.Yoghurt.isna().tail(), "\n",
"`notna`\n", food.Yoghurt.notna().tail())

`dropna`: option to include `how = all` to only remove rows where every value is NaN 
 10     2.0
11    11.0
12     2.0
14    16.0
15     3.0
Name: Yoghurt, dtype: float64 
 `fillna`: pass fillna a dictionary (fillna({1: 0.5, 2: 0})) to specify a different value for each column
 11    11.0
12     2.0
13     0.0
14    16.0
15     3.0
Name: Yoghurt, dtype: float64 
 `isna`
 11    False
12    False
13     True
14    False
15    False
Name: Yoghurt, dtype: bool 
 `notna`
 11     True
12     True
13    False
14     True
15     True
Name: Yoghurt, dtype: bool

7.2 Data Transformation

Removing Duplicates

Check to see is duplicates exists:

food.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
dtype: bool

If you were to have duplicates, you can use the function drop_duplicates().

*NOTE: by default, drop_duplicates will only return the first observed value*

dup_food = food[['Yoghurt','Yoghurt']]
dup_food.columns = ['a','b']

# index 11,12 are dropped - dont understand this at all
dup_food.drop_duplicates()

	a	b
0	30.0	30.0
1	5.0	5.0
2	57.0	57.0
3	53.0	53.0
4	20.0	20.0
5	31.0	31.0
6	11.0	11.0
7	6.0	6.0
8	13.0	13.0
9	48.0	48.0
10	2.0	2.0
13	NaN	NaN
14	16.0	16.0
15	3.0	3.0

# index 6, 10 are dropped- also dont understand this at all
dup_food.drop_duplicates(keep = 'last')

	a	b
0	30.0	30.0
1	5.0	5.0
2	57.0	57.0
3	53.0	53.0
4	20.0	20.0
5	31.0	31.0
7	6.0	6.0
8	13.0	13.0
9	48.0	48.0
11	11.0	11.0
12	2.0	2.0
13	NaN	NaN
14	16.0	16.0
15	3.0	3.0

# again 11,12 are dropped - still dont understand - help
dup_food.drop_duplicates(subset=['a'])

	a	b
0	30.0	30.0
1	5.0	5.0
2	57.0	57.0
3	53.0	53.0
4	20.0	20.0
5	31.0	31.0
6	11.0	11.0
7	6.0	6.0
8	13.0	13.0
9	48.0	48.0
10	2.0	2.0
13	NaN	NaN
14	16.0	16.0
15	3.0	3.0

Transforming Data with a Function or Mapping

Since mapping a function over a series has already been covered, this section will only go over a few more helpful ways to map.

define your own function - similar to how we would do in apply functions or purrr:map()

food_sub = food[:5][['Country','Yoghurt']]
country_yogurt = {
  'Germany':'Quark',
  'Italy':'Yomo',
  'France':'Danone',
  'Holland':'Campina',
  'Belgium':'Activia'
}

def get_yogurt(x):
   return country_yogurt[x]

food_sub['Brand'] = food_sub['Country'].map(get_yogurt)

food_sub['Country'].map(get_yogurt)

0      Quark
1       Yomo
2     Danone
3    Campina
4    Activia
Name: Country, dtype: object

Replace Values

print("using `replace`: \n", food_sub.replace([30],50), '\n',
"using `replace` for more than one value: \n", food_sub.replace([30, 20],[50, 40]))

using `replace`: 
    Country  Yoghurt    Brand
0  Germany     50.0    Quark
1    Italy      5.0     Yomo
2   France     57.0   Danone
3  Holland     53.0  Campina
4  Belgium     20.0  Activia 
 using `replace` for more than one value: 
    Country  Yoghurt    Brand
0  Germany     50.0    Quark
1    Italy      5.0     Yomo
2   France     57.0   Danone
3  Holland     53.0  Campina
4  Belgium     40.0  Activia

Renaming Axis Indices

As we’ve seen, standard indices are labelled as such:

>>> food_sub.index
RangeIndex(start=0, stop=5, step=1)

That can also be changed with the mapping of a function:

print(food_sub.index.map(lambda x: x + 10))
print('or')
print(food_sub.index.map({0:'G', 1:'I', 2:'F', 3:'H', 4:'B'}))

Int64Index([10, 11, 12, 13, 14], dtype='int64')
or
Index(['G', 'I', 'F', 'H', 'B'], dtype='object')

Discretization and Binning

It is common to convert continuous variables into discrete and group them. Let’s group the affinity for yogurt into random bins:

scale = [0, 20, 30, 50, 70]
# reasonable, ok, interesting, why

pd.cut(food.Yoghurt, scale)

0     (20.0, 30.0]
1      (0.0, 20.0]
2     (50.0, 70.0]
3     (50.0, 70.0]
4      (0.0, 20.0]
5     (30.0, 50.0]
6      (0.0, 20.0]
7      (0.0, 20.0]
8      (0.0, 20.0]
9     (30.0, 50.0]
10     (0.0, 20.0]
11     (0.0, 20.0]
12     (0.0, 20.0]
13             NaN
14     (0.0, 20.0]
15     (0.0, 20.0]
Name: Yoghurt, dtype: category
Categories (4, interval[int64, right]): [(0, 20] < (20, 30] < (30, 50] < (50, 70]]

scaled = pd.cut(food.Yoghurt.values, scale)
scaled.categories

pd.value_counts(scaled)

(0, 20]     10
(30, 50]     2
(50, 70]     2
(20, 30]     1
dtype: int64

Apply the labels to the bins to have it make more sense:

scale_names = ['reasonable', 'ok', 'interesting', 'why']
pd.value_counts(pd.cut(food.Yoghurt.values, scale, labels = scale_names))

reasonable     10
interesting     2
why             2
ok              1
dtype: int64

Finally, let pandas do the work for you by supplying a number of bins and a precision point. It will bin your data equally while limiting the decimal point based on the value of precision

pd.qcut(food.Yoghurt.values, 4, precision = 2)

[(13.0, 30.5], (1.99, 5.5], (30.5, 57.0], (30.5, 57.0], (13.0, 30.5], ..., (5.5, 13.0], (1.99, 5.5], NaN, (13.0, 30.5], (1.99, 5.5]]
Length: 16
Categories (4, interval[float64, right]): [(1.99, 5.5] < (5.5, 13.0] < (13.0, 30.5] < (30.5, 57.0]]

Detecting and Filtering Outliers

We often have to face the decision of how to handle outliers. We can choose to exclude them or to transform them.

# let's say any country who's percentage of yogurt consumption is over 50% is an outlier

yog = food.Yoghurt
yog[yog.abs() > 50]

2    57.0
3    53.0
Name: Yoghurt, dtype: float64

More interestingly, what if we wanted to know if the consumption of ANY food was over 50% ?

food2 = food.drop('Country', axis = 'columns')
food2[(food2.abs() > 95).any(axis = 'columns')]

	Real coffee	Instant coffee	Tea	Sweetener	Biscuits	Powder soup	Tin soup	Potatoes	Frozen fish	Frozen veggies	Apples	Oranges	Tinned fruit	Jam	Garlic	Butter	Margarine	Olive oil	Yoghurt	Crisp bread
3	96	62	98	32.0	62.0	67	43	7	14	14	83	89	61	81	15	31	97	13	53.0	15
5	97	61	86	28.0	79.0	73	12	7	26	23	85	94	83	20	91	94	94	84	31.0	24
6	27	86	99	22.0	91.0	55	76	17	20	24	76	68	89	91	11	95	94	57	11.0	28
10	97	13	93	31.0	NaN	43	43	39	54	45	56	78	53	75	9	68	32	48	2.0	93
11	96	17	92	35.0	66.0	32	17	11	51	42	81	72	50	64	11	92	91	30	11.0	34
13	98	12	84	20.0	64.0	27	10	8	18	12	50	57	22	37	15	96	94	17	NaN	64
15	30	52	99	11.0	80.0	75	18	2	5	3	57	52	46	89	5	97	25	31	3.0	9

Permutation and Random Sampling

Permuting = random reordering
- np.random.permutation = takes the length of the axis you want to permute
Random sampling = each sample has an equal probability of being chosen

Let’s randomly reorder yogurt affinity:

print(np.random.permutation(5))

food.take(np.random.permutation(5))

[2 3 0 1 4]

	Country	Real coffee	Instant coffee	Tea	Sweetener	Biscuits	Powder soup	Tin soup	Potatoes	Frozen fish	...	Apples	Oranges	Tinned fruit	Jam	Garlic	Butter	Margarine	Olive oil	Yoghurt	Crisp bread
2	France	88	42	63	4.0	76.0	53	11	23	11	...	87	84	40	45	88	94	47	36	57.0	3
1	Italy	82	10	60	2.0	55.0	41	3	2	4	...	67	71	9	46	80	66	24	94	5.0	18
3	Holland	96	62	98	32.0	62.0	67	43	7	14	...	83	89	61	81	15	31	97	13	53.0	15
0	Germany	90	49	88	19.0	57.0	51	19	21	27	...	81	75	44	71	22	91	85	74	30.0	26
4	Belgium	94	38	48	11.0	74.0	37	23	9	13	...	76	76	42	57	29	84	80	83	20.0	5

5 rows × 21 columns

This method can be helpful when using iloc indexing!

food.take(np.random.permutation(5), axis = 'columns')

	Country	Sweetener	Instant coffee	Real coffee	Tea
0	Germany	19.0	49	90	88
1	Italy	2.0	10	82	60
2	France	4.0	42	88	63
3	Holland	32.0	62	96	98
4	Belgium	11.0	38	94	48
5	Luxembourg	28.0	61	97	86
6	England	22.0	86	27	99
7	Portugal	2.0	26	72	77
8	Austria	15.0	31	55	61
9	Switzerland	25.0	72	73	85
10	Sweden	31.0	13	97	93
11	Denmark	35.0	17	96	92
12	Norway	13.0	17	92	83
13	Finland	20.0	12	98	84
14	Spain	NaN	40	70	40
15	Ireland	11.0	52	30	99

Let’s try taking a random subset without replacement:

food.sample(n =5)
# you can always add `replace=True` if you want replacement

	Country	Real coffee	Instant coffee	Tea	Sweetener	Biscuits	Powder soup	Tin soup	Potatoes	Frozen fish	...	Apples	Oranges	Tinned fruit	Jam	Garlic	Butter	Margarine	Olive oil	Yoghurt	Crisp bread
14	Spain	70	40	40	NaN	62.0	43	2	14	23	...	59	77	30	38	86	44	51	91	16.0	13
1	Italy	82	10	60	2.0	55.0	41	3	2	4	...	67	71	9	46	80	66	24	94	5.0	18
9	Switzerland	73	72	85	25.0	31.0	69	10	17	19	...	79	70	46	61	64	82	48	61	48.0	30
4	Belgium	94	38	48	11.0	74.0	37	23	9	13	...	76	76	42	57	29	84	80	83	20.0	5
13	Finland	98	12	84	20.0	64.0	27	10	8	18	...	50	57	22	37	15	96	94	17	NaN	64

5 rows × 21 columns

Computing Indicator/Dummy Vars

This kind of transformation is really helpful for machine learning. It converts categorical variables into indicator or dummy variable through a transformation that results in 0’s and 1’s.

pd.get_dummies(food['Country'])

	Austria	Belgium	Denmark	England	Finland	France	Germany	Holland	Ireland	Italy	Luxembourg	Norway	Portugal	Spain	Sweden	Switzerland
0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
2	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0
4	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
5	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
6	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
8	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
9	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
10	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
11	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
12	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
13	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0
14	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
15	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0

This example is not the most helpful since this set of countries are unique but I hope you get the idea..

This is topic will make more sense in Ch.13 when data analysis examples are worked out.

7.3 Extension Data Types

Extension types addresses some of the shortcomings brought on by numpy such as:

expensive string computations
missing data conversions
lack of support for time related objects

s = pd.Series([1, 2, 3, None])
s.dtype

dtype('float64')

s = pd.Series([1, 2, 3, None], dtype=pd.Int64Dtype())
s
print(s.dtype)

Int64

Note that this extension type indicates missing with <NA>

print(s.isna())

0    False
1    False
2    False
3     True
dtype: bool

<NA> uses the pandas.NA sentinal value

s[3] is pd.NA

True

Types can be set with astype()

df = pd.DataFrame({"A": [1, 2, None, 4],
"B": ["one", "two", "three", None],
"C": [False, None, False, True]})

df["A"] = df["A"].astype("Int64")
df["B"] = df["B"].astype("string")
df["C"] = df["C"].astype("boolean")

df

	A	B	C
0	1	one	False
1	2	two	<NA>
2	<NA>	three	False
3	4	<NA>	True

Find a table of extension types here

7.4 String Manipulation

Functions that are built in:

split() : break a string into pieces
join()
strip() : trim whitespace
in(): good for locating a substring
count() : returns the number of occurrences of a substring
replace() : substitute occurrences of one pattern for another

See more function here

lb = " layla is smart, witty, charming, and... "
lb.split(" ")
lb.strip()
'-'.join(lb)
'smart' in lb
lb.count(',')
lb.replace('...', ' bad at python.')

' layla is smart, witty, charming, and bad at python. '

Regular Expressions

RegEx is not easy. It takes some getting used to. It is really useful for programatically applying any of the string functions to particular pattern.

I often refer to this handy [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/strings.pdf)

To use regular expression in python, you must import the module re:

import re

text = "layla has lived in philadelphia county, miami-dade county, and rockdale county"

# split on whitespace
re.split(r"\s+", text)

['layla',
 'has',
 'lived',
 'in',
 'philadelphia',
 'county,',
 'miami-dade',
 'county,',
 'and',
 'rockdale',
 'county']

To avoid repeating a common expression, you can compile it and store it as it’s own object.

regex = re.compile(r"\s+")

Don’t forget: there are certain characters you must escape before using like: ‘\,., +, :’ and more

What if I wanted to get the counties?

regex = re.compile(r"\w+(?=\s+county)")

regex.findall(text)

['philadelphia', 'dade', 'rockdale']

String Functions

data = {"Dave": "dave@google.com", "Steve": "steve@gmail.com",
"Rob": "rob@gmail.com", "Wes": np.nan}
# convert to series
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

To get certain information, we can apply string functions from Series array-oriented methods:

# does the string contain something
data.str.contains("gmail")
# change the extension tryp
data_as_string_ext = data.astype('string')
data_as_string_ext

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                 <NA>
dtype: string

# vectorized element retrieval
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
data.str.findall(pattern, flags=re.IGNORECASE).str[0]

Dave     (dave, google, com)
Steve    (steve, gmail, com)
Rob        (rob, gmail, com)
Wes                      NaN
dtype: object

7.5 Categorical Data

values = pd.Series(['apple', 'orange', 'apple',
                   'apple'] * 2)
                   
pd.unique(values)
pd.value_counts(values)

apple     6
orange    2
dtype: int64

You can improve performance by creating categorical representations that are numerical:

values = pd.Series([0, 1, 0, 0] * 2)
dim = pd.Series(['apple', 'orange'])

dim

0     apple
1    orange
dtype: object

Retrieve the original set of strings with take

dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

Computations with Categoricals

rng = np.random.default_rng(seed=12345)
draws = rng.standard_normal(1000)
bins = pd.qcut(draws, 4)
bins

[(-3.121, -0.675], (0.687, 3.211], (-3.121, -0.675], (-0.675, 0.0134], (-0.675, 0.0134], ..., (0.0134, 0.687], (0.0134, 0.687], (-0.675, 0.0134], (0.0134, 0.687], (-0.675, 0.0134]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.121, -0.675] < (-0.675, 0.0134] < (0.0134, 0.687] < (0.687, 3.211]]

bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins
# then use groupby
bins = pd.Series(bins, name='quartile')
results = (pd.Series(draws)
               .groupby(bins)
               .agg(['count', 'min', 'max'])
               .reset_index())

Leads to better performance