7. Data Cleaning and Preparation

Learning Objectives

  • Know which tools to use for missing data
  • Know how to filter out missing data
  • Understand methods to fill in missing values
  • Know when and how to transform data
  • Know how to use certain numpy functions to handle outliers, permute, and take random samples
  • Know how to manipulate strings
  • Understand some useful methods for regular expressions
  • Learn about some helpful methods in pandas to explore strings
  • Understand how to handle categorical data more optimally

import pandas as pd
import numpy as np

food = pd.read_csv("https://openmv.net/file/food-consumption.csv")

print(food.head(5))
   Country  Real coffee  Instant coffee  Tea  Sweetener  Biscuits  \
0  Germany           90              49   88       19.0      57.0   
1    Italy           82              10   60        2.0      55.0   
2   France           88              42   63        4.0      76.0   
3  Holland           96              62   98       32.0      62.0   
4  Belgium           94              38   48       11.0      74.0   

   Powder soup  Tin soup  Potatoes  Frozen fish  ...  Apples  Oranges  \
0           51        19        21           27  ...      81       75   
1           41         3         2            4  ...      67       71   
2           53        11        23           11  ...      87       84   
3           67        43         7           14  ...      83       89   
4           37        23         9           13  ...      76       76   

   Tinned fruit  Jam  Garlic  Butter  Margarine  Olive oil  Yoghurt  \
0            44   71      22      91         85         74     30.0   
1             9   46      80      66         24         94      5.0   
2            40   45      88      94         47         36     57.0   
3            61   81      15      31         97         13     53.0   
4            42   57      29      84         80         83     20.0   

   Crisp bread  
0           26  
1           18  
2            3  
3           15  
4            5  

[5 rows x 21 columns]

dataset: The relative consumption of certain food items in European and Scandinavian countries. The numbers represent the percentage of the population consuming that food type

7.1 Handling Missing Data

Some things to note:

  • ALL DESCRIPTIVE STATISTICS ON pandas OBJECTS EXLUDE MISSING DATA - BY DEFAULT

  • NaN is used for missing values of type: float64

  • Values like NaN are called sentinel values

    • a value that is not part of the input but indicates a special meaning; a signal value

    • NaN for missing integers, -1 as a value to be inserted in a function that computes only non-negative integers, etc.

print(food.Yoghurt.isna())
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13     True
14    False
15    False
Name: Yoghurt, dtype: bool

We do have an NaN in our midst!

# descriptive stats
print(np.mean(food['Yoghurt']), "\n versus", np.average(food['Yoghurt']))
20.533333333333335 
 versus nan

Different results! Why?? According to numpy documentation:

np.mean always calculates the arithmetic mean along a specified axis. The first argument requires the type to be of int64 so will take the mean of those that fit. The average is taken over the flattened array by default. np.average computes the weighted average along the specified axis.

sum(food.Yoghurt) –> nan

from average source:

        avg = avg_as_array = np.multiply(a, wgt,
                          dtype=result_dtype).sum(axis, **keepdims_kw) / scl

from mean source:

if type(a) is not mu.ndarray:
        try:
            mean = a.mean
        except AttributeError:
            pass
        else:
            return mean(axis=axis, dtype=dtype, out=out, **kwargs)

    return _methods._mean(a, axis=axis, dtype=dtype,
                          out=out, **kwargs)

FYI: the statistics module includes mean()

Something weird to consider….

print(np.nan == np.nan)

# apparently, according to the floating-point standard, NaN is not equal to itself!
False

I digress…

Filtering Missing Data

# method dropna
print("`dropna`: option to include `how = all` to only remove rows where every value is NaN \n",food.Yoghurt.dropna().tail(), "\n",
"`fillna`: pass fillna a dictionary (fillna({1: 0.5, 2: 0})) to specify a different value for each column\n", food.Yoghurt.fillna(0).tail(), "\n",
"`isna`\n", food.Yoghurt.isna().tail(), "\n",
"`notna`\n", food.Yoghurt.notna().tail())
`dropna`: option to include `how = all` to only remove rows where every value is NaN 
 10     2.0
11    11.0
12     2.0
14    16.0
15     3.0
Name: Yoghurt, dtype: float64 
 `fillna`: pass fillna a dictionary (fillna({1: 0.5, 2: 0})) to specify a different value for each column
 11    11.0
12     2.0
13     0.0
14    16.0
15     3.0
Name: Yoghurt, dtype: float64 
 `isna`
 11    False
12    False
13     True
14    False
15    False
Name: Yoghurt, dtype: bool 
 `notna`
 11     True
12     True
13    False
14     True
15     True
Name: Yoghurt, dtype: bool

7.2 Data Transformation

Removing Duplicates

Check to see is duplicates exists:

food.duplicated()
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
dtype: bool

If you were to have duplicates, you can use the function drop_duplicates().

*NOTE: by default, drop_duplicates will only return the first observed value*

dup_food = food[['Yoghurt','Yoghurt']]
dup_food.columns = ['a','b']
# index 11,12 are dropped - dont understand this at all
dup_food.drop_duplicates()
a b
0 30.0 30.0
1 5.0 5.0
2 57.0 57.0
3 53.0 53.0
4 20.0 20.0
5 31.0 31.0
6 11.0 11.0
7 6.0 6.0
8 13.0 13.0
9 48.0 48.0
10 2.0 2.0
13 NaN NaN
14 16.0 16.0
15 3.0 3.0
# index 6, 10 are dropped- also dont understand this at all
dup_food.drop_duplicates(keep = 'last')
a b
0 30.0 30.0
1 5.0 5.0
2 57.0 57.0
3 53.0 53.0
4 20.0 20.0
5 31.0 31.0
7 6.0 6.0
8 13.0 13.0
9 48.0 48.0
11 11.0 11.0
12 2.0 2.0
13 NaN NaN
14 16.0 16.0
15 3.0 3.0
# again 11,12 are dropped - still dont understand - help
dup_food.drop_duplicates(subset=['a'])
a b
0 30.0 30.0
1 5.0 5.0
2 57.0 57.0
3 53.0 53.0
4 20.0 20.0
5 31.0 31.0
6 11.0 11.0
7 6.0 6.0
8 13.0 13.0
9 48.0 48.0
10 2.0 2.0
13 NaN NaN
14 16.0 16.0
15 3.0 3.0

Transforming Data with a Function or Mapping

Since mapping a function over a series has already been covered, this section will only go over a few more helpful ways to map.

  • define your own function - similar to how we would do in apply functions or purrr:map()

    food_sub = food[:5][['Country','Yoghurt']]
    country_yogurt = {
      'Germany':'Quark',
      'Italy':'Yomo',
      'France':'Danone',
      'Holland':'Campina',
      'Belgium':'Activia'
    }
def get_yogurt(x):
   return country_yogurt[x]

food_sub['Brand'] = food_sub['Country'].map(get_yogurt)

food_sub['Country'].map(get_yogurt)
0      Quark
1       Yomo
2     Danone
3    Campina
4    Activia
Name: Country, dtype: object

Replace Values

print("using `replace`: \n", food_sub.replace([30],50), '\n',
"using `replace` for more than one value: \n", food_sub.replace([30, 20],[50, 40]))
using `replace`: 
    Country  Yoghurt    Brand
0  Germany     50.0    Quark
1    Italy      5.0     Yomo
2   France     57.0   Danone
3  Holland     53.0  Campina
4  Belgium     20.0  Activia 
 using `replace` for more than one value: 
    Country  Yoghurt    Brand
0  Germany     50.0    Quark
1    Italy      5.0     Yomo
2   France     57.0   Danone
3  Holland     53.0  Campina
4  Belgium     40.0  Activia

Renaming Axis Indices

As we’ve seen, standard indices are labelled as such:

>>> food_sub.index
RangeIndex(start=0, stop=5, step=1)

That can also be changed with the mapping of a function:

print(food_sub.index.map(lambda x: x + 10))
print('or')
print(food_sub.index.map({0:'G', 1:'I', 2:'F', 3:'H', 4:'B'}))
Int64Index([10, 11, 12, 13, 14], dtype='int64')
or
Index(['G', 'I', 'F', 'H', 'B'], dtype='object')

Discretization and Binning

It is common to convert continuous variables into discrete and group them. Let’s group the affinity for yogurt into random bins:

scale = [0, 20, 30, 50, 70]
# reasonable, ok, interesting, why

pd.cut(food.Yoghurt, scale)
0     (20.0, 30.0]
1      (0.0, 20.0]
2     (50.0, 70.0]
3     (50.0, 70.0]
4      (0.0, 20.0]
5     (30.0, 50.0]
6      (0.0, 20.0]
7      (0.0, 20.0]
8      (0.0, 20.0]
9     (30.0, 50.0]
10     (0.0, 20.0]
11     (0.0, 20.0]
12     (0.0, 20.0]
13             NaN
14     (0.0, 20.0]
15     (0.0, 20.0]
Name: Yoghurt, dtype: category
Categories (4, interval[int64, right]): [(0, 20] < (20, 30] < (30, 50] < (50, 70]]
scaled = pd.cut(food.Yoghurt.values, scale)
scaled.categories

pd.value_counts(scaled)
(0, 20]     10
(30, 50]     2
(50, 70]     2
(20, 30]     1
dtype: int64

Apply the labels to the bins to have it make more sense:

scale_names = ['reasonable', 'ok', 'interesting', 'why']
pd.value_counts(pd.cut(food.Yoghurt.values, scale, labels = scale_names))
reasonable     10
interesting     2
why             2
ok              1
dtype: int64

Finally, let pandas do the work for you by supplying a number of bins and a precision point. It will bin your data equally while limiting the decimal point based on the value of precision

pd.qcut(food.Yoghurt.values, 4, precision = 2)
[(13.0, 30.5], (1.99, 5.5], (30.5, 57.0], (30.5, 57.0], (13.0, 30.5], ..., (5.5, 13.0], (1.99, 5.5], NaN, (13.0, 30.5], (1.99, 5.5]]
Length: 16
Categories (4, interval[float64, right]): [(1.99, 5.5] < (5.5, 13.0] < (13.0, 30.5] < (30.5, 57.0]]

Detecting and Filtering Outliers

We often have to face the decision of how to handle outliers. We can choose to exclude them or to transform them.

# let's say any country who's percentage of yogurt consumption is over 50% is an outlier

yog = food.Yoghurt
yog[yog.abs() > 50]
2    57.0
3    53.0
Name: Yoghurt, dtype: float64

More interestingly, what if we wanted to know if the consumption of ANY food was over 50% ?

food2 = food.drop('Country', axis = 'columns')
food2[(food2.abs() > 95).any(axis = 'columns')]
Real coffee Instant coffee Tea Sweetener Biscuits Powder soup Tin soup Potatoes Frozen fish Frozen veggies Apples Oranges Tinned fruit Jam Garlic Butter Margarine Olive oil Yoghurt Crisp bread
3 96 62 98 32.0 62.0 67 43 7 14 14 83 89 61 81 15 31 97 13 53.0 15
5 97 61 86 28.0 79.0 73 12 7 26 23 85 94 83 20 91 94 94 84 31.0 24
6 27 86 99 22.0 91.0 55 76 17 20 24 76 68 89 91 11 95 94 57 11.0 28
10 97 13 93 31.0 NaN 43 43 39 54 45 56 78 53 75 9 68 32 48 2.0 93
11 96 17 92 35.0 66.0 32 17 11 51 42 81 72 50 64 11 92 91 30 11.0 34
13 98 12 84 20.0 64.0 27 10 8 18 12 50 57 22 37 15 96 94 17 NaN 64
15 30 52 99 11.0 80.0 75 18 2 5 3 57 52 46 89 5 97 25 31 3.0 9

Permutation and Random Sampling

  • Permuting = random reordering

    • np.random.permutation = takes the length of the axis you want to permute
  • Random sampling = each sample has an equal probability of being chosen

Let’s randomly reorder yogurt affinity:

print(np.random.permutation(5))

food.take(np.random.permutation(5))
[2 3 0 1 4]
Country Real coffee Instant coffee Tea Sweetener Biscuits Powder soup Tin soup Potatoes Frozen fish ... Apples Oranges Tinned fruit Jam Garlic Butter Margarine Olive oil Yoghurt Crisp bread
2 France 88 42 63 4.0 76.0 53 11 23 11 ... 87 84 40 45 88 94 47 36 57.0 3
1 Italy 82 10 60 2.0 55.0 41 3 2 4 ... 67 71 9 46 80 66 24 94 5.0 18
3 Holland 96 62 98 32.0 62.0 67 43 7 14 ... 83 89 61 81 15 31 97 13 53.0 15
0 Germany 90 49 88 19.0 57.0 51 19 21 27 ... 81 75 44 71 22 91 85 74 30.0 26
4 Belgium 94 38 48 11.0 74.0 37 23 9 13 ... 76 76 42 57 29 84 80 83 20.0 5

5 rows × 21 columns

This method can be helpful when using iloc indexing!

food.take(np.random.permutation(5), axis = 'columns')
Country Sweetener Instant coffee Real coffee Tea
0 Germany 19.0 49 90 88
1 Italy 2.0 10 82 60
2 France 4.0 42 88 63
3 Holland 32.0 62 96 98
4 Belgium 11.0 38 94 48
5 Luxembourg 28.0 61 97 86
6 England 22.0 86 27 99
7 Portugal 2.0 26 72 77
8 Austria 15.0 31 55 61
9 Switzerland 25.0 72 73 85
10 Sweden 31.0 13 97 93
11 Denmark 35.0 17 96 92
12 Norway 13.0 17 92 83
13 Finland 20.0 12 98 84
14 Spain NaN 40 70 40
15 Ireland 11.0 52 30 99

Let’s try taking a random subset without replacement:

food.sample(n =5)
# you can always add `replace=True` if you want replacement
Country Real coffee Instant coffee Tea Sweetener Biscuits Powder soup Tin soup Potatoes Frozen fish ... Apples Oranges Tinned fruit Jam Garlic Butter Margarine Olive oil Yoghurt Crisp bread
14 Spain 70 40 40 NaN 62.0 43 2 14 23 ... 59 77 30 38 86 44 51 91 16.0 13
1 Italy 82 10 60 2.0 55.0 41 3 2 4 ... 67 71 9 46 80 66 24 94 5.0 18
9 Switzerland 73 72 85 25.0 31.0 69 10 17 19 ... 79 70 46 61 64 82 48 61 48.0 30
4 Belgium 94 38 48 11.0 74.0 37 23 9 13 ... 76 76 42 57 29 84 80 83 20.0 5
13 Finland 98 12 84 20.0 64.0 27 10 8 18 ... 50 57 22 37 15 96 94 17 NaN 64

5 rows × 21 columns

Computing Indicator/Dummy Vars

This kind of transformation is really helpful for machine learning. It converts categorical variables into indicator or dummy variable through a transformation that results in 0’s and 1’s.

pd.get_dummies(food['Country'])
Austria Belgium Denmark England Finland France Germany Holland Ireland Italy Luxembourg Norway Portugal Spain Sweden Switzerland
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
6 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
8 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
11 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
13 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
15 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

This example is not the most helpful since this set of countries are unique but I hope you get the idea..

This is topic will make more sense in Ch.13 when data analysis examples are worked out.

7.3 Extension Data Types

Extension types addresses some of the shortcomings brought on by numpy such as:

  • expensive string computations

  • missing data conversions

  • lack of support for time related objects

s = pd.Series([1, 2, 3, None])
s.dtype
dtype('float64')
s = pd.Series([1, 2, 3, None], dtype=pd.Int64Dtype())
s
print(s.dtype)
Int64

Note that this extension type indicates missing with <NA>

print(s.isna())
0    False
1    False
2    False
3     True
dtype: bool

<NA> uses the pandas.NA sentinal value

s[3] is pd.NA
True

Types can be set with astype()

df = pd.DataFrame({"A": [1, 2, None, 4],
"B": ["one", "two", "three", None],
"C": [False, None, False, True]})

df["A"] = df["A"].astype("Int64")
df["B"] = df["B"].astype("string")
df["C"] = df["C"].astype("boolean")

df
A B C
0 1 one False
1 2 two <NA>
2 <NA> three False
3 4 <NA> True

Find a table of extension types here

7.4 String Manipulation

Functions that are built in:

  • split() : break a string into pieces

  • join()

  • strip() : trim whitespace

  • in(): good for locating a substring

  • count() : returns the number of occurrences of a substring

  • replace() : substitute occurrences of one pattern for another

See more function here

lb = " layla is smart, witty, charming, and... "
lb.split(" ")
lb.strip()
'-'.join(lb)
'smart' in lb
lb.count(',')
lb.replace('...', ' bad at python.')
' layla is smart, witty, charming, and bad at python. '

Regular Expressions

RegEx is not easy. It takes some getting used to. It is really useful for programatically applying any of the string functions to particular pattern.

I often refer to this handy [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/strings.pdf)

To use regular expression in python, you must import the module re:

import re

text = "layla has lived in philadelphia county, miami-dade county, and rockdale county"

# split on whitespace
re.split(r"\s+", text)
['layla',
 'has',
 'lived',
 'in',
 'philadelphia',
 'county,',
 'miami-dade',
 'county,',
 'and',
 'rockdale',
 'county']

To avoid repeating a common expression, you can compile it and store it as it’s own object.

regex = re.compile(r"\s+")

Don’t forget: there are certain characters you must escape before using like: ‘\,., +, :’ and more

What if I wanted to get the counties?

regex = re.compile(r"\w+(?=\s+county)")

regex.findall(text)
['philadelphia', 'dade', 'rockdale']

String Functions

data = {"Dave": "dave@google.com", "Steve": "steve@gmail.com",
"Rob": "rob@gmail.com", "Wes": np.nan}
# convert to series
data = pd.Series(data)
data
Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

To get certain information, we can apply string functions from Series array-oriented methods:

# does the string contain something
data.str.contains("gmail")
# change the extension tryp
data_as_string_ext = data.astype('string')
data_as_string_ext
Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                 <NA>
dtype: string
# vectorized element retrieval
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
data.str.findall(pattern, flags=re.IGNORECASE).str[0]
Dave     (dave, google, com)
Steve    (steve, gmail, com)
Rob        (rob, gmail, com)
Wes                      NaN
dtype: object

7.5 Categorical Data

values = pd.Series(['apple', 'orange', 'apple',
                   'apple'] * 2)
                   
pd.unique(values)
pd.value_counts(values)
apple     6
orange    2
dtype: int64

You can improve performance by creating categorical representations that are numerical:

values = pd.Series([0, 1, 0, 0] * 2)
dim = pd.Series(['apple', 'orange'])

dim
0     apple
1    orange
dtype: object

Retrieve the original set of strings with take

dim.take(values)
0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

Computations with Categoricals

rng = np.random.default_rng(seed=12345)
draws = rng.standard_normal(1000)
bins = pd.qcut(draws, 4)
bins
[(-3.121, -0.675], (0.687, 3.211], (-3.121, -0.675], (-0.675, 0.0134], (-0.675, 0.0134], ..., (0.0134, 0.687], (0.0134, 0.687], (-0.675, 0.0134], (0.0134, 0.687], (-0.675, 0.0134]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.121, -0.675] < (-0.675, 0.0134] < (0.0134, 0.687] < (0.687, 3.211]]
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins
# then use groupby
bins = pd.Series(bins, name='quartile')
results = (pd.Series(draws)
               .groupby(bins)
               .agg(['count', 'min', 'max'])
               .reset_index())

Leads to better performance