dataset: The relative consumption of certain food items in European and Scandinavian countries. The numbers represent the percentage of the population consuming that food type
7.1 Handling Missing Data
Some things to note:
ALL DESCRIPTIVE STATISTICS ON pandas OBJECTS EXLUDE MISSING DATA - BY DEFAULT
NaN is used for missing values of type: float64
Values like NaN are called sentinel values
a value that is not part of the input but indicates a special meaning; a signal value
NaN for missing integers, -1 as a value to be inserted in a function that computes only non-negative integers, etc.
Different results! Why?? According to numpy documentation:
np.mean always calculates the arithmetic mean along a specified axis. The first argument requires the type to be of int64 so will take the mean of those that fit. The average is taken over the flattened array by default. np.average computes the weighted average along the specified axis.
print(np.nan == np.nan)# apparently, according to the floating-point standard, NaN is not equal to itself!
False
I digress…
Filtering Missing Data
# method dropnaprint("`dropna`: option to include `how = all` to only remove rows where every value is NaN \n",food.Yoghurt.dropna().tail(), "\n","`fillna`: pass fillna a dictionary (fillna({1: 0.5, 2: 0})) to specify a different value for each column\n", food.Yoghurt.fillna(0).tail(), "\n","`isna`\n", food.Yoghurt.isna().tail(), "\n","`notna`\n", food.Yoghurt.notna().tail())
`dropna`: option to include `how = all` to only remove rows where every value is NaN
10 2.0
11 11.0
12 2.0
14 16.0
15 3.0
Name: Yoghurt, dtype: float64
`fillna`: pass fillna a dictionary (fillna({1: 0.5, 2: 0})) to specify a different value for each column
11 11.0
12 2.0
13 0.0
14 16.0
15 3.0
Name: Yoghurt, dtype: float64
`isna`
11 False
12 False
13 True
14 False
15 False
Name: Yoghurt, dtype: bool
`notna`
11 True
12 True
13 False
14 True
15 True
Name: Yoghurt, dtype: bool
print("using `replace`: \n", food_sub.replace([30],50), '\n',"using `replace` for more than one value: \n", food_sub.replace([30, 20],[50, 40]))
using `replace`:
Country Yoghurt Brand
0 Germany 50.0 Quark
1 Italy 5.0 Yomo
2 France 57.0 Danone
3 Holland 53.0 Campina
4 Belgium 20.0 Activia
using `replace` for more than one value:
Country Yoghurt Brand
0 Germany 50.0 Quark
1 Italy 5.0 Yomo
2 France 57.0 Danone
3 Holland 53.0 Campina
4 Belgium 40.0 Activia
Renaming Axis Indices
As we’ve seen, standard indices are labelled as such:
reasonable 10
interesting 2
why 2
ok 1
dtype: int64
Finally, let pandas do the work for you by supplying a number of bins and a precision point. It will bin your data equally while limiting the decimal point based on the value of precision
Let’s try taking a random subset without replacement:
food.sample(n =5)# you can always add `replace=True` if you want replacement
Country
Real coffee
Instant coffee
Tea
Sweetener
Biscuits
Powder soup
Tin soup
Potatoes
Frozen fish
...
Apples
Oranges
Tinned fruit
Jam
Garlic
Butter
Margarine
Olive oil
Yoghurt
Crisp bread
14
Spain
70
40
40
NaN
62.0
43
2
14
23
...
59
77
30
38
86
44
51
91
16.0
13
1
Italy
82
10
60
2.0
55.0
41
3
2
4
...
67
71
9
46
80
66
24
94
5.0
18
9
Switzerland
73
72
85
25.0
31.0
69
10
17
19
...
79
70
46
61
64
82
48
61
48.0
30
4
Belgium
94
38
48
11.0
74.0
37
23
9
13
...
76
76
42
57
29
84
80
83
20.0
5
13
Finland
98
12
84
20.0
64.0
27
10
8
18
...
50
57
22
37
15
96
94
17
NaN
64
5 rows × 21 columns
Computing Indicator/Dummy Vars
This kind of transformation is really helpful for machine learning. It converts categorical variables into indicator or dummy variable through a transformation that results in 0’s and 1’s.
pd.get_dummies(food['Country'])
Austria
Belgium
Denmark
England
Finland
France
Germany
Holland
Ireland
Italy
Luxembourg
Norway
Portugal
Spain
Sweden
Switzerland
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
2
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
4
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
6
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
7
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
8
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
9
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
11
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
12
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
13
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
14
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
15
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
This example is not the most helpful since this set of countries are unique but I hope you get the idea..
This is topic will make more sense in Ch.13 when data analysis examples are worked out.
7.3 Extension Data Types
Extension types addresses some of the shortcomings brought on by numpy such as:
expensive string computations
missing data conversions
lack of support for time related objects
s = pd.Series([1, 2, 3, None])s.dtype
dtype('float64')
s = pd.Series([1, 2, 3, None], dtype=pd.Int64Dtype())sprint(s.dtype)
Int64
Note that this extension type indicates missing with <NA>