Data preparation

The original dataset had 89 variables and 16,924 players.

Below is a preview of a slimmed-down version of this dataset used throughout chapter:

Rows: 5,000
Columns: 42
$ nationality                <fct> Argentina, Portugal, Brazil, Slovenia, Belg…
$ overall                    <dbl> 94, 93, 92, 91, 91, 91, 90, 90, 90, 90, 89,…
$ potential                  <dbl> 94, 93, 92, 93, 91, 91, 93, 91, 90, 90, 95,…
$ wage_eur                   <dbl> 565000, 405000, 290000, 125000, 470000, 370…
$ value_eur                  <dbl> 95500000, 58500000, 105500000, 77500000, 90…
$ age                        <dbl> 32, 34, 27, 26, 28, 28, 27, 27, 33, 27, 20,…
$ height_cm                  <dbl> 170, 187, 175, 188, 175, 181, 187, 193, 172…
$ weight_kg                  <dbl> 72, 83, 68, 87, 74, 70, 85, 92, 66, 71, 73,…
$ attacking_crossing         <dbl> 88, 84, 87, 13, 81, 93, 18, 53, 86, 79, 78,…
$ attacking_finishing        <dbl> 95, 94, 87, 11, 84, 82, 14, 52, 72, 90, 89,…
$ attacking_heading_accuracy <dbl> 70, 89, 62, 15, 61, 55, 11, 86, 55, 59, 77,…
$ attacking_short_passing    <dbl> 92, 83, 87, 43, 89, 92, 61, 78, 92, 84, 82,…
$ attacking_volleys          <dbl> 88, 87, 87, 13, 83, 82, 14, 45, 76, 79, 79,…
$ skill_dribbling            <dbl> 97, 89, 96, 12, 95, 86, 21, 70, 87, 89, 91,…
$ skill_curve                <dbl> 93, 81, 88, 13, 83, 85, 18, 60, 85, 83, 79,…
$ skill_fk_accuracy          <dbl> 94, 76, 87, 14, 79, 83, 12, 70, 78, 69, 63,…
$ skill_long_passing         <dbl> 92, 77, 81, 40, 83, 91, 63, 81, 88, 75, 70,…
$ skill_ball_control         <dbl> 96, 92, 95, 30, 94, 91, 30, 76, 92, 89, 90,…
$ movement_acceleration      <dbl> 91, 89, 94, 43, 94, 77, 38, 74, 77, 94, 96,…
$ movement_sprint_speed      <dbl> 84, 91, 89, 60, 88, 76, 50, 79, 71, 92, 96,…
$ movement_agility           <dbl> 93, 87, 96, 67, 95, 78, 37, 61, 92, 91, 92,…
$ movement_reactions         <dbl> 95, 96, 92, 88, 90, 91, 86, 88, 89, 92, 89,…
$ movement_balance           <dbl> 95, 71, 84, 49, 94, 76, 43, 53, 93, 88, 83,…
$ power_shot_power           <dbl> 86, 95, 80, 59, 82, 91, 66, 81, 79, 80, 83,…
$ power_jumping              <dbl> 68, 95, 61, 78, 56, 63, 79, 90, 68, 69, 76,…
$ power_stamina              <dbl> 75, 85, 81, 41, 84, 89, 35, 75, 85, 85, 84,…
$ power_strength             <dbl> 68, 78, 49, 78, 63, 74, 78, 92, 58, 73, 76,…
$ power_long_shots           <dbl> 94, 93, 84, 12, 80, 90, 10, 64, 82, 84, 79,…
$ mentality_aggression       <dbl> 48, 63, 51, 34, 54, 76, 43, 82, 62, 63, 62,…
$ mentality_interceptions    <dbl> 40, 29, 36, 19, 41, 61, 22, 89, 82, 55, 38,…
$ mentality_positioning      <dbl> 94, 95, 87, 11, 87, 88, 11, 47, 79, 92, 89,…
$ mentality_vision           <dbl> 94, 82, 90, 65, 89, 94, 70, 65, 91, 84, 80,…
$ mentality_penalties        <dbl> 75, 85, 90, 11, 88, 79, 25, 62, 82, 77, 70,…
$ mentality_composure        <dbl> 96, 95, 94, 68, 91, 91, 70, 89, 92, 91, 84,…
$ defending_marking          <dbl> 33, 28, 27, 27, 34, 68, 25, 91, 68, 38, 34,…
$ defending_standing_tackle  <dbl> 37, 32, 26, 12, 27, 58, 13, 92, 76, 43, 34,…
$ defending_sliding_tackle   <dbl> 26, 24, 29, 18, 22, 51, 10, 85, 71, 41, 32,…
$ goalkeeping_diving         <dbl> 6, 7, 9, 87, 11, 15, 88, 13, 13, 14, 13, 7,…
$ goalkeeping_handling       <dbl> 11, 11, 9, 92, 12, 13, 85, 10, 9, 14, 5, 11…
$ goalkeeping_kicking        <dbl> 15, 15, 15, 78, 6, 5, 88, 13, 7, 9, 7, 7, 1…
$ goalkeeping_positioning    <dbl> 14, 14, 15, 90, 8, 10, 88, 11, 14, 11, 11, …
$ goalkeeping_reflexes       <dbl> 8, 11, 11, 89, 8, 13, 90, 11, 9, 14, 6, 5, …

Note: The fifa dataset referenced in the text appears to be different than the one currently available in the DALEX package. For instance, the field naming conventions are different, and the number of dimensions in the fifa dataframe do not match the text. We use the fifa dataset in the current DALEX package for presentation purposes.

Target Variable: Players’ Value

Player value, value_eur, is a heavily skewed variable (skewness value : 4.03).

We’ll apply a log transformation for modeling purposes.

Key Feature Variables

Four key variables:
- Age - range is 16-41, symmetric, median/mean is age 27
- movement_reactions - roughly symmetric
- skill_ball_control - bimodal due to lower score distribution for goalkeepers
- skill_dribbling - bimodal due to lower score distribution for goalkeepers