5.1 The Data Generating Process
Introduction
- Basic premise of science: underlying laws that govern how the world works
- Laws operate regardless of our knowledge about them
- We can learn about laws, or data generating processes (DGP), through empirical observation
- Laws that govern social processes generally not well-behaved like physical laws
Two Parts of DGPs
- Part we already know about process through prior assumptions or research
- Part we don’t know - what we’re attempting to learn about through additional research
Contrived DGP Example
Assume following laws:
- Income is log-normally distributed
- Brown hair causes a 10% increase to income
- A college degree produces a 20% income boost
Other assumptions (laws):
- 20% of population have naturally brown hair
- 30% of population have college degrees
- 40% of folks with neither natural brown hair nor college degrees will dye hair brown
Premise 1: No prior knowledge of laws
- Let’s look at data and try to infer the relationship between brown hair and income:
- Based on high-level data exploration, it looks like brown hair is associated with 1.6% higher income.
Premise 2: Knowledge of all laws except brown hair/income relationship
- Underlying relationship is hard to see because of non-college folks with brown hair who don’t benefit from college wage boost
- We know college-students don’t dye hair, so relationship between brown hair and income is not distorted by the relationship with dying one’s hair
- Let’s limit data exploration of hair color and income, limited to college students:
- Now we see 13% higher income among brown-haired vs. other-colored hair among college students. This is close to the 10% governed by the true DGP