5.1 The Data Generating Process

Introduction

  • Basic premise of science: underlying laws that govern how the world works
  • Laws operate regardless of our knowledge about them
  • We can learn about laws, or data generating processes (DGP), through empirical observation
  • Laws that govern social processes generally not well-behaved like physical laws

Two Parts of DGPs

  • Part we already know about process through prior assumptions or research
  • Part we don’t know - what we’re attempting to learn about through additional research

Contrived DGP Example

Assume following laws:

  • Income is log-normally distributed
  • Brown hair causes a 10% increase to income
  • A college degree produces a 20% income boost

Other assumptions (laws):

  • 20% of population have naturally brown hair
  • 30% of population have college degrees
  • 40% of folks with neither natural brown hair nor college degrees will dye hair brown

Premise 1: No prior knowledge of laws

  • Let’s look at data and try to infer the relationship between brown hair and income:

  • Based on high-level data exploration, it looks like brown hair is associated with 1.6% higher income.

Premise 2: Knowledge of all laws except brown hair/income relationship

  • Underlying relationship is hard to see because of non-college folks with brown hair who don’t benefit from college wage boost
  • We know college-students don’t dye hair, so relationship between brown hair and income is not distorted by the relationship with dying one’s hair
  • Let’s limit data exploration of hair color and income, limited to college students:

  • Now we see 13% higher income among brown-haired vs. other-colored hair among college students. This is close to the 10% governed by the true DGP

Two core ideas

  • Variation - How can we find variation we need and focus on it?
  • Identification - How can we use DGP to be sure we’re teasing out the right variation?