9.2 Method

Loss Function Minimization

\[ \hat g = \arg \min_{g \in \mathcal{G}} L\{f, g, \nu(\underline{x}_*)\} + \Omega (g), \] where

  • \(\mathcal{G}\) is class of simple models
  • \(f\) is black box model prediction
  • \(g\) is glass box model prediction
  • \(\nu(\underline{x}_*)\) is neighborhood around instance \(\underline{x}_*\)
  • \(\Omega (g)\) is penalty for complexity of \(f\)

In practice, exclude \(\Omega (g)\) from optimization procedure. Instead, manually control complexity by specifying number of coefficients in glass-box model.

f() and g() operate on different spaces:

  • f() works in \(p\) dimensions, corresponds to \(p\) predictors
  • g() corresponds to \(q\) dimensional spaces
  • \(q << p\)

Glass-box Fitting Procedure

Specify:
  - x* instance of interest
  - N sample size of artificial points 
  - K number of predictor variables for glass-box model
  - similarity function
  - glass box model
    
Steps:    
1. Sample N artificial points around instance.   
2. Fit black box model to points -- becomes target variable
3. Calculate similarity between instance and each artifical point
4. Fit glass box model:  
    -use N artificial points as features
    -use black-box predictions on N points as target
    -limit to K nonzero coefficients
    -Weight training of each point using similarity measure 

Image Classifier Example

Overview:

  • Classify 244 x 244 color image into 1,000 potential categories
  • Goal: Explain model classification.
  • Issue: High dimensional problem: Each image comprises 178,608 dimensions (3 x 244 x 244)

Solution:

  • Transform into 100 superpixels. Glass box applies to \(\{0,1\}^{100}\) space
  • Sample around instance - (i.e.,randomly exclude some superpxiels)
  • Fit LASSO with K=15 nonzero coefficients

Source: Figure 9.3
Figure 9.3: Left-original image; Middle-superpixels; Right-artificial data

Figure 9.4
Figure 9.4: 15 selected superpixel features

Sampling around instance

  • Can’t always sample from existing points because data very sparse and “far” from each other in high dimensional space
  • Usually instead create perturbations on instance of interest
    • continuous variables - multiple approaches
      • adding Gaussian noise
      • perturb discretized versions of of variables
    • binary variables - change some 0s to 1s and vice-versa