4.1 An Overview of Classification

  • Classification: Approaches to make inference and/or predict qualitative (categorical) response variable

  • Few common classification techniques (classifiers):

    • logistic regression
    • linear discriminant analysis (LDA)
    • quadratic discriminant analysis (QDA)
    • naive Bayes
    • K-nearest neighbors


- Examples of classification problems:

  1. A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?
  • Predictor variable: Symptoms
  • Response variable: Type of medical conditions

  1. An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
  • Predictor variable: User’s IP address, past transaction history, etc
  • Response variable: Fraudulent activity (Yes/No)

  1. On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not.
  • Predictor variable: DNA sequence data

  • Response variable: Presence of deleterious gene (Yes/No)

  • In the following section, we are going to explore the Default dataset. The annual incomes (\(X_1\) = income) and monthly credit card balances (\(X_2\) =balance) are used to predict whether whether an individual will default on his or her credit card payment.

The distribution of balance and income split by the binary default variable respectively; Note. Defaulters represented as orange plus sign; non-defaulters represented as blue circle

Figure 4.1: The distribution of balance and income split by the binary default variable respectively; Note. Defaulters represented as orange plus sign; non-defaulters represented as blue circle