8. This exercise relates to the College data set, which can be found in the file College.csv on the book website. It contains a number of variables for 777 different universities and colleges in the US. The variables are
Private : Public/private indicator
Apps : Number of applications received
Accept : Number of applicants accepted
Enroll : Number of new students enrolled
Top10perc : New students from top 10 % of high school class
Top25perc : New students from top 25 % of high school class
F.Undergrad : Number of full-time undergraduates
P.Undergrad : Number of part-time undergraduates
Outstate : Out-of-state tuition
Room.Board : Room and board costs
Books : Estimated book costs
Personal : Estimated personal spending
PhD : Percent of faculty with Ph.D.s
Terminal : Percent of faculty with terminal degree
S.F.Ratio : Student/faculty ratio
perc.alumni : Percent of alumni who donate
Expend : Instructional expenditure per student
Grad.Rate : Graduation rate
Before reading the data into Python, it can be viewed in Excel or a text editor.
(a) Use the pd.read_csv() function to read the data into Python. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.
college = pd.read_csv('ISLP_data/College.csv')college
(b) Look at the data used in the notebook by creating and running a new cell with just the code college in it. You should notice that the first column is just the name of each university in a column named something like Unnamed: 0. We don’t really want pandas to treat this as data. However, it may be handy to have these names for later. Try the following commands and similarly look at the resulting data frames:
This has used the first column in the file as an index for the data frame. This means that pandas has given each row a name corresponding to the appropriate university. Now you should see that the first data column is Private. Note that the names of the colleges appear on the left of the table. We also introduced a new python object above: a dictionary, which is specified by (key, value) pairs. Keep your modified version of the data with the following:
college = college3college
Private Apps Accept Enroll Top10perc \
college
Abilene Christian University Yes 1660 1232 721 23
Adelphi University Yes 2186 1924 512 16
Adrian College Yes 1428 1097 336 22
Agnes Scott College Yes 417 349 137 60
Alaska Pacific University Yes 193 146 55 16
... ... ... ... ... ...
Worcester State College No 2197 1515 543 4
Xavier University Yes 1959 1805 695 24
Xavier University of Louisiana Yes 2097 1915 695 34
Yale University Yes 10705 2453 1317 95
York College of Pennsylvania Yes 2989 1855 691 28
Top25perc F.Undergrad P.Undergrad Outstate \
college
Abilene Christian University 52 2885 537 7440
Adelphi University 29 2683 1227 12280
Adrian College 50 1036 99 11250
Agnes Scott College 89 510 63 12960
Alaska Pacific University 44 249 869 7560
... ... ... ... ...
Worcester State College 26 3089 2029 6797
Xavier University 47 2849 1107 11520
Xavier University of Louisiana 61 2793 166 6900
Yale University 99 5217 83 19840
York College of Pennsylvania 63 2988 1726 4990
Room.Board Books Personal PhD Terminal \
college
Abilene Christian University 3300 450 2200 70 78
Adelphi University 6450 750 1500 29 30
Adrian College 3750 400 1165 53 66
Agnes Scott College 5450 450 875 92 97
Alaska Pacific University 4120 800 1500 76 72
... ... ... ... ... ...
Worcester State College 3900 500 1200 60 60
Xavier University 4960 600 1250 73 75
Xavier University of Louisiana 4200 617 781 67 75
Yale University 6510 630 2115 96 96
York College of Pennsylvania 3560 500 1250 75 75
S.F.Ratio perc.alumni Expend Grad.Rate
college
Abilene Christian University 18.1 12 7041 60
Adelphi University 12.2 16 10527 56
Adrian College 12.9 30 8735 54
Agnes Scott College 7.7 37 19016 59
Alaska Pacific University 11.9 2 10922 15
... ... ... ... ...
Worcester State College 21.0 14 4469 40
Xavier University 13.3 31 9189 83
Xavier University of Louisiana 14.4 20 8323 49
Yale University 5.8 49 40386 99
York College of Pennsylvania 18.1 28 4509 99
[777 rows x 18 columns]
(c) Use the describe() method to produce a numerical summary of the variables in the data set.
(d) Use the pd.plotting.scatter_matrix() function to produce a scatterplot matrix of the first columns [Top10perc, Apps, Enroll]. Recall that you can reference a list C of columns of a data frame A using A[C].
(e) Use the boxplot() method of college to produce side-by-side boxplots of Outstate versus Private.
(f) Create a new qualitative variable, called Elite, by binning the Top10perc variable into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.
Use the value_counts() method of college['Elite'] to see how many elite universities there are. Finally, use the boxplot() method again to produce side-by-side boxplots of Outstate versus Elite.
college['Elite'].value_counts()
Yes 3
No 0
Name: Elite, dtype: int64
(g) Use the plot.hist() method of college to produce some histograms with difering numbers of bins for a few of the quantitative variables. The command plt.subplots(2, 2) may be useful: it will divide the plot window into four regions so that four plots can be made simultaneously. By changing the arguments you can divide the screen up in other combinations.
(h) Continue exploring the data, and provide a brief summary of what you discover.
9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.
Auto = pd.read_csv('ISLP_data/Auto.csv', na_values=['?'])Auto
(b) What is the range of each quantitative predictor? You can answer this using the min() and max() methods in numpy.
mpg_min = Auto['mpg'].min( )mpg_max = Auto['mpg'].max( )print('The min and max miles per gallon are', (mpg_min, mpg_max))
The min and max miles per gallon are (9.0, 46.6)
dsp_min = Auto['displacement'].min( )dsp_max = Auto['displacement'].max( )print('The min and max displacement are', (dsp_min, dsp_max))
The min and max displacement are (68.0, 455.0)
hpwr_min = Auto['horsepower'].min( )hpwr_max = Auto['horsepower'].max( )print('The min and max horsepower are', (hpwr_min, hpwr_max))
The min and max horsepower are (46.0, 230.0)
wt_min = Auto['weight'].min( )wt_max = Auto['weight'].max( )print('The min and max weights are', (wt_min, wt_max))
The min and max weights are (1613, 5140)
acc_min = Auto['acceleration'].min( )acc_max = Auto['acceleration'].max( )print('The min and max accelerations are', (acc_min, acc_max))
The min and max accelerations are (8.0, 24.8)
(c) What is the mean and standard deviation of each quantitative predictor?
mpg_mean = Auto['mpg'].mean( )mpg_sd = Auto['mpg'].std( )print('The mean and standard deviation of miles per gallon are', mpg_mean,'and', mpg_sd)
The mean and standard deviation of miles per gallon are 23.515869017632248 and 7.825803928946562
dsp_mean = Auto['displacement'].mean( )dsp_sd = Auto['displacement'].std( )print('The mean and standard deviation of weight are', dsp_mean,'and', dsp_sd)
The mean and standard deviation of weight are 193.53274559193954 and 104.37958329992945
hpwr_mean = Auto['horsepower'].mean( )hpwr_sd = Auto['horsepower'].std( )print('The mean and standard deviation of horsepower are', hpwr_mean,'and', hpwr_sd)
The mean and standard deviation of horsepower are 104.46938775510205 and 38.49115993282855
wt_mean = Auto['weight'].mean( )wt_sd = Auto['weight'].std( )print('The mean and standard deviation of weight are', wt_mean,'and', wt_sd)
The mean and standard deviation of weight are 2970.2619647355164 and 847.9041194897246
acc_mean = Auto['acceleration'].mean( )acc_sd = Auto['acceleration'].std( )print('The mean and standard deviation of acceleration are', acc_mean,'and', acc_sd)
The mean and standard deviation of acceleration are 15.555667506297214 and 2.7499952929761515
(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
mpg_min = Auto_new['mpg'].min( )mpg_max = Auto_new['mpg'].max( )print('The min and max miles per gallon of the subsetted data are', (mpg_min, mpg_max))
The min and max miles per gallon of the subsetted data are (11.0, 46.6)
mpg_mean = Auto_new['mpg'].mean( )mpg_sd = Auto_new['mpg'].std( )print('The mean and standard deviation of miles per gallon of the subsetted data are', mpg_mean,'and', mpg_sd)
The mean and standard deviation of miles per gallon of the subsetted data are 24.40931677018633 and 7.913357147165568
dsp_min = Auto_new['displacement'].min( )dsp_max = Auto_new['displacement'].max( )print('The min and max displacement of the subsetted data are', (dsp_min, dsp_max))
The min and max displacement of the subsetted data are (68.0, 455.0)
dsp_mean = Auto_new['displacement'].mean( )dsp_sd = Auto_new['displacement'].std( )print('The mean and standard deviation of weight of the subsetted data are', dsp_mean,'and', dsp_sd)
The mean and standard deviation of weight of the subsetted data are 187.6801242236025 and 100.12092459330134
hpwr_min = Auto['horsepower'].min( )hpwr_max = Auto['horsepower'].max( )print('The min and max horsepower of the subsetted data are', (hpwr_min, hpwr_max))
The min and max horsepower of the subsetted data are (46.0, 230.0)
hpwr_mean = Auto['horsepower'].mean( )hpwr_sd = Auto['horsepower'].std( )print('The mean and standard deviation of horsepower of the subsetted data are', hpwr_mean,'and', hpwr_sd)
The mean and standard deviation of horsepower of the subsetted data are 104.46938775510205 and 38.49115993282855
wt_min = Auto['weight'].min( )wt_max = Auto['weight'].max( )print('The min and max weights of the subsetted data are', (wt_min, wt_max))
The min and max weights of the subsetted data are (1613, 5140)
wt_mean = Auto['weight'].mean( )wt_sd = Auto['weight'].std( )print('The mean and standard deviation of weight of the subsetted data are', wt_mean,'and', wt_sd)
The mean and standard deviation of weight of the subsetted data are 2970.2619647355164 and 847.9041194897246
acc_min = Auto['acceleration'].min( )acc_max = Auto['acceleration'].max( )print('The min and max accelerations of the subsetted data are', (acc_min, acc_max))
The min and max accelerations of the subsetted data are (8.0, 24.8)
acc_mean = Auto['acceleration'].mean( )acc_sd = Auto['acceleration'].std( )print('The mean and standard deviation of acceleration of the subsetted data are', acc_mean,'and', acc_sd)
The mean and standard deviation of acceleration of the subsetted data are 15.555667506297214 and 2.7499952929761515
(e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
(f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.
10. This exercise involves the Boston housing data set.
(a) To begin, load in the Boston data set, which is part of the ISLP library.
(b) How many rows are in this data set? How many columns? What do the rows and columns represent?
(c) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your fndings.
(d) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
(e) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
(f) How many of the suburbs in this data set bound the Charles river?
(g) What is the median pupil-teacher ratio among the towns in this data set?
(h) Which suburb of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your fndings.
(i) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.