# SJSU American Community Survey 2015 Microdata & Aggregate Excelsheet

Download the Large ACS Excel file. The “household” worksheet contains data on all households surveyed by the ACS in PUMA 0608511 in 2015. Find the variables SERIALNO, VEH and HINCP in the household worksheet, and copy and paste them next to each other in a new worksheet. The variable SERIALNO should be in column A of the new worksheet (this is a unique code identifying individual households that were surveyed), followed by VEH and HINCP. Finally, sort the data from smallest to largest, by HINCP. (Consult the Data Dictionary and do a CNTL+F search for the variable names to find out precisely what these variables measure.)

- What is the average value of VEH for households with HINCP<50,001? (Hint: use the
*=AVERAGE(*syntax in Excel) - What is the average value of VEH for households with HINCP>50,000?
- What is the variance of VEH for households with HINCP<50,001? (Hint: use the
*=VAR.S(*syntax in Excel) - What is the variance of VEH for households with HINCP>50,000?
- How many households have HINCP>50,000?
- How many households have HINCP<50,001?
- How many houesholds have missing values for HINCP? Why are they missing? (Hint: see Data Dictionary).
- What is the difference in means? (I.e. the average number of vehicles in high income minus the average number of vehicles in low income households?) For this and all remaining questions, do not use any households who have missing values of HINCP and VEH; observations with missing values can be dropped for the purposes of this analysis.)
- What is the value of the standard error of the difference in means test? (Hint: use the formula in footnote 17 of
*Mastering Metrics*.) - What is the value of the test statistic, in a test of the null hypothesis that there is no difference in number of vehicles in low and high income households (where high income is defined as HINCP>50,000?
- Finally, select six observations from this data set, three from the low income group, and three from the high income group, and carry out the same difference in means test using this subsample. Pick the three observations at random; for example, sort by SERIALNO and use data from the first three observations in each group. How do your results compare with what you found in question 10? If your results differed, to what do you attribute the difference?