stat 579

Homework #5

Due in class, Thursday Oct 16.

  1. Reading Assignment: read the paper The Split-Apply-Combine Strategy for Data Analysis by Hadley Wickham.
  2. This week's homework is again based on the GSS data extract economical-status.csv.

    Use the dplyr package to produce datasets of summary statistics:

    1. a dataset with one row for each year summarising: the number of overall entries (in that year), the percent of female respondents, the mode of income, the number of missing values in income, the average age of respondents, and the region with the highest number of respondents.
    2. a dataset of year and region with all summary statistics of income2 (see class notes of how to get income2 from the income variable) necessary for drawing the main part of boxplots by year and region (i.e. you'll need to calculate median, upper and lower quartile and the whiskers; use the 'Tukey' definition for the whiskers).
    For the year-region dataset, draw dotplots of all statistics in a single plot. For that, you will need to use melt (from the reshape2 package) to get the data into a suitable form. Make sure to use aesthetics, such as colour, shape, or size to visually distinguish between the summary statistics.
    Provide a 3-4 sentence write-up of your findings - does income change more between regions or over time? Pick the plot that best shows your conclusion.


    An R markdown file (.Rmd) submitted to Blackboard with all of the R code, all the charts and the additional write-up & interpretation.