stat 579

Midterm II - Fall 2012

The three questions below give points adding up to 100 points (not counting extra credit). A passing grade will start at about 40 points, any score above 80 will receive an A.

For all of the questions below incorporate the necessary R code directy into your answers. The Iowa Department of Natural Resources monitors water quality in different sites across Iowa and provides access to these records at http://programs.iowadnr.gov/iastoret/srchStations.aspx. Data on a variety of different analytes for several sites in the Big Creek area (local lake) are given in the data set http://www.hofroe.net/stat579/data/bigcreek.csv
  1. Question (25 points)

    Is the data set in a normal form? Start your assessment with the 3rd normal form and work downwards. For each normal form that you reject give an example from the dataset that violates it, but would not violate the next lower normal form. Don't forget to include in your discussions which (combination of) variable(s) form(s) (part of) the key.

    
    	
    bigcreek <- read.csv("http://www.hofroe.net/stat579/data/bigcreek.csv")
    	
    what's the key? Date, Analyte, Station_Name
    
    not 3rd: because nothing but the key is violated: Location is stored two-fold (Station_Name and Storet_ID)
    
    not 2nd: because key is split, Analyte determines unit, e.g. 
    
    not 1st: there are some duplicate rows in the data
    	
  2. Question (25 points)
    Identify the analyte for which most observations are available. Create a data frame summary.stat for an overview of this analyte by station, which includes

    	
    library(plyr)
    sort(table(bigcreek$Analyte))
    summary.stat <- ddply(subset(bigcreek, Analyte=="Escherichia coli"), .(Station_Name), summarise, 
      n= length(Analyte),
      notzero=sum(Result!=0)/length(Result),
      mean=mean(Result),
      max = max(Result),
      min=min(Result),
      sd=sd(Result)
      )	
      
    stations <- read.csv("http://www.hofroe.net/stat579/data/stationsinfo.csv")  
    summary_loc <- merge(summary.stat, stations, by="Station_Name")
    library(ggplot2)
    qplot(LONGITUDE, LATITUDE, colour=mean, data=summary_loc)
    	
    	
    • the number of observations,
    • the percentage of results (Result) not equal to zero,
    • and basic summaries of the results (Result), including mean, maximum, minimum, and standard deviation of the records.
    The data set http://www.hofroe.net/stat579/data/stationsinfo.csv contains information on each station at which observations are taken. Merge this information with the summary.stat data. Draw a scatterplot of the mean observed values (as color) by longitude and latitude.
  3. Question III (25 points)
    Write a function trim (x, p = 5) that computes the mean of a numeric vector x after removing a percentage p of the most extreme values (both large and small values). In the function first determine the number of values to remove, then remove one value at a time. At the example of the water quality data find the mean result of pH with and without trimming 5% of the data. For which analyte do the untrimmed mean and a 5% trimmed mean differ the most? Note: if you can't get the function trim to work, use the regular mean and investigate the purpose of the parameter trim. Caution: even if you got your function to work, the results of trim and mean will be different, because here you are asked to implement a non-standard version of trimming.

    	
    trim <- function(x, p=5) {
        npts <- round(length(x)*p/100)
        for (i in 1:npts) {
            mx <- mean(x)
            idx <- which.max(abs(x-mx))
            x <- x[-idx]
        }
        mean(x)
    }
    
    library(plyr)
    res <- ddply(bigcreek, .(Analyte), summarise, mean=mean(Result),
     trim=trim(Result))
    res$absolute <- with(res, abs(mean-trim))
    res$relative <- with(res, abs((mean-trim)/mean))
    res[which.max(res$absolute),]
    res[which.max(res$relative),]
    	
    	
  4. Question IV (25 points)
    For the following gene sequence write a single regular expression that allows to "bundle" sequences of length 10 by inserting a white space. Use the expression in an appropriate R function on the sequence below to re-format the sequence. sequence <- "ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA"

    
    	
    bundle <- function(x) {
        gsub("([ACGT]{10})","\\1 ", x)
    }
    
    bundle(sequence)