For all of the questions below incorporate the necessary R code directy into your answers. The Iowa Department of Natural Resources monitors water quality in different sites across Iowa and provides access to these records at http://programs.iowadnr.gov/iastoret/srchStations.aspx. Data on a variety of different analytes for several sites in the Big Creek area (local lake) are given in the data set http://www.hofroe.net/stat579/data/bigcreek.csv

- Question (25 points)
Is the data set in a normal form? Start your assessment with the 3rd normal form and work downwards. For each normal form that you reject give an example from the dataset that violates it, but would not violate the next lower normal form. Don't forget to include in your discussions which (combination of) variable(s) form(s) (part of) the key.

`bigcreek <- read.csv("http://www.hofroe.net/stat579/data/bigcreek.csv")`

what's the key? Date, Analyte, Station_Name not 3rd: because nothing but the key is violated: Location is stored two-fold (Station_Name and Storet_ID) not 2nd: because key is split, Analyte determines unit, e.g. not 1st: there are some duplicate rows in the data Question (25 points)

Identify the analyte for which most observations are available. Create a data frame`summary.stat`

for an overview of this analyte by station, which includes`library(plyr) sort(table(bigcreek$Analyte)) summary.stat <- ddply(subset(bigcreek, Analyte=="Escherichia coli"), .(Station_Name), summarise, n= length(Analyte), notzero=sum(Result!=0)/length(Result), mean=mean(Result), max = max(Result), min=min(Result), sd=sd(Result) ) stations <- read.csv("http://www.hofroe.net/stat579/data/stationsinfo.csv") summary_loc <- merge(summary.stat, stations, by="Station_Name") library(ggplot2) qplot(LONGITUDE, LATITUDE, colour=mean, data=summary_loc)`

- the number of observations,
- the percentage of results (Result) not equal to zero,
- and basic summaries of the results (Result), including mean, maximum, minimum, and standard deviation of the records.

`summary.stat`

data. Draw a scatterplot of the mean observed values (as color) by longitude and latitude.Question III (25 points)

Write a function`trim (x, p = 5)`

that computes the mean of a numeric vector x after removing a percentage p of the most extreme values (both large and small values). In the function first determine the number of values to remove, then remove one value at a time. At the example of the water quality data find the mean result of pH with and without trimming 5% of the data. For which analyte do the untrimmed mean and a 5% trimmed mean differ the most? Note: if you can't get the function trim to work, use the regular mean and investigate the purpose of the parameter trim. Caution: even if you got your function to work, the results of trim and mean will be different, because here you are asked to implement a non-standard version of trimming.`trim <- function(x, p=5) { npts <- round(length(x)*p/100) for (i in 1:npts) { mx <- mean(x) idx <- which.max(abs(x-mx)) x <- x[-idx] } mean(x) } library(plyr) res <- ddply(bigcreek, .(Analyte), summarise, mean=mean(Result), trim=trim(Result)) res$absolute <- with(res, abs(mean-trim)) res$relative <- with(res, abs((mean-trim)/mean)) res[which.max(res$absolute),] res[which.max(res$relative),]`

Question IV (25 points)

For the following gene sequence write a single regular expression that allows to "bundle" sequences of length 10 by inserting a white space. Use the expression in an appropriate R function on the sequence below to re-format the sequence.`sequence <- "ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA"`

`bundle <- function(x) { gsub("([ACGT]{10})","\\1 ", x) } bundle(sequence)`