# stat 579

## Midterm II - Sample

The three questions below give points adding up to 100 points (not counting extra credit). A passing grade will start at about 40 points, any score above 80 will receive an A.

### Warm-up (15 points)

File brfss-2009-clean.csv contains the records from the BRFSS (Behavioral Risk Factor Surveillance System) as discussed in class earlier this semester.
1. For each state, find the percentage of female and male population.

```	```

library(plyr)
pct.states <- ddply(brfss, .(X_STATE), summarise,
pctmale = sum(SEX==1)/length(SEX),
pctfemale = sum(SEX==2)/length(SEX)
)
```
```
2. The file fips-code.csv has a list of state names by FIPS code - used in the variable XSTATE in the BRFSS data. Use this data and the states data from the `maps` package to draw a map of the U.S. (mainland only) with states colored by percentage of male respondents.

```	```
fips\$region <- tolower(fips\$State.Name)

library(maps)
library(ggplot2)
states <- map_data("state")
statesfips <- merge(states, fips, by="region")
statesbrfss <- merge(statesfips, pct.states, by.x="FIPS.Code", by.y="X_STATE")
qplot(long, lat, geom="polygon", group=group, order=order, data=statesbrfss, fill=pctmale)
# some states do not show up or are wrongly named ....
```
```

### Working with text (40 points + 2 + 5 points of extra credit)

1. Write a function `palindrome5(x)` that is `TRUE`, if `x` is a palindrome of length 5 (i.e. x is five letters long and can be read from the back or the front, e.g. 'radar', 'level'). Extra points, if you can avoid an explicit loop or recursion.

```	```
palindrome5 <- function(x) {
if (nchar(x) != 5) return(FALSE)
pal5 <- gsub("^(.)(.)(.).*", "\\1\\2\\3\\2\\1", x)
x == pal5
}

```
```
2. The file course-description.txt contains a description of all the courses offered by the Statistics Department at ISU. Read this file and convert to a data frame with columns 'Course', 'Name', 'Description'. Extra points, if you successfully extract number of credits from the description as well.

```	```
file <- read.table("http://www.hofroe.net/stat579/course-description.txt", sep="\n")
fsplit <- strsplit(split="\\. ", as.character(file[,1]))
dframe <- ldply(fsplit, function(x) c(x, x, x, paste(x[-(1:5)], collapse=". ")))
names(dframe) <- c("Course", "Name", "Credit", "Description")
summary(dframe)
```
```
3. For the following gene sequence:

```sequence <-
"ATGGATTCTGGTATGTTCTAGCGCTTGCACCATCCCATTTAACTGTAAGAAGAATTGCACGGTCCCAATTGCTCGAGAGA
TTTCTCTTTTACCTTTTTTTACTATTTTTCACTCTCCCATAACCTCCTATATTGACTGATCTGTAATAACCACGATATTA
TTGGAATAAATAGGGGCTTGAAATTTGGAAAAAAAAAAAAACTGAAATATTTTCGTGATAAGTGATAGTGATATTCTTCT
TTTATTTGCTACTGTTACTAAGTCTCATGTACTAACATCGATTGCTTCATTCTTTTTGTTGCTATATTATATGTTTAGAG
GTTGCTGCTTTGGTTATTGATAACGGTTCTGGTATGTGTAAAGCCGGTTTTGCCGGTGACGACGCTCCTCGTGCTGTCTT
CCCATCTATCGTCGGTAGACCAAGACACCAAGGTATCATGGTCGGTATGGGTCAAAAAGACTCCTACGTTGGTGATGAAG
CTCAATCCAAGAGAGGTATCTTGACTTTACGTTACCCAATTGAACACGGTATTGTCACCAACTGGGACGATATGGAAAAG
ATCTGGCATCATACCTTCTACAACGAATTGAGAGTTGCCCCAGAAGAACACCCTGTTCTTTTGACTGAAGCTCCAATGAA
CCCTAAATCAAACAGAGAAAAGATGACTCAAATTATGTTTGAAACTTTCAACGTTCCAGCCTTCTACGTTTCCATCCAAG
CCGTTTTGTCCTTGTACTCTTCCGGTAGAACTACTGGTATTGTTTTGGATTCCGGTGATGGTGTTACTCACGTCGTTCCA
ATTTACGCTGGTTTCTCTCTACCTCACGCCATTTTGAGAATCGATTTGGCCGGTAGAGATTTGACTGACTACTTGATGAA
GATCTTGAGTGAACGTGGTTACTCTTTCTCCACCACTGCTGAAAGAGAAATTGTCCGTGACATCAAGGAAAAACTATGTT
ACGTCGCCTTGGACTTCGAACAAGAAATGCAAACCGCTGCTCAATCTTCTTCAATTGAAAAATCCTACGAACTTCCAGAT
GGTCAAGTCATCACTATTGGTAACGAAAGATTCAGAGCCCCAGAAGCTTTGTTCCATCCTTCTGTTTTGGGTTTGGAATC
TGCCGGTATTGACCAAACTACTTACAACTCCATCATGAAGTGTGATGTCGATGTCCGTAAGGAATTATACGGTAACATCG
TTATGTCCGGTGGTACCACCATGTTCCCAGGTATTGCCGAAAGAATGCAAAAGGAAATCACCGCTTTGGCTCCATCTTCC
ATGAAGGTCAAGATCATTGCTCCTCCAGAAAGAAAGTACTCCGTCTGGATTGGTGGTTCTATCTTGGCTTCTTTGACTAC
CTTCCAACAAATGTGGATCTCAAAACAAGAATACGACGAAAGTGGTCCATCTATCGTTCACCACAAGTGTTTCTAA"
```
find all 3 letter 'words' (a word is a sequence of the letters 'A', 'G', 'C', and 'T') and the frequency of their occurence

```	```
next3 <- function(x) {
if (nchar(x) <= 3) return(x)
substr(x, 1, 3)
}

seq <- gsub("\n", "", sequence)
res <- rep("", nchar(seq)-3)
for (i in 1:(nchar(seq)-3)) {
res[i] <- next3(seq)
seq <- substr(seq, 2, nchar(seq))
}
sort(table(res))
```
```

### Integration by simulation (45 points)

1. Write a function `integral (n, a,b,f)` that estimates the area under function `f` between limits `a` and `b` using `n` uniform random numbers. The picture below shows a description of the process. In this example, 1000 pairs of random uniform values between [0.5,3.5] x [0,15] were used. 592 of them resulted in falling below the curve, giving an estimate for the are under curve as 0.592 * 3 * 15 = 26.64. ```	```
integral <- function(n, a,b,f) {
x <- runif(n, a,b)
y <- runif(n, 0, 15)
sum(f(x) < y)/n*(b-a)*15
}
```
```
2. Apply your function repeatedly (say, 20 times) to the curve f(x) = 2x(x-3)^2+5 between limits 1 and 4 using 2000 pairs of random numbers. Give an estimate for the area under the curve and a standard deviation.

```	```
f <- function(x) 2*x*(x-3)^2+5
res <- replicate(20, integral(2000, 1, 4, f))
mean(res)
sd(res)
```
```