Assistment - Printing Content

The R Project is a free software program for use in statistical computing and graphics. It is used by many professional and non-professional statisticians. This link leads to a helpful resource which explains many aspects of R: cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf

To start out, let's make a data set. Say a statistics class has ten students and their scores on an exam are as follows: 67, 99, 81, 83, 77, 84, 21, 59, 67, and 91. To create a data set from these numbers in R, type:
x=c(67,99,81,83,77,84,21,59,67,91)
Next, find the median (M) and mean (X-bar) of these data by typing:
mean(x)
median(x)
Now press "Run the R Program" and scroll down to check the answers.
The median and mean are?

n.b. You can replace the "x" with "data" or "testscores" or anything else which you would like to name the data. Just make sure to use this same name consistently (if you define the data as "testscores" and then ask R what the mean of "x" is, R will not give you the correct answer).

Multiple Choice:

	A.) M=77 X-bar=72.9
	No, sorry.
	B.) M=74 X-bar=72
	No, Sorry.
	C.) M=79 X-bar=72.9
	D.) M=79 X-bar=72
	No, Sorry.

Hints:

Make sure that your data is correct and that you typed X=c(...) The c is important. Also, R is CaSe SeNsItIvE.

Make sure you are reading your answers in the right order.

A study was taken of twenty high school students and how many times they eat fast food a week, and the results were as follow: 3,1,0,7,2,2,5,0,1,4,6,5,2,3,4,7,0,1,0,0.
Enter the data into R
What is the median number of times a week?
What is the mean number of times a week?

Multiple Choice:

	2,3
	No, sorry.
	2,3.5
	No, sorry.
	2,2.65
	3,2.65
	No, sorry.

Playing with the Stock Market Stock Market Crash

Stock markets are a matter of international attention. Massive amounts of money are lost and gained each day, as stock traders decide the value of a stock, and players of the stock market make gambles about these values. Our goal here isn't to get rich quick by making bets on the market, rather we have academic desires--to learn some R commands using data from the stock market.

1.) Starting in with R

The R software can perform in a convenient way most of the calculations in statistics. Think of R as a calculator for statistics where the many dedicated buttons are replaced by a keyboard where you type the commands for what you want to do.

Link: http://rss.acs.unt.edu/cgi-bin/R/Rprog

Starting R in Windows opens up a large window that will contain various subwindows: a command console for typing commands, windows for displaying graphs, data-editing windows, and help page windows.

Interacting with R is done in a question-and-answer manner: you ask questions and R answers.

You ask these questions by typing them in after the prompt:

>

For example, to see that R can be a calculator, type the following commands (not the prompt)

and hit the Enter key:

> 2 + 2

[1] 4

> 5 * 6

[1] 30

> (3 + 2)^2

[1] 25

After a leading [1], R returns the correct answer. (The leading [1] will be explained later.) As you see, R uses +, -, *, /, and ^ for the usual math notations; and parentheses to group expressions.

Use R to find the value for:
52 · 17.75

Algebraic Expression:

923

Hints:

Make sure you use only the five symbols given for mathematical operations. R will be confused if you use X instead of *, for example.

2.) Working with data

Statistics is about analyzing data sets which likely will have more than one data point. Unlike most calculators, R works naturally with data sets.

The price of a share of stock fluctuates on a daily basis. Some stocks more so than most. In January of 2004, The AT&T wireless stock (symbol AWE) for AT&T's cellphone services had been having a big decline. In late January though, word of a possible merger was released changing how investor's viewed the stock. (AT&T merged with Cingular in 2004.)

Data for the closing price of AT&T wireless stock for a few different Fridays are in Table 1.

What can we say about this data?

23-Jan-04	10.61	12-Dec-03	7.13
16-Jan-04	9.99	5-Dec-03	7.27
9-Jan-04	8.15	28-Nov-03	7.50
2-Jan-04	8.08	21-Nov-03	7.00
26-Dec-03	7.63	14-Nov-03	6.81
19-Dec-03	7.35	7-Nov-03	7.02

Table 1: Closing price of AT&T wireless stock

2.1- Storing data

Before doing anything, let's store the data into the computer for January and December.

We use the function c() to combine numbers into a data set. Simply separate the values with commas.

> c(10.61, 9.99, 8.15, 8.08, 7.63, 7.35, 7.13, 7.27)

[1] 10.61 9.99 8.15 8.08 7.63 7.35 7.13 7.27

The numbers were combined and then printed - then they were forgotten! Again, the [1] appears. This helps keep track of how many numbers are in the data vector (we call a variable that stores data a data vector). When there are several rows of numbers output, the number in square brackets indicates the position of the first number in that row.

Functions in R are called using the function name, an opening parentheses, any arguments, and then a closing parentheses. Don't forget the parentheses. The output of a function is the name for what is returned.

We need to store the data so we can reuse it. To do this, we assign the output to a variable using an equals sign. The following will store the values into the variable called awe.

> awe = c(10.61, 9.99, 8.15, 8.08, 7.63, 7.35, 7.13, 7.27)

R is quiet after an assignment; only the prompt is returned. However, R was busy. Wherever the variable awe is used, R will refer to this dataset. For example, to see the values of a variable simply type its name:

> awe
[1] 10.61 9.99 8.15 8.08 7.63 7.35 7.13 7.27
Type in and store the numbers 7.97, 7.51, 5.94, 5.62 into the variable pcs.
Ask R to return pcs. Copy and paste what R's answer is.

Algebraic Expression:

	[1] 7.97 7.51 5.94 5.62
	[1] 7.97 7.51 5.94 5.62

Hints:

Once R has your data set, simply type pcs into R and hit run program.

The data set should look like this when you enter it into R: pcs=c(7.97, 7.51, 5.94, 5.62)

[1] 7.97 7.51 5.94 5.62 is the answer.

Manipulating data using functions

In R data sets are explored, summarized, and analyzed by applying functions to the data sets. A basic usage looks like functionname( datasetname ) Though, many functions will have extra arguments to change their default behavior. Many things can be done with the output of a function. It may simply answer your question. Or you may want to store it for later usage, or you may compose it directly with another function.

For the stock market, where there is so much data available, people are interested in summaries of the data. For example, maximum price, minimum price, and average price. R has functions max() and min() to find the maximum and minimum values in a data vector.

> max(awe)

[1] 10.61

> min(awe)

[1] 7.13

These are returned together with the range() function.

> range(awe)

[1] 7.13 10.61

Question: Use R to find the maximum and minimum of the variables sbux and pcs.

Multiple Choice:

	PCS: 7.97,5.62 SBUX: 24.76,23.98
	PCS: 8.01,6.3 SBUX: 25.6,22.98
	No, sorry
	PCS: 10.1,3.25, SBUX: 30.56, 21.77
	No, Sorry
	PCS: 5.62,7.97 SBUX: 24.76, 23.98
	No, Sorry

Hints:

Make sure to write the maximum first, followed by the minimum.

Make sure you are matching the maximum and minimum with the right data, and writing the exact number

Second part of 3

The difference between the maximum and minimum values in a data set is sometimes referred to as the range of the data sets. There are several ways to find this. We can subtract the minimum from the maximum, or use the diff() function on the output of range().

For example,
> max(awe) - min(awe)
[1] 3.48

> diff(range(awe))

[1] 3.48

Find the difference between the maximum and minimum values of the variables sbux and pcs.

The average value of a data set can be found several ways, as illustrated next. For the data in awe we can do it all by hand:

> (10.61 + 9.99 + 8.15 + 8.08 + 7.63 + 7.35 + 7.13 + 7.27)/8

[1] 8.27625

But, why should we type the data values in when they are already stored into awe. We can let the computer do the addition using the sum() function:

> sum(awe)/8

[1] 8.27625

As well, rather than counting the eight numbers we added, we can let the computer find the length using length(awe)):

> sum(awe)/length(awe)

[1] 8.27625

This works fine, but as find the average is a common task in statistics there is a built-in function, mean(), for this (the sample mean is the name of the average of a data set in statistics)

> mean(awe)

[1] 8.27625

Question: Find the average of the data sets sbux and pcs

Multiple Choice:

	PCS:6.76 SBUX:24.34833
	PCS: 7.83 SBUX:26.45
	No, sorry
	PCS: 6.03 SBUX:23.3481
	No, sorry
	PCS: 7.11 SBUX:30.562
	No, sorry

Hints:

Make sure you match the results with the right variables

Make sure you right the exact numbers

Continuation of problem 3...

The difference between the maximum and minimum values in a data set is sometimes referred to as the range of the data sets. There are several ways to find this. We can subtract the minimum from the maximum, or use the diff() function on the output of range().

For example,

> max(awe) - min(awe)
[1] 3.48

> diff(range(awe))

[1] 3.48

Question: Find the difference between the maximum and minimum values of the variables sbux and pcs.

Multiple Choice:

	2.35,0.78
	3.52,1.08
	No, sorry
	2.56,0.96
	No, sorry
	4.01,0.05
	No, sorry

Hints:

Make sure to be matching up the results with the right variable

graphical views

R has several functions that produce graphics for viewing a data set. Enter the following into R:

> plot(awe)

After typing this command, a plot window should open up showing an admittedly boring plot. By default, this plots the numbers in the order they are typed in. The x-axis label, Index, refers to the position in the data vector of the data point.

Seems like the stock price is dropping doesn't it? Well not really, that's because the stock numbers were typed in reverse chronological order. How can we reverse the numbers without retyping the data? R has a built in function rev() to do so:

Question: Make a reverse plot of the sbux data set. Are the reversed data positively correlated or negatively?

Note:
Another plotting function is a barplot().

Multiple Choice:

	A.) Positively
	B.) Negatively
	No, sorry.
	C.) Neither
	No, sorry.

4.) Real data sets
All of the previous computer work could have been done by hand or with a calculator. To illustrate why a computer is a much better tool for statistics than a calculator, let's use bigger datasets. So big, you wouldn't even want to find the largest number by hand, let alone the average value. Rather than type the data in, we are going to let the computer do the work for us. However, you need to teach the computer how by typing the following exactly as shown (there are four capital letters):
> source("http://www.math.csi.cuny.edu/st/R/downloadStockData.R")
This command downloads a file from the Stem and Tendril website. The file defines a new
function, downloadStockData(), that will fetch the previous years worth of data on a stock courtesy of http://finance.yahoo.com. It only requires the user to provide the stock symbol. To illustrate, a years worth of stock data for for Yahoo! for can be retrieved by using its symbol, "YHOO."
> yahoo = downloadStockData("YHOO")
> max(yahoo)
[1] 57.59
This shows the maximum closing value of the stock for the previous year at the time this project was made (October 26, 2010).
A plot (Figure 3) of the year's activities is produced as before:
> plot(yahoo)
From this graph we can see a lot about the history of the stock. For example, We can look at this graph and see that the minimum value occurred near 130 and the maximum value occurred near index 50.

Download current stock data for Yahoo!. Answer the following:
What was the maximum price? minimum price? average price?
Yahoo! Logo

Ungraded Open Response:

The day-to-day differences in the stock price can be looked at by using the
function diff(). This will form a new data vector containing the differences between successive days values. For example, the command
> yahoo.diffs = diff(yahoo) forms the differences and stores them into the data vector yahoo.diffs.
For yahoo.diffs do the following: What was the largest increase in a given day? the largest decrease in a given day?

Ungraded Open Response:

Hints:

What is the largest increase in terms of the data set? The largest decrease?

The largest increase is the maximum value and the largest decrease is the minimum value.

Type max(yahoo.diffs) for the largest increase and min(yahoo.diffs) for the largest decrease

5.) Using indices
The entries in a data vector come with a natural order: the first, second, ..., nth. Being able to access the values by their index can extend the ways we can look at a data vector.
To access a single value in a data set can be done using square brackets, []. For example, if
the closing value of the Dow Jones Industrial Average for a week was
10196 10243 10391 10433 10368
We can use indexing to subtract the week's first value from the last
> dow = c(10196, 10243, 10391, 10433, 10368)
> dow[5] - dow[1]
[1] 172
This says the market went up 172 points during this week.
(Note that you use square brackets for data extraction, and parentheses for functions.)
More than one index can be referred to at once. To pull out the first and fifth days is done
with:
> dow[c(1, 5)]
[1] 10196 10368
Question 14: Copy the following data set into R:
>dow=c(55,60,65,70,71,90,65,78,76,88,55,67,35,56,45,76,85,87,56,87,56,46,76,45,56,75,66,77,56,55,45,45,64,75)
How many data points are there in the set (do not count manually)? What was the overall difference between the first and last point (do not calculate by hand)? Give in answer , answer form.

Algebraic Expression:

34 , 20

Hints:

Use the length() function to find the number of data points in the set. Remember what we have named the set.

Use what you have learned about indices to find the difference between the first and last data points.

The first data point will be dow[1]. The last will be dow[x], where x is the length of the data set. To find the difference, subtract the first from the last.

The data set has 34 points and the value increased by 20 from the first point to the last.

Type in 34 , 20

5.1) Indices can also be logical expressions allowing one to question the data.

We use this data for dow.

> dow

[1] 10196 10243 10391 10433 10368

We can ask what days were more than 10,200 as follows

> dow > 10200

[1] FALSE TRUE TRUE TRUE TRUE

The answer is TRUE or FALSE for each value in the data vector dow. When using such answers

as indices, the values corresponding to TRUE are returned.

> dow[dow > 10200]

[1] 10243 10391 10433 10368

Logical expressions used for indices must be the same length as the data vector. Other logical questions are possible using >, >=, <, <=, == (double equals signs), and ! for the negative.

Expressions can be combined using & (and) and | (or).

For example, values less than or equal to 10,400 are

> dow[dow <= 10400]

[1] 10196 10243 10391 10368

Both conditions are found with

> dow[dow <= 10400 & dow > 10200]
[1] 10243 10391 10368

Question: Use the dow data from the previous question to find all the values more than the last value.

Multiple Choice:

	A.) 6,8,9,10,16,17,18,20,23,28
	B.) 1,3,5,7,9,21,28,31
	No, sorry
	C.) 2,4,6,8,10,12,27,32
	No, sorry
	D.) 5,6,7,10,14,15,16,17,27,30
	No, sorry

Hints:

Remember the data is:(55,60,65,70,71,90,65,78,76,88,55,67,35,56,45,76,85,87,56,87,56,46,76,45,56,75,66,77,56,55,45,45,64,75)

Remember that the answer refers to the value in the index. For example: the first value is 1 and the last for this data set is 34.

Use dow > 75, since 75 is the last value.

5.2) What index was that?
A natural question to ask is what index has a value that does something special. For example,
when is something at its maximum, or minimum? The which() command can answer in terms of
the index.
> which(dow == max(dow))
[1] 4
The answer are the indices where the data set dow is at a maximum value. Similarly, the indices
of when dow is at its minimum would be found with:
> which(dow == min(dow))
[1]
For our data set, which data point is the minimum, and which is the maximum?

Multiple Choice:

	A.) min=value 13; max=value 6
	B.) min=value 13; max=value 7
	No, sorry.
	C.) min=value 34; max= value 33
	No, sorry.
	D.) min=value 13; max=value 14
	No, sorry.

Hints:

The functions used in the text will be the exact ones needed for the problem.

R Cheat Sheet

Create a data set:

data=c(2,6,5,8,4,8,...)

Finding math functions:

Maximum: max(x)
Minimum: min(x)
Range: range(x)
Sum: sum(x)
Mean: mean(x)
Median: median(x)
Standard deviation: sd(x)
Variance: var(x)
Correlation: cor(x,y)
Quantile (Q1, Q2, Q3, Q4): quantile(x)
Round x to n decimal places: round(x,n)
Histogram: hist(x)
Barplot: barplot(x)
Stemplot: stem(x)
Pie chart: pie(x)
Boxplot: boxplot(x,y)
Plot: plot(x) plot(x,y)

Question: A survey of fifteen students was taken asking how many hours do they spend on facebook daily. The results were as followed: 4,2,1,0,2,0,3,2,2,5,3,2,6,1, and 2. Using this R cheat sheet, make a boxplot, and then find the minimum, maximum, mean, median, and quantiles.

Multiple Choice:

	A) min=0, max=6, mean=2.3, median=2, quantiles=0,1.5,2,3,6
	min=0, max=6, mean=2, median=1, quantiles=1,2,3,4,5
	No, sorry
	min=0, max=5, mean=3, median=3, quantiles=0,1,2,4,6
	No, sorry
	min=1, max=2, mean=1.5, median=2.5, quantiles=0,.5,1,2.5,5
	No, sorry

Hints:

Make sure to type the commands exactly how you see them on the cheat sheet

Make sure to match the values with the correct command

What is extrapolation?

Multiple Choice:

	The line that makes the sum of squared vertical distances of the data points from the line as small as possible.
	Wrong
	The use of a regression line for prediction outside the range of values of the explanatory variable x used to obtain the line.
	The line that summarizes the relationship between two variables.
	Try Again
	The model that tells us about the dependence of the response variable y on the explanatory varabile x.
	Wrong

x (third exam score)	y (final exam score)
65	175
67	133
71	185
71	163
66	126
75	198
67	153
70	163
71	159
69	151
69	159

The data shows scores on a final exam based on scores from a third exam. Use a calculator to plot the points and find the least-squares regression line. Round to the nearest hundredth.

Algebraic Expression:

-173.51+4.83x

Hints:

It is always important to plot a scatter diagram first

The line of best fit is a+bx

Type in -173.51+4.83x

Scaffold:

Insert the data into L1 and L2 of your calculator. Type in "yes" after you complete this step.

Algebraic Expression:

yes

Scaffold:

Calculate the linear regression equation through Vars - Calc - LinReg(a+bx).

Multiple Choice:

	a+bx
	ax+b
	Try Again.
	y-yhat
	Try Again.

Scaffold:

The answer you should get is -173.51+4.83x. Type this into the answer box.

Algebraic Expression:

-173.51+4.83x

Refer to the data in this link to solve the problem http://cnx.org/content/m17090/latest/. What would you predict the final score to be for a student who scored 66 on the third exam. Round answer to the nearest hundredth.

Multiple Choice:

	140.95
	145.27
	126
	You cannot reliably predict the final exam score for this student.

Hints:

Since 66 is between the x-values 65 and 75, substitute x=66 into the equation.

Refer to the data in this link to solve the problem http://cnx.org/content/m17090/latest/. What would you predict the final exam score to be for a student who scored a 80 on the third exam?

Multiple Choice:

	194.08
	198
	189.24
	You cannot reliably predict the final exam score for this student.

The data below was obtained from Centers of Disease Control and Prevention.

Year		Number of Cases
1994		343
1995		308
1996		237
1997		185
1998		145
1999		117
2000		117
2001		126
2002		113
2003		93
2004		84

The data shows the number of reported cases of HIV/AIDS in infants born to HIV-infected mothers from 1994 to 2004. Plot the points to find the least-squares regression line. Round to the nearest hundredth.

Algebraic Expression:

48745.52-24.3x

Hints:

Remeber to insert the data and plot the points into your graph.

The linear regression line is in the form of a+bx.

The answer is 48745.52-24.3x

Refer to the data in the previous question. What would you predict the number of cases to be for a student in the year 2000? Round answer to the nearest hundredth.

Multiple Choice:

	117
	121
	136
	You cannot reliably predict the number of cases for this year.

What would you predict the number of cases to be for a student in the year 2004? Round answer to the nearest whole number.

Algebraic Expression:

48

Hints:

Substitute 2004 for x in the equation.

The answer is 48

Outliers are points that are far from the least squares line and other observations.
Outliers in the x direction often influence the least squared regression line.

Multiple Choice:

	True
	False
	Wrong. Outliers often have large errors.

The data below was obtained from Centers of Disease Control and Prevention.

Year		Total Cigarettes (billion)
1996		487
1997		480
1998		465
1999		435
2000		430
2001		425
2002		415
2003		400
2004		388
2005		381
2006		380

The data shows the adult yearly consumption of cigarettes in the United Statesfrom 1996 to 2006. Plot the points to find the least-squares regression line. Round to the nearest hundredth.

Algebraic Expression:

23110.06-11.34x

Hints:

Use your calculator to plot the data and calculate the linear regressoin line in the form of a+bx.

The answer is 23110.06-11.34x

Which variable is the explanatory value and which axis should this variable be located?

Multiple Choice:

	Year, x-axis
	Year, y-axis
	Total cigarettes, x-axis
	Total cigarettes, y-axis

What is the observed value when the total cigarette value is 425?

Algebraic Expression:

2001

Hints:

Refer to the original data set.

The observed data refers to the year (y-axis).

The answer is 2001

Do outliers affect the accuracy of a least squares regression line?

Multiple Choice:

	Yes
	No
	WRONG!
	Sometimes
	NOPE!

Which of these is most influenced by outliers?

Multiple Choice:

	Mean
	Least Squares Regression Line
	NOPE!
	Mode
	TRY AGAIN!
	Median
	WRONG

Hints:

We decided that the line of least squares regression IS affected by outliers in the last question.

Use the following link that simulates a least-squares regression line:

http://hadm.sph.sc.edu/COURSES/J716/demos/LeastSquares/LeastSquaresDemo.html

Click on the buttons located on the right side of the page: "Show Residuals", "Show Squares", "Squares' Sum", "Residuals' Sum", and "LS Line".

The least-squares regression line of y on x is the line that makes the sum of the squared vertical distances of the data points from the line as small as possible.

Do you understand the function of a least-squares regression line?

Samuel L. Baker, "Least Squares Applet," hspm.sph.sc.edu. July 21, 2002. http://hadm.sph.sc.edu/COURSES/J716/demos/LeastSquares/LeastSquaresDemo.html

Multiple Choice:

	Yes
	No

Which of the following is NOT a property of the LSR Line?

source: student-made.

Multiple Choice:

	the sum of the residuals = 0
	The sum of the distances between each point and the LSR Line is minimized.
	The average x value and the average y value lies on the LSR Line
	The sum of squared residuals is minimized

Let's start analyzing the correlation of regression by using R with the "women" data set.
First, go ahead and create a scatterplot of the women data by typing in:

women
plot(women)

Now we will find the correlation of the data. We can do that in R using the cor(x,y) function.
Make sure you have the following typed into R:

plot(women)
x=(women$height)
y=(women$weight)
cor(x,y)

What is the correlation of this data set?

source: student-made

Algebraic Expression:

.99

Hints:

Paste the code into R, and review the output for correlation.

The correlation should be .99.

Now we can add in the least squares regression line on our scatterplot.
Type in the following code into R:
x=(women$height)
y=(women$weight)
plot(x,y)
model=lm(y~x) ### This creates a linear model using the data from x and y
abline(model) ### This function adds the line to your scatterplot

Have you made a scatterplot with the least-squares regression line?

Multiple Choice:

	Yes
	No

Use the formulas below for the equation of a least squares regression line. Solve for the slope, b.
y-hat = a + b*x

where, b = r*(sd(y)/sd(x)) ### remember r is correlation, in R, r=cor(x,y).

B equals? (Round to the nearest hundredth).

Algebraic Expression:

3.43

Scaffold:

First, use R to find the standard deviation of the dataset by typing sd(women) into R.
What is the standard deviation for weight (y)?

**Answer rounded to the nearest hundredth.

Algebraic Expression:

15.50

Scaffold:

Now what is the standard deviation for height (x)?
***Round to the nearest hundredth.

Algebraic Expression:

4.47

Scaffold:

Remember that correlation can be found using R as we did in the previous problem, and that r=the correlation.

What is, again, the correlation of this data set?

Algebraic Expression:

.99

Scaffold:

Plug all variables into the equation:

b = r*(sd(y)/sd(x))

and solve for b, the variable for slope.

What is the slope (round to the nearest hundredth)?

Algebraic Expression:

3.43

Let's solve for the y-intercept, a.
The formula is:
a=(y-bar) - b*(x-bar)

What is the y-intercept of our least squares linear equation? (Round to the nearest hundredth).

Algebraic Expression:

-87.52

Hints:

x-bar and y-bar refer to the mean of the data, which can be found by simply typing in the following function into R:

mean(women)

The y-intercept is -87.52.

Finally, have R simply define the slope and y-intercept for us. Use the following code:

model=lm(y~x) ### this will create a least squares model for our data set
model ### this will output the slope and y-intercept for us

What is the output?

Multiple Choice:

	3.01, -92.13
	-87.52, 3.43
	-82.55, 3.56
	87.52, -3.01

An ice cream truck owner collects data on the number of sales made each day and the average temperature that day. He computes a regression line for predicting the number of sales based on how far the daily temperature is from freezing (32 degrees Fahrenheit) and finds sales = 0.22 + 1.8 X (degrees over 32 Fahrenheit). Identify the y-intercept.

source: No Author, "EBook Problems GLM Regress", WikiStatistics Book, January 8, 2009. http://wiki.stat.ucla.edu/socr/index.php/EBook_Problems_GLM_Regress

Algebraic Expression:

	We can't tell from the information given
	32
	.22
	1.8

Hints:

The equation for the least-squares regression line is y-hat (the predicted outcome) = a + bx, where a is the y-intercept, and b is the slope.

Hints:

The y-intercept is 0.22.

Let's now use the cars data set to start us off with residual plots:
Load the data set. We are going to use the function called attach, and also insert a least-squares regression line.

### Start Code
cars
names(cars)
attach(cars) ### cars has two variables, speed & distance. This function
### allows us to simply call on the variable names without using the
### the complicated notation: cars$speed or cars$dist
plot(speed, dist)

model=lm(speed~dist)
abline(model)
model

### End Code
Now, using the cars data set. Use the attach function to plot speed vs. distance. Make sure you don't use the $ anywhere in the code.
Question: Do you understand the attach function, and the least squares-regression line?

source: student-made.

Multiple Choice:

	Yes
	No

Now, create a residual plot to determine the fit of regression line. We also will create the scatterplot with least-squares regression line (LSRL) to better understand the the relationship between LSRL and the residual plot. Copy and paste the following code into R.
### Code
plot(speed, dist)
abline(model)
residuals=model$residuals
residuals
### now lets have R create a residual plot
plot(speed, residuals, main="Residual Plot")
abline(h=0) ### lets draw a horizontal line at 0
### End Code
What does the horizontal line at 0 on the residual plot represent?

Multiple Choice:

	It has no meaning.
	The horizon of the residual plot.
	The x-axis.
	The least-squares regression line.

Hints:

Look at the data that falls above and below the horizontal line at 0, and picture where it would appear on a regular scatterplot.

If you were to flip the regression line completely horizontal, without changing the data points, you would see that on a Residual Plot, the line at 0 represents the regression line of the data set it accompanies.

Lets see how well our LSRL fits our data.
Do you see a pattern in the residual plot?

Multiple Choice:

	The data seems random.
	It's in a U-Shaped pattern.

Refering to the cars residual plot, do you feel that the linear model is most appropriate for this data set?

Multiple Choice:

	No.
	Yes.

Hints:

If the data does not have a distinct pattern, and seems to fall equally above and below the horizontal at 0, then the data can be seen as more accurate.


Random pattern	Non-random: U-shaped curve	Non-random: Inverted U

Residual Plot: A B C
Which Residual Plot represents the most accurate data set?

image source: No Author, "Statistics and Probability Glossary," StatTrek, 2010, http://stattrek.com/Help/Glossary.aspx?Target=Residual%20plot

Multiple Choice:

	A
	B
	C

Let's see how outliers affect a data set and the regression line.
Put random scatterplot data into R by pasting in the following code:
x=c(3,5,7,9,11)
y=c(5,8,9,14,18)
plot(x,y)

Now, add in the regression line:

model=lm(y~x)
abline(model)

Does the data appear to have a very accurate regression line?

source: student-made.

Multiple Choice:

	Yes
	No

Now, let's throw in an outlier.
Copy this code into R:
x=c(3,5,7,9,11,4)
y=c(5,8,9,14,18,22)
plot(x,y)

model=lm(y~x)
abline(model)

Has the regression line stayed the same?

Multiple Choice:

	No
	Yes

In response to the above questions, outliers DO change the regression line equation.

Type in model after the previous coding, and find out the equation for the new line.
What is the equation?

Multiple Choice:

	y = .76x + 7.7
	y-hat = .76x + 7.7
	y = 7.7x +.76
	y-hat = 7.7x + .76

An influential point is a point that affects the coefficient of the regression line.
We will explore this using R, by making a scatterplot that includes an influential point.
Open R, and put in the following code:
x=c(1,2,3,4,5,6)
y=c(5,7,8,14,18,19)
plot(x,y)

You should get a scatterplot that appears visibly linear.
Now let's add in a regression line:

model=lm(y~x)
abline(model)

To find the equation of the line, add in:

model

What is the equation of this linear regression line?

source: student-made.

Multiple Choice:

	y-hat = 3.11x + .93
	y-hat = .93x + 3.11
	y = .93x + 3.11
	y = 3.11x + .93

Hints:

Using the "model" function in R gives you the coefficents for x and y.

Plug these coefficents into the equation y-hat = a + bx.

You should get: y-hat = 3.11x + .93

Now, let's add in an influential point, by putting this data set into R:
x=c(1,2,3,4,5,6,18)
y=c(5,7,8,14,18,19,45)
plot(x,y)

The scatterplot should still appear to be linear.
Now let's add in a regression line again:

model=lm(y~x)
abline(model)

Find the equation of this regression line by using the model function a second time.

Multiple Choice:

	y = 2.4x + 3.4
	y-hat = 3.4x + 2.4
	y-hat = 2.4x + 3.4
	y = 3.4x + 2.4

Did the influential point change the equation of the regression line completely?

Multiple Choice:

	Yes
	No

Correlation is resistant or not resistant to a few outlying observations?

source: student-made.

Multiple Choice:

	resistant
	not resistant

What is true about the Least-Squares Residual Line?

Multiple Choice:

	It minimizes the sum of the residuals
	no, sorry
	It minimizes the sum of the squared residuals
	It must pass through the median of both variable
	no, sorry. The LSRL must pass through the point with the means of both data sets.
	There is more than one LSRL for any given set of data.
	No, sorry. There is only one LSRL for a set of data.

If the correlation of a set of data is 0, what does this tell you about the data.

Multiple Choice:

	A correlation of 0 is impossible
	The slope of the LSRL will be 0
	there is no relationship between the variables
	there is no linear relationship between the variables

Correlation measures:

Multiple Choice:

	the strength of a linear relationship
	the strength of any relationship
	the strength and direction of any relationship
	the strength and direction of a linear relationship

Scaffold:

Is a negative correlation possible? If so, what does it mean?

Multiple Choice:

	Not possible
	Yes; the more negative the correlation, the weaker the relationship
	no, the strength of the relationship is measured by the absolute value of the correlation. The closer this is to 1, the stronger the relationship.
	Yes; a negative correlation means the data has an inverse relationship

Scaffold:

Which correlation corresponds to the data set with the strongest relationship?

Multiple Choice:

	.87
	.91
	-.91
	-.92

Scaffold:

Which correlation corresponds to a data set with a LSRL with a slope of -.5?

Multiple Choice:

	-.5
	no, sorry. correlation does not tell us anything about slope.
	.5
	no, sorry. correlation does not tell us anything about slope.
	2
	no, sorry. A correlation of 2 is not possible, nor does correlation tell us anything about slope.
	none of these are correct.
	no, sorry.
	Slope cannot be determined by correlation.

True or False:
An influential point greatly affects the slope or the LSRL and always lowers the correlation coeffeicient.

Multiple Choice:

	True
	No, sorry. Influential points might actually increase the correlation coefficient.
	False
	I dont know.

Scaffold:

An observation is considered influential if adding or removing it would change the result of the LSRL or correlation coefficient significantly.

Most influential points have outliers for either their x or y values.

Let's look at the data sets we've been using and determine whether they might have influential points

women
attach(women)
plot(height, weight)
###end code

Does this data look like it has any influential points?

Multiple Choice:

	yes
	no

Hints:

Are there any points that seem like they have outliers for their x or y value?

Most of the points seem pretty evenly spaced, so they don't have outliers for there x or y value.

This means there doesn't seem to be any influential points.

Scaffold:

Good. Now let's look at our next data set:
Orange
attach(Orange)
plot(age, circumference)
###end code

Does this data look like it has any influential points?

Multiple Choice:

	Yes
	No
	I dont know.

Hints:

Are there any points that seem like they have outliers for their x or y value?

None of the points seem very far away from the rest of the data, so they don't have outliers for there x or y value.

This means there doesn't seem to be any influential points.

Scaffold:

Good now let's look at our last data set:
cars
attach(cars)
plot(speed,dist)
###end code
Of course, at this point we can't tell if a point is influential unless we plot the data set with and without the point and compare, but:
Compared to the other data sets, does this data look like it has any influential points?

Multiple Choice:

	Yes
	No

Hints:

Are there any points that seem like they have outliers for their x or y value?

Look for points in the corners of the graph. Do those points seem far away from the rest of the data?

Look at the point in the top right corner. Does it seem seperated from the rest of the data?

The point is seperated from the rest of the data.

This seems like it might be an influential point.

What is the definition of 'residual'?

Multiple Choice:

	the points along the LSRL
	y minus y-hat
	y-hat minus y
	the points of the data set

What does it mean if the residual for a given point in a set of data is -.2?

Multiple Choice:

	a negative residual is not possible
	the predicted is .2 units away from the observed value in either direction
	the predicted is .2 units above the observed value
	the predicted is .2 units below the observed value

Hints:

Recall the definition of residual from the last problem...

...which is y minus y-hat

y is the observed value
y-hat is the predicted value

Using a TI-83, create a scatter plot of the data provided by Seattle Center's Quantitative Environmental Learning Project (data set #018). On the TI-83, start by (stat->edit and imput the X data in L1 and the Y data in L2).

source: US Department of Energy
Total Number of Alternative-Fueled Vehicles In Use in the United States
year	number
1992	251,352
1993	314,848
1994	324,472
1995	333,049
1996	352,421
1997	367,526
1998	383,847
1999	407,542

What type of trend does this data set reveal?

Multiple Choice:

	exponential
	linear
	logrithmic
	quadratic

This data reveals a linear pattern, which allows us to derive a formula from the data in order to predict the number of alternative-fueld vehicles in future years. In order to do this, we must find the equation of the linear regression, or "line of best fit", which in in the form of y=ax+b.
After inputting the data into your calculator (stat->edit X and Y), we can find the linear regression by going to (stat->calc->4. LinReg(ax+b)). In this form, a is the slope of the line and b is the y-intercept.

What is the value of a (slope) and b (y-intercept) rounded to the nearest tenth?

Multiple Choice:

	a= -373554646.71 ; b= 188191.17
	a is NOT the y-intercept!
	a=188191.17 ; b= -373554646.71
	ROUND THE VALUES!
	a=188191.2 ; b= -373554646.7

Now we can use this linear regression equation inorder to predict values in the future. Let's start by finding the number of alternative-fueld cars in the year 2005. Because we do not have an actual value figure for the year 2005, we call this EXPECTED value "y-hat".
In order to find the number of alternative-fuled cars in the year 2005, we simply start by looking at the equation of the linear regression, or "best fit line", which will tell us the predicted value of Y (y-hat) for an given value of X.
y-hat = 18891.2 x - 37355464.7
y-hat = 18891.2 (2005) - 37355464.7
y-hat = 521391.3
Because you can not have .3 of a car, the EXPECTED value for the year 2005 is 521391 alternative-fueld cars.

Refering to the example problem above, how many alternative-fueld cars can we expect to see in the year 2010?

Multiple Choice:

	615847.3
	remeber that you can not have a fraction of a car!
	615847
	615848
	Remeber not to round the the number up because .3 is not greater than .5
	596956
	we are looking at the year 2010, not 2009!

Hints:

Plug the year value (2010) into the X position in the linear regression equation above.

A linear regression allows us to see the accuracy of the data collected. In order to compare the observed and the expected values of a data set, you must compare the y and y-hat values for a certain point.
Let's compare the oberseverd and the expected for two different years, 1992 and 1994.
Example 1:
X=1992, Y=251352

In order to compare the observed and expected values for 1992, we must first find the expected value (y-hat) because we already know that the observed value (Y) is 251352. Let's solve for y-hat by using our linear regression equation.

y-hat = 18891.2 x - 37355464.7
y-hat = 18891.2 (1992) - 37355464.7
y-hat = 275805.7

Because .7 is greater than .5, we can round our expected value to 275806.

What is the Expected value for 2007?

Multiple Choice:

	559173.7
	we cant have a fraction of a car!
	559174
	559173
	Must round up, not down!

Let's try and find another EXPECTED value for this data set. Let's predict the number of alternative-fueld cars in the year 2020.

Once again, start by plugging in the 2020 into the X value in the linear regression equation.

y-hat = 18891.2 x - 37355464.7
y-hat = 18891.2 (2020) - 37355464.7

Now, we can simply slove this problem using algebra and find that...

y-hat = 804759.3 for the X value 2020

*IMPORTANT: Remeber that we can not have a fraction of a car so when we round this value, our y-hat becomes 804759.

What is the expected number of alternative-fueld cars in the year 2030?

Algebraic Expression:

	993671.3
	We cannot have a fraction of a car!
	993671
	993672
	.3 is not greater than .5 so therefore we do not round up
	1012563
	We are looking for the expected value for the year 2030, not 2031

Not only is the linear regression equation used to predict future values, it is also used to test the accuracy, and linearity, of the data.

Let's take a look at a previous example...

In question 4, we found the EXPECTED value for the year 1992.
y-hat = 18891.2 x - 37355464.7
y-hat = 18891.2 (1992) - 37355464.7
y-hat = 275805.7

rounded, y-hat = 275806 for the year 1992.

By looking back at the original data collected, we can see that data was collected for the year 1992 and the number of alternative-fueld cars was 251352. Now we have both an observed and an EXPECTED value for 1992, but just how can this information help us test the linearity of the data?

Residules are what helps us test the linearity of a data set. By deffinition, a residule is the difference between an observed value of the response variable (X-axis) and the value predicted by the regression line.

*Formula:                 Residual = observed Y - predicted Y
                                                     r = Y - (Y-hat)

Let's find the value of the residule for the year 1992.

Observed = 251352
Expected = 275806

                                    r = Y - (Y-hat)
                                    r = 251352 - 275806
                                    r = -24454

What is the value of the residule for the year 1994?

Algebraic Expression:

Scaffold:

In order to find the residule value for 1994, we must first find the observed value. We can easily find this by looking back at our data chart ...

source: US Department of Energy
Total Number of Alternative-Fueled Vehicles In Use in the United States
year	number
1992	251,352
1993	314,848
1994	324,472
1995	333,049
1996	352,421
1997	367,526
1998	383,847
1999	407,542

What is the observed value for 1994?

Algebraic Expression:

	324472
	333049
	that is the observed value for 1995
	314848
	that is the observed value for 1993
	251352
	that is the observed value for 1992

Scaffold:

The next step in finding the residule value is to find the expected value for the year 1994. Remember, you need the linear regression equation to find the expected value!
Earlier in the problem set, we found that the linear regression equation was ...

y-hat = 18891.2 x - 37355464.7

Now, we just plug the year (1994) into the X-value in this equation to obtain our expected value.

y-hat = 18891.2 (1994) - 37355464.7

What is the expected value for 1994?
*Remeber to round to the nearest whole number because we can not have a fraction of a car!

Multiple Choice:

	313588.1
	Remember to round!
	294696.9
	the expected value for 1993
	313588
	294697
	Rounded expected value for 1993

Scaffold:

To find the residule value, we must take the observed value and subtract the expected value. Let's refer back to the residule equation, which was presented in an earlier problem...

                                    r = Observed Y - Expected Y
                                           r = Y - (y-hat)

Now, let's find the residule value for 1994.

X = 1994
Y = 324472
y-hat = 313588

Plug the values into the residule equation.

              r = Y - (y-hat)
              r = 324472 - 313588

What is the value of the residule?

Multiple Choice:

	-10884
	Y - (y-hat) not (y-hat) - Y
	10884

To prove your strength in finding residule values, what is the residule value for the year 1996?
*REMEMBER - residule = Y - (y-hat)

Multiple Choice:

	1050.5
	dont forget the negative sign!
	-1050.5
	1051
	dont round!
	-1051
	dont round!

What is the residule value for the year 1998?
*REMEMBER - residule value = Y - (y-hat)

Multiple Choice:

	5305.9
	-5305.9
	its not negative!
	5306
	dont round!
	-5306
	its not negative & dont round!

Now that we know what residules are, let's look at the bigger picture. Residules are very useful when they are shown on a graph. Graphs showing data based on residule values are known as "residule plots".

In order to graph a residule plot, keep the X-values the same (on the x-axis) and, instead of graphing the observed Y-values on the Y-axis, graph the residule values on the Y-axis.

Let's take a look at this using a TI-83. In order to graph the residule plot for this data set, make sure that you plug the original data in to the calculator by going to STAT --> EDIT --> and listing the X-values in L1 and the observed Y-values in L2.

Eventhough we now know how to calculate the residule value, it would be very time consuming to do that for every X-value in this data set; luckily, our TI-83 can do this quickly and easily for us.

In order to find the residule values using the calculator, go to STAT --> EDIT --> scroll over and highlight L3 --> 2nd STAT (list) --> 7 residule --> enter. NOw we have the residule values stored in the L3 of our calculator.

So how do we see this on a graph? In order to see the residule plot on our calculator we start by going to 2nd Y= (STAT PLOT) --> 1 PLOT 1 --> enter --> ON & enter --> scroll down to Xlist and make sure it says L1 --> scroll down to Ylist and change it to L3 by hitting 2nd 3. Now we can graph the plot by hitting GRAPH. In this format, you may not be able to see the graph but this problem can be solved by hitting ZOOM and choosing option 9. ZoomStat.

Now we can clearly see our residule plot. Residule plots help us judge linearity, but how do we know when a graph is linear or not based on its residule plot?

1. If the pattern of data points produced by the residule plot is very scattered and widley spread, we can say that our data is linear.
2. If our data is very close together and creat a pattern (parabola, repeating pattern, etc.) then we can say that our data is not linear.

By looking at the residule plot for the cars data set that we just produced on our calculator, can we say that our data is linear?

Multiple Choice:

	maybe
	yes
	no
	i cant tell from this graph

Let's look at a data set comparing women's height and weight:
women
attach(women)
x= height
y= weight
plot(x,y) ###plots the data set
lm(y~x) ###gives us the linear regression equation for the data set
cor(x,y) ###gives us the correlation of the data set

model= lm(y~x)
abline(model) ###plots the least-squares regression line
resid= resid(model)
plot(x,resid) ###plots the residuals
abline(0,0) ###draws a horizontal line at y=0
###end code
Is the LSRL a good representation of the data?

Multiple Choice:

	Yes
	No
	I dont know.

Scaffold:

Ok let's break this down.
Where would we look to see if the LSRL is a good fit?

Multiple Choice:

	r (the correlation)
	no, sorry
	the data plot
	no, sorry
	the residual plot

Scaffold:

Good!
So now let's look at the residual plot. (the second graph you created on R)
Do we see a clear pattern in the residual plot?

Multiple Choice:

	Yes
	No
	No,sorry. Look again.

Scaffold:

Good!
Now that we see a pattern in the residual plot, what does this mean about the fit of the LSRL?

Multiple Choice:

	The pattern means the LSRL is a good fit.
	no, sorry
	The pattern means the LSRL is not a good fit.

Let's look at a set of data in R comparing the age of orange trees and their size:

Orange ###shows us the data set
x= Orange$age ###defines age as the explanatory variable
y= Orange$circumference ###defines circumference as the response variable
plot(x,y) ###plots the data set
lm(y~x) ###gives us the linear regression equation for the data set
cor(x,y) ### gives us the correlation of the data set

model= lm(y~x)
abline(model) ###plots the least-squares regression line
###end code
Is this data linear?

Multiple Choice:

	Yes
	No
	I dont know.

Scaffold:

What is the correlation of this data set?
Round to the nearest thousandth.

Algebraic Expression:

	.914
	.913
	No, check your rounding.

Hints:

the command to find correlation in R is cor(x,y)

find the number R gives you after cor(x,y)

type in .914

Scaffold:

Good. Let's plot the residual plot for this data.
resid= resid(model)
plot(x,resid) ###plots residuals
abline(0,0) ###draws a horizontal line at the horizon
###end code
Does there seem to be a pattern with the residual plot, or is it random?

Multiple Choice:

	Random
	Pattern
	sorry, look again

Scaffold:

Good. There is no pattern in the residual plot.
Taking into account the strong correlation (.914) and the lack of pattern in the residual plot, what does this say about the LSRL?

Multiple Choice:

	Despite the strong correlation, the lack of pattern shows the LSRL is a good representaion of the data.
	Despite the strong correlation, the lack of pattern shows the LSRL is not a good representaion of the data.
	The strong correlation and the lack of pattern shows the LSRL is a good representation of the data.
	The strong correlation and the lack of pattern shows the LSRL is not a good representation of the data.

Scaffold:

Let's look at correlation and it's relationship with linear data.
Which of the following is true about correlation?

Multiple Choice:

	All data sets with a strong pattern will have a strong correlation.
	No, sorry
	Data sets with strong linear relationships will have strong correlations.
	Data sets with strong nonlinear relationships will have strong correlations.
	No, sorry

Scaffold:

Now that we now know the signs of a strong linear relationship in correlation and residual plots:
Taking into account the strong correlation and the lack of pattern in the residual plot, what does this say about the LSRL?

Multiple Choice:

	Despite the strong correlation, the lack of pattern shows the LSRL is a good representaion of the data.
	Despite the strong correlation, the lack of pattern shows the LSRL is not a good representaion of the data.
	The strong correlation and the lack of pattern shows the LSRL is a good representation of the data.
	The strong correlation and the lack of pattern shows the LSRL is not a good representation of the data.

What does the LSRL tell us we can assume from the data?

Multiple Choice:

	As the tree ages one year, it's circumference increases by .0168 units
	As the tree ages .0168 years, it's circumference increases by 1 unit
	no, sorry
	As the tree ages by 1 year, it's circumference increases by 17.3997 units
	no, sorry
	As the tree ages by 17.3997 years, it's circumference increases by 1 unit.
	no, sorry

Hints:

Look at the coefficients of the LSRL listed under "Coefficients" as "(Intercept)" and "x"

The intercept is the the y-intercept of the equation (where x=0) and the coefficient iof x is the slope.

Since the coefficient of x is the slope (which is rise over run), it tells us how much the circumference (the response variable) would change when the age (the explanatory variable) increases by one.

Stefano wants to determine if the size of a watermelon has a linear relationship with how much sunlight it receives. In order to accomplish this, he is growing his own watermelons, allotting them certain amounts of sunlight. What is the equation for the least squared regression line. Round to the nearest hundreth.
Weight (lbs)                  Sunshine (hours daily)
2.1                               4
2.3                               4.5
2.7                               5
2.9                               5.5
3.1                               5.7
3.1                               5.8
3.3                               7
3.6                               7.4

Multiple Choice:

	y = 2.23x - 0.82
	y = 2.2x - 0.8
	Round to the nearest hundreth.
	y = -0.82x + 2.23
	y = -0.8x + 2.2

Ms. Lincoln is noticing trends in the tests she gives to her history class. She gave a survey after a test that asked her students how long they studied. She then compared the surveys with the scores to determine if they displayed a linear relationship.

Hours Studied              Score
0.5                               72
0.5                               78
1                                  81
1                                  83
1                                  87
1                                  89
2                                  88
2                                  91
3                                  94

What is the equation for the least squared regression line? Round to the nearest hundredth.

Multiple Choice:

	y = 6.85x + 75.65
	y = 6.8x + 75.6
	Round to the nearest hundreth.
	y = 75.65x + 6.85
	y = 75.6x + 6.8

Jane is trying to determine if there is a relationship between age and the amount of sleep a person gets. She surveyed classmates at her high school, adults she knows from her job at the community center, as well as some of her younger brother's friends. She did this in order to obtain a broad range of data:

Age                              Hours of Sleep
7                                  9
8                                  10
8                                  9
16                                7
16                                6
17                                6
26                                7
29                                7
33                                6
37                                9
43                                8

What is the equation for the least squared regression line? Round to the nearest hundredth.

Multiple Choice:

	y = -0.03x + 8.3
	y = 8.3
	Round to the nearest hundreth
	y = 8.3x - 0.03
	y = -8.3x

A curved pattern in a residual plot shows that the relationship of the data is...

Multiple Choice:

	not linear
	linear
	it doesn't tell you anything about the data
	it always requires a logrythmic transformation of the data

What is the impact of an outlier on a regression line?

Multiple Choice:

	Since it is an outlier, the regression line ignores it
	The line tries to be the least amount of variance for all the data, including the data.
	Outliers usually influence the leat squared regression line
	An outlier will only be influential if it touches the regression line

The removal of an influential observation...

Multiple Choice:

	a. has little effect on the regression line
	b. has a significant impact on the direction of the regression line
	c. none of the above

Lynn, a local real estate agent, is trying to determine if the size of a house can be used to predict its sale price. In order to determine if there is a relationship, she observed six recent sales in her neighborhood.
House Size                  Price
1,503                           $ 162,000
1,272                           $ 135, 000
2,216                           $ 240, 000
1,861                           $ 195,000
1,017                           $ 125,000
2,400                           $ 262,000
What is the least squared regression line? Round to the nearest hundreth.

Multiple Choice:

	y = 102.52x + 11029
	y = 102.5x + 11029
	y = 11029x + 102.52
	y = 11029x + 102.5

Amanda is trying to figure out how many viewers a television show will have in 2011. The show started in 2000, but Amanda is using that as her zero point, and using a data set that looks like this (with viewers in millions):

###start code
x=seq(1,10,by=1)
y=c(6,4.3,4.1,3.8,3.7,3.4,3,2.9,2.5,2.1)
model=lm(y~x)
plot(x,y)
abline(model)
model
###end code

What is the LSRL (exactly as it appears in Rweb and in a+bx form)?

Algebraic Expression:

	5.4600-0.3418x
	-0.3418x+5.4600
	That is ax+b form.

What is the predicted number of viewers for 2011, as it appears on your calculator?

Algebraic Expression:

1.7002

Katy is doing a survey for Perry Pet Grooming. She wants to know if there is a linear relationship between cat age and weight. She collects data from ten cat owners.

Age Weight (in lb)

3 8

3 10

5 10

6 11

7 11.25

8 12

9 11.5

12 13

15 16.5

15 20

Find the least-squared regression line for this data.

Multiple Choice:

	y=6.32+0.72x
	No, the least-squared regression line is ALWAYS y(hat).
	y(hat)=6.32+0.72x
	y=0.72+6.32x
	Make sure your line is a+bx.
	y(hat)=0.72+6.32x
	Make sure your line is a+bx.

Hints:

Enter the data into the "Stats" menu on your TI-83.

Go to STATS > CALC > LinReg(a+bx)

Check the data by pressing STATS > CALC > LinReg(a+bx), then going to VARS > Y-VARS > Function > Y1.

Using "cars" data and Rweb, calculate the least squared regression line for the speed vs. distance data. Type it in exactly as it appears on rweb.

Algebraic Expression:

	-17.579+3.932x
	3.932x-17.579
	a+bx form
	-17.579x+3.932
	-17.579 isn't the slope, it's the y-intercept.
	3.932-17.579x
	While you did give it in a+bx form, the a and b are opposite.

Scaffold:

The first step is to plot the data in rweb.
###start code
cars
names(cars)
x=cars$speed
y=cars$dist
plot(x,y)
###end code
Did you plot the data?

Multiple Choice:

	Yes
	No
	Why not?

Scaffold:

The next step is to plot the least squared regression line.
###start code
model=lm(y~x)
plot(x,y)
abline(model)
model
###end code
Does the line look correct?

Multiple Choice:

	Yes
	No

Scaffold:

The least squared regression line equation is y=________________.

Algebraic Expression:

	-17.579+3.932x
	3.932x-17.579
	a+bx form
	-17.579x+3.932
	-17.579 isn't the slope, it's the y-intercept.
	3.932-17.579x
	The slope and y-intercept are switched.

Hints:

The components of the LSRL are located by the abline command. They are located under "coefficients", see if you can find them.

Lana is looking at high school and college grades for 200 students at a local state school. She's trying to predict a student's university GPA from his or her high school GPA. If x=high school GPA and y=university GPA, the LSRL equation is y=0.675x+ 1.097.

What would the university GPA be for someone who had a 3.2 GPA in high school? Input the answer exactly as it appears on your calculator.

Algebraic Expression:

3.257

Hints:

Input the equation for the LSRL on your TI-83; graph the line. Find 3.2 on the x-axis and the corresponding y value.

Press 2nd > TRACE > value. Then x=3.2. The answer will appear in the bottom right of your calculator screen.

Jomelia works as a nurse. She measures the height and weight of patients that come into the emergency room.

Using the data from the set "women," make a residual plot. (Copy the following into R.)

###start code
women
names(women)
x=women$height
y=women$weight
plot(x,y)
model=lm(y~x)
plot(x,y)
abline(model)
model
plot(x,model$residuals,ylab="RESIDUALS")
abline(h=0)
###end code

What shape does the data take?

Multiple Choice:

	linear
	Look at the residual plot, not the graph.
	u-shape
	inverted u-shape

Is the data linear?

Multiple Choice:

	Yes
	No

Hints:

When data distributed randomly around the residual line, it is linear.

Taylor is studying the effect of x hours of sleep on the ability to use four-syllable words in conversation. The data can be inputed into Rweb using this code:

###start code
x=seq(1,10,by=1)
y=c(4,6,9,10,14,17,18,20,24,25)
model=lm(y~x)
plot(x,y)
abline(model)
model
###end code

What is the LSRL? (input the numbers EXACTLY as they appear on Rweb.) y=__________

Algebraic Expression:

1.467+2.406x

Now, let's input an outlier at (3, 22):

###start code
x=seq(1,10,by=1)
y=c(4,6,9,10,14,17,18,20,24,25)
model=lm(y~x)
plot(x,y)
abline(model)
model
x=c(x,3)
y=c(y,22)
model=lm(y~x)
plot(x,y)
abline(model)
model
###end code

Does this change the LSRL? (Is the outlier significant?)

Multiple Choice:

	Yes
	No

Hannah is an FBI Agent. She is looking at data on pirated movies, and trying to figure out if ticket sales decrease as the incidence of movie pirating increases.

Since 1900, the ticket sales of movies have decreased with more and more movies being pirated. x=the number of people who have downloaded movies (in millions), and y=the number of tickets sold each year (in billions).

###start code
x=c(10,15,18,22,24,28,30,55,60)
y=c(1.92,1.88,1.84,1.86,1.8,1.74,1.75,1.62,1.66)
model=lm(y~x)
plot(x,y)
abline(model)
model
###end code

What is the least squared regression line for the data?

Multiple Choice:

	y=1.946953-0.005544x
	It should be y(hat), since it is predicted.
	y(hat)=1.946953-0.005544x
	y(hat)=-0.005544+1.946953x
	You switched up the slope and y-intercept.
	y=-0.005544+1.946953x
	You switched up the slope and y-intercept.

Now, add an outlier to the data. In 2008, the ticket sales for The Dark Knight pushed up 2008's ticket sales to 2.02 billion from 1.62 billion. The data now looks like this:

###start code
x=c(10,15,18,22,24,28,30,55,60)
y=c(1.92,1.88,1.84,1.86,1.8,1.74,1.75,2.02,1.66)
model=lm(y~x)
plot(x,y)
abline(model)
model
###end code

How significant is this outlier?

Multiple Choice:

	Not significant
	Significant
	Very significant

Stefano's sister, Stefani, is looking at the relationship between the number of ads run and peanut sales. She surveyed 10 companies.

# of Ads Run (in thousands)                  Sales (in thousands)
1                                                          12
1.3                                                       14
1.36                                                     17
1.44                                                     19
1.62                                                     25
1.78                                                     28
2                                                          35
2.14                                                     36
2.56                                                     42
2.79                                                     48

Make a residual plot. According to the residual plot, is this data linear?

Multiple Choice:

	Yes
	No

Hints:

On the TI-83, to make a residual plot you first input the data into L1 and L2. Then, you go to STAT > CALC > LinReg. THEN, go to 2nd > Y= > PLOT 1. Then, make sure you turn the plot on and set the Ylist to RESID, which you can do by going to 2nd > STAT > RESID. Then hit graph, and zoom stat, and you have a residual plot!

What is the shape of a linear model?

Multiple Choice:

	curved line
	straight line

GO to www.stats4stem.org
Click on Rweb-1 at the top (near the right)
Enter following code:

##Code

X=read.table("http://seattlecentral.edu/qelp/sets/038/s038.txt")
X ## This names the data from the website "X"
attach(X) ## This breaks the data set into variables
names(X) ## This shows the names of the variables
population=V1 ##This renames V1 "population"
disposed=V2 ##This renames V1 "disposed"
plot(population, disposed) ## This creates a scatterplot of population versus disposed

Look at the scatterplot. Upon initial inspecton, does the data appear linear?

Multiple Choice:

	Yes
	No

Add the following to the previous code:

##Code
lsrl=lm(disposed~population) ##This makes the least-squares regression line
lsrl ## This displays the least-squares regression line

What is the equation of the least-squares regression line?

Multiple Choice:

	y=-4763.5509x + 0.8784
	yhat=-4763.5509x + 0.8784
	y=0.8784x -4763.5509
	yhat=0.8784x -4763.5509

Hints:

The least-squares regression line is given by the formula:
yhat=ax+b

The intercept is b.

Add the following to the previous code:
##Code:
abline(lsrl) ## This graphs the least-squares regression line on the scatterplot

Upon second inspection, do you think the linear model fits the data?

Multiple Choice:

	Yes
	No

Add the following to the previous code:

##Code:
plot(population,lsrl$residuals, ylab="RESIDUALS", main="RESIDUAL PLOT") ##This creates the residual plot
abline(h=0) ##This graphs the horizontal of the residual plot.

What does the horizontal line at 0 on the residual plot represent?

Multiple Choice:

	Nothing
	The median residual
	The least-squares regression line
	The average residual

Looking at the residual plot, do you think the linear model is a good fit for the data?

Multiple Choice:

	Yes
	No

Hints:

Residual plots that show a pattern are not well fit by a linear model.

Are there any outliers?

Multiple Choice:

	Yes
	No

Hints:

Outliers are points outside of the overall pattern in the y direction on the residual plot.

Are there any influential observations?

Multiple Choice:

	Yes
	No

Hints:

Influential observations are points outside of the overall pattern in the x direction on the residual plot.

What would be affected if the influential observation were removed?

Multiple Choice:

	The residual
	The least-squares regression line
	It would stay the same

GO to www.stats4stem.org
Click on Rweb-1 at the top (near the right)
Enter following code:
##Code
X=read.table("http://www.statsci.org/data/general/kittiwak.txt", header=T)
X ## This names the data from the website "X"
attach(X) ## This breaks the data set into variables
names(X) ## This shows the names of the variables
plot(Area, Population) ## This creates a scatterplot of area versus population

Upon first inspection, does the plot appear linear?

Multiple Choice:

	Yes
	No

Add the following code to the code presented above:
##Code
lsrl=lm(Population~Area) ## This makes the least-squares regression line
lsrl ## This displays the least-squares regression line

What is the equation for the least-squares regression line?

Multiple Choice:

	y=3.302x -734.806
	yhat=3.302x -734.806
	y=-734.806x+3.302
	yhat=-734.806x+3.302

Hints:

The least-squares regression line is given by the formula:
yhat=ax+b

The intercept is b.

Add the following code to the code presented above:
##Code
abline(lsrl) ## This graphs the least-squares regression line on the scatterplot

Upon second inpesction, does the plot appear linear?

Multiple Choice:

	Yes
	No

Add the following to the previous code:

##Code
plot(Area,lsrl$residuals,ylab="RESIDUALS",main="RESIDUAL PLOT") ##This creates the residual plot

abline(h=0) ##This graphs the horizontal of the residual plot.

Looking at the residual plot, do you think the linear model fits the data?

Multiple Choice:

	Yes
	No

Hints:

Residual plots that show a pattern are not well fit by a linear model.

Looking at the graphs, are there any outliers?

Multiple Choice:

	Yes
	No

What are three things you learned?

Ungraded Open Response:

In an exponential model, what is the regression equation?

Multiple Choice:

	log(y) = b0 + b1x
	1/y = b0 + b1x
	no, try again
	log(y)= b0 + b1log(x)
	no, keep trying
	sqrt(y) = b0 + b1x
	Come on!... you know this

Hints:

Please refer to this website:
http://stattrek.com/AP-Statistics-1/Transformation.aspx?Tutorial=AP
It is very helpful in explaining how to achieve linearity no matter what type of data you are given.

In a power model, what equation do you use to find the predicted value?

Multiple Choice:

	ŷ = 10^(b0 + b1log(x))
	ŷ = ( = b0 + b1x )^2
	try again
	ŷ = b0 + b1x
	no that is wrong
	ŷ = 1 / ( b0 + b1x )
	keep trying

Hints:

Please refer to this website:
http://stattrek.com/AP-Statistics-1/Transformation.aspx?Tutorial=AP
It is very helpful in explaining how to achieve linearity no matter what type of data you are given.

Given the following data set, figure out the regression equation.
X = 1 2 3 4 5
Y = 1 2 4 8 16

Multiple Choice:

	y=2^x
	Nope
	log(y)=2^x
	Nope
	log(y)=-.301+.301x
	y=-.301+.301x
	Nope

Scaffold:

First of all, identify what type model the data is.

Multiple Choice:

	Exponential
	Linear
	Nope
	Quadratic
	Nope
	Logarithmic
	Nope

Hints:

Realize that for each time the x-value goes up by one, the y-value goes up by a fixed multiple.

Scaffold:

Our ultimate goal is to transform this data in some way so that we end up with a linear model. What transformation must be made in order to achieve this?

Multiple Choice:

	log(y)
	square-root(y)
	Nope
	1/y
	Nope
	None
	Nope

Hints:

It is log(y). When graphing the x-values against the log(y)-values, you get a straight line.

Scaffold:

Now for a little bit of practice...when x=3, what is log(y)?

Multiple Choice:

	.602
	4
	Nope
	.301
	Nope
	.903
	Nope

Hints:

Refer back to the table and find out what y is when x=3.

Take the log(4) because y=4 when x=3, according to the table.

Scaffold:

Using the method from the previous problem, we can find out the rest of the log(y) values.
X = 1 2 3 4 5
Log(Y) = 0 .301 .602 .903 1.204

Using your calculator, find out the linear regression equation.

Multiple Choice:

	log(y)=-.301+.301x
	y=-.301+.301x
	Nope, remember that we were using the log(y) values not the regular y-vaules
	log(y)=.301-.301x
	Nope
	y=.301-.301x
	Nope

Hints:

Enter the original data set into your calculator as L1 and L2. Then enter in the log(y) values as L3. Now, go to the "Stat" menu; go to the "Calc" tab; go to the LinReg (a+bx) option and press enter. Back on the calculator home screen, LinReg (a+bx) should be showing right now. Directly beside this type in "L1,L3"; then hit enter. The a-value and b-value should come up. Congratulations, you now have your equation.

The a-value is your y-intercept and the b-value is your slope. The b-value should be attached to "x" in the equation.

Now that you have a regression equation, find an equation that equals to y that you can use to predict values from a given "x".

Regression Equation: log(y)=-.301+.301x

Multiple Choice:

	y=10^(-.301+.301x)
	y=(-.301+.301x)/log
	Nope, you can't divide by log
	y=(-.301+.301x)^10
	Nope
	The equation is already correct
	Nope

Hints:

You need to find a way to isolate y on the left side of the equation. That means getting rid of the log.

In order to get rid of the log, you need to do 10 raised to each side of the equation.
10^log(y)=y

Now we can use R to check our answers and get a residual plot. Enter the following into R:
x=c(1,2,3,4,5)
y=c(1,2,4,8,16)
plot(x,y)
log.y=log10(y)
plot(x,log.y)
model=lm(log.y~x)
abline(model)
plot(x, model$residuals, ylab="RESIDUALS")
abline(h=0)
model
According to R, what is the y-intercept?

Algebraic Expression:

-0.301

Hints:

It is the number right under "Intercept" in R.

Enter the answer EXACTLY how it appears in R.

Type in -0.301.

According to R, what is the slope?

Algebraic Expression:

0.301

Hints:

It is the number under "x" in R.

Enter the number EXACTLY how it appears in R.

Type in 0.301.

Now let's take a look at the graphs. The first one is the graph of the data, it is exponential. The second one is the graph of the x-values against the log(y)-values, it is linear. The third graph is the residual plot. Do you see any pattern on the residual plot?

Multiple Choice:

	Yes
	Wrong
	No

Right, there is no pattern on the residual plot. When you are plotting the residuals from a linear graph, there shouldn't be any pattern on the plot. If you tried to plot the residuals from a non-linear graph, there would be a pattern (U-shaped or something else). This website explains it well:
stattrek.com/Help/Glossary.aspx?Target=Residual%20plot
What did you learn from this assistment?

Ungraded Open Response:

The true antelopes are found only in Africa and Asia. They range in size from 12" (30 cm. at the shoulder) pygmy antelopes to giant elands, which are over 6 feet tall (180 cm) at the shoulder. Most antelopes are between 3 to 4 feet tall (90-120 cm) at the shoulder. The horns of antelopes, unlike the antlers of deer, are un-branched, are made of a shell with a bony core, and are not shed. The majority of antelopes reside in Africa.

Data: The data below represents the length and mid-shaft diameters of the humerus bones of African Antelopes.

Diameter (mm)	Length (mm)
17.6	159.9
26.0	206.9
31.9	236.8
38.9	269.9
45.8	300.6
51.2	323.6
58.1	351.7
64.7	377.6
66.7	384.1
80.8	437.2
82.9	444.7

Prepare a scatter plot of the data using R.
Enter the following into R:
x=c(17.6,26.0,31.9,38.9,45.8,51.2,58.1,64.7,66.7,80.8,82.9)
y=c(159.9,206.9,236.8,269.9,300.6,323.6,351.7,377.6,384.1,437.2,444.7)
plot(x,y)
Were you able to make a scatterplot?

Multiple Choice:

	Yes
	No
	Try again

Hints:

Keep in mind that this is a power model data set...so it should look different from an exponential scatter plot.

Using R, find the linear regression model for the data.
Enter the following into R:
x=c(17.6,26.0,31.9,38.9,45.8,51.2,58.1,64.7,66.7,80.8,82.9)
y=c(159.9,206.9,236.8,269.9,300.6,323.6,351.7,377.6,384.1,437.2,444.7)
plot(x,y)
log.x=log10(x)
log.y=log10(y)
plot(log.x,log.y)
model=lm(log.y~log.x)
abline(model)
plot(x, model$residuals, ylab="RESIDUALS")
abline(h=0)
model
What is the y-intercept?

Algebraic Expression:

1.3826

Hints:

It is the number directly under "Intercept" in R.

Enter it in EXACTLY how it appears in R.

Type in 1.3826.

What is the slope?

Algebraic Expression:

0.6595

Hints:

It is the number right under "log.x" in R.

Enter the number EXACTLY how it appears in R.

Type in 0.6595.

Now for some more practice on R. Enter the following into R:
x=c(1,2,3,4,5)
y=c(1,2,8,28,85)
plot(x,y)
log.y=log10(y)
plot(x,log.y)
model=lm(log.y~x)
abline(model)
plot(x, model$residuals, ylab="RESIDUALS")
abline(h=0)
model
What kind of model is this data set?

Multiple Choice:

	Exponential
	Linear
	Nope
	Quadratic
	Nope
	Power
	Nope

Hints:

Realize that as the x goes up by 1, the y goes up by a fixed multiple each time.

According to R, what is the y-intercept of the linear regression equation?

Algebraic Expression:

-0.5854

Hints:

It is the number right under "Intercept" in R.

Enter the number EXACTLY how it appears in R.

Type in -0.5854.

What is the slope?

Algebraic Expression:

0.5005

Hints:

It is the number right under "x" in R.

Enter the number EXACTLY how it appears in R.

Type in 0.5005.

Is the residual plot random or does it have a pattern?

Multiple Choice:

	Random
	Pattern
	Nope, look again

Because of the residual plot, is it a good linear model?

Multiple Choice:

	Yes
	No
	Nope

Hints:

Go back to this website if you need some help remembering the relationship between residual plots and linear models:
http://stattrek.com/Help/Glossary.aspx?Target=Residual plot

In a residual plot, when is a linear regression model appropriate for the data?

Multiple Choice:

	in a random pattern
	in a U-shaped curve
	no keep trying
	in a inverted U
	no. try again
	all answers
	no. dont give up!

Hints:

Please refer to this website:
http://stattrek.com/AP-Statistics-1/Transformation.aspx?Tutorial=AP
It is very helpful in explaining how to achieve linearity no matter what type of data you are given.

Make sure to check out this website for a full breakdown of how to achieve linearity from different types of data sets using transformations:
http://stattrek.com/AP-Statistics-1/Transformation.aspx?Tutorial=AP
Did you find this website useful?

Multiple Choice:

	Yes
	No
	Oh well

This data set is a little harder!
X Y
8 9.64
21 27.61
54 77.01
67 98.64
98 146.04
12 15.007
34 46.70
99 149.02
22 29.15
Find Y when X is 105. Round to the nearest hundredths.

Algebraic Expression:

	301.66
	301.67
	300.55
	300.5

Scaffold:

First make the data linear.

What is the correlation(r) of the LinReg of the transformed data?(round everything to the nearest hundredth)

Algebraic Expression:

	.95
	.94
	.949

Hints:

Press (STAT) then (Edit...), if you have not already, enter the data set. Next go over and up to L3 (make sure there is a black box around L3) then type LOG, L2. This makes each point in L3 the log of it's corresponding point in L2.

Next press STAT, go over to CALC then down to (8: LinReg (a+bx)) hit ENTER, then press L1 then (,) then L3,then ENTER. The 5th variable is r.

Type .95

Scaffold:

Good Job! Now that we have transformed the data to linear form we must transform it back to Exponential form.
Find the equation for the exponential least squares regression line.

Multiple Choice:

	y-hat= 13.49*1.03^X
	y-hat= 13.49*1.08^X
	y= 13.49*1.03^X
	y-hat= 18.49*1.03^X
	y= 18.49*1.03^X
	y-hat= 13.49*1.08^X

Hints:

When you get the LinReg of L1, L3 it is in the form Log(y-hat)=a+bx. How do you convert this so that y-hat is by itself?

Take the inverse Log (10^X) of both sides. The problem will become y-hat=10^(a)10^(bx)

Plug in. y-hat=10^(1.13)10^(.011(X)).

Simplify y-hat=10^(1.13)10^(.011(X))= y-hat= 13.49+1.03^X

Scaffold:

Find Y when X is 105. Round to the nearest hundredths.

Algebraic Expression:

	301.66
	301.67
	300.55
	300.5

Lets start with a simple data set!
x   y
1   2
2   4
3   8
4 16
5 32
6 64
7 128
8 256
9 512
Find Y when X is 25

Algebraic Expression:

33554432

Scaffold:

First make the data linear.

What is the correlation(r) of the LinReg of the transformed data?

Algebraic Expression:

1

Hints:

Press (STAT) then (Edit...), if you have not already, enter the data set. Next go over and up to L3 (make sure there is a black box around L3) then type LOG, L2. This makes each point in L3 the log of it's corresponding point in L2.

Next press STAT, go over to CALC then down to (8: LinReg (a+bx)) hit ENTER, then press L1 then (,) then L3,then ENTER. The 5th variable is r.

Type 1

Scaffold:

Good Job! Now that we have transformed the data to linear form we must transform it back to Exponential form.
Find the equation for the exponential least squares regression line.

Multiple Choice:

	y-hat=1*2^X
	y=1*2^X
	y-hat=2*1^X
	y=2*1^X

Hints:

When you get the LinReg of L1, L3 it is in the form Log(y-hat)=a+bx. How do you convert this so that y-hat is by itself?

Take the inverse Log (10^X) of both sides. The problem will become y-hat=10^(a)10^(bx)

Plug in. y-hat=10^(0)10^(.301(X)).

Simplify y-hat=10^(0)10^(.301(X))= y-hat= 1*2^X

Now plug in for X and solve the original problem!

Scaffold:

Find Y when X is 25

Algebraic Expression:

33554432

Now that you know a little more about linear transformations lets try another problem!
Given the data set
X Y

Greglangkamp and Joe Hull, "Exponential Scatterplots," QELP, October 27,2010

http://seattlecentral.edu/qelp/sets/045/045.html

Find Y when X is 150 round to the nearest hundredths

Algebraic Expression:

	6300179.18
	6300179.181

Scaffold:

First make the data linear.

What is the correlation(r) of the LinReg of the transformed data?

Exact Match (case sensitive):

.96

Hints:

Press (STAT) then (Edit...), if you have not already, enter the data set. Next go over and up to L3 (make sure there is a black box around L3) then type LOG, L2. This makes each point in L3 the log of it's corresponding point in L2.

Next press STAT, go over to CALC then down to (8: LinReg (a+bx)) hit ENTER, then press L1 then (,) then L3,then ENTER. The 5th variable is r.

Type .96

Scaffold:

Good Job! Now that we have transformed the data to linear form we must transform it back to Exponential form.
Find the equation for the exponential least squares regression line.

Multiple Choice:

	y-hat=2788.02*1.008^x
	y=2788.02*1.008^x
	y=4235.32*1.008^x
	y-hat=4235.32*1.008^x
	y=2788.02*1.89^x
	y-hat=2788.02*1.89^x
	y-hat= 4177.90*1.05^X

Hints:

When you get the LinReg of L1, L3 it is in the form Log(y-hat)=a+bx. How do you convert this so that y-hat is by itself?

Take the inverse Log (10^X) of both sides. The problem will become y-hat=10^(a)10^(bx)

Plug in. y-hat=10^(3.620958695)10^(.0210229934(X)).

Simplify y-hat=10^(3.620958695)10^(.0210229934(X))= y-hat= 4177.906294+1.04959799^X

Now plug in for X and solve the original problem!

Scaffold:

Find Y when X is 150 round to the nearest hundredths

Algebraic Expression:

	6300179.18
	6300179.181

Exponential Transformations.
When do we use Exponential Transformations? Well, exponential transformations are used to asses the correlation (strength and direction) of the exponential transformation. They are also used to create a Least-Squares Regression Line, which can later be used to predict future points on the same exponential track as the previous points.
Lets try it out!
Heres a data set; I will walk you through how to complete a exponential Transformation using this data set.

X                Y

Greglangkamp and Joe Hull, "Exponential Scatterplots," QELP, October 27,2010
http://seattlecentral.edu/qelp/sets/045/045.html

1) First enter this data set into your calculator. (press STAT, EDIT, enter the first column into L1 and the second into L2)

2) Next press second STAT PLOT (above Y=), ENTER change Xlist to L1 and Ylist to L2, Then press ZOOM, 9 (ZoomStat)

3) This lets us view the original graph

4)Next press STAT, EDIT, move over to L3, then press log(L2). This takes the log of each of the original Y variables.

5) Repeat step 2 replacing Ylist to L3 to view this graph.

6) Next press STAT, move over to CALC, then move down to 8 [LinReg(a+bx)]

7) Then type L1,L3 then hit Enter.

8) Your r should be .9947465551 your r^2 should be .9895207089

9) Now in order to conver this to a form in which we can predict future point we must convert it.

10) Since this is technically log(y-hat)=a+bx (.4793)+(.0324)(X) we must change it so that y-hat is by itself.

11) Do this by taking the inverse log of both sides (10^X) or (2nd LOG) of both side.. you should end up with y-hat=3.0156*1.0775^X Now simply plug in a number for X and the equation will predict the Y variable for you!

Question: Given the above data set and equation what will Y be when X is 50?

Multiple Choice:

	50
	125.68
	67.54
	78.33

This is a data set of the amount of nitrogen used on a crop, and the crop yield.
X Y

122.3	6449
102.4	7483
104.1	7874
101.0	8034
106.0	8419
113.7	9362
146.0	10080
168.9	11959
198.3	9928
254.1	11850
408.4	16001
602.1	18753
635.3	21412

What is the equation of the Power Transformation Equation?

Multiple Choice:

	y=2.85+.52x
	y=.52+2.85x
	x=.52
	r=.95
	that is the value of r not the equation

Hints:

Make sure all of the data is entered correctly and that you have followed all the steps

In the data area make sure you have highlighted L3 and entered the log of L1

In the data area make sure you have highlighted L4 and entered the log of L2

After doing all of these enter the LinReg(a+bx) and enter L3,L4 after it. This should give you the correct answer.

Here is a new data set about Madrid's earth quakes and the size of the magnitude
X Y

5.34	4.01
4.5	3.91
4.88	4.38
3.84	2.66
4.71	3.67
4.83	3.87
4.43	2.89
3.06	1.11
4.92	3.46
4.92	3.6
4.39	3.04
4.27	2.93
4.82	4.03
3.54	3.21
2.22	1.23
5.66	4.47
4.04	3.24
4.68	3.46
4.83	3.5
2.53	0.78
4.61	3.63
4.2	3.19
4.25	3.04
5.83	4.94
4.64	3.06

What is R-squared value of the regression line (a + bx) when you use the power model?

Algebraic Expression:

	.84
	.839
	Round it!
	.83
	Incorrect rounding
	.92
	That is the value of r

Scaffold:

What is the value in the Xlist?

Algebraic Expression:

	L3
	L4

Scaffold:

What is the value in the Ylist?

Algebraic Expression:

	L4
	L3

Scaffold:

What is entered after LinReg(a+bx)?

Algebraic Expression:

	L3, L4
	L4, L3
	Backwards

Scaffold:

Algebraic Expression:

.84

Here is a data list of gas mileage compared to engine size.
X Y

http://seattlecentral.edu/qelp/sets/036/036.html

What is the equation for the power transformation of the above data?

Multiple Choice:

	y=1.65-.4x
	y=.4-1.65x
	y=1.65+.4x

Hints:

Make sure all the data is entered correctly and you have followed all of the steps.

If you have done all of the steps make sure that L3 is entered in the Xlist and L4 is entered in the Ylist.

If this still doesnt work try entering L3, L4 after you enter LinReg(a+bx)

Power transformations involves taking the log of both data sets and graphing the new data sets to achieve linearity.
To take the logs of the data set you must input a set of data into your calculator first. All of the power transformations will be done eith the calculator.
Here is a data set to help practice how to do this procedure. Follow these steps with the data set provided.

1	2940600
70	13094400
109	28953600
173	40379580
242	56427280
322	64593200
376	75072000
547	88965600
603	100742400
699	115814504
872	152472840
922	154291740
1087	173260800
1343	178320000
1692	212908800
1858	243579520

http://seattlecentral.edu/qelp/sets/020/020.html

List of steps to enter data into the calculator:
1) Press "STAT"
2) Go into "Edit" by pressing 1 or "ENTER"
3) Data on the X axis goes in L1
4) Data on the Y axis goes in L2
5) Graph the data set by going into "ZOOM" and click on "ZoomStat" or 9
6) If this data set graphs correctly skip steps 7, 8, 9, and 10
7) If this doesnt work its probly becuse the dimensions arent correct. To change them so this will graph go to "STAT PLOT" which is right above "Y=" on the calculator.
8) After you're in "STAT PLOT" click enter on Plot 1
9) Go down to "Xlist" and set it to L1 and change "Ylist" to L2
10) Go back to "ZOOM" and click "ZoomStat" to graph this data
11) What do you notice? The graph should not be linear right? Now we will try top change that
12) To achieve linearity we must take the log of both L1 and L2
13) To do this go back into "STAT" click edit and you should see your data set in L1 and L2
14) Go over L3 and highlight it by pressing up on the directional pad. The blinking black cursor should be over L3. After you do this press '"LOG" then press L1 and click enter. This should give you a new column of data
15) Go over to L4 and highlight it by pressing up on the directional pad.The blinking black cursor should be over L4. After you do this press '"LOG" then press L2 and click enter. This should give you a new column of data
16) To graph this go back into "STAT PLOT" and go to Plot 1. Go down to "Xlist" and change L1 to L3. Go down to "Ylist" and change L2 to L4. This will change the dimensions of your new graph.
17) Go to "ZOOM" and click 9 or "ZoomStat"
18) If done correctly this should show you a new graph that is very linear.
19) To confirm linearity click "STAT", go over to "CALC", and click "LinReg(a+bx)". Then enter L3 insert a comma and press L4 then press enter. This should give you a few numbers but the one you're most concerned with is r squared. If this number is anywhere from .8 to .99 then you have a very strong linear relationship.
20) This data that you have just accumulated will be very useful in the following problems.

Algebraic Expression:

	.97
	.98
	That is the value of r

Scaffold:

What is the value in the Xlist?

Algebraic Expression:

	L3
	L4

Scaffold:

What is the value in the Ylist?

Algebraic Expression:

	L4
	L3

Scaffold:

What is entered after LinReg(a+bx)?

Algebraic Expression:

	L3, L4
	L1,L2

Scaffold:

What is the value of r-squared?

Multiple Choice:

	.97
	.96
	Incorrect rounding!
	.968
	Round it!
	.98
	That is the value of r

Algebraic Expression:

One day Roger decides to make a loan of 25 dollars from a local loan shark. Little does he know, the interest rate is 3.0 per week. Roger does not pay for 5 weeks and soon becomes heavily in dept. Graph the data below in your graphing calcutor. Is the graph linear?

Weeks: 1, 2, 3, 4, 5
Dept: 25, 75, 225, 675, 2025

Multiple Choice:

	Yes
	Have you tried hitting zoomstat?
	No

Scaffold:

Plug in the variables into the L1 and L2 columns of your stats area in your graphing calculator.

Turn on Stat plot1 and hit graph.

Do you now see your graph?

Multiple Choice:

	yes
	no

Scaffold:

Is the graph linear? or possibly exponential?

Multiple Choice:

	Linear
	Hit stat, calc, LinReg(ax+b), L1, L2, Y1. Do the data points really line up?
	Exponential

Now since we've identified this as an exponential graph, let's make it linear with a transformation.

Using Log on your graphing calculator, use exponential transformation to find a linear graph.
Did you succeed?

Multiple Choice:

	Yes
	No

Scaffold:

Go to stats, edit.
Always set L3 as Log(L1)
and set L4 as Log(L2)

For this graphing, which value are you changing? (L1 or L2)
Which set then are you going to use?

Multiple Choice:

	L2, L3
	L1, L4
	L2, L4
	L1, L3

Scaffold:

Graph everything out on the graphing calculator. Did you get a linear plot? Did you remember to go to stat plot1 and change the y-list to L4?

Multiple Choice:

	Yes, Yes
	Yes, No
	impossible!
	No, Yes
	Try hitting stats, calc, LinReg(ax+b) L1, L4, Y1. Graph

What is the slope and y-intercept of the line?

Multiple Choice:

	.9208, .4771
	.9419, .2490
	.9128, .2771
	.9012, .3532

Hints:

Using your graphing calculator, after you calculate LinReg(a+bx)

a is the y-intercept.
b is the slope

What is the equation of the problem?

Multiple Choice:

	Log(y-hat) = .9419 + .2490x
	Log(y-hat) = .9208 + .4771x
	(y-hat) = .9208 + .4771x
	Remember, if you look back, you used Log(L2) in place of L4. It doesn't just go away!
	(y-hat) = .9419 + .2490x

Hints:

Using R, after you plug in the original data you just need to plug in

>model

to find the equation of the line.

How would I arrive with an equation of just y-hat = a+bx?

Multiple Choice:

	You take the original equation and you throw away the Log
	You go to plot stat1 and change the Y-list back into L2
	You use a property of Logs to raise everything to a power of 10
	You find the recipical of each value

Go to http://stattrek.com/AP-Statistics-1/Association.aspx?Tutorial=AP for a quick run-through on two-way frequency tables and use this as a reference.

	Dance Club	Sports Club	Drama Club
Boys	3	9	7
Girls	16	7	8

Given the table above, how many total people are described in the two-way table?

StatTrek. “AP* Statistics Tutorial: Two-Way Tables.” Accessed October 25, 2010. http://stattrek.com/AP-Statistics-1/Association.aspx?Tutorial=AP.

Multiple Choice:

	50
	16
	10
	48

Hints:

What does each of the six entries represent? The total people is the sum of which entries?

	Dance Club	Sports Club	Drama Club
Boys	3	9	7
Girls	16	7	8

In the two-way table above, define the row variable and column variable. (Row Variable; Column Variable)

Multiple Choice:

	Gender; School Club
	Boys; School Club
	Girls; School Club
	School Club; Gender

	Dance Club	Sports Club	Drama Club
Boys	3	9	7
Girls	16	7	8

Using the table above, give the marginal distribution of participation in sports club as a percentage. Ignore the percent sign (%) in your answer.

Algebraic Expression:

32

Scaffold:

Marginal distributions are percents of the table total. First off, what is the group you are looking at?

Multiple Choice:

	Sports Club
	Dance Club
	Drama Club

Scaffold:

What is the number of people participating in the Sports Club?

Multiple Choice:

	16
	18
	9
	7

Scaffold:

What is the total amount of people represented?

Multiple Choice:

	48
	50
	40
	46

Scaffold:

The marginal distribution of participation in sports club would be the percentage of the number of people participating in the sports club over the total number of people. What is the marginal distribution of participation in sports club as a percentage?

Multiple Choice:

	32%
	45%
	30%
	28%

	Dance Club	Sports Club	Drama Club
Boys	3	9	7
Girls	16	7	8

Out of the dance club only, what is the conditional frequency of girls? Round to the nearest hundredth.

Multiple Choice:

	84.21%
	80.76%
	16.00%
	48.47%

Hints:

To find the conditional frequency of a variable, look only at the column you are focusing on.

Based on the bar graph above, what is the marginal distribution of deaths per vehicle kilometers traveled of people aged 55 and older? Ignore the percent sign (%) in your answer and round to the nearest hundredth if necessary.

Raise the Hammer. “Stay In Your Lane: on plucking the low-lying fruit of safe driving.” Accessed October 25, 2010. http://www.raisethehammer.org/article/609/stay_in_your_lane.

Algebraic Expression:

34.52

Hints:

Marginal distributions are percents of the table total. What is the group you are looking at? What is the total?

The group you are looking at is people aged 55 and older, which is 29 people. The total amount of people is 84 people.

The marginal distribution of deaths per vehicle kilometers traveled of people aged 55 and older would be the percentage of people aged 55 and older over the total amount of people. Hence, you would do 29/84 = 0.3452 (rounded to the nearest hundredth). The final answer would be 34.52%.

Using the same bar graph, give the marginal distributions of the youngest age group and oldest age group in percentages. (Youngest age group; Oldest age group)

Multiple Choice:

	32%; 23.81%
	23.81%; 32%
	26%; 44%
	44%; 26%

Hints:

Follow the same procedure as the previous problem- determine which group you are looking at and what the total amount is. Marginal distribution is a percent of the table total.

Based on the graph above, which age group(s) seem(s) to be the safest drivers? Why?

Multiple Choice:

	35-44 and 45-54; for they have the least deaths.
	35-44; for it has only three deaths.
	45-54; for older middle aged drivers are safer.
	55-64; for it only has four deaths.

Stacked bar graphs are used to easily compare parts of a whole. For example the different colors (blue, black, red, etc.) are part of a whole category (least favorite color).

What can be observed from the color, orange, in this stacked bar graph?

Joe Hallock. "Colour Assignment: Preferences." Accessed October 27th, 2010. http://www.joehallock.com/edu/COM498/preferences.html.

Multiple Choice:

	Dislike of orange generally increases as age increases
	Dislike of orange generally decreases as age increases
	Dislike of orange generally increases as age decreases
	Dislike of orange generally decreases as age decreases

Hints:

A large percentage represents MORE dislike in the color. A small percentage represents LESS dislike in the color.

What happens to the percentage of the color orange as age increases? Do the percentages get larger or smaller? What does this represent?

Assuming that all age groups contain the same number of surveyed people, which color is considered the most liked?

Multiple Choice:

	Black
	Purple
	Blue
	Orange

Hints:

Remember that a large percentage represents MORE dislike in the color, while a small percentage represents LESS dislike in the color.

What does the absence of blue in age groups under 70 years old represent?

Multiple Choice:

	Little to no younger people under 70 years old disliked blue.
	Younger people under 70 years old don't care about blue.
	Older people who are over 70 years old love blue more than young people.
	Older and younger people alike have no comments on blue.

Comparing the colors purple and green, which one is better liked? Why?

Multiple Choice:

	Green, because little people dislike green.
	Purple, because little people dislike purple.
	Purple, because a large amount of people love purple.
	Green, because a large amount of people love green.

How many people Moroccans were infected with syphilis in 1991 and 1992?

International Encyclopedia of Sexuality. "Morocco." Accessed October 27, 2010. http://www2.hu-berlin.de/sexology/IES/morocco.html.

Multiple Choice:

	10,458
	4,952
	5,506
	10,000

What is the marginal distribution of Moroccans who were infected with STDs in 1997?

Multiple Choice:

	19%
	22%
	24%
	17%

Hints:

The group we are looking at is Moroccans who were infected with STDs in 1997. This is represented by "Total" in the year 1997, which was 189,021 Moroccans.

To find the total amount of Moroccans represented by this table, simply add up the "Total" row.

Marginal distribution of Moroccan who were infected with STDs in the 1997 is the percentage of Moroccans infected in 1997 over the total amount of Moroccans represented by the table.

What is the conditional distribution of other STDs in 1995? Round your answer to the nearest hundredth.

Multiple Choice:

	2.52%
	2%
	1.86%
	2.98%

Hints:

To find the conditional distribution of the column variable for one specific value of the column variable, look only at that column in the table. Find the entry in the column as a percent of the column total.

Which type of STD is the most uncommon among Moroccans between the years of 1991-1998?

Multiple Choice:

	Genital Herpes
	Condyloma
	Hepatitis
	Chancre

Now it's time to use the R-program to further understand two-way frequency tables.
------------ START CODE

sexsmoke<-matrix(c(70,120,65,140),ncol=2,byrow=TRUE)
rownames(sexsmoke)<-c("male","female")
colnames(sexsmoke)<-c("smoke","nosmoke")
sexsmoke <- as.table(sexsmoke)
sexsmoke

--------------- END CODE

Scrolling down, you should get a two-way frequency table comparing the amount of males and females who smoke or don't smoke. What is the total amount of people represented in this data?

Check your answer by entering the line below into R, which will give you the total amount of people represented in this data.

margin.table(sexsmoke)

Cyclismo. "R Tutorial: Tables." Accessed October 28, 2010. http://www.cyclismo.org/tutorial/R/tables.html.
Cyclismo. "R Tutorial: The Basic Data Types." Accessed October 28, 2010. http://www.cyclismo.org/tutorial/R/types.html.

Multiple Choice:

	395
	400
	380
	375

Add the following line in:
sexsmoke/margin.table(sexsmoke)

What do you think this new data set gives us?

Multiple Choice:

	Marginal Distributions
	Conditional Distributions
	Row Totals
	Column Totals

Using the marginal distribution table created from the previous problem, determine which group has the lowest marginal distribution.

Multiple Choice:

	Females who smoke
	Females who don't smoke
	Males who smoke
	Males who don't smoke

Enter the line
prop.table(sexsmoke, 2)
This gives us the Relative Frequency of Column Table, which gives us the conditional frequencies of smokers and non-smokers. What is the conditional frequency of women who smoke?

Multiple Choice:

	48.15%
	51.45%
	43.15%
	53.83%

Which of the following statements best describe the data, based on your observations from the Marginal Frequency Table and Relative Frequency of Column Table? Check all statements that apply.

Check All That Apply:

	More men smoke than women.
	More women choose not to smoke than to smoke.
	Most men do not smoke.
	More women than men choose not to smoke.
	The marginal frequency of male smokers is less than the marginal frequency of women smokers.

From: https://oli.web.cmu.edu/jcourse/workbook/activity/page?context=90de571d80020ca60104b2620b3f3e00&view=frameset
Suppose you want to determine the musical preferences of all students at your university, based on a sample of students. Here are some examples of the many possible ways to pursue this problem.
1. Post a music-lovers' survey on a university internet bulletin board, asking students to vote for their favorite type of music.
This is an example of a volunteer sample, where individuals have selected themselves to be included. Such a sample is almost guaranteed to be biased. In general, volunteer samples tend to be comprised of individuals who have a particularly strong opinion about an issue (and are just waiting for an opportunity to voice it....). Whether the variable's values obtained from such a sample are over- or under-stated, and to what extent, cannot be determined. As a result, data obtained from a voluntary response sample is quite useless when you think about the "Big Picture" since the sampled individuals only provide information about themselves, and we cannot generalize to any larger group at all.
Comment: As we will see in our discussion of study design, a volunteer sample is not so problematic when it is taken for the purpose of carrying out an experiment where individuals are randomly assigned to different treatment groups.
2. Stand outside the Student Union, across from the Fine Arts Building, and ask students passing by to respond to your question about musical preference.
This is an example of a convenience sample, where individuals happen to be at the right time and place to suit the schedule of the researcher. Depending on what variable is being studied, it may be that a convenience sample provides a fairly representative group. However, there are often subtle reasons why the sample's results are biased. In this case, the proximity to the Fine Arts Building might result in a disproportionate number of students favoring classical music. A convenience sample may be susceptible to bias because certain types of individuals are more likely to be selected than others. In the extreme, some convenience samples are designed in such a way that certain individuals have no chance at all of being selected, as in the next example.
3. Ask your professors for email rosters of all the students in your classes. Randomly sample some addresses, and email those students with your question about musical preference.
Here is a case where the sampling frame---list of potential individuals to be sampled---does not match the population of interest. The population of interest consists of all students at the university, whereas the sampling frame consists of only your classmates. There may be bias arising because of this discrepancy. For example, students with similar majors will tend to take the same classes as you, and their musical preferences may also be somewhat different from those of the general population of students. It is always best to have the sampling frame match the population as closely as possible.
4. Obtain a student directory with email addresses of all the university's students, and send the music poll to every 50th name on the list.
This is called systematic sampling. It may not be subject to any clear bias, but it would not be as safe as taking a random sample.
If individuals are sampled completely at random, and without replacement, then each group of a given size is just as likely to be selected as all the other groups of that size. This is called a simple random sample (SRS). In contrast, a systematic sample would not allow for sibling students to be selected, because of having the same last name. In a simple random sample, sibling students would have just as much of a chance of both being selected as any other pair of students. Therefore, there may be subtle sources of bias in using such a sampling plan.
5. Obtain a student directory with email addresses of all the university's students, and send your music poll to a simple random sample of students. As long as all of the students respond, then the sample is not subject to any bias, and should succeed in being representative of the population of interest.
But what if only 40% of those selected email you back with their vote?
The results of this poll would not necessarily be representative of the population because of volunteer response. Since individuals are not compelled to respond, often a relatively small subset take the trouble to participate. Volunteer response is not as problematic as a volunteer sample (presented in (1) above), but there is still a danger that those who do respond are different from those who don't, with respect to the variable of interest. An improvement would be to follow up with a second email, asking politely for students' cooperation. This may boost the response rate, resulting in a sample that is fairly representative of the entire population of interest, and it may be the best that you can do, under the circumstances. Non-response is still an issue, but at least you have managed to reduce its impact on your results.

Did You Read This? (We know that it's long, but it will really help!)

Multiple Choice:

	yes
	No, I am lazy

We want to find out who the senior class wants to DJ prom. We post a sheet outside of the guidance office asking students to write their preference. This is an example of what kind of survey?

Multiple Choice:

	Convenience
	Read the above passage.
	Volunteer Sample
	Systematic
	Read the above passage
	Volunteer Response
	Read the above passage. Be careful not to confuse volunteer sample and volunteer response.

Hints:

In a volunteer response everybody is asked the question and may choose to answer or not. In a volunteer sample people select themselves to be included in the survey.

Refer to Question 2, the volunteer sample. Is this sample biased?

Multiple Choice:

	Yes
	No
	Read the passage from Question 1

Refering to the volunteer sample in Question 2 why is it biased?
Keep in mind that answers come right from the passage in Question 1.

Check All That Apply:

	volunteer samples tend to be comprised of individuals who have a particularly strong opinion about an issue
	individuals have selected themselves to be included.
	The whole class has been made aware that the survey exists.
	This is a reason it is NOT biased. If only part of the class knew that the survey existed then it would be biased.
	The guidance counselors have a meeting r2
	This is completely irrelevant to the survey

We want to know who the Boston Latin School Senior Class' favorite teacher is (besides Mr. Simoneau). We obtain a list of every senior in the school. We ask every third senior who their favorite teacher is. This is an example of what kind of survey?

Multiple Choice:

	Convenience
	Volunteer Sample
	Systematic
	Volunteer Response

Hints:

Read the original lesson

Multiple Choice:

	This is horribly biased
	This is slightly biased, but it would be more biased if we just asked our friends who their favorite teacher is.
	This is not biased at all.
	What no students we asked took German? How would we know if any of the German teachers were good?

Multiple Choice:

	Ways to ask people questions
	The answer people give to the question
	A list of potential individuals to be sampled
	The number of answers possible for the question

Hints:

Make sure you read the lesson CAREFULLY.

A Health Science Magazine, Doctors Learn (DL), wants to know which Boston college's students drink the most alcohol. They will find out by asking students how many alcoholic beverages they have per week. The following are two ways that DL could go about surveying the students:
a. DL surveys all of the students in every college's Education, Nursing, Engineering, Business, and International Relations Majors.
b. DL surveys 10 students in every major at every college.

Exact Match (case sensitive):

Cluster

Hints:

Is this an example of stratisfied sampling or cluster sampling? Make sure that you read the passage to understand the exact difference.

A Census is used to find out different facts about people living in a certain area. Let's use the United States Census as our example. The US census is taken every ten years. The US census is the way that the government knows about the population of the United States. How does the government know how many buses it should run in a certain area? Through the census the government knows how many people live in a specific area and how many of those people rely on public transportation.
Here is a link to the United States Census http://2010.census.gov/2010census/ The following questions will be about this site.

Under the "How It Works" page the site says that the Census is used to allocate funds for ....

Check All That Apply:

	Hospitals
	Animal Shelters
	Public Works Projects
	Emergency Services

Hints:

Make sure you read the bullet points carefully. Three out of the four answers are correct.

On the Census homepage click on the "Learn More About Data Processing Link" and watch the video. (It won't kill you. It's less than a minute.) Answer the following questions about the video.
The video explains that the census is used to determine how many seats a state gets in the U.S. House Of Representatives. True or False?

Multiple Choice:

	True
	False

- Carry out an observational study, where values of the variable or variables of interest are recorded as they naturally occur. There is no interference by the researchers who conduct the study.
- Take a sample survey, which is a particular type of observational study where individuals report variables' values themselves, frequently by giving their opinions.
- Perform an experiment: instead of assessing values of variables as they naturally occur, the researchers interfere, and they are the ones who assign values of the explanatory variable to the individuals. The reason why the researchers "take control" of the values of the explanatory variable is because they want to see how changes in the values of the explanatory variable affect the response. (Note: By nature, any experiment, then, involves at least two variables)
The type of design used, and the details of the design, are crucial, since they will determine what kind of conclusions we may draw from the results. In particular, when studying relationships in the Exploratory Data Analysis unit, we stressed that an association between two variables does not guarantee that a causal relationship exists. In this module, we will explore how various details of a study design play a crucial role in our ability to establish evidence of causation.

EXAMPLE

Suppose researchers want to determine whether people tend to snack more while they watch TV. In other words, the researchers would like to explore the relationship between the expalnatory variable "TV" (a categorical variable that takes the values 'on' and 'not on') and the response "snack consumption".
Identify each of the following designs as being an observational study, a sample survey, or an experiment.
1. Recruit participants for the study. While they are presumably waiting to be interviewed, half of the individuals sit in a waiting room with snacks available and a TV on. The other half sit in a waiting room with snacks available and no TV, just magazines. Researchers determine whether people consume more snacks in the TV setting.
This is an experiment, because the researchers take control of the explanatory variable of interest (TV watched or not) by assigning each individual to either watch TV or not, and determine the effect on the response of interest (snack consumption).
2. Recruit participants for a study. Give them journals to record hour by hour their activities the following day, including TV watched and food consumed. Determine if food consumption is higher during TV times.
This is an observational study, because the participants themselves determined whether or not TV was watched. There is no attempt on the researchers' part to interfere.
3. Recruit participants for a study. Ask them to recall, for each hour of the previous day, whether they were watching TV, and what food they consumed each hour. Determine whether food consumption was higher during the TV times.
This is also an observational study; again, it was the participants themselves who decided whether or not to watch TV. (do you see the difference between 2 and 3? see comment below).
4. Poll a sample of individuals with the following question: While watching TV, do you tend to snack (a) less than usual (b) more than or usual (c) the same amount as usual?
This is a sample survey, because the individuals self-assess the relationship between TV watching and snacking.

Comment
Notice that in Example 2, the values of the variables of interest (TV and Snacking Habits) are recorded forward in time. Such observational studies are called prospective. In contrast, in Example 3 the values of the variables of interest are recorded backward in time. This is called a retrospective observational study.

N.B. This material, as well as the following 4 questions were taken from https://oli.web.cmu.edu/jcourse/workbook/activity/page?context=90e062da80020ca601f85ed29d5c001c&view=frameset a creative commons website.

Multiple Choice:

	Yes
	No
	Please don't be lazy. This is very helpful

Identify the type of design in the following scenario:
An internet poll asks people to vote on their favorite American Idol singer. This is an example of ___________ kind of study.

Multiple Choice:

	Prospective Observational Study
	This is not quite right. Which study design typically asks for your opinion?
	Retrospective Observational Study
	This is not quite right. Which study design typically asks for your opinion?
	Survey
	Experiment
	Does an experiment ask for the opinion of participants? No.

Identify the type of study:
Researchers compared the rates of autism for children who did and did not receive the standard measles-mumps-rubella vaccine, to see if the vaccine was responsible for autism in some children.

Multiple Choice:

	Prospective Observational Study
	A prospective observational study records what is happening in the FUTURE. This study is looking at the past.
	Retrospective Observational Study
	Survey
	What study records values of variables as the happen in the past?
	Experiment
	What study records values of variables as the happen in the past?

Identify the type of study:
Researchers injected some patients' underarms with Botox, and others with salt water, in order to see if Botox (which was originally intended to smooth wrinkles) would also reduce sweating.

Multiple Choice:

	Prospective Observational Study
	Not quite. Note that, in this study, a treatment (Botox/salt water) was imposed on the ndividuals. Which study design does that?
	Retrospective Observational Study
	Nope. A retrospective observational study looks at events that happened in the past. In this study a treatment was imposed on the individuals. Which study does that?
	Survey
	Not quite. Note that, in this study, a treatment (Botox/salt water) was imposed on the ndividuals. Which study design does that? Drag edit delete
	Experiment

Identify the type of study:
Researchers classified pregnant women as being non-drinkers or light, moderate, or heavy drinkers; they examined the weights of their children at regular age intervals to see if alcohol during pregnancy results in poor growth.

Multiple Choice:

	Prospective Observational Study
	Retrospective Observational Study
	Nope. A Retrospective Observational Study takes place in the past. This study is taking place as it happens. What type of study takes place as it happens?
	Survey
	This is not quite right. Which study design records values of variables as they naturally happen forward in time?
	Experiment
	This is not quite right. Which study design records values of variables as they naturally happen forward in time?

Algebraic Expression:

Use the internet to find the definition of a census. What is the purpose of a census?

Multiple Choice:

	Studying part of a population
	This is a sampling, not a census
	An attempt to contact every individual in a population
	An experiment

The US Census is conducted every __________ years.

Algebraic Expression:

10

Hints:

Visit http://2010.census.gov/2010census/how/index.php

The 2000 US Census data can be found here: http://www2.census.gov/census_2000/datasets/demographic_profile/0_United_States/2kh00.pdf

We can use this data to estimate the population percentages for any state given their population. Go to page 3 and find what percentage of the US population is Male. What is that percentage?

Algebraic Expression:

	49.1
	.491
	That is a decimal. What is the fraction?

Using the percentage found in the previous problem, what percentage of the population in Massachusetts would you expect to be male?

Algebraic Expression:

	49.1
	.491
	That is the decimal.

If the population is 6.4 Million in Massachusettes and you expect to have 49.1% male, how many males do you expect to live in MA?

Algebraic Expression:

3142400

Scaffold:

The first step is to write 49.1% as a decimal. What is that decimal?

Algebraic Expression:

.491

Scaffold:

What is 6.4 million as one number? (Write it out with all the zeros.)

Algebraic Expression:

6400000

Scaffold:

Multiply 6400000 by .491
This is your answer.

Algebraic Expression:

3142400

According to the 2000 MA Census Data, there were 3,058,816 men in MA, not 3,142,400, as we estimated in the previous problem. How far off were you from the actual number (in positive percentages)?

Algebraic Expression:

	2.73
	2.7325
	Too many places!!

Scaffold:

First, you need to realize that this problem is asking by percentage of the whole you were off. Start by subtracting the actual number from the expected number. What did you get?

Algebraic Expression:

	83584
	-83584
	It should be positive, not negative.

Scaffold:

Then divide that number by the actual number of men in MA (or 3,058,816) and multiply by 100 to get the percentage. Limit your answer to 2 decimal places. What did you get as a result?

Algebraic Expression:

2.73

Scaffold:

The answer to the last scaffold is the answer to the problem. 2.73% is the answer. Retype that here to make sure you got the correct answer.

Algebraic Expression:

2.73

Write a reflection about what you have learned about census data collecting and how this information can be used?

Ungraded Open Response:

Do you understand census data collection?

Check All That Apply:

	yes
	No
	Look back at the previous questions and retry them before continuing

Go to http://en.wikipedia.org/wiki/Survey_sampling

What is the definition/purpose of Survey sampling?

Multiple Choice:

	To reduce the cost and/or the amount of work that it would take to survey the entire target population.
	To contact everyone in a population
	That is a census
	To study a population in its entirety
	That is a census

What does a sample survey often entail? (Refer to the website provided in the previous question)

Multiple Choice:

	It most often refers to a questionnaire used to measure the characteristics and/or attitudes of people.
	It contacts everyone in the population
	That is a census

What is the difference between a survey and a census? (Respond in 2-3 Sentences)

Ungraded Open Response:

If deduction is essentially going from the larger picture to the smaller picture, what can deduction be compared to: census or sample survey?

Multiple Choice:

	Census
	Sample Survey

If induction is essentially going from the smaller picture to the larger picture, what can induction be compared to: census or sample survey?

Multiple Choice:

	Census
	Sample Survey

Multiple Choice:

	yes
	no
	Look back at the previous questions before continuing on.

Write 2-3 Sentences about the differences between a census and a sample survey.

Ungraded Open Response:

sampling involves

Multiple Choice:

	studying a part in order to gain information about the whole.
	conducting an experiment
	studying the whole in order to gain information about a part.

which of the following is not a type of sampling?

Multiple Choice:

	voluntary response sample
	convenience sampling
	probability sample
	strange sampling

True or false; Undercoverage occurs when an individual chosen from the sample can't be contacted or does not cooperate?

Multiple Choice:

	true
	this is false because the above definition is Nonresponse. Under coverage occurs when some groups in the population are left out of the process of choosing the sample
	False

Nike wants to know what type of sneakers high school athletes wear. They send out surveys to 100 High school sports teams at random. They receive 76 back. What is the population for this study? and what is the sample?

Multiple Choice:

	high school athletes; 76
	100;76
	high school athletes;100
	76;100

Do you know what "Table B" is?

Multiple Choice:

	Yes
	No
	Table B is a table of random digits.

SRS

Algebraic Expression:

Go to the following website and find the definition of an experiment.

http://stattrek.com/AP-Statistics-2/Experiment.aspx?Tutorial=Stat

Multiple Choice:

	Manipulation of one or more variables, while holding all other variables constant.
	Changing every single variable
	Holding all variables constant

Go to the following website about observational studies. What is the definition of an observational study?

http://en.wikipedia.org/wiki/Observational_study

Multiple Choice:

	draws inferences about the possible effect of a treatment on subjects, where the assignment of subjects into a treated group versus a control group is outside the control of the investigator
	Conducts an experiment
	Holds all variables constant
	Changes one variable while holding the others constant

What are the differences between an experiment and observational study?

Multiple Choice:

	An observer observes without interfering, but an experimenter interferes.
	An experimenter observes without interfering, but an observer interferes.

Sample surveys is another method of data collection. There are 30 students in the class. These are the test scores taken from a sample survey.

74, 89, 68, 95, 100, 96, 70, 79, 90

Does this data reflect exactly how the entire class did?

Multiple Choice:

	Yes
	It's not yes because a sample survey does not represent exactly the entire population.
	No

Scaffold:

If a survey shows that 40% of the student population have part-time jobs afterschool, does this mean that if you take 100 students, 40 of them must be working part-time jobs now?

Multiple Choice:

	Yes
	No because it is only a sample survey, so it just means that 2 out 5 students on average have a part-time job.
	No

Scaffold:

Based on a sample survey, 15% of teenagers have smoked before. What can we infer from the statistics provided?

Multiple Choice:

	On average 3 out of 20 teenagers smoke
	If 100 teenagers were polled, 15 of them must be smokers.

Hints:

Not necessarily 15 out of the 100 people surveyed must have smoked before.

Suppose ABC College has 10,000 part-time students (the population). We are interested in the average amount of money a part-time student spends on books in the fall term. Asking all 10,000 students is an almost impossible task.

Suppose we take two different samples.

First, we use convenience sampling and survey 10 students from a first term organic chemistry class. Many of these students are taking first term calculus in addition to the organic chemistry class . The amount of money they spend is as follows:

$128; $87; $173; $116; $130; $204; $147; $189; $93; $153

The second sample is taken by using a list from the P.E. department of senior citizens who take P.E. classes and taking every 5th senior citizen on the list, for a total of 10 senior citizens. They spend:

$50; $40; $36; $15; $50; $100; $40; $53; $22; $22

Do you think that either of these samples is representative of (or is characteristic of) the entire 10,000 part-time student population?

Source: Barbara Illowsky and Susan Dean, "Sampling and Data: Sampling," Connexions, February 23, 2010, http://cnx.org/content/m16014/latest/

Multiple Choice:

	Yes
	No

Scaffold:

Since these samples are not representative of the entire population, is it wise to use the results to describe the entire population?

Multiple Choice:

	Yes
	Never use a sample that is not representative or does not have the characteristics of the population.
	No

Scaffold:

Now, suppose we take a third sample. We choose ten different part-time students from the disciplines of chemistry, math, English, psychology, sociology, history, nursing, physical education, art, and early childhood development. Each student is chosen using simple random sampling. Using a calculator, random numbers are generated and a student from a particular discipline is selected if he/she has a corresponding number. The students spend:

$180; $50; $150; $85; $260; $75; $180; $200; $200; $150
Do you think this sample is representative of the population?

Multiple Choice:

	Yes
	No
	It is chosen from different disciplines across the population.

What sampling technique is being used in this scenario?
Voters are selected at random from an alphabetical list of all registered voters.

Source: N/A, "Sampling (2 of 2)," Open Learning, N/A, https://oli.web.cmu.edu/jcourse/workbook/activity/page?context=90e062c080020ca6007985a9a1b18479&view=frameset

Multiple Choice:

	cluster sampling
	simple random sampling
	stratified sampling
	systematic sampling

Hints:

Simple Random Sampling: when each individual has the same chance of being selected, like "selecting names out of a hat".

Cluster Sampling: when our population is naturally divided into groups (clusters). For example, all the students in a university are divided into majors; all the nurses in a certain city are divided into hospitals; all registered voters are divided into precincts (election districts). In cluster sampling we take random sample of clusters, and use all the individuals within the selected clusters as our sample. For example, in order to get a sample of high-school senior from a certain city, you choose 3 high-schools at random from among all the high-schools in that city, and use all the high-school seniors in the three selected high-school as your sample.
Stratified Sampling: when our population is naturally divided into sub-populations (starta). For example, all the students in a certain college are divided by gender or by year in college; all the registered voters in a certain city are divided by race. In stratified sampling, we choose a simple random sample from each stratum, and our sample consists of all these simple random samples put together. For example, in order to get a random sample of high-school seniors from a certain city, we choose a random sample of 25 high-school seniors from each of the high-schools in that city. Our sample consists of all these samples put together. .
Systematic sampling: when an organized (but not random) approach to the selection process is taken, such as picking every 50th name on a list, or the first product to come off the production line each hour.

Source: N/A, "Sampling (2 of 2)," Open Learning, N/A, https://oli.web.cmu.edu/jcourse/workbook/activity/page?context=90e062c080020ca6007985a9a1b18479&view=frameset

What sampling technique is being used in this scenario?
Voters are selected by choosing at random several of the city's zip codes and selecting all the voters from those selected zip codes.

Source: N/A, "Sampling (2 of 2)," Open Learning, N/A, https://oli.web.cmu.edu/jcourse/workbook/activity/page?context=90e062c080020ca6007985a9a1b18479&view=frameset

Multiple Choice:

	cluster sampling
	simple random sampling
	stratified sampling
	systematic sampling

What sampling technique is being used in this scenario?
Several pieces of fruit from each tree in an orchard are selected.

Source: N/A, "Sampling (2 of 2)," Open Learning, N/A, https://oli.web.cmu.edu/jcourse/workbook/activity/page?context=90e062c080020ca6007985a9a1b18479&view=frameset

Multiple Choice:

	cluster sampling
	simple random sampling
	stratified sampling
	systematic sampling

Which two of the basic principles of statiscal design of experiments help prevent bias, or systematic favoritism, in experiments?

Multiple Choice:

	Replication and Control
	Randomization and control
	Replication and Randomization

Hints:

Comparison is the simplest form of control. Experiments should compare two or more treatments in order to prevent confounding the effect of a treatment with other influences, such as lurking variables.

There is one main difference between experiments and observational studies: in experiments, the explanatory variable is set for a certain sample of test subjects, whereas in observational studies, the explanatory variable is set by the test subjects themselves.

Which of the following scenarios describes a situation in which the method of data collection is an observational study?

Multiple Choice:

	In order to discover which of 4 weight loss programs gets the best results, subjects are randomly selected for each of the possible programs.
	No, this represents an experiment because the subjects were assigned the weight loss program they were to use.
	In order to find out which graduates have the highest starting salaries we gather information from graduates of ivy league, private, and public universities.
	A delivery company that owns a fleet of trucks performs a study to decide if rotating tires at specific intervals has any effect on the number of miles a set of tires lasts.
	No, this represents an experiment because the company can decide which trucks have their tires rotated.
	In order to find out whether a new hair product increases strength of hair, subjects are asked to either continue with their daily hair products or try the new product.
	No, this represents an experiment because the subjects have the choice of whether or not to use the new hair product.

Hints:

Ask yourself, are the test subjects free to choose the explanatory variable that yields their response variable? Or is it assigned to them?

Last year 20% of a group of adult women did not have a cold throughout the year. This year they participated in study in which they all took Echinacea capsules every day and 30% did not get a cold. It was concluded that Echinacea capsules prevent colds.
Source: N/A, "Understanding Statistics", Australian Bureau of Statistics, November 13, 2009, http://www.abs.gov.au/websitedbs/a3121120.nsf/4a256353001af3ed4b2562bb00121564/ce88a10d17c46b56ca257610000a9364!OpenDocument

What method of data collection was used to draw this conclusion?

Multiple Choice:

	Census
	Sorry, try again.
	Sample Survey
	Sorry, try again.
	Experimental Study
	Observational Study
	Sorry, try again.

A study of engineers showed that those who had completed a certificate earned 10% more, on average, than those who had completed a degree.
Source: N/A, "Understanding Statistics", Australian Bureau of Statistics, November 13, 2009, http://www.abs.gov.au/websitedbs/a3121120.nsf/4a256353001af3ed4b2562bb00121564/ce88a10d17c46b56ca257610000a9364!OpenDocument

Which of these is the explanatory variable for these results?

Multiple Choice:

	Income
	No
	Engineers
	No
	Level of Schooling
	Age of Engineers
	No

Can the result from an observational study conclude the cause of certain relationships between an explanatory variable and a response variable?

Source: N/A, "Statistics/Methods of Data Collection/Observational Studies", WikiBooks, May 23, 2010, http://en.wikibooks.org/wiki/Statistics/Methods_of_Data_Collection/Observational_Studies

Multiple Choice:

	Yes
	Sorry, try again.
	No

Scaffold:

Suppose four treatments for drug addiction are offered. A sample of smokers wanting to quit the habit choose the method they want to use to quit. Is this an observational study or an experimental study?

Multiple Choice:

	Observational
	Experimental
	No. Try again.

Scaffold:

Refer to the table above. Based on this data, can we tell which method was the most effective in helping smokers quit their habit?

Exact Match (case sensitive):

	No
	Yes

Scaffold:

Refer to the table above. Does this data mean that the combination drug/therapy method definitely causes the best success rate?

Multiple Choice:

	Yes
	No

Scaffold:

To understand why not, let's consider the study in itself.

Do we know the ages of the people who participated in the study?

Multiple Choice:

	No
	Yes

Scaffold:

Do we know the genders of the individuals who participated in the study?

Multiple Choice:

	Yes
	No

Scaffold:

These variables that we do not know or are not given to us and can cause a relationship between variables are called lurking variables.

Because of these lurking variables, we cannot conclude that combination drug/therapy definitely yields success in the quest to quit smoking.

Do you now understand the concept of an observational study?

Check All That Apply:

	Yes
	No
	Somewhat
	Conclusions cannot be absolutely drawn from observational studies because of the possibility of unforeseen variables.

It is essential that we approach and solve problems by using and interpreting data, not by giving "obvious" answers.
Statistical significance is usually expressed in terms of a significance level which is a percentage, but no matter what that percentage is, significance does not equate to importance.
Statistical significance implies that there is evidence of an association between two variables.
------------------------------------------

	% with no colds	% with colds
Placebo	18	82
Vitamin C	26	74

Note: A placebo is a "fake" pill that patients are given that has more of a psychological remedy than an actual physical remedy.
Refer to the table above. An experimental study was conducted on a group of people to see whether Vitamin C helps prevent colds. Is the data reported statistically significant enough to conclude that Vitamin C plays a role in preventing colds?
Source: http://www.abs.gov.au/websitedbs/a3121120.nsf/4a256353001af3ed4b2562bb00121564/b3cb0b453c0c4203ca25761700002c35!OpenDocument

Multiple Choice:

	Yes
	No
	Sorry, try again.

Hints:

"Significant" in a statistical sense means "not likely to happen just by chance". It does not mean "important".

Analysis of this result indicated that the difference between the placebo and Vitamin C was statistically significant. Statisticians can evaluate that a difference this large would arise by chance in 1% of studios of this size and design thus the result is statistically significant.
-------------
If data is practically significant, this means that the information concluded from the data is enough to impact behavior.
To understand this, let's look at the data table above.
We could ask if the data in the table would be enough to convince people to take 1 gram of Vitamin C every day of their life.
Would people necessarily change their behavior on the basis of these results?
Source: N/A, "Understanding Statistics", Australian Bureau of Statistics, September 1, 2009, http://www.abs.gov.au/websitedbs/a3121120.nsf/4a256353001af3ed4b2562bb00121564/d513a9432adec195ca25761700002cfd!OpenDocument

Multiple Choice:

	No
	Yes

Let's make sure you understand the difference between statistically significant and practically significant.

Let's say a study was conducted which shows data between the number of hours of studying without the television on and with the television on. Analysis showed that the difference in the test scores are statistically significant. Does this mean that it is also practically significant, meaning people might start studying more instead?

Multiple Choice:

	Yes
	No

Now that you have learned four methods of data collection (census, sample survey, experimental survey, observational survey), let's make sure you know the difference between them.

Which method of data collection gets results from an entire population?

Multiple Choice:

	Sample Survey
	Experimental Study
	Census
	Observational Study

Which method of data collection gets results from a small group of people?

Multiple Choice:

	Census
	Sample Survey
	Observational Study
	Experimental Study

Which method of data collection gets results from a group of people who have the freedom to choose the explanatory variable that yields their response variable?

Multiple Choice:

	Observational Study
	Sample Survey
	Experimental Study
	Census

Hints:

A study is done on a group of teenagers who are studying for a final exam. Researchers tell the students that they can study by either making flashcards, making chapter outlines, doing practice problems, or by rereading the textbook. The students have 3 hours to study for their exam by whichever method they choose. The researchers will then ask the students for the grades they got on the exam and the method of studying they chose.

Which method of data collection does this exemplify?

Which method of data collection gets results from a group of people who are assigned a certain explanatory variable to yield a response variable?

Multiple Choice:

	Census
	Sample Survey
	Observational Study
	Expermiental Study

Hints:

Researchers are testing out two new brands of shampoo designed to make your hair grow faster. They gather a group of thirty 25-year-old women and ask them to try out the new products. Fifteen women are to use product A, and fifteen women are to use product B. At the end of the study, the results will be used to see which product helps hair grow faster.

Which method of data collection does this exemplify?

In this assistment you will be taught how to make and conduct a survey.
The purpose of a good survey is to understand and analyze the views of a certain segment of the population. Before making questions to ask people you need to fully analyze and understand the exact purpose of the survey.
When the purpose of the survey is established then we need to identify the specific population we will focus on.
When we have the people and the purpose then we have to make questions up that make for the fairest and least biased results.

Sampling the population
This is the hardest part of a survey. When sampling a population one needs to get results that are the closest to the truth and the least biased.
Examples of biased samples.
volunteer sample, where individuals have selected themselves to be included

convenience sample, where individuals happen to be at the right time and place to suit the schedule of the researcher

sampling frame---list of potential individuals to be sampled---does not match the population of interest
How you should sample:
simple random sample- list of everyone that the question pertains to. Even when using this, there will still be a volunteer response. Since individuals are not compelled to respond, often a relatively small subset take the trouble to participate. Still, this is the least biased way to conduct a survey, as you are giving everyone a chance to answer.

Multiple Choice:

	i understand this
	i do not understand
	https://oli.web.cmu.edu/jcourse/workbook/activity/page?context=90de571d80020ca60104b2620b3f3e00&view=frameset http://www.abs.gov.au/websitedbs/a3121120.nsf/4a256353001af3ed4b2562bb00121564/a5a5e246faa67167ca2575e1001ce4fb!OpenDocument

When you need to find out how much the average American loves sports, on a scale of 1-10 how should you conduct the survey?

Multiple Choice:

	Stand in front of a baseball field and ask people
	convenience sample
	Ask only people 75 and older
	sampling frame, old people play less sports thus are less interested in sports
	Ask 1000 people in Canada
	we are only interested in people living in the united states
	Call 1000 random people

What is the best why to survey people if you want to find out who will win the class election?

Multiple Choice:

	Yell as loudly as you can at lunch and listen to who yells back.
	this is a volunteer sample because only the loudest kids will answer.
	Ask 50 random 12th graders
	Ask an 11th grader
	wrong sampling frame
	Email all your friends
	sampling frame, all of your friends think similarly to you

Different sampling methods
Simple Random Sampling is, as the name suggests, the simplest probability sampling plan. It is equivalent to "selecting names out of a hat", where each individual as the same chance of being selected.
Cluster Sampling - This sampling technique is used when our population is naturally divided into groups (which we call clusters).
Stratified Sampling - Stratified sampling is used when our population is naturally divided into sub-populations (which we call stratum, plural: starta).
Ex.
Suppose you would like to study the job satisfaction of hospital nurses in a certain city based on a sample. Besides taking a simple random sample, here are two additional ways to obtain such a sample
1. Suppose that the city is 10 hospitals. Choose one of the 10 hospitals at random and interview all the nurses in that hospital regarding their job satisfaction. This is an example cluster sampling where the hospitals are the clusters.
2. Choose a random sample of 50 nurses each of the 10 hospitals and interview these 50*10=500 regarding their job satisfaction. This is an example of stratified sampling where each hospital is a stratum.

Which way of sampling is best if you were survey students at Boston Latin school and how much they like school lunch?

Multiple Choice:

	Simple Random Sampling
	Cluster Sampling
	Stratified Sampling

If you are interested in finding out the approval rating of a president, should you just ask the people in Massachusetts or should you ask people from all 50 states?

Multiple Choice:

	You should only ask Massachusetts residents because Massachusetts is the best state.
	You should ask Canadians
	You should ask people from every state with proper representation
	You should ask only the states on the east coast because they are the closest to the president

What sampling method is best when we are trying to figure out how much students like their high schools in New York?

Multiple Choice:

	Simple Random Sampling
	No because it is too vague
	Cluster Sampling
	no we want all the schools
	Stratified Sampling

When asking the Public different Survey questions then the questions have to be written in an unbiased way.
Examples of biased questions
With Obama's terrible track record in the senate, do you think he will make a good president?
Do you want biased questions in your Survey?

Multiple Choice:

	no
	yes
	yes

When asking people if they will vote for Obama, what is the best way of asking the question?

Multiple Choice:

	Will you vote For Obama?
	The candidate running against Obama is wants to destroy the world, will you vote for Obama?
	Obama loves animals, will you vote for Obama?
	With Obama’s terrible track record in the senate, do you think he will make a good president?

When you are conducting a good Survey you cannot let the people you are asking know your opinion or the people's opinions will sway.

If you are asking people about global warming and if they believe in it, you cannot say that you believe that the earth will be destroyed by global warming.
What is the best Survey question if you are interested in whether or not people believe in Global warming?

Multiple Choice:

	Do you think that we will feel the effects of Global warming in 50 years?
	Do you believe in any form of climate change?
	Polar bears are dying, is global warming real?
	I believe in global warming, do you?

If you are interested in finding out the number of people who have diabetes in boston, what is the best way to ask the questions?

Multiple Choice:

	Diabetes kills 1 in 5 Americans, do you have diabetes?
	People with diabetes will not want to respond
	Do you have diabetes?
	Free medication for anyone who has diabetes, do you have diabetes?
	People love free stuff so more people will say they have diabetes then actually do.
	New cure found for diabetes, do you have diabetes?
	People love free stuff so more people will say they have diabetes then actually do.

Where is the best place to ask the previous question?

Multiple Choice:

	At the top of a high mountain.
	no, people whoi have diabetes will not be on a high mountain.
	At the Gym
	At the Hospital
	there will be too many people with diabetes at the hospital.
	on the street corner

What is a good amount of people to ask for a good survey?

Multiple Choice:

	10
	too few
	100
	better but still few
	10000000
	too many
	1200

When we cunduct a survey we always have a purpose.
If we were trying to find out the age that most people develop diabetes we need to take the mean and median of the data
Mean- is the average
Median- is the middle term
ages- 23,34,37,64,65,66,66,67,70
What is the mean and median

Multiple Choice:

	54.7, 67
	54.7, 65
	50, 66
	50, 65

Now we decided to weight of the people in our diabetes survey.
No diabetes(weight lbs) - 134, 140, 150, 160, 170, 168
Yes diabetes(weight lbs) - 200, 210, 290, 300, 250
What is the Mean weight of a person who has diabetes?

Algebraic Expression:

250

What has a stronger correlation persons weight and diabetes or Persons age and diabetes?
yes diabetes ages- 23,34,37,64,65,66,66,67,70
no diabetes ages - 23, 45, 67, 68, 26,29, 80
No diabetes(weight lbs) - 134, 140, 150, 160, 170, 168
Yes diabetes(weight lbs) - 200, 210, 290, 300, 250

Multiple Choice:

	age
	no
	weight

volunteer sample, where individuals have selected themselves to be included

convenience sample, where individuals happen to be at the right time and place to suit the schedule of the researcher

sampling frame---list of potential individuals to be sampled---does not match the population of interest

give an example of each

Ungraded Open Response:

Sources:
Open Learning Initiative, "Sampling," Open Learning Initiative, October 27, 2010 https://oli.web.cmu.edu/jcourse/workbook/activity/page?context=90de571d80020ca60104b2620b3f3e00&view=frameset

Australian Bureau of Statistics, "Module 1: Producing Data," Australian Bureau of Statistics, October 28, 2010 http://www.abs.gov.au/websitedbs/a3121120.nsf/4a256353001af3ed4b2562bb00121564/a5a5e246faa67167ca2575e1001ce4fb!OpenDocument

Ungraded Open Response:

Type in the following code:

##Start of code
sample(1:30, 5, replace=F)
##End of Code

This will allow you to take an SRS from a population of 30, with the sample size being 5, without replacing any numbers so that you might choose a number twice. This SRS is chosen fairly because the numbers are randomly generated and are removed from the lottery.
Do you understand how lottery works?

Multiple Choice:

	Yes.
	No

Scaffold:

Lottery works efficiently to choose an SRS because it randomly selects each number based off an algorithm and does not select that number again due to the code "replace=F" which means the number is not replaced in the lottery.
Lottery is used frequently as, you guessed it, a means to choose the SRS in televised lottery games.
Do you understand lottery in reference to SRS?

Multiple Choice:

	Yes.
	No.

Use the labeled table from the review website found here:
https://sites.google.com/site/apstats16/home/srs
Using your textbook, reference table B's value 113 and find the 5 clubs that will be shown on BLSTV this week.

Multiple Choice:

	62, 56, 87, 02, 06
	45, 14, 93, 29, 92
	40, 32, 50, 36, 99
	02, 06, 08, 25, 11
	62568, 70206, 40325, 03699, 71080

Scaffold:

Reference Table B from your Statistics textbook.
Using the table, find line 113 and take the first 5 sets of numbers.
62568 70206 40325 03699 71080
Next, split these into two-digit groups as follows:
62 56 87 02 06 40 32 50 ... and so on until you've gotten the 5 correct choices.
Which clubs will be chosen to appear this week on BLSTV?

Multiple Choice:

	62, 56, 87, 02, 06
	45, 14, 93, 29, 92
	02, 06, 08, 25, 11
	40, 32, 50, 36, 99

Hints:

Ignore labels such from 31 to 99 and 00 because they are not used with this example.

The Prudential Center decided to survey the shopping habits of its consumers. Due to budget restraints, interviews were conducted only by person within the plaza. Is this sample survey biased?

Multiple Choice:

	No, because the interviewers are getting the opinion of the entire population of consumers.
	No, because interviewers survey all of the plaza's customers.
	Yes, because the interviewers are only interviewing consumers that frequent the plaza.
	Yes, because the respondents tend to avoid interviewers.
	None of the above.

Scaffold:

The Prudential Center is low on budget, because of this they do not want to create additional spending by contacting consumers by phone or online advertisements. Why could this survey sample biased?

Multiple Choice:

	Only the middle-to-high class was represented.
	Only the people that could be easily reached were interviewed.
	Survey sample may have been mostly young folk or the elderly.
	People with negative opinions seem to volunteer to be the respondent more often than those with positive feedback.
	All of the above.

Hints:

Why would responses from only select groups of an entire population be a a problem for an unbiased random survey?

The Prudential Center wishes to gather information from more than just one of these groups.

All of these groups cause the survey sample to become biased.

Scaffold:

Another source of response bias is the behavior of the interviewer or the respondent. Would an interviewer likely choose an unkempt or safe-looking man as the respondent for an interview conducted in the Prudential Center?

Multiple Choice:

	Unkempt man
	Would you?
	Safe-looking

Boston Latin School students of a health class were tasked to survey their neighbors about smoking policy. The question is posed as follows:
A recent survey has shown that many of the cases of lung cancer this past year were caused by the frequent first-hand smoking habits of doctors and business persons outside of their respective offices and the resulting second-hand smoke. We are concerned for the health of our community and wish to know your stance on what should be done. Do you think that these people should be smoking so close to the building of their work?
Is there a problem to how this question was posed?

Multiple Choice:

	No, the question was posed fairly.
	No, the person clearly preceded their question well with information.
	Yes, the way the question was stated showed bias.
	Yes, there an obvious discrepancy with the student's information.

Scaffold:

Let us focus on how the question was worded. What is missing from the question so that the response would be a fair one?
Notice that the question doesn't leave space open for interpretation.

Multiple Choice:

	Nothing was left out.
	There was no bias in the question.
	The question wasn't stated in a neutral way.
	The ducks aren't all quacking.

Bias is everywhere in our daily lives. It is inevitable. It is unavoidable. It is part of human nature. You must succumb to it. There is no escape.

Have you ever experienced bias in your life before?

Multiple Choice:

	Yes
	No
	You're lucky! But wrong

The bias you've experienced in your life is most likely different from bias as we know it in statistics.
In statistics, a sampling method is BIASED if it systematically favors certain outcomes.
Would you consider propaganda a form of bias?

Multiple Choice:

	Yes
	No
	Propaganda is information that is spread for the purpose of promoting some cause. So yes, it is a form of bias.

Let's say for example you want to conduct a survey about the average number of hours people spend on computers.
You survey people at a local daycare center.
What would be a source of bias in this survey?

Multiple Choice:

	The location
	The year
	Shoe size
	Race/gender

An advertisement in the newspaper USA Today once asked the readers, "Should handgun control be tighter? If yes, call 617-504-9511, if no, call 857-540-6142. All calls will be charged 50 cents."
Why is this survey most certainly biased?

Multiple Choice:

	It's not biased
	People who care the most are willing to pay
	Some people don't like the look of the second phone number
	The newspaper it is advertising in is about handgun control

Now that you have a better idea of what type of data would be considered biased, you should know there are many different kinds of bias:
Selection bias is where individuals or groups are more likely to take part in a research or a survey than others, thus resulting in biased data (selection bias is also known as Berksonian bias).
Spectrum bias is where the surveyed group is biased to begin with.
Sampling bias occurs when some members of the population being surveyed are less likely to be included than others.
What type of bias is the follow scenario?
A writer for a current scientific publication wants to interview teenagers about illegal drug use at a local high school. However, because this survey is conducted at a high school, it does not include the teenagers that are high school drop outs or teenagers that are homeschooled.

Multiple Choice:

	Sampling bias
	Spectrum bias
	Selection bias
	It's not biased

Algebraic Expression:

Now that you know about the various types of statistical bias, let's take a look at how one would sample something for surveys.

Are you excited?

Multiple Choice:

	Yes!
	No.
	Well, you better get excited.

There are two types of groups of individuals that we work with when we sample for data.

A population is the entire group of individuals we want information about.

A sample is only a part of the population that we examine to gather information about the whole.

For example, say you want to gather information about how happy college students are in the United States. You go to a local college and survey the students there to represent all U.S. college students.

Did you just take a sample of college students, or did you survey the entire population of college students?

Multiple Choice:

	Sample
	Population

Now let's say you want to find out what the minimum wage for all of the farmers in China is. You go to China and interview every single farmer in the People's Republic.

Did you just interview a sample of farmers or the entire population?

Multiple Choice:

	Sample
	Population

Now that you know the difference between a sample and a population, let's take a look at some actual sampling methods...

A sampling method refers to the process used to chooes the sample from the population. Poor sampling methods can lead to misleading conclusions.

For example, in the state of Alabama, a local news station casts a poll to see which candidate the people will vote for in the upcoming race in a dominantly Democratic part of the state.

Is this an example of a poor sampling method?

Multiple Choice:

	Yes, because that part of the state in an anomaly to the rest of the state.
	No, because one part of the state can represent all.
	Yes, because it is a form of spectrum bias.
	No, there is nothing wrong with the sampling method.

Now consider the following scenario:

A science journal wants to know how much fast food America consumes per month. A group of statisticians from the journal contacts fast food restaurants around the nation and asks how much fast food they sell within a month.

Is this an example of a good sampling method?

Multiple Choice:

	Yes, because the scientists go straight to the source for their information.
	No, because fast food restaurants cannot be trusted
	No, because the scientists do not ask consumers themselves.
	Yes, because they are from a scientific journal.

Another type of a sampling method is called voluntary response sample. This kind of sampling consists only of people who decide to respone to the survey. This kind of sampling method is inherently biased, as only people who want to respond, will.

Another type is called convenience sampling. This kind of sampling only chooses individuals that are easy to reach to interview or survey.

Now let's consider the following situation:

You work for a marketing agency and are asked by your boss to find out what makes people in your city buy certain kinds of MP3 players. You then interview people only in your neighborhood, going from door to door. Because you live in a very high-end and rich neighborhood, you conclude that people buy only the most expensive kinds of MP3 players.

What kind of poor sampling method is this?

Multiple Choice:

	Convenience sampling, because you only interviewed people who were closest to you and not representative of the entire city.
	Voluntary response sampling, because the people who opened their doors were willing to answer you.
	None of these

Now, voluntary response sampling and convenience sampling are both inherently flawed because in one, the people choose to respond to the survey and in the other, the interviewer chooses. The only way to remedy this is by using a simple random sample.

A simple random sample consists of individuals from a population that are chosen in such a way to ensure that everyone has an equal chance to be selected.

What are some ways do you think you can take a simple random sample?

Multiple Choice:

	Make each entry independent of each other
	All of these
	Ensure that each entry is equally likely to be choosen
	Flip a coin
	Close your eyes and point

Say you won 5 free tickets to Canobie Lake Park, and you didn't know which of your friends to choose without hurting their feelings. You make a list of all of your closest friends and want make sure each one gets an equal chance to get selected. You close your eyes and run your finger over each name and then stop randomly. Here is your list:

Dan, Ian, Kevin, Caitlyn, Malcolm, Dan, Mike, Eric.

Would you be taking a simple random sample?

Multiple Choice:

	No, because not everyone has an equal chance to be selected.
	No, because not every entry is independent of each other.
	Yes, because the method you are using ensures randomness.
	Yes, because you are only taking a sample of your best friends, not the entire population.

Now, the last type of good sampling methods we'll cover today (finally!) is called stratified random sampling. This involves dividing the population into groups of similar individuals, called strata. Then choose a separate simple random sample in each stratum and combine these to form a full sample.

Let's say you want to know what political affliation people in Massachusetts is. First, you would divide the population of Massachusetts into similar political parties (Democratic, Republican, Green, Independent), and then you would take a simple random sample of each party, and then combine them to get a better picture of the political standing in Massachusetts is.

Is this a good example of stratified random sampling?

Multiple Choice:

	Yes
	No
	I tried my best...

That was last, but certainly not least! Did you find our Assistment helpful?

Ungraded Open Response:

Design of experiments refers of the blueprint for planning a study or experiment, performing the data collection protocol and controlling the study parameters for accuracy and consistency. Data, or information, is typically collected in regard to a specific process or phenomenon being studied to investigate the effects of some controlled variables (independent variables or predictors) on other observed measurements (responses or dependent variables).

Did you learn something new?

"AP Statistics Curriculum 2007 IntroDesign." Statistics Online Computational Resource. 28 June 2010. http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_IntroDesign

Multiple Choice:

	Yes.
	No.

There are two types of variables involved in experimenting with data.

Explanatory variable- the independent variable or the variable that causes change in another variable. (It is located on the x-axis.)
Response variable- the dependent variable or the variable in which the differences are observed. (It is located on the y-axis.) The response variable is changed by the explanatory variable.

Example: In a study, the hours of preparation for an exam determined the student's test score.
In this case, the explanatory variable would be the hours of preparation because it causes a change in the response variable. The response variable would be the student test score because differences can be observed in them. For example, if a student spends two hours studying, they might receive a higher test score than one that only spends 20 minutes studying. The explanatory variable changes the response variable.

Do you understand it?

Multiple Choice:

	Yes.
	No.
	Please go back and reread this section.

Scientists are conducting an experiment to test the correlation between a bird's speed with its wing span. Which of these two variables is the explanatory variable (independent variable)?

Multiple Choice:

	Wing span
	Speed
	Break the question down and try again.
	I don't understand
	Go break the question down for help.
	Neither
	Go break the question down for help.

Scaffold:

In this experiment, we are looking for which variable depends on the other. For example, in a correlation between a person's weight and the amount of food they eat, we say that the explanatory/independent variable is the amount of food. This is because the weight for a person depends on the amount of food, making it the response/dependent variable.
Now, looking at the original problem, if a bird's speed is greater with a larger wing span and less with a smaller wing span, we say that the speed depends on the wing span.
Which, then, is the independent variable in this problem?

Algebraic Expression:

wing span

Some students are studying the affects of height and a person's weight. They took 50 samples and found a relationship in the data. The taller a person is, the more they weigh. Vice versa, a shorter person with have a smaller weight.

If the weight changes accordingly to the height, then what kind of variable is it?

Multiple Choice:

	Explanatory
	Please try again (look at the hint).
	Response
	Controlled
	Please try again (look at the hint).

Hints:

The weight depends on the person's height.

The amount of cigarettes someone smokes can change their life span. A group of researchers found that people who smoke a lot of cigarettes will have a shorter life span than people who don't smoke or smoke very little.

What is the explanatory or the independent variable in this situation?

Multiple Choice:

	The amount of cigarettes
	Life span
	Look at the hint.
	Time
	Look at the hint.
	No independent variable
	Look at the hint.

Hints:

The explanatory changes the response variable. In this case, what causes the life span to change?

In an experiment, there are two groups: the experimental group and the control group. The control group and the experimental group are almost identical except the experimental group is affected by a variable that the control group is not. This means that the control group doesn't receive any experimental treatment. The control group can be used to observe how a variable changes the experimental group.

Are you ready to move on to the practice problems?

Multiple Choice:

	Yes.
	No, I don't understand it.
	Click on hint.

Hints:

The control group is the group without the experimental treatment.

EXAMPLE:
You're testing out how to improve your test scores through different methods of preparation for a test. You try doing it fours different ways and each way is one hour. The first way is not preparing for the test at all. The second way is reading the book. The third way is studying only from your notes. The fourth way is a combination of reading from the book and studying from your notes.

The control group is the first way because you don't apply any method of preparing for the test to it. Thus it doesn't get the experimental treatment.

You are looking at the effects of nitrogen on plants. You have two groups: one group has a high level of nitrogen and the other has a normal level of nitrogen. Which is the control group?

Multiple Choice:

	The group with the high level of nitrogen.
	Break this problem down and try again.
	The group with a normal level of nitrogen.
	None of the above.
	Break this problem down and try again.

Hints:

The control group is the one without a variable.

Scientists are conducting an experiment to observe the effects of a new drug (Drug A) for depression. One group was given the given Drug A, one with another drug (Drug B) and one without the drug.

Which group is the control group?

Multiple Choice:

	Drug A group
	Try again.
	Drug B group
	Try again.
	No drug group

A farmer is trying to determine what type of feed he should give his chickens in order to have them at their optimal weight. He split up his chicken into four groups. Group 1 is given a corn based feed. Group 2 is given a fish based feed. Group 3 is given a grain based feed. Group 4 is given their regular mixed type feed.

Which of the following is the control group?

Multiple Choice:

	Group 1
	Break this problem down and try again.
	Group 2
	Break this problem down and try again.
	Group 3
	Break this problem down and try again.
	Group 4

Scaffold:

Remember that a control group is not affected by the variable.

First of all, what is the variable that changes here?

Multiple Choice:

	The feed type
	The groups of chicken
	Try again

Scaffold:

Now that we know what the variable is, which group is not affected by the variable?

Multiple Choice:

	Group 1
	Are you sure? Try again.
	Group 2
	Try again.
	Group 3
	Try again.
	Group 4

A confounding variable is another variable whose effect on the response variable cannot be separated from the explanatory variable under study.

Look at the table below and identify the anesthetic that seems most dangerous.

Death Rates Associated with Various Anesthetics

Halothane	Pentothal	Cyclopropane	Ether	All Others
1.7	1.7	3.4	1.9	3.0

"Module 1: Producing Data." Australian Bureau of Statistics. 13 November 2009. http://www.abs.gov.au/websitedbs/a3121120.nsf/4a256353001af3ed4b2562bb00121564/ce88a10d17c46b56ca257610000a9364!OpenDocument

Multiple Choice:

	Cyclopropane
	Halothane
	Pentothal
	Ether
	Others

If you selected Cyclopropane, you are right. It does SEEM to be the most dangerous anesthetic as it is associated with the highest death rate. However, understanding the context in which these dat were generated allows you to identify confounding variables and will allow you to make a more reasoned interpretation of hte data. In fact, the apparently higher death rate for people who were given Cyclopropane can be explain if you consider that Cyclopropane tended to be used for risky operations that had a higher death rate anyway.

Multiple Choice:

	Ohhhkay, I get it
	I dont understand, could you explain more?
	of course!

Scaffold:

Let's take a better look at that diagram:

We know that the explanatory variable is the type of anaesthetic, and that the response variable is the death rate.
The confounding variable is a different variable that also affects the response variable (in this case the death rate). The type of operation affects this response variable because some operations are riskier than others, which could raise the death rate. It does not change anything in the Explanatory Variable, but it does change the Response Variable.

Multiple Choice:

	Ok, Thanks
	Can We try one more?
	yes! click above on the scaffold

Scaffold:

Which of the following variables could confound or influence the results shown in the table above?

Multiple Choice:

	The age of the patients
	The type of anaesthetic used
	The sex of the patients
	The general health / physical condition of the patients

Scaffold:

Variables you could have selected include a) age, c) sex and d) physical condition / general health. The confounding nature of these variables can be reduced by specifically accounting for these variables in any analysis of these data.

Multiple Choice:

kthxbye

When a medical researcher is testing out a new drug, one very important factor is to determine how much of the drug should be administered for the best results. If he has 4 volunteers to test the drug, he will give each of them a different Treatment. This means that each of them would receive a diferent amount of the drug.
A Treatment is a unique condition applied to experimental units (such as an individual).
An Experimental Unit is the individual that an experiment would be conducted upon.
In an experiment, you want to ind out how different variables affect the experimental units, so you will need to create different treatments to observe the differences.

Multiple Choice:

OK. Lets keep going

Here's an example:
A doctor has just created a drug called Simoneaucine to help peoples memory. He needs to find out how many milligrams of Simoneaucine a patient should take for the effects to work properly. He has 5 volunteers to experiment with. He chooses to give each of them a different dosage and observe which dosage produces the intended results best.

Each patient in this experiment will be considered an Experimental Unit because the experiment is being conducted on them.
Each of them will receive a unique Treatment. Each will take a different dosage of Simoneaucine.

Multiple Choice:

Cool, I think I get it

Try one out or yourself. Identify the Experimental Units and the Treatments in the following scenario.
Mr. Simoneau, a high school statistics teacher is trying to find out which types of questions on a test will confuse the students most. He has 3 types of questions to decide between; Multiple Choice, Open response, and Short answer. He has 3 classes in his schedule to give the test to. He gives each of the 3 classes a test consisting of only one type of question each. He observes which type of question produces the lowest average grades in the class for the test.

(Choose the best answer in this order: Experimental Unit, Treatment)

Multiple Choice:

	Each Class, Types of Question
	Types of Question, Each Class
	Break the problem down for better understanding
	Test Grades, Types of Question
	Break the problem down for better understanding
	Test Grades, Each Class
	Break the problem down for better understanding
	Types of Question, Test Grades
	Break the problem down for better understanding

Scaffold:

In the experiment, we were looking for the Experimental units and the Treatments. Lets look for the experimental unit first. We are looking for an individual that the experiment is being conducted on. Our choices are between Test Scores, Classes, and Types of Questions.
The types of questions cannot be the experimental units because they are a variable we can change.
Test Scores are not the experimental units either because they are a result of the changing variable.
The Classes are our experimental unit because we are testing the effectiveness of the questions ons them.

Multiple Choice:

Scaffold:

Now lets look for the Treatments. The Treatments are experimental changes that can be applied to the experimental units we just found.Now that we know that the experimental units are the Classes, our choices for the Treatments are narrowed down to Test Scores and Types of Questions.

Which do you think is the Treatment?

Multiple Choice:

	Test Scores
	break this down further for better understanding
	Types of Questions

Scaffold:

The Treatments in this experiment are actually the types of questions.
Think about what can be changeable and applied to the experimental units we found. Test Scores are a result of how the class performs on the test, so they are actually our reference variable in this experiment. The Types of questions are assigned to the different classes, making them different conditions, changed between classes. This is why they are the Treatments.

Multiple Choice:

Confounding variables can be very important in clinical trials where a new drug or procedure is being tested. Let's say a doctor gives a patient a new drug in tablet form and the patient gets better. How can you tell whether it was
1) the attention that was given to the patient as the drug was administered, or
2) the drug itself that caused the improvement?
Many patients respond positively to any treatment, even when they are given a placebo, i.e. a dummy medication. In other words, it is the process of being treated, not the action of the drug, which produces patient improvement. As a result, it becomes important to separate the drug (explanatory variable) from the treatment (confounding variable). An improvement in a person's health that occurs when they are given a dummy medication is called the placebo effect.

Do you understand it?

"Module 1: Producing Data." Australian Bureau of Statistics. 13 November 2009. http://www.abs.gov.au/websitedbs/a3121120.nsf/4a256353001af3ed4b2562bb00121564/ce88a10d17c46b56ca257610000a9364!OpenDocument

Multiple Choice:

	Yes.
	No.
	Go back and reread the information on the placebo effect.

I dentify the Explanatory variable, Reference variable, Experimental unites, and the Treatments in the following scenario.
A banana packaging company is conducting an experiment to see what temperature their trucks should transport their bananas with to have the bananas last the longest. They own 4 trucks and decide to send them each out full of bananas at different temperatures. The trucks are sent with temperatures of 10, 15, 20, and 25 degrees Celsius.

What is the Explanatory Variable?

Multiple Choice:

	The Trucks
	The Bananas
	The Temperature change
	The life of the Bananas

What is the Reference Variable?

Multiple Choice:

	The Trucks
	The Bananas
	The Life of the Bananas
	The Temperature change
	10, 20, 15, 20 degrees

What is the explanatory unit?

Multiple Choice:

	The Trucks
	The Bananas
	The Life of the Bananas
	The Temperature Change
	10, 15, 20, 15 degrees

What is the treatment?

Multiple Choice:

	The Trucks
	The Bananas
	The Life of the Bananas
	The Temperature Change
	10, 15, 20, 25 degrees

Read the following
Scenario 1:
A boy named Rasheem, had a pet turtle that he found deep in the woods. One day his pet turtle showed unusual signs. Rasheem had an idea. He thought a possible reason for this was the food that he was feeding his pet turtle. He decided to conduct an experiment. He went back to the woods where he found his pet turtle. He took 20 more turtles, he isolated all 20 of them from the rest of the turtles. He marked an X on the turtles that he used as a control-- the turtles that are going to eat the food normally eaten in the woods, while the experimental group ate the food that Rasheem provided. He then isolated the control group and the eperimental groups. Each day, he went back to the woods to observe them. He recorded the affects of the experimental group. After a couple of days, the experimental group showed the exact signs as his pet turtle. Rasheem concluded that the food was the problem.

Scenario 2:
A boy named Rasheem, had a pet turtle that he found deep in the woods. One day his pet turtle showed unusual signs. Rasheem had an idea. He thought a possible reason for this was the food that he was feeding his pet turtle. He decided to conduct an experiment. He went back to the woods where he found his pet turtle. He took 2 more turtles, he isolated all 2 of them from the rest of the turtles. He marked an X on the turtle that he used as a control-- the turtle that are going to eat the food normally eaten in the woods, while the experimental group ate the food that Rasheem provided. Each day, he went back to the woods to observe them. After the first day, there was no signs. Rasheem concluded that it was just his pet turtle.

Which scenario is the best example of a well conducted example and why was the wrong one wrong?

Check All That Apply:

	Scenario 1, sample size was too small
	Scenario 1, two groups were not isolated
	Scenario 1, no data was recorded
	Scenario 2, isolation doesn't reveal the confounding variable to the experiment

If a team of biologists from Harvard Medical School wanted to created a study to show the effects of sleep deprivation of teenagers with low test scores, they would design a test with the least number of factors. The test would need to be replicated throughout the world by various medical institutions and so in order to create a statistically accurate study what would be the best candidates for the study?

Check All That Apply:

	Similar age range of males and females
	dietary habits
	Similar educational curriculum and position (i.e. High school vs. College)
	Large test group
	Equal Number of males to females
	Control group (adequate amount of sleep)

A teacher wanted to know if students actually learned the material taught in class. To test this, she decided to give them all a pop quiz. She wanted to know if there was any difference between taking a multiple choice test or taking an open-response test. Which group will have the highest test results. Instead of giving them all the same formatted test, she decided to give them a choice. They can choose between taking a multiple choice test or taking an open- response test. 15 students chose to take the multiple choice test and 10 chose to take the open-response test. Because the teacher let them choose which type of test to take, the statistical data results were similar. What happened?

Multiple Choice:

	The students were biased and chose the type of test they were better at taking.
	There were cases of cheating.
	An error occurred while calculating the scores.
	The test was too easy.

Hints:

In statistics, bias is systematic favoritism present in data collection, analysis or reporting of quantitative research.

A single blind study is a way of eliminating any bias of product of the experiment itself. Coca-Cola designed a single blind study to test their Coke Zero product to their original Coca-Cola product. How should they display the two products to the subjects or the people testing the two products?

Multiple Choice:

	display liquid in original can and original labels.
	display both in a Coke Zero can
	display both in an original Coca-Cola can
	display each one in the same type of cup and label one A and the other B

Is a random sample biased or unbiased?

Multiple Choice:

	Biased
	Unbiased

In order to have an accurate study what ratio of the test group should be the control group?

Multiple Choice:

	1:2
	3:4
	9:10
	No Control group
	1:1
	5:100

What are the sources of bias?

Check All That Apply:

	Undercoverage
	Nonresponse
	Don't know, haven't decided
	Untruthful answers
	Ignorant people
	People who don't remember
	Statement of questions

Hints:

Keep on going

There correlation between the sales of winter clothes and the deaths from falling through ice and drowning. But one can not infer that winter clothes shopping causes drowning. What factor is missing from this situation that would explain the relationship between these two instances?

Multiple Choice:

	Confounding: Wintertime
	Confounding: Increase in precipitation
	Lurking: Population increase
	Lurking and Confounding are essentially the same thing.
	Lurking: Global Warming
	Lurking and Confounding are essentially the same thing.

A boy named Rasheem, had a pet turtle that he found deep in the woods. One day his pet turtle showed unusual signs. Rasheem had an idea. He thought a possible reason for this was the food that he was feeding his pet turtle- processed turtle food. He decided to conduct an experiment. He went back to the woods where he found his pet turtle. He took 20 more turtles, he isolated all 20 of them from the rest of the turtles. He marked an X on the turtles that he used as a control-- the turtles that are going to eat the food normally eaten in the woods, while the experimental group ate the food that Rasheem provided. He then isolated the control group and the eperimental groups. Each day, he went back to the woods to observe them. He recorded the affects of the experimental group. After a couple of days, the experimental group showed the exact signs as his pet turtle. Rasheem concluded that the food was the problem.

What is the treatment of the experimental group?

Multiple Choice:

	Turtles
	Grass
	Worms
	Soil
	Smart food
	Turtle food

Hints:

A treatment is something that researchers administer to experiments.

In our intellectual society, advancements in science and medicine occur quite frequently. Not all of the science is viable. Some results are sometimes sloppy and full of human error. The important difference between "Sound" science and "Junk" science is...

Multiple Choice:

	The replication of the experiment by other researchers
	The support of science scholars and doctors
	Thorough analysis of the data
	A large budget
	Media coverage

Hints:

"Sound" science means reliable and consistent science.

An experimental unit is any group, thing, or object that takes part in an experiment. Knowing this, what is the answer to the following question?

Hospital floors are usually covered by bare tiles. Carpets would cut down on noise but might be more likely to harbor germs. To study this possibility, investigators randomly assigned 8 of 16 available hospital rooms to have carpet installed. The others were left bare. Later, air from each room was pumped over a dish of agar. The dish was incubated for a fixed period, and the number of bacteria colonies were counted. Select the appropriate statistical term for the 16 hospital rooms.

"EBook Problems EDA IntroDesign," last modified on 26 October 2009, accessed 26 October 2010, http://wiki.stat.ucla.edu/socr/index.php/EBook_Problems_EDA_IntroDesign

Multiple Choice:

	Response
	Treatments
	Experimental Units
	Control Group

Jennifer participated in a study where she had to take two pills each day. The pills were part of a study conducted by John Hopkins University for the treatment chronic migraines. Jennifer wasn't told whether she was part of the control group or the experimental group. After two weeks of taking the pills she told researchers at the John Hopkins that she noticed a change in her migraine pain. She noted that they were now less painful than before. However she was not part of the experimental group, she was part of the control group. What kind of effect did she experience?

Multiple Choice:

	Placebo effect
	Bias effect
	Blinding effect
	Double blind effect

Hints:

The Placebo effect is defined as a neutral treatment that is part of the control group and has a positive effect on the dependent variable in the experiment.

The word random explains it all. In order for something to be random, there can't be any bias or any choosing. What are some techniques of random assignments (an experimental technique for assigning subjects to different treatments or groups).

Check All That Apply:

	Tossing a coin
	Assigning random numbers to participants
	Picking out of a hat
	Choosing you favorite group
	Separating groups by particular areas, genders, etc.

Algebraic Expression:

How would a double blind experiment be performed for a product, by whom?

Multiple Choice:

	Computer
	Study Advisor present
	Owner or president of the company
	A random person not affiliated with the company
	A Statistician
	If the experiment is performed by a human, there is considered to be some sort of bias (consciously or subconsciously). Even a professional can be bias in a study.

500 people signed up to take part in Josh's sleep deprived study. Josh chose the first 100 people to be in it. He randomly assigned 50 people to be apart of his control group and the remaining 50 to be apart of his experimental group.

What was Josh's method of splitting the people into two groups called?

Exact Match (case sensitive):

random assignment

Hints:

Random As_ _ g _ _ _ _ t

A double blind study is a way to eliminate both bias in the experimentation and also the experimenter. In what ways can an experimenter be bias during a study testing the public's preference of Pepsi or Coca-Cola?

Check All That Apply:

	Adding more or less ice to one or the other
	Positioning one cup closer to the subject than the other
	Suggesting one product over the other
	Hinting at which liquid is from which company

Doctors at the UCLA Hospital are worried about some of the side effects of a drug used to treat cancer when that drug is prescribed in large amounts. 60 volunteers are randomly split into three groups of 20; the first group doesn't take the drug, the second group takes a low dosage of the drug, and the third group takes a high dosage of the drug. How many treatments are there in this experiment?

"EBook Problems EDA IntroDesign," last modified on 26 October 2009, accessed 26 October 2010, http://wiki.stat.ucla.edu/socr/index.php/EBook_Problems_EDA_IntroDesign

Multiple Choice:

	There are 60 treatments, one for each volunteer.
	There is only one treatment used for this cancer, the drug being tested.
	There were 180 treatments, one for each level of the drug and one for each patient.
	There are three treatments, one for each level of the drug.
	We need to know what the dosages prescribed were in order to determine the treatments

If doctors want to conduct an experiment to determine whether Prograf or Cyclosporin is more effective as an immunosuppressant, in a sample size of 300, how many subjects would be assigned to each?
Carnegie Mellon, Open Learning Initiative, Statistics, October 26, 2010, https://oli.web.cmu.edu/jcourse/workbook/activity/page?context=90de572a80020ca6013b1983f8562ffd&view=frameset

Check All That Apply:

	100 for Control group, no immunosuppressant
	It's unethical for patients not to recieve necessary medication. In these certain situations control groups are not tested but the comparison of drugs are.
	150 for Prograf
	150 for Cyclosporin
	100 for Prograf
	100 for Cyclosporin

A new hair product promises to cure frizzy hair, and backs up this statement with tests they collected and analyzed over a period of 2 years. They take a group of 550 males and females and over the course of two years watch the group for changes to their hair. What is the possible flaw to this study?

Multiple Choice:

	2 years
	850 is not a large enough test group
	female to male ratio
	No control group

What is a factor?

Multiple Choice:

	The combination of levels of explenatory variables
	The combination of Answers
	The combination of possible solutions
	The combination of subjects

Hints:

This is a definition question, consult book.

What is the purpose of a control?

Multiple Choice:

	Remove Lurking Variables
	Randomization
	Sorry
	Comparision
	Wrong
	Replication
	Wrong

Algebraic Expression:

	explantory variables
	response variables
	Wrong
	conceptional variables
	no
	power ups
	no

What is one issue with double blind experiments?

Multiple Choice:

	The administrator knows whats going on
	That is not blind
	The subject knows whats going on
	Wrong
	Lack of realism
	blocks
	Wrong

Given the random digits table i the book, when asked for the numbers for an experiment with labels 01 to 99 who would be picked when looking at line fifty of the random digits table?

Multiple Choice:

	05, 16, 17, 20, 19, 04, 25, 29, 18, 07, 13, 02, 23, 27, 21
	05, 00, 16, 31, 05, 15, 30, 42, 16, 18, 04, 09, 02, 01, 05
	Wrong, Probably had the wrong line
	69, 05, 16, 17, 20, 19, 04, 25, 29, 18, 07, 13, 02, 23, 27,
	wrong, improper usage, see scaffolding for step by step.
	None of them are correct
	One of them have to be right for scaffolding.

Scaffold:

Firstly What is a Random Digit Table?

Multiple Choice:

	A table of random digits
	Doomsday table
	No this will not be the end of the world
	Numbers going from one to ten
	its actually 0 to 9
	Totally Pointless
	I suppose when we learn other methods for randomness, but at the moment we are focusing on random tables

Scaffold:

How many digits should we use per random subject?

Algebraic Expression:

2

Scaffold:

Can you write out 15 of these numbers? If they are not in the range of numbers you randomized you should remove them and find a replacement further down.

Multiple Choice:

	Yes
	No
	I think you can?

Scaffold:

If those numbers dont exist in your range then you must find more deeper down the chart. What are those 15 numbers after this process?

The answer is 05, 16, 17, 20, 19, 04, 25, 29, 18, 07, 13, 02, 23, 27, 21

Exact Match (case sensitive):

05, 16, 17, 20, 19, 04, 25, 29, 18, 07, 13, 02, 23, 27, 21

Before we delve into the world of experimental design, here are some helpful basics to know about the statistical design of experiments:

Control the effects of lurking/external variables on the response by comparing two or more treatments.
Randomize by using chance to assign experimental units to treatments.
Replicate the same treatment on many experimental units to reduce variation in results.

Do you understand these basic principles? If not, reread each point.

Multiple Choice:

	Yes
	No
	Go back and reread!

Now let's delve into the world of experimental design itself!

Here are some more basics:
The individuals that are being tested are called experimental units. When humans are the experimental units, they are referred to as subjects.

In the following problem, check all of the experimental units that are subjects:

Check All That Apply:

	Doctors
	Children
	Flowers
	Flowers aren't human!
	Textbooks
	You better read a textbook on what a human is!
	Trucks
	Trucks aren't human!

With the experimental units out of the way, let's move onto the experiment!

The experimental conditions that affect the units are called factors. A combination of factors form treatments.

For example: Scientists are conducting a study on what factors will cause a cow to grow larger. They propose that a cow's diet and the amount of exercise are factors. They carry out the experiment with 500 cows being fed grass or grain diets. They also are subjected to either 40 minutes of exercise or 60 minutes of exercise. So cows are divided into FOUR treatments with TWO factors. Refer to the table below:

500 cows	40 minutes of exercise	60 minutes of exercise
Grain diet	125 cows	125 cows
Grass diet	125 cows	125 cows

If you understand, click yes.
Otherwise go back and reread!

Multiple Choice:

	Yes, I understand!
	Uh, let me reread!

Teachers are trying to enhance the performance of their students. They conduct a study with 120 volunteer students. They ask the students to change sleeping habits to 7, 8, and 9 hours of sleep. Students are also asked to change their diet to eating fish or not eating fish. How many factors and treatments are there?

120 Students	7 Hours of Sleep	8 Hours of Sleep	9 Hours of Sleep
Eats fish	20 students	20 students	20 students
Doesn't eat fish	20 students	20 students	20 students

Multiple Choice:

	2 factors and 6 treatments
	4 factors and 6 treatments
	How many factors are there?
	2 factors and 5 treatments
	Refer to the table!
	4 factors and 5 treatments
	How many factors are there? Refer to the table!

Good job on getting this far! Now that we're familiar with the terminology of experimental design, we are going to move onto the types of design.

The simplest experimental design is the completely randomized design. Exactly like what we have been doing so far, all the experimental units are divided evenly at RANDOM to each treatment (there is no specific assignment of units).

To determine which delivery company delivers packages faster, 600 packages were distributed among 3 delivery companies evenly.

Company A	200 Packages
Company B	200 Packages
Company C	200 Packages

Multiple Choice:

	Yes, it is a completely randomized design
	Are you crazy? Not at all!
	Check to see if the packages were distributed evenly and at random!

Two types of fertilizer are being tested to see which is better at helping plant growth. 100 flower bulbs receive each fertilizer. 25 flower bulbs receive fertilizer A while 75 flower bulbs receive fertilizer B. Is this a completely randomized design?

Fertilizer A	25 flower bulbs
Fertilizer B	75 flower bulbs

Multiple Choice:

	Yes
	Wrong! The experimental units aren't distributed evenly!
	No

Moving on! Matched pairs design is a more elaborate randomized design and compares only two treatments. The subjects are PAIRED to be as close as possible to each other.

For example: a piece of cloth is torn into two pieces. Each piece of cloth is then used to test the strength of two different detergents. This eliminates the external variable of the type of cloth used.

700 subjects with mental illnesses are subjected to a study. They undergo either a psychological therapy session or medical treatment. The subjects are paired based on age and gender.
Is this an example of a matched pairs design?

Multiple Choice:

	Yes
	No
	What are the subjects being paired by?

Matched pairs design does not only pair similar experimental units, but can also have each subject receive both treatments.

Referring back to the previous problem, a subject can receive both the medical treatment and the psychological therapy to determine the effect of both treatments on one subject rather than have one subject of a pair undergo each treatment.

Vegan scientists are proposing that vegetables may have an effect on brainpower. They assert that three vegetables in particular (broccoli, carrots, and brussel sprouts) have the ability to bolster a student's thinking power. 100 students each took 3 standardized tests, eating one of the specific vegetables before each test. The scores of the test were recorded along with the type of vegetable eaten.
Is this a matched pairs design?

Multiple Choice:

	Yes
	A matched pairs design only compares TWO treatments. There are three here.
	No

With completely randomized design and matched pairs design out of the way, the last experimental design to discuss is block design. In a block design, the experimental units are divided into groups called blocks. Then the experimental units in each block are allocated to a random treatment.

For example: Milk is being tested for expiration date and taste. There are 200 milk samples. Half of milk comes from cows and half of the milk comes from goats. The experimental unit, the milk, is blocked into the subgroup of either cow or goat milk.

Physical therapists are conducting a study on runners depending on the amount of stretching they do. Runners are asked to stretch for 10, 15, and 20 minutes before a race to determine stretching's effectiveness. Physical therapists want to divide the groups into blocks to remove any other external variables.

Check all of the following that are acceptable blocks:

Check All That Apply:

	Gender
	Ethnicity
	Weight class
	Amount of water drunk before the race
	Amount of water does not distinguish the experimental units!
	What each runner had for breakfast
	What each runner ate for breakfast does not distinguish the experimental units!

There is an SAT test coming up. To determine whether studying earlier before a test can improve results, 400 students are asked to study two months prior to the exam and to study two weeks prior to the exam. The students were divided equally between both groups.

What type of experimental design is this?
How many treatments are there?

Multiple Choice:

	Completely randomized design, 2 treatments
	Block design, 2 treatments
	Are there really blocks separating the experimental units?
	Matched pairs design, 2 treatments
	How are the experimental units being paired?
	None of the above!
	You better review the previous questions!

There is a study that indicates identical twins think alike. 100 pairs of twins were gathered. The twins were separated into different rooms. The same ten question survey was administered to all the twins and the results were compared to see if they had the same answers.

What type of design is this?

Multiple Choice:

	Matched pairs design
	Block design
	The subjects were not placed into blocks.
	Factors
	A factor is an independent variable.
	Treatments
	Treatments are the combination of factors.
	Completely randomized designs
	Were subjects really distributed at random?

Here's a wrap up question about what you learned! Good job for completing the assistment!
What did you learn about experimental design?
What are the basics of experimental design?
What are the individuals referred to? The experimental conditions?
What are the three types of experimental design?

Ungraded Open Response:

Romano's Industries, a pharmaceutical company, has developed an experimental new extreme conditioning drug for the army. 1000 miltary volunteers are available. Each volunteer will be given a dose and have their physical performance tested.
Why should Romano's not simply administer the new drug as the first step and record the volunteer's test results?

Ungraded Open Response:

Without randomizing the group and establishing a control group and test group, Romano's Industries can not establish whether or not the drug actually is effective.

Hints:

Think about why randomization in a group is important for designing a experiment.

Experiments use a control group and compare it to the test group that will be different in one explanatory variable.

Why should statisticians rely on chance to make an assignment or allocation of test subjects to the control or test group?

Ungraded Open Response:

Chance is not affected by human biases.

Hints:

What do humans do that chance doesn't?

If John, a scientist, is conducting an experiment on rats' diets, then which of the following data sets would be the best to use for his experiment if he is only changing the variable of different diets among different groups of rats? Note: John desires as unbiased a data set as he can get from his choice of the following different data sets.

Multiple Choice:

	1,000 Random Rats
	Not Quite.
	334 Slim Rats, 334 Medium Rats, 3334 Fat Rats
	Wrong Concept, Sorry.
	10,000 Random Rats
	3334 Slim Rats, 3334 Medium Rats, 33334 Fat Rats
	Wrong Concept, Sorry

Scaffold:

So John the scientist wants the most unbiased data set that he can get for his experiment on the diets of rats. In this case John wants which type of data set from the following types?

Multiple Choice:

	Categorized
	Sorry, a categorized data set will bias the study because it is adding extra variables.
	Quantitative
	Not exactly applicable to a relatively unbiased data set or having anything to do with biased data.
	Random

Hints:

So if we want an unbiased Data Set, we are looking for a data set that minimizes variability in any variables that are not part of the variables being tested. We are not looking for new variables to be introduced into the data set for this problem.

Scaffold:

Ok, so now we know that we are looking for a Random Data Set. Which is a better data Set for John to use that is the most unbiased?

Multiple Choice:

	10 Rats
	No, for the data to be more unbiased, the largest number of rats is best to balance out the variability of variables not related to the variable(s) being tested.
	100 Rats
	No, for the data to be more unbiased, the largest number of rats is best to balance out the variability of variables not related to the variable(s) being tested.
	1,000 Rats
	No, for the data to be more unbiased, the largest number of rats is best to balance out the variability of variables not related to the variable(s) being tested.
	10,000 Rats

AnimaEcho has created a new dog food formula. They want to test whether or not it tasted better than their previous Athion brand of dog food. There are 250 dogs of many different breeds available to act as test subjects. How should the dogs be distributed to the different groups?

Ungraded Open Response:

They should number the dogs, write the numbers on small pieces of paper, put them in a hat, shuffle them around, and pick the first 125 to be the test group.

Scaffold:

You want to determine a method that removes all possible bias inheirant in human selection. This would be a process that makes the test subjects allocated to the two different groups randomly. What is this process called?

Exact Match (case sensitive):

Randomization

Scaffold:

How can you randomize the distribution of dogs to each group?

Ungraded Open Response:

Matched pairs design compares two experiments by...

Multiple Choice:

	separating the test subjects into two groups by looking at certain factors.
	randomizing the test subjects into two groups and making them act as the control group first, and then as the experimental group.
	randomizing the test subjects into two groups and making one of the groups act as the control group first, and then as the experimental group while the other group acts as the experimental group first, and then as the control group.
	randomizing the test subjects into two groups and making them act as the experimental group.

Machinecle Cubix, a pharmaceutical company with an emphasis on fitness, has developed an new energy drink for marathon runners that is claimed to significantly boost one's endurance. 3000 volunteers are available to act as test subjects to test this claim. Why should the company not simply let all the runners run at once, record their running distance, and redo it after they are given an appropriate amount of energy drink and rest? Assume that it is physically and legally possible for that many runners to run on the road at the same time.

Ungraded Open Response:

By allowing all of the runners to run under the same conditions, they will all become more wary the second time of obstacles and shortcuts. This will affect the results of the second trial.

Hints:

What problems could you predict if everyone gets to see the course at once?

How might the first trial affect the second?

If a new Television Company Flerovane wants to test the side effects on customers watching their Tv for extended durations and their researchers have defined the experiment as the side effects from watching their Standard Definition Tv for an extended duration compared against the side effects from watching their High Definition Tv for an extended duration and they want as unbiased a data set that they can get, which of the following data sets is most appropriate for their experiment.

Multiple Choice:

	Random % of People watch SD Tv first and Remainder watch HD Tv first Based on 50% Probability Randomization
	50% of People watch SD Tv first and 50% watch HD Tv first.
	Not Perfectly Randomized, so not the least biased data set.
	Completely Random Data Set
	This data set does not take into account the bias from some % of people doing one experiment before the other.
	Random % of People watch SD Tv first and Remainder watch HD Tv first
	Close, but not quite. This data set does not take into account the relative ratio of 1:1 based on 50% probability. If 99% of people watch SD first then it is much more biased then one with relative 1:1 ratio.

Hints:

Think about how order affects the experiment in terms of using Randomization to achive unbiased data.

There are 2 data sets so order should be determined by Randomization based off of the principle of their only being 2 orders for the set.

In using Randomization for unbiased data in the order of 2 data sets one must use Randomization on the order RELATIVE to the 2 choices, (Relative 50% should use one order remainder use the other). Think of it as if you are flipping a coin for each person in the experiment to determine in which order they will do the 2 segments of the experiment.

Using Relative Probability instead of fixed probability allows for the randomization to unbias the experiment, this is because no other variable to be interconnected other than the difference between the two experiments.

If Gori's Utilities is developing a new program for phones that allows the phones to listen to music stored on their computer and Gori wants as unbiased a data set as she can get, which of the following data sets is the most unbiased for her to use if she knows that the new program will work differently on different types of phones as determined by her research team?

Multiple Choice:

	Completely Randomized Data Set
	The different types of phones will bias each other in the data.
	Randomized Data Set divided into sub groups of different phones and their versions
	Sorry, Gori has no information on the different versions of phones, so we cannot assume anything that is not given.
	50% Old Phones 50% New Phones randomly divided between the sub groups of the different phones
	There is no mention of 50% or comparison between 2 things, please reread the question.
	Randomized Data Set divided into sub groups of the different phones

Hints:

Gori knows that the different phone types may bias the data if tested together as the same data type.

Ronald's Factory is testing out a new conveyor belt that system will allow his workers to take their breaks at different times and the project was successful. Now Ronald wants to implicate this experimental conveyor belt system in all of the factories in his city. But, many of these factories are different and the way Ronald's new conveyor belt system will work in each one will be a little different. Ronald unfortunately was unable to hire a research team that would investigate all of the factories for information on how different his results would be so now he is only able to do the experiment itself. With this information which of the following data sets should Ronald choose?

Multiple Choice:

	Completely Randomized Data Set
	Randomized Between Old Factories and New Factories
	Ronald has no such information on the difference between old and new factories.
	Randomized Between Retail Factories and Hardware Factories
	Ronald has no such information on the difference between retail and hardware factories.
	Randomized Data Set divided between the different types of Factories
	Ronald has no information on the different types of Factories, so he is unable to make such assumptions

Hints:

Ronald has no real information on the differences between the factories so he must eliminate bias through randomization, not assumptions of undetermined data.

Design an experiment from the situation above if Ronald wants to test his new conveyor belt system for the Night Shift and the Day Shift (Same Participants) between Very Old Factores, Somewhat Old Factories, and New Factories considering the factories in question supplied the funds to research their factories and they found that there were significant differences between the 3 types of factories and Night and Day Shift.

Ungraded Open Response:

Hints:

Remember to Split up variables that are introduced to the experiment that are not the main experiment variable (the 3 types and the 2 shifts)

Remember that order matters in splitting up the 2 shifts because the same participants are taking part in the experiment for both shifts.

Remember to use relative probability within Randomization for Order Matters Data Sets

Block design seeks to divide the test subjects into blocks. What are blocks?

Multiple Choice:

	Groups of test subjects that are randomly selected.
	Groups of test subjects that are known before the experiment to be similar in some way that is expected to affect the results.
	Groups of test subjects that are divided arbitrarily.
	Groups of test subjects that are known before the experiment to be similar.

Scientists have developed a new medication that is claimed to cure lung cancer. How should scientists divide the test subjects into groups if they know that one of the following paired groups of factors will have a significant difference?

Multiple Choice:

	1st hand smoking vs 2nd hand smoking
	Young vs Old
	Male vs Female
	Average weight vs Obese

A gym teacher wants to sample his students upper body strength by making them a endurance test that will measure the amount of push-ups that each student can perform before tiring out. Before the experiment is started, what should the teacher do to the class so that some of the interfering variables will be eliminated?

Ungraded Open Response:

The class should be divided into two seperate group of males and females, since gender plays a role in the amount of muscle mass.

Hints:

Think about how humans can be classified or divided into groups. Is there a classifing factor that might significantly affect the test result?

Gingy is experimenting with chemicals in a lab and he discovers that a prankster has mixed some of his chemicals together in such a way that every two test tubes' contents were mixed into one tube with only 2 different chemicals in each tube as result of the prank. Gingy is a bit distressed, but he thinks he may still be able to carry out the experiment. If Gingy's experiment is to test which chemical reacts in a certain way to a special compound how should Gingy design his experiment using Randomization if only half of his chemicals were mixed by the prankster.

Ungraded Open Response:

Hints:

Gingy can separate the mixed chemicals from the unmixed chemicals and then use Randomization.

If Fred's New Gen Tech Industry wants to test out the effectiveness of a new wireless headset that controls the mouse of one's computer in response to eye movement and he desires an unbiased data set. From research done by the scientists at his industry, Fred now knows that the results for the effectiveness of his new wireless headset will differ drastically among difference in a particular characteristic. Fred understands from this research that the effectiveness of his new wireless headset is impractical for people who are blind or very nearly blind. Although the effects are not necessarily so straightforward with people who are near sighted, far sighted, have 20-20 vision, have strained/weak retina (loss of eye focus), and bad eye-sight. With this information, which of the following data sets is the most appropriate for his experiment if he wishes to have as unbiased a data set as he can testing his current headset that he may adjust for the different sub groups based on the results of this experiment?

Multiple Choice:

	Completely Random Data Set of Non-Blind Persons
	If Fred were to use this data set he would be biasing his results by crossing the substantial differences within his sub groups.
	Random Data Set within each Sub Category Separated
	Completely Random Data Set
	This data set ignores all of the conditions that Fred needs in order to have the most unbiased data set in his experiment. Reread the question and think about the Sub-Groups this time in more detail.
	Random Data Set within persons with 20-20 Vision and Random Data set within persons with impaired vision Separated
	You're on the right track, but this data set does not completely separate out the bias of the sub groups of persons with impaired vision as the scientists have researched that the effects should be different among those sub groups.
	Random Data Set within Children, Adolescents, Middle aged, and Elderly Separated.
	Reread the question, the question does not mention age groups at all, so age groups are irrelevant to Fred's experiment and thus do not separate out the bias for his experiment.

Hints:

Pay Attention to Sub Groups and Separation of Bias.

Let's take apart the problem a bit, now Fred wants an unbiased experiment, so he should use Randomized Data, but he needs to use Randomized Data within the different groups of people as determined by his scientists that researched the differences among the different groups.

Hysterian Systems is trying to test the effects of a new remote that is specifically for police to catch speeding cars remotely. Hysterian System's team of researchers have found that the effect of the remote to bring a targeted car's speed to match the position of the police car using a simple algorithm differs drastically between a car driving slightly over the speed limit and a car driving much higher than the speed limit. Which of the following data sets is the most suited for their experiment?

Multiple Choice:

	Completely Randomized Data Set
	Randomized Data Set divided between very fast cars and slightly fast cars
	This data set does not allow them to establish a correlation between speed and results.
	Randomized Data Set divided between Cars Under the Speed Limit and Over the Speed Limit
	There is no mention of cars under the speed limit being tested in the experiment, reread the question.
	Randomized Data Set divided between different Car Models
	The is no mention of the effect of different car models in the experiment.

Hints:

The only two variables being considered are Speed and Results.

Dividing the experiment between two ranges of Speeds does not allow the variable to be tested.

Just because the results differ drastically between slightly over the limit and much higher than the limit, it does not mean they should be separated, from the information it can be assumed that randomization will provide results that yield a correlation or an otherwise undeterminable difference between the two ranges. The drastic difference may even be as a result of an exponential relationship.

What is wrong with this "completely randomized design"?
AAAA
BBBB
CCCC
DDDD

Multiple Choice:

	It is biased
	Wrong, this is a completely randomized design
	WRONGO!
	It has too many groupings
	It has too many groupings
	It is not in any way a random design
	nope!

What are factors of a completely randomized design?

Check All That Apply:

	k = number of factors
	L = number of levels
	n = number of replications
	B = number of treatments
	T = Time elapsed

Hints:

All completely randomized designs are defined by three things, how many factors, levels, and replications are involved in the experiment.

What type of degisn does this show?
                           Treatment
Gender        Placebo        Vaccine
Male            250                 250
Female        250                 250

Multiple Choice:

	Completely Randomized Design
	Block Design
	Stacked Design
	HA
	Grouped Design
	NO!

Scaffold:

How many "factors" are there in this problem

Algebraic Expression:

4

Scaffold:

Since there are 4 factors in this experiment, we know that it is not a completely randomized design. This design is called a randomized block design. Here, the experimenter divides participants into subgroups called blocks, such that the variability within blocks is less than the variability between blocks. Then, participants within each block are randomly assigned to treatment conditions. Because this design reduces variability and potential confounding, it produces a better estimate of treatment effects.

Multiple Choice:

	yes
	no

Scaffold:

Knowing what a "Randomized Block Design is, what is the answer to the problem?

Multiple Choice:

	Randomized Block Design
	wrong, this is a completely randomized design
	NO!!!!!!!!!!!!!!!!!!!
	This is a stacked design
	ARE YOU SERIOUS!!!!
	Grouped Design
	you musn't have read the scaffold....

It is known that men and women are physiologically different and react differently to medication. This design ensures that each treatment condition has an equal proportion of men and women. As a result, differences between treatment conditions cannot be attributed to gender. Which design would be best to have to represent this type of situation?

Multiple Choice:

	Randomized Block Design
	Completely Randomized Design
	no, sorry
	Stacked Design
	Simoneau Special Design
	almost!

All of the following are advantages of Completely Randomized Design except....

Multiple Choice:

	It is completely flexible
	Any number of treatments can be investigated
	each treatment can have any number of units (though equal units are desired)
	experiments do not involve randomized numbers

A completely randomized design is most appropriate when

I. Experimental units are similar.

II. Several units may be destroyed or fail to respond.
III. It is a relatively small experiment.

Algebraic Expression:

	I. only
	nope
	II and I only
	I, II, III
	III only
	II and III only

Asses the validity of this statement: In Completely Randomized Design, the experiments work best when they are small such as laboratory experiments.

Multiple Choice:

	True
	False
	no, sorry
	Not Enough Information
	nope.

Which of the following statements are true?

I. A completely randomized design offers no control for lurking variables.
II. A randomized block design controls for the placebo effect.
III. In a matched pairs design, participants within each pair receive the same treatment.

Multiple Choice:

	I only
	nope, try again
	II only
	nope, try again
	III only
	nope, try again
	All of the above
	nope, try again
	None of the above

Hints:

In a completely randomized design, experimental units are randomly assigned to treatment conditions. Randomization provides some control for lurking variables. By itself, a randomized block design does not control for the placebo effect. To control for the placebo effect, the experimenter must include a placebo in one of the treatment levels. In a matched pairs design, experimental units within each pair are assigned to different treatment levels.

What is the first step in the completely randomization procedure?

Multiple Choice:

	-Treatments are assigned to experimental units completely at random.
	-Every experimental unit has the same probability of receiving any treatment.
	-Randomization is performed using a random number table, computer, program, etc.
	-Assgin units in order.

(Table I)
Treatment
Placebo Vaccine
500 500

(Table II)
Treatment
Placebo Vaccine
Male 250 250
Female 250 250

Question
Table _____ is categorized as randomized block design, while Table ____ is categorized as completely randomized block design.

Write answer down as first answer, second answer.

Exact Match (case sensitive):

	II, I
	I, II
	Nope

Hints:

In randomized design subjects are split into blocks that they are similar to, such as a certain race, age group, health status, etc.

Completely randomized design subjects are assigned treatments and different characteristics are disregarded and not noted, i.e the difference of men vs women.

What is a characteristic of completely randomized design?

Check All That Apply:

	It is one of the most complicated experimental design.
	Nope
	It is one of the simplest experimental design.
	It is used to control for the effects of outside variables.
	It is used to randomize the subjects so they are not in order.
	Nope

All completely randomized designs with one primary factor are defined by three numbers; what are they?

Exact Match (case sensitive):

number of factors, number of levels, number of replications

What is the one disadvantage of CRD?

Multiple Choice:

	units that receive one treatment may be inherently dierent from units that receive other treatments
	units that receive one treatment may be inherently different from units that receive other treatments
	The statistical analysis is not straightforward
	Any number of treatments can not be investigated
	The analysis beomces complicated if observations from some units are missing

As more and more consumers are taking sleep aids, the effectiveness of Eszopiclone (Lunesta) is being tested. Three hundred Men and women above the age of 30 years old with 6 hours of sleep or less agreed to take part in the experiment. One hundred of the participants will be placed either in a group that is given Eszopiclone, a placebo (essentially a sugar pill), or placed in a room with soothing music.

Describe a completely randomized design that will show the effectiveness of each method.

Ungraded Open Response:

When dealing with generalizations, one must realize the importance of sampling. Because there are many types of sampling and the probability of it not being able to represent all the given data correctly can vary, it's important to realize that both logic and better judgment is needed during it.
Click the following link to read and get an introduction to sampling:
http://www.statcan.gc.ca/edu/power-pouvoir/ch6/sampling-echantillonage/5214807-eng.htm

Multiple Choice:

	Yes
	No
	Please read through the given link.

Write a couple sentences explaining what you were able to gather from this exercise.

Ungraded Open Response:

Here is an interactive site with which you can further delve into sampling; it explores the pros and cons that come with voting.
http://www.learner.org/interactives/statistics/

Note: Answer as if you were 18 and old enough to vote.

Ungraded Open Response:

Did you complete the exercise? What were you able to learn about the voting process?

Ungraded Open Response:

Use this app for the following question:

http://www.stat.tamu.edu/~west/applets/mandmtest.html

You and your friend are looking to buy bags of M&M's for a party. The party requires that the colors that need to

be served at this party are Red, Green and Yellow. The suggested number is about 200 total.

Click the app above and try a couple samples. Based on looking at them, what can you conclude from the different colors represented? Are the same ratio of colors found in all of the bags? How many bags would you probably need to buy?

Ungraded Open Response:

Use the following link:
http://stattrek.com/AP-Statistics-2/Experiment.aspx?Tutorial=Stat
Experimemts are used to show:

Multiple Choice:

	observing variables to determine how the observations prove a hypothesis
	observing what is happening to draw conclusions based on observations
	taking random samples to draw a conclusion

Hints:

Remember the difference between experiments, sampling and observational studies.

Using the graph on the following link, what can you conclude about this experiment and how does this differ from an observational study and sampling?
http://www.sciencebuddies.org/science-fair-projects/project_data_analysis.shtml

Ungraded Open Response:

Hints:

View the graph carefully and if yo would like a deeper analysis view the data chart on the link as well.

Algebraic Expression:

Suppose two researchers wanted to determine if aspirin reduced the chance of a heart attack. Researcher 1 studied the medical records of 500 patients. For each patient, he recorded whether the person took aspirin every day and if the person had ever had a heart attack. Then he reported the percentage of heart attacks for the patients who took aspirin every day and for those who did not take aspirin every day.
Researcher 2 also studied 500 people. He randomly assigned half of the patients to take aspirin every day and the other half to take a placebo everyday. After a certain length of time, he reported the percentage of heart attacks for the patients who took aspirin every day and for those who did not take aspirin every day. Suppose that both researchers found that there is a statistically significant difference in the heart attack rates for the aspirin users and the non-aspirin users and that aspirin users had a lower rate of heart attacks. Can both researchers conclude that aspirin caused the reduction?

Multiple Choice:

	No, only researcher 2 can conclude this.
	No, only researcher 1 can conclude this.
	Yes, because aspirin is known to reduce heart attacks
	Yes, because aspirin users had a larger heart attack rate in both studies.

Hints:

Use this link to help you understand blinding
http://stattrek.com/AP-Statistics-2/Experiment.aspx?Tutorial=Stat

Suppose that you were hired as a statistical consultant to design a experiment to examine the impact of a new medicine vs. a current medicine on migraines. 50 patients volunteer to participate in the study. What design will you recommend?

Multiple Choice:

	Completely randomized design with two variables.
	Completely randomized design.
	Completely randomized design with two factors and single blind.
	Completely randomized design with two factors and double blind.

There are three basic types of study design--- observational studies, sample surveys, and experiments.
Observational Studies: where values of the variable or variables of interest are recorded as they naturally occur. There is no interference by the researchers who conduct the study.
Sample Surveys: a particular type of observational study where individuals report variables' values themselves, frequently by giving their opinions.
Experiments: instead of assessing values of variables as they naturally occur, the researchers interfere, and they are the ones who assign values of the explanatory variable to the individuals. The reason why the researchers "take control" of the values of the explanatory variable is because they want to see how changes in the values of the explanatory variable affect the response. (Note: By nature, any experiment, then, involves at least two variables)

Multiple Choice:

	A) Sampling is a type of observational study
	B) Sampling takes a sample
	Answer the question fully
	C) Observational studies observe
	Answer the question fully

Hints:

Re-read the question

Hints:

Re-read the entire passage

What is the difference between observational studies and experiments?

Multiple Choice:

	A) No difference
	Really? Did you even try?
	B) Experiments are a form of observational studies
	C) Observational studies are forms of experiments
	D) In observational studies the researchers have no influence but in experiments they do

Hints:

Read the the blurb on experiments in the first question.

Suppose researchers want to determine whether people tend to snack more while they watch TV. In other words, the researchers would like to explore the relationship between the expalnatory variable "TV" (a categorical variable that takes the values 'on' and 'not on') and the response "snack consumption".
Identify each of the following designs as being an observational study, a sample survey, or an experiment.
1. Recruit participants for the study. While they are presumably waiting to be interviewed, half of the individuals sit in a waiting room with snacks available and a TV on. The other half sit in a waiting room with snacks available and no TV, just magazines. Researchers determine whether people consume more snacks in the TV setting.

What type of data collection is this?

Multiple Choice:

	A) This is an observational study
	B) This is an experiment
	C) This is a sample survey

Hints:

The researchers take control of the explanatory variable of interest (TV watched or not) by assigning each individual to either watch TV or not, and determine the effect on the response of interest (snack consumption)

2. Recruit participants for a study. Give them journals to record hour by hour their activities the following day, including TV watched and food consumed. Determine if food consumption is higher during TV times.
What type of data collection is this?

Multiple Choice:

	A) This is an observational study
	B) This is an experiment
	C) This is a sample survey

Hints:

The participants themselves determined whether or not TV was watched. There is no attempt on the researchers' part to interfere.

3. Recruit participants for a study. Ask them to recall, for each hour of the previous day, whether they were watching TV, and what food they consumed each hour. Determine whether food consumption was higher during the TV times.
What type of data collection is this?

Multiple Choice:

	A) This is an observational study
	B) This is a sample survey
	C) This is an experiment

Hints:

The participants themselves decided whether or not to watch TV. (do you see the difference between 2 and 3?).

4. Poll a sample of individuals with the following question: While watching TV, do you tend to snack (a) less than usual (b) more than or usual (c) the same amount as usual?
What type of data collection is this?

Multiple Choice:

	A) This is an observational study
	B) This is a sample survey
	C) This is an experiment

Hints:

The individuals self-assess the relationship between TV watching and snacking.

Inferences to population can be made from surveys and observational studies only if subjects of the surveys and observational studies are selected at random.

Cause - effect relationships between explanatory and response variables are the results that can be drawn from experiments only when treatments are randomly assigned to groups.

Read and understand this concept. It is a short but confusing concept that one can make many mistakes on. When you are ready to move on type "1" into the answer box below.

Algebraic Expression:

1

Can one generalize results of a survey if the subjects were randomly chosen?

Multiple Choice:

	Yes
	No
	If subjects were selected randomly, the results of the survey can be assumed to be the results of the entire population.

Can one generalize results of an observational study when subjects are randomly chosen?

Multiple Choice:

	No
	If subjects are chosen randomly one can generalize the results as the results of the population.
	Yes

Hints:

To generalize the results means to assume that the results apply to the entire population.

What type of conclusions can be drawn from a survey where subjects are chosen randomly?

Multiple Choice:

	Inferences to population
	Cause-effect relationships between explanitory and response variables.
	No conclusions

What type of conclusions can be drawn from observational studies where subjects are chosen randomly?

Multiple Choice:

	Inferences to population
	Cause-effect relationships between explanitory and response variables.
	No conclusions

What type of conclusion can be drawn from an experiment where treatments are assigned to random groups?

Multiple Choice:

	Cause-effect relationships between explanitory and response variables.
	No conclusions
	Inferences to population

Can results of surveys be generalized if the subjects were not chosen at random?

Multiple Choice:

	Yes
	Survey subjects MUST be selected randomly for it to be able to be generalized for the entire population.
	No

Hints:

Would a survey be accurate of the entire population if you asked only people with a biased idea on something?

Can observational studies be generalized if subjects were not chosen randomly?

Multiple Choice:

	No
	Yes
	Subjects must be chosen randomly for results to be generalized for the entire population.

Can cause - effect relationships between explanitory and response variables be generalized if the treatments were not assigned to groups at random?

Multiple Choice:

	Yes
	Treatments must be assigned randomly.
	No

If you wanted to make an inference to the population, which of the following would you conduct?

Multiple Choice:

	Census
	A controlled experiment
	A field experiment

Scaffold:

What is a census?

Multiple Choice:

	A survey
	An experiment
	An observational study

Scaffold:

Can inferences to the population be made from surveys?

Multiple Choice:

	Yes
	No
	Yes, inferences can be made from surveys

Which of the following would you NOT conduct to get an inference to the population?

Multiple Choice:

	A controlled experiment
	A census
	A census is a survey. Surveys can give you an inference to the population.
	A natural experiment
	A natural experiment is a form of observational study. Observational studies can give you an inference to the population.

If Johnny wanted to take a poll on which team was better- the Boston Red Sox or the New York Yankees- how should Johnny conduct the poll in order to generalize the results as an inference to the population?

Multiple Choice:

	Poll only Yankees fans.
	By polling only Yankees fans, Johnny would not be using random subjects and achieve only a biased result.
	Poll only Red Sox fans.
	By polling only Red Sox fans, Johnny would not be using random subjects and achieve only a biased result.
	Polling random people on the phone.

Bob wants to see if drinking water makes you eat less. How should Bob conduct his experiment in order for him to be able to generalize the results?

Multiple Choice:

	Conducting his experiment with random subjects
	Conducting his experiments on big eaters only.
	Conducting his experiment on himself.

Which survey can be used to get an inference to the entire population?

Multiple Choice:

	A survey of 3rd grader asking how old they are.
	The subjects of this survey are not selected randomly.
	A survey of random people on favorite ice cream flavors.
	A survey of pet owners on number of pets they own.
	The portion of the population that does not own any pets is not included in this survey.

Which sample should a company use if they wanted to determine the popularity of their product?

Multiple Choice:

	Every third person in the phone book.
	A selection of people who use the product.
	A selection of people who don't use the product.
	A selection of people within the company.

If Dexter wants to test out his hypothesis of whether all plants are happier listening to classical music, can he make a generalization of the entire plant population if he tested on only the sunflower?

Multiple Choice:

	Yes
	Testing on one plant cannot give results that represent the entire plant population.
	No

When a dotplot is skewed right the median is on the left.

Multiple Choice:

	True
	False
	Try again

The difference between a stem and leaf plot and a dot plot is

Multiple Choice:

	nothing. they are just different words that mean the same thing.
	They display the same information in different shapes
	stem plots have dots and dot plots have numbers
	The opposite of this is true.

A stem and leaf display is a good method of displaying large amounts of data.

Multiple Choice:

	False
	True
	True
	Correct

If the Minnesota Vikings have a team of over 53 men with the smallest man 5'9" and the tallest at 6'8" Would it be better to display this information on a stemplot or a dotplot?

Multiple Choice:

	Stemplot
	guess again
	Dotplot
	guess again
	Neither would be good because there is too much information to display

Which gives more detailed information a stem and leaf plot or a dot plot

Multiple Choice:

	both
	just pick one
	stem and leaf
	dot
	NOPE. the stem and leaf displays the specific numbers used to generate the graph

Greg Jennings is one fast dude. He is notorious for his explosive speed and his ablility to put his team on his back. Below is a stemplot showing the number of touchdowns Greg Jennings has scored during each season of his career. what is the average touchdowns he has scored in a season?
0 / 2
0 /
1 / 24
1 / 789
2 / 12
2 /

Multiple Choice:

	One cannot count the number of touchdowns scored by Greg Jennings
	Yes you can, they are right there in front of you.
	15.625
	18.75
	try again.
	2
	No way, Jose.
	22
	find the average

This Dotplot has a normal distribution

Multiple Choice:

	True
	False

What is the distribution of this stem and leaf plot?
Stems / Leaves
1 / 2457699999999999999999999999999999999999999
2 /45666666888888888888999999999
3 /555555555555
4 /2344
5 /45
6 /23

Multiple Choice:

	Skewed Left
	opposite
	Skewed Right
	Skewed Write
	Skewed Tight
	Normal

A stem and leaf plot is

Multiple Choice:

	Quanitative
	Categorical
	Categorical graphs group things. Do stem and leaf plots do that?

In the following stem and leaf plot is # a stem or a leaf
1 / #
2 / 5

Multiple Choice:

	Leaf
	Stem
	sorry sir

Stemplot is to Vertical as Dotplot is to_____

Multiple Choice:

	Horizontal
	Flat
	Vertical
	Exterior

Is this a dotplot or a stem and leaf plot

Multiple Choice:

	Stem and Leaf plot
	Dot Plot

Stem and Leaf plot is related to dot plot how?

Multiple Choice:

	Brothers (similar but a little different looking)
	Cousins (both are shapes but not similar)
	Fathers (physically impossible relationship)
	Twins (exactly the same in every way)
	no they do have a few minor differences

1 / 1234
2 / 234
3 / 0
The stem and leaf plot above shows which set of numbers?

Multiple Choice:

	11 12 13 14 22 23 24 30
	121 111 123 123 213 233 30002
	The stem and leaf plot shows the number 11 as 1 / 1 and the number 30 as 3 / 0
	1 2 32434 55 4555 22
	11 222 333 54321 112 22 23
	11 12 14 21 22 23 24 30

The stemplot below represents the points allowed by the Green Bay Packers in their last 16 games. What is the median of the stemplot?
1 / 00447
2 / 111448
3 / 116
4 / 18

Multiple Choice:

	22.5
	23
	21
	24
	22
	22.5

Forest Gump once said "You know it's funny what a young man recollects?"
What do you recollect from these questions?

Ungraded Open Response:

How would you poperly represent this data plot.
Ten strangers were asked how many hours a week they spent doing exercize in a week, here are the results: 5,6,5,8,10,9,9,7,5
A)                                      B)                                C)
     *        *                             *
     *        *                             *            *
     *        *        *                   *  *   *   *   *               *  *  *  *   *   * * * *
   4-6    7-9     10-13             5  6 7 8 9 10            5 6 5 8 10 9 9 7 5

Multiple Choice:

	A
	B
	C
	None of the above

Hints:

Refer to the website below. Scroll halfway down and it provides you with examples of how to properly create a dot plot
http://www.vertex42.com/ExcelArticles/dot-plot.html

This data represents the average age of randomly selected NHL players in the year 2009. We want to make a dot plot out of the data set. But first we need to know it contains any outliers. Does it?
24,37,29,19,19,23,28,33,29,20,22,31,22,25,28,32,33,35,45,18,47, 28,30

Multiple Choice:

	No outliers
	1 outlier
	2 outliers
	3 outliers

Hints:

Outliers are numbers that appear outside the norm in any data set. For example here is a data set below. There are three obvious outliers: 19, 20, 21 look to see in the initial problem if there are numbers that appear outside the norm.
3,5,7,4,2,7,3,19,3,2,20,4,6,7,21.

Which is correct about a dotplot?

Multiple Choice:

	Dot plots are used only in studying statistics, never real world applications
	Dot plots are both categorial and quantitative at times but never one or the other
	Dot plots represent individual observations in a batch of data with symbols. It's used to depict distribution
	Answers B and C are correct

Hints:

Before tackling the problem think about what data plots represent. Have there been examples in the real world where data plots have been used? Is it possible for a data set to be both quanatative and categorial? After process of elimination the right answer will be evident

A)                                        B)          *   *               C)
    *    *     *     *     *                 *    *    *   *                           *      *    *
    *    *     *     *     *               *     *    *    *     *             *     *   *     *    *
    0    1     2     3     4             0     1    2    3     4          0     1     2    3     4

Here is data plot represenents the number of fouls represented by randomly selected NBA players. Which data plot reveals a uniform distribution set?

Multiple Choice:

	A
	B
	C
	None of the above

Hints:

Uniform distribution represents a set of data that is equally distributed throughout. Eliminate answers that are skewed or symmetrical and you will find your answer.

            *
            *     *
     *     *     *     *
     *     *     *     *     *
*   *     *     *     *     *     *     *     *
0    1     2     3     4     5     6     7     8
This data set represents the number of shots attempted by randomly selected NHL players in a single game. Does this data set contain any outliers?

Multiple Choice:

	Yes; there is a large spread thus points 1 and 8 are outliers
	Yes; since the dot plot is skewed to the left point 8 is an outlier
	No; because there are five 2's in the data set
	No; because there are no observations that are numerically distant from the data set

Scaffold:

In order to understand this problem you first must know the definition of an outlier. An outlier is any data that appeaers unusally long or small and out of place compared to a data set. Now look back at the data and see in fact if it contains any outliers.

Multiple Choice:

	Yes; there is a large spread thus points 1 and 8 are outliers
	Yes; since the dot plot is skewed to the left point 8 is an outlier
	No; because there are five 2's in the data set
	No; because there are no observations that are numerically distant from the data

            *             *     *
     *     *     *     *
     *     *     *     *     *
*   *     *     *     *     *     *     *     *
0    1     2     3     4     5     6     7     8
This data set represents the number of shots attempted by randomly selected NHL players in a single game. Describe the distribution

Multiple Choice:

	Distribution is skewed to the left
	Distribution is skewed to the right
	Distribution is symmetric
	Distribution contains outliers
	Distribution contains outlierse

Scaffold:

Look at the illustration above. Note what the shape of the distribution is. Is it symmetric (bell-shaped), skewed to the right (most observations are on the left, lower values), skewed to the left (most lower observations are on the right, lower values)

Multiple Choice:

	Distribution is skewed to the left
	Distribution is skewed to the right
	Distribution is symmetric
	Distribution contains outliers

Scaffold:

Multiple Choice:

Multiply the stem by 10, after that whats the biggest number in this set of data

Exact Match (case sensitive):

	73
	73.3
	073

Hints:

The stems are supposed to be multiplied by ten and then you may add the leaf.

after doing that you get that 73 is the largest of them all

Why are there multiple stems starting with the same number?

Multiple Choice:

	The stems represent a large dataset and to make it easier to read the dataset might be divided into 5 stems of the same number
	The stem and the leaf switched
	The data is filled with too many numbers so the leaf of the dataset is cut
	None of the above

4|33
3|56
2|00456
1|00134
0|1245589
-0|0679
-1|005559
-2|7
How many negative numbers are in this stem plot?

Multiple Choice:

	11
	10
	9
	8
	7

Hints:

To read negative values you will look for negative stems so when you find the negative stems you can count how many negative values are there

The answer is 10

What is the mean of this stem and leaf plot?

Multiple Choice:

	61.8
	61.1
	62.3
	63.3
	62.8

Hints:

To find the arithmatic mean you must add up all the values then divide by how many values that there are.

here is an example 5, 6 ,9, 12, 15 There are 5 values which add up to 47 then didvide by the number of values which is 5 and you get 9.4

61.8 is the answer

Figure 3. Back to back stem and leaf display. The left side shows the 1998 TD data and the right side shows the 2000 TD data.

Number of Touchdowns per team in a NFL season
1998 2000

11

332
8865
44331110
987776665
321
7

4
3
3
2
2
1
1
0

7
233
889
001112223
56888899
22444
69

In what season were there more than 40 touchdowns by a team?

Multiple Choice:

	1997
	1998
	1999
	2000

Hints:

You read a double stem plot simliarly to how you would read a regular one but one side is one set of data while the other side is another set of data. So the left is side is 1998 and the right side is 2000.

the answer is 1998 because 2 teams scored 41 touchdowns

3|2337
2|001112223889
1|2244456888899
0|69

What number is the mode in this stem plot?

Exact Match (case sensitive):

	18
	18.0
	18.
	18.00

	The mode is the number that appears most often.

Scaffold:

3|2337
2|001112223889
1|2244456888899
0|69

What number is the mode in this stem plot?

Multiple Choice:

	15
	13
	19
	16

Weights of NFL half backs

__________________________________
185 195 205 215 225 235 245
The Arizona Cardinals have not signed a running back under 185 pounds. What is the chance of a 185.5 pounder making the team considering weight standards?

Multiple Choice:

	0%
	2.5%
	3%
	50%

Hints:

Draw in the bell curve

Place correct percentages

Based on WEIGHT STANDARDS

Choose closest answer

Common Adult Shoe Sizes

______________________________
6 7 8 9 10 11 12
If an online sneaker carrier only has sizes 7-11 left in stock, what percentage of people cannot purchase off the website?

Algebraic Expression:

	5%
	13.5%
	18.5%
	2.5%

Hints:

Draw the problem out with a bell curve

Write in the correct percentages

x       x       x                                x      x       x
x       x       x                               x       x       x
x       x       x     x                x       x       x      x
x       x       x     x       x       x       x       x      x
_______________________________________________
6       7       8     9       10     11     12     13    14
Bags of candy

How would the distubution above be desrcibed?

Multiple Choice:

	Skewed right
	Symmetric
	Skewed Right
	All of the above

Hours Of Sleep (x-hours, y-people)

                            *
                            *           *
*                          *     *     *     *
*___*_______*__*___*__*___*____*__ *___
3     4     5     6     7     8     9     10     11   12
What is the average number of hours of sleep in this data set?

Multiple Choice:

	17
	Try Again
	8.5
	Try Again
	7.05
	8
	Try Again

57,60,70,84,58,60,66,72,63,63,75,61,68,71,70,80,59,69,66,78,74,70,69,72,67,73,76
Given the set of numbers above, which of the following statements is accurate?

Multiple Choice:

	The stemplot is skewed to the right
	Try Again
	The stemplot is skewed to the left
	Try Again
	The median is 69.5
	The mean is 69
	Try Again

Hints:

Start by arranging the numbers in order from least to greatest.

Stems should be 5,6,7, and 8. Fill in the the matching leaves.

Does the graph look skewed at all?

The median is at 69.5

Algebraic Expression:

Which is true about normal distributions?

Multiple Choice:

	Standard deviations for all normal distributions are the same.
	Data does not need to be bell shaped to be considered "normal distribution."
	Normal distributions must have one peak

In the equation

, what is "x"?

Multiple Choice:

	mean
	standard deviation
	a specific number in the set of data

What is

?

Multiple Choice:

	the standard deviation, mean
	mean, z
	mean, standard deviation
	z, standard deviation

For one week at Boston Latin School, the number of students that bought a school lunch was recorded. If the bottom 1.5% is at 196 and the top 1.5% is at 234, what is the mean of the data?

Exact Match (case sensitive):

215

Hints:

Draw a graph of the data. What number falls evenly between 196 and 234?

Using your previous answer and the z chart, what is the standard deviation? Round to the nearest one-hundredth.

Exact Match (case sensitive):

8.84

Hints:

Plug the numbers into

.

Scaffold:

The answer is 8.84.

Algebraic Expression:

8.84

A normal distribution has a mean of 22 and a standard deviation of 2. What percentage of the data lies below 20?

Multiple Choice:

	100%
	16%
	84%
	23%

Hints:

Start by plugging in all the known information into the Z-score formula.

Refer to the Z Table and find the value of the Z score. That value will determine the portion of the data out of 1.

Convert the decimal into a percentage.

Mr. Simoneau's AP Stats class just took a chapter three quiz. If 2.5% of people scored to the left of 76 and 2.5% scored to the right of 80, what is the mean score?

Multiple Choice:

	77
	look at z score
	79.1
	79.9
	78

Suppose that foot length of a randomly chosen adult male is a normal random variable with mean μ=11 and standard deviation σ=1.5. Then the Standard Deviation Rule lets us sketch the probability distribution of X as follows:

This graph shows the length of adult male feet. The probability is only 2.5% that an adult male will have a foot longer than how many inches?
Barbara Illowsky and Susan Dean, "Collaborative Statistics," Connexions, March 22, 2010, http://cnx.org/content/col10522/1.38/

Multiple Choice:

	28
	15
	14
	4

For one week, Neiman Marcus decided to record the sizes of every pair of shoes bought by customers. At the end of the week, the data was compiled and it was determined the mean shoe size purchased was an 8, with a standard deviation of .5 sizes. What shoe size represents the bottom 2.5% of shoes sold? What size represents the top 2.5% sold?

Multiple Choice:

	7, 9
	6.5, 9.5
	2.5%, not .15%
	7, 7.5
	not between percentages
	7.5, 8.5

Hints:

Use the 68-95-99.7 rule.

The number of hot chocolates ordered at Starbucks daily was recorded for one week. The data had a mean of 42. If the number 48 corresponded to the z-score of 2.47, determine the standard deviation of the data.

Multiple Choice:

	2.43
	2.47
	-2.43
	-.84

Using the data from the previous problem, determine the z-score of 37

Multiple Choice:

	-2.06
	.49
	-.49
	2.06

What is the area underneath a normal distribution bell-curve?

Algebraic Expression:

1

Students recently took the SATs. Out of a composite score of 2400, the mean score was 1800, with a standard deviation of 100 points. What percent of students scored above 2050 on their SATs?

Algebraic Expression:

.0062

Hints:

Note that if a problem uses the term GREATER THAN, you must subtract the answer from 1 for it is showing the area to the right of z. (The answer is 4 digits)

Mangia Pizza advertises in the newspaper that all of its thick crust pizzas have a crust of 1.5 inches. In reality, the mean of their crusts' thickness is 1.3 inches with a standard deviation of .1 inches. How thick is crust of the the bottom 1.1%?

Multiple Choice:

	1.071 inches
	1 inch
	.9845 inches
	1.12 inches

A group of students at a school takes a history test. The distribution is normal with a mean of 25, and a standard deviation of 4. (a) Everyone who scores in the top 30% of the distribution gets a certificate. What is the lowest score someone can get and still earn a certificate?
David Lane, "Online Statistics:An Interactive Multimedia Course of Study," Online Statbook, October 27,2010, http://onlinestatbook.com/

Multiple Choice:

	27.1
	28.9
	36.4
	31

Hints:

Find the z score for 70% and plug the numbers into the standard score for z equation

J.Crew decided to count the number of shoppers that entered the store daily for one month. When they compiled data, it was determined that the mean number of shoppers daily was 202. If 1.5% of the data lay below 172 and 1.5% lay above 232, what was the standard deviation of the data? Round to the nearest one-hundredth.

Exact Match (case sensitive):

13.82

Scaffold:

Subtract 232-202

Algebraic Expression:

30

Scaffold:

Since 1.5% of the data is above 232, 98.5% is below 232. Where is 98.5% on the Z table?

Algebraic Expression:

2.17

Scaffold:

2.17 is z for 98.5%. So, 2.17s= 30 (232-202). Solve for s.

Algebraic Expression:

13.82

Scaffold:

30 divided by 2.17 =13.82. Write 13.82

Algebraic Expression:

13.82

Shoppers at Urban Outfitters were asked how much they paid for a pair of jeans. When the data was recored, the mean price paid was $54, with a standard deviation of $4. What percentage of shoppers bought their jeans between the prices of $47 and $60?

Multiple Choice:

	93.32%
	94.23%
	89.31%
	72.01%

Hints:

Subtract the percentage of shoppers that bought $47 jeans from the percentage of shoppers that bought $60 jeans.

The Real Deal sub shop recently asked its patrons how much they spend weekly at the shop. From the data they collected, the mean was $24 with a standard deviation of $2. What's the probability that someone spends more than $28.5 a week at the Real Deal?

Multiple Choice:

	.9878
	subtract from 1
	14.25
	.0225
	use z chart to find value for 2.25
	.0122

The weights of sixie backpacks follow a normal distribution with a mean weight of 40 lbs. and a standard deviation of 3 lbs. What percentage of sixies have backpacks larger than 45 lbs.?

Algebraic Expression:

	4.75%
	4.75
	.0475
	No, this is the area. To find the percentage, you need to multiply by 100.
	95.25
	No, this is the percentage of backpacks smaller than 45 lbs. To find the percentage larger than 45, subtract this z-score from 1.
	.9525
	No, this is the area of backpacks smaller than 45 lbs. To find the percentage larger than 45, subtract this z-score from 1.

Scaffold:

First, use the equation ((x-mean)/(standard deviation)) = Z . Fill in the given numbers. What is the value of x in this particular problem?

Algebraic Expression:

45

Scaffold:

Good. Now fill in the mean and the standard deviation. The equation should look something like this: ((45-40)/(3)) = Z. After dividing, we find that Z = 5/3. Calculate 5/3 as a decimal. Round to the nearest hundredth place.

Algebraic Expression:

	1.67
	1.66
	No, the question asks to round to the nearest hundredth.

Scaffold:

Now, locate 1.67 on the z-table. You will find that it is 0.9525. This number is the area of the normal distribution graph that is to the left (or less than) 45. However, the original question asks for the percentage of sixies with a backpack GREATER than 45 lbs. Therefore, because the area under the normal distribution is 1, you need to subtract this area from 1. This will give you the remaining area greater than 45. Write the final percentage of sixies with backpacks greater than 45 lbs.

Algebraic Expression:

	4.75
	4.75%
	0.0475
	No, the percentage is the area multiplied by 100.

Hints:

Remember, the percentage is the area found on the z-table multiplied by 100.

What is the probability that, on a normal distribution with a mean of 40 and standard deviation of 3, x=39?

Algebraic Expression:

0

Hints:

Finding probability implies finding the area of a certain part of the normal distribution. Because you only want to find a point, there is no area and therefore, the answer is 0.

Find the probability:
P(-0.7<Z<1.5)=

Algebraic Expression:

	0.6912
	69.12%

Hints:

Subtract the areas found on the z-table to find the area in between 1.5 and -0.7.

A B
Which normal probability plot represents the data from the normal distribution? (A or B)
Plot A taken from: meloun.upce.cz
Plot B taken from: bjo.bmj.com

Algebraic Expression:

A

A ruler company advertises that it supplies 12 inch rulers to Staples. In fact, the length of the rulers have a mean value of 12.15 inches with a standard deviation of 0.03. What percentage of rulers are between 12.05 and 12.19?

Multiple Choice:

	90.62%
	10.62%
	No, try again.
	50%
	Nope!
	80.62%

Scaffold:

First, find out your unknowns. Using the equation [(x-mean)/(standard deviation)]=Z, fill in what you know. You know the mean, the x-values and the standard deviation. What is the value of each z-score?

Multiple Choice:

	-2; 1
	No, try again
	-3.33; 1.33
	-1.25; 3
	-2.55; 1.65

Scaffold:

Now that you know the 2 z-scores, what are the 2 probability values?

Multiple Choice:

	0.9066; 0.0004
	0.0025; 0.0004
	0.9066; -0.0004
	0.1012; 0.1011

Hints:

Use the z-table to find the probability.

Scaffold:

Now that you know the 2 probability values, you know the area of the 2 values on the density curve. What is the difference?

Multiple Choice:

	50%
	75%
	80.12%
	90.62%

Hints:

Remember to multiply the probability values by 100 to get the area.

The weight of ballerinas is normally distributed with a mean 96 lbs. and a standard deviation of 2 lbs. What is the area of weight of ballerinas that lies to the left of 97.5 lbs?

Algebraic Expression:

	0.7734
	77.34%

In a certain normal distribution, 15.85% of the area lies to the left of 47 and 15.85% lies to the right of 53. What is the mean and standard deviation?

Multiple Choice:

	mean=50; sd=3
	mean=49; sd=2.5
	mean=47.4; sd=3
	mean=51; sd=.3

Scaffold:

Construct a density curve following normal distribution and 68-95-99.7 rule. Label z-scores accordingly. What is the percentage that does not lie between the 47 and 53?

Multiple Choice:

	60%
	Add up the percentages you know!
	31.7%
	30%
	No, add up the percentages you know.
	68%
	Nope, sorry.

Scaffold:

Now that you have 31.7%, what is the area that lies between the two number values, 47 and 53?

Multiple Choice:

	70%
	No
	40%
	Remember, subract 31.7% from 99.7%.
	58%
	68%

Scaffold:

Now, that you know the area between these values is 68%, you now know that 47 has a z-score of -1 and 53 has a z-score of 1. What do you think the standard deviation is?
Hint: The standard is a whole number.

Algebraic Expression:

3

Hints:

The standard deviation is a whole number.

Scaffold:

Now that you know the standard deviation, what is the mean?
Hint: you may use the equation Z=(x-mean)/(standard deviation).

Algebraic Expression:

50

If I am 3 standard deviations below the mean, what is my z-score?

Algebraic Expression:

	-3
	3
	No, the z-score becomes negative once the standard deviation is below the mean.

Scaffold:

Ok, let's put an example to this problem. Using the equation [(x-mean)/(standard deviation)]=Z, plug in these numbers: If the mean of a normal distribution of heights of men is 70 inches and the standard deviation is 3, find the z-score of 64 inches. What is the z-score?

Algebraic Expression:

-2

Scaffold:

So, because 64 is below the mean, 70, the z-score is negative. Type "correct" to move on.

Algebraic Expression:

correct

New England has an average temperature of 55 degrees in October with a standard deviation of 2 degrees. What temperature lies in the 87th percentile?

Multiple Choice:

	60.03
	Nope, sorry.
	52.76
	Uh-oh...remember definition of percentile!
	57.26
	56.35
	Not there yet!

Scaffold:

Because the percentile is 87, your area is 0.87. Find 0.87 on the z-table. What is the z-score?

Algebraic Expression:

1.13

Scaffold:

Nice job! Let's move on. Write the equation [(x-mean)/ (standard deviation)] with the variables you now know. What variable are you looking for?

Algebraic Expression:

X

Scaffold:

You're almost done! Use your algebra skills, solve for x!

Algebraic Expression:

57.26

What is the term for taking original data values and converting them to standard deviation units?

Multiple Choice:

	simulation
	no, try again.
	standardizing
	normal distribution
	no, try again.
	statistical inference
	no, try again.

Hints:

Pay attention to what the question is asking you. Can you pick out a key vocabulary term?

What is the name of a standardized value?

Multiple Choice:

	mean
	no, try again.
	z-score
	average
	no, try again.
	normal distribution
	no, try again.

Hints:

The standardized value is the z-score.

What is the area under any given denisty curve?

Multiple Choice:

	1
	0
	no, try again.
	-1
	no, try again.
	.5
	no, try again.

Hints:

Density Curve: is a curve that is always on or above the horizontal axis and has an area of exactly 1 underneath it.

The mean of a density curve is the "balancing point", which means that if the curver were made oif solid material it would balance at that point. The median is the "equal areas point, that divides the area under the curve in half.

In a normal distribution how does the mean relate to the median?

Multiple Choice:

	Equal to
	Less than
	no, try again.
	Greater than
	no, try again.
	Cannot be determined
	no, try again.

Hints:

The picture is not representative of the answer!

A normal distribution means that the density curve is symmetrical and thus the mean is equal to the median.

Where is 68% of the data located?

Multiple Choice:

	+/- 3 standard deviations
	no, try again.
	2nd quartile
	no, try again.
	+/- 1 standard deviation
	1st percentile
	no, try again.

Hints:

Remember the 68-95-99.7 rule.

Normal distributions do not necessarily have the same means and standard deviations. A normal distribution with a mean of 0 and a standard deviation of 1 is called a standard normal distribution.
Areas of the normal distribution are often represented by tables of the standard normal distribution. A portion of a table of the standard normal distribution is shown in Table 1.

Table 1. A portion of a table of the standard normal distribution.

Z	Area below Z
-2.50	0.0062
-2.49	0.0064
-2.48	0.0066
-2.47	0.0068
-2.46	0.0069
-2.45	0.0071
-2.44	0.0073
-2.43	0.0075
-2.42	0.0078
-2.41	0.0080
-2.40	0.0082
-2.39	0.0084
-2.38	0.0087
-2.37	0.0089
-2.36	0.0091
-2.35	0.0094
-2.34	0.0096
-2.33	0.0099
-2.32	0.0102

The first column titled "Z" contains values of the standard normal distribution; the second column contains the area below Z. Since the distribution has a mean of 0 and a standard deviation of 1, the Z column is equal to the number of standard deviations below (or above) the mean. For example, a Z of -2.5 represents a value 2.5 standard deviations below the mean. The area below Z is 0.0062.
Online Statistics: An Interactive Multimedia Course of Study, http://onlinestatbook.com/

Do you understand what the value of Z stands for?

Multiple Choice:

	Yes
	No

A value from any normal distribution can be transformed into its corresponding value on a standard normal distribution using the following formula:
Z = (X - μ)/σ
where Z is the value on the standard normal distribution, X is the value on the original distribution, μ is the mean of the original distribution and σ is the standard deviation of the original distribution.
As a simple application, what portion of a normal distribution with a mean of 50 and a standard deviation of 10 is below 26. Applying the formula we obtain
Z = (26 - 50)/10 = -2.4.
From Table 1, we can see that 0.0082 of the distribution is below -2.4. This value represents the area of the portion that is below. It also means that 0.82% of the data lies below the value of 26 since the total area of a normal distribution is always equal to 1.
To find the portion of a normal distribution above a value, simply use the equation to find the disribution below that value and subtract it from 1, the value of the whole distribution.
What portion of a normal distribution with a mean of 50 and a standard deviation of 10 is above 26? Applying the formula we obtain
Z = (26 - 50)/10 = -2.4.
From Table 1, we can see that 0.0082 of the distribution is below -2.4 but since this value is for the portion below 26 and we're trying to find the portion above we would have to subtract this value from one.
1 - 0.0082 = 0.9918
This means that 99.18% of the data lies above the value of 26.

Do you understand how to find the area of a normal distribution?

Multiple Choice:

	Yes
	No

A standard normal distribution has:

Multiple Choice:

	a mean of 1 and a standard deviation of 1
	a mean of 0 and a standard deviation of 1
	a mean larger than its standard deviation
	all scores within one standard deviation of the mean

A number 2.5 standard deviations above the mean has a Z score of:

Multiple Choice:

	2.5
	-2.5
	5
	Not enough information

A normal distribution has a mean of 110 and a standard deviation of 20. What percent of the data lie between 85 and 130?

Multiple Choice:

	84%
	11%
	74%
	95%

Hints:

Start by find the Z-scores of both 85 and 130.

Find the value of the Z-score using the Z-table for both numbers.

Subtract the value of 85 from 130 to obtain the decimal that represents the portion of the data that lies between 85 and 130.

A normal distribution has a mean of 50 and a standard deviation of 3. What portion of the data is over 53?

Multiple Choice:

	.8413
	.1587
	1

	0

Hints:

Start by entering all your known data into the z-score formula.

Once you find the Z-score look at the Z table and find what portion of the data it represents.

Remember that we are trying to find the portion of the data above 53 and not below it. To find this, simply subtract the answer you got from plugging your data into the equation from 1. Since 1 is the total area of the distribution by subtracting the value below 53 from 1 would only leave the value of the data above 1.

In a normal distribution with a standard deviation of 2.5, 60% of the data lies to the left of 58. What is the mean?

Multiple Choice:

	57.375
	-57.375
	92
	-92

Hints:

Start by converting the 60% value into a Z-score. To do this, simply, look on the Z table and find the decimal that most closely corresponds to .6. In this case it's .25.

Plug all the known variables into the z-score formula and work backwards to find the mean.

Multiply 2.5 by .25 to get rid of the fraction.

Subtract 58 from both sides to isolate the mean.

Since the mean is negative divide both sides by -1 to make it positive.

In a normal distribution with a mean of 62, 30% of the data lies below 54. What is the standard deviation?

Multiple Choice:

	15.09
	-15.09
	-26.66
	26.66

Hints:

First, start by converting 30% to a Z-score. Find the decimal that is closest to .3 (30%) on the Z-table.

Plug in all the known variables into the Z-score formula.

Work backwards by subtracting the mean from the number first.

Multiply the standard deviation on both sides to get rid of the fraction.

Divide by the Z-score.

A normal distribution with a mean of 75 has 45% of its data above 82. What is the standard deviation?

Multiple Choice:

	53.85
	.13
	43.75
	7

Hints:

Find the value on the Z-table that corresponds to 45% and subtract it from 1. Convert that answer into a Z-score.

Now let's try a word problem:

A pizza shop makes pizzas with an average of 13 inches in diameter. Occasionally, the workers would make pizzas that are smaller than that size, represented by a standard deviation of 0.2 inches. Pizzas that are made under 12.75 inches must be tossed out. What percent of the pizzas made must be thrown out?

Multiple Choice:

	89%
	11%
	27%
	36%

On a recent midterm, 200 students scored a mean of 72 with a standard deviation of 6. Marisa scored higher than 68% of her peers. Her teacher said he would give her 5 points of extra credit on the exam if she could tell him her score using only her percentile and the median and standard deviation. What is her score? Always round to the nearest hundreth if necessary

Algebraic Expression:

	74.82
	74.82%

Scaffold:

Remember we need to use the formula Z= 9 (x - μ)/σ, where Z is the amount of standard deviations, x is the number we are dealing with, μ is the mean and σ is value of one standard deviation.
First we need to find Z. We are given that Marisa is in the 68%. Use that to find Z.

Algebraic Expression:

0.47

Scaffold:

Now that we have Z we can go back to the formula Z=(x - μ)/σ. Plug in the formula and find Marisa's midterm score.

Algebraic Expression:

	74.82
	74.82%

Marisa was able to find her score on the test, so her teacher gave her the 5 points of extra credit. If Marisa was the only student able to find her score in this manner, and the standard deviation is unaffected while the mean shifts to 72.05, what would her new percentage be if the test was out of 60 points? Always round to the nearest hundreth if necessary

Algebraic Expression:

	96.78
	96.78%

Scaffold:

So first we need to use the information given to find Marisa's new score. Her original score, the answer from the first question, was 74.82, and it was out of 60 points. Knowing this, what did Marisa score out of 60?

Algebraic Expression:

	44.89
	44.89%
	44.892
	round to the nearest hundreth

Hints:

To find what 74.82% of 60 is, multiply 60 by .7482

Scaffold:

So Marisa's original score was a 44.89 out of 60. With the extra credit, what is Marisa's new percent score?

Algebraic Expression:

	83.15
	83.15%

Hints:

First add the 5 points to the 44.89 points.

After you know what the new score out of 60 is, divide by 60 to find the percent score.

Scaffold:

Now we have all the information we need to use the Z formula to find Marisa's new percentile. We know that x is Marisa's new score, 83.15, μ is now 72.05, and the σ remains 6. What is Marisa's new percentile?

Algebraic Expression:

	96.78
	96.78%
	1.85
	That's the correct number of standard deviations over, now use that to find the percentile with the Z-tables.

The baseball team x has a mean batting average of .261 and a standtard deviation of .05. If player y has a batting average of .298, which percentile is he in?

Algebraic Expression:

	73.89%
	.7389
	Remember you're finding the percentile, not the area. The answer should be in the form of a percentage.
	0.64
	You've found Z, use that to find the area on the Z-tables.
	73.89

Scaffold:

To solve this problem, use the formula Z= (x - μ)/σ
In Z=(x - μ)/σ, each of the variables stand for something in the problem. The problem was:
The baseball team x has a mean batting average of .261 and a standtard deviation of .05. If player y has a batting average of .298, which percentile is he in?
x represents the piece of data we are currently dealing with, which in this case is .298
μ is the mean of the data, which here is .261
σ is the standard deviation, which is .05
What did you find for Z?

Algebraic Expression:

0.64

Scaffold:

Z is the amount of standard deviations the piece of data we are dealing with (x) is away from the mean. In this case, .298 is 0.64 standard deviations away from the mean, .261. With Z, use the z-tables to find the area, then put the answer in the form of a percentile to find the answer.

Algebraic Expression:

	73.89%
	.7389
	put the answer in the form of a percentile
	73.89

Shaniqua buys a cow to feed her 12 children at the country fair for $978 and then goes to get it appraised by a farmer/statistician. He tells her that the mean price for cows is $500 and the standard deviation is $173.
How many deviations did Shaniqua pay above the mean? Round to the nearest one-hundredth.

Multiple Choice:

	2.77
	2.56
	No, sorry. Solve for the standardized value of 978.
	1.33
	No, sorry. Solve for the standardized value of 978.
	3.00
	No, sorry. Solve for the standardized value of 978.

Shaniqua's husband Alan comes home and is enraged that she paid so much for a cow to feed children that aren't his, and has her go back to return the cow. After Shaniqua returns the cow, another farmer offers to sell his cow at a price 1.75 deviations under the mean.
How much is the farmer selling his cow for?

Multiple Choice:

	197.25
	498.25
	No, sorry. Try using the standardized value equation to find the answer.
	200
	No, sorry. Use the standardized value equation to find the answer.
	156.68
	No, sorry. Use the standardized value equation to find the answer.

Alan makes buns for a living, the mean weight of all his buns is .5lbs and the standard deviation is .08lbs. What percentage of his buns will weigh more than .36lbs?

Multiple Choice:

	4.01%
	10.26%
	No, sorry. Try finding the standardized value of .36 first.
	-1.75
	No, sorry. You found the correct z-score, try finding the probability that it corresponds with.
	5%
	No, sorry. Try finding the standardized value of .36 first.

If 70% of Alan's buns have to weigh more than .60lbs, what must be the new standard deviation?
Round to the nearest one-hundredth.

Multiple Choice:

	.19
	.07
	No, sorry
	.22
	No, sorry.
	.04
	No, sorry.

Alan is working in a factory that makes toothpaste for Colgate (extra whitening). He must throw away tubes that have toothpaste under 51 grams and tubes that exceed 54 grams. Luckily, the mean amount of toothpaste in the tubes is 52.5 grams and the standard deviation is .8 grams. Find the z-score for the amount of toothpaste under 53 grams.

Multiple Choice:

	.625
	.800
	.498
	.375

Referring to the previous question, what is the probability that the amount of toothpaste in Colgate tubes would be less than 54.7 grams?

Multiple Choice:

	.9970
	.6250
	.0030
	.1475

Referring to the toothpaste problem again, how many grams of toothpaste correspond to the 30th percentile?

Multiple Choice:

	52.1
	55.4
	51.5
	53.4

Referring back to the toothpaste problem, what is the probability that Alan will have to throw away a tube of toothpaste because it doesn't fall within guidelines?

Multiple Choice:

	6.02%
	96.99%
	No, this is the percentage of tubes with toothpaste under 54 grams
	3.01%
	No, this is the percentage of tubes with toothpaste over 54 grams or under 51 grams.
	56.92%

Alan owns a bakery and makes 100 juicy buns everyday. He claims to use 10 ounces of sugar for each bun but in reality, the amount of sugar on a random bun is normally distributed with a mean value of 12 oz and a standard deviation of 0.5 oz.

What is the z-score of a juicy bun that has 11 oz of sugar?

Algebraic Expression:

	-2
	2
	No, sorry
	-1.87
	No, watch your rounding.

What percent of buns have less than 11 ounces of sugar?

Multiple Choice:

	2.28%
	2%
	No, this is the z-score.
	2.05%
	No, sorry.
	1.01%
	No, sorry.

One afternoon, an old lady comes into the bakery to complain that her bun was too sweet. The next day, a young boy says that his bun is not sweet enough. Alan decides to throw away the 5 buns with the most sugar and the 5 buns with the least sugar everyday. The buns that are left will contain how much sugar?

Multiple Choice:

	11.18 and 12.82
	5 and 95
	No, these are the percentiles.
	-1.645 and 1.645
	No, these are the z-scores.
	10.5 and 12.5
	No, sorry.

Scaffold:

Find the z-score of the buns with the least amount of sugar. Remember that this is the lowest 5%.

Algebraic Expression:

	-1.65
	-1.64
	-1.645
	0.05
	No, this is the percentage.

Scaffold:

After finding that the z-score is around -1.645, find the corresponding x-value of ounces of sugar. Round to the nearest hundredth.

Algebraic Expression:

	11.18
	11.17
	No, remember to round.
	11
	No, remember to round.
	11.95
	No, sorry.

Scaffold:

Find the z-score of the buns with the most amount of sugar. This is the top 5%

Algebraic Expression:

	0.95
	No, this is the percentage.
	1.645
	1.65
	1.64

Scaffold:

After finding that the z-score is around 1.645, find the corresponding x-value in ounces of sugar. Round to the nearest hundredth.

Algebraic Expression:

	12.82
	12.8
	No, you must round to the nearest hundredth.
	13
	No, sorry.

Scaffold:

From the previous answers, between how many ounces will the buns that are kept be?

Multiple Choice:

	11.18 and 12.82
	-1.645 and 1.645
	No, these are the z-scores.
	5 and 95
	No, these are the percentiles.
	10 and 12
	No, sorry.

What percent of the buns will have exactly 11 ounces of sugar?

Multiple Choice:

	0
	unknown
	No, sorry.
	-2
	Sorry, this is the z-score.

Referring to the image above, which curve has a larger standard deviation?

Multiple Choice:

	Red curve
	Blue curve
	No, sorry.
	Both are the same.
	No, sorry.

What percentage of the area of the normal distribution curve falls within 1 standard deviation from the mean?

Multiple Choice:

	99.7
	No, sorry.
	95
	No, sorry.
	68.5
	No, but very close.
	68

When 95% of the area under the normal distribution curve is represented, within how many standard deviation(s) does it fall from the mean?

Multiple Choice:

	2
	1
	No, sorry.
	3
	No, sorry.
	2.5
	No, sorry.

According to the Empirical Rule, what is the area under the normal distribution curve when X is -1 standard deviations from the mean?

Multiple Choice:

	34
	68
	No, this is the area between -1 and 1 standard deviations.
	95
	No, this is the area between -2 and 2 standard deviations.
	47.5
	No, this is the area that is -2 standard deviations from the mean.

Alan buys a bar of soap (Dove creamy & silky smooth) from the convenience store. He showers on random days and he records the weight of the soap each day that he uses it.

Day	Weight of soap (g)

Rex Boggs, "Bar of Soap", Glenmore State High School, accessed 26 Oct 2010, www.statsci.org/data/oz/soap.html

Find the mean of the weight (g) and round to the nearest hundredth.

Algebraic Expression:

62.93

Find the standard deviation of the previous data and round to the nearest hundredth.

Algebraic Expression:

41.24

Alan comes home late one night from the clüb and is feeling really filthy. If Alan wants to have his bar of soap to weigh at least 118 grams when he showers, what is the probability of that happening?

Multiple Choice:

	90.82
	No, subtract this from 1.
	1.33
	No, this is the z-score.
	80.92
	No, sorry.
	9.18

When creating a graph in terms of normal distribution, the result is impacted by which two factors?

Multiple Choice:

	mean and standard deviation
	dimension and mode
	positive and negative coefficients

What determines the shape of a graph of normal distribution?

Multiple Choice:

	standard deviation
	mean

Scaffold:

http://www.stattucino.com/berrie/dsl/index.html

Go to this link and type in that the mean=3 and standard deviation=1. Which one effects change in shape?

Multiple Choice:

	standard deviation
	mean

In an ordered pair referring to the density curve of a normal distribution graph, (10,3) the 10 refers to __ and the 3 refers to __

Multiple Choice:

	mean, standard deviation
	standard deviation, mean
	x,y
	y,x

Hints:

x,y is wrong because we are referring to ordered pairs in terms of normal distibution

The empirical rule states that about what percentage of the area of the density curve falls within 1 standard deviation of the mean?

Algebraic Expression:

68%

Hints:

According to the empirical rule:

About ???% of the area under the curve falls within 1 standard deviation of the mean.
About 95% of the area under the curve falls within 2 standard deviations of the mean.
About 99.7% of the area under the curve falls within 3 standard deviations of the mean.

The answer is 68%

What is the probability when z= -1.5?

Algebraic Expression:

.0668

Hints:

Refer to your Z-table.

In a standard normal distribution, find the probability that:
P(Z<-1.4)

Multiple Choice:

	0.0808
	0.0668
	0.0968
	0.0793

Hints:

Look at the Z-Table

In a standard normal distribution, find the probability that:
P(z=-1.4)

Algebraic Expression:

0

Scaffold:

Is there any area in a straight line?

Multiple Choice:

	Yes
	No

Scaffold:

If there is no area, what is your answer?

Algebraic Expression:

0

In a standard normal distribution, find the probability that:
P(-1.4<Z<0.6)
(to the nearest ten-thousandth)

Algebraic Expression:

0.6449

Scaffold:

Find the probabilities for both -1.4 and 0.6

The answer is in an ordered pair --> [P(-1.4), P(0.6)]

Multiple Choice:

	[0.0808, 0.7257]
	[0.0010, 0.0025]
	[0.0035,0.0655]
	[0.1357, 0.1492]

Scaffold:

Subtract the probability that you find for 0.6 by the probability that you find for -1.4

Algebraic Expression:

0.6449

In a normal distribution, 2.5% of the area lies to the left of 51 and 2.5% lies to the right of 57.
Find the mean.

Algebraic Expression:

54

Refer to the previous problem in which 2.5% was to the left of 51, and 2.5% to the right of 57, and we found that the mean=54
What is the standard deviation?

Algebraic Expression:

1.5

Scaffold:

What would be the z-score when the probability is 2.5%?

Algebraic Expression:

-1.96

Scaffold:

So if z=-1.96, the mean = 54, and you can use the predicted values of either 51 or 57, What is the standard deviation?
Use this equation : z=predicted-mean/standard deviation

(To the hundreds place)

Algebraic Expression:

-1.53

A supermarket advertises that the weight of turkeys is 5 lbs. The weight of turkeys randomly weighed is normally distributed with a mean value of 4.5 lbs, and a standard deviation of 0.2 lbs. What is the probability that a randomly selected turkey weighs more 5 lbs?
(to the nearest ten-thousandth)

Algebraic Expression:

0.0062

Hints:

To find the area, first you must plug in all given information into the equation.
predicted=5, mean=4.5, standard deviation= 0.2
Equation: z=predicted-mean/standard deviation

The zscore = 2.5
Look at the zscore to find the area

Subtract the area from 1.

The area of this normal distribution is 0.0062

Refer back to the supermarket question where the mean of turkeys sold is 4.5 and the standard deviation is 0.2.
Without calculating, would the probability of the normal distribution graph be higher or lower if the standard deviation was instead 0.3?

Multiple Choice:

	higher
	lower

Refer back to supermarket problem again.
What is the probability that a randomly selected turkey would weigh less than 4.3?

Algebraic Expression:

	0.1587
	.1587

Scaffold:

Find the z-score using the equation z=predicted-mean/standard deviation

Algebraic Expression:

-1

The distribution of heights of adult American men is approximately Normal with mean 69 inches and standard deviation 2.5 inches. Use the empirical rule to answer the following question:
What percent of men are taller than 74 inches?

(Citation: The Practice of Statistics. Yates Moore and Starnes, page 137.)

Multiple Choice:

	2.5%
	5%
	1.5%
	3.2%

Refer back to the problem about men's heights. Between what heights do the middle 95% of men fall?
(In inches)

Multiple Choice:

	64-74
	63-73
	70-75
	50-63

If x is an observation from a normal distribution that has a mean value and a standard deviation value, the standardized value of x is z=(x-mean)/standard deviation. A standardized value is often called a z-score.

The standard normal distribution is the normal distribution N(2,4) with mean of 2 and standard deviation of 4.

Table A is a table of areas under the standard normal curve. The table entry for each value z is the area under the curve to the left of z.

Do not make the common mistake of looking up a z-value in Table A and then reporting the entry corresponding to that z-value without knowing if the problem asks for the area to the left or right of that z-value. To make sure you do not fall for this common mistake make sure you always sketch the standard normal curve, mark the z-value, and shade the area you are looking for.

Sometimes we want to find the observed value with a given proportion of the observations above or below it. To do this, use Table A backward. Look in the body of the table to find the given proportion, and read the corresponding z value from the left column and top row. Then plug in z, mean, and the standard deviation into the equation to solve for x.

Now lets begin:
Your class has a test and you score in the top 20th percentile. If the class scores are normally distributed and have a mean of 75 and a standard deviation of 5, what was your score?

Multiple Choice:

	79.2
	70.8
	the top 20th percentile is equivalent to 80%
	90.5
	80

Hints:

look up the top 20th percentile, which is 80%, on the z-table
when you find the z-score of 80%, set that number equal to:
z=(x+mean)/ standard deviation

Gabe measured the heights of all the kids in his AP Stats class. the heights are normally distributed with a mean of 66 inches and a standard deviation of 2 inches. what fraction of the students are taller than 71 inches? round to the nearest ten-thousandth.

Algebraic Expression:

	0.9938
	you need to subtract this from 1.0
	.0062

Hints:

You are given the x, the mean, and the standard deviation.

The answer is .0062

There are 2400 students at a school. Every student takes a test and the scores are normally distributed with a mean score is 80 and with a standard deviation of 8. What number of students received a grade above 95? Round to the nearest whole number.

Algebraic Expression:

	.0301
	you need to multiply this by the number of students
	.9696
	you need to subtract this from 1.0 then multiply it by the number of students
	72

Scaffold:

let's start by finding the percentage of students who scored above 95.
remember the formula z=(x-mean)/standard deviation
z=(95-80)/8
what is the z score?

Exact Match (case sensitive):

1.88

Scaffold:

now look this up in the z score table
what is the z table value of 1.88?
round to the nearest ten-thousandth

Exact Match (case sensitive):

.9699

Scaffold:

now that you have the value of the z-score, which is 0.9699, you need to subtract this number from 1.0, because we are looking for the higher value not the lower value. now do 1.000-0.9699, what is the answer? this is now the percent of students that scored above a 95.

Multiple Choice:

	.0301
	-.0301

Scaffold:

now that you know the percentage of students that scored above a 95 on the test (0.0301), multiply the percentage by the total number of students and you will find out exactly how many students scored above a 95 on the test. round to the nearest whole number.

Multiple Choice:

	72
	73
	round down not up

Suppose the weight of cookies is normally distributed with a mean of 17 ounces and standard deviation of .5 ounces. If the company wants to keep the mean at 17 ounces but adjust the standard deviation so that only 3% of the cookies weigh less than 15 ounces. What does the new standard deviation need to be?

Multiple Choice:

	1.06
	3.85
	3% is .03 not .3 when looking at z table
	1.11
	You need to look for a closer value of .03 in your z tbale

Hints:

A random variable X has the following distribution N(10, 5) and is normally distributed.
Find:
Probability that X>15

round to the nearest thousandth

Exact Match (case sensitive):

	.159
	.841
	you need to subtract this from 1 because x>15 not less than
	.1587
	you need to round to the nearest thousandth

Hints:

Remember this is a greater than problem.

The answer is .159

Referring to the previous question, what is the probability when P=12?

Algebraic Expression:

	0
	.6554
	there is no area value when x is = to a numer, it must be < or >
	.3446
	there is no area value when x is = to a numer, it must be < or

Hints:

What is always the probability when X equals a number?

The answer is 0

PLOT A PLOT B PLOT C

Which of these plots is normally distributed?

Multiple Choice:

	Plot A
	Plot B
	S-shaped graphs are not normally distributed
	Plot C
	normally distributed plots should be linear

A sub shop advertises that they put .6 pounds of meat in their subs. When a group of subs were randomly selected, the amount of meat is normally distributed with a mean of .5 pounds and a standard deviation of .02 pounds. What percentage of subs have between .49 pounds and .55 pounds of meat? Round to nearest whole percent.

Exact Match (case sensitive):

	3%
	3.2%
	Round to nearest whole percent
	30
	When you found the percentage for greater than .49 and less than .55 you forgot you had decimals not a percent

Hints:

Find z when x equals .49 and when x equals .55. Then find the area that corresponds with these 2 z values. Then subtract the smaller decimal from the larger decimal to find your answer.

The answer is 3%

In a normal distribution, find the mean when the standard deviation is 6 and 3.5% of the area lies to the left of 90.

Multiple Choice:

	100.86
	92.28
	Look up .035 not .35 in the body of Table A
	-100.86
	-92.28

Hints:

Work backward from Table A. Look in the body of Table A to find your Z value.

If I am 2 standard deviations above the mean, what is my z-score?

Exact Match (case sensitive):

2

Hints:

the defnition of the z-score is how many standard deviations you are away from the mean.

Suppose a group of students take a test and the scores are normally distributed with a mean equal to 85 and a variance of 144. What percentage of the studnets score better than a 90? Round to nearest whole percent.

Multiple Choice:

	34%
	66%
	Since it is greater than 90 you need to subrtact this from 1.

Hints:

Standard deviation equals the square root of the variance.

Watermelons are normally distributed with a mean of 4 pounds and a standard deviation equal to .03 pounds. What percentage of watermelons will weigh more than 3 pounds?

Multiple Choice:

	84%
	16%
	This is a greater than problem not a less than problem
	15%
	1%

if the following variable Y has the distribution N(150,25) and is normally distributed
Find: probability that X< 130
round to the nearest ten-thousandth

Algebraic Expression:

	.2119
	.7881
	the answer is less than not greater than

The weights of members of a football team are normally distributed with a mean of 250 pounds and a standard deviation of 10 pounds. What percentage of players weigh less than 235 pounds?

Multiple Choice:

	7%
	93%
	This is a less than problem not a greater than so you do not subtract from 1.
	9%
	25%

If Gabe goes hiking in the woods everyday and the time lengths of his hikes are normally distributed with a mean of 120 minutes (2 hours) and a standard deviation of 30 minutes, what percent of his nature hikes take more than 3 hours (180 minutes)? round to the nearest ten-thousandth.

Multiple Choice:

	.9772
	this is a greater than problem not less than
	.0228
	.8754
	check your math and do the formula correctly
	.00022
	check your math and do the forumla properly

The 68-95-99.7 Rule
68% of observations falls within 1 standard deviation of the mean
95% of observations fall within 2 standard deviations of the mean
99.7% of observations fall within 3 standard deviations of the mean
if a random variable X is normally distributed with a distribution of N(25, 5)
what percent of observations lie below the number 35?

Multiple Choice:

	68%
	95%
	99.7%
	97.35

An ice cream company advertises that it puts 0.2 lb of real chocolate chips in its ice cream. In fact, the amount of chocolate chips on a sample of randomly selected chocolate chip ice cream has a mean value of 0.25 lb and a standard deviation of 0.02 lb. What percentage of ice cream has between 0.19 and 0.26 lbs of chocolate chips?

Algebraic Expression:

0.6902

Scaffold:

What variables do you have?

Multiple Choice:

	X, Z, the mean
	We are solving for Z.
	Z, the mean, the standard deviation
	We are solving for Z.
	the mean, X, the standard deviation
	Z, X, the standard devation
	We are solving for Z.

Scaffold:

What is Z if X=0.19 ? Use the equation Z=(X-mean)/standard deviation.

Multiple Choice:

	0.25
	This is the mean.
	0.02
	This is the standard deviation.
	-3
	0.5
	This is Z if X=0.26

Scaffold:

Using the Z-chart, find the probability if X=0.19.

Algebraic Expression:

0.0013

Scaffold:

What is Z if X=0.26 ? Use the equation Z=(X-mean)/standard deviation.

Multiple Choice:

	0.5
	-0.5
	Make sure you plug in the numbers correctly.
	0.7
	-0.6

Scaffold:

Using the Z-chart, find the probability if X=0.26.

Algebraic Expression:

	0.6915
	0.3085
	Z=0.5

Scaffold:

What is the area between 0.19 lbs and 0.26 lbs?

Algebraic Expression:

0.6902

Hints:

Subtract 0.0013 from 0.6915.

A random variable X has the following distribution: N(40, 7). Find P(X>30).

Multiple Choice:

	0.0764
	This is the area to the left of the graph. We are looking for the area to the right.
	0.0778
	You must round the z-value to the nearest hundredth. But this is the area to the left of the graph. We are looking for the area to the right.
	0.9236
	0.9222
	Round the Z-value to nearest hundredth.

Scaffold:

Find Z by plugging in the variables into the equation Z=(X-mean)/Standard deviation. Round to the nearest hundredth.

Algebraic Expression:

	-1.42
	Round to the nearest hundredth.
	-1.43

Scaffold:

Find the probability by using the Z-chart.

Algebraic Expression:

	0.0764
	This is the area to the left of 30, we are looking for the area to the right of 30. Subtract this value from 1.
	0.0778
	Round the z-value to the nearest hundredth. Then subtract the probability from 1.
	0.9236
	0.9222
	Round the z-value to the nearest hundredth.

Michael takes AP Statistic. During the first term has scored an average of 85% on his exams with a standard deviation of 3%. On how many of his exams has he scored at least 88%?

Algebraic Expression:

5

Scaffold:

Using the equation z=(x-mean)/standard deviation, what is your z-score?

Multiple Choice:

	1
	3
	50

Scaffold:

Knowing your z-value is 1, what is the proportion?

Algebraic Expression:

.8413

Hints:

Find 1.0 along the first column in the z-table. The proportion will be .8413

Scaffold:

On how many of Michael's exams has he scored at least 88%?

Algebraic Expression:

5

Hints:

Take the amount of tests taken (6) and multiply it by the proportion you found (.8413). The answer will be the amount of tests Michael has scored at least 88% on. Round your answer to the nearest whole number.

Suppose that the duration of a particular type of criminal trial is known to be normally distributed with a mean of 21 days and a standard deviation of 7 days. 60% of all of these types of trials are completed within how many days? Round to the nearest day.

Algebraic Expression:

	23
	22.75
	Round to the nearest whole day.

Scaffold:

Using 60% in decimal form, find the Z-value in the Z-chart.

Multiple Choice:

	0.26
	0.5987 is closer to 0.6000 than 0.6026.
	0.25
	-0.25
	-0.26

Scaffold:

Using the equation Z=(X-mean)/standard deviation, find X. Round to the nearest whole.

Algebraic Expression:

	23
	22.75
	Round to the nearest whole.
	22.82
	Roudn to the nearest whole
	22.77
	Round to the nearest whole.
	22
	Round up, not down.

At the 2010 Massachusetts Division 1 State Championship meet for Track and Field 25 girls competed in the 200 meter dash. With an average time of 27.12 seconds and a standard deviation of 0.2 seconds, what percent of the girls ran with a time of 26.8 seconds or faster?

Algebraic Expression:

5.48%

Hints:

Remember when looking for the girls that ran 26.8 seconds or faster, you want to find the area to the left of 26.8

Scaffold:

Using the formula z=(x-mean)/standard deviation, what is your z-value?

Algebraic Expression:

-1.6

Hints:

z=(26.8-27.12)/0.2

Scaffold:

Knowing your z-value is -1.6, what is the proportion you find in the z-table that corresponds to -1.6?

Algebraic Expression:

.0548

Scaffold:

What percent of the girls ran with a time of 26.8 seconds or faster?

Algebraic Expression:

5.5%

Hints:

Remember to find the percent you multiply your proportion by 100. Round your answer to the nearest whole number.

Smith's Store sells 600 Christmas trees during the month of December. With a mean price of $50 and a standard deviation of 10, what percent of the Christmas trees cost more than $58?

Algebraic Expression:

21.19%

Scaffold:

What is your x-value, mean, and stand deviation, respectively?

Multiple Choice:

	58, 50, 10
	60, 2, 5
	100, 18, 30

Scaffold:

Using the formula z=(x-mean)/standard deviation, what is your z-value?

Multiple Choice:

	0.9
	your equation is z=(58-50)/10
	0.8
	0.2
	your equation is z=(58-50)/10

Scaffold:

Find 0.8 in your z-table. What is the proportion?

Algebraic Expression:

.7881

Scaffold:

To find the amount greater than 58, find the area to the left of 58. How do you do this?

Multiple Choice:

	1-.7881
	.7881-1
	switch the order
	.7881 is the area to the left of 58
	No. Do 1-.7881 to find the area to the left of 58

Scaffold:

Having found the proportion of 1-.7881, what is the percent of trees that cost more than $58?

Algebraic Expression:

21.19%

Nancy gives out candy on Halloween every year. Over the past 5 years the average amount of candy that she gave out was 100 candy bars with a standard deviation of 4. What is the proportion that Nancy will give out at least 97 candy bars this year?

Algebraic Expression:

0.2266

Hints:

Use the equation z=(x-mean)/standard deviation

Among first year students at a certain university, scores on the verbal SAT follow the normal curve. The average is around 500 and the SD is about 100. Tatiana took the SAT, and placed at the 85% percentile. What was her verbal SAT score?

UCLA Statistics, http://wiki.stat.ucla.edu/socr/index.php/EBook_Problems_Normal_Std

Algebraic Expression:

640

Scaffold:

Find 0.85 in the z-table *(it's 0.8508). What is the z-value?

Algebraic Expression:

1.4

Scaffold:

Using the formula z=(x-mean)/standard deviation, what does x equal?

Multiple Choice:

	800
	switch your mean and standard deviation
	640
	400
	use the equation 1.4=(x-500)/100

IQ is normally distributed with a mean of 100 and a standard deviation of 15. What is the IQ of a person in the top 20% of the data? Round your answer to the nearest whole number.

Barbara Illowsky and Susan Dean, "Collaborative Statistics," Connexions, March 22, 2010, http://cnx.org/content/col10522/1.38/

Algebraic Expression:

113

Hints:

top 20% is the same as 80th percentile, which is the same as .80 in the z-table

In the 1992 presidential election, Alaska's 40 election districts averaged 1956.8 votes per district for President Clinton. The standard deviation was 572.3. (There are only 40 election districts in Alaska.) The distribution of the votes per district for President Clinton was bell-shaped. What is the probability that a district has 2000 votes? (Source: The World Almanac and Book of Facts)

Barbara Illowsky and Susan Dean, "Collaborative Statistics," Connexions, March 22, 2010, http://cnx.org/content/col10522/1.38/

Multiple Choice:

	0.5319
	0.5279
	Round the z-value up to 0.08
	0.4721
	0.4681

Scaffold:

What variables do we have?

Multiple Choice:

	the mean, Z, the standard deviation
	We are solving for the Z-value
	the standard deviation, X, the mean
	Z, the mean, X
	We are solving for the Z-value
	Z, the standard deviation, X
	We are solving for the Z-value

Scaffold:

Using the equation Z= (X-mean)/standard deviation, what is Z-value?

Multiple Choice:

	0.7548
	0.07548
	-0.7548
	-0.07548
	the equation is (X-mean)/standard deviation, not (mean-X)/standard deviation

Scaffold:

Using the z-value and the z-chart, what is the probability?

Algebraic Expression:

	0.4721
	0.4681
	0.5319
	0.5279
	Round the z-value to the nearest hundredth.

Terri Vogel, an amateur motorcycle racer, averages 129.71 seconds per 2.5 mile lap (in a 7 lap race) with a standard deviation of 2.28 seconds . The distribution of her race times is normally distributed. Find the percent of her laps that are completed in less than 130 seconds.

Barbara Illowsky and Susan Dean, "Collaborative Statistics," Connexions, March 22, 2010, http://cnx.org/content/col10522/1.38/

Multiple Choice:

	.13
	This is your z-value. Use this in the z-table to find your percent
	.5517
	Turn this proportion into a percent
	55.17%
	30%

Scaffold:

What is your x-value?

Algebraic Expression:

130

Scaffold:

Using the equation z=(x-mean)/standard deviation, what is your z-value?

Multiple Choice:

	.13
	-.13
	Switch your mean and x-value
	.98
	Switch your standard deviation and mean
	5
	z=(130-129.71)/3.38

Scaffold:

Using the Z-table, what is the proportion for your z-value?

Multiple Choice:

	.9032
	Find the z-value .13 in the table
	.5517
	.4413
	Find the z-value .13, not -.13

Scaffold:

Multiple Choice:

	.5517
	Multiply by 100 to turn it into a percent
	55.17%

According to a study done by De Anza students, the height for Asian adult males is normally distributed with an average of 66 inches and a standard deviation of 2.5 inches. Suppose one Asian adult male is randomly chosen. What is the height of a man in the 40th percentile? (only answer in a number, leave out "inches")

Barbara Illowsky and Susan Dean, "Collaborative Statistics," Connexions, March 22, 2010, http://cnx.org/content/col10522/1.38/

Algebraic Expression:

	65
	66
	The exact number is 65.375 inches, therefore you round down to 65 not up to 66.

Scaffold:

What is the percentage we are looking for in the z-chart?

Multiple Choice:

	40%
	60%
	the _th percentile means the _%
	30%
	the _th percentile means the _%
	50%
	the _th percentile means the _%

Scaffold:

Using the Z-Chart, what is the Z value?

Multiple Choice:

	-0.25
	-0.26
	We are looking for the percentage that is closest to 0.40, 0.4013 is closer than 0.3974.
	0.25
	0.26

Hints:

When using the Z-Chart, look for the value that is closest to 0.4000 to help you find Z.

Scaffold:

What variables do we have to complete the equation, Z= (X-mean)/standard deviation ?

Multiple Choice:

	the mean, Z, the standard deviation
	Z, the standard deviation, X
	We are solving for the X
	the mean, the standard deviation, X
	We are solving for the X
	X, the mean, Z
	We are solving for the X

Scaffold:

Using the equation from above and the variables from the problem, solve for X to the nearest whole number.

Algebraic Expression:

	65
	66
	The exact value is 65.375, you must round down to 65 instead of up to 66.
	65.375
	Round to the nearest whole value
	62

The grade point averigaes of the students at the University of Houlihan are approximately normally distributed with mean equal to 3.0 and standard deviation equal to 0.2. What percentage of the students will possess a grade point average greater than 3.5?

Algebraic Expression:

	0.9938
	This is the area to the left of 3.5, or lower than 3.5.
	0.0062

Scaffold:

Using the equation Z= (X-mean)/standard deviation, find Z.

Multiple Choice:

	2.5
	-2.5
	the equation is (X-mean)/standard deviation not (mean-X)/standard deviation
	1.5
	-1.5

Scaffold:

Using the Z-Chart, what is the probability that the scores will be above 3.5?

Multiple Choice:

	0.9938
	This is the area to the left of 3.5
	0.0062
	0.5293
	0.0427

Suppose that weights of bags of potato chips coming from a factory follow a normal distribution with mean 12 ounces and standard deciation 0.6 ounces. If the manufacturer wants to keep the mean at 12 ounces but adjust the standard deviation so that only 4% of the bags weigh less than 11 ounces, what does the new standard deviation need to be? Round to the nearest hundredth.

Algebraic Expression:

	0.38
	Make sure you are using 0.0401 as the percentage.
	0.37
	Make sure you are using 0.0401 as the percentage.
	0.57
	0.56
	Make sure you are using 0.0401 as the percentage.

Scaffold:

What is the decimal equivalent to 4% ?

Multiple Choice:

	0.0400
	0.0040
	Try 4/100 in your calculator.
	0.4000
	Try 4/100 in your calculator.
	4.0000
	Try 4/100 in your calculator.

Scaffold:

Using the Z-chart, what is the z-value?

Algebraic Expression:

	-1.75
	-1.7
	Make sure you are using both parts of the z-table. You are missing the .05
	-2.65
	The decimal equivalent is 0.0401 not 0.0040

Scaffold:

What is the X-value?

Algebraic Expression:

	11
	12
	This is the average, not the X-value.

Scaffold:

Using the equation Z=(X-mean)/standard deviation, find the standard deviation to the nearest hundredth.

Multiple Choice:

	0.570
	0.580
	Round to the nearest whole.
	0.571
	Round to the nearest hundredth.
	0.377
	Make sure you have the correct value for Z

Hints:

Find the decimal equivalent of 4% in the Z-chart.

In China, 4-year-olds average 3 hours a day unsupervised. Most of the unsupervised children live in rural areas, considered safe. Suppose that the standard deviation is 1.5 hours and the amount of time spent alone is normally distributed. We randomly survey one Chinese 4-year-old living in a rural area. Find the probability that the child spends less than 1 hour per day unsupervised.

Barbara Illowsky and Susan Dean, "Collaborative Statistics," Connexions, March 22, 2010, http://cnx.org/content/col10522/1.38/

Algebraic Expression:

.0918

Scaffold:

What is the x-value, mean, and standard deviation, respectively? *Remember, average is the same as mean*

Algebraic Expression:

1,3,1.5

Scaffold:

Using the formula z=x-mean/standard deviation, what is your z-value?

Multiple Choice:

	1.33
	-1.33
	0.4
	Switch your mean and x-value
	-.167
	Switch your mean and standard deviation

Scaffold:

What is probability that the child spends less than 1 hour per day unsupervised?

Multiple Choice:

	0.3
	Remember to look for the value -1.33 in the z-table
	.0918
	.0237
	Remember to look for the value -1.33 in the z-table

In a certain normal distribution, 1.25% of the area lies to the left of 33 and 1.25% lies to the right of the 39. Find the mean.

Algebraic Expression:

36

Hints:

If the area to the left and the right of these numbers are equal, the mean is between these two numbers.

Referring to the question above, find the standard deviation to the nearest hundredth.

Algebraic Expression:

	1.34
	1.33
	Round to the nearest hundredth.

Scaffold:

Using the Z-chart, what is the Z-value?

Multiple Choice:

	-2.24
	-2.34
	Look for 1.25% in the body of the z-chart
	2.24
	Look for 1.25% in the body of the z-chart
	2.34
	Look for 1.25% in the body of the z-chart

Hints:

1.25% equals 0.0125 in decimal form

Scaffold:

What variables do you have?

Multiple Choice:

	X, Z, the mean
	X, standard deviation, the mean
	You are solving for the standard deviation
	Z, the mean, the standard deviation
	You are solving for the standard deviation
	X, Z, the standard deviation
	You are solving for the standard deviation

Scaffold:

Which value do you use for X ?

Multiple Choice:

	33
	39
	Because the area 1.25% is to the left, you need to use the variable to the left
	36
	Because the area 1.25% is to the left, you need to use the variable to the left
	37
	Because the area 1.25% is to the left, you need to use the variable to the left

Scaffold:

Using the equation Z=(X-mean)/standard deviation, solve for the standard deviation to the nearest hundredth.

Multiple Choice:

	1.33
	Round to the nearest hundredth
	1.34
	1.36
	1.30

Find the probability, P(Z=1.7).

Algebraic Expression:

	0
	0.9552
	There is no area above one specific point.
	0.0446
	There is no area above one specific point.

Hints:

When the Z is equal to a certain value, there is no area below it.

The percent of fat calories that a person in America consumes each day is normally distributed with a mean of about 36 and a standard deviation of 10. Suppose that one individual is randomly chosen. Find the probability that the percent of fat calories a person consumes is more than 40.

Barbara Illowsky and Susan Dean, "Collaborative Statistics," Connexions, March 22, 2010, http://cnx.org/content/col10522/1.38/

Algebraic Expression:

	0.3446
	0.6554
	This is the area to the left of 40

Scaffold:

What is your x-value?

Multiple Choice:

	36
	This is the mean
	40
	10
	This is the standard deviation

Scaffold:

Using the equation z=x-mean/standard deviation, what is the z-value?

Algebraic Expression:

	-0.65
	Switch the standard deviation and the x-value
	0.65
	the equation is z=x-mean/standard deviation
	0.4

Scaffold:

Is this the area to the left or right of 40? Which are you trying to find?

Multiple Choice:

	Left, Left
	You are not trying to find the area to the left of 40
	Left, Right
	Right, Right
	This is the area to the left of 40
	Right, Left
	This is the area to the left of 40, and you are trying to find the area to the right of 40

Scaffold:

How do you use the z-score you found before (0.6554) to find the area of the data to the right of 40?

Multiple Choice:

	1-0.6554
	0.6554-1
	1+0.6554
	0.6554 is the answer

Scaffold:

What is the probability that the percent of fat calories a person consumes is more than 40?

Multiple Choice:

	0.3446
	34.36%
	Keep your answer as a decimal