Think of a topic of interest to you, either professionally or personally, that addresses some natural phenomenon that can be studied by observation, experimentation,
or by analyzing existing data. Example topics include: crime, poverty, e-commerce, privacy rights, sick leave policies in the U. S. versus Europe, money in politics,
education, mental health issues, gun control, tax policy, race relations, corporate governance, minimum wage laws, free trade agreements, mobile phone usage,
population growth, birth control in developing
countries, body language, the work-life balance, and countless others. If you need inspiration, a favorite site of mine is TED.com1, which houses thousands of 15
Data Analysis Assignment 1
The general approach you should take to a data analysis assignment is to write as if you
were providing a report to a manager who knows basic math and statistics, but is not a
specialist. Your work product should exhibit a high degree of professionalism. Speci?cally,
on every data analysis assignment, I expect your papers to conform to the following
1). Word-process the assignment using 11- or 12-point font, double spacing, and 1-inch
margins all around.
2). If applicable to the problem, use subscripts and superscripts. For example, write 253,
not 25^2 if you need to indicate an exponent; write x1; not x_1:
3). Write any equations or mathematical symbols using the equation editor (Word,
OpenO¢ ce, and other programs have these). For example, write the sample mean,
pronounced as “x-bar”, as x; not “xbar” or something similar.
4). Write all of your answers in complete, grammatically correct sentences. Never answer a
question with just a number. For example, if the question asks for the sample mean salary,
write “The sample mean of the salaries is <whatever>.”
5). Spell check your document.
6). Write for the reader, not “the teacher.” Explain exactly what you are doing on every
question as if writing for a colleague or supervisor who does not know the details of the
problem you are working on. Imagine you had to pick up your assignment a year from now
and understand it; write your answers so that you could do that.
7). Include an introductory sentence before every graph, symbol, table, or element of
software output. For example, if the directions ask for a histogram, include a sentence
before the histogram describing exactly what the graph shows. If the directions ask for a
con?dence interval, write out the conclusion as a statement in the context of the problem.
For example, do not write “[4.2,5];” write “We are 95% con?dent that the true mean length
of time spent at a job is between 4.2 and 5 years.”
8). Make all ?gures large enough to be easily read, and do not rely on color to distinguish
the various components of graphs. Rather use di¤erent patterns or shades of gray.
Professionalism, as judged by your paper adhering to the guidelines above, will
make up roughly 40% of each assignment grade.
1. (Note: For this question, and only this question, all group members must answer in-
dividually. Label each answer with the name of the student to whom it belongs. Each
person?s response should be no less than half a page, following the formatting guidelines
above. This question refers to material in the “Big Picture” document and video.)
Think of a topic of interest to you, either professionally or personally, that addresses
some natural phenomenon that can be studied by observation, experimentation, or
by analyzing existing data. Yes, your options here are quite varied, with the only
restriction being that the topic have the ability to be studied by gathering observable
data. Example topics include: crime, poverty, e-commerce, privacy rights, sick leave
policies in the U. S. versus Europe, money in politics, education, mental health issues,
gun control, tax policy, race relations, corporate governance, minimum wage laws, free
trade agreements, mobile phone usage, population growth, birth control in developing
countries, body language, the work-life balance, and countless others. If you need
inspiration, a favorite site of mine is TED.com1, which houses thousands of 15 ?? 20-
minute talks about almost any conceivable topic. Just pick one that interests you.
Personal beliefs that have no ability to be tested systematically, such as the belief that
the entire known universe rests on the back of a giant turtle, or that you are really just
a “brain ?oating in a jar” experiencing everything through an elaborate simulation,2
are not topics to discuss here. Once you have your topic, do the following.
(a) Think of a question that you have about your topic and use Google Scholar to ?nd
one (1) academic research paper that generally addresses that question3. Focus
on papers submitted to scholarly journals, not on news articles or “white papers”
written by companies or organizations. Note that you might not understand most
of the paper; that?s okay. Just read the abstract (i.e., the summary at the begin-
ning), the introduction section, and the conclusion section. If, after a few tries,
you don?t have a general idea of what the researchers did, ?nd a better article
(the point of these three sections is to explain clearly what is being done, but
some excellent researchers are poor communicators). Give a full citation for the
article using APA4 format. Then, brie?y, explain:
1. The aspect of Nature5 that was studied.
2. The Design and Measurement approach. That is, how did the researchers
decide to address the question? With an experiment? By observing a group?
By working with an existing data set? By some other means?
3. The Data the researchers obtained. That is, what did the numbers they
recorded represent? Crime rates? Weights? Reaction times? Website hits?
Rankings on a 1 ?? 5 (or similar) scale?6
(b) How convincing are the conclusions of the article? Do you believe the researchers
approached the problem the right way? If you had an opinion on the issue before,
has it changed or has it been reinforced? If you think the study was ?awed or not
convincing, state?brie?y but speci?cally?why you think so.
2. (Note: from here onward, you can work in a group as normal). Again referring to
the “Big Picture” document and/or video, brie?y state whether you think each of the
statements below is reasonable and why you think it is or is not. You don?t need to
write much more than a sentence or two for each, and you do not need to mention any
statistical methods or cite any outside sources. Just use your best judgement. You
may make additional (but reasonable) assumptions if needed to support your point.
(a) I?ve heard smoking causes cancer, but my dear aunt Hilda lived to be 90 and
smoked most of her life. So smoking does not cause cancer.
1Alphabetical list of TED talks]
2Yes, this is a real philosophical point of view. See http://en.wikipedia.org/wiki/Brain_in_a_vat
3You may not ?nd something that addresses the issue exactly, as that?s how new research starts, but you
should ?nd something broadly related.
4Here is a link to the format of an APA journal citation:
5Remember that “Nature,” in this context, is the entire observable world, not just trees and birds.
6These rating scales are formally called Likert scales.
(b) Two groups of U. S. high school freshmen, selected by randomly choosing student
ID numbers, were enrolled in a two-week drug abuse education program. One
group (A) received instruction from a police o¢ cer in uniform while the other
group (B) received no instruction. Five years later, the two groups were surveyed.
In Group A, 15% had tried an illegal drug at least once in the last ?ve years, while
in Group B, 30% had done so. The program therefore does not work because it
should be 0% for Group A.
(c) A researcher asked 100 shoppers in a mall to try a new bracelet that supposedly
improves balance. Each shopper was told that the bracelet used magnets to direct
the body?s energy ?ow. The researcher asked each shopper to stand on one foot as
long as possible while wearing the bracelet, and then to repeat the action without
wearing the bracelet. At the end of the study, 70% of the shoppers stood for
a longer time while wearing the bracelet. Therefore, the bracelet is e¤ective at
(d) A graph7 shows that as the sales of organic food increased from 1997 to 2007,
diagnoses of autism also increased. Organic foods are therefore a cause of autism.
3. This problem will use the “MBA survey data” data set. This is an additional data
?le located under the “Course Materials” area in Blackboard. The data were collected
from a survey given to one MBA statistics class at Sam Houston State University.
Students were not required to respond to the survey, and received no course credit
whatsoever for choosing to participate. In total, 38 out of 50 students responded. The
variable names and descriptions are as follows:
work_stat: A student?s employment status. Possible values were: working full
time; working part time; unemployed, laid o¤, or looking for work; other. If
“other” was chosen, the student given the chance to enter additional information.
job_hrs: The number of hours the student typically spends per week at his/her
sch_hrs: The number of hours the student typically spends on school work per
num110: The ?rst number between 1 and 10 the student thought of after reading
last_name: Letter group containing the ?rst letter of the student?s last name
num_kids: The number of children the student has
age: student?s age
gender: student?s gender
yrs_stat: number of years since the student last studied statistics
accidents: number of automobile accidents (reported and unreported) the student
had in the past year
7Here is the link.
(a) Classify each variable as nominal/categorical, ordinal, or interval/ratio, and ex-
plain your reasoning.
(b) For the variable “job_hrs,” ?nd the mean, standard deviation, minimum, maxi-
mum, and quartiles. Explain what each measure tells you in the context of the
(c) Make a completely labeled histogram and boxplot for “job_hrs.” Describe the
shape of the histogram and what it tells you about the population or process the
data came from.
(d) Repeat (b) but separate the results by gender (you don?t have to repeat the
descriptions of what the measures tell you; just report the statistics). This can be
done using either a pivot table or in R Commander by going to Statistics ->
Summaries -> Numerical Summaries. Then click “Summarize by groups…”
and select “gender” as the group. Comment on the di¤erences you see between
(e) Repeat (c) but, again, separate the results by gender. You can make separate
histograms in Excel, but R will be easier. Similar to (d), go to Graphs ->
Histogram and then click Plot by groups…, and again selecting “gender” as
the group. Describe the shapes of the plots and what additional information they
provide about the two groups.
(f) Investigate the data set to ?nd the cause of the outlier visible in the plots you
made. Putting yourself in the position of a professional data analyst, discuss
whether or not the outlier should be removed and why you think as you do.
(g) The dean of the College of Business eventually wants to make a report to the
Board of Regents about the demographics of the MBA program, and wants to
use the data you have collected. Is this data appropriate for that purpose? In your
answer, mention the population or process you think the sample of 38 students
is taken from (Hint: All data come from some population or process; the issue is
whether it?s the population we are interested in studying).
(h) The dean grants you access to the full registrar database containing the demo-
graphic information for all MBA students currently enrolled. Explain why you
still might not have “the population.” In what sense is this data better than the
sample of 38 students you used earlier?
4. The following problems will use the “AMESHousing.csv” data set, which includes 82
variables on nearly 3; 000 residential properties in Ames, Iowa, from 2006 to 2010,
obtained from the Ames Assessor?s O¢ ce. The names of the variables and what they
represent are found in the “AmesHousingDescription” ?le in the “Data Sets” folder on
Blackboard. Using R or Excel, do the following:
(a) Make a histogram of the Sale_Price variable and describe the pattern that you
see. Is this pattern surprising to you, given what you know about the housing
market? If you see strange looking numbers with “e?s” in the plot, those indicate
powers of 10: For example, 1:5e+05 is 1:5 105 = 150; 000: Determine a way to
get rid of the “e?s.” Hint: The hard way is to start messing with R?s graphing
code to get the big numbers to print. There is a much easier way to solve this
(b) Make a new variable called “lnSale” to record the natural logarithm of the sale
price, i.e., ln(Sale_Price). You can do this in either Excel by adding a column
and re-importing the data to R, or directly in R Commander by going to Data ->
Manage variables in active data set -> Compute new variable and inputting
the appropriate expression to calculate the natural logarithm of Sale_Price8.
Make a histogram of “lnSale” and compare its appearance to the one in (a).
Make a conclusion about one of the properties of the natural logarithm.
5. Using R Commander or Excel, calculate a new variable called “Age” to hold the age of
a house at the time of sale. In R Commander, select Data -> Manage variables in
active data set -> Compute new variable and input the appropriate expression
to calculate the age of the house using the variables “Year_Built” and “Year_Sold.”
Then do the following:
(a) Make a histogram of “Age” and describe the information you can get from it.
(b) Make a box plot of “Age” and describe what information you can get from it that
you cannot get easily from a histogram.
(c) Calculate the mean, standard deviation, and 90th percentile of the age data. De-
scribe what each measure tells you.
6. This problem will use the General Social Survey (GSS) data from 2012, which is in the
“gss_2012.csv” ?le. We will use side-by-side bar charts to investigate the question of
whether there is a relationship between marital status and general happiness. Using
an Excel Pivot Chart (or R Commander if you?re brave), do the following:
(a) Display a contingency table with the marital status (“marital”) variable down the
rows and the general happiness variable (“happy”) along the columns.
(b) Summarize the distribution of the “happy” variable by marital status by summa-
rizing by row percentages.
(c) Display a side-by-side bar chart of the distribution you found in (b).
(d) Using (b) and (c), does there seem to be a signi?cant relationship between marital
status and general happiness? If so, what is the apparent relationship? Describe
speci?cally what features you are looking for in the graph and in the table in
order to make your assessment.
7. Download two years of weekly stock closing prices from a publicly traded company of
your choice9. Then using Excel (probably easier) or R Commander, do the following:
8In R, the natural log function is log().
9Yahoo! Finance makes this easy. Go to http://?nance.yahoo.com and enter the company name (or stock
symbol) in the “Quote Lookup” area. Then click “Historical Prices” on the left-hand side and enter the
two-year time frame. The starting year should be 2012.
(a) Letting Yt be the closing price of the stock on day t, de?ne new variable called
Yt ?? Yt??1
where Yt??1 is the price of the stock the previous trading week (set the return for
the ?rst period to 0; and note that you will only have “real” data for the second
week onward). Thus, Rt is just the percentage change from week to week without
the multiplication by 100:
(b) Report the ?rst 10 rows of returns.
(c) De?ne a new variable Lt = ln(Yt) to hold the natural logarithm of the closing
price on each week. Report the ?rst 10 rows of Lt:
(d) De?ne one more variable “lnDi¤” as the di¤erence between the log of the price on
week t and the log of price on week t??1: That is, lnDi¤= Lt??Lt??1: Compare the
values of “lnDi¤” to the values of Rt from part (a) and draw a conclusion about
a property of the natural logarithm.
Extra Credit (i.e., I won?t test you on this, but I?ll give you some extra points if
you do it correctly): Show mathematically why the property in (d) occurs. Your answer
should not rely on any speci?c data values but show, in general, why the property “works.”