Introduction to Hypothesis Testing
- Describe how a probability value is used to cast doubt on the null
- Define "statistically significant"
- Distinguish between statistical significance and practical significance
- Distinguish between two approaches to significance testing
A low probability
value casts doubt on the null hypothesis. How
low must the probability value be in order to conclude that
the null hypothesis is false? Although there is clearly no
right or wrong answer to this question, it is conventional
to conclude the null hypothesis is false if the probability
value is less than 0.05. More conservative researchers conclude
the null hypothesis is false only if the probability value
is less than 0.01. When a researcher concludes that the null
hypothesis is false, the researcher is said to have rejected
the null hypothesis. The probability value below which the
null hypothesis is rejected is called the
α level or simply α. It is also
called the significance level.
When the null hypothesis is rejected, the effect
is said to be statistically significant.
For example, in the Physicians
Reactions case study, the probability value is 0.0057. Therefore,
the effect of obesity is statistically significant and the null
hypothesis that obesity makes no difference is rejected. It is
very important to keep in mind that statistical significance means
only that the null hypothesis of exactly no effect is rejected;
it does not mean that the effect is important, which is what "significant"
usually means. When an effect is significant, you can have confidence
the effect is not exactly zero. Finding that an effect is significant
does not tell you about how large or important the effect is.
Do not confuse statistical significance with
practical significance. A small effect can be highly significant
if the sample size is large enough.
Why does the word "significant" in the
phrase "statistically significant" mean something so
different from other uses of the word? Interestingly, this is
because the meaning of "significant" in everyday language
has changed. In the 19th century when the procedures for hypothesis
testing were developed, something was "significant"
if it signified something. Thus, finding that an effect is statistically
significant signifies that the effect is real and not due to chance.
Over the years, the meaning of "significant" changed,
leading to the potential misinterpretation.
There are two approaches (at least) to conducting
significance tests. In one (favored by R. Fisher) a significance
test is conducted and the probability value reflects the strength
of the evidence against the null hypothesis. If the probability
is below 0.01, the data provide strong evidence that the null
hypothesis is false. If the probability value is below 0.05 but
larger than 0.01, then the null hypothesis is typically rejected,
but not with as much confidence as it would be if the probability
value were below 0.01. Probability values between 0.05 and 0.10
provide weak evidence against the null hypothesis, and, by convention,
are not considered low enough to justify rejecting it. Higher
probabilities provide less evidence that the null hypothesis is
The alternative approach (favored by the statisticians
Neyman and Pearson) is to specify an α
level before analyzing the data. If
the data analysis results in a probability value below the α
level, then the null hypothesis is rejected; if it is not, then
the null hypothesis is not rejected. According to this perspective,
if a result is significant, then it does not matter how significant
it is. Moreover, if it is not significant, then it does not matter
how close to being significant it is. Therefore, if the 0.05 level
is being used, then probability values of 0.049 and 0.001 are
treated identically. Similarly, probability values of 0.06 and
0.34 are treated identically.
The former approach (preferred by Fisher) is more
suitable for scientific research and will be adopted here. The
latter is more suitable for applications in which a yes/no decision
must be made. For example, if a statistical analysis were undertaken
to determine whether a machine in a manufacturing plant were malfunctioning,
the statistical analysis would be used to determine whether or
not the machine should be shut down for repair. The plant manager
would be less interested in assessing the weight of the evidence
than knowing what action should be taken. There is no need for
an immediate decision in scientific research where a researcher
may conclude that there is some evidence against the null hypothesis,
but that more research is needed before a definitive conclusion
can be drawn.