
Thus,
my third recommendation is that, as re-
searchers, we routinely report effect sizes in the form of
confidence limits. "Everyone knows" that confidence in-
tervals contain all the information to be found in
signif-
icance tests and much more. They not only reveal the
status of the trivial nil hypothesis but also about the status
of non-nil null hypotheses and thus help remind re-
searchers about the possible operation of the crud factor.
Yet they are rarely to be found in the literature. I suspect
that the main reason they are not reported is that they
are so embarrassingly large! But their sheer size should
move us toward improving our measurement by seeking
to reduce the unreliable and invalid part of the variance
in our measures (as Student himself recommended almost
a century ago). Also, their width provides us with the
analogue of power analysis in significance testing—larger
sample sizes reduce the size of confidence intervals as
they increase the statistical power of NHST. A new pro-
gram covers confidence intervals for mean differences,
correlation, cross-tabulations (including odds ratios and
relative risks), and survival analysis (Borenstein, Cohen,
& Rothstein, in press). It also produces Birnbaum's (1961)
"confidence curves," from which can be read all confi-
dence intervals from 50% to 100%, thus obviating the
necessity of choosing a specific confidence level for pre-
sentation.
As researchers, we have a considerable array of sta-
tistical techniques that can help us find our way to theories
of some depth, but they must be used sensibly and be
heavily informed by informed judgment. Even null hy-
pothesis testing complete with power analysis can be use-
ful if we abandon the rejection of point nil hypotheses
and use instead "good-enough" range null hypotheses
(e.g., "the effect size is no larger than 8 raw score units,
or d = .5), as Serlin and Lapsley (1993) have described
in detail. As our measurement and theories improve, we
can begin to achieve the Popperian principle of repre-
senting our theories as null hypotheses and subjecting
them to challenge, as Meehl (1967) argued many years
ago.
With more evolved psychological theories, we can
also find use for likelihood ratios and Bayesian methods
(Goodman, 1993;Greenwald, 1975). We quantitative be-
havioral scientists need not go out of business.
Induction has long been a problem in the philosophy
of science. Meehl (1990a) attributed to the distinguished
philosopher Morris Raphael Cohen the saying "All logic
texts are divided into two parts. In the first part, on de-
ductive logic, the fallacies are explained; in the second
part, on inductive logic, they are committed" (p. 110).
We appeal to inductive logic to move from the particular
results in hand to a theoretically useful generalization.
As I have noted, we have a body of statistical techniques,
that, used intelligently, can facilitate our efforts. But given
the problems of statistical induction, we must finally rely,
as have the older sciences, on replication.
REFERENCES
Bakan, D. (1966). The test of significance in psychological research.
Psychological Bulletin, 66, 1-29.
Berkson, J. (1938). Some difficulties of interpretation encountered in
the application of the chi-square test. Journal of the American Sta-
tistical Association, 33, 526-542.
Birnbaum, A. (1961). Confidence curves: An omnibus technique for
estimation and testing statistical hypotheses. Journal of the American
Statistical Association, 56, 246-249.
Borenstein, M., Cohen, J., & Rothstein, H. (in press). Confidence inter-
vals,
effect size, and power [Computer program]. Hillsdale, NJ: Erl-
baum.
Cleveland, W. S. (1993). Visualizing data. Summit, NJ: Hobart.
Cleveland, W. S., & McGill, M. E. (Eds.). (1988). Dynamic graphics for
statistics. Belmont, CA: Wadsworth.
Cohen. J. (1962). The statistical power of abnormal-social psychological
research: A review. Journal of Abnormal and Social Psychology 69
145-153.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences
(2nd ed.). Hillsdale. NJ: Erlbaum.
Cohen, J. (1990). Things I have learned (so far). American Psychologist
45,
1304-1312.
Dawes, R. M. (1988). Rational choice in an uncertain
world.
San Diego,
CA: Harcourt Brace Jovanovich.
Falk, R., & Greenbaum, C. W. (in press). Significance tests die hard:
The amazing persistence of a probabilistic misconception. Theory
and Psychology.
Fisher, R. A. (1951). Statistical methods for
research
workers.
Edinburgh,
Scotland: Oliver & Boyd. (Original work published 1925)
Gigerenzer, G. (1993). The superego, the ego, and the id in statistical
reasoning. In G. Keren & C. Lewis (Ed.), A handbook for data analysis
in the behavioral sciences: Methodological issues (pp. 311-339).
Hillsdale, NJ: Erlbaum.
Goodman, S. N. (1993). P values, hypothesis tests, and likelihood im-
plications for epidemiology: Implications of
a
neglected historical de-
bate.
American Journal of Epidemiology, 137. 485-496.
Greenwald, A. G. (1975). Consequences of prejudice against the null
hypothesis. Psychological Bulletin, 82, 1-20.
Hogben, L. (1957). Statistical theory. London: Allen & Unwin.
Lykken, D. E. (1968). Statistical significance in psychological research.
Psychological Bulletin, 70, 151-159.
Meehl, P. E. (1967). Theory testing in psychology and physics: A meth-
odological paradox. Philosophy of Science, 34, 103-115.
Meehl. P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir
Ronald, and the slow progress of soft psychology. Journal of Consulting
and Clinical
Psychology,
46, 806-834.
Meehl, P. E. (1986). What social scientists don't understand. In D. W.
Fiske & R. A. Shweder (Eds.), Metaiheory in social
science:
Pluralisms
and
subjectivities
(pp. 315-338). Chicago: University of Chicago Press.
Meehl, P. (1990a). Appraising and amending theories: The strategy of
Lakatosian defense and two principles that warrant it. Psychological
Inquiry,
1,
108-141.
Meehl, P. E. (1990b). Why summaries of research on psychological the-
ories are often uninterpretable.
Psychological
Reports, 66(Monograph
Suppl.
1-V66),
195-244.
Morrison. D. E., & Henkel, R. E. (Eds.). (1970). The significance test
controversy. Chicago: Aldine.
Oakes, M. (1986). Statistical
inference:
A commentary for the social and
behavioral sciences. New York: Wiley.
Pollard, P., & Richardson, J. T. E. (1987). On the probability of making
Type I errors. Psychological Bulletin, 102, 159-163.
Popper, K. (1959). The
logic
of scientific
discovery.
London: Hutchinson.
Rosenthal, R. (1979). The "file drawer problem" and tolerance for null
results. Psychological Bulletin, 86,
638-641.
Rosenthal, R. (1993). Cumulating evidence. In G. Keren & C. Lewis
(Ed.),
A handbook for data analysis in the behavioral sciences: Meth-
odological issues (pp. 519-559). Hillsdale, NJ: Erlbaum.
Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance
test. Psychological Bulletin, 57, 416-428.
Schmidt, F. L. (1992). What do data really mean? Research findings,
meta-analysis, and cumulative knowledge in psychology. American
Psychologist, 47, 1173-1181.
Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power
1002
December 1994 • American Psychologist