For anyone interested in issues related to p-values, quality of res...

These are heavyweight names in epidemiology/biostatistics. This pap...

This same sentiment was expressed in the American Statistical Assoc...

Another great resource for misinterpretations of p-values, which wa...

In 2015, the journal "Basic and Applied Social Psychology" (BASP) d...

This is often known as data dredging or p-hacking - an excellent Wi...

Unfortunately, this terminology has led to a misconception and many...

R.A. Fisher created the concept of a "p value" and the "null hypoth...

Multiple hypothesis testing is a huge problem in the sciences. Let'...

[Here's](https://fivethirtyeight.com/features/not-even-scientists-c...

This is an excellent short read in its own right!

Wikipedia has a fantastic and very accessible article on the [Misun...

If you've ever worked with a big data set, you'll know that it is v...

Estimating the science-wide false discovery rate is an active area ...

To appropriately combine evidence from more than one study, we need...

This is a misconception I have witnessed time and time again within...

An additional recommendation that has attracted lots of supporters,...

ESSAY

Statistical tests, P values, conﬁdence intervals, and power: a guide

to misinterpretations

Sander Greenland

1

•

Stephen J. Senn

2

•

Kenneth J. Rothman

3

•

John B. Carlin

4

•

Charles Poole

5

•

Steven N. Goodman

6

•

Douglas G. Altman

7

Received: 9 April 2016 / Accepted: 9 April 2016 / Published online: 21 May 2016

The Author(s) 2016. This article is published with open access at Springerlink.com

Abstract Misinterpretation and abuse of statistical tests,

conﬁdence intervals, and statistical power have been

decried for decades, yet remain rampant. A key problem is

that there are no interpretations of these concepts that are at

once simple, intuitive, correct, and foolproof. Instead ,

correct use and interpretation of these statistics requires an

attention to detail which seems to tax the patience of

working scientists. This high cognitive demand has led to

an epidemic of shortcut deﬁnitions and interpretations that

are simply wrong, sometimes disastrously so—and yet

these misinterpretations dominate much of the scientiﬁc

literature. In light of this problem, we provide deﬁnitions

and a discussion of basic statistics that are more general

and critical than typically found in traditional introductory

expositions. Our goal is to provide a resource for instruc-

tors, researchers, and consumers of statistics whose

knowledge of statistical theo ry and technique may be

limited but who wish to avoid and spot misinterpretations.

We emphasize how violation of often unstated analysis

protocols (such as selecting analyses for presentation based

on the P values they produce) can lead to small P values

even if the declared test hypothesis is correct, and can lead

to large P values even if that hypothesis is incorrect. We

then provide an explanatory list of 25 misinterpretations of

P values, conﬁdence intervals, and power. We conclude

with guidelines for improving statistical interpretation and

reporting.

Editor’s note This article has been published online as

supplementary material with an article of Wasserstein RL, Lazar NA.

The ASA’s statement on p-values: context, process and purpose. The

American Statistician 2016.

Albert Hofman, Editor-in-Chief EJE.

& Sander Greenland

lesdomes@ucla.edu

Stephen J. Senn

stephen.senn@lih.lu

John B. Carlin

john.carlin@mcri.edu.au

Charles Poole

cpoole@unc.edu

Steven N. Goodman

steve.goodman@stanford.edu

Douglas G. Altman

doug.altman@csm.ox.ac.uk

1

Department of Epidemiology and Department of Statistics,

University of California, Los Angeles, CA, USA

2

Competence Center for Methodology and Statistics,

Luxembourg Institute of Health, Strassen, Luxembourg

3

RTI Health Solutions, Research Triangle Institute,

Research Triangle Park, NC, USA

4

Clinical Epidemiology and Biostatistics Unit, Murdoch

Children’s Research Institute, School of Population Health,

University of Melbourne, Melbourne, VIC, Australia

5

Department of Epidemiology, Gillings School of Global

Public Health, University of North Carolina, Chapel Hill, NC,

USA

6

Meta-Research Innovation Center, Departments of Medicine

and of Health Research and Policy, Stanford University

School of Medicine, Stanford, CA, USA

7

Centre for Statistics in Medicine, Nufﬁeld Department of

Orthopaedics, Rheumatology and Musculoskeletal Sciences,

University of Oxford, Oxford, UK

123

Eur J Epidemiol (2016) 31:337–350

DOI 10.1007/s10654-016-0149-3

Keywords Conﬁdence intervals Hypothesis testing Null

testing P value Power Signiﬁcance tests Statistical

testing

Introduction

Misinterpretation and abuse of statistical tests has been

decried for decades, yet remains so rampant that some

scientiﬁc journals discourage use of ‘‘statistical signiﬁ-

cance’’ (classifying results as ‘‘signiﬁcant’’ or not based on

a P value) [1]. One journal now bans all statistical tests and

mathematically related procedures such as conﬁden ce

intervals [2], which has led to considerable discussion and

debate about the merits of such bans [3, 4].

Despite such bans, we expect that the sta tistical methods

at issue will be with us for many years to come. We thus

think it imperative that basic teaching as well as general

understanding of these methods be improved. Toward that

end, we attempt to explain the meaning of signiﬁcance

tests, conﬁdence intervals, and statistical powe r in a more

general and critical way than is traditionally done, and then

review 25 common misconceptions in light of our expla-

nations. We also discuss a few more subtle but nonetheless

pervasive problems, explaining why it is important to

examine and synthesize all results relating to a scientiﬁc

question, rather than focus on individual ﬁndings. We

further explain why statistical tests should never constitute

the sole input to inferences or decisions about associations

or effects. Among the many reasons are that, in most sci-

entiﬁc settings, the arbitrary classiﬁcation of results into

‘‘signiﬁcant’’ and ‘‘non-signiﬁcant’’ is unnecessary for and

often dam aging to valid interpretation of data; and that

estimation of the size of effects and the uncertainty sur-

rounding our estimates will be far more important for

scientiﬁc inference and sound judgment than any such

classiﬁcation.

More detailed discussion of the general issues can be found

in many articles, chapters, and books on statistical methods and

their interpretation [5–20]. Speciﬁc issues are covered at length

in these sources and in the many peer-reviewed articles that

critique common misinterpretations of null-hypothesis testing

and ‘‘statistical signiﬁcance’’ [1, 12, 21–74].

Statistical tests, P values, and conﬁdence intervals:

a caustic primer

Statistical models, hypotheses, and tests

Every method of statistical inference depends on a complex

web of assumptions about how data were collected and

analyzed, and how the analysis results were selected for

presentation. The full set of assumptions is embodied in a

statistical model that underpins the method. This model is a

mathematical representation of data variability, and thus

ideally would capture accurately all sources of such vari-

ability. Many problems arise however because this statis-

tical model often incorporates unrealistic or at best

unjustiﬁed assumptions. This is true even for so-called

‘‘non-parametric’’ methods, which (like other methods)

depend on assumptions of random sampling or random-

ization. These assumptions are often deceptively simple to

write down mathematically, yet in practice are difﬁcult to

satisfy and verify, as they may depend on successful

completion of a long sequence of actions (such as identi-

fying, contacting, obtaining consent from, obtaining

cooperation of, and following up subjects, as well as

adherence to study protocol s for treatment allocation,

masking, and data analysis).

There is also a serious problem of deﬁning the scope of a

model, in that it should allow not only for a good repre-

sentation of the observed data but also of hypothetical

alternative data that might have been observed. The ref-

erence frame for data that ‘‘might have been observed’’ is

often unclear, for example if multiple outcome measures or

multiple predictive factors have been measured, and many

decisions surrounding analysis choices have been made

after the data were collected—as is invariably the case

[33].

The difﬁculty of understanding and assessing underlying

assumptions is exacerbated by the fact that the statistical

model is usually presented in a highly compressed and

abstract form—if presented at all. As a result, many

assumptions go unremarked and are often unrecognized by

users as well as consumers of statistics. Nonetheless, all

statistical methods and interpretations are premised on the

model assumptions; that is, on an assumption that the

model provides a valid representation of the variation we

would expect to see across data sets, faithfully reﬂecting

the circumstances surrounding the study and phenomena

occurring within it.

In most applications of statistical testing, one assump-

tion in the model is a hypothesis that a particular effect has

a speciﬁc size, and has been targeted for statistical analysis.

(For simplicity, we use the word ‘‘effect’’ when ‘‘associa-

tion or effect’’ would arguably be better in allowing for

noncausal studies such as most surveys.) This targeted

assumption is called the study hypothesis or test hypothe-

sis, and the stati stical methods used to evaluate it are called

statistical hypothesis tests. Most often, the targeted effect

size is a ‘‘null’’ value representing zero effect (e.g., that the

study treatment makes no difference in average outcome),

in which case the test hypothesis is called the null

hypothesis. Nonetheless, it is als o possible to test other

338 S. Greenland et al.

123

effect sizes. We may also test hypot heses that the effect

does or does not fall within a speciﬁc range; for example,

we may test the hypothesis that the effect is no greater than

a particular amount, in which case the hypothesis is said to

be a one-sided or dividing hypothesis [7, 8].

Much statistical teaching and practice has developed a

strong (and unhealthy) focus on the idea that the main aim

of a study should be to test null hypotheses. In fact most

descriptions of statistical testing focus only on testing null

hypotheses, and the entire topic has been called ‘‘Null

Hypothesis Signiﬁcance Testing’’ (NHST). This exclusive

focus on null hypotheses contributes to misunderstanding

of tests. Adding to the misunderstanding is that many

authors (including R.A. Fisher) use ‘‘null hypothesis’’ to

refer to any test hypothesis, even though this usage is at

odds with other authors and with ordinary English deﬁni-

tions of ‘‘null’’—as are statistical usages of ‘‘signiﬁcance’’

and ‘‘conﬁdence.’’

Uncertainty, probability, and statistical signiﬁcance

A more reﬁned goal of statistical anal ysis is to provide an

evaluation of certainty or uncertainty regarding the size of

an effect. It is natural to express such certainty in terms of

‘‘probabilities’’ of hypotheses. In conventional statistical

methods, however, ‘‘probability’’ refers not to hypotheses,

but to quantities that are hypothetical frequencies of data

patterns under an assumed statistical model. These methods

are thus calle d frequentist methods, and the hypothetical

frequencies they predict are called ‘‘frequency probabili-

ties.’’ Despite considerable training to the contrary, many

statistically educated scientists revert to the habit of mis-

interpreting these frequency probabilities as hypothesis

probabilities. (Even more confusingly, the term ‘‘likelihood

of a parameter value’’ is reserved by statisticians to refer to

the probability of the observed data given the parameter

value; it does not refer to a probability of the parameter

taking on the given value.)

Nowhere are these problems more rampant than in

applications of a hypothetical frequency called the P value,

also known as the ‘‘observed signiﬁcance level’’ for the test

hypothesis. Statistical ‘‘signiﬁcance tests’’ based on this

concept have been a central part of statistical analyses for

centuries [75]. The focus of traditional deﬁnitions of

P values and statistical signiﬁcance has been on null

hypotheses, treating all other assumptions used to com pute

the P value as if they were known to be correct. Recog-

nizing that these other assumptions are often questionable

if not unwarranted, we will adopt a more general view of

the P value as a statistical summary of the compatibility

between the observed data and what we would predict or

expect to see if we knew the entire statistical model (all the

assumptions used to compute the P value) were correct.

Speciﬁcally, the distance between the data and the

model prediction is measured using a test statistic (such as

a t-statistic or a Chi squared statistic). The P value is then

the probability that the chosen test statistic would have

been at least as large as its observed value if every model

assumption were correct, including the test hypothesis.

This deﬁnition embodies a crucial point lost in traditional

deﬁnitions: In logical terms, the P value tests all the

assumptions about how the data were generated (the entire

model), not just the targeted hypothesis it is supposed to

test (such as a null hypothesis). Furthermore, these

assumptions include far more than what are traditionally

presented as modeling or probability assumptions—they

include assumptions about the conduct of the analysis, for

example that intermediate analysis results were not used to

determine which analyses would be presented.

It is true that the smaller the P value, the more unusual

the data would be if every single assumption were correct;

but a very small P

value does not tell us which assumption

is incorrect. For example, the P value may be very small

because the targeted hypothesis is false; but it may instead

(or in addition) be very small because the study protocols

were violated, or because it was selected for pres entation

based on its small size. Convers ely, a large P value indi-

cates only that the data are not unusual under the model,

but does not imply that the model or any aspect of it (such

as the targeted hypothesis) is correct; it may instead (or in

addition) be large becau se (again) the study protocols were

violated, or because it was selected for presentation based

on its large size.

The general deﬁnition of a P value may help one to

understand why statistical tests tell us much less than what

many think they do: Not only does a P value not tell us

whether the hypothesis targeted for testing is true or not; it

says nothing speciﬁcally related to that hypothesis unless

we can be completely assured that every other assumption

used for its computation is correct—an assurance that is

lacking in far too many studies.

Nonetheless, the P value can be viewed as a continuous

measure of the compatibility between the data and the

entire model used to compute it, ranging from 0 for com-

plete incompatibility to 1 for perf ect compatibility, and in

this sense may be viewed as measuring the ﬁt of the model

to the data. Too often, however, the P value is degraded

into a dichotomy in which results are declared ‘‘statistically

signiﬁcant’’ if P falls on or below a cut-off (usually 0.05)

and declared ‘‘nonsigniﬁcant’’ otherwise. The terms ‘‘sig-

niﬁcance level’’ and ‘‘alpha level’’ (a) are often used to

refer to the cut-off; however, the term ‘‘signiﬁcance level’’

invites confusion of the cut-off with the P value itself.

Their difference is profound: the cut-off value a is sup-

posed to be ﬁxed in advance and is thus part of the study

design, unchanged in light of the data. In contrast, the

Statistical tests, P values, conﬁdence intervals, and power: a guide to misinterpretations 339

123

P value is a number computed from the data and thus an

analysis result, unknown until it is computed.

Moving from tests to estimates

We can vary the test hypothesis while leaving other

assumptions unchanged, to see how the P value differs

across competing test hypotheses. Usually, these test

hypotheses specify different sizes for a targeted effect; for

example, we may test the hypothesis that the average dif-

ference between two treatment groups is zero (the null

hypothesis), or that it is 20 or -10 or any size of interest.

The effect size whose test produced P = 1 is the size most

compatible with the data (in the sense of predicting what

was in fact observed) if all the othe r assumptions used in

the test (the statistical model) were correct, and provides a

point estimate of the effect under those assumptions. The

effect sizes whose test produced P [ 0.05 will typically

deﬁne a range of sizes (e.g., from 11.0 to 19.5) that would

be considered more compatible with the data (in the sense

of the observations being closer to what the model pre-

dicted) than sizes outside the range—again, if the statistical

model were correct. This range corresponds to a

1 - 0.05 = 0.95 or 95 % conﬁdence interval, and provides

a convenient way of summarizing the results of hypothesis

tests for many effect sizes. Conﬁdence int ervals are

examples of interval estimates.

Neyman [76] proposed the construction of conﬁdence

intervals in this way because they have the following

property: If one calculates, say, 95 % conﬁdence intervals

repeatedly in valid applications, 95 % of them, on average,

will contain (i.e., include or cover) the true effect size.

Hence, the speciﬁed conﬁdence level is called the coverage

probability. As Neyman stressed repeatedly, this coverage

probability is a property of a long sequence of conﬁdence

intervals computed from valid models, rather than a

property of any single conﬁden ce interv al.

Many journals now require conﬁdence intervals, but

most textbooks and studies discuss P values only for the

null hypothesis of no effect. This exclusive focus on null

hypotheses in testing not only contributes to misunder-

standing of tests and underappreciation of estimation, but

also obscures the close relationship between P values and

conﬁdence intervals, as well as the weaknesses they share.

What P values, conﬁdence intervals, and power

calculations don’t tell us

Much distortion arises from basic misunderstanding of

what P values and their relatives (such as conﬁdence

intervals) do not tell us. Therefore, based on the articles in

our reference list, we review prevalent P value

misinterpretations as a way of moving toward defensi ble

interpretations and presentations. We adopt the format of

Goodman [40] in providing a list of misinterpretations that

can be used to critically evaluate conclusions offered by

research reports and reviews. Every one of the bolded

statements in our list has contributed to statistical distortion

of the scientiﬁc literature, and we add the emphatic ‘‘No!’’

to underscore statements that are not only fallacious but

also not ‘‘true enough for practical purposes.’’

Common misinterpretations of single P values

1. The P value is the probability that the test

hypothesis is true; for example, if a test of the null

hypothesis gave P = 0.01, the null hypothesis has

only a 1 % chance of being true; if instead it gave

P = 0.40, the null hypothesis has a 40 % chance of

being true. No! The P value assumes the test

hypothesis is true—it is not a hypothesis probability

and may be far from any reasonable probability for the

test hypot hesis. The P value simply indicates the degree

to which the data conform to the pattern predicted by

the test hypothesis and all the other assumptions used in

the test (the underlying statistical model). Thus

P = 0.01 would indicate that the data are not very close

to what the statistical mode l (including the test

hypothesis) predicted they should be, while P = 0.40

would indicate that the data are much closer to the

model prediction, allowing for chance variation .

2. The P value for the null hypothesis is the probability

that chance alone produced the observed assoc ia-

tion; for example, if the P value for the null

hypothesis is 0.08, there is an 8 % probability that

chance alone produced the association. No! This is a

common variation of the ﬁrst fallacy and it is just as

false. To say that chance alone produced the observed

association is logically equivalent to asserting that

every assumption used to compute the P value is

correct, including the null hypothesis. Thus to claim

that the null P value is the probability that chance alone

produced the observed association is completely back-

wards: The P value is a probability computed assuming

chance was operating alone. The absurdity of the

common backwards interpretation might be appreci-

ated by pondering how the P value, which is a

probability deduced from a set of assumptions (the

statistical model), can possibly refer to the probability

of those assumptions.

Note: One often sees ‘‘alone’’ dropped from this

description (becoming ‘‘the

P value for the null

hypothesis is the probability that chance produced the

observed association’’), so that the statement is more

ambiguous, but just as wrong.

340 S. Greenland et al.

123

3. A signiﬁcant test result (P £ 0.05) means that the

test hypothesis is false or should be rejected. No! A

small P value simply ﬂags the data as being unusual

if all the assumptions used to compute it (including

the test hypothesis) were correct; it may be small

because there was a large random error or because

some assumption other than the test hypothesis was

violated (for example, the assumption that this

P value was not selected for presentation because

it was below 0.05). P B 0.05 only means that a

discrepancy from the hypothesis prediction (e.g., no

difference between treatment groups) would be as

large or larger than that observed no more than 5 %

of the time if only chance were creating the

discrepancy (as opposed to a violation of the test

hypothesis or a mistaken assumption).

4. A nonsigniﬁcant test result (P > 0.05) means that

the test hypothes is is true or should be accepted.

No! A large P value only suggests that the data are

not unusual if all the assumptions used to compute the

P value (including the test hypothesis) were correct.

The same data would also not be unusual under many

other hypotheses. Furthermore, even if the test

hypothesis is wrong, the P value may be large

because it was inﬂated by a large random error or

because of some other erroneous assumption (for

example, the assumption that this P value was not

selected for presentation because it was above 0.05).

P [ 0.05 only mea ns that a discrepancy from the

hypothesis prediction (e.g., no difference between

treatment groups) would be as large or larger than

that observed more than 5 % of the time if only

chance were creating the discrepancy.

5. A large P value is evidence in favor of the test

hypothesis. No! In fact, any P value less than 1

implies that the test hypothesis is not the hypothesis

most compatible with the data, because any other

hypothesis with a larger P value would be even

more compatible with the data. A P value cannot be

said to favor the test hypothesis except in relation to

those hypotheses with smaller P values. Further-

more, a large P value often indicates only that the

data are incapable of discriminating among many

competing hypotheses (as would be seen immedi-

ately by examining the range of the conﬁdence

interval). For example, many authors will misinter-

pret P = 0.70 from a test of the null hypothesis as

evidence for no effect, when in fact it indicates that,

even though the null hypothesis is compatible with

the data under the assumptions used to compute the

P value, it is not the hypothesis most compatible

with the data—that honor would belong to a

hypothesis with P =

1. But even if P = 1, there

will be many other hypotheses that are highly

consistent with the data, so that a deﬁnitive conclu-

sion of ‘‘no association’’ cannot be deduced from a

P value, no matter how large.

6. A null-hyp othesis P value greater than 0.05 means

that no effect was observed, or that absence of an

effect was shown or demonstrated. No! Observing

P [ 0.05 for the null hypothesis only means that the

null is one among the many hypotheses that have

P [ 0.05. Thus, unless the point estimate (observed

association) equals the null value exactly, it is a

mistake to conclude from P [ 0.05 that a study

found ‘‘no association’’ or ‘‘no evidence’’ of an

effect. If the null P value is less than 1 some

association must be present in the data, and one

must look at the point estimate to determine the

effect size most compatible with the data under the

assumed model.

7. Statistical signiﬁcance indicates a scientiﬁcally or

substantively important relation has been detected.

No! Especially when a study is large, very minor

effects or small assumption violations can lead to

statistically signiﬁcant tests of the null hypot hesis.

Again, a small null P value simply ﬂags the data as

being unusual if all the assumptions used to compute

it (including the null hypothesis) were correct; but the

way the data are unusual might be of no clinical

interest. One must look at the conﬁdence interval to

determine which effect sizes of scient iﬁc or other

substantive (e.g., clinical) importanc e are relatively

compatible with the data, given the model.

8. Lack of statistical signiﬁcance indicates that the

effect size is small. No! Especially when a study is

small, even large effects may be ‘‘drowned in noise’’

and thus fail to be detected as statistically signiﬁcant

by a statistical test. A large null P value simply ﬂags

the data as not being unusual if all the assumptions

used to compute it (including the test hypothesis)

were correct; but the same data will also not be

unusual under many other models and hypotheses

besides the null. Again, one must look at the

conﬁdence interval to determine whether it includes

effect sizes of importance.

9. The P value is the chance of our data occurring if

the test hypothesis is true; for example, P = 0.05

means that the observed association would occur

only 5 % of the time under the test hypothesis. No!

The P value refers not only to what we observed, but

also observations more extreme than what we

observed (where ‘‘extremity’’ is measured in a

particular way). And again, the P value refers to a

data frequency when all the assumptions used to

compute it are correct. In addition to the test

Statistical tests, P values, conﬁdence intervals, and power: a guide to misinterpretations 341

123

hypothesis, these assumptions include randomness in

sampling, treatment assignment, loss, and missing-

ness, as well as an assumption that the P value was

not sel ected for presentation based on its size or some

other aspect of the results.

10. If you reject the test hypothesis because P £ 0.05,

the chance you are in error (the chance your

‘‘signiﬁcant ﬁnding’’ is a false positive) is 5 %.No!

To see why this description is false, suppose the test

hypothesis is in fact true. Then, if you reject it, the chance

you are in error is 100 %, not 5 %. The 5 % refers only to

how often you would reject it, and therefore be in error,

over very many uses of the test across different studies

when the test hypothesis and all other assumptions used

for the test are true. It does not refer to your single use of

the test, which may have been thrown off by assumption

violations as well as random errors. This is yet another

version of misinterpretation #1.

11. P = 0.05 and P £ 0.05 mean the same thing. No!

This is like saying reported height = 2 m and

reported height B2 m are the same thing:

‘‘height = 2 m’’ would include few people and those

people would be considered tall, whereas ‘‘height

B2 m’’ would include most people including small

children. Similarly, P = 0.05 would be considered a

borderline result in terms of statistical signiﬁcance,

whereas P B 0.05 lumps borderline results together

with results very incompatible with the model (e.g.,

P = 0.0001) thus rendering its meaning vague, for no

good purpose.

12. P values are properly reported as inequalities (e.g.,

report ‘‘P < 0.02’’ when P = 0.015 or report

‘‘P > 0.05’’ when P = 0.06 or P = 0.70). No! This is

bad practice because it makes it difﬁcult or impossible for

the reader to accurately interpret the statistical result. Only

when the P value is very small (e.g., under 0.001) does an

inequality become justiﬁable: There is little practical

difference among very small P values when the assump-

tions used to compute P values are not known with

enough certainty to justify such precision, and most

methods for computing P values are not numerically

accurate below a certain point.

13. Statistical signiﬁcance is a property of the phe-

nomenon being studied, and thus statistical tests

detect signiﬁcance. No! This misinterpretation is

promoted when researcher s state that they have or

have not found ‘‘evidence of’’ a statistically sign iﬁ-

cant effect. The effect being tested either exists or

does not exist. ‘‘Statistical signiﬁcance’’ is a dichoto-

mous description of a P value (that it is below the

chosen cut-off) and thus is a property of a result of a

statistical test; it is not a property of the effect or

population being studied.

14. One should always use two-sided

P values. No!

Two-sided P values are designed to test hypotheses that

the targeted effect measure equals a speciﬁc value (e.g.,

zero), and is neither above nor below this value. When,

however, the test hypothesis of scientiﬁc or practical

interest is a one-sided (dividing) hypothesis, a one-

sided P value is appropriate. For example, consider the

practical question of whether a new drug is at least as

good as the standard drug for increasing survival time.

This question is one-sided, so testing this hypothesis

calls for a one-sided P value. Nonetheless, because

two-sided P values are the usual default, it will be

important to note when and why a one-sided P value is

being used instead.

There are other interpretations of P values that are

controversial, in that whether a categorical ‘‘No!’’ is war-

ranted depends on one’s philosophy of statistics and the

precise meaning given to the terms involved. The disputed

claims deserve recognition if one wishes to avoid such

controversy.

For example, it has been argued that P values overstate

evidence against test hypotheses, based on directly com-

paring P values against certain quantities (likelihood ratios

and Bayes factors) that play a central role as evidence

measures in Bayesian analysis [37, 72, 77–83]. Nonethe-

less, many other statisticians do not accept these quant ities

as gold standards, and instead point out that P values

summarize crucial evidence needed to gauge the error rates

of decisions based on statistical tests (even though they are

far from sufﬁcient for making those decisions). Thus, from

this frequentist perspective, P valu es do not overstate

evidence and may even be considered as measuring one

aspect of evidence [7, 8, 84–87], with 1 - P measuring

evidence against the model used to compute the P value.

See also Murtaugh [ 88] and its accompanying discussion.

Common misinterpretations of P value comparisons

and predictions

Some of the most severe distortions of the scientiﬁc liter-

ature produced by statistical testing involve erroneous

comparison and synthesis of results from different studies

or study subgroups. Among the worst are:

15. When the same hypothesis is tested in different

studies and none or a minority of the tests are

statistically signiﬁcant (all P > 0.05), the overall

evidence supports the hypothesis. No! This belief is

often used to claim that a literature supports no effect

when the opposite is case. It reﬂects a tendency of

researchers to ‘‘overestimate the power of most

research’’ [89]. In reality, every study could fail to

reach statistical signiﬁcance and yet when combined

342 S. Greenland et al.

123

show a statistically signiﬁcant association and persua-

sive evidence of an effect. For example, if there were

ﬁve studies each with P = 0.10, none would be

signiﬁcant at 0.05 level; but when these P values are

combined using the Fisher formula [9], the overall

P value would be 0.01. There are many real examples

of persuasive evidence for important effects when few

studies or even no study reported ‘‘statistically signif-

icant’’ associations [90, 91]. Thus, lack of statistical

signiﬁcance of individual studies should not be taken as

implying that the totality of evidence supports no

effect.

16. When the same hypothesis is tested in two different

populations and the resulting P values are on

opposite sides of 0.05, the results are conﬂicting.

No! Statistical tests are sensitive to many differences

between study populations that are irrelevant to

whether their results are in agreement, such as the

sizes of compared groups in each population. As a

consequence, two studies may provide very different

P values for the same test hypothesis and yet be in

perfect agreement (e.g., may show identical observed

associations). For example, suppose we had two

randomized trials A and B of a treatment, identical

except that trial A had a known standard error of 2 for

the mean difference between treatment groups

whereas trial B had a known standard error of 1 for

the difference. If both trials observed a difference

between treatment groups of exactly 3, the usual

normal test would produce P = 0.13 in A but

P = 0.003 in B. Despite their difference in P values,

the test of the hypothesis of no difference in effect

across studies would have P = 1, reﬂecting the

perfect agreement of the observed mean differences

from the studies. Differences between results must be

evaluated by directly, for example by estimating and

testing those differences to produce a conﬁden ce

interval and a P value comparing the results (often

called analysis of heterogeneity, interaction, or

modiﬁcation).

17. When the same hypothesis is tested in two different

populations and the same P values are obtained, the

results are in agreement. No! Again, tests are sensitive

to many differences between populations that are irrel-

evant to whether their results are in agreement. Two

different studies may even exhibit identical P values for

testing the same hypothesis yet also exhibit clearly

different observed associations. For example, suppose

randomized experiment A observed a mean difference

between treatment groups of 3.00 with standard error

1.00, while B observed a mean difference of 12.00 with

standard error 4.00. Then the standard normal test would

produce P = 0.003 in both; yet the test of the hypothesis

of no difference in effect across studies gives P = 0.03,

reﬂecting the large difference (12.00 - 3.00 = 9.00)

between the mean differences.

18. If one observ es a small P value, there is a good

chance that the next study will produce a P value

at least as small for the same hypothesis. No! This is

false even under the ideal condition that both studies are

independent and all assumptions including the test

hypothesis are correct in both studies. In that case, if

(say) one observes P = 0.03, the chance that the new

study will show P B 0.03 is only 3 %; thus the chance

the new study will show a P value as small or smaller

(the ‘‘replication probability’’) is exactly the observed

P value! If on the other hand the small P value arose

solely because the true effect exactly equaled its

observed estimate, there would be a 50 % chance that

a repeat experiment of identical design would have a

larger P value [37]. In general, the size of the new

P value will be extremely sensitive to the study size and

the extent to which the test hypothesis or other

assumptions are violated in the new study [86]; in

particular, P may be very small or very large depending

on whether the study and the violations are large or

small.

Finally, although it is (we hope obviously) wrong to do

so, one sometimes sees the null hypothesis compared with

another (alternative) hypothesis using a two-sided P value

for the null and a one-sided P value for the alternative. This

comparison is biased in favor of the null in that the two-

sided test will falsely reject the null only half as often as

the one-sided test will falsely reject the alternative (again,

under all the assumptio ns used for testing).

Common misinterpretations of conﬁdence intervals

Most of the above misinterpr etations translate into an

analogous misinterpretation for conﬁdence intervals. For

example, another misinterpretation of P [ 0.05 is that it

means the test hypothesis has only a 5 % chance of being

false, which in terms of a conﬁdence interval becomes the

common fallacy:

19. The speciﬁc 95 % conﬁdence interval presented by

a study has a 95 % chance of containing the true

effect size. No! A reported conﬁdence interval is a range

between two numbers. The frequency with which an

observed interval (e.g., 0.72–2.88) contains the true effect

is either 100 % if the true effect is within the interval or

0 % if not; the 95 % refers only to how often 95 %

conﬁdence intervals computed from very many studies

would contain the true size if all the assumptions used to

compute the intervals were correct.Itispossibleto

compute an interval that can be interpreted as having

Statistical tests, P values, conﬁdence intervals, and power: a guide to misinterpretations 343

123

95 % probability of containing the true value; nonethe-

less, such computations require not only the assumptions

used to compute the conﬁdence interval, but also further

assumptions about the size of effects in the model. These

further assumptions are summarized in what is called a

prior distribution, and the resulting intervals are usually

called Bayesian posterior (or credible) intervals to

distinguish them from conﬁdence intervals [18].

Symmetrically, the misinterpretation of a small P value as

disproving the test hypothesis could be translated into:

20. An effect size outside the 95 % conﬁdence interval

has been refuted (or excluded) by the data. No! As

with the P value, the conﬁdence interval is computed

from many assumptions, the violation of which may

have led to the results. Th us it is the combination of

the data with the assumptions, along with the arbitrary

95 % criterion, that are needed to declare an effect

size outside the interval is in some way incompa tible

with the observations. Even then, judgements as

extreme as saying the effect size has been refuted or

excluded will require even stronger conditions.

As with P values, naı

¨

ve comparison of conﬁdence intervals

can be highly misleading:

21. If two conﬁdence intervals overlap, the difference

between two estimates or studies is not signiﬁcant.

No! The 95 % conﬁdence intervals from two subgroups

or studies may overlap substantially and yet the test for

difference between them may still produce P \ 0.05.

Suppose for example, two 95 % conﬁdence intervals for

means from normal populations with known variances

are (1.04, 4.96) and (4.16, 19.84); these intervals

overlap, yet the test of the hypothesis of no difference

in effect acrossstudies gives P = 0.03. As with P values,

comparison between groups requires statistics that

directly test and estimate the differences across groups.

It can, however, be noted that if the two 95 % conﬁdence

intervals fail to overlap, then when using the same

assumptions used to compute the conﬁdence intervals

we will ﬁnd P \ 0.05 for the difference; and if one of the

95 % intervals contains the point estimate from the other

group or study, we will ﬁnd P [ 0.05 for the difference.

Finally, as with P values, the replication properties of

conﬁdence intervals are usually misunderstood:

22. An observed 95 % conﬁdence interval predicts

that 95 % of the estimates from future studies will

fall inside the observed interval. No! This statement

is wrong in several ways. Most importantly, under the

model, 95 % is the frequency with which other

unobserved intervals will contain the true effect, not

how fre quently the one interval being presented will

contain future estimates. In fact, even under ideal

conditions the chance that a future estimate will fall

within the current interval will usually be much less

than 95 %. For example, if two independent studies of

the same quantity provide unbiased normal point

estimates with the same standard errors, the chance

that the 95 % conﬁdence interval for the ﬁrst study

contains the point estimate from the second is 83 %

(which is the chance that the difference between the

two estimates is less than 1.96 standard errors). Again,

an observed interval either does or does not contain the

true effect; the 95 % refers only to how often 95 %

conﬁdence intervals computed from very many studies

would contain the true effect if all the assumpt ions used

to compute the intervals were correct.

23. If one 95 % conﬁdence interval includes the null

value and another excludes that value, the interval

excluding the null is the more precise one.No!

When the model is correct, precision of statistical

estimation is measured directly by conﬁdence interval

width (measured on the appropriate scale). It is not a

matter of inclusion or exclusion of the null or any other

value. Consider two 95 % conﬁdence intervals for a

difference in means, one with limits of 5 and 40, the

other with limits of -5 and 10. The ﬁrst interval

excludes the null value of 0, but is 30 units wide. The

second includes the null value, but is half as wide and

therefore much more precise.

In addition to the above misinterpretations, 95 % conﬁ-

dence intervals force the 0.05-level cutoff on the reader,

lumping together all effect sizes with P [ 0.05, and in this

way are as bad as presenting P values as dichotomies.

Nonetheless, many authors agree that conﬁdence intervals are

superior to tests and P values because they allow one to shift

focus away from the null hypothesis, toward the full range of

effect sizes compatible with the data—a shift recommended

by many authors and a growing number of journals. Another

way to bring attention to non-null hypotheses is to present

their P values; for example, one could provide or demand

P values for those effect sizes that are recognized as scien-

tiﬁcally reasonable alternatives to the null.

As with P values, further cautions are needed to avoid

misinterpreting conﬁdence intervals as providing sharp

answers when none are warranted. The hypothesis which

says the point estimate is the correct effect will have the

largest P value (P = 1 in most cases), and hypotheses inside

a conﬁdence interval will have higher P values than

hypotheses outside the interval. The P values will vary

greatly, however, among hypotheses inside the interval, as

well as among hypotheses on the outside. Also, two

hypotheses may have nearly equal P values even though one

of the hypotheses is inside the interval and the other is out-

side. Thus, if we use P valu es to measure compatibility of

344 S. Greenland et al.

123

hypotheses with data and wish to compare hypotheses with

this measure, we need to examine their P values directly, not

simply ask whether the hypotheses are inside or outside the

interval. This need is particularly acute when (as usual) one

of the hypotheses under scrutiny is a null hypothesis.

Common misinterpretations of power

The power of a test to detect a correct alternative

hypothesis is the pre-study probability that the test will

reject the test hypothesis (e.g., the probability that P will

not exceed a pre-speciﬁed cut-off such as 0.05). (The

corresponding pre-study probability of failing to reject the

test hypothesis when the alternative is correct is one minus

the power, also known as the Type-II or beta error rate)

[84] As with P values and conﬁdence intervals, this p rob-

ability is deﬁned over repetitions of the same study design

and so is a frequency probability. One source of reasonable

alternative hypotheses are the effect sizes that were used to

compute power in the study proposal. Pre-study power

calculations do not, howe ver, measure the compatibility of

these alternatives with the data actually observed, while

power calculated from the observed data is a direct (if

obscure) transformation of the null P value and so provides

no test of the alternatives. Thus, presentation of power does

not obvia te the need to provide interval estimates and

direct tests of the alternatives.

For these reasons, many authors have condemned use of

power to interpret estimates and statistical tests [42, 92–

97], arguing that (in contrast to conﬁdence intervals) it

distracts attention from direct comparisons of hypotheses

and introduces new misinterpretations, such as:

24. If you accept the null hypothesis because the null

P value exceeds 0.05 and the power of your test is

90 %, the chance you are in error (the chance that

your ﬁnding is a false negative) is 10 %. No! If the

null hypot hesis is false and you accept it, the chance

you are in error is 100 %, not 10 %. Conversely, if the

null hypothesis is true and you accept it, the chance

you are in error is 0 %. The 10 % refers only to how

often you would be in error over very many uses of

the test across different studies when the particular

alternative used to compute power is correct and all

other assumptions used for the test are correct in all

the studies. It does not refer to your single use of the

test or your error rate under any alternative effect size

other than the one used to compute power.

It can be especially misleading to compare results for two

hypotheses by presenting a test or P value for one and power

for the other. For example, testing the null by seeing whether

P B 0.05 with a power less than 1 - 0.05 = 0.95 for the

alternative (as done routinely) will bias the comparison in

favor of the null because it entails a lower probability of

incorrectly rejecting the null (0.05) than of incorrectly

accepting the null when the alternative is correct. Thus, claims

about relative support or evidence need to be based on direct

and comparable measures of support or evidence for both

hypotheses, otherwise mistakes like the following will occur:

25. If the null P value exceeds 0.05 and the power of this

test is 90 % at an alternative, the results support the

null over the alternative. This claim seems intuitive to

many, but counterexamples are easy to construc t in

which the null P valu e is between 0.05 and 0.10, and yet

there are alternatives whose own P value exceeds 0.10

and for which the power is 0.90. Parallel results ensue

for other accepted measures of compatibility, evidence,

and support, indicating that the data show lower

compatibility with and more evidence against the null

than the alternative, despite the fact that the null P value

is ‘‘not signiﬁcant’’ at the 0.05 alpha level and the

power against the alternative is ‘‘very high’’ [42].

Despite its shortcomings for interpreting current data,

power can be useful for designing studies and for under-

standing why replication of ‘‘statistical signiﬁcance’’ will

often fail even under ideal conditions. Studies are often

designed or claimed to have 80 % power against a key

alternative when using a 0.05 signiﬁcance level, although

in execution often have less powe r due to unanticipated

problems such as low subject recruitment. Thus, if the

alternative is correct and the actual power of two studies is

80 %, the chance that the studies will both show P B 0.05

will at best be only 0.80(0.80) = 64 %; furthermore, the

chance that one study shows P B 0.05 and the othe r does

not (and thus will be misinterpreted as showing conﬂicting

results) is 2(0.80)0.20 = 32 % or about 1 chance in 3.

Similar calculations taking account of typical problems

suggest that one could anticipate a ‘‘replication crisis’’ even

if there were no publication or reporting bias, simply

because current design and testing conventions treat indi-

vidual study results as dich otomous outputs of ‘‘si gniﬁ-

cant’’/‘‘nonsigniﬁcant’’ or ‘‘reject’’/‘‘accept.’’

A statistical model is much more

than an equation with Greek letters

The above list could be expanded by reviewing the

research literature. We will however now turn to direct

discussion of an issue that has been receiving more atten-

tion of late, yet is still widely overlooked or interpreted too

narrowly in statistical teaching and presentations: That the

statistical model used to obtain the results is correct.

Too often, the full statistical model is treated as a simple

regression or structural equation in which effects are

Statistical tests, P values, conﬁdence intervals, and power: a guide to misinterpretations 345

123

represented by parameters denoted by Greek letters. ‘‘Model

checking’’ is then limited to tests of ﬁt or testing additional

terms for the model. Yet these tests of ﬁt themselves make

further assumptions that should be seen as part of the full

model. For example, all common tests and conﬁdence

intervals depend on assumptions o f random selection for

observation or treatment and random loss or missingness

within levels of controlled covariates. These assumptions

have gradually come under scrutiny via sensitivity and bias

analysis [98], but such methods remain far removed from the

basic statistical training given to most researchers.

Less often stated is the even more crucial assumption

that the analyses themselves were not guided toward

ﬁnding nonsigniﬁcance or signiﬁcance (analysis bias), and

that the analysis results were not reported based on their

nonsigniﬁcance or signiﬁcance (reporting bias and publi-

cation bias). Selective reporting renders false even the

limited ideal meanings of statistical signiﬁcance, P values,

and conﬁdence intervals. Because author decisions to

report and editorial decisions to publish results often

depend on whether the P value is above or below 0.05,

selective reporting has been identiﬁed as a major probl em

in large segments of the scientiﬁc literature [99–101 ].

Although this selection problem has also been subject to

sensitivity analysis, there has been a bias in studies of

reporting and publication bias: It is usually assumed that

these biases favor signiﬁcance. This assumption is of course

correct when (as is often the case) researchers select results

for presentation when P B 0.05, a practice that tends to

exaggerate associations [101–105]. Nonetheless, bias in

favor of reporting P B 0.05 is not always plausible let alone

supported by evidence or common sense. For example, one

might expect selection for P [ 0.05 in publications funded

by those with stakes in acceptance of the null hypothesis (a

practice which tends to understate associations); in accord

with that expectation, some empirical studies have observed

smaller estimates and ‘‘nonsigniﬁcance’’ more often in such

publications than in other studies [101, 106, 107].

Addressing such problems would require far more political

will and effort than addressing misinterpretation of statistics,

such as enforcing registration of trials, along with open data

and analysis code from all completed studies (as in the

AllTrials initiative, http://www.alltrials.net/). In the mean-

time, readers are advised to consider the entire context in

which research reports are produced and appear when inter-

preting the statistics and conclusions offered by the reports.

Conclusions

Upon realizing that statistical tests are usually misinter-

preted, one may wonder wha t if anything these tests do for

science. They were originally intended to account for

random variability as a source of error, thereby sounding a

note of caution against overinterpretation of observed

associations as true effects or as stronger evidence against

null hypotheses than was warranted. But before long that

use was turned on its head to provide fallacious support for

null hypotheses in the form of ‘‘failure to achieve’’ or

‘‘failure to attain’’ statistical signiﬁcance.

We have no doubt that the founders of modern statistical

testing would be horriﬁed by common treatments of their

invention. In their ﬁrst paper describing their binary

approach to statistical testing, Neyman and Pearson [108]

wrote that ‘‘it is doubtful whether the knowledge that [a

P value] was really 0.03 (or 0.06), rather than 0.05…would

in fact ever modify our judgment’’ and that ‘‘The tests

themselves give no ﬁnal verdict, but as tools help the

worker who is using them to form his ﬁnal decision.’’

Pearson [109] later added, ‘‘No doubt we could more aptly

have said, ‘his ﬁnal or provisional decision.’’’ Fisher [110]

went further, saying ‘‘No scientiﬁc worker has a ﬁxed level

of signiﬁcance at which from year to year, and in all cir-

cumstances, he rejects hypotheses; he rather gives his mind

to each particular case in the light of his evidence and his

ideas.’’ Yet fallacious and ritualistic use of tests continued

to spread, including beliefs that whether P was above or

below 0.05 was a universal arbiter of discovery. Thus by

1965, Hill [111] lamented that ‘‘too often we weaken our

capacity to interpret data and to take reasonable decisions

whatever the value of P. And far too often we deduce ‘no

difference’ from ‘no signiﬁcant difference.’’’

In response, it has been argued that some misinterpre-

tations are harmless in tightly controlled experiments on

well-understood systems, where the test hypothesis may

have special support from established theories (e.g., Men-

delian genetics) and in which every other assumption (such

as random allocation) is forced to hold by careful design

and execution of the study. But it has long been asserted

that the harms of statistical testing in more uncontrollable

and amorphous research settings (such as social-science,

health, and medical ﬁelds) have far outweighed its beneﬁts,

leading to calls for banning such tests in research reports—

again with one journal banning P values as well as conﬁ-

dence intervals [2].

Given, however, the deep entrenchment of statistical

testing, as well as the absence of generally accepted

alternative methods, there have been many attempts to

salvage P values by detaching them from their use in sig-

niﬁcance tests. One approach is to focus on P values as

continuous measures of compatibility, as described earlier.

Although this approach has its own limitations (as descri-

bed in points 1, 2, 5, 9, 15, 18, 19), it avoids comparison of

P values with arbitrary cutoffs such as 0.05, (as described

in 3, 4, 6–8, 10–13, 15, 16, 21 and 23–25). Another

approach is to teach and use correct relations of P values to

346 S. Greenland et al.

123

hypothesis probabilities. For exampl e, under common sta-

tistical models, one-sided P values can provide lower

bounds on probabilities for hypotheses about effect direc-

tions [45, 46, 112, 113]. Whether such reinterpretations can

eventually replace common misinterpretations to good

effect remains to be seen.

A shift in emphasis from hypothesis testing to estimation

has been promoted as a simple and relatively safe way to

improve practice [5, 61, 63, 114, 115] resulting in increasing

use of conﬁdence intervals and editorial demands for them;

nonetheless, this shift has brought to the fore misinterpre-

tations of intervals such as 19–23 above [116]. Other

approaches com bine tests of the null with further calcula-

tions involving both null and alternative hypotheses [117,

118]; such calculations may, however, may bring with them

further misinterpretations similar to those described above

for power, as well as greater complexity.

Meanwhile, in the hopes of minimizing harms of current

practice, we can offer several guidelines for users and

readers of statistics, and re-emphasize some key warnings

from our list of misinterpretations:

(a) Correct and careful interpretation of statistical tests

demands examining the sizes of effect est imates and

conﬁdence limits, as well as precise P values (not

just whether P values are above or below 0.05 or

some other threshold).

(b) Careful interpretation also demands critical exami-

nation of the assumptions and conventions used for

the statistical analysis—not just the usual statistical

assumptions, but also the hidden assumptions about

how results were generated and chosen for

presentation.

(c) It is simply false to claim that statistically non-

signiﬁcant results support a test hypothesis, because

the same results may be even more compatible with

alternative hypotheses—even if the power of the test

is high for those alternatives.

(d) Interval estimates aid in evaluating whether the data

are capable of discriminating among various

hypotheses about effect sizes, or whether statistical

results have been misrepresented as supporting one

hypothesis when those results are better explained by

other hypotheses (see points 4–6). We caution

however that conﬁdence intervals are often only a

ﬁrst step in these tasks. To compare hypotheses in

light of the data and the statistical model it may be

necessary to calculate the P value (or relative

likelihood) of each hypothesis. We further caution

that conﬁdence intervals provide only a best-case

measure of the uncertainty or ambiguity left by the

data, insofar as they depend on an uncertain

statistical model.

(e) Correct statistical evaluation of multiple studies

requires a pooled analysis or meta-analysis that deals

correctly with study biases [68, 119–125]. Even when

this is done, however, all the earlier cautions apply.

Furthermore, the outcome of any statistical procedure

is but one of many considerations that must be

evaluated when examining the totality of evidence. In

particular, statistical signiﬁcance is neither necessary

nor sufﬁcient for determining the scientiﬁc or prac-

tical signiﬁcance of a set of observations. This view

was afﬁrmed unanimously by the U.S. Supreme

Court, (Matrixx Initiatives, Inc., et al. v. Siracusano

et al. No. 09–1156. Argued January 10, 2011,

Decided March 22, 2011), and can be seen in our

earlier quotes from Neyman and Pearson.

(f) Any opinion offered about the probability, likeli-

hood, certainty, or similar property for a hypothesis

cannot be derived from statistical methods alone. In

particular, signiﬁcance tests and conﬁdence intervals

do not by themselves provide a logically sound basis

for concluding an effect is present or absent with

certainty or a given probability. This point should be

borne in mind whenever one sees a conclusion

framed as a statement of probability, likelihood, or

certainty about a hypothesis. Information about the

hypothesis beyond that contained in the analyzed

data and in conventional statistical models (which

give only data probabilities) must be used to reach

such a conclusion; that information should be

explicitly acknowledged and described by those

offering the conclusion. Bayesian statistics offers

methods that attempt to incorporate the needed

information directly into the statistical model; they

have not, however, achieved the popularity of

P values and conﬁdence intervals, in part because

of philosophical objections and in part because no

conventions have become established for their use.

(g) All statistical methods (whether frequentist or

Bayesian, or for testing or estimation, or for

inference or decision) mak e extensive assumptions

about the sequence of events that led to the results

presented—not only in the data generation, but in the

analysis choices. Thus, to allow critical evaluation,

research reports (including meta-analyses) should

describe in detail the full sequence of events that led

to the statistics presented, including the motivation

for the study, its design, the original analysis plan,

the criteria used to include and exclude subjects (or

studies) and data, and a thorough description of all

the analyses that were conducted.

In closing, we note that no statistical method is immune

to misinterpretation and misuse, but prudent users of

Statistical tests, P values, conﬁdence intervals, and power: a guide to misinterpretations 347

123

statistics will avoid approaches especially prone to serious

abuse. In this regard, we join others in singling out the

degradation of P values into ‘‘signiﬁcant’’ and ‘‘nonsignif-

icant’’ as an especially pernicious statistical practice [126].

Acknowledgments SJS receives funding from the IDEAL project

supported by the European Union’s Seventh Framework Programme

for research, technological development and demonstration under

Grant Agreement No. 602552. We thank Stuart Hurlbert, Deborah

Mayo, Keith O’Rourke, and Andreas Stang for helpful comments, and

Ron Wasserstein for his invaluable encouragement on this project.

Open Access This article is distributed under the terms of the Creative

Commons Attribution 4.0 International License (http://creative

commons.org/licenses/by/4.0/), which permits unrestricted use, distri-

bution, and reproduction in any medium, provided you give appropriate

credit to the original author(s) and the source, provide a link to the

Creative Commons license, and indicate if changes were made.

References

1. Lang JM, Rothman KJ, Cann CI. That confounded P-value.

Epidemiology. 1998;9:7–8.

2. Traﬁmow D, Marks M. Editorial. Basic Appl Soc Psychol.

2015;37:1–2.

3. Ashworth A. Veto on the use of null hypothesis testing and p

intervals: right or wrong? Taylor & Francis Editor. 2015.

Resources online, http://editorresources.taylorandfrancisgroup.

com/veto-on-the-use-of-null-hypothesis-testing-and-p-intervals-

right-or-wrong/. Accessed 27 Feb 2016.

4. Flanagan O. Journal’s ban on null hypothesis signiﬁcance test-

ing: reactions from the statistical arena. 2015. Stats Life online,

https://www.statslife.org.uk/opinion/2114-journal-s-ban-on-null-

hypothesis-signiﬁcance-testing-reactions-from-the-statistical-arena.

Accessed 27 Feb 2016.

5. Altman DG, Machin D, Bryant TN, Gardner MJ, eds. Statistics

with conﬁdence. 2nd ed. London: BMJ Books; 2000.

6. Atkins L, Jarrett D. The signiﬁcance of ‘‘signiﬁcance tests’’. In:

Irvine J, Miles I, Evans J, editors. Demystifying social statistics.

London: Pluto Press; 1979.

7. Cox DR. The role of signiﬁcance tests (with discussion). Scand J

Stat. 1977;4:49–70.

8. Cox DR. Statistical signiﬁcance tests. Br J Clin Pharmacol.

1982;14:325–31.

9. Cox DR, Hinkley DV. Theoretical statistics. New York: Chap-

man and Hall; 1974.

10. Freedman DA, Pisani R, Purves R. Statistics. 4th ed. New York:

Norton; 2007.

11. Gigerenzer G, Swijtink Z, Porter T, Daston L, Beatty J, Kruger

L. The empire of chance: how probability changed science and

everyday life. New York: Cambridge University Press; 1990.

12. Harlow LL, Mulaik SA, Steiger JH. What if there were no

signiﬁcance tests?. New York: Psychology Press; 1997.

13. Hogben L. Statistical theory. London: Allen and Unwin; 1957.

14. Kaye DH, Freedman DA. Reference guide on statistics. In:

Reference manual on scientiﬁc evidence, 3rd ed. Washington,

DC: Federal Judicial Center; 2011. p. 211–302.

15. Morrison DE, Henkel RE, editors. The signiﬁcance test con-

troversy. Chicago: Aldine; 1970.

16. Oakes M. Statistical inference: a commentary for the social and

behavioural sciences. Chichester: Wiley; 1986.

17. Pratt JW. Bayesian interpretation of standard inference state-

ments. J Roy Stat Soc B. 1965;27:169–203.

18. Rothman KJ, Greenland S, Lash TL. Modern epidemiology. 3rd

ed. Philadelphia: Lippincott-Wolters-Kluwer; 2008.

19. Ware JH, Mosteller F, Ingelﬁnger JA. p-Values. In: Bailar JC,

Hoaglin DC, editors. Ch. 8. Medical uses of statistics. 3rd ed.

Hoboken, NJ: Wiley; 2009. p. 175–94.

20. Ziliak ST, McCloskey DN. The cult of statistical signiﬁcance:

how the standard error costs us jobs, justice and lives. Ann

Arbor: U Michigan Press; 2008.

21. Altman DG, Bland JM. Absence of evidence is not evidence of

absence. Br Med J. 1995;311:485.

22. Anscombe FJ. The summarizing of clinical experiments by

signiﬁcance levels. Stat Med. 1990;9:703–8.

23. Bakan D. The test of signiﬁcance in psychological research.

Psychol Bull. 1966;66:423–37.

24. Bandt CL, Boen JR. A prevalent misconception about sample

size, statistical signiﬁcance, and clinical importance. J Peri-

odontol. 1972;43:181–3.

25. Berkson J. Tests of signiﬁcance considered as evidence. J Am

Stat Assoc. 1942;37:325–35.

26. Bland JM, Altman DG. Best (but oft forgotten) practices: testing

for treatment effects in randomized trials by separate analyses of

changes from baseline in each group is a misleading approach.

Am J Clin Nutr. 2015;102:991–4.

27. Chia KS. ‘‘Signiﬁcant-itis’’—an obsession with the P-value.

Scand J Work Environ Health. 1997;23:152–4.

28. Cohen J. The earth is round (p \ 0.05). Am Psychol.

1994;47:997–1003.

29. Evans SJW, Mills P, Dawson J. The end of the P-value? Br

Heart J. 1988;60:177–80.

30. Fidler F, Loftus GR. Why ﬁgures with error bars should replace

p values: some conceptual arguments and empirical demon-

strations. J Psychol. 2009;217:27–37.

31. Gardner MA, Altman DG. Conﬁdence intervals rather than P

values: estimation rather than hypothesis testing. Br Med J.

1986;292:746–50.

32. Gelman A. P-values and statistical practice. Epidemiology.

2013;24:69–72.

33. Gelman A, Loken E. The statistical crisis in science: Data-de-

pendent analysis—a ‘‘garden of forking paths’’—explains why

many statistically signiﬁcant comparisons don’t hold up. Am

Sci. 2014;102:460–465. Erratum at http://andrewgelman.com/

2014/10/14/didnt-say-part-2/. Accessed 27 Feb 2016.

34. Gelman A, Stern HS. The difference between ‘‘signiﬁcant’’ and

‘‘not signiﬁcant’’ is not itself statistically signiﬁcant. Am Stat.

2006;60:328–31.

35. Gigerenzer G. Mindless statistics. J Socioecon.

2004;33:567–606.

36. Gigerenzer G, Marewski JN. Surrogate science: the idol of a

universal method for scientiﬁc inference. J Manag. 2015;41:

421–40.

37. Goodman SN. A comment on replication, p-values and evi-

dence. Stat Med. 1992;11:875–9.

38. Goodman SN. P-values, hypothesis tests and likelihood: impli-

cations for epidemiology of a neglected historical debate. Am J

Epidemiol. 1993;137:485–96.

39. Goodman SN. Towards evidence-based medical statistics, I: the

P-value fallacy. Ann Intern Med. 1999;130:995–1004.

40. Goodman SN. A dirty dozen: twelve P-value misconceptions.

Semin Hematol. 2008;45:135–40.

41. Greenland S. Null misinterpretation in statistical testing and

its impact on health risk assessment. Prev Med. 2011;53:

225–8.

42. Greenland S. Nonsigniﬁcance plus high power does not imply

support for the null over the alternative. Ann Epidemiol.

2012;22:364–8.

348 S. Greenland et al.

123

43. Greenland S. Transparency and disclosure, neutrality and bal-

ance: shared values or just shared words? J Epidemiol Com-

munity Health. 2012;66:967–70.

44. Greenland S, Poole C. Problems in common interpretations of

statistics in scientiﬁc articles, expert reports, and testimony.

Jurimetrics. 2011;51:113–29.

45. Greenland S, Poole C. Living with P-values: resurrecting a

Bayesian perspective on frequentist statistics. Epidemiology.

2013;24:62–8.

46. Greenland S, Poole C. Living with statistics in observational

research. Epidemiology. 2013;24:73–8.

47. Grieve AP. How to test hypotheses if you must. Pharm Stat.

2015;14:139–50.

48. Hoekstra R, Finch S, Kiers HAL, Johnson A. Probability as

certainty: dichotomous thinking and the misuse of p-values.

Psychon Bull Rev. 2006;13:1033–7.

49. Hurlbert Lombardi CM. Final collapse of the Neyman–Pearson

decision theoretic framework and rise of the neoFisherian. Ann

Zool Fenn. 2009;46:311–49.

50. Kaye DH. Is proof of statistical signiﬁcance relevant? Wash

Law Rev. 1986;61:1333–66.

51. Lambdin C. Signiﬁcance tests as sorcery: science is empirical—

signiﬁcance tests are not. Theory Psychol. 2012;22(1):67–90.

52. Langman MJS. Towards estimation and conﬁdence intervals.

BMJ. 1986;292:716.

53. LeCoutre M-P, Poitevineau J, Lecoutre B. Even statisticians are

not immune to misinterpretations of null hypothesis tests. Int J

Psychol. 2003;38:37–45.

54. Lew MJ. Bad statistical practice in pharmacology (and other

basic biomedical disciplines): you probably don’t know P. Br J

Pharmacol. 2012;166:1559–67.

55. Loftus GR. Psychology will be a much better science when we

change the way we analyze data. Curr Dir Psychol. 1996;5:

161–71.

56. Matthews JNS, Altman DG. Interaction 2: Compare effect sizes

not P values. Br Med J. 1996;313:808.

57. Pocock SJ, Ware JH. Translating statistical ﬁndings into plain

English. Lancet. 2009;373:1926–8.

58. Pocock SJ, Hughes MD, Lee RJ. Statistical problems in the

reporting of clinical trials. N Eng J Med. 1987;317:426–32.

59. Poole C. Beyond the conﬁdence interval. Am J Public Health.

1987;77:195–9.

60. Poole C. Conﬁdence intervals exclude nothing. Am J Public

Health. 1987;77:492–3.

61. Poole C. Low P-values or narrow conﬁdence intervals: which

are more durable? Epidemiology. 2001;12:291–4.

62. Rosnow RL, Rosenthal R. Statistical procedures and the justi-

ﬁcation of knowledge in psychological science. Am Psychol.

1989;44:1276–84.

63. Rothman KJ. A show of conﬁdence. NEJM. 1978;299:1362–3.

64. Rothman KJ. Signiﬁcance questing. Ann Intern Med.

1986;105:445–7.

65. Rozeboom WM. The fallacy of null-hypothesis signiﬁcance test.

Psychol Bull. 1960;57:416–28.

66. Salsburg DS. The religion of statistics as practiced in medical

journals. Am Stat. 1985;39:220–3.

67. Schmidt FL. Statistical signiﬁcance testing and cumulative

knowledge in psychology: Implications for training of

researchers. Psychol Methods. 1996;1:115–29.

68. Schmidt FL, Hunter JE. Methods of meta-analysis: correcting

error and bias in research ﬁndings. 3rd ed. Thousand Oaks:

Sage; 2014.

69. Sterne JAC, Davey Smith G. Sifting the evidence—what’s

wrong with signiﬁcance tests? Br Med J. 2001;322:226–31.

70. Thompson WD. Statistical criteria in the interpretation of epi-

demiologic data. Am J Public Health. 1987;77:191–4.

71. Thompson B. The ‘‘signiﬁcance’’ crisis in psychology and

education. J Soc Econ. 2004;33:607–13.

72. Wagenmakers E-J. A practical solution to the pervasive problem

of p values. Psychon Bull Rev. 2007;14:779–804.

73. Walker AM. Reporting the results of epidemiologic studies. Am

J Public Health. 1986;76:556–8.

74. Wood J, Freemantle N, King M, Nazareth I. Trap of trends to

statistical signiﬁcance: likelihood of near signiﬁcant P value

becoming more signiﬁcant with extra data. BMJ.

2014;348:g2215. doi:10.1136/bmj.g2215.

75. Stigler SM. The history of statistics. Cambridge, MA: Belknap

Press; 1986.

76. Neyman J. Outline of a theory of statistical estimation based on

the classical theory of probability. Philos Trans R Soc Lond A.

1937;236:333–80.

77. Edwards W, Lindman H, Savage LJ. Bayesian statistical infer-

ence for psychological research. Psychol Rev. 1963;70:193–242.

78. Berger JO, Sellke TM. Testing a point null hypothesis: the

irreconcilability of P-values and evidence. J Am Stat Assoc.

1987;82:112–39.

79. Edwards AWF. Likelihood. 2nd ed. Baltimore: Johns Hopkins

University Press; 1992.

80. Goodman SN, Royall R. Evidence and scientiﬁc research. Am J

Public Health. 1988;78:1568–74.

81. Royall R. Statistical evidence. New York: Chapman and Hall;

1997.

82. Sellke TM, Bayarri MJ, Berger JO. Calibration of p values for

testing precise null hypotheses. Am Stat. 2001;55:62–71.

83. Goodman SN. Introduction to Bayesian methods I: measuring

the strength of evidence. Clin Trials. 2005;2:282–90.

84. Lehmann EL. Testing statistical hypotheses. 2nd ed. Wiley:

New York; 1986.

85. Senn SJ. Two cheers for P-values. J Epidemiol Biostat.

2001;6(2):193–204.

86. Senn SJ. Letter to the Editor re: Goodman 1992. Stat Med.

2002;21:2437–44.

87. Mayo DG, Cox DR. Frequentist statistics as a theory of induc-

tive inference. In: J Rojo, editor. Optimality: the second Erich L.

Lehmann symposium, Lecture notes-monograph series, Institute

of Mathematical Statistics (IMS). 2006;49: 77–97.

88. Murtaugh PA. In defense of P-values (with discussion). Ecol-

ogy. 2014;95(3):611–53.

89. Hedges LV, Olkin I. Vote-counting methods in research syn-

thesis. Psychol Bull. 1980;88:359–69.

90. Chalmers TC, Lau J. Changes in clinical trials mandated by the

advent of meta-analysis. Stat Med. 1996;15:1263–8.

91. Maheshwari S, Sarraj A, Kramer J, El-Serag HB. Oral contra-

ception and the risk of hepatocellular carcinoma. J Hepatol.

2007;47:506–13.

92. Cox DR. The planning of experiments. New York: Wiley; 1958. p. 161.

93. Smith AH, Bates M. Conﬁdence limit analyses should replace

power calculations in the interpretation of epidemiologic stud-

ies. Epidemiology. 1992;3:449–52.

94. Goodman SN. Letter to the editor re Smith and Bates. Epi-

demiology. 1994;5:266–8.

95. Goodman SN, Berlin J. The use of predicted conﬁdence inter-

vals when planning experiments and the misuse of power when

interpreting results. Ann Intern Med. 1994;121:200–6.

96. Hoenig JM, Heisey DM. The abuse of power: the pervasive

fallacy of power calculations for data analysis. Am Stat.

2001;55:19–24.

97. Senn SJ. Power is indeed irrelevant in interpreting completed

studies. BMJ. 2002;325:1304.

98. Lash TL, Fox MP, Maclehose RF, Maldonado G, McCandless

LC, Greenland S. Good practices for quantitative bias analysis.

Int J Epidemiol. 2014;43:1969–85.

Statistical tests, P values, conﬁdence intervals, and power: a guide to misinterpretations 349

123

99. Dwan K, Gamble C, Williamson PR, Kirkham JJ, Reporting

Bias Group. Systematic review of the empirical evidence of

study publication bias and outcome reporting bias—an updated

review. PLoS One. 2013;8:e66844.

100. Page MJ, McKenzie JE, Kirkham J, Dwan K, Kramer S, Green

S, Forbes A. Bias due to selective inclusion and reporting of

outcomes and analyses in systematic reviews of randomised

trials of healthcare interventions. Cochrane Database Syst Rev.

2014;10:MR000035.

101. You B, Gan HK, Pond G, Chen EX. Consistency in the analysis

and reporting of primary end points in oncology randomized

controlled trials from registration to publication: a systematic

review. J Clin Oncol. 2012;30:210–6.

102. Button K, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J,

Robinson ESJ, Munafo

`

MR. Power failure: why small sample

size undermines the reliability of neuroscience. Nat Rev Neu-

rosci. 2013;14:365–76.

103. Eyding D, Lelgemann M, Grouven U, Ha

¨

rter M, Kromp M,

Kaiser T, Kerekes MF, Gerken M, Wieseler B. Reboxetine for

acute treatment of major depression: systematic review and

meta-analysis of published and unpublished placebo and selec-

tive serotonin reuptake inhibitor controlled trials. BMJ.

2010;341:c4737.

104. Land CE. Estimating cancer risks from low doses of ionizing

radiation. Science. 1980;209:1197–203.

105. Land CE. Statistical limitations in relation to sample size.

Environ Health Perspect. 1981;42:15–21.

106. Greenland S. Dealing with uncertainty about investigator bias:

disclosure is informative. J Epidemiol Community Health.

2009;63:593–8.

107. Xu L, Freeman G, Cowling BJ, Schooling CM. Testosterone

therapy and cardiovascular events among men: a systematic

review and meta-analysis of placebo-controlled randomized

trials. BMC Med. 2013;11:108.

108. Neyman J, Pearson ES. On the use and interpretation of certain

test criteria for purposes of statistical inference: part I. Biome-

trika. 1928;20A:175–240.

109. Pearson ES. Statistical concepts in the relation to reality. J R Stat

Soc B. 1955;17:204–7.

110. Fisher RA. Statistical methods and scientiﬁc inference. Edin-

burgh: Oliver and Boyd; 1956.

111. Hill AB. The environment and disease: association or causation?

Proc R Soc Med. 1965;58:295–300.

112. Casella G, Berger RL. Reconciling Bayesian and frequentist

evidence in the one-sided testing problem. J Am Stat Assoc.

1987;82:106–11.

113. Casella G, Berger RL. Comment. Stat Sci. 1987;2:344–417.

114. Yates F. The inﬂuence of statistical methods for research

workers on the development of the science of statistics. J Am

Stat Assoc. 1951;46:19–34.

115. Cumming G. Understanding the new statistics: effect sizes,

conﬁdence intervals, and meta-analysis. London: Routledge;

2011.

116. Morey RD, Hoekstra R, Rouder JN, Lee MD, Wagenmakers E-J.

The fallacy of placing conﬁdence in conﬁdence intervals. Psy-

chon Bull Rev (in press).

117. Rosenthal R, Rubin DB. The counternull value of an effect size:

a new statistic. Psychol Sci. 1994;5:329–34.

118. Mayo DG, Spanos A. Severe testing as a basic concept in a

Neyman–Pearson philosophy of induction. Br J Philos Sci.

2006;57:323–57.

119. Whitehead A. Meta-analysis of controlled clinical trials. New

York: Wiley; 2002.

120. Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Intro-

duction to meta-analysis. New York: Wiley; 2009.

121. Chen D-G, Peace KE. Applied meta-analysis with R. New York:

Chapman & Hall/CRC; 2013.

122. Cooper H, Hedges LV, Valentine JC. The handbook of research

synthesis and meta-analysis. Thousand Oaks: Sage; 2009.

123. Greenland S, O’Rourke K. Meta-analysis Ch. 33. In: Rothman

KJ, Greenland S, Lash TL, editors. Modern epidemiology. 3rd

ed. Philadelphia: Lippincott-Wolters-Kluwer; 2008. p. 682–5.

124. Petitti DB. Meta-analysis, decision analysis, and cost-effec-

tiveness analysis: methods for quantitative synthesis in medi-

cine. 2nd ed. New York: Oxford U Press; 2000.

125. Sterne JAC. Meta-analysis: an updated collection from the Stata

journal. College Station, TX: Stata Press; 2009.

126. Weinberg CR. It’s time to rehabilitate the P-value. Epidemiol-

ogy. 2001;12:288–90.

350 S. Greenland et al.

123

This same sentiment was expressed in the American Statistical Association's (ASA) [paper on p-values](https://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108#.XE8wl89KjRY), to which this article served as a supplement. This was the first time ASA had ever produced such a statement of concern.
Another great resource for misinterpretations of p-values, which was also validated by one of the authors, can be found in Wikipedia's article [Misunderstandings of p-values](https://en.wikipedia.org/wiki/Misunderstandings_of_p-values).
These are heavyweight names in epidemiology/biostatistics. This paper represents the first time that I know of, where they united forces into writing an article, reflecting the importance of addressing these misinterpretations.
[Here's](https://fivethirtyeight.com/features/not-even-scientists-can-easily-explain-p-values/) a funny clip from FiveThirtyEight, where scientists actually working on research quality and p-values at Stanford (including one of the authors of this paper) are asked on the spot to explain p-values!
Multiple hypothesis testing is a huge problem in the sciences. Let's say you have 20 hypotheses that you want to test and are using a significance threshold of .05. The probability of one of these hypotheses being significant by random chance is 1 − (1 − 0.05) ^20 = 64% probability.
For more: https://www.stat.berkeley.edu/~mgoldman/Section0402.pdf
Unfortunately, this terminology has led to a misconception and many primarily look at the p-value, rather than the magnitude and direction of the effect being measured, to understand its significance. Well known scientists, such as Harvard's Miguel Hernan, have vouched to stop using the terminology "statistically significant".
R.A. Fisher created the concept of a "p value" and the "null hypothesis" (no relationship between two measured phenomena) when he published about statistical significance tests in 1925, later starting in 1928 Neyman and Pearson began introducing the terminology for null and alternative hypothesis testing that are regularly used today.
For more: https://en.wikipedia.org/wiki/Null_hypothesis
If you've ever worked with a big data set, you'll know that it is very easy to selectively fish for significant p-values. If you're working with "big data", even the smallest differences might appear as "statistically significant"-> always put your results in context and inspect effect sizes. The more data you have, you probably want to try smaller significance levels, especially if you're testing multiple hypotheses (see Bonferonni correction).
Wikipedia has a fantastic and very accessible article on the [Misunderstandings of of p-values](https://en.wikipedia.org/wiki/Misunderstandings_of_p-values), as also indicated above.
Estimating the science-wide false discovery rate is an active area of research (and fight!). The journal Biostatistics published a fantastic issue of heavy-weight names weighing on this topic [here](https://academic.oup.com/biostatistics/issue/15/1). Unfortunately, reproducibility projects in psychology, cancer biology and economics, do not paint a particularly favorable picture for even the most prestigious of studies. This has led to what is known as the "reproducibility crisis."
This is an excellent short read in its own right!
This is a misconception I have witnessed time and time again within the biosciences. Even the general guidelines given here, do not apply when comparing groups of very different sample size, which is often the case. The best written article on how to interpret comparisons between standard deviation error bars, standard error error bars and confidence intervals I have found so far, can be found [here](https://www.graphpad.com/support/faq/spanwhat-you-can-conclude-when-two-error-bars-overlap-or-dontspan/).
Fisher was in fact vehemently against the null hypothesis significance testing procedure introduced by Neyman and Pearson (which is the procedure most scientists use today). He was very concerned that we would turn the often complicated process of inferring from data into an automated decision algorithm; unfortunately, it turns out he was spot on. One of the classic papers in this space offers further historic details and was written by one of the authors of this paper in 1993 - it can be found [here](https://academic.oup.com/aje/article/137/5/485/50007), but it is behind a paywall...
An additional recommendation that has attracted lots of supporters, and enemies, is that we lower the commonly accepted level of significance from 0.05 to 0.005. The paper proposing this change has been signed by some very big names and can be found [here](https://www.nature.com/articles/s41562-017-0189-z). The authors understand this proposal as an interim measure to curtail the "symptoms" of misunderstandings in statistical inference while working on the "cause."
In 2015, the journal "Basic and Applied Social Psychology" (BASP) declared the null hypothesis significance testing procedure invalid, and banned it for any studies appearing in its journal. For more: http://www.medicine.mcgill.ca/epidemiology/Joseph/courses/EPIB-621/BASP2015.pdf
For anyone interested in issues related to p-values, quality of research, open science and evidence-based practice, you may find the [Reddit community on meta-research](https://www.reddit.com/r/metaresearch/) of interest (disclaimer: I am one of the moderators of that community!)
To appropriately combine evidence from more than one study, we need to use a statistical method, NOT compare p-values. The most commonly used such method in medicine is a meta-analysis. One of the famous success stories of meta-ananlyses is the discovery that administering corticosteroids to preterm babies substantially reduces complications; this was in contrast to all individual studies up to that point, which had not provided convincing evidence of such benefit. The main figure of that meta-analysis is now the logo of probably the most well-known consortium for evidence synthesis, Cochrane - more on this [here](https://www.ncbi.nlm.nih.gov/pubmed/16856047).
This is often known as data dredging or p-hacking - an excellent Wikipedia article on it can be found [here](https://en.wikipedia.org/wiki/Data_dredging). A MUST READ case that came to light last year is that of Brian Wansink, a Cornell professor, who inadvertently admitted p-hacking on his personal blog. Many of his papers are now retracted from the academic literature. You can read the original report about this case [here](https://www.buzzfeednews.com/article/stephaniemlee/brian-wansink-cornell-p-hacking).