Decision |
Frequency | Percentage |
Accept |
1 | 2.6% |
R&R | 13 |
34.2% |
Reject | 24 |
63.2% |
Decision |
Frequency | Percentage |
Accept |
9 | 69.2% |
R&R | 2 |
15.4% |
Reject | 2 |
15.4% |
Ultimate Decision |
||||
Accept |
Reject |
Unknown |
Total | |
My Recommendation | ||||
Accept |
1 | 0 | 0 | 1 |
R&R |
6 | 2 | 5 | 13 |
Reject | 4 | 18 | 2 |
24 |
Total | 11 | 20 | 7 |
38 |
[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.]
i've been bitching and moaning for a long time about the low statistical power of psych studies. i've been wrong. our studies would be underpowered, if we actually followed the rules of Null Hypothesis Significance Testing (but kept our sample sizes as small as they are). but the way we actually do research, our effective statistical power is actually very high, much higher than our small sample sizes should allow. let's start at the beginning. background (skip this if you know NHST)Null Hypothesis Significance Testing (over)simplified
in this table, power is the probability of ending up in the bottom right cell if we are in the right column (i.e., the probability of rejecting the null hypothesis if the null is false). in Null Hypothesis Significance Testing (NHST), we don't know which column we're in, we only know which row we end up in. if we get a result with p < .05, we are in the bottom row (and we can publish!* yay!). if we end up with a result with p > .05, we end up in the top row (null result, hard to publish, boo). within each column, the probability of ending up in each of the two cells (top row, bottom row) adds up to 100%. so, when we are in the left column (i.e., when the null is actually true, unbeknownst to us), the probability of getting a false positive (typically assumed to be 5%, if we use p < .05 as our threshold for statistical significance) plus the probability of a correct rejection (95%) add up to 100%. and, when we are in the right column (i.e., when the null is false, also unbeknownst to - but hoped for by - us), the probability of a false negative (ideally at or below 20%) plus the probability of a hit (i.e., statistical power; 80%) add up to 100%.
side note: even if the false positive rate actually is 5% when the null is true, it does not follow that only 5% of significant findings are false positives. 5% is the proportion of findings in the left column that are in the bottom left cell. what we really want to know is the proportion of results in the bottom row that are in the bottom left cell (i.e., the proportion of false positives among all significant results). this is called the Positive Predictive Value (PPV) and would likely correspond closely to the rate of false positives in the published literature (since the published literature consists almost entirely of significant key findings). but we don't know what it is, and it could be much higher than 5%, even if the false positive rate in the left column really was 5%.**
back to the main point.
we have small sample sizes in social/personality psychology. small sample sizes often lead to low power, at least with the effect sizes (and between-subjects designs) we're typically dealing with in social and personality psychology. therefore, like many others, i have been beating the drum for larger sample sizes.
not background
our samples are too small, but despite our small samples, we have been operating with very high effective power. because we've been taking shortcuts.
the guidelines about power (and about false positives and false negatives) only apply when we follow the rules of NHST. we do not follow the rules of NHST. following the rules of NHST (and thus being able to interpret p-values the way we would like to interpret them, the way we teach undergrads to interpret them) would require making a prediction and pre-registering a key test of that prediction, and only interpreting the p-value associated with that key test (and treating everything else as preliminary, exploratory findings that need to be followed up on).
since we violate the rules of NHST quite often, by HARKing (Hypothesizing After Results are Known), p-hacking, and not pre-registering, we do not actually have a false positive error rate of 5% when the null is true. that's not new - that's the crux of the replicability crisis. but there's another side of that coin.
the point of p-hacking is to get into the bottom row of the NHST table - we cherry-pick analyses so that we end up with significant results (or we interpret all significant results as robust, even when we should not because we didn't predict them). in other words, we maximize our chances of ending up in the bottom row. this means that, when we're in the left column (i.e., when the null is true), we inflate our chances of getting a false positive to something quite a bit higher than 5%.
but it also means that, when we're in the right column (i.e., when the null hypothesis is false), we increase our chances of a hit well beyond what our sample size should buy us. that is, we increase our power. but it's a bald-faced power grab. we didn't earn that power.
that sounds like a good thing, and it has its perks for sure. for one thing, we end up with far fewer false negatives. indeed, it's one of the main reasons i'm not worried about false negatives. even if we start with 50% power (i.e., if we have 50% chance of a hit when the null is false, if we follow the rules of NHST), and then we bend the rules a bit (give ourselves some wiggle room to adjust our analyses based on what we see in the data), we could easily be operating with 80% effective power (i haven't done the simulations but i'm sure one of you will***).
what's the downside? well, all the false positives. p-hacking is safe as long as our predictions are correct (i.e., as long as the null is false, and we're in the right column). then we're just increasing our power. but if we already know that our predictions are correct, we don't need science. if we aren't putting our theories to a strong test - giving ourselves a serious chance of ending up with a true null effect - then why bother collecting data? why not just decide truth based on the strength of our theory and reasoning?
to be a science, we have to take seriously the possibility that the null is true - that we're wrong. and when we do that, pushing things that would otherwise end up in the top row into the bottom row becomes much riskier. if we can make many null effects look like significant results, our PPV (and rate of false positives in the published literature) gets all out of whack. a significant p-value no longer means much.
nevertheless, all of us who have been saying that our studies are underpowered were wrong. or at least we were imprecise. our studies would be underpowered if we were not p-hacking, if we pre-registered,**** and if we only interpreted p-values for planned analyses. but if we're allowed to do what we've always done, our power is actually quite high. and so is our false positive rate.
also
other reasons i'm not that worried about false negatives:
You’re not alone.
As I conduct research for my upcoming book on friendship, I’ve found that most people have experienced discomfort when attempting to strike up and maintain a conversation with someone they don’t know well.
But here’s an important observation. If we tried to avoid small talk, because of the tensions involved, it would likely prevent us from making friendships over time.
While small talk may sometimes be dismissed as the meaningless “fluff” of communication, it’s actually an essential building block for connecting with others.
To help take the mystery out of the daunting task of small talking, I’ve engaged another round of experts who’ve devoted themselves to studying human interaction. (If you missed the first round of interviews on friendship, you can check out that series here.) Continue reading →
]]>