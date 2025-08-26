Series within a series? Yes! The main series constitutes the posts (with more to come) on Sexism in Science. But this is the first of a subseries, i.e., a series of posts within Sexism in Science on some of the very bad arguments we received from the reviewers at Nature, when we submitted our proposal to replicate Moss-Racusin et al.’s (2012, hence “M-R”) famous study finding biases against hiring women in science. They rejected it, we submitted it to Meta-Psychology, where the paper is now accepted and awaiting publication. You can find an overview of our reversal (i.e., using methods nearly identical to theirs, we found biases agains men) here:

The Nature reviews are from 2019. The reviews were filled with reasons to reject it that were sophisticated versions of bad arguments. Because of the bullshit asymmetry principle:

It takes an entire post to debunk, critique, or contest a sophisticated version of a single bad argument.

So, in this post, I do so for one reviewer criticism. That criticism suggested that the “fact” that our respondents “may have been familiar” with the Moss-Racusin et al study meant that their responses could not be believed, and, therefore, that the study could not be replicated.

I will return to other bad reviewer claims in subsequent posts. That is the “series within a series.”

You can think of this as a sort of constructive revenge post.

You might think, “He’s just pissed because his paper was rejected” and you’d be half right. I was pissed, but not because the paper was rejected; this happens all the time (we just had a paper rejected at Psych Review, and, though disappointing, the reviews were very reasonable). So no, I was not pissed because it was rejected. I was pissed because of the ridiculous reasons it was rejected. This post begins to tell that story.

Dysfunctions in Peer Review Generally

Peer review has so many dysfunctions that it has been described as a complete failure and I have argued elsewhere that most claims that appear in psychology journals are not credible.

Many of the dysfunctions of conventional peer review synergistically combine to create a spiral of low quality reviews:

Power asymmetry

Reviewers hold the fate of authors’ papers in their hands (and to some extent their careers) but authors hold nothing of value in their hands regarding the fate of reviewers. At top journals, one negative review can often kill a paper’s chances.

Reviewers evaluate the submitted manuscript but, unless the editor permits a revision, no one evaluates the quality of the reviews.

No transparency or accountability

No one knows how the editor selects reviewers, nor, usually, even who the reviewers were, unless they choose to identify themselves.

No accountability. If there are systematic problems with a journal’s review processes, there is no way for anyone outside the journal’s inner circle to know about it. Thus, reforms to peer review have been occurring on a geological time scale.

Reviewer biases

There are so many. Theory, status, ingroup (especially, academic ingroup), political, biases against null results, biases against replication, and many more. Of course, bias is inherent to the human condition, but most papers are only reviewed by 2-5 reviewers, so its a total crapshoot as to who you are going to get as a reviewer, and with what baggage. This is even true for political biases, even though the social sciences are almost completely populated by people left of the American center. Even those people can be roughly divided into those who feel liberated to inflict their biases on scholarship and peer review and those who believe this is a very bad idea and work hard not to do so.

How to reform peer review is a complex and sticky issue, and beyond the scope of the present essay. The relevant part is that these biases are bad enough on their own, but each can exacerbate the problem of the others. Power asymmetry? Without accountability, it just continues and continues and continues… Reviewer biases? As long as there is neither transparency nor an open review process whereby the entire scientific community can evaluate particular reviews, reviewers have no reason to even attempt to limit their biases.

The only way to expose this nonsense is … to expose this nonsense. Thus, this revenge post.

Before getting our replication (and, ultimately, reversal) of the famous Moss-Racusin et al (2012) study (hence, “M-R”) finding biases against women accepted at Meta-Psychology, we submitted the proposal for a registered replication report (RRR) to Nature. They rejected it, basically telling us:

The study could not be replicated. This is the topic of the present post. The study should not be replicated. This will be a topic of subsequent posts.

We did the right thing and (mostly) ignored them.

This is the actual jacket of my collaborator and unindicted co-conspirator, Catherine Salmon

In this post, I present their key reasons for rejecting our proposed RRR as a testament to scientific malfunctions endemic to peer review and, by implication, academia more widely.

A Brief Explainer

M-R found, based on a sample of 127 faculty at six institutions, biases against hiring a woman for a lab manager position.

In a very close replication attempt (i.e., our methods were intentionally adopted to be as identical to theirs as possible), we found, based on a sample of over 1100 (N varied a bit from analysis to analysis because of missing data), biases against hiring men for a lab manager position.

I have posted on this previously, so you can go here for a summary of what we did and found, including a link to the full article accepted for publication at Meta-Psychology:

What is a Registered Replication Report?

More details about what an RRR is can be found in the post on our reversal. Here I just review some key features. RRR’s are something of a gold standard in replication work because of their transparency and the way they limit reviewer and editor biases.

We knew we’d have a much better shot at getting the paper published if the paper was accepted as a Registered Replication Report (RRR). In an RRR, you do not submit to the journal the final paper with all the results and analyses potentially disconfirming the social justicey narrative testaments to the power of bias and patriarchy that so many academics so love. No. You definitely do not do that.

Instead, you submit a proposal describing the research you intend to conduct – complete with theory, hypotheses, methods and planned analyses. For journals that accept RRR’s, this leaves reviewers and editors no choice but to review the proposal without knowing the results. As we shall see, this does not eliminate the problem of reviewer bias completely but it does remove one major source of such biases – not liking the results.

RRR to the Rescue?

But there is one big bad but:

If reviewers liked the original study and its results, AND they anticipate it won’t replicate, they can concoct “serious scientific-sounding” “reasons” to reject the RRR. Which, as you shall see, is exactly what happened at Nature. The Orwelexicon has this covered:

In it, you will find this:

The paper should really have been published in The Proceedings of the National Academy of Science (PNAS), because that is the journal that published M-R. In my Academic Utopia, journals would accept responsibility for publishing papers attempting to replicate work they previously published. If it was important enough to publish, it should be important enough to either verify or disconfirm. PNAS did not, at the time have an RRR option (I have no idea whether they do now), so we could not submit it there. Journals without an RRR format are (in my experience and opinion, though I have no hard data) far more likely to reject replication work because, if the replication succeeds, their reaction is “We knew this already and do not need to republish the same result” and if it fails, will likely come up with some reason to reject it (after all, it really doesn’t make the journal look good to be viewed as publishing nonsense, aka the original study). But I digress.

Lowlight #1 from the Nature Reviews

tl;dr: The reviewer stops barely short of claiming, but strongly implies, that very close replications of M-R are impossible.

Reviewer 1:

I am worried about the fact that the faculty participants may be aware of the original 2012 paper, and that their responses may be skewed as a result.

What a weird phrasing. Its a “fact that…” and the “facts” that follow are speculations. Ok, whatever. We treated treated the speculation seriously. After all, it was hypothetically possible that masses of faculty are aware of M-R and said awareness produces non-credible responses. The speculation raised two questions:

How aware were respondents of the M-R study? This is an empirical question that can easily be answered by asking them. We did this, and it was in the proposal. Did awareness “skew” responses? This, too, is an empirical question, and we also proposed assessing this.

But the reviewer knew this and did not like it, either.

Here is how the reviewer “criticized” our proposal to address the question empirically:

However, my broader concerns still stand. How can we be certain that participants will be willing and/or able to answer these familiarity checks honestly and accurately?

WTLF? Let me count the LF’s:

“Certain"!?” Well, no, we can’t be “certain.” We almost never can be certain of anything in social science research, ever, at least not until its been confirmed by zillions of studies. See “rigor mortus selectivus” above. Willing? Why the LF would people with PhDs in science be unwilling to answer q’s about their knowledge of research accurately? If anything, a few might be motivated to overclaim — to act like they know more than they actually do, and thereby claim they do know the study when, in fact, they do not. Thus, when we asked about familiarity, if anything, those results would be more likely to overstate familiarity than understate it. Able? Why the LF would people with PhDs in science be unable to figure out whether they knew some study, not counting the whole overclaiming thing? Sure, memory is imperfect, but if most of them could not figure out whether they knew the study, then the entire problem of “familiarity” largely or completely evaporates into the dark mists of memory and forgetting. But the worst is not what the reviewer stated explicitly, but the implications of the argument. Specifically, if faculty respondenses are so distorted, so polluted by awareness of the M-R study, and if this is a reason not to even conduct the study, then, implicitly, the argument is that a direct replication of their study is not possible. One could, presumably, conduct a different study to assess sex bias in science, but one could not directly replicate that particular study. The reviewer stopped barely short of saying so explicitly:

My specific concern about this is that you are then not actually measuring STEM faculty members’ gender bias, but rather, how STEM faculty respond when they think they are being measured for bias.

Translation: The M-R study must stand intact, forever, it can never be replicated, and, should one be conducted and published, any failure to replicate can be dismissed as “tainted”!

Let’s Talk Evidence

How aware of M-R were our faculty participants?

tl;dr: Not very.

We asked this question:

How much do you know about the following articles?”

Following this were the M-R article and five nonexistent articles, all in full APA style. I’ll get to what we did with the fake articles later.

Answer options were: [1] “nothing at all,” [2] heard of it, but do not know much,” [3]“I know a little about it,” [4]“I know a moderate amount about this,” and [5] “I know a great deal about this.”

I’m reporting results for Study 1 here but results were similar for all three studies.

What did we find?

69% of participants selected “nothing at all”

11% “heard of it, but do not know much”

9% “I know a little about it,”

7% “I know a moderate amount about this”

3% “I know a great deal about this.”

Taking this at face value, 80% had either not heard of it or did not know much about it, and 89% knew, at most, “a little” about it.

Awareness was mostly a non-issue. And what about the 11% who claimed to know at least a moderate amount about it? It is almost definitely an overestimate based on overclaiming. How do we know? That’s where the fake, nonexistent articles come in. 4% claimed to know a “moderate amount” or “great deal” about the nonexistent articles. There is no reason to expect any less making shit up about what they knew about M-R than for the fake, nonexistent articles. We can estimate how many actually knew about M-R by subtracting the 4% who “knew” about the fake articles from the 11% claiming to know about M-R. So the best guess is that around 7% actually knew much about M-R. The similar overclaiming-adjusted figures for our Studies 2 and 3 were … ready? … wait for it … are you sure you want to see this? … 2%.

Did Awareness Skew the Results, as the Reviewer Predicted?

No. We tested this directly, by assessing whether familiarity scores “moderated” the four main outcomes (ratings of the applicant’s hireability, competence, and deservingness of mentorship, and the recommended starting salary). Moderation answers the question, “Did the effects of the sex of applicant on the outcome — i.e., bias — depend on the level of familiarity faculty claimed to have about M-R?”

The answer was mostly “no.” There were 12 such tests (four outcomes by three studies). 9 of 12 were not statistically significant, meaning there was no evidence for moderation (and given the large sample sizes, the moderation effect sizes were near zero). For two of the four statistically significant moderation analyses, results were consistent with the reviewer’s concerns: more familiarity, more bias favoring the female applicant. But for the third, the result was wild: at low familiarity, there was bias favoring the female applicant, but at high familiarity, the bias favored the male applicant! Keeping in mind that the number of people expressing moderate or high familiarity with M-R was so low that there is huge uncertainty around any estimate of how biased they were, I think the “significant” moderation by familiarity results are statistical noise. Regardless there was no evidence supporting a general pattern of “more familiarity, more bias favoring the female applicant” for 10 of 12 outcomes.

For this post (i.e., its not in the paper), I tested my suspicion that the pattern was statistical noise. I did this by testing this pattern against the null hypothesis of “no effect of familiarity that increases biases favoring women.” This can be tested with a (relatively) simple (for the statistically trained) binomial probability. Null hypotheses, if true and tested appropriately, will produce “statistically signficant” results in 1 of 20 tests. If the null is true, how likely would we get statistically significant results in the “right” direction in 2 of 12 tests? I address this question using this online binomial probability calculator.

The probability of getting no more than 2 significant results in the “right” direction, out of 12 results was .98. The probability of getting at least 2 significant results in the “right” direction was .12. Put differently, the null of no effect of familiarity cannot be rejected for our results (which using conventional standards requires a probability level of .05 or lower). In normal English, this means there was no evidence here of a systematic effect of familiarity increasing biases favoring women.

If Reviewer 1 is STILL Right, Its Even Worse!

The reviewer’s “familiarity” and “tainted response” hypotheses were resoundingly disconfirmed. Furthermore, there is little support in the wider literature for the reviewer’s core idea — that people lie to appear unbiased for social desirability purposes on anonymous questionnaires — which is why the reviewer did not cite a evidence documenting such social desirability biases in the review. Indeed, there is evidence that social desirability plays no role in many cognitive tasks. Although those are admittedly not the same as bias assessment, there is also excellent evidence that direct explicit questionnaire measures of racial prejudice are more valid than indirect ones.

There is neither evidence nor good reason to think it is any different with respect to sex biases. In short, there is little evidence, anywhere, for the “social desirability taints bias assessment” hypothesis and considerable evidence against it.

But despite this, let’s steelman the idea that, nonetheless, the reviewer is right. This would mean that the STEM faculty in our sample, when responding on anonymous questionnaire, felt social desirability pressures to evaluate “Jennifer” (the name of the female applicant in our studies) more positively than they really believed. Let’s just temporarily pretend that’s true, and consider what it would mean.

Why would faculty lie on an anonymous questionnaire? They are terrified of appearing sexist, even to themselves. And they do this on an anonymous questionnaire.

Most likely, they have rationalized the reasonableness of doing so. They have become “true believers” in the whole political “smash the patriarchy” business, and in the relentless drumbeat of rhetoric about sexist biases in academia and beyond. There is more than ample evidence for this, i.e., that faculty massively overestimate sexist biases in hiring generally and are massively more likely to cite research showing biases against women than biases against men. So what do such faculty do? They engage in psychological affirmative action — giving higher evaluations of the female candidate (than they do for an identical male candidate) to do something like “compensate for past sexist injustice.”

If this is the case, then their pro-female biases are their honest responses, again disconfirming the reviewer’s analysis.

But the reviewer’s analysis did not argue that faculty rationalized their biases and reported them honestly. Instead, it argued that faculty responses would not be not credible, they were disingenuous distortions. If social desirability pressures are enough to get faculty to distort their evaluations on an anonymous questionnaire, think how much stronger such pressures would be when publicly deliberating (e.g., in a department meeting) about whether to hire male or female job applicants. How many faculty want to be seen as atavistic chauvinist pigs? As far right reactionaries? As on the wrong side of history? How many want to alienate their female and feminist colleagues by standing for merit if doing so means NOT giving the edge to the female applicant? The answer probably is above 0, but also probably not very much. Far easier to pretend to base the decision on merit and give the nod to the female candidate (which in both M-R and our studies would be pretense under the “taint” analysis and given that their qualifications were identical).

Even though faculty hiring votes are often anonymous, we just stipulated that social desirability manifests on anonymous questionnaires. If anything, such biases would likely be more extreme when there is social pressure to evaluate the female candidate more positively. Such biases seem even more likely to similarly manifest in an anonymous department vote than on an anonymous questionnaire, because in a departmental vote, people believe (often, I suspect, mostly correctly) that they can figure out who voted yea or nay than on an anonymous questionnaire.

The results showing biases against men are damning, even in the extraordinarily unlikely event that they were “tainted” by social desirability pressures. In fact, they may be even more damning if that is what produced those biases. It would indicate that regressive/progressives have created a toxic academic culture where people are terrified to express their true opinions.

M-R failed to replicate, indeed, its finding was reversed. We did not test explanations for the reversal (other than “familiarity with M-R,” which failed to account for the patterns of bias), so we do not really know why the results were different. Perhaps M-R was a false positive (statistically significant results finding biases against women, even though no such biases existed). Perhaps the culture changed, going from misogynistic to misandrist. Perhaps the culture did not change, but academia did, going from misogynistic to misandrist. Perhaps the faculty at M-R’s six institutions were misogynistic but the faculty at the 55 institutions from which we surveyed faculty were misandrist. Its impossible to know for sure based on our replications.

Regardless, M-R not only failed to replicate, its findings were reversed. That stands, regardless of the explanation.

