REVERSAL: Science Faculty's "Subtle" Gender Biases Against Men
Our Highly Powered Registered Replication Report Finds Bias in the Opposite Direction of a Very Influential Study
Summary Up Front
In this post, I present the headline findings and a few other details from a reversal of a very famous study on sexism in science. Our studies, which used nearly identical methods to that of the original, were not merely a series of failed replications, though they were that. They were a reversal. The original found biases against hiring women in science. Three far more highly powered replication experiments all found biases against men.
The full paper is available here and the supplement is available here.
What is in this post:
Summary of the original study’s main findings of biases against women.
Summary of the central reversal findings from our own studies. As part of doing so, I explain some of the extraordinary steps we took to ensure the rigor of our replications, including:
Following the original study’s methods very very closely.
Conducting it as a registered replication report ( and how this format dramatically reduces the potential for both researcher and journal biases is explained below).
Conducting it as an adversarial collaboration (and how this constitutes a very different type of check on researcher biases is discussed below).
I end with some general conclusions but also explain that this endeavor is sufficiently rich and complex that it warrants a series of posts on this topic, and I give some sense of what they will address.
The 2012 Moss-Racusin et al Study
This is one of the most influential studies conducted in psychology in the last 15 years (available here). As of 6/27/25, Google Scholar indicates it has been cited 4505 times (most papers are cited in single digits, a paper that has over 100 is doing quite well, and anything over 1000 is in rarefied territory). Citations are one measure of scientific impact (and, in fact, various citation measures are often referred to as “impact factors”). In addition, the Obama administration considered this study so important, so valuable, that it invited the lead author, Corinne Moss-Racusin, to the White House. She has parlayed this work and follow-up research1 into, by traditional academic standards, an impressive career, including mass media coverage, receiving multiple grants and a prestigious award from the American Psychological Association.
So by conventional academic standards, Moss-Racusin is very very successful. Kudos to her! This post is not about Moss-Racusin as a person or about her career writ large. It is just about this paper:
And yet:
Indeed, everything about this paper, from many aspects of the study itself, to the fact that it was published in one of the most influential journals, not just in psychology, but in all of science, to its widespread and widely uncritical acceptance, and to the attempts, often successful, to use such flawed science to influence policy, was so wrong. To be clear, I am not claiming they made up their data, there is no reason to believe that. Furthermore, I even have no reason to doubt that their results were mostly accurately described; certainly, their main results were accurately described. “What then,” you ask, “was the problem?”
It turns out that the problems2 were nonetheless so deep and pervasive that:
Identifying and explaining the ways in which it was wrong is too long to condense into a single Unsafe Science post. Therefore, I will be doing an entire series on the ways this paper, and its aftermath and related research, is emblematic of much of what is wrong in science, academia, and elite culture.
The accepted version of our very close replication studies took 66 manuscript pages and 81 additional pages as part of an online supplement. I do realize, in the words of the poet:
On the other hand, there is no way I can make all the killer points in one essay. Thus, this will be a series of posts. In this one, my main goal is to convey what Moss-Racusin et al found, explain why it should never have been taken too seriously, and then contrast it with our studies, which were very close and rigorous replication attempts, finding the opposite, i.e., biases against men.
What Moss-Racusin et al. (2012) Did
Moss-Racusin et al. (2012, hence “M-R”) performed an experiment, with science faculty as their participants. They recruited 127 faculty3 from six universities in biology, chemistry and physics to evaluate the application materials of an undergraduate student applying to be a lab manager. Everyone received the same materials, purposely designed to be kinda good but not stellar,4 with one twist: For half the participants, the applicant was named John, and of the other half, the applicant was named Jennifer. So, the only difference between the applications was the name — a marker for sex.5
M-R’s Key Findings
Results are shown (directly screenshotted from their paper) for three key variables: Ratings of the applicants competence, hireability, and how deserving the student was of being mentored. Higher scores indicate more positive evaluations. All differences between the ratings of the male and female student applicants were statistically significant at p<.001 and they reported all effect sizes as d’s>.60.
Their participants also endorsed paying “John” more than they endorsed paying “Jennifer.”
The thing is, the fanfare and glorification of this study within and beyond academia was never justified.
M-R had a sample size of 127. This is not tiny but it is pretty small. The problem with small samples is that they tend to have large levels of uncertainty around whatever they find. The small sample did not make the study wrong. Nonetheless, it should have affected its credibility. Rather than cause for celebration and glorification of the amazing disovery!, attention to the sample size should have constituted a warning to consider the results suggestive and preliminary, rather than definitive.
Our Replication Studies
Our studies were recently accepted for publication in the journal, Meta-Psychology, a journal that emerged from the efforts to improve the methods, practices, and rigor of psychological research. Accordingly, our studies had several unique strengths.
Let’s Talk About Scientific Rigor
Close Replications
We performed three very close replications of the M-R study. A “direct” or “close” replication is a study that follows the methods and statistical analyses of a prior study, usually a published one, very very closely. It is tempting to use the term “exact replication,” but, in social science, there is no such thing. If some study was conducted by, say, Yale researchers in 2011, a replication is not “exact” if it is not conducted by Yale researchers in 2011; put differently, even if the methods are identical, a study conducted by, say, Rutgers researchers in 2022 is not literally “exact” as an original study conducted by Yale researchers in 2011. The replication also has different participants; even in the extraordinarily unlikely event that it somehow tracked down the original participants, they are now 11 years older; they are not exactly the same.
A replication attempt is considered “close,” if it follows the methods as described in the published report nearly exactly. It is often impossible to follow the methods and statistical analyses exactly, even in published reports, because, too often, at least some of the methods were not described in the original report with sufficient clarity to know what they were. This was indeed the case with the M-R studies, where there were some ambiguities in how they conducted their analyses, ambiguities described in detail in our supplement. Sometimes, we contacted Moss-Racusin for clarifications. But we needed so many, we eventually stopped because we were pretty sure it would be annoying. So, other times, we attempted to identify different things they might have done and we conducted the study or analyses every way that we could identify they might have conducted theirs. For example, one such issue that is reported here (among many others that are not reported here), below, involves whether they paid their participants — because this was not completely clear, we conducted two studies with paid participants, and one with unpaid participants.
Registered Replication Report
We did not merely conduct the three replications, though. We conducted them as a registered replication report, aka an RRR. These are something of a gold standard in replication work because of their transparency and the way it limits reviewer and editor biases. This is how it worked for us:
First, we wrote up a proposal, called a pre-registration, describing every aspect of the intended methods and statistical analyses in advance. Pre-registrations severely limit researchers’ abilities to cherrypick analyses and results, because they are all described upfront. They also eliminate researchers’ possibilities of fishing around in the data and, when p<.05 comes up, reframing the paper as if it was all planned in advance to test some a priori hypothesis.
We then submitted the proposal to the journal, Meta-Psychology, before we conducted the study. This means everyone — us, the reviewers, the editor — knew, in advance, what we proposed to do. No making shit up after the fact or fishing around in the data for “good” results and pretending it was “as predicted.”6 They accepted the paper based on the proposal, meaning that, as long as we at least mostly stuck to our pre-registration, they were committed to publishing it no matter how they results turned out. Journals with this format are about as serious about the science of it all as possible, because it means they abdicate their ability to reject a paper because “the results were not interesting” or '“because nothing was statistically significant” or for scientific-sounding bullshit designed to prevent findings that challenge popular (in academia) political or theoretical narratives from getting published — all common reasons reviewers come up with for rejecting papers that journals are not committed to publishing based on a proposal.
Deviations are usually permitted, as long as they are minor and transparently reported.
So everyone knew, upfront, exactly what we planned to do, and both researcher funny business and editorial/reviewer biases based on the findings were largely pre-empted by the RRR format.
Adversarial Collaboration
But that’s not all folks. The paper was also an adversarial collaboration. This refers to researchers who have very different views about some phenomena, and often have publicly staked out very different positions, working together to empirically resolve some, or maybe all, of those differences.
The main value of the adversarial collaboration is that those who disagree can function as a skeptical check on each others’ potential biases. Even though this was a close replication, it is hard to know when you inadvertently slip in an assumption that, however unintentional, functions to bias the methods towards confirming the results. Similarly, we could also act as checks on each others’ potential biases in interpreting the results once they were obtained.
There were four authors: Nathan Honeycutt, Akeela Careem, Neil Lewis, and me. We all generated predictions regarding how the replication attempts would turn out. Some years ago, a debate about the (de)merits of “implicit bias” was held at UMass/Boston, where Neil and I took opposite sides (him saying it was a real thing and important, me saying not so much). Accordingly, we were on semi-opposite sides in our predictions for this replication series of experiments. Because of M-R’s small sample, I had no idea how it would come out; my prediction range included the possibility of small biases against men to moderate biases against women. Nate pretty much predicted it would not replicate, though his prediction range included small biases against men and small biases against women.
Nate and Akeela were, at the time, my grad students (both since completing their PhDs). Given randomness and uncertainty, we all gave a range of predictions for the bias effect size. Akeela, despite being my grad advisee at the time, predicted M-R would replicate, though probably with a somewhat smaller effect size. Neil predicted it would probably replicate with a much smaller effect size, though his prediction interval included 0 (no bias).
The combination of the RRR format and the adversarial collaboration is probably about as close as psychological research can get to a gold standard for rigor, transparency, and limiting bias.

Our Three Studies and Main Results
Methods
Our studies followed M-R’s methods very closely (John, Jennifer, same application materials as used in M-R, etc.), with a few differences to increase rigor and generalizability:
We performed three experiments, not just one, plus a meta-analysis of our three experiments.
Our sample sizes totaled about 1100 (compared to 127) across the three studies (the N for analysis varied a bit from analysis to analysis because of missing data).
We drew faculty from 55 top institutions, rather than the six they focused on.
Our studies include STEM faculty, whereas M-R focused on faculty in just S(cience) fields. Study 1 focused exclusively on faculty in biology, chemistry and physics, as did M-R. Studies 2 and 3 focused on faculty in the rest of STEM, TEM, i.e., technology, engineering and math.
Main Results
There were no differences between the patterns of sex bias obtained in the three experiments, so, in a fourth study, essentially a meta-analysis of the three main studies, we simply combined all the data. The key results, shown below, were not merely that M-R failed to replicate, but it was a reversal — we found biases against men! Faculty across the three studies consistently rated Jennifer more positively than they rated John.
There were plenty of other interesting results in the main paper, and I plan to present them in subsequent posts in this series. They, too, generally failed to replicate or otherwise raise questions about the generality and even the credibility of M-R’s findings.
Conclusions
M-R found biases against women. Our very close replication studies found biases against men. Which should you believe?
This table compares the two reports on markers of methodological rigor and quality. You decide.7 Even if you decided ours were more rigorous, do not be too hard on M-R. They deserve credit for creating the methodology. Science (I am including social science here, which I realize some of you may consider a dubious proposition), if it functions well, often improves with practice. This is why, for example, we no longer bleed people to cure fevers (figuring that out took something like 2000 years; even I am cautiously optimistic that unjustified claims in social science won’t take 2000 years to correct, although it may feel that way). We could look at what they did well and not so well, and improve on some of it. They had no such option.
Nonetheless, it turns out that most experimental studies of sex biases in academic hiring show biases favoring women, but the research showing biases favoring men:
Is cited vastly more
Has vastly smaller sample sizes than the research showing biases favoring women
That will have to do for now, take my word for it, or not. Actually, you should not take my word for it, you should require to see the evidence. Which you should do for most claims of scientific credibility, including but way not only mine. I will address that in a more detailed subsequent post (draft in preparation), with data and full references. For now, these Orwelexicon entries will have to tide you over.
There were lots of strange things going on throughout the entire experience of trying to get this published, and many will also be addressed in subsequent posts. Here is a teaser of things to come, unless I get distracted (and you never want to understimate the probability that I will get distracted).
Not necessarily in the order in which they will appear:
Note that “Subtle” is in scare quotes in the title of this post. That is because an entire post will be dedicated to whether M-R actually even tested for “subtle” biases.
Bizarrenesses in the peer review process. PNAS, the journal that published the M-R paper, did not have an RRR format at the time we wrote up the proposal (I don’t know if they do now), so we could not send it there (which is itself a comment on PNAS). So we first sent it to Nature, which does have an RRR format. They rejected it for what I can only describe as some pretty strange reasons, and will review in a subsequent post. Nature recently decided to make reviews public, but I think that is only for accepted papers anyway.
Strange claims in M-R about various aspects of their methods and analyses.
M-R also performed mediation analyses. We conducted the same mediation analysis. But we also tested alternative models, They (both theirs and ours) are a minor disaster area and a warning about how not to test for mediation in these sorts of studies.
There is a great little lesson about statistical power in our studies, revealing what happens if you perform the same study with and without high power to detect small effects.
M-R also administered a sexism scale. We administered the same scale. She found that the higher people scored on the scale, the lower they rated the female applicant. So did we. However, because, in our studies, people were generally biased in favor of women, the upshot of this relationship was that we found the least bias against men among those “highest” in “sexism” and the most bias favoring women among those “lowest” in “sexism.” This raises questions about what the “sexism” scale is actually measuring.
In a subsequent paper, Moss-Racusin found that men were more reluctant to accept at face value studies finding biases against women. The writing strongly implied that this was some sort of defensive unscientific attitude on the part of men. Turns out, the men’s skepticism was justified.
The rhetoric exploiting the M-R study to justify policies geared around advancing women was a thing to behold.
At the top of the Unsafe Science site, you can now find this:
Each is a tab/link, that will get you to a specific types of post, either thematically related to each other (When Racist Mules Attack; Trump, Science and Academia), or a type of post (Book Chapters, Reports, Empirical Studies). This post, and all subsequent ones on the M-R study, our replication, and follow up posts on the same type of topic (will) appear in the tab circled in red. Just alerting you so that you can find the collection if and when there is a collection to be found, even if you are not avidly reading every post the second it appears. The point is to make it easier for you to find all thematically related posts in one place. I am also doing this because, although I am planning a series of posts, the series will not necessarily appear one right after another. I already have a guest post lined up on an entirely different topic, and stuff just comes up all the time. But I do plan to post a slew of essays on this.
Commenting
Before commenting, please review my commenting guidelines. They will prevent your comments from being deleted. Here are the core ideas:
Don’t attack or insult the author or other commenters.
Stay relevant to the post.
Keep it short.
Do not dominate a comment thread.
Do not mindread, its a loser’s game.
Don’t tell me how to run Unsafe Science or what to post. (Guest essays are welcome and inquiries about doing one should be submitted by email).
Related Unsafe Science Posts
~75% of Psychology Claims are False
In this essay, I explain why, if the only thing you know is that something is published in a psychology peer reviewed journal, or book or book chapter, or presented at a conference, you should simply disbelieve it, pending confirmation by multiple independent researchers in the future.
No Evidence for Workplace Sex Discrimination
This is a modified for comprehensibility (I hope!) excerpt from a draft manuscript by Nate Honeycutt and me to be submitted for peer review. The larger paper is on how psychology has periodically experienced explosive paroxysms of research on bias, only to discover that either the work was unreplicable, overblown, or that the “biases” discovered actuall…
Some Sex Differences are More Problematic than Others
April and Michael have become regular recent contributors to Unsafe Science, and I invite you to check out their prior contributions, which are definitely among my favorite guest posts, and which you can find here, here, and here. As most of you know,
Apex "Peer Reviewed" Journal, Nature, in Free Fall
Note: This is an expanded version of a Twitter thread I posted on this earlier this week. In case you did not already know, I am (probably too) active on Twitter, but it is such an amazing source of evidence that academia has gone off the rails, that I can’t resist. You can follow me on
Footnotes
Moss-Racusin’s followup research. I won’t be discussing that here, but I will indeed be discussing some of it in follow up essays in this series.
The problems with Moss-Racusin’s research. There are indeed problems with the research, and I began but way way did not not finish identifying them in this essay. But I include in “the problems” not only the research itself but the subsequent unjustified interpretations of the work and followup work, and the dysfunctional manner in which it was used to leap to policy advocacy to justify interventions.
Recruited 127 faculty. STRANGENESS ALERT. We were in touch with Moss-Racusin as we were designing our replication studies. Via email, she informed us that they had paid participants $25 each. This statement appears nowhere in either the main M-R article or the supplement. Thus, the scientific record has no evidence that M-R’s participants were paid, but, on the other hand, we have no reason to think she was lying or otherwise in error when she emailed us that she paid her participants. What to do? We conducted three separate replication attempts; in two, we paid participants $30/each and one sample was unpaid. The patterns of means were amazingly similar across the studies.
The applicant was kinda good but not stellar. LIMITATION AND GENERAL STRANGENESS ALERT. They purposely did this to, as they wrote on p. 16478, allow “for variability in participant responses.” There is variability in responses to almost every question any social scientist has ever asked. However, steelmanning their rationale, the idea was probably that there would not likely be much bias if the candidate was stellar or awful. And there is some evidence that stereotypes are most likely to bias judgments when the personal characteristics of the target (the person being judged) are somewhat ambiguous. Nonetheless, in the real world, when selecting a person for a job, those doing the selecting usually try to pick the best person, not some 2nd tier pretty good but not great person. Thus, even if one takes their results at face value, something which our reversal studies cast strong doubt upon as being a reasonable thing to do, this methodological feature of the study should have always been viewed as a severe constraint on generalizing their findings. Also filed under STRANGENESS ALERT, they characterized their decision to have the applicant be kinda good but not great as “following best practices.” There are best practices for implementing methodological procedures, but there are no “best practices” with respect to how to design a study to assess demographic biases; there are only different types of experimental methodologies, each useful for answering different questions — there are just alternatives, each with its own benefits and limitations.
Jennifer vs. John as a marker for sex. Published in 2012, the study was probably conducted in 2010 or 2011. Back in those days, it was reasonable to assume someone named “John” was a male and someone named “Jennifer” was a female. From The Orwelexicon:
Pre-registration, no fishing around in the data. Actually, fishing around in the data is fine, even in a registered replication report. Its just that doing so needs to be transparently reported and one cannot pretend that the results of said fishing were based on analyses planned in advance to test an a priori hypothesis.
I have to give them credit for having had a higher response rate than we did. This is how we discussed it in the published paper:
M-R had a 23% response rate, which was higher than that of our three studies, all of which were in single digits. This is the one methodological aspect of their study that is stronger than that of the three studies reported in our paper. How likely does the response rate account for the differences in findings? It is possible but not likely on several grounds. First, the demographics of our sample corresponded as well to the academic population as did theirs, which they touted as evidence for their study’s rigor and generalizability. If such correspondence constitutes such evidence for M-R, it also does so for our replications. Regardless, there is no evidence suggesting that demographic differences or biases can explain the results. Second, response rates to surveys in general have been falling for decades (Holbrook et al., 2007), including those of faculty email surveys such as M-R’s and ours. For example, the response rate of faculty surveys by Williams and Ceci (2015) was 34%, Honeycutt & Freberg (2017) was 26%, Kaufmann (2021) was 2-4% (depending on survey), and Weitz et al. (2022) was 6.6%. Third, response rates have generally been found to be “...unrelated to or only very weakly related to the distributions of substantive responses” (Holbrook et al., 2007, p. 508). Fourth, the robust meta-analysis reported in our fourth study found that the bias results were similar across our three studies, despite differing response rates.
I am too lazy to include the references to all those citations here, but they can be found in the paper, the link to which appears at the top of this post.
So annoying, thanks Lee! I hope that Williams & Ceci still stands up better than M&R.
I know you looked at both in the past.
Often, one would see only M&R quoted in articles, to support a particular point of view.
When are we going to talk about the Citation Crisis in Soc Psych?
Those who either intentionally or inadvertently keep citing shoddy or non-replicating work more often than replicating studies that do not produce the "correct" result".......
Nice!!