8 Comments

I completely agree with your suggestion not to take a reported finding too seriously until it has been independently and directly (not conceptually) replicated in a study with high power. However, this suggestion applies not only to reported findings but also to reported "failures to replicate." Therefore, I think you need to inquire more deeply before saying things like this: "In this context, it was then not surprising that the long-term success rate for replications in psychology has hovered right around 50% (e.g., Boyce et al., 2023; Open Science Collaboration, 2015; Scheel et al., 2021)." The tables in OSC2015 indicate 92% power for their replications. However, few realize that those power calculations were based on inflated observed effect sizes reported in the original (also low power) experiments, not on a reasonable estimate of their true effect sizes. When you take into account the inflation of a reported effect size using a p < .05 filter (the filter used in the original studies), statistical power in OSC 2015 was not far from 50%, which means that what they reported is not far from what you'd expect if all of the original experiments reported true positives (not 50% false positives). The findings that fail to replicate when tested again with high power are not so much the everyday studies conducted by experimental psychologists (contrary to what OSC2015 implies) but the studies associated with sensational claims -- that is, studies associated with low prior odds of being true. This is why people are pretty good at predicting which findings will replicate and which will not. You say that the 50% successful replication rate baseline might be too high, but I think it is almost certainly too low. My 2 cents!

Expand full comment

Great article. Just the perfect combination of snark and great reasoning.

A quibble.

I work in a complex business field across a wide variety of industries, and was CEO / CTO for a non-profit research organization which worked to create reliable worldwide standardized measures, practices and processes in addressing business issues.

One simple measure of complex failure was poor “perfect order reliability”. The calculation involved reducing the problem to four types of failure. In business it’s slightly different than science but conceptually identical.

The standard is that a purchased entity can be late, incomplete, damaged, or have inaccurate paperwork (wrong invoice).

To calculate reliability you first classify orders in those four attributes, sum the number which have none of those attributes, and divide by the entire base count.

A common arithmetic failure is to either multiply % in each category of failure in a sample set, or to add - both are completely wrong. [I burst out laughing when I first read the origin of “intersectionality” a completely false set of concepts. They fail at elementary arithmetic.]

Consider two sets of four orders. In the first set only a single order fails in all four syndromes. In the second set each order fails in its own way.

Multiplying the %’s (75% of the orders are perfect) in the first case gives you a reliability level of 31%. The actual reliability is 3 perfect of four or 75%.

In the second case you have a reliability level with the same logic of 31% but the reliability is actual 0% - there were no perfect orders.

Data samples must not be correlated, overlapping - ideally “orthogonal” vectors so to speak when you look at reliability metrics.

Companies often claim great % reliability, often in the 80%+ range, but when you apply the correct measure they are quite bad. It is not unusual to encounter a company who on a month to month basis, has 0% reliability. I also encounter companies who have terrible reliability and simply reject the metric. It’s quite crushing to realize customers dislike doing business, and easier to make excuses. Customers abandon the company.

I would apply the same logic to your observation.

Consider 6 papers: a single paper can be faulty due to faulty citations, faulty assertions, faulty data/arithmetic, lacking contrary evidence, censored, evidence, MSU (and I note as well citing outdated evidence). Faulty data/arithmetic is so widespread that it’s not hard to glance through papers and get a feeling instantly that there’s a problem.

The same issue is in play as orders.

Consider two sets of papers, one has a single paper defective in 6 categories; the other has each paper defective in one category. Both situations might seem to be 38% reliability but of course the first has 83% reliability and the second has 0% reliability.

In the business world i consider the orders for a company as a whole over a time period in benchmarking against other companies in the same or similar industries.

You may consider a few things to strengthen your case. First, I’d partition the data by field and then by institution - there are a huge number of research institutions (analogous to companies) and their order fulfillment is research (papers and other artifacts, like data, hypotheses, experiment designs, materials, patents, methods, trade secrets - other IP)

I would apply your imperfect paper set to the partitioning logic I outlined to all the output of by industry by institution, and you can easily see exact outcomes.

When we apply perfect order logic to a naive company the results are usually crushing, but smart companies simply begin partitioning failure by syndrome and attacking the core problems.

Institutions and industries (sciences) are not used to being measured by reliable research output, often only by citation indexing. These would all be naive companies in the control processes business use.

Your thesis was that 75% of psychology research is faulty. By analogy to naive companies, it is entirely possible that entire institutions in the field have zero credible research - none at all.

An example is in fields contemplating “gender identity”. The term was invented by two scientists in the 60’s to support positions they believed true with trans delusional patients, patients with transvestic fetishism, and a smattering of children with puberty dysphoria.

The principal experiments (not research: anecdotal rather than systematic) were both grotesquely unethical and masked the facts that the research subjects committed suicide as an outcome.

“Gender identity” is not a scientifically supported term in any dimension, similar to implicit bias or stereotype falsification.

Use of it in research (to me) invalidates the premises involved. That means that, for instance, all documents from the APA, WPATH and others in the field of “gender identity” are faulty as a starting point, citing an entity “gender identity” which itself a combination of MSU, censorship, faked/faulty/masked data.

I’m more cheerful than you, while being very pessimistic. I know empirically that generation of faulty work products can be corrected by systemic efforts at the institutional level.

Expand full comment

Grand. Academic circle jerking.

Expand full comment
Nov 15·edited Nov 15Liked by Lee Jussim

Another great article. I can see why you use Gould's definition of a fact but it is still a bad definition and this is perhaps why you have often found yourself using scare quotes around the word ‘fact’. A fact is a way that the world IS, more technically, it is a state of affairs that obtains. States-of-affairs, and therefore facts, are not the kind of thing that can be confirmed or disconfirmed.

What can be confirmed are statements that state states-of-affairs. A statement stating a fact is a truth. Scientific statements are used to make claims about what the facts are and such claims can be the outcome of a body of scientific research. If that body of research is sufficiently comprehensive and methodologically sound many of its claims are scientifically justified statements and belief in them is justified. It is such scientifically justified statements that can be said to be ‘confirmed to such a degree that it would be perverse to withhold provisional assent’. Statements that state a fact can never turn out to be false but scientifically justified statements can turn out to be false.

Consequently, it is impossible for Gould’s definition to be good for two reasons. Gould's definition should instead be given as a definition of when a claim is a scientifically justified statement and when belief in that claim is scientifically justified. I would still reject it as a definition, though. Rather, his perverse-to-withhold-assent condition is a usefully short generally reliable indicator of a scientifically justified claim.

Expand full comment
Nov 15Liked by Lee Jussim

I get it but, as usual, this approach has a very strong subdisciplinary bias. In my area, cognitive psychology/cognitive neuroscience, I hardly know any (systematically) non-replicable findings. Stroop effect, Simon, flanker, memory effect, recency etc., learning, attentional blink, you name it. Some higher-order interactions may be context-dependent, but the basic effects (orignally obtained with dramatically small samples, sometimes just the two authors) are super solid. So most of this handwaving business may be restricted to effects and phenomena that are more penetrated by culture. That's not so surprising, but please don't over-generalize from your own (admittedly messy) field...

Expand full comment
author

I think you might be confusing "social psych is really bad" with "cognitive is really good." There is good evidence for the former, not so much the latter.

https://royalsocietypublishing.org/doi/pdf/10.1098/rsos.231240

Replication Rates

Cognitive: 58%

Social: 36%

Expand full comment

Well, I was thinking of the theoretically relevant outcomes of theoretically motivated, theoretically important experiments conducted by experienced researchers (that‘s what my attentional span is optimized processing), but not of cherry-picked (based on what graduate students find „interesting“ and, I guess, unlikely to replicate at face value) studies of unknown relevance, scored by two unexperienced graduate students according to „subjective replicability“ using unknown, if any, criteria…

Expand full comment
Nov 1Liked by Lee Jussim

Great article Lee! The default position in science is to remain skeptical until there is sufficient evidence to cross Gould’s threshold. Unfortunately I think too many people including psychologists want to claim that they have special knowledge. Perhaps too many PhDs do this to justify the costs of earning a doctorate. I used to be guilty of it

Expand full comment