R³: Randomized Replication Residency
a proposal to determine the reliability of federally funded science while building out replication capacity
No one really knows how much scientific research can be trusted. Large-scale evaluations of the replicability of work are still relatively rare, but what evidence there is seems worrying. Only a third of psychology papers replicate. Economics does somewhat better, but about 40% of top papers appear to have serious empirical issues.
Nor is the problem limited to social science. Perhaps half of published biomedical papers may not stand up to further investigation. Only 11% of “landmark” preclinical studies in cancer reproduced in one study, and around half of cancer researchers have tried and failed to reproduce published work.
This lack of replicability isn’t simply an academic concern; it has real consequences. One study estimates that unreplicable preclinical research wastes $26B a year. It is also possible that fraudulent research into Alzheimer’s led to billions of misdirected dollars and a delay in curing a disease that affects millions of Americans. A fraudulent study on beta-blockers may have led to tens of thousands of deaths.1
For this reason, replication - and metascience more generally - is getting increased attention from policymakers. In the US, RFK Jr. says that fixing the replication crisis should be a top priority,2 and NIH director Jay Bhattacharya agrees.
But how do we actually do that? This is less clear. Replication is currently a thankless task, and there are few incentives for academics to focus on replication. Academics generally must publish to get tenure, and replications are much more difficult to publish than new research. Some journals do not take replications at all; even those that do publish replications far less often than new papers.3
Furthermore, focusing on replication often makes researchers quite unpopular in their field. After all, telling other people that their papers are wrong is rarely a good way to make friends and influence people. This is an existential concern for working academics; getting tenure relies on having positive relationships with senior people in their field.4
Perhaps a natural place to start would be at the grantmaking organizations themselves. After all, they ultimately control what science is or is not done. They could play a role in replication in three ways:
Funding replication work.
Most funding agencies currently spend very little on replication work. As an example, the NIH spent just $2M on replication funding in 2024. Since it is a $47B agency, this means they currently spend less than 0.01% of their annual budget on replication.5 The House has suggested increasing this amount 50-fold, to $100M.6
Directly replicating work.
This may be more promising than asking academics to do the work because NIH and other agency staff face different career incentives than academics. It does not matter to career NIH staff if journals will not accept replications as long as the NIH itself thinks this is a valuable use of staff time. It is also less problematic if career agency staff alienate senior scientists - they need not later acquire tenure letters from the same senior scientists.
However, no agency currently has the capacity to do much replication in-house. Even if NIH wanted to do in-house replications of 1% of the studies it funds, it does not currently have enough staff with the relevant expertise. It certainly has staff who could learn replication, but they do not yet have the necessary skills.
Moving agency funding away from non-replicable work to better-designed work that is more likely to replicate
At least in the social sciences, it is possible to predict which studies are likely to replicate at a rate much greater than chance. This suggests that there are specific, identifiable qualities of non-replicable work. It could be possible to train grantmakers to focus on funding work that is more likely to replicate.7
This would seem to be the biggest win here. It is useful to know how much science is wrong or even fraudulent, but it would be better if we did not fund such science in the first place. If grantmakers can determine what that science is before they fund it, and then not fund it, we would be in a much better place.
Which of these should be our first priority?
Funding replication is clearly the easiest of these three options, and would be a good start. However, it doesn’t really solve the underlying problem. Papers aren’t cited less after they fail to replicate - and the incentives within science still push one toward doing bad, unreliable science.
We need a method to address issues 2 and 3 as well. Science funding agencies simply don’t have replication capacity, and they often don’t prioritize funding the highest reliability science.
I propose the R3 fellowship program to address these issues. It would build out agency capacity to replicate, find scientific fraud, and shift grantmaking towards more replicable work.
What is R3?
R3 would be a two-year fellowship for top replication scientists who have already taken down bad science in their field - and want to expand their reach. These replicators will be assigned within grantmaking teams to replicate existing work and teach grantmakers how the replication process works.
Each participant would be assigned to a small number of grants. They would then replicate the results of those grants. During the replication process, the expert replicator will teach their process to the team that originally made the grant - building out additional capacity and training the team on what makes a replicable study.
This serves three purposes:
Determining the reliability of government-funded science.
We do not currently have a good estimate of how reliable an individual result is - or even how prevalent scientific fraud is. By randomizing which grants are assigned a replicator, R3 will give the first credible estimate of how much government-funded science is replicable, how much may be p-hacked, and even how much could be fraudulent.
The policy implications here seem obvious. We currently find scientific fraud in a scattershot way, investigating further when someone has suspicions about a paper. To draw conclusions from scientific work, we really should know how likely it is to be true. We currently build scientific consensus without knowing how reliable each paper is - or how many papers on a given topic we need before deciding that a theory is probably true.8
Making science more reliable.
Such a fellowship would have some of the same benefits as simply establishing a replication fund; each replication adds to the body of scientific knowledge.
Building out capacity for a more robust scientific ecosystem going forward.
This, to me, is the key value-add of the fellowship. Each replicator will train agency staff on how to replicate and where studies often fail. This directly addresses the lack of agency capacity to do replication in-house.
It is, in some sense, a metascience pyramid scheme. Your lead replicator trains some small number of replicators; each of them goes out and trains more replicators, and so on. Eventually, you will have a small army of people at government funding agencies with replication training.9
It also means that the impact of the program does not end when the lead replicator leaves. They may move on to other projects, but the people that they have taught will be able to continue replicating federally funded research for years after.
Shifting funding patterns towards more replicable work.
Doing replications also trains people in how to look for the flaws in research. As the grantmaking staff replicate more projects, they will also gain a sense of how likely a study is to replicate. As they gain expertise in replication, they will be able to use their knowledge when evaluating new grants.
This would equip agencies to favor better-designed, more replicable studies and help shift funding towards more replicable work.
All of this means that a successful R3 fellowship would not just result in a handful of replications. It is instead designed to create more replicators and more grantmakers that care about replicability.
Fellowship Mechanics
I imagine a pilot of this program looking something like:
Fellows: 10 replication leaders with proven fraud‑detection/replication records across biomedicine, social science & statistics.
The replicator assigned will be determined by the subject of the grant; a social science grant might be replicated by someone like David Roodman, or a microbiology grant by someone like Elisabeth Bik.
Host Match: Each fellow would be paired with one grantmaker or grantmaking team
Budget: $250k/year fellow compensation + $50k replication expenses per expert10 ⇒ $550k × 10 = $5.5M pilot.
If you are either a funder or at a science agency and find this idea interesting and would like to discuss further, please drop me a line at lagilbert@gmail.com.
Many thanks to Laura Ryan, Sam Enright, Parth Ahya and Stuart Buck, who provided feedback on various versions of this document. All remaining mistakes are my own.
In most cases, replication catches mistakes and errors in analysis. In some cases - like the Dan Ariely case - it can catch outright fraud.
I rarely agree with RFK Jr, but he’s right here.
In one case that I know of, a journal was only willing to publish a failed replication if the original author agreed that their original paper was wrong. Shockingly, the failed replication did not get published at that journal.
As a graduate student, I replicated a study and found the headline result was probably wrong. A professor told me to drop it, because I couldn’t afford to make enemies if I ever wanted to get an academic job.
And the current program is opt-in; it seems unlikely that people will submit work they know is p-hacked or otherwise unlikely to replicate.
The amount of funding needed per replication varies by type of research. Replication in the social sciences largely involves PI time and rerunning code; this is relatively cheap. Replication in biomedicine may involve redoing preclinical and clinical work. It would be relatively easy to spend $100M on this.
“Could” is doing some work here - we don’t yet know how easy this is to do in fields outside social science. Early data in cancer biology suggests that it may be more difficult than in social science, but this paper is based on only six replications.
Let’s say there are five papers on topic X that say X is caused by Y. I am much more inclined to believe Y really causes X if 80% of the studies in that field replicate than if only 10% of studies in that field replicate.
I think of this somewhat like the order of magnitude physics class I took in college. Originally, there was one professor at Caltech who taught an OOM class. His students took that knowledge and started teaching similar classes at the universities where they got job. Now, it’s a class at Stanford, Penn, MIT, University of Texas at San Antonio, Berkeley, and others. For each of these, you can usually trace out the genealogy for how the professor knows Sterl. (And yes, Sterl knows I now occasionally teach OOM thinking.)
For cost reasons, the pilot would begin by focusing on desk and low-cost research. As the program expanded, it could begin to cover more costly forms of replication (e.g. preclinical and clinical work).