Rebutting a Revisionist History of the Implicit Relational Assessment Procedure

I poured my heart and soul into the IRAP and in cultivating a career around it. Over a thirteen-year period I managed to conduct, often in collaboration with a variety of colleagues and students, an integrated program of IRAP research. I have personally laid eyes on more than 3000 IRAP data files and designed and analyzed more than 30 empirical studies with this measure. For those who aren’t familiar with IRAP research and IRAP researchers, this is a lot. My efforts added up to hundreds of hours of discussion in lab meetings and thousands of person-hours of delegated work in my lab. This mission started during my time in graduate school, shortly after I first learned about the IRAP at a conference, I believe in 2005, from Dermot Barnes-Holmes, the creator of the IRAP and supervisor of an RFT lab that was then beginning to produce a flow of IRAP research. By 2006, I was producing my own IRAP data from a version of the measure that I wrote myself (with help from a programming-wise friend).

In other words, I know and understand the IRAP from the inside out.

In those early years – from grad school, to internship, to a postdoctoral residency, to an assistant professor position at a small university – I managed to conduct several IRAP studies, to generate several upgrades to my own software, and to stay on top of the accumulating IRAP literature. One thing I was disappointed to discover early on was that the IRAP was not going to move the field overnight; the measure seemed to be pretty sensitive and often a bit challenging for participants. There was a lot of developmental work to do and relatively basic goals to aspire to, such as improving the reliability of the measure, reducing attrition, and just generally figuring out what the procedure could do for us. Although I had witnessed some people become disenchanted with the measure for these things, I was okay with the challenge of addressing them empirically. I was willing to do the work. To me, it was all part of the scientific journey. I was optimistic that this measure could eventually offer us a means of assessing private events in a way that would greatly improve our reach into the world.

In 2012 I secured a professorship at Southern Illinois University. This job provided me the opportunity to have a graduate research assistant, a lot of start-up funding, my own dedicated lab space, and more broadly just the ability to embark on a large and prolonged research program with a team of graduate and undergraduate students. It was a dream come true. My first graduate assistant and I hit the ground running. The two of us, along with four newly-recruited undergrads, were collecting IRAP data within a couple months. I was now supervising a dedicated RFT lab! By this time, I was also a published IRAP researcher and starting to receive requests to review manuscripts of IRAP studies.

In the ensuing years my lab transformed into the most productive IRAP lab in the United States. I loved the challenge of using the IRAP to advance the field, loved the creative and strategic efforts we devised to serve that mission, and loved leading a team of capable students who were helping me make this journey as they were earning their doctorates in Clinical Psychology. I also enjoyed reviewing IRAP manuscripts and being a presence at conferences, where I could report my lab’s IRAP work and get acquainted with the work of others.

That is not to say that running an IRAP lab was easy or always fun. The difficulties culminated for me in 2016, a tumultuous year of ups and downs that would end with a pivotal change in my perceptions about my research program. It started off with the publication of a very high quality IRAP study conducted by my lab showing that performances on it could be faked, despite expectations and the results of an existing publication on the (un)fakability of the measure. The reviewers were pretty strict, and after a lot of accommodating revisions the end result was very good.

Also in this year, some colleagues and I finally published data we had collected several years ago on acceptance and experiential avoidance. This was a very juicy topic for someone with translational research interests like me. However, it was a messy dataset. I decided to submit it for publication because a highly counterintuitive finding in that dataset was replicated in a related project on psychological flexibility done by one of my graduate students for her master’s thesis. My plan was to produce a manuscript of that replication soon and submit it for consideration as well. The review of this manuscript was a strange experience; there seemed to be some revulsion to the counterintuitive finding and demands for some inappropriate statistical analyses that would have obscured the counterintuitive finding. I had to recruit an outside expert on statistics to make my case with the editor to get it published.

That summer, I submitted another manuscript for a study conducted in conjunction with the study on fakability. This was another study where I was trying to build a foundation for translational research, hoping eventually to move my IRAP research into a clinical setting. I requested Dermot as one of the reviewers. Dermot was instead assigned as the managing editor. I was very excited about this manuscript – it was the best empirical IRAP manuscript I ever wrote, at least in my opinion. But to my tremendous dismay several months later, two anonymous reviewers declared, in strangely similar reviews, that it was a highly problematic study. I felt such despair as I read that the project was not fit for publication. No revisions were desired, and no response to those crazy reviewers would be allowed. It was so odd, for two reasons – because I easily could have addressed the concerns that were raised, and because I was often reading IRAP publications that were inferior to the study detailed in my manuscript. For years this manuscript languished on my computer hard drive. I am happy to say that people are now free to read it and judge the quality for themselves, as I’ve made it available as a preprint:

Drake, Sain, & Thompson (2022) “Comparing a Nomothetic and Idiographic Approach to Implicit Social Cognition” https://psyarxiv.com/mp72t/

The review of this manuscript was the point where I started to lose heart for the work, ten years into my journey. Many concerns had accumulated by this time, including many disturbing but anecdotal experiences which I have not reported here, that were finally leading me to have a more cynical view of what was happening. I had already discovered that it was hard to produce publishable data with the IRAP, and this year I learned that it was hard to publish data that compared favorably to the current literature.

By 2018, my lab was no longer conducting IRAP studies. I finally admitted to myself that I was never going to be able to use the measure the way I had dreamed about years ago. In the preceding couple of years, in my lab’s last efforts to find something worth pursuing with the measure, we started to do more basic work, and we also used larger samples, as burdensome as that was. Among these studies were several earnest efforts to address the ACT concepts of fusion and defusion with the IRAP. Our results produced mostly null findings and/or small effects, which is such a shame, because the research design in these projects was so frickin cool! I’m sorry to brag, but seriously, this was my passion. That passion provided me with the fuel to conduct far more studies with the IRAP than I was able to publish. You can get a glimpse of these studies – every single IRAP study I ever did – in this presentation I gave at Kelly Wilson’s retirement conference:

Drake (2019) “The IRAP will break your heart” https://osf.io/htkm6

Looking back, I think I made two big mistakes at the outset of my research career. First, I let my heroes exert too much influence over my scientific behavior. They were publishing fascinating research with very small samples, so I assumed that was perfectly acceptable and tried to do the same, at least initially. Second, although I had a rudimentary understanding of statistical power, I did not go to the trouble to educate myself about how to plan adequately powered studies and how to evaluate the power of existing data. I did not ensure that my research efforts would replicate, or at least be likely to replicate, for most of my research program. As a result, I wasted a lot of time and energy (including that of my students and other collaborators) that could have been better spent on other endeavors. Insufficient power is a clear deficit of the IRAP literature broadly, and yet there is still almost no discussion of it in the IRAP community.

The problems are even broader and deeper. One problem is the insufficient internal consistency and test-retest reliability of the measure. A lot of my studies looked at this, mostly as a secondary focus. It was usually poor – like, really poor – and you never knew why. This problem is sometimes dismissed as a needless indulgence in psychometrics that aren’t relevant to Contextual Behavioral Science and behavior analysis, but it absolutely is. You can find a hopefully informative manuscript about the issue here (currently under review at JCBS):

Hussey & Drake (2020) “The Implicit Relational Assessment Procedure demonstrates poor internal consistency and test-retest reliability: A meta-analysis” https://psyarxiv.com/ge3k7/

Another problem is that the IRAP does not measure what you would expect it to measure. My first inkling about this was revealed by a relatively consistent pattern among its scores across studies where the IRAP was being used to measure different things. Initially I found it amusing, then empirically interesting, and eventually learned that it reveals a very serious and fundamental limitation of the measure. You can find a hopefully informative manuscript about that issue here (also currently under review at JCBS):

Hussey & Drake (2020) “The Implicit Relational Assessment Procedure is not very sensitive to the attitudes and learning histories it is used to assess” https://psyarxiv.com/sp6jx/

People who are better informed and more statistically capable than I have also observed problems with the IRAP and the IRAP literature. Ian Hussey, one of Dermot’s former students, was the primary author with me on the previous two citations; he also has been active on other projects that have critically examined the IRAP. You can find a very informative presentation detailing his activities here:

Hussey (2022) “An updated critique of IRAP research” https://docs.google.com/presentation/d/1I8PMhLGQ4Ib6IcxywFBYVrf5Afn_qdpDHz6TQq-XYqM/edit#slide=id.gc6f73a04f_0_0

Other works are forthcoming. It’s a shame that it takes so much time and effort to address problematic work that should never have been done in the first place.

Nowadays when someone asks me what the IRAP measures, I say, “It’s a noisy measure that measures noise.” Of course, one can provide other answers, such as “implicit cognition,” or “natural verbal relations,” or “relational response strength,” but these kinds of distinctions matter most in very specific circumstances. And they don’t matter at all if the IRAP does not reproducibly measure anything. I am waiting for replicated studies suggesting that my skepticism about the actual utility of the IRAP is wrong. After almost 20 years, it seems to me that such data should have been produced by now. If the IRAP was a legitimate psychological instrument, it should have been easy to produce such evidence. My lab made a mighty effort to produce it and was not able to.

I know that other labs have generated a lot of IRAP data over the years as well. I know because I have found myself reviewing manuscripts where it seemed apparent that the authors were attempting to extract a publishable product from a study that did not go very well. I suppose my own experience at interrogating lousy datasets provides some degree of insight about these activities. One manuscript in particular has stuck in my memory; I reviewed a manuscript detailing a study that I considered methodologically poor, and in the last of many paragraphs of my review I wrote:

“Perhaps most critically, the work involves a small sample and involves a very large number of analyses. It seems likely that the study is very underpowered, and the excessive number of analyses makes the work highly vulnerable to Type I errors. There are even multiple “marginally significant” effects reported in spite of these concerns. And on top of these critical issues, it is not clear how this project provides anything especially new to the IRAP literature. For these final reasons I believe that the manuscript is unsuited for publication in The Psychological Record.”

The manuscript was “Exploring Racial Bias in a European Country with a Recent History of Immigration of Black Africans”, by Power, Harte, Barnes-Holmes, and Barnes-Holmes. About a year later, I saw this project cited in the introduction of another manuscript I was reviewing. It wasn’t the first time I had witnessed a study get published despite my recommendation, but I had no recollection of being asked to review a revision. Curious, I went to the submission portal and located the author’s response to my review. I discovered that the paragraph above was missing in the authors’ response to my list of concerns – it appears that the authors simply dropped this point and the editor didn’t notice it. This deletion could have been a simple accident, but to me, it felt like yet another example of the kinds of things that seem to surround the strange persistence of underpowered IRAP studies in our little corner of psychology.

Of note, I have been asked to review an IRAP study only once in the past year, and only a few times in the past three years. I used to review many of them every year. I stopped being asked to review IRAP manuscripts around the same time that I started asking for larger samples, fewer analyses, and more transparent research practices. I know Ian has similarly been ostracized from the reviewer pool. The fact that we are no longer being asked to review submissions, despite our expertise, says something about what is going on with producers of IRAP manuscripts. Not only does this suggest that there is a lack of blindness in the review process, but also that editors over-rely on author suggestions when selecting reviewers. It’s a method of gaming and short-circuiting the process of science. Instead of conducting adequately powered studies, the authors find friendly reviewers who are willing to support their problematic research.

I have mostly been quiet about all these concerns, believing that the basic principles of scientific conduct, coupled with the growing interest in Open Science practices, would provide a corrective trajectory for these issues. These problems with the IRAP are publicly known, certainly by members of the IRAP community, in part because of the works that I have listed above.

So why am I posting this blog now?

A couple months ago, this article was brought to my attention:

Barnes-Holmes & Harte (2022) “The IRAP as a Measure of Implicit Cognition: A Case of Frankenstein’s Monster” https://link.springer.com/article/10.1007/s40614-022-00352-z

For the typical reader, someone who is not terribly informed about the IRAP, this is an enjoyable and informative article, relating a nice story about how the IRAP was originally meant to be an RFT measure but got hijacked by people who wanted to do more applied work on implicit cognition. There’s even a seemingly poignant Frankenstein metaphor offered about that history, and some intriguing speculation about the viability of using the IRAP to explore fusion and defusion (good luck with that). Very readable stuff.

My awareness of this article marks the end of my silence. Because for me, it was not at all an engaging story; it’s an insult to me and many other people who have worked very hard to make a meaningful contribution to Contextual Behavioral Science. Because the article is a lie. The history that is described in this article is not a history so much as it is a coverup. It’s subtle and clever, but easy to see if you just know what to look for. It is quite obvious, to anyone who is willing to pay attention to the facts, what the IMPLICIT Relational Assessment Procedure was designed to measure, and who bears the greatest ownership of its use. For an excellent, data-based presentation of those details, see this work:

Hussey (2022) “Reply to Barnes-Holmes & Harte (2022) ‘The IRAP as a Measure of Implicit Cognition: A Case of Frankenstein’s Monster’” https://psyarxiv.com/qmg6s/

At this point, since it is apparent that I have a lot to say about this measure and the way it has been used over the years, you may be wondering why I don’t resort to a peer-reviewed rebuttal myself. Afterall, it’s in poor taste for a scientist to make controversial claims without subjecting those claims to scrutiny via the evaluative methods embraced by the scientific community. I mean, crafting a false narrative is also in poor taste, but at least it passed peer review, right?

First, let me just say that the relevant data is cited above. If you want an evidence-based critique of the IRAP, read the citations I have already offered here. Some of them are currently under review at JCBS, but the preprints and their associated data and analytic code are open for inspection and critique by anyone and have been for two years. They were conducted using Open Science practices. I wouldn’t be surprised if this data and their affiliated preprints are a major reason Dermot decided to write his revisionist history piece. As Ian details in his reply linked above, Dermot’s pivot away from claiming that the IRAP is an implicit measure coincided with emerging evidence that it is a very *bad* implicit measure relative to other ones. It’s one thing to face the evidence and suggest a pivot, and it’s something else to state that you never claimed something that you did indeed claim, in black and white, for over a decade. Revising history is something you do when you are unwilling to face the facts. It will be interesting, and perhaps aggravating, to see how reviewers at JCBS evaluate our submissions. Given my history, it is difficult to be optimistic.

Second, I would like to point out that Ian has already encountered difficulties in publishing his rebuttal to Dermot’s revisionist history piece – it has been desk rejected from three peer-reviewed journals, including Perspectives on Behavior Science, where Dermot’s history was published, and the Journal of Contextual Behavioral Science. Apparently it is perfectly fine to publish some misinformation, but not okay to publish corrections to that misinformation – not even to audiences who are most vulnerable to it. This differential ability to publish in peer-reviewed journals adds another layer of dysfunction to our science. Those with privilege get to produce unreliable work and then get to publish content that helps them dodge accountability for doing it. There are other harms as well – it unjustifiably raises the perpetrator’s profile, misleads the community, and may result in expenditures of resources to no good end, similar to what I experienced with my own research program.

This has been such a waste of resources, and as the years go by it will lead to an ever greater waste of resources. History will not view kindly this body of work or the people who knowingly built it. Sooner or later, it will be broadly recognized that this literature is essentially worthless. It is just a matter of time. The fundamentals of this critique have been discussed informally for years, but few improvements have been made. Indeed, in some ways there has been doubling down on bad practices.

In a recent paper about his experience as a student mentor (Barnes-Holmes, 2018, “A Commentary on the Student-Supervisor Relationship: A Shared Journey of Discovery” https://doi.org/10.1007/s40617-018-0227-y), Dermot does not seem terribly in favor of Open Science practices. While discussing some contemporary difficulties for students who wish to make themselves marketable for a career after receiving mentorship, Dermot wrote:

“…other external pressures have also entered the mix, such as the Open Science Framework (OSF), on foot of the so called replication crisis. This is not the place to work through the potential costs and benefits of these more recent developments in academic life, but the apotheosis of the KPI and the unquestioning acceptance of the OSF, all of which seem so reasonable at first blush, have the potential to impact on academic life in perhaps unexpectedly negative ways.”

The context of this statement is Dermot’s expression of concern (without an elaboration of any substance behind this concern) about students who are exclusively interested in KPIs (Key Performance Indicators) when selecting a mentor, as opposed to the student having a genuine interest in learning a full range of research skills from that mentor. Seems reasonable, though I can understand students wanting to be marketable for employment after surviving graduate school. But notice how a concern about OSF, a platform that facilitates the conduct of Open Science, is folded clunkily into this concern about KPIs. Not only does Dermot seem critical of the practice of re-analyzing existing datasets, which is just one thin slice of Open Science practice, but also he seems skeptical of the replication crisis in psychology and of the value of OSF in and of itself.

I have learned that you cannot have a constructive dialogue with someone who participates in that conversation in bad faith. As I have been working on this post I have checked on some of the more recent IRAP publications. Despite IRAP researchers having plenty of time to know better, the typical sample sizes are still, even now at the end of 2022, usually way too low to provide credible results.

Some people will be inclined to defend these publication practices because behavior analysis has a long tradition of low N research. But we are not talking about time series designs here. We are not talking about studies that demonstrate any degree of control over the data produced by individuals – not even close. We are, instead, talking about group designs employing inferential statistics – playing the game of Null Hypothesis Significance Testing but not abiding by the rules of that game. There is no shortcut to statistical inference using these methods. The math of p values doesn’t change based on whether the researcher using them identifies as a behaviorist or as doing ‘inductive’ research. These studies involve correlations and comparisons of means and other practices that do not provide reliable, replicable, and credible conclusions unless they are derived from much larger sample sizes (to control false negatives), more constraints on experimenter degrees of freedom (to control false positives), and other features of good quality research.

I am sorry to report here that the replication crisis is alive and well – thriving, even – in the world of IRAP research. I now find myself worrying about the quality of other areas of research within Contextual Behavioral Science (with a particular concern about some of the RFT data). I suspect that our community is no more immune to the replication crisis than any other corner of psychology, but its lack of acknowledgement of this problem positions us nearly a decade behind the crisis and its ongoing recovery in other areas. There has been some progress in recent years that I can acknowledge. In the spring of 2021, an ACBS Task Force released a report on recommended methods for Contextual Behavioral Science. There was a lot of attention given to this vision, but open science practices were not overtly embraced in it. However, in the summer of 2021, ACBS released a followup document “recommending” a list of Open Science practices. To me, this seems like a baby step, but at least it’s a step in the right direction. Still, the wording of the document suggests that conducting credible research through transparent methods is a desirable but still optional thing to do. And as one might imagine, a year and a half later, I am still seeing underpowered IRAP studies getting published in JCBS. You’d think that a collection of experts on behavioral contingencies would recognize the balance of incentives between cutting corners and credible work.

So, I am hoping that I can disrupt some perceptions about the IRAP that seem persistent in our community. Instead of “Has anyone done an IRAP study on this?”, I have some alternative questions to recommend that we all begin asking:

Why are IRAP researchers continuing to conduct unreliable/unreplicable/non-credible research?

Why do reviewers of IRAP manuscripts recommend publication of unreliable/unreplicable/non-credible research?

Why do editors and peer-reviewed journals publish unreliable/unreplicable/non-credible IRAP research?

Why is it so difficult to get critiques of the IRAP and its extant literature into peer-reviewed journals?

I wasted most of my research career on a measure that has been and continues to be misused and actually seems to be just about useless. Many lines of research don’t work out, that’s part of the game, but that’s not what happened here. The warning signs that the IRAP has been oversold to our community have been present in the data for years, but have been ignored – often willfully, and most often by the task’s creator who effectively controls the literature. The data I have to support this bleak view is cited above, and hopefully forthcoming publications will continue to make their way to the light of day. I would have preferred to let these works do the talking for me, but Dermot’s recent distortion of history has motivated me to speak out.

Thank you for reading.