How to know what you cannot know?

The term “impact evaluation” means different things to different people, but a helpful definition is the systematic analysis and estimation of causal effects of policy interventions and other events. In short, thoroughly trying to understand whether an intervention works or not. The fundamental problem with causal inference however, is missing data – for each person, district, municipality, region, country or subset of entrepreneurs/researchers/etc. we can only observe at most one potential outcome. Not only does that pose issues for understanding the effect size of any intervention, but more importantly whether the outcome would perhaps have been better without the intervention.

The answer to this dilemma has been experimental research methods, which have become mainstream across many disciplines in the social and behavioural sciences after their widespread and longstanding use in medicne. A randomized controlled trial (or randomized control trial, RCT) is a type of scientific experiment or intervention study (as opposed to observational study) that aims to reduce certain sources of bias when testing the effectiveness of new treatments. This is accomplished by randomly allocating subjects to two or more groups, treating them differently, and then comparing them with respect to a measured response.

Fervent fans of experimental methods in social sciences, dubbed ‘randomistas’, were given a boost in 2019 after three pioneers of randomised experiments in international development – Esther Duflo, Abhijit Banerjee and Michael Kremer – won the Nobel Memorial Prize in Economics. Their organisation, the Abdul Latif Jameel Poverty Action Lab (or JPAL), has run over 1,000 randomised controlled trials to understand how to reduce poverty and has championed the use of the method internationally.

However, with some exceptions, other areas of empirical social science lag behind in terms of acceptance and adoption of these experimental methods. Some of this lag is to do with a misunderstanding of experimental methods and their application. At the same time, as with any new ‘standard’ in scientific methodology, with its growth in use and authority so grew its criticism. Below, a few pro’s and con’s are listed regarding the use of RCTs.

Three benefits:

RCTs are currently the optimal and least-biased method for estimating, on average, whether something works, when done well. Those last three words are important. Just because a study is an RCT, does not automatically mean it is ‘gold standard’. In actual fact, there is a continuum from the truly outstanding to the totally rubbish.
When combined with information on cost and implementation, RCTs provide very powerful information for decision makers (in our case this often includes programme managers in the educational sector and government officials) on the ‘best bets’ for spending their budgets. However, without implementation and theory, RCTs are a ‘black box’, meaning it can be difficult to interpret and decide how to act on the result.
Crucially, experimentation means being transparent also about what does not work. There is no avoiding the challenges this brings; and it may mean taking a more honest approach: recognising in public that you do not have all the answers and need to test out your ideas. Though this work in progress, with a trend towards more open science, the publication and sharing of negative results is further encouraged by the set-up of RCTs.

Three challenges:

RCTs are not suited to answering all kinds of questions, and there are some things to which subjects are just not willing to be randomised. Questions interesting to answer through RCTs but with low success odds include mixed attainment grouping v setting in schools (too ideological), financial incentives for teachers (too controversial), and changing the secondary school start times to later in the day to accommodate sleepy teenagers (too impractical). For these kinds of questions, we need alternative designs.
Sometimes the answer depends. RCTs tell you what works on average, but how is one school supposed to know how applicable an estimate is to them? I have learned that we need hundreds of schools even to estimate what works on average, so imagine how many we would need to be able to tell what works for different types of schools and pupils? But data archiving and linkage has great potential for understanding variation in outcomes..
Decision makers want answers now, but RCTs take time to plan and deliver well. Without excellent planning and communication throughout, RCTs just won’t work. There is no easy solution to this. However, it does mean that it is essential to persuade and demonstrate to decision-makers of the value of RCTs, and to make sure the RCTs we design now are still relevant when the results come out.

There are some measures that can be taken to be mitigate the above listed challenges. A few ideas are to make sure to communicate the benefits well to participants, so they do not drop out, collect high-quality data on cost and implementation and think about context and timing carefully, to ensure relevance at the end.

At the same time the field is not stagnant, the use of experiments to guide effective social action is being championed, and reinvented constantly. It’s not only the classic social scientific experiment that has entered the toolbox of the policymaker. Prototyping is an approach traditionally used by engineers, designers and web developers, and is now part of the landscape of policy and service design and decision making. Another set of approaches central to learning about policy are quasi-experimental designs (QEDs), sometimes called ‘queasy’ experiments – after the sense of unease they can instil in purist researchers who prefer randomised designs. These methods have deep roots in social science and statistics, providing ways to test and learn about policy ideas that are already ongoing or in areas of decision-making where randomised approaches may not be possible. These type of designs also allow us to exploit unplanned opportunities for learning about social change, in what are termed ‘natural experiments’. These are tests that explore a policy or practice change resulting from a ‘natural’ divergence or external event.

No approach is perfect, and no design offers easy answers. Many innovations fail to make a difference; making better policy takes hard work and investment. Most projects evaluated by an RCT show little impact. In education, 90 per cent of interventions evaluated in RCTs by the US Institute for Education Sciences had weak or no positive effects. This lack of progress should not, however, depress us. It is a reminder of how difficult it is to make a difference (and how unwise it is to assume that just because an innovation is new and sounds attractive, that it will improve outcomes in practice).