Researchers in fields as diverse as healthcare and politics are increasingly turning to machine learning to help them find patterns in their data and draw conclusions. However, a pair of researchers from Princeton University in New Jersey have found that the claims in many such studies are likely to be exaggerated. They hope to raise awareness about what they describe a “brewing reproducibility problem” in the fields of science that rely heavily on machine intelligence.
According to Sayash Kapoor, a machine-learning researcher at Princeton, many researchers take the advise that machine learning is a technology that can be learned in a few hours and used independently. He argues that it is unrealistic to believe that a scientist could learn how to manage a laboratory through a distance learning program. Moreover, according to Kapoor, who has co-authored a preprint on the ‘crisis,’1 few scientists know that the difficulties they face when deploying AI algorithms are shared by other fields. He claims that peer reviewers lack the resources to thoroughly investigate these models, and that there are no present systems in place to identify and exclude studies that cannot be replicated. To help scientists avoid these types of problems, Kapoor and his co-author Arvind Narayanan devised a detailed checklist that should be included with every scientific manuscript.
What is reproducibility?
The scope of repeatability as defined by Kapoor and Narayanan is large. Often referred to as computational reproducibility, this is already a major problem for scientists working in the field of machine learning because it states that other teams should be able to replicate the results of a model given full details on data, code, and conditions. The two both agree that when data analysis mistakes render a model less predictive than advertised, the model is irreproducible.
The evaluation of such errors is highly subjective and usually necessitates expert knowledge of the domain to which machine learning is being applied. Some of the researchers whose publications have been criticized by the team have responded by saying that they do not think that the studies are defective or that Kapoor is making overly bold assertions. Researchers in the social sciences, for instance, have created machine-learning algorithms to foresee whether or not a country will descend into civil war. Once the mistakes are taken out of the models, according to Kapoor and Narayanan, they are no better than the traditional statistical methods. Political scientist David Muchlinski from Georgia Tech in Atlanta, whose paper2 was reviewed by the two, argues that conflict prediction has been unfairly vilified and that subsequent investigations support his findings.
The team’s motto has nevertheless become an inspirational slogan. On July 28th, Kapoor and colleagues hosted a small online workshop on reproducibility with the goal of brainstorming and disseminating possible answers. More than 1,200 individuals registered for the event. There will be a perpetual cycle of rediscovering the same issues in different fields “until we do something like this,” he warns.
At the event, Momin Malik, a data scientist at the Mayo Clinic in Rochester, Minnesota, will warn that over-optimism about the capabilities of machine-learning models could have negative consequences when algorithms are applied to fields like medicine and law. He warns that if the situation is not addressed, machine learning will suffer. The validity of machine learning hasn’t yet crashed, which has surprised me. “But I think it might be arriving very soon.”
Concerns with machine learning
Similar difficulties, according to Kapoor and Narayanan, arise when applying machine learning to different scientific domains. Twenty evaluations from seventeen different areas of study were examined, and a total of 329 research papers had results that could not be duplicated due to issues with the application of machine learning1.
One of the 329 is a work on computer security that Narayanan co-authored in 2015 (and so is not immune to). Kapoor argues that the problem requires the efforts of the entire community. In his opinion, the failures cannot be pinned on a single researcher. Instead, it can be attributed to exaggerated claims about artificial intelligence and a lack of proper controls. Kapoor and Narayanan identify “data leaking” as a major problem, which occurs when a model’s training data also serves as its evaluation data. The model’s predictions will appear to be far more accurate than they actually are if the two are not kept completely apart. Researchers can be on the lookout for eight distinct forms of data leaking that have been identified by the team.
Some information disclosure occurs in a more understated fashion. In the case of temporal leakage, for instance, the difficulty arises when the training data contains information from a later time period than the test data, which is problematic given that the future is dependent on the past. Malik cites a 2011 paper4 that asserted a model that analyzed Twitter users’ sentiments could forecast the stock market’s closing value with an accuracy of 87.6 percent. However, he claims that the algorithm was effectively given access to the future because they tested the model’s predictive power using data from a period earlier than some of the training set.
Problems on a broader scale, according to Malik, include training models with datasets that are smaller than the population they are meant to represent. A computer program designed to detect pneumonia in chest X-rays may perform less well if it was trained solely on patients over the age of 65. Computer scientist Jessica Hullman from Northwestern University in Evanston, Illinois, who will be speaking at the session, notes that another issue is that algorithms frequently rely on shortcuts that don’t always hold. A computer vision system may, for instance, learn to identify a cow based on the grassy background typically present in cow photos, leading to failure when presented with an image of a cow on a mountain or beach.
Tests with high prediction accuracy can lead people to believe that the models are able to grasp the “real structure of the problem” in a human-like fashion, as she puts it. She compares the current issue to the replication problem in psychology, where too much faith was placed in statistical procedures. According to Kapoor, scholars have been too quick to embrace the outcomes of machine learning because of the widespread publicity around the field. Malik argues that the term “prediction” is problematic because most predictions are tested after the fact and have nothing to do with making forecasts.
Repairing the problem of data loss
To combat data leaking, Kapoor and Narayanan propose that authors of research papers provide proof that their models avoid each of the eight categories of leakage. Model information sheets are proposed as a format for this kind of material.
Clinical ophthalmologist at the University of Birmingham, UK, Xiao Liu, who has contributed to the development of reporting requirements for studies involving AI in, say, screening or diagnosis, notes that biomedicine has made significant strides in the past three years using a similar technique. Only 5% of over 20,000 papers in 2019 that used AI for medical imaging documented their methods in sufficient detail to determine if they would be applicable in a clinical setting, according to research by Liu and her colleagues. It’s true that guidelines don’t help anyone’s models improve per se, but they do “make it pretty evident who the individuals who’ve done it well, and maybe people who haven’t done it well, are,” as she puts it, giving regulators a useful resource.
Malik argues that working together can also be useful. He recommends that studies include academics not only from the relevant field but also from machine learning, statistics, and survey sampling. Kapoor predicts that drug research and other fields where machine learning can generate follow-up leads would reap the greatest rewards from the technology. But additional research is needed to demonstrate its usefulness in other areas, he says. While machine learning is still in its infancy in many disciplines, he warns that researchers must take precautions to prevent a crisis of confidence similar to that which followed the replication problem in psychology a decade ago. A larger issue will arise the longer we wait to address it.