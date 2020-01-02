Loading...

How much credit do you get if you are "pretty good", that is, rather than bad?

If you are an artificial intelligence algorithm, you have a lot of credit. AI programs do not have to have a definitive answer, just a probabilistic answer, a probability percentage of the correct answer, whether the task is to perform a translation into natural language or diagnose cancer.

The latest example of AI's probabilistic achievements is in this week's issue of the journal Nature, entitled "International evaluation of an AI system for breast cancer screening," and is the author of an army of 31 academics from the Google's Google Health unit, its DeepMind unit, and Imperial College London, led by authors Scott Mayer McKinney, Marcin T. Sieniek, Varun Godbole and Jonathan Godwin (DeepMind CEO Demis Hassabis, is among the authors) .

In addition, a blog post offers comments from Google Health academics of Google Shravya Shetty, M.S., and Daniel Tse, M.D.

The Google Health team at Google, its DeepMind unit and Imperial College London used a trio of three different deep learning neural networks, consisting, from above, on the "RetinaNet" of Facebook AI, combined with the "MobileNetV2" from Google, followed by the now standard ResNet -v2-50 in the middle section and, finally, a ResNet-v1-50 in the lower layer. Each selects suspicious areas of a mammogram in different ways, and the results are added together to arrive at a decision of cancer or non-cancer probability.

Google Health, DeepMind, Imperial College London

The main news is that Google's science surpassed radiologists from both the United Kingdom and the United States by observing mammograms years after the fact and stating if there was cancer, demonstrating "an absolute reduction (…) in false positives and ( …) in false negatives. " The AI ​​technician even surpassed a panel of six human radiologists in charge of the task who observed five hundred mammograms and gave their diagnosis.

The result is an important contribution in terms of AI tools that could be very useful for doctors. But that does not mean that I can replace human evaluation. It is important to look more closely at the numbers, where there are many entrances and exits.

Consider the configuration. The scientists gathered data in the United Kingdom from three different hospitals, about women who had been screened for breast cancer between 2012 and 2015 that met certain criteria, such as age and examination, a total of 13,918 women. That was what they used to train the system. Another 26,000 cases were used to test the system once it was trained. They also did the same process with data from a US hospital. UU., Northwestern Memorial Hospital, meeting from 2001 to 2018, a much smaller sample. (If asked, the authors acknowledge a similar study conducted by the University of New York, whose results were published earlier this year. According to Google authors, one of the most important differences is that they included data from three years of follow-up consultations, while the NYU study, like previous studies, was limited to case histories of a year or less).

The scientists trained an ingenious set of three different neural networks, each of which examined mammograms with different levels of detail. The details of this deep learning configuration are fascinating and perhaps represent the state of the art in the combination of machine learning networks. One is ResNet V-1 50, now a classic image recognition approach, developed by Kaiming He and his Microsoft colleagues in 2015. A second network was RetinaNet, developed by Facebook researchers AI Research in 2017. And a third is The neural MobileNet V2 network revealed by Google scientists last year. It is truly a wonderful combination of approaches that show how code sharing and open scientific publication can enrich everyone's work. Details are contained in the supplementary materials document that is linked at the bottom of the main Nature document.

Now here comes the difficult part: the "fundamental truth" of whether any of the cases in which the trained network is judged is a case of breast cancer confirmed by subsequent biopsies. The diagnosis, in other words, was not only what things looked like in an image but what subsequent medical tests found by definitely removing a piece of cancerous tissue. The answer, in that case, was an unequivocal yes or no in terms of the presence of cancer.

But the exquisite collection of three deep learning neural networks described above does not produce a yes or no, not really. It produces a score of zero to one, as a "continuous value", rather than a binary judgment. In other words, AI can be quite correct or quite incorrect, depending on how close or far from the correct value, zero or one, it is in any case.

To match that probability score with what humans do when they make a judgment, McKinney and his colleagues had to convert the AI ​​probability score into binary values. They did this through a separate set of validation tests that selected individual responses. Comparisons of "superiority" to human judgment are a selection of responses that AI gave within the broadest set of total responses it produced.

As the authors explain, "the AI ​​system natively produces a continuous score that represents the probability that cancer is present," and thus, "to support comparisons with the predictions of human readers, we calculate this score to produce decisions. of analogous binary detection ". if "threshold" in this case means choosing a single point to compare: "For each clinical reference point, we use the validation set to choose a different operating point; this is equivalent to a scoring threshold that separates positive and negative decisions" .

Compared to UK data, AI was as good as people in terms of predicting if something is cancer. The term is "not inferior," as the report says, which means that it is no worse than human judgment. The area where AI networks performed better was in what is called "specificity," a statistical term that means that neural networks were a little better at avoiding false positives, that is, predicting the disease when it is not there. That is certainly important because getting a false diagnosis of cancer means a lot of unnecessary stress and anxiety for women.

But again, pay attention to the fine print. The human score, in this case, was from doctors who had to make a judgment on whether more mammography-based tests, such as biopsy, would be performed. It is conceivable that doctors at the early stage of diagnosis can perform an evaluation that is too broad to take a patient to other tests so as not to run the risk of undetected cancer incidents. That is a fundamental difference between a doctor who decides where to go next with a patient and a machine that guesses the probability of a result in the coming years.

In other words, a doctor sitting in front of a patient does not usually try to guess the probabilities of results in the coming years, but try to determine what is the next critical step that this patient should take. For example, even if the AI ​​determines in a particular case that the likelihood of cancer is low based on mammography, would a patient want her doctor to be mistaken for precaution and prescribe a biopsy, to be sure instead of lamenting? It is very possible that they appreciate such a precaution.

And scientists write in the summary section that AI also omitted several cases that doctors detected from cancer, even when AI found cases that doctors did not detect. This was especially true for the additional "reading study" where six human radiologists observed five hundred cases of cancer screens. The researchers found "a case of sample cancer that the six radiologists did not detect, but that the AI ​​system correctly identified," but also "a case of sample cancer that was detected by the six radiologists, but that the AI ​​system did not detect. "

Somewhat worrisome, the authors write that it is not entirely clear why AI is successful or fails in each case: "Although we were unable to determine clear patterns between these cases, the presence of such extreme cases suggests potentially complementary roles for the AI ​​system. and human readers to reach precise conclusions. "

Perhaps, but certainly, one wants to know more about how the three neural networks of deep learning are making their guesswork of probability. What are they seeing, so to speak? That question, a question of what networks represent, is not addressed in the study, but it is a crucial question for AI in such a sensitive application.

A big question as a result of all of the above is: How much effort should be given to a system that can predict the chances of future cancer development more accurately than some doctors who have to do an initial evaluation? If those probability scores can help doctors decide in some "extreme cases," so to speak, the value could be too high to help doctors with AI, even if you can't really replace them at this time.

On a tangential note, the study, which analyzed data from the United Kingdom and the United States, offers some puzzling findings of the comparative quality of the health system. In general, the level of accuracy in doctors in the UK seems much higher than in the US. UU. In terms of concluding correctly, from an initial screening of the tests, something is going to turn out to be cancer.

Given the disparity of the data sets used: 13,981 in the United Kingdom, of three hospitals, compared to 3,097 in the United States of a single hospital, it is really difficult to know how to take these disparate results. Apparently, as intriguing as AI is the relative ability of human doctors in two different medical systems.