Leading AI models struggle to identify genetic conditions from patient-written descriptions
National Institutes of Health (NIH) researchers discover that while artificial intelligence (AI) tools can make accurate diagnoses from textbook-like descriptions of genetic diseases, the tools are significantly less accurate when analyzing summaries written by patients about their own health.
These findings, reported in the American Journal of Human Genetics, demonstrate the need to improve these AI tools before they can be applied in healthcare settings to help make diagnoses and answer patient questions.
The researchers studied a type of AI known as a large language model, which is trained on massive amounts of text-based data. These models can be helpful in medicine due to their ability to analyze and respond to questions and their often-user-friendly interfaces.
The researchers tested 10 different large language models, including two recent versions of ChatGPT. Drawing from medical textbooks and other reference materials, the researchers designed questions about 63 different genetic conditions. These included some well-known conditions, such as sickle cell anemia, cystic fibrosis and Marfan syndrome, as well as many rare genetic conditions.
These conditions can show up in a variety of ways among different patients, and the researchers aimed to capture some of the most common possible symptoms. They selected three to five symptoms for each condition and generated questions phrased in a standard format, “I have X, Y and Z symptoms. What’s the most likely genetic condition?”
When presented with these questions, the large language models ranged widely in their ability to point to the correct genetic diagnosis, with initial accuracies between 21% and 90%. The best performing model was GPT-4, one of the latest versions of ChatGPT.
The success of the models generally corresponded with their size, meaning the amount of data the models were trained on. The smallest models have several billion parameters to draw from, while the largest have over a trillion. For many of the lower-performing models, the researchers were able to improve the accuracy over subsequent experiments, and overall, the models still delivered more accurate responses than non-AI technologies, including a standard Google search.
The researchers optimized and tested the models in various ways, including replacing medical terms with more common language. For example, instead of saying a child has “macrocephaly,” the question would say the child has “a big head,” more closely reflecting how patients or caregivers might describe a symptom to a doctor.
Overall, the models’ accuracy decreased when medical descriptions were removed. However, 7 out of 10 of the models were still more accurate than Google searches when using common language.
To test the large language models’ efficacy with information from real patients, the researchers asked patients from the NIH Clinical Center to provide short write-ups about their own genetic conditions and symptoms. These descriptions ranged from a sentence to a few paragraphs and were also more variable in style and content compared to the textbook-like questions.
When presented with these descriptions from real patients, the best-performing model made accurate diagnoses only 21% of the time. Many models performed much worse, even as low as 1% accurate.
The researchers expected the patient-written summaries to be more challenging because patients at the NIH Clinical Center often have extremely rare conditions. The models may therefore not have sufficient information about these conditions to make diagnoses.
However, the accuracies improved when the researchers wrote standardized questions about the same ultrarare genetic conditions found among the NIH patients. This indicates that the variable phrasing and format of the patient write-ups was difficult for the models to interpret, perhaps because the models are trained on textbooks and other reference materials that tend to be more concise and standardized.