Ensuring Accuracy in Clinical AI: Lessons from the VA’s Experience with Population Health Risk Algorithmic Drift
Key Takeaways
- Study finding: A study by the US Veterans Health Administration (VA) found a 4% decline in positive predictive value of the Care Assessment Need (CAN) algorithm from 2016-2021, equating to approximately 18 000 additional false positives among >5 million veterans.
- Clinical impact: False-positive palliative care referrals rose from 61% to 69%, indicating misallocation of resources and erosion of quality metrics tied to high-risk veteran identification.
- Operational insight: Recommends ongoing algorithm monitoring, input-variable tracking, and predefined recalibration plans to maintain accuracy, equity, and clinical validity in VA and comparable health systems.
In this interview, Ravi B. Parikh, MD, medical oncologist and associate professor at Emory University’s Winship Cancer Institute, discusses his study titled “Performance Drift in a Nationally Deployed Population Health Risk Algorithm in the US Veterans Health Administration.” Dr Parikh shares insights into how predictive algorithms like the VA’s CAN score can evolve over time, what performance drift means for patient care and quality metrics, and how health systems can safeguard the accuracy and fairness of clinical AI tools.
Please introduce yourself by stating your name, title, organization, and relevant professional experience.
Ravi B. Parikh, MD: I’m Ravi Parikh, and I’m a medical oncologist and associate professor at Emory University’s Winship Cancer Institute. I’m also a staff physician at the Joseph Maxwell Cleland Atlanta VA Medical Center, where I practice medical oncology. I’ve been a VA oncologist and researcher for about 7 years.
Your study found that the Care Assessment Needs (CAN) algorithm’s positive predictive value declined by about 4% from 2016 to 2021. For clinicians working with high-risk veterans, how might this drift have affected care decisions—particularly referrals to palliative care or intensive management programs?
Dr Parikh: I think it’s useful to explain a bit about what the CAN score actually does. The CAN score is one of the most widely used predictive algorithms across the VA system. It predicts the risk of hospitalization or mortality for over 5 million veterans who see primary care physicians every year.
The CAN score is used to make individual care management decisions for veterans and to allocate population-level resources, such as new facilities, services, or high-risk care management programs at VA medical centers. It’s widely accessed by thousands of VA clinicians and population health managers daily.
In our study, we found that over time, the CAN score’s performance—its ability to predict hospitalization or mortality—declined. That’s not unexpected; predictive algorithms often degrade over time. However, because the CAN score informs specific care decisions, such as allocating palliative care resources, this decline means that some veterans were likely misclassified.
Over a 5-year period, we observed a 4% decline in positive predictive value, which translated to about 18 000 additional false positives by 2021. This means clinicians were increasingly flagging veterans as high risk who did not experience hospitalization or death. Among high-risk veterans receiving palliative care, the false positive rate climbed from 61% to 69%. Although palliative care can benefit patients beyond those facing imminent mortality, this drift undermines the algorithm’s purpose.
Because CAN scores are tied to national quality metrics, such as palliative care visit rates among high-risk veterans, what are the potential downstream consequences of algorithm drift for VA quality reporting and resource allocation?
Dr Parikh: One of the interesting aspects of our study is that, while other research has explored how performance drift leads to misallocated resources, ours is the first to examine how drift affects quality metrics. Many health systems use algorithms not only for clinical decision support but also to define populations eligible for certain quality measures.
For example, one VA quality metric tracks the percentage of high-risk veterans receiving palliative care. We found that when algorithms drift, the validity of these benchmarks erodes. Although the metric itself appeared stable, the underlying population changed due to increasing false positives. This could mislead policymakers into thinking care quality is stable or improving when, in fact, the algorithm is misidentifying patients.
That creates a troubling scenario: if quality metrics rely on flawed algorithms, resource planning and funding decisions could be misguided. This instability has implications beyond the VA, affecting any health system that uses algorithm-informed quality reporting.
You identified shifts in key covariates—especially demographics, health care utilization, and lab data. From a clinical standpoint, what kinds of changes in the veteran population or care delivery might be driving these shifts, and how could frontline clinicians or administrators recognize them sooner?
Dr Parikh: That’s a great question. It ties into how we can anticipate algorithmic decline before it affects care decisions. The biggest shifts we saw were in health care utilization: fewer in-person visits, more telehealth, and reduced hospital admissions. These reflect the profound impact of the COVID-19 pandemic, which continues to influence utilization patterns.
Clinicians experienced these changes firsthand; there were fewer lab tests, different patient presentations, and shifts in case mix as less acute patients stayed home. By monitoring not just algorithm outputs but also the distributions of input variables—such as missing laboratory rates or telehealth usage—we might detect early signs of drift. That would allow for proactive monitoring or recalibration before misclassification affects care.
Given that performance drift can accumulate over time, what processes or monitoring frameworks would you recommend the VA or other health systems adopt to ensure risk prediction models remain accurate and equitable?
Dr Parikh: Those who develop algorithms shouldn’t just deploy them and walk away. We need ongoing monitoring with clearly defined metrics. While statistical metrics like positive predictive value or area under the curve (AUC) are useful, they don’t necessarily reflect patient impact. We should instead track operational and clinical metrics—how algorithm-driven decisions change and which patients they affect.
Hospitals and systems should establish algorithm governance committees to continuously evaluate model performance and clinical consequences. Stress-testing algorithms under different conditions can also help identify potential weaknesses before they cause harm.
Finally, organizations should have predefined change-management plans that specify when and how to recalibrate or suspend algorithms if they degrade. Too often, corrective action comes only after retrospective studies reveal problems. A proactive framework would help health systems respond more quickly and effectively.
The VA has been a leader in using predictive analytics at scale. What lessons from this analysis should other health systems—especially those serving complex or aging populations—take away about maintaining trust and safety in clinical artificial intelligence (AI) tools?
Dr Parikh: The VA has indeed led the way in deploying predictive algorithms for many beneficial uses, such as the CAN score, sepsis prediction, and suicide prevention. Other health systems should recognize that the same factors that make their populations complex—such as multimorbidity and social determinants of health—can also make algorithms more prone to drift.
The key lesson is that algorithm deployment can’t be “set it and forget it.” Health systems must budget for ongoing monitoring infrastructure or ensure that vendors provide it. Clinicians and population health teams should be educated about the signs of drift and empowered to raise concerns about algorithm performance.
Transparency is also critical. During events like COVID-19, clinicians should be informed when an algorithm’s reliability may be compromised. Finally, systems should reassess quality metrics based on high-risk patient identification during such disruptions. If those metrics become invalid, they should be paused rather than used to determine funding or performance evaluations.


