Maternal and neonatal risk is a complex, multi-faceted issue that demands a sensitive, holistic approach to understanding and acting on it. This piece is part of a new resource library published by Jacaranda Health that explores pregnancy and postpartum risk across a broad spectrum of risk drivers – clinical, environmental, social, conversational – and provides practical suggestions for how digital tools can be used to both collect and effectively respond to data on these factors.
PROMPTS is Jacaranda’s AI-enabled digital health navigator that empowers new and expecting mothers across Kenya, Ghana and Eswatini to seek and connect with care at the right time and place. On any given day, the platform can field up to 5,000 questions from users about their pregnancy or their newborn baby. Most of the questions are routine, covering topics like nutrition or baby milestones, but occasionally these questions reflect an urgent or emergency situation.
PROMPTS uses machine learning to triage incoming questions to identify urgent questions and prioritize them for escalation (ie. a call back and potential referral by a trained nurse). We have recently embedded our own, locally-adapted LLM into the platform to improve the efficiency of this process (we currently reach mothers with urgent questions within ~10 minutes.)
As we scale PROMPTS, we are looking for additional ways to identify mothers who are at risk.
Sometimes, an individual question from a user would not be classified as urgent, however there may be a pattern in the users’ history of asking questions that would warrant an escalation. These patterns are not immediately obvious, e.g. questions about ‘risky’ issues would have already been flagged by the helpdesk team. In addition:
- A minority (12%) of users have an escalation, and finding patterns through general observations is a bit like finding a needle in a haystack.
- The escalations can occur at any point 200 days from the date of enrollment on the platform
- There are no geographical patterns to escalation
Machine learning can be a useful tool in identifying patterns where traditional analytical methods are not powerful enough.
We set out to answer two questions:
- Can we determine likelihood of escalation from what a user has previously told us?
- And can we use this data to better monitor and support ‘flagged’ users with a higher risk of complications?
Building a supervised machine learning model using conversational history
To answer the above questions, we established a well-characterized dataset consisting 870k+ messages from ~160,000 users over a period of two years (early 2021 – 2023). These messages were classified by subject matter, or ‘intents’.
We used a family of algorithms known as “Tree-based models”, specifically a ‘random forest’ (RF) technique, to build a predictive model from this dataset capable of reviewing a user’s message history and determining likelihood of escalation. These models are suited for classification problems (likely to escalate vs. not), can efficiently handle complex datasets with a large number of variables (different intents), and present results that are interpretable and are easy to deploy at scale.
We tested a variety of different random forest models that are publicly available, and selected the model that had the best performance. For classifier models, this can be defined by the single metric of Area under the Curve (AUC). An AUC value close to 1 would indicate a perfect classifier, and our RF model had a very strong AUC of 0.85 (see black line in Fig 2 below).
Model Results: The longer you monitor, the better the model.
Models typically do well when using retrospective data, but we wanted to understand how strong the predictive power of the model would be when presented with a batch of users’ conversational histories as part of a real-world ongoing observational study. We devised an experiment to follow the progress of eight cohorts within a sample of 5,851 *active users (See Fig 3).
Every week, we would assess the escalation risk of users who messaged in during that week, based on their historical conversations prior to that week. We would then follow the cohort to assess whether or not they had an escalation on the helpdesk. The observational study ended at the same point, so each cohort was monitored for different time periods (shortest observation period was 28 days, longest was 103 days).
The AUC of models with cohorts initially ranged from 0.57 (not much better than a random guess) to 0.68 (not too bad in comparison to other ‘real world’ predictive classifiers). The longer we monitored conversations, the stronger the model was at predicting the likelihood of escalation.
How would we use these models with the helpdesk?
The model presents a probability of escalation. Our helpdesk agents can choose to further monitor users who have a certain probability of escalation (monitoring could involve asking follow up questions or proactively calling the individual). Choosing a threshold for escalation (ie. the % probability of risk) has major operational implications.
Here are two scenarios to illustrate this choice:
Scenario 1: We set a relatively high escalation threshold at 33%. The model escalates the 1,500 individuals that meet this threshold to our human helpdesk team, who monitor them on the platform. We have high confidence that we are missing relatively few individuals who are likely to escalate (in the cohort study 18% of individuals not monitored end up escalating). This approach would work well if we want to provide a strong ‘rule-out’ option to agents, so as not to overload their schedules.
Scenario 2: We set a relatively low escalation threshold at 10%. 3,500 individual tickets therefore get escalated to the human helpdesk. The larger volume of escalations means that we only miss 9% of those who eventually escalate, but we will also have a high number of false positives. In the context of PROMPTS in the ‘real-world’, this approach would mean our helpdesk would be monitoring an additional 30,000 individuals. This ‘rule-in’ option is better if we are able to automate the process of monitoring, e.g. by using our LLM.
Limitations and next steps
Predictive models are only as good as their underlying data. The model is built based on (1) conversational history and (2) recorded escalations. We can imagine a scenario where someone who didn’t chat much still had an escalation, and a scenario where someone who chatted a lot didn’t have a recorded escalation on the platform. Given the limited availability of clinically validated data (e.g. EMR records or accessible paper health records) for this population we will need to explore alternative methods of validating our approach.
If there is inconsistency in how our agents historically escalated cases, this would also be reflected in our RF model. A more likely case is that the model reflects the lower risk threshold that our human helpdesk agents have (i.e. escalating a broader variety of cases to ensure nothing is missed). As we incorporate other factors into risk prediction on PROMPTS (clinical history, environment, social factors), our hope is that we can build stronger models that leverage varied sources of data to inform us whether a particular pregnant woman or new mother needs additional attention.
Meanwhile, we are working to incorporate these tools into our helpdesk software, so that agents are able to proactively monitor, support and make faster referrals for at-risk individuals on the platform. These conversational patterns can help us build towards a picture of risk, especially when processed and analyzed among other risk drivers that may impact our users – like their clinical history or environmental factors (see Jacaranda’s Risk Series).