Edit Content

Measuring ‘felt’ respect: Can AI models understand sentiment in conversational data?

Respectful Maternity Care (RMC) is a critical component of quality services for moms and babies. But it is subjective, sensitive and nuanced, and has been historically difficult to measure. Several approaches to RMC measurement have been proposed in recent years, including expanding facility exit interviews to include RMC questions and sharing granular email surveys with postpartum moms. But these approaches (while yielding encouraging results) are costly, tricky to scale, and don’t adequately reflect the real-time nature of these issues in facilities (where rapid action can save lives).

Our digital health navigator PROMPTS offers a rapid, low-cost way to assess client service utilization and experience of care. It routinely prompts its users – new and expecting mothers – to share feedback (via a simple 2 question SMS survey) on whether they were treated with respect at their last facility visit. The data is anonymized (PROMPTS only captures a user’s phone number, location and gestational stage) and shared quantitatively and qualitatively in real-time dashboards with our government partners. We have strong anecdotal evidence that this feedback is driving tangible improvements in facilities.

We’ve often been asked whether two basic questions can adequately capture the breadth of RMC issues we see reflected in patient-centered care (eg. as listed as a 14-item scale in the Mothers on Respect index (MORi). But we deliberately designed it this way to make sure women actually engaged (ie. no one is going to answer a 14 question survey). 60% of PROMPTS moms routinely respond to our surveys and, by nature of making the second question open-ended, users are volunteering granular information about their experiences, much of which falls outside the remit of the 14 MORi items.

However, despite being able to collect granular information, our AI models struggled to process it. We quickly recognized limitations in the data we were sharing with government. Quantitative bar charts showing the variance between moms reporting respectful vs. disrespectful care didn’t adequately capture the nuances of disrespectful care. We also had feedback that government officials didn’t always have time to scroll through and extract insights from the qualitative ‘feed’ of survey responses. We needed a way to break down the issues raised by moms around respectful care and then capture this in a format that facilities and local governments could use to target improvements.

Defining ‘sentiment’ in conversational data

Early this year, Jacaranda’s Technology team developed a ‘sentiment analysis model’ to classify mothers’ responses to RMC surveys. The process began with a data annotation exercise to define variations of respectful care, including delayed service, harassment, treatment by students, refusal of service, insurance issues, positive/ negative communication, stock-outs, abuse, and understaffing.

We used ‘real world’ training data based on concerns mothers had reported previously on PROMPTS. This is important: all our models are trained on data that reflects the context in which they are deployed to ensure the relevance and accessibility of information moms receive, and the contextual sensitivity of data generated. Given the code-mixed nature of the language that our moms use (eg. Swahili, English and mixed dialects like Sheng), we selected a base model that already understood the nuances of how they communicate on the platform. The model was then fine-tuned on the annotated RMC data to enable it to appropriately classify responses.

How does it match up to human interpretation of sentiment?

Our latest and best iteration of the model has an accuracy of 82% in classifying RMC responses from PROMPTS users. Here, accuracy is mathematically computed based on the human annotators’ interpretation of responses and how the model matches up to this interpretation. Previously, this kind of classification was not possible. Done manually, the process of labeling ~45,000 RMC survey responses might have taken a team of five agents two weeks.

Figures 2 and 3: A bar graph (top) and confusion matrix (below) shows a comparison between human classification versus sentiment analysis model classification against a sample of 245 RMC survey responses (ie. by how much they ‘agree’ on the sentiment behind a mother’s message). The values along the diagonal in the confusion matrix are instances where the model has matched the human classification (eg. they both agree on ‘long waiting times as the primary intent), whereas the values outside this imply that the model has predicted another sentiment in the response (eg. verbal abuse), or is wrong.

Learnings: A classification model is only as good as its training data.

We tested different ways to increase the quality of our training data, including:

  1. Start with a large dataset so, by the time it’s ‘cleaned’, you’re left with enough quality data to train a high performing model. The size is dependent on the number of classes you have – ie. The more classes, the more samples you need for the model to learn effectively.
  2. Minimize ambiguity by making a clear distinction between classes, or categories in training data. We initially found multiple ambiguities in our data by nature of disaggregating a subjective and nuanced issue like felt respect. (Eg. interpreting ‘rudeness’ as harassment, poor communication, and verbal abuse). We addressed this ambiguity via ‘feature engineering’ to add, delete, and merge similar data, collaborating with domain experts to characterize data classes, and using evaluation tools (eg. a confusion matrix) to detect ambiguities.
  3. Use Large Language Models (LLMs) to create a ‘synthetic dataset’ to increase the size and quality of human-annotated training data. We anticipate this approach will help us achieve a big jump in classification accuracy (90%+) and the ability to detect numerous sentiments in a single message (eg. a user reporting both refusal of care and verbal abuse) – all at minimal expense.

How will this data be used, and what are the ethical considerations ?

We recognize that some RMC data shared by mothers will be linked with unpleasant or traumatic experiences, and dealing with this data requires sensitivity. We have taken steps to increase the trust mothers have to submit sensitive information to PROMPTS, but we also see a responsibility to close the feedback loop. Our dashboards were developed for this purpose. Data generated through the classification model will appear as bar charts in these dashboards, allowing facility and government partners to review hotspots of trending RMC issues (eg. long client waiting times) at a county, sub county and facility level.

We’ve recently conducted an assessment to understand the behavioral dynamics of how these stakeholders perceive, prioritize and act on data, which offered suggestions around meaningful review of client feedback, like adding insights boxes to increase understanding and action around data ‘red flags’. We’ll continue working collaboratively with these stakeholders to increase adoption and bridge a gap between basic data review and the actions that follow.

Additionally, we have already put guardrails in place around how we use and process this data, like:

  1. Keeping humans-in-the-loop, by having all RMC responses reviewed by a human (ie. a clinical helpdesk agent). This includes cases flagged by the model (ie. issues falling into certain categories and requiring follow up) and non-flagged cases to ensure nothing is missed.
  2. Removing all PII (eg. phone numbers) from dashboard data. In certain serious cases where health administrators want to personally follow-up on cases, we always seek permission.

Why does this matter?

Instances of disrespectful care need to be dealt with quickly, but facilities and governments can’t do this with ‘slow’, unstructured data. The sentiment analysis model means we can deliver RMC data to these stakeholders in a granular, real-time way to support rapid, highly-targeted action (which is especially important in settings where resources are tight). Importantly, this work will also help us unpack and scrutinise reports of positive feedback – a key motivator for busy providers.

Our aim is that facilities will be able to use the data to make rapid, cheap improvements (eg. installing privacy curtains if the data shows privacy concerns) to boost client experiences, while also recognizing the often-invisible ‘wins’ on the ward and incentivizing nurses to continue providing excellent care.

Gurung, R., Ruysen, H., Sunny, A.K. et al. Respectful maternal and newborn care: measurement in one EN-BIRTH study hospital in Nepal. BMC Pregnancy Childbirth 21(Suppl 1), 228 (2021). https://doi.org/10.1186/s12884-020-03516-4

Sheferaw, E.D., Mengesha, T.Z. & Wase, S.B. Development of a tool to measure women’s perception of respectful maternity care in public health facilities. BMC Pregnancy Childbirth 16, 67 (2016). https://doi.org/10.1186/s12884-016-0848-5

Share this resource