Edit Content
Stanslaus Mwongela, Jay Patel, Sathy Rajasekharan, Laura Down, Mohamed Ahmed, Gilles Hacheme, Bernard Shibwabo, Julius Butime

Data efficient learning for healthcare queries in low-resource and code-mixed language settings

The leading approaches in modern Natural Language Processing (NLP) are notoriously data-hungry.

A good example is Transformer models, which achieve surging and state-of-the-art performance at the cost of big data.

However, acquiring the big data needed is expensive and time-consuming for many application domains, limiting large adoption. Consequently, state-of-the-art NLP models perform poorly for low-resource languages such as African languages.

Their performance is even worse when applied in sectors such as healthcare in low-resource settings.

As a result, both academic and industrial communities are calling for more data-efficient models that use artificial learners but require less training data and less supervision.

The current research aims to tackle these challenges by creating a data-efficient Transformer-based model for maternal queries intent detection in the settings of low-resource and code-mixed languages (Kiswahili, Sheng, and other local Kenyan Languages).

Several experiments were carried out, including the use of pre-trained multilingual language models, language adaptive fine-tuning, supervised contrastive learning, and sample weighting in the loss function.

The most efficient data-learner was obtained by Masked Language Model (MLM) adaptation on our unlabelled maternal queries and fine-tuning the adapted MLM checkpoint on our labelled training dataset. Sample weighting the loss function of the derived model increased the robustness and the overall performance of the model.

The developed model was later deployed and is currently used to triage code-mixed maternal health queries at Jacaranda Health.

Share this resource