Khimo Osawere on Why the Data Problem Is Really a Structure Problem

The African Innovators Series(TAIS): Tech, Data, and AI Changing the Game

Jun 19, 2026

Welcome to Issue #66 of TAIS, where every Friday we spotlight visionary changemakers reshaping Africa’s tech, data, and AI landscape, one breakthrough at a time.

Today we spotlight Khimo Osawere, a Nigerian NLP researcher and dataset builder whose path into one of the continent’s most underfunded problems (the language layer beneath every AI system) began not in a machine learning lab but in military barracks in Ojo, Lagos, where Pidgin English was the sound of everyday life, and where he developed, before she had technical language for it, a deep intuition for what it means when a language carries meaning that formal systems cannot see.

**Khimo (Sunmisola) Osawere**_Building Nigerian pidgin Dataset for Fintech customer support (currently)| Turning Africa’s Untapped Data into Al Infrastructure | Geoscientist.

Her world operates where linguistics meets infrastructure, where the unglamorous work of cleaning, structuring, and governing informal language data determines whether the AI systems built on top of it actually understand the people they claim to serve. Not in the well-resourced environments where models are benchmarked and celebrated, but in the harder, less visible work of ensuring that the way Nigerians speak in fintech customer support, in mobile banking, in the fluid code-switching of everyday commerce, is captured with enough fidelity that it does not get flattened into something more legible but less true.

Khimo is not building models. She is building what models depend on. And her argument, which runs through every answer in this conversation, is that Africa’s most consequential decisions about AI are not being made at the application level. They are being made at the data level, right now, mostly by default, and largely by people who have never heard Nigerian Pidgin spoken in a moment of genuine frustration or confusion. Her dataset work is her answer to that, not a protest against the systems that exist, but a deliberate construction of the foundation they are missing.

Origin & Identity

Q: You began in geology before moving into NLP and dataset building. How did that shift happen, and do you see parallels between interpreting geological data and structuring language data for AI systems?

A: I studied geology, and honestly, I still see myself as someone grounded in the sciences. Geology taught me how to observe patterns, work with messy real-world data, and carefully interpret information before jumping to conclusions. A lot of the work is about understanding the story hidden inside raw evidence. That way of thinking stayed with me. Growing up in the military barracks in Ojo, Lagos, Pidgin English was the language around me every day. It was how people joked, argued, complained, and expressed emotions naturally. So even before I got into NLP, I already had a strong connection to the language. Over time, I started thinking about preservation. Languages change, people move, and communication evolves. And with AI becoming part of everyday life, I kept wondering what happens when millions of people speak a language fluently, but the systems they use don’t really understand them. That curiosity pulled me into NLP and dataset building. At first, it was just observation. Then I started noticing how little structured representation exists for many African languages in real-world contexts like customer support and fintech, where communication is often informal and emotional. So for me, this never felt like leaving science behind. It feels like applying the same scientific mindset to a different kind of problem. I’m also interested in how AI and environmental science may connect in the future, so I see this more as an expansion of my work than a shift away from it.

**Geology collage**: Visual collage representing Khimo’s background in geology and early scientific training in field observation and data interpretation.

Q: You’ve said that Africa possesses data “nobody else on earth can produce,” but that much of it remains unstructured and unusable. What kinds of data are you referring to and why do you think these forms of knowledge have remained largely absent from AI systems?

A: I’m referring to everyday communication in African contexts. The way people speak in local languages and dialects, especially in informal settings like customer service, banking conversations, markets, and daily life. It includes culture, expressions, sarcasm, code-switching, and tone. It is very rich, but also very unstructured. I don’t think it’s accurate to say the data cannot exist elsewhere. It’s more that this kind of communication is deeply tied to specific cultural and linguistic realities, so it doesn’t appear naturally in the datasets that early AI systems were built on. Most early datasets came from formal and written sources like books and web text. That naturally left out more informal, emotional, and context-heavy communication. So the gap is not absence. It’s structure.

Editorial commentary: At first glance, Khimo’s reflections seem to tell the story of a transition from geology to AI. The deeper continuity, however, lies in how she approaches information itself. Geologists are trained to look beyond what is immediately visible and uncover the stories embedded in fragments, patterns, and incomplete evidence. That same instinct appears to shape her approach to language data. Where others might see informal conversations, slang, or code-switching as messy and difficult to work with, she sees a rich source of knowledge waiting to be understood.
This perspective also helps explain her interest in African languages and datasets. The challenge, as she frames it, is not that African realities are absent from the world, but that they are often overlooked by the systems designed to interpret it. In many ways, her work is an exercise in making the invisible legible.

The Language Layer

Q: You’re focusing specifically on Nigerian Pidgin within fintech and mobile banking contexts. What makes Nigerian Pidgin particularly important or particularly difficult when building NLP systems?

A: Nigerian Pidgin is important because it is one of the most widely used ways people communicate in Nigeria, especially in everyday and informal situations. In fintech and mobile banking, it shows up a lot. When people are frustrated, confused, or trying to explain issues, they often switch into Pidgin because it feels more natural. The challenge is that it is not consistent. The same idea can be expressed in many different ways, often mixed with English and other languages. There is also sarcasm and context that is not obvious from the words alone. So the difficulty is not translation. It is understanding meaning, tone, and intent in a very flexible language.

Behind-the-scenes view of Khimo working, representing the process of building datasets and thinking through the work in real time.

Q: A major part of your work involves transforming messy, informal, lived language into structured datasets. What gets gained in that process and what risks being flattened, removed, or misunderstood?

A: What you gain is clarity and usability. Once data is structured, it becomes possible for models to learn from it and for patterns to emerge more clearly. It also helps systems become more useful in real-world applications because they start reflecting how people actually speak. What can be lost is some of the natural richness. Tone, humor, sarcasm, and cultural context can sometimes get reduced when you force language into structure. So the goal is balance. Enough structure for usability, but not so much that the language loses its human feel.

Q: You repeatedly use the word “usable” when talking about African data. Usable for whom? And who ultimately decides what forms of language or context are considered valuable enough to structure?

A: When I say “usable,” I mean usable for building systems like customer support tools, fintech products, and other AI applications that respond to people in natural language. It is useful for both builders and users. Builders need structured data so models can learn, and users benefit when systems understand them properly. What gets considered “valuable enough” to structure has mostly been shaped by what was easy to collect early on, like formal written text and widely available online data. That shaped what AI systems are good at today. But a lot of important communication happens outside those spaces, in informal and local contexts. Part of the work now is expanding what we consider worth structuring.

Editorial commentary: AI requires language to be structured, categorized, and made usable. Yet much of what makes human communication meaningful lies precisely in its ambiguity. Sarcasm, humor, frustration, code-switching, and cultural references often derive their power from what is implied rather than explicitly stated. The challenge Khimo describes is therefore not simply one of translation or data collection. It is a question of interpretation.
This is what makes her focus on Nigerian Pidgin particularly interesting. Pidgin is often treated as informal or unstructured, but its flexibility is also what allows speakers to convey nuance, emotion, and social context with remarkable efficiency. When language is transformed into a dataset, some of that richness inevitably has to be translated into categories, labels, and patterns that machines can recognize. The task is not merely to organize language, but to decide what aspects of meaning are preserved, what gets simplified, and what may disappear altogether.
So Khimo’s work sits at the intersection of technology and culture. She is not only helping machines understand how people speak. She is grappling with a much older question: how much of human meaning can ever be captured in a system without losing the very qualities that make it human in the first place?

Infrastructure & Application

Q: Your current focus is on fintech and mobile banking applications. What does language inclusion change in financial systems, especially for users who operate outside formal or English-dominant environments?

A: Language inclusion affects how people understand and trust financial systems. Many users do not interact in formal English. They use local languages or mixed speech to ask questions or report issues. If systems cannot understand them properly, it creates confusion and frustration. Inclusion is not just translation. It is understanding intent and context in the language people naturally use.

Q: Much of your work feels infrastructural, you’re helping create datasets that many downstream systems may eventually depend on. What is it like building foundational AI infrastructure in a context where many of the surrounding systems are still emerging?

A: It feels exciting and uncertain at the same time. Exciting because the gap is clear, and you can see how better data can immediately improve systems. Uncertain because the space is still forming. There are not many clear standards yet, so you are constantly figuring things out as you go. It often feels like building something foundational without all the surrounding structure in place, but that also creates space to shape how things should be done.

Snapshot of Khimo’s Nigerian Pidgin dataset work in progress, showing structured conversational data being built and reviewed.

Q: Conversations around AI often focus on models, while the labour of collecting, cleaning, and annotating data receives far less attention. What aspects of dataset creation remain most invisible to outsiders?

A: What people don’t often see is how much effort goes into making data usable. It is not just a collection. It includes cleaning, checking consistency, fixing structure, and making sure it reflects real language. There are also many decisions about what to include, what to remove, and how to handle unclear cases. These decisions shape everything built on top of the dataset. Most attention goes to models, but a lot of foundational work happens at the data level.

Editorial commentary: A common assumption in AI is that datasets are neutral inputs and models are where the real intelligence resides. Khimo’s reflections subtly challenge that view. The process she describes is filled with human decisions: what to include, what to exclude, how to interpret ambiguous cases, and how to represent the way people actually communicate. Long before a model generates an output, someone has already made choices about what counts as meaningful data.
This is particularly significant in contexts such as fintech and mobile banking, where understanding language is closely tied to trust. A customer explaining a problem is not simply transmitting information; they are expressing confusion, urgency, frustration, or uncertainty. Capturing those signals requires more than technical accuracy. It requires judgments about meaning. Dataset creation is not merely preparatory work for AI systems. It is one of the places where the assumptions, priorities, and values that shape those systems are established in the first place.

Building in Public

Q: You intentionally build in public, including sharing uncertainty and confusion. Why is that openness important to you, especially in a field where expertise is often presented as certainty?

A: I build in public because it helps me stay grounded in the process. A lot of AI work can look finished from the outside, but in reality, most of it is still being figured out step by step. Sharing it makes it easier to learn and get feedback. It also removes the pressure to appear like everything is already figured out. For me, this is still early work, so openness helps more than hiding uncertainty. It also helps other people see that building is a process, not a finished state.

Editorial commentary: Khimo challenges the common assumption about expertise that we see in many technical fields, where credibility is often associated with certainty. The ability to present clear answers, confident predictions, and well-defined plans. Yet the reality of building something new is usually far messier. There are unknowns, dead ends, revisions, and moments when the next step is not entirely clear.
Her commitment to building in public reflects a different understanding of expertise, one rooted less in having all the answers and more in being transparent about the process of finding them. In that sense, openness becomes a way of demystifying innovation itself, replacing the image of the lone expert with a more honest picture of experimentation, learning, and adaptation. The result is a reminder that uncertainty is often evidence that genuine exploration is taking place.

Sovereignty & Governance

Q: As African datasets become more valuable, questions of ownership and extraction become more urgent. How do you think African language data should be governed to avoid simply becoming another extractive resource?

A: This is important and should be taken seriously early. Data should be collected with clarity around consent, ownership, and purpose. People should understand how their data is being used. There should also be more local participation in how datasets are built and applied, not just as sources but as contributors. Otherwise there is a risk that data becomes something extracted without enough value returning to the communities it represents.

Q: Dataset builders often have enormous influence over what systems can and cannot understand. Do you think data infrastructure builders are recognised as central actors in AI or are they still treated as secondary to model builders?

A: Dataset builders are still mostly seen as secondary compared to model builders. Most attention goes to models and performance, but models depend heavily on data quality and structure. This is slowly changing as people realize how important data is, but there is still a gap in recognition. To me, dataset work feels like building the foundation everything else depends on.

Q: You describe your work as turning “untapped” African data into AI infrastructure. Do you worry that once African language data becomes valuable at scale, the continent could again find itself positioned primarily as a supplier of raw material rather than as an owner of the systems built from it?

A: Yes, it is a real concern. When something becomes valuable, there is always a risk that people closest to it only supply raw material while others build systems on top. That is why ownership and participation matter, not just data creation. It is important to think early about how datasets are used and who benefits from them. This is also part of why I’m interested in governance alongside technical work.

Editorial commentary: A recurring theme in discussions about AI governance is the power held by model developers and technology companies. Khimo’s reflections draw attention to a less visible group of actors: the people who decide what data enters those systems in the first place. Dataset builders occupy a unique position between communities and technology. They determine what forms of language are collected, how they are categorized, and ultimately what AI systems are able to recognize and understand.
This helps explain why her concern extends beyond data ownership to questions of participation and governance. The issue is not simply whether African language data is valuable, but who gets to shape the conditions under which that value is created. The reality is that dataset building is an act of stewardship, carrying responsibility not only for the quality of the data itself, but for how the communities represented within that data participate in the systems built upon it.

Vision & Concern

Q: There is increasing attention on African languages in AI, but much of it remains symbolic or experimental. In your view, what would it actually take for African language NLP systems to become robust enough for widespread real-world use?

A: What is needed is high-quality, consistent data that reflects real communication in real contexts. Not just isolated words or translations, but structured conversational data that captures intent, tone, and meaning. There also needs to be long-term focus. A lot of current work is still experimental, so sustained effort is important. When those pieces come together, systems can become more reliable in real use cases.

Q: When you think about the future of African NLP, what excites you most and what concerns you about the direction things are heading?

A: What excites me most is the possibility of systems that better understand African languages in real contexts, not just translations but meaning, tone, and intent. I’m also encouraged by how many people are now paying attention to this space. What concerns me is depth. Some work risks staying surface level without building strong, usable datasets. There is also the question of how data is collected and whether the context behind it is respected. So it is a mix of optimism and caution.

Closing reflection: I appreciate the opportunity to reflect on this work in a structured way. It has helped me think more clearly about what I’m building and why.

Editorial commentary: One of the more interesting tensions in Khimo’s reflections is the gap between visibility and progress. African languages are receiving more attention in AI than ever before, attracting new research, funding, and public interest. Yet she repeatedly returns to the question of depth. Her concern is not whether African languages are entering the conversation, but whether the underlying work required to support them is keeping pace with the enthusiasm surrounding them.
This perspective reveals a recurring challenge in emerging technology spaces: visibility is often easier to achieve than capability. A language can appear in a benchmark, a pilot project, or a research paper long before robust systems exist to support real-world use. Khimo’s caution stems from an awareness of that distinction. For her, the measure of success is not whether African languages are acknowledged by AI systems, but whether those systems can reliably operate in the contexts where people actually live and communicate. The future she describes therefore depends less on attention and more on the patient, often invisible work of building the foundations that make meaningful inclusion possible.

Closing remarks

This conversation arrives at an interesting moment for African language AI. There is growing recognition across the industry that African languages matter, reflected in the work of researchers, startups, open-source communities, and increasingly, large technology companies investing in African language resources. In many ways, the debate has shifted. The question is no longer whether African languages should be represented in AI systems, but how that representation is built and who gets to shape it.

What distinguishes Khimo’s perspective is her focus on the less visible work required to move from attention to capability. Throughout the interview, she returns to the challenge of transforming real-world communication into data that systems can meaningfully learn from, while preserving the context, intent, and nuance that make language human. Her reflections are a reminder that inclusion is not achieved simply because a language appears in a dataset or a research benchmark. The harder task is ensuring that AI systems can engage with how people actually communicate in everyday life.

Khimo’s work points to a broader challenge facing the continent’s AI ecosystem. As interest in African languages grows, so too does the need for the infrastructure, governance, and expertise that can support that interest over the long term. The future of African NLP will depend not only on expanding representation, but on building the foundations that turn representation into understanding.

Thank you for reading!

Based on today’s conversation, which cluster in TAIS Knowledge map do you think best describes Khimo’s thematic community?

Don't see your pick in the options? Drop it in the comments. Khimo joins the map this weekend.

Rebecca Mbaya

Discussion about this post

Ready for more?