When you hear the term machine learning, what immediately comes to mind is something complicated and difficult to understand. We sat down with Kathleen Siminyu, a Machine Learning Fellow at the Mozilla Foundation to better understand what her background and what the job entails is what she had to say.
Tell us about you?
My name is Kathleen Siminyu and I am a Machine Learning Fellow at the Mozilla Foundation. Professionally, I am a researcher in Natural Language Processing. At Mozilla I work on the common speech dataset which is basically a platform that allows language communities to create datasets and one of the languages in there is Kiswahili, which I’m working on .
How did you start your career path?
For my undergraduate degree, I studied math and computer science at JKUAT and while in 4th grade I decided to venture into a field that used both. I did my research and came across data science which encompassed both math and computer science. My 4th year project was on Data Science which helped me after I graduated as I was able to include it in my portfolio.
Besides the degree, I started doing online courses on platforms like Edex and Coursera which were related to data science. I did it to deepen my knowledge and boost my CV because ultimately I was not a trained data scientist.
My first job was at Africa’s Talking and I learned a lot on the job. In the beginning, my role was more about providing metrics like airtime sold and things like that rather than data science. However, I managed to automate most of these processes, which meant that I had more time to focus on my passions, which is data engineering.
During this time, I realized that there was a need for African linguistic tools or resources and that IT was not the place where I could follow this interest. This meant I had to go back to academia, at that time I found research communities that were building NLP for African languages. This is what really contributed to my learning journey because in Africa there are very few academic institutions that offer degrees in Data Science and Artificial Intelligence. However, there are grassroots communities that nurture this talent and I am the product of that.
What is the Kiswahili Common Voice Dataset and what are its benefits?
A common Kiswahili speech dataset is basically a dataset for speech recognition, also known as speech-to-text. It is basically a task of turning audio into text. One of its uses is captioning for television/videos and some conferencing platforms like Zoom.
The dataset itself begins with the collection of text, which is then broken down to the sentence level and then sent to people. At Mozilla we also outsource the audio aspect and when you go to our platform and sign up as a contributor you will start receiving phrases and you can record yourself saying those phrases out loud.
So, a dataset for speech recognition or transcription is essentially text along with the audio of what’s in the text. This is the data that you would feed into your algorithm or machine learning model for it to start learning to transcribe text into Kiswahili. Indeed, it is then able to map a word to the respective sound.
This is important because datasets can be used to develop products for end users. Transcripts can be used on platforms like Zoom or Google Meet.
Apart from Luhya and Kiswahili, what other language have you worked on?
Earlier in my research days I worked on a task known as machine translation which is similar to Google translate in which you can type in English and it gives you a French translation. During this time, I worked on several Kenyan languages such as Kamba, Kikuyu and Luo. Regarding voice recognition, I started with the Luhya language and now I am currently working on Kiswahili.
What are your future plans?
When I am done with Kiswahili, I would like to continue my work on Kenyan languages. In the course of my work, I have to realize that translations usually pivot on English, say a Kamba-English or Luhya-English translation. I am thinking of changing this pivot to Kiswahili so that we can have a Kiswahili-kamba or Kiswahili-luhya translation model. This is all the more true since we have more than 200 million Kiswahili speakers in the world.