One
of the challenges everyone faces in this space is the scarcity of machine
readable language data which can be used to build technology. For many
languages, it is difficult to find or it simply does not exist. Diversity
gaps in Natural Language Processing (NLP) education and academia also narrow representation among language
technologists working on lesser-resourced languages. Democratizing access to
underrepresented languages data and increasing NLP education helps drive NLP
research and advance language technology.
As part of our
continued
commitment and investment in digital transformation
in Africa, Google teams have been working on programs to advance language
technologies that serve the region, such as: adding
24 new languages to Google Translate earlier at I/O
(including Bambara, Ewe, Krio, Lingala, Luganda, Tsonga and Twi), researching how to
build speech recognition in African languages, and supporting local researchers through initiatives like
Lacuna Fund. Community initiatives launched in India
expanded to Africa, resulting in open-sourced crowdsourced datasets for
speech applications in
Nigerian English
and Yoruba, and new
community initiatives
and workshops like Explore ML with Crowdsource are gaining momentum in multiple African countries. We also hosted
our
first community workshop in the field of NLP and African languages
in our growing AI research center in Ghana, which is also looking into how
to advance NLP for African languages.
One more recent
example of our language initiatives in the continent comes from a
partnership with Africans to invest in African languages and NLP technology:
in collaboration with Zindi, a social enterprise and professional network
for data science we organized a series of Natural Language Processing (NLP)
hackathons in Africa. The series included an
Africa Automatic Speech Recognition (ASR) workshop
and three hackathon challenges centered on model training for speech
recognition, sentiment analysis, and speech data collection.
The interactive workshop aimed to increase awareness and
skills for NLP in Africa, especially among researchers, students, and data
scientists new to NLP. The workshop provided a beginner-friendly
introduction to NLP and ASR, including a step by step guide on how to
train a speech model for a new
language. Participants also
learned about the challenges and progress of work in the Africa NLP space
and opportunities to get involved with data science and grow their careers.
In the
Intro to Speech Recognition Africa Challenge, participants collected speech data for African languages and trained
their own speech recognition models with it. This challenge generated new
datasets in African languages, including the open-source datasets released
by the challenge winners in
Fongbe,
Wolof,
Swahili, Baule,
Dendi,
Chichewa and
Khartoum Arabic, which
enables further research, collaboration, and development of technology for
these languages.
We partnered with
Data Scientists Network
(DSN) to organize the
West Africa Speech Recognition Challenge, which according to Toyin Adekanmbi, the Executive Director of DSN, gave
participants an “immersive experience to sharpen their skills as they
learned to solve local problems”. Participants worked to train their own
speech-recognition model for Hausa, spoken by an estimated 72 million
people, using
open source data from the Mozilla Common Voice platform.
In the
Swahili Social Media Sentiment Analysis Challenge, held across Tanzania, Malawi, Kenya, Rwanda and Uganda, participants
open sourced solutions
of models that classified if the sentiment of a tweet was positive,
negative, or neutral. These challenges allowed participants with similar
interests to connect with each other in a supported environment and improve
their machine learning and NLP skills.
Our focus to
empower people to use technology in the language of their choice continues
and, across many teams, we are on a mission to advance language technologies
for African languages and increase NLP skills and education in the region,
so that we can collectively build a world that is truly accessible for
everyone, irrespective of the language they speak.