By Siyolise Fikizolo
For millions of young South African learners, finding age-appropriate reading materials in their mother tongue remains a challenge. Now, a team of interdisciplinary researchers is working to change that by developing the first complete child-focused lexicon database for the country’s Nguni languages.
Dr Tracy Bowles and Paige Cox, lecturers at Rhodes University; Dr Robyn Berghoff from Stellenbosch University; and Dr Zelalem Shibeshi, senior lecturer at Rhodes University, are behind this project. They aim to create an open-access database covering isiZulu, isiXhosa, siSwati, and isiNdebele; languages that have received limited attention compared to English, Spanish, and German, which have substantial lexical databases supporting both research and educational development.
“What inspired me to create this database project is that I used to create my own database when doing my master’s degree, which contained a broader linguistic range of information that was beyond simple word definitions, which used to help me a lot. That’s when I thought it would contribute to children’s literature,” said Cox.
Until recently, the only child-focused lexical database for indigenous South African languages was for isiXhosa, created in 2024 by Berghoff. “Most literacy resources and assessments to date are currently built on what we call expert judgments, and without objective empirical data, these are really just estimates, and this often leaves a mismatch between research tools, class materials, and the vocabulary that children encounter in the books they read,” said Bowles during a recent presentation of their project.
A database built for children
The database comprises more than 2 000 children’s stories across four Nguni languages, sourced from initiatives such as African Storybook, Nal’ibali, and the Department of Basic Education’s readers. It represents the most extensive collection of kid-friendly text in these languages, with over 150 000 tokens in isiZulu, 140 000 in isiXhosa, and roughly 100 000 in siSwati and isiNdebele.
Unlike existing databases built from adult texts or government websites, this project uses exclusively children’s literature to ensure the data reflects what young learners actually encounter in their reading materials. The database includes critical information such as word frequency, word length, neighbourhood density (similarly spelt words), and sublexical features like consonant sequences, all factors that research shows directly impact how children process, recognise, and learn words.
Technical challenges and solutions
Shibeshi used sophisticated Python-based methods to collect and process the texts, using automated PDF downloads with optical character recognition and web scraping techniques. The team faced unique challenges, including computer-unfriendly fonts designed for children and typographic errors in the source materials.
“We wanted this database to be as clean as possible, and we had to balance keeping the corpus clean for use as a pedagogical resource without dictating what counts as a legitimate word in Nguni,” said Cox.
The team took a conservative approach, filtering out obvious non-standard tokens such as onomatopoeic words and interjections while preserving all metadata for potential future linguistic analysis.
From research to practice
The researchers envision the database supporting multiple applications, including designing psycholinguistic experiments with developmentally appropriate stimuli, creating graded reading materials, developing phonics programs based on actual letter frequency, and aligning literacy assessments with real-world reading input.
Importantly, the database tracks metadata such as story titles, grade levels, and sources, allowing researchers to compare how different publishers approach vocabulary difficulty or analyse patterns across reading levels.
The team plans to present their work at the Digital Humanities Association of Southern Africa conference later this year, with a research paper in preparation. Their ultimate goal is to create an interactive, user-friendly web interface that allows researchers, educators, and NGOs to access the data and download word lists that meet specific criteria.
“We envision this as an open-access platform that anyone can use to inform their research and develop literacy and language resources,” Bowles said.


