Wikipedia cleanup
From Cpp Tea :: Articles
24.11.2007, 12:04
September, October and November 2007 were the months of big cleaning on Wikipedia on Serbian language, that I taken decision to do. These cleanings included handling the following problems:
- Titles with mixed Cyrillic and Latin alphabet
- Generating statistics about articles which have words with mixed Cyrillic And Latin alphabet, and correcting a number of such articles
- Seeking most common grammar mistakes and extincting them from the whole project
While doing this, I actually developed an C++ API that is capable of crawling Wikipedia or its dump, and thereby finding, marking and implying needed corrections or actions. What is still to be done, is autonomous ability to log in to Wikipedia and post under cover of a bot, which may make this program rather popular among Wikipedians once it's released. Beside the plans for bot, soon I will upload some of these tolls in the "Wikipedia tools" section of this site.
At present, I'm waiting for new dump of Wikipedia on Serbian language to see the results of the past actions.
