Wikipedia cleanup

From Cpp Tea :: Articles

Jump to: navigation, search
Date:
24.11.2007, 12:04

September, October and November 2007 were the months of big cleaning on Wikipedia on Serbian language, that I taken decision to do. These cleanings included handling the following problems:

  • Titles with mixed Cyrillic and Latin alphabet
  • Generating statistics about articles which have words with mixed Cyrillic And Latin alphabet, and correcting a number of such articles
  • Seeking most common grammar mistakes and extincting them from the whole project

While doing this, I actually developed an C++ API that is capable of crawling Wikipedia or its dump, and thereby finding, marking and implying needed corrections or actions. What is still to be done, is autonomous ability to log in to Wikipedia and post under cover of a bot, which may make this program rather popular among Wikipedians once it's released. Beside the plans for bot, soon I will upload some of these tolls in the "Wikipedia tools" section of this site.

At present, I'm waiting for new dump of Wikipedia on Serbian language to see the results of the past actions.