Correcting imported articles
From Cpp Tea :: Articles
31.10.2008
Lately, I have cleaned about 3400 articles on sr.wiknews from words that didn't have correct orthography. These articles were imported from external sites that are in Latin Serbia and from some strange reason dislike the use of the letter "đ", whereas replacing it with "dj".
Approach to the problem
What I did is extracting all the words containing the substring "dj" from the articles. Then I manually wrote correct versions of these words and so I made a sort of dictionary which clearly states which word shall be replaced with which one. This dictionary has 938 words in total and is available under this link:
The format of this dictionary is to have one entry per line. In a line, there first goes the bad entry, then tab (\t, 0x09), then good entry and the new line (\n, 0x0A) in the end.
When I had the dictionary, I simply let the bot go through all the articles where the bad words originally came from.
Results
The result of this action was cleaning the sr.wikinews from a considerable number of bad words. Complete list of bot's actions is available within the logs of its actions:
In the end, I proposed to the owner of the bot that imports these articles, to use the dictionary to fix the incoming articles before posting.
Side effects
Manual correction of the bad words, while making the dictionary, has resulted in finding several words that needed more correction than just changing "dj" in "đ". They were also properly corrected.
