Topic on Project:Support desk

Improving the effectiveness of regular expressions against vandalism

12
Guiwp (talkcontribs)

The current mechanism that identify vandalism is based on regular expression applied to "illegal" variations of words. This is not effective, it can be easily bypassed (as has been seen very often at least in the pt.* community).

I've proposed a solution to it here.

The translation to English resumes in: 1) provide a default (and a good) spell checking mechanism. Just allow save the wikitext if all words exists (using the dictionary, inflections, etc.). 2) After this process send this data to the layer that counters vandalism (anti vandalism tools).

  1. A database of words that "exist", already exist, and its open: the wiktionary.
  2. A set of libraries for a lot of programming languages exist that do inflections of words.
  3. That is very good, because the side effect of it is: articles will be more correct!

What is missing?

There are any projects, extensions, looking this?

Bawolff (talkcontribs)

Having a "very good" dictionary of possible words is difficult. In particular having an exhaustive list of proper nouns is pretty much impossible. Also some articles will presumably have words in foreign languages, so one has to allow valid words in any language.

88.130.111.116 (talkcontribs)

Yes, that's also what I thought: Creating a usable whitelist with allowed words will be pretty much impossible.

Guiwp (talkcontribs)

But this doesn't make sense. You think that the number of arbitrary words (illegal variations) are less than the known words that we consult in a simple dictionary? No they are much more, 1000x more, more, etc. Most people use few words. You can make a research and publish the word frequency used in the articles, you will see that the used words are very few compared to the whole lexicon that they could use.

The whitelist doesn't need to be perfect at the first time, words can be added as their use start appearing. And articles can be flagged to latter be published after someone personally check and possibly later add the word to whitelies database (or create a "inflexer" for a new variation).

The way vandalism its currently identified is much more "ridiculous" than trying something better. Man, I saw funny regular expressions just trying to catch a simple variation (illegal) of a word, and they failed, simply because its easy to bypass a system that accepts arbitrary combination of characters.

Usually the foreign languages are present through foreign terms in some articles, but we are not allowed to write an article using a foreign language. Thats another point making whitelist for category of articles, because if one present a term that has nothing to do with the subject, we known it is vandalism, or it its possibly a vandalism that can be notified.

I seriously believe that things wont get better with the current mechanism. If you believe, give me technical reasons please. Maybe in 20 or more years, things will be the same if they don't change this way to catch vandalism.

Skizzerz (talkcontribs)

Technical reasons? We already did -- it is impossible to make a dictionary sufficient enough to cover every possible word in every possible language, including every possible proper noun, intentional misspellings of words (e.g. "internet speak"), and code (among many other things). If such a dictionary was made, it would likely number in the millions of words and be impossible to scan every word of every submission for a match in the dictionary in a reasonable amount of time (especially of larger articles that average 20 KB of text or more).

This is way too limiting and even more arbitrary of a restriction than the current regex-based measures. Additionally, a sufficiently motivated vandal can and will get around any arbitrary restriction you impose, so in addition to not doing anything (except, I guess, making vandals vandalize pages using actual words), it harms the average user who may wish to type something that doesn't exist in the dictionary and finds that they are unable to do so ("I really wanted to write an article on Khenarthi's Roost, but it isn't in the dictionary so I can't!").

Guiwp (talkcontribs)

Sorry, but I think that you didn't read what I wrote.

Let me again, try to make you read.

  • You think that the number of arbitrary words (illegal variations) are less than the known words that we consult in a simple dictionary? You get to be kidding if you answer that arbitrary combinations are less than using known words. So no, the current system is much less effective. Its simple I wont rewrite it again.
  • Most people use few words. Thats somthing globally very "known", you can read some books in the linguistics area, you will known.
  • The whitelist doesn't need to be perfect at the first time. Where did I said to cover every possible word in every possible language? Please, read the text...

Now based on simple math that 2+2 is greater than 1, I can't completely understand what you say by This is way too limiting and even more arbitrary of a restriction than the current regex-based measures. Did you known what are arbitrary characters combination?

Other thing you said that proves you didn't read:

"I really wanted to write an article on Khenarthi's Roost, but it isn't in the dictionary so I can't!"

What I wrote before:

And articles can be flagged to latter be published after someone personally check and possibly later add the word to whitelies database (or create a "inflexer" for a new variation).

Anyway, thanks for you comments. I think I should look for people more optimistic. Bye.

88.130.111.116 (talkcontribs)

> Did you known what are arbitrary characters combination?

Skizzerz is right. Please stop becoming offensive and stay on topic!

It is true that trusted editors could flag all new revisions in every article. The article would then only be displayed after this check. This concept in fact has been introduced e.g. in the German Wikipedia and successfully is in use there since years. However, this has nothing to do with an automated check against a whitelist in which all words of the article have to be included as you want it. While aiming at the same problem, flagged revisions are technically a completely different cup of tea.

I also see the point that you can use various kinds of spelling to write one and the same illegal word. E.g. for the small blue pills you can use two slashes instead of the "V", a small "L" or a "1" instead of the "i" or some kind of unicode sign which looks like one of these letters, but is not one of them. You see that you will quickly have several thousand variations of the same illegal word. However, compared to the few illegal words, the number of legal words is way bigger: The Oxford English Dictionary lists around 170.000 words, around 50.000 obsolete words and nearly 10.000 derivate words. And all this does neither include technical terms nor flections nor composita. There is a vast number of legal words and it is just untrue to say they would not be used. Even if the same user only uses "a few" (how many ever that may be) different words a day, the sum of users, the sum of languages and the sum of all their different areas of expertise will create an immensely hugh number of legal words. And what should happen when a user adds a word not in that list? Should we then disallow saving the whole edit? There so often are articles with typing errors - or maybe only about a word, which you have not amongst your billions of words. Imagine you wrote your first article and you just would not be able to save it. How annoying! The number of editors is already stagnating today, but what would happen with that patronizing? Many more would turn away and never come back. It would do way more harm than good.

And I even go one step further when I dare to say: Not only is it personally impossible; it is also technically impossible to check every single edit, every single word against a dictionary with these billions of words, flections and derivations. Although hardware and also CPU cycles are becoming cheaper and cheaper, you will still have to throw an immense amount of money at this problem would you really want to check every single edit and every single word against such a list. And all this does not even take into account that languages are constantly changing: New words are created and meanings of old words change. What was accepted yesterday maybe has become an insult now. You would have to constantly update your huddle of words to stay up to date. Basically like a dictionary company; but even more blatant as they can leave words out (e.g. because they don't consider them tightened enough), while with a whitelist you cannot do so.

I am not pessimistic, but just realistic when I say:

What you want, technically as well as personally, just is impossible to be accomplished.

Guiwp (talkcontribs)

Did you known what are arbitrary characters combination?

This was a question, simply a question on the topic. There are some readers, volunteers that do not known "computing". They just try to help, I was thinking the Skizzerz was one. Sorry, didn't want offend Skizzerz. But he didn't said that he was offended, are you "Skizzerz"?

You wrote:

several thousand variations of the same illegal word.

Man... I could say that with arbitrary combination of characters you don't declare a "word" a "word", its simply millions/billions/etc. of combinations (that is, thousand times bigger than of all dictionaries that you could cite in this thread right now, sorry). I'm not telling you to make characters appear a word, as illegal variation, I'm telling you that fuzzy combinations with all available characters in unicode are much, but much more bigger than predefined set of words. And that is what I mean by arbitrary.

I can't make it more clear, I'll stop on this thread. Thank you for you comments.

88.130.111.116 (talkcontribs)

You're always welcome. :-)

Merry Christmas!

Guiwp (talkcontribs)

Thanks for this words. Now I see hope In your words. Now I'm more happy. =)

We'll find a way to improve all this "system"! Lets hope! =)

Merry Christmas to all people!

Bawolff (talkcontribs)

I guess I'm late to the party here, but...

In your original post I assumed you meant that any edit containing a non-allowed word would be utterly rejected. In such a system, the room for error is pretty close to zero, as the inconvenience for a false positive is extremely high.

In regards to the number of bad words being larger than the number of good words. That is almost certainly true. There is a countably infinite number of bad words, and I believe only a finite number of good words. However just because the cardinality is higher on the bad word sign, does not necessarily make it a do-able task to do it in the other direction, since a number smaller then infinity can still be much to large to manage.

However, later on you talk about using this a system to trigger extra review, or tag an edit for futher review, etc. That's much more likely to be workable imo, although there would still be many questions to work out about how such a system would actually work, and what the review process would be like.

But if you do go in that direction, it might be better to go a more machine learning approach rather than spelling (lest the vandals and spammers learn to spell obligatory xkcd). You may be interested in reading up on w:User:ClueBot_NG.

Happy holidays,

Reply to "Improving the effectiveness of regular expressions against vandalism"