Talk:ORES/BWDS review

About this board

This discussion page is reserved for topics related to the bad words review processing. Feel free to express any problem, idea, or improvement suggestion. Thank you.

Start a new topic

This approach doesn't really work

3 comments • 21:54, 21 October 2019 4 years ago

3

Jeblad (talkcontribs)

In some languages bad words are generative, not fixed. It makes it pretty much impossible to make a fixed list of badwords.

In Nordland county in Norway animal names, name of genetalia, and a few other names (often in the local dialect) is combined to produce a badword. Fun thing is that what seems like a bad word can be taken as a boast in some situations. To confuse the situation even more some of the names are common names used for other things. A “peis” is a fire place, but also a penis. Lets make an example. If you have asked a friend over to do some work, and he doesn't show up, and then you call him a “måspeis” (“dick of a seagull”) he may just laugh of it – you called him a small jerk. If you call him a “hæstpeis” (“dick of a horse”) he might hit you – you called him a big jerk. Now assume you go on a party with your friend and you are asked in the door who he is, and you say he is a “måspeis”, then he might hit you. If you call him a “hæstpeis” he might give you a beer.

To make this somewhat simpler both “måspeis” and “hæstpeis” are informals that should not show up in articles. To make it harder they are not listed in any dictionaries. To make it even harder there are a lot of other combinations; “apskjit”, “mainskjit”, “hæstskjit”, “torskskjit”, “apekuk”, “mainkuk”, “hæstkuk”, “torsk-kuk”, “ap-peis”, “mainpeis”, “hæstpeis”, “torskpeis”, osv. The large set of variations will seem to be just noise in a k-means algorithm.

Some years ago I was pretty sure I had found all combinations, the list was short of 2000 badwords. Then I got a list from another source, and the list had suddenly over 7000 badwords and it was not complete.

I made a better description a few years back at m:Grants:IdeaLab/BadWords detector for AbuseFilter/Technical description. I also posted a nearly the same at m:Research talk:Revision scoring as a service/Word lists/no#Badwords.

Reply Edited 13:19, 25 October 2019 4 years ago

Wladek92 (talkcontribs)

Thank you for your point of view and the explainations. It is not said the method is universal but it gives a basic filtering. In FRench the explained method works. In any case, we are encouraged to update the lists manually per language (if your 7000 entries stuck performances, select at least most common NOrv subsets). These words participate to an automatic process and if you do not declare them, always a human rereading process should filter them later (...and declare manually).

Christian FR (talk) 06:46, 25 October 2019 (UTC)

Reply 06:46, 25 October 2019 4 years ago

Jeblad (talkcontribs)

The set has some common items, but then turn into a “very biggely” (!) long-tail. The phrases “hæstpeis”, “hæstkuk”, “hæstskjit”, and “mainskjit” are winners, but also such terms as “hyspeis” (dick of a haddock) and “frosk-kuk” (dick of a frog) can be found in in the long tail. I got some feedback a few years ago, and it seems like the same thing exists in other cultures too.

My idea from back then was to create lists of terms, and merge terms to form composite words. To do so it would be necessary to have affix rules, in particular infix rules. This would be used to brute force create composite words. This approach gives a list of regular expressions of order $O(N\times M)$ . A better solution would be to look for forms that can be merged. Only after sufficient coverage of a word is achieved it is flagged as a match, and further processing triggered. This approach gives a list of regular expressions of order $O(N+M)$ .

Reply Edited 16:50, 25 October 2019 4 years ago

Reply to "This approach doesn't really work"

There are no older topics