Talk:ORES

Jump to navigation Jump to search

About this board

This talk page is intended to be used exclusively to discusse the developement and deployment of ORES. For any discussion about the bots/tools that use ORES, please direct them to the respective talk page.

EEggleston (WMF) (talkcontribs)

Is the github.com/wiki-ai page the right place to link?

EpochFail (talkcontribs)

It's not a bad place to link. We keep all of our primary repos within that organization.

This post was hidden by Neil P. Quinn-WMF (history)
Adamw (talkcontribs)

Unfortunately, this is out-of-date now. The wiki-ai organization still shows which repos we work in, but the most current code for those projects will be in the `wikimedia` organization.

We need to create a new entry point for developers.

Reply to "Links to repos?"

Question from a student doing independent research

1
Summary by FeralOink

I am marking this question as fully resolved. To summarize, a computer science student at an American university inquired of Wikimedia regarding detection of undisclosed paid edits and revisions. The student is using a particular framework for his own project of detecting paid Wikipedia edits, and wished to validate its accuracy versus the findings of Wikimedia's own methods of undisclosed paid edit detection.

An employee of Wikimedia responded to the inquiry. He provided links to recent public releases of data, as well as descriptions of ORES and a predictive model used for detection. He provided ORES parameter settings and information about API use. The student replied, acknowledging that the response he received was helpful and adequate. The student also gave permission for the conversation to be openly posted on mediawiki.

Halfak (WMF) (talkcontribs)

The following is an email conversation that I had with Mark Wang about ORES. I'm posting this so that it'll maybe gain some long-term usefulness. As you can see at the end of the thread, Mark agreed to me posting this publicly.

On Sat, Nov 17, 2018 at 8:50 PM Wang, Mark <> wrote:

Hi Scoring Platform Team!

I'm Mark, a CS student at Brown Univ. I'm working on experimenting with applying Snorkel (a framework for leveraging unlabeled data with noisy label proposals) to detect paid Wikipedia edits. I've got a few selfish requests / questions for you guys.

Snorkel code : https://github.com/HazyResearch/snorkel Snorkel paper : https://arxiv.org/pdf/1711.10160.pdf

Some selfish questions:

1) Is it possible for me to have access to edits and page-stats data that you work with? I can scrape them myself (with a reasonable crawl rate), but of course, it's less convenient and I'll end up working with less data.

2) How do you represent revisions? I'm thinking about using character embeddings here. What are some methods that worked well for you guys? And what should I probably not try?

3) What features seem to be strongly informative in your models for detecting low-quality edits?

4) Any additional recommendations /advice?

Thank you in advance for your time, Mark Wang


On Mon, Nov 19, 2018 at 4:49 PM Aaron Halfaker <ahalfaker@wikimedia.org> wrote: Hi Mark!

Thanks for reaching out! Have you seen our recent data release of known paid editors? https://figshare.com/articles/Known_Undisclosed_Paid_Editors_English_Wikipedia_/6176927

1) I'm not sure what page stats you are looking for, but you can see the features we use in making predictions by adding a "?feature" argument to an ORES query. For example, https://ores.wikimedia.org/v3/scores/enwiki/21312312/damaging?features shows the features extracted and a "is this edit damaging" prediction for https://en.wikipedia.org/wiki/Special:Diff/21312312

2) A revision is a vector that we feed into the prediction model. We do a lot of manual feature engineering, but we use vector embeddings for topic modeling. We're actually looking into just using our current word2vec strategies for implementing better damage detection too. See https://phabricator.wikimedia.org/T197007

3) Here's an output of our feature importance weights for the same model. This is estimated by sklearn's GradientBoosting model.

feature.log((temporal.revision.user.seconds_since_registration + 1)) 0.131
feature.revision.user.is_anon 0.036
feature.english.dictionary.revision.diff.dict_word_prop_delta_sum 0.033
feature.revision.parent.markups_per_token 0.029
feature.revision.parent.words_per_token 0.028
feature.revision.parent.chars_per_word 0.027
feature.log((wikitext.revision.parent.ref_tags + 1)) 0.026
feature.revision.diff.chars_change 0.026
feature.revision.user.is_patroller 0.026
feature.english.dictionary.revision.diff.dict_word_prop_delta_increase 0.025
feature.log((wikitext.revision.parent.chars + 1)) 0.023
feature.log((AggregatorsScalar(<datasource.tokenized(datasource.revision.parent.text)>) + 1)) 0.023
feature.log((AggregatorsScalar(<datasource.wikitext.revision.parent.words>) + 1)) 0.023
feature.revision.parent.uppercase_words_per_word 0.022
feature.log((wikitext.revision.parent.wikilinks + 1)) 0.021
feature.log((wikitext.revision.parent.external_links + 1)) 0.02
feature.log((wikitext.revision.parent.templates + 1)) 0.02
feature.wikitext.revision.diff.markup_prop_delta_sum 0.02
feature.english.dictionary.revision.diff.non_dict_word_prop_delta_sum 0.02
feature.log((AggregatorsScalar(<datasource.wikitext.revision.parent.uppercase_words>) + 1)) 0.018
feature.revision.diff.tokens_change 0.018
feature.log((wikitext.revision.parent.headings + 1)) 0.017
feature.wikitext.revision.diff.markup_delta_sum 0.015
feature.revision.diff.words_change 0.015
feature.english.dictionary.revision.diff.dict_word_delta_sum 0.015
feature.english.dictionary.revision.diff.dict_word_prop_delta_decrease 0.015
feature.english.dictionary.revision.diff.non_dict_word_prop_delta_increase 0.015
feature.revision.diff.markups_change 0.014
feature.english.dictionary.revision.diff.dict_word_delta_increase 0.014
feature.wikitext.revision.diff.markup_prop_delta_increase 0.013
feature.wikitext.revision.diff.markup_delta_increase 0.012
feature.wikitext.revision.diff.number_prop_delta_sum 0.011
feature.wikitext.revision.diff.number_prop_delta_increase 0.011
feature.english.dictionary.revision.diff.non_dict_word_delta_sum 0.011
feature.wikitext.revision.diff.number_delta_increase 0.01
feature.revision.diff.wikilinks_change 0.01
feature.revision.comment.has_link 0.01
feature.english.dictionary.revision.diff.dict_word_delta_decrease 0.01
feature.revision.page.is_mainspace 0.009
feature.wikitext.revision.diff.number_delta_sum 0.009
feature.wikitext.revision.diff.markup_prop_delta_decrease 0.008
feature.english.dictionary.revision.diff.non_dict_word_prop_delta_decrease 0.008
feature.revision.page.is_articleish 0.007
feature.revision.diff.external_links_change 0.007
feature.revision.diff.templates_change 0.007
feature.revision.diff.ref_tags_change 0.007
feature.english.informals.revision.diff.match_prop_delta_sum 0.007
feature.english.informals.revision.diff.match_prop_delta_increase 0.007
feature.wikitext.revision.diff.number_prop_delta_decrease 0.006
feature.revision.comment.suggests_section_edit 0.006
feature.english.dictionary.revision.diff.non_dict_word_delta_increase 0.006
feature.wikitext.revision.diff.markup_delta_decrease 0.005
feature.revision.user.is_bot 0.005
feature.revision.user.is_admin 0.005
feature.english.badwords.revision.diff.match_prop_delta_sum 0.005
feature.wikitext.revision.diff.number_delta_decrease 0.004
feature.wikitext.revision.diff.uppercase_word_prop_delta_sum 0.004
feature.revision.diff.headings_change 0.004
feature.revision.diff.longest_new_repeated_char 0.004
feature.english.badwords.revision.diff.match_prop_delta_increase 0.004
feature.english.informals.revision.diff.match_delta_increase 0.004
feature.english.dictionary.revision.diff.non_dict_word_delta_decrease 0.004
feature.wikitext.revision.diff.uppercase_word_delta_sum 0.003
feature.wikitext.revision.diff.uppercase_word_prop_delta_increase 0.003
feature.revision.diff.longest_new_token 0.003
feature.english.informals.revision.diff.match_delta_sum 0.003
feature.wikitext.revision.diff.uppercase_word_delta_increase 0.002
feature.wikitext.revision.diff.uppercase_word_prop_delta_decrease 0.002
feature.english.badwords.revision.diff.match_delta_sum 0.002
feature.english.badwords.revision.diff.match_delta_increase 0.002
feature.wikitext.revision.diff.uppercase_word_delta_decrease 0.001
feature.english.informals.revision.diff.match_prop_delta_decrease 0.001
feature.revision.page.is_draftspace 0.0
feature.revision.user.has_advanced_rights 0.0
feature.revision.user.is_trusted 0.0
feature.revision.user.is_curator 0.0
feature.english.badwords.revision.diff.match_delta_decrease 0.0
feature.english.badwords.revision.diff.match_prop_delta_decrease 0.0
feature.english.informals.revision.diff.match_delta_decrease 0.0

4) You'll note that time since registration and is_anon are strongly predictive. They don't overwhelm the predictions -- we can still differentiate good from bad among newcomers and anonymous editors. But the model generally doesn't predict that an edit by a very experienced editors is bad regardless of what's actually in the edit. The more we can move away from relying is_anon and seconds_since_registration, the more we'll be targeting the things that people do -- rather than targeting them for their status. See section 7.4 our systems paper for a more substantial discussion of this problem.

-Aaron


On Mon, Nov 19, 2018 at 6:47 PM Wang, Mark <> wrote:

Thanks a bunch for your help Aaron! This is all very informative.

One more question from me: May I borrow your features? And if so, is accessing them through the API the preferred method of access for an outsider?

Thanks again, Mark


On Tue, Nov 20, 2018 at 11:07 AM Aaron Halfaker <ahalfaker@wikimedia.org> wrote:

Say, I'd like to save this conversation publicly so that others might benefit from it. Would you be OK with me posting our discussion publicly on a wiki?

On Tue, Nov 20, 2018 at 10:06 AM Aaron Halfaker <ahalfaker@wikimedia.org> wrote: Yes. That is a good method for accessing the features. You'll notice that the features that the API reports are actually just the basic reagents for the features the model uses.

For example, we have features like this:

  • words added
  • words removed
  • words add / words removed
  • log(words added)
  • log(words removed)
  • etc.

In all of these features, the basic foundation is "words added" and "words removed" with some mathematical operators on top. So we only report those two via the API. To see the full set of features for our damage detection model, see https://github.com/wikimedia/editquality/blob/master/editquality/feature_lists/enwiki.py See also a quick overview I put together for feature engineering here: https://github.com/wikimedia/revscoring/blob/master/ipython/feature_engineering.ipynb

If I wanted to extract the raw feature values for the English Wikipedia "damaging" model, I'd install the "revscoring" library (pip install revscoring) and then run the following code from the base of the editquality repo:

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from editquality.feature_lists.enwiki import damaging
/home/halfak/venv/3.5/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
>>> from revscoring.extractors import api
>>> import mwapi
>>> extractor = api.Extractor(mwapi.Session("https://en.wikipedia.org"))
Sending requests with default User-Agent.  Set 'user_agent' on mwapi.Session to quiet this message.
>>> list(extractor.extract(123456789, damaging))
[True, True, False, 10.06581896445358, 9.010913347279288, 8.079927770758275, 3.4965075614664802, 2.772588722239781, 5.402677381872279, 2.70805020110221, 1.791759469228055, 2.1972245773362196, 7.287484510532837, 0.3940910755707484, 0.009913258983890954, 0.06543767549749725, 0.0, 2.0, -2.0, 0.04273504273504275, 0.15384615384615385, -0.1111111111111111, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1, 1, False, False, False, False, True, False, False, 11.305126087390619, False, False, 0, 0, 0, 0.0, 0.0, 0.0, 0, 0, 0, 0.0, 0.0, 0.0, 0, 0, 0, 0.0, 0.0, 0.0, 0, 0, 0, 0.0, 0.0, 0.0]

This extracts the features for this edit: https://en.wikipedia.org/w/index.php?diff=123456789

-Aaron


Hi Aaron:

Thank you so much! This is all so helpful. And of course, feel free to publicize any of our conversations.


Mark

Help to support ORES at Galician wiki

4
Elisardojm (talkcontribs)
Daylen (talkcontribs)

The product teams that deal with this likely have a backlog. It might take a few weeks or months for the project to be picked up, as it is a feature request and not a bug.

Elisardojm (talkcontribs)

Ok, thanks!

EpochFail (talkcontribs)

Hey! Thanks for the ping! I just got back from a vacation and will be picking up new tasks soon. I just added an update to both.

Reply to "Help to support ORES at Galician wiki"
Summary by Harej (WMF)

Off-topic

DPAO EUTMMali (talkcontribs)

Good morning.

I am the Deputy Public Affairs Office from EUTM Mali.

I am trying to change/update the information that is in the wikipedia concerning this topic, but it is the third time the bot erases all what i have modified, and refers me to this page.

Is there any way of doing this, as i have to change the Spanish/English/French pages?

Regards.

EpochFail (talkcontribs)
DPAO EUTMMali (talkcontribs)

Good morning.

Thanks for the answer.

As i don't know who wikipedia exactly works...i will try to materialize your words...

Regards.

Coverage of ORES (and hackathon) in a Catalan newspaper

3
Halfak (WMF) (talkcontribs)
Townie (talkcontribs)

ORES was also mentioned here:

among many others. It is certainly an interesting project :)

Halfak (WMF) (talkcontribs)

\o/ Thanks for sharing

Reply to "Coverage of ORES (and hackathon) in a Catalan newspaper"
Netkawai (talkcontribs)

There is no Japanese Wikipedia in support table? I am not developer. This is curious.

If someone indicates a link about this, I appreciate that and follow it.

EpochFail (talkcontribs)

Thanks for your note! We're stuck on an issue we can't review with our current support for Japanese Wikipedia. Can you check out Phab:T133405 and see if you can help us?

Reply to "Japanese Wikipedia"
Aymatth2 (talkcontribs)

See discussion at en:Wikipedia:Village pump (idea lab)#Automated article assessment. The idea is for a bot to use ORES to generate project assessments on article talk pages, flagged as "bot assessed". The bot would periodically reassess articles until a human removed the "bot assessed" flag. This must have been considered before? Any comments there would be welcome. Thanks, ~~~~

EpochFail (talkcontribs)

It's a good idea IMO. How can we help?

Aymatth2 (talkcontribs)
Reply to "Automated assessment on en.wiki"
Lorely De Leon (talkcontribs)

Hello, I wanted to update data and I put the references, I do not know if you change my changes, I would like you to check. The information is correct and so are the sources.

Reply to "reversal of changes"
Fridaonz (talkcontribs)

Trato de editar el Articulo llamado Miguel Castro Reynoso, ya que se marco como spam y me indicaron que debia de hacer algunas correcciones en dicha página, sin embargo, cuando trato de editar y corregir me marca como vandalismo y revierte mi cambio.

Pontesalpublicidad (talkcontribs)

Intento editar la ficha de Veganismo pero me está dando errores continuados y resulta imposible editar. ¿Cual es el problema?

Reply to "Editar Articulo"
Summary by EpochFail

Abuse filter was blocking an edit

Jogamau (talkcontribs)

I am trying to change one page and the i get a error. The page says me that i am a "titere" so, i need help

Wargo (talkcontribs)

Where?

Jogamau (talkcontribs)

Page is «Juego de bienes públicos»

Wargo (talkcontribs)

On which wiki?

Jogamau (talkcontribs)

Wikipedia

Jogamau (talkcontribs)

Please HELP

Wargo (talkcontribs)