Wikipedia article translation metrics/How to detect translated articles

From mediawiki.org

This research is based on talking with editors and on insights following a detailed comparison between translated articles and originals articles.

Finding the right revision[edit]

  1. In the target wiki (the article that is a candidate for being translated), we should search for the earliest revision that is stable and has enough bites. This version was edited by the creator of the article. Usually, it won't be the first revision but the fourth or fifth.
    • What's earliest?
      • Timestamp is preserved after import from another wiki; rev_id doesn't ensure chronological order.
      • Must look for the earliest rev_timestamp which has either rev_user > 0 or rev_user = 0 and rev_user_text resembling an IP (using rev_user_text is hacky; rev_parent_id might help?).
    • A few definitions for stable:
      • Until another author made changes.
      • No one edited the article for a while.
      • A bot added langlinks.
      • The change in bytes is small (less than 50)
  2. In the source wiki (original article from the source language), we should search for the closest (from below) edit.

Characteristics of translated pages[edit]

Traces of another language[edit]

  1. Editors copy the original article (in the original language) to the target article:
    1. Look at the first target revision.
    2. If it is in a similar size (+- 20%) to the source version then check if it is in a different language than the target one (for the same alphabet use Google translate to detect the language).
    3. Another option is to check if the first sentence is identical in both versions.
    4. For examples see “היסטוריה של אנדורה” (in Hebrew from Catalan) and Roberto Melli (in English from Italian). Both demonstrate the shift in language from the first target version until the stable version.
  2. References of the source article are in another language (e.g., here). Can be seen in the first version of the article.
    • Notice that that it might not work when translating from English.
  3. Headlines in a different language.
  4. Source, Notes, External links, Further reading of the source article are in another language (e.g., here). This data is part of the article from its creation.
  5. The target language editor edits the source article with an edit that is more meaningful than only adding langlink to the target language.
  6. In articles created before the transition to wikidata, the first draft already has langlinks in the code to all the other languages (maybe except the source).

Manual annotations[edit]

  1. Comments about “translated from”, “translation of” “תורגם מ” [he] etc.
    • A legacy but common practice is to inform others about the source of the translation in the talk page.
    • The legally required/sufficient practice, per m:Terms of use#7c, to write in the edit summary (revision comments) whether the page was translated and from where.
  2. A translation template.
  3. Weaker possibility: the language name or code in the talk page.

Structure[edit]

  1. Table of contents structure.
  2. Paragraph structure:
    1. Number of paragraphs in a section.
    2. Similar length of paragraphs (normalizing by language).

Location and order[edit]

  1. The order of interlinks in the text[1].
    • Should look at the order of intersection links, specifically how many links are in one of the articles that are not in the other.
    • Maybe localizing to paragraph/subject.
    • Different languages have different sentence structure. We might need to find the right separator for shifting the order.
    • Can we do it without translating interlinks? It might be possible because the links themselves should be connected
  2. The location and order of numbers and dates.
    • Should enable understanding of the paragraph without translation.
  3. Order and location of references (should match) in the text.
  4. Order and location of notes in the text.
  5. Matching between the presented order of references (i.e., in the paragraph “References”), external links, further reading, images.

Other properties[edit]

  1. bad translation.
    • The page Wikipedia:Translation, which exists in 49 languages, explains the do's and don'ts of translation for each language.
    • In Hewiki there is a page that points to common translation mistakes. I is specific for Hebrew but has lots of value for that language. We should think if we want to use it (and search similar pages in other languages). For example, searching for unproportional use of words (e.g. משקל הופעל; יכול היה) that is caused by bad translation.
    • When searching if it was translated from German, irregular sentence length should be checked.
  2. Headlines are translated
    • Use “Google Translate” or another machine translation for aid.
      • Detect the language in versions that the source was copied.
    • Create a list of the common translated headlines e.g., references, external links, see also, further reading.

Types of translated articles[edit]

  1. Exact translation - the source and target contain the same information.
  2. Partial translation - not all the information was translated from the source but all the information that is in the target article can be found in the source. The translator changed the division of the paragraphs. The order is the same but it is shorted than the original article.
    • For example: Original article for Marcel Marceau in English and translated article in Hebrew.
  3. More than one possible "father" as the source.
  4. Translation from more than one source.
  5. Inspirational translation

Unanswered problems and other directions:[edit]

  1. Machine translation.
    • A note about it should be in the talk page or main page.
  2. Pages that were deleted.
  3. Is there a way to look at draft pages?
  4. What can we do with the information regarding the editor adding a langlink to another wiki at the start of the edit.
    • Maybe it can point to the possibility that it was translated from it.
    • Nowadays a new link is added to the item in Wikidata.

Notes[edit]

  1. An article that talks about a similar idea is Hecht & Gergle, 2010  that also uses interlinks (outlinks) as a way to look at concepts' similarity between languages.