Talk:Wikipedia article translation metrics/How to detect translated articles

Simpler hints
I think it would be best to start with some simpler comparisons on markup and non-linguistical features. For instance, if an article is created with "ref name" markup, links, imagelinks and/or ISBN links all/mostly contained in a previous (interlanguage-linked) article, the latter is probably the source. You can calculate the overlap between two pages for each of those factors, multiply them all and then find a threshold.

Other hints, but less certain, are given by the structure of the article, i.e. number and tree structure of the sections and amount of text in each of them. --Nemo 12:17, 26 January 2015 (UTC)


 * I agree, I will probably start with non-lingustical features. I'm not sure about multiplying because than it is an all or nothing function (if the editor didn't translate one factor then the final score will be zero). I might just add the percentages of markups found in the target page instead of multiplying.
 * Another interesting direction is to see if there are markups that are not in the original content. I think it will use as a hint that the article that was not translated from the other langauge. Livnetata (talk) 14:59, 28 January 2015 (UTC)