User:TJones (WMF)/Notes/DWIM as API

From mediawiki.org

January 2021 — See TJones_(WMF)/Notes for other projects.

DWIM as API[edit]

Background[edit]

The DWIM (“do what I mean”) gadgets on Hebrew and Russian wikis try to address the problem of typing on the wrong keyboard (particularly the US English keyboard) while searching on those wikis.

On the Hebrew Wikipedia, if an autocomplete search in the upper search box in either Hebrew or Latin script gets fewer than 10 results, the query is remapped to the other keyboard and searched again (assuming it is different—123 maps back to 123 so no point in searching it twice); results are limited to those needed to fill out the list of 10 to show the user. On Russian Wikipedia, the behavior is analogous, though the gadget is off by default and can be enabled in the user preferences.

There is also an Arabic DWIM gadget, though looking at it the implementation has problems and I don’t think it can work as written.

The upgrade to Vue.js disrupted the DWIM scripts, though there may be a sub-optimal hack to keep them working. See T262566 for more.

As a result, we are looking into options for adding DWIM functionality the the CirrusSearch API.

See also T138958 & T155104 for more on generic backend “wrong keyboard” detection and Trey’s relevant notes for even more.

Minimum Viable Product[edit]

At a minimum, we would want to support the functionality of the Hebrew and Russian DWIM gadgets. We could hard code their relevant mappings and have a generic function to do the mapping—the DWIM gadgets hard code numbers, but the relevant genericization of this line:

return ic + 1 ? hes.charAt( ( ic + 29 ) % 58 ) : c;

is

return ic + 1 ? hes.charAt( ( ic + hes.length/2 ) % hes.length ) : c;

—though the /2 division should probably be done outside the inner loop. Right now the completion suggester API call looks like this (autocompleting on most of the word for “Hebrew”):

https://he.wikipedia.org/w/api.php?action=opensearch&​search=עברי

The DWIMified API call would look like this (with the same search as above, but on the US keyboard):

https://he.wikipedia.org/w/api.php?action=opensearch​&DWIM=he&​search=gcrh

Acceptable values for the DWIM param could be he and ru, or perhaps he-us and ru-us, where us stands for the US English keyboard.

On the back end, after generating the suggestions for gcrh—which currently gets 3 suggestions—we could remap the query to עברי and ask for 7 more suggestions (or whatever is needed to not go over the limit param, which defaults to 10).

Controlled by Preferences[edit]

As noted above, the Russian DWIM feature is controlled by a user preferences setting. It is not on by default. Interacting with the user prefs is not be required for the MVP, but may be required for adoption by the Russian Wikipedia community.

The default configuration could be that for Hebrew Wikipedia, DWIM=he-us, and for Russian Wikipedia, DWIM= (blank) or DWIM=ru-us. A further option would be to allow for other mappings in preferences, such as DWIM=ru-fr (Russian PC to French AZERTY keyboard).

Custom Mappings[edit]

An obvious extension would be to allow the user to specify their own custom mapping, keeping in mind that such mappings could be up to 200 characters long (all keys on the keyboard (~50), plus shifted keys (x2), for each keyboard (x2)). The configurations could be even larger if we needed to map other command key combinations (like the control key). It would make sense to set a reasonable limit (200–300 characters) to prevent malicious or naive users from sending, say, a 10,000 character keyboard map.

For example, on one of the Mac Korean keyboard layouts (“2-Set Korean”), you would type 뵤도인 (the Buddhist temple Byōdō-in) as ㅂㅛㄷㅗㅇㅣㄴ; on the US keyboard, the corresponding keys would be qyehdls.

My (poorly tested) mapping for 2-Set Korean to US keyboard in the API would look something like this:

https://he.wikipedia.org/w/api.php?​action=opensearch&​DWIM=custom&​DWIMMap=`poiuytrewqasdfghjklmnbvcxz​POYTREWQ₩ㅔㅐㅑㅕㅛㅅㄱㄷㅈㅂㅁㄴㅇㄹㅎㅗㅓㅏㅣㅡㅜㅠㅍㅊㅌㅋㅖㅒㅛㅆㄲㄸㅉㅃ&​search=qyehdls

Actually, the mapping should be URL encoded in case it includes ?, &, spaces or other URL elements and non-ASCII characters, so the DWIMMap parameter should look like this:

DWIMMap=%60poiuytrewqasdfghjklmnbvcxz​POYTREWQ%E2%82%A9%E3%85​%94%E3%85%90%E3​%85%91%E3%85%95​%E3%85%9B%E3%85​%85%E3%84%B1%E3​%84%B7%E3%85%88​%E3%85%82%E3%85​%81%E3%84%B4%E3​%85%87%E3%84%B9​%E3%85%8E%E3%85​%97%E3%85%93%E3​%85%8F%E3%85%A3​%E3%85%A1%E3%85​%9C%E3%85%A0%E3​%85%8D%E3%85%8A​%E3%85%8C%E3%85​%8B%E3%85%96%E3​%85%92%E3%85%9B​%E3%85%86%E3%84​%B2%E3%84%B8%E3​%85%89%E3%85%83

Allowing custom mappings would allow maximum flexibility, but would also make it possible to generate weird, incorrect, or ridiculous results. The mapping DWIMMap=abcdefghijklmnopqrstuvwxyz would implement ROT13, which is mildly interesting, but probably not very useful.

Additional concerns:

  • We should check that mapping strings are an even number of characters long, and ignore them (or send a warning) if they are not.
  • It is possible that there are keyboards where a single character represents multiple keystrokes (for example, á on my keyboard is option-e + a). Supporting multi-character mappings would not be possible with the most straightforward DWIM implementation—though it isn’t needed for Hebrew or Russian.
  • We might have to take extra care to handle 16-bit or 32-bit or larger characters in custom mappings. I don’t think any are currently in common use, but Old Hungarian keyboard layouts do exist!
  • Input systems that use dead keys or that combine input characters on the fly can cause problems. We’ve already have problems with dead keys and autocomplete (see T177251), and with my Korean map above, I can get qyehdl (equivalent to typing ㅂㅛㄷㅗㅇㅣ which is composed as 뵤도이) will get the desired suggestion of 뵤도인, adding the final s doesn’t work. And, while ㅇㅐㅎ will generate a remapped query of dog, the composed version 앻 will not, because it is a single character. This could be fixed by decomposing input, but that’s another layer of complication.

Arbitrary Keyboard Pairings[edit]

We could also appropriately describe a number of keyboards, and then allow users to arbitrarily pair them, so that DWIM=ru-fr could indicate that a mapping between the standard Russian PC keyboard and the standard French AZERTY should be built on the fly on the backend (and possibly cached?) and applied to the query.

Implementation thoughts: given all of the keys’ outputs for two keyboards it’s possible to automatically generate the minimal mapping between them (or something close). You can ignore characters that don’t change when mapped (e.g., numbers 0-9 or punctuation, much of the time) and remove them from the mapping. You could also build a more efficient data structure than the mapping string—like a hash—to improve lookup speed.

Even with a maximal mapping that doesn’t remove unchanged characters, this approach would allow DWIM via API to work with many keyboards, and to add a new keyboard, we would only have to define it’s output, and not have to define all mappings between other keyboards.

Alternatively, we could have a pre-processing step that would generate mappings for all (n² – n)/2 pairs if generating them on the fly was too CPU intensive.

In any event, this could allow arbitrary mappings, like Tamil–Greek or Arabic–Russian, without us or the users having to explicitly define the mappings—assuming we have all the necessary characters covered and we can handle dead keys and combining characters appropriately.