User:Rukhabot/Tbot and language variants

From Wiktionary, the free dictionary
Jump to navigation Jump to search

Background[edit]

Background:

  • One task that Rukhabot (talkcontribs) performs is converting between {{t}} and {{t+}} (and other such pairs). At a very high level, the logic for that is "If {{t+}} would link to a foreign-language Wiktionary entry that actually exists, use it; otherwise, use {{t}}." There are a lot of subtle complexities involved in this.
  • A few languages' Wiktionaries have a feature whereby they support multiple "variants" of their language, with server-side logic to convert between those variants. For example, the Chinese Wiktionary (zh.wikt) supports converting between the Traditional Chinese and Simplified Chinese writing systems; users can select a preference for one of the other, and whenever they visit a page, it will automatically appear to them in their preferred variant. One part of this feature is that if you try to visit a non-existent page, but there exists a page with a corresponding title in a different variant, then the server will redirect you to that one. (For example, as of this writing, zh:响水 doesn't exist but zh:響水 does, so if you try to visit the former, you get redirected to the latter. Note that this is an HTTP 301 redirect — it actually sends your browser to the other page — and not "redirect" in the usual Wiktionary sense.)
  • So, to convert between {{t}} and {{t+}}, Rukhabot has to know about this variant conversion feature; for example, it has to know to write {{t+|cmn|响水}} rather than {{t|cmn|响水}}, because the link would ultimately point to a working page, namely zh:響水.

This page compares some of the approaches I've used for this feature.

Approach #1: Hardcoded support[edit]

The first Wiktionary to support this feature, at least to my knowledge, was the Serbian Wiktionary (sr.wikt), which has pretty simple deterministic rules for converting between Latin and Cyrillic scripts; so at first I simply added those rules as hardcoded logic in the bot. (The bot uses database-dumps to find which page-titles exist, so this just means checking a few extra titles to support both possibilities.)

As of this writing (April 2024), the bot uses this approach for both the Serbian and the Kurdish Wiktionaries (sr.wikt and ku.wikt); however, I plan to migrate both of them to approach #4 below.

Approach #2: action=query with converttitles=true[edit]

The first Wiktionary that I couldn't support in the above way was the Chinese Wiktionary (zh.wikt), because due to the nature of the writing system, its conversions involve thousands upon thousands of rules, and more rules get added every year.

To address this, I added logic to the bot to query the zh.wikt API to find out which pages exist, using URLs like https://zh.wiktionary.org/w/api.php?action=query&converttitles=true&titles=响水 (except that I batched them into 50 titles at a time). The API responds with information about which pages exist, and what conversions it performed in order to find them.

This is slow (because the API has to not just do the conversion, but also query the database for every single title), but it works.

As of this writing (April 2024), the bot no longer uses this approach for the Chinese Wiktionary, but it still uses this approach for both the Kazakh and the Inuktitut Wiktionaries (kk.wikt and iu.wikt); however, I plan to migrate to approach #4 below for Inuktitut, and I plan to remove this functionality for Kazakh (because that Wiktionary no longer has this multiple-variants feature enabled).

Approach #3: action=parse with <langconvert>[edit]

In March 2024, after noticing that we have support for <langconvert> enabled on en.wikt, I migrated to a faster approach for Chinese: I used the en.wikt API to parse wikitext that does the conversions (so I could get all possible versions of a given title) — e.g. https://en.wiktionary.org/w/api.php?action=parse&contentmodel=wikitext&prop=text&disablelimitreport=true&text=%3Clangconvert+from=%27zh-Hans%27+to=%27zh-Hant%27%3E响水%3C/langconvert%3E (except that I batched them into 1000 titles at a time, and used POST) — and then I went back to using the database dump to see which titles actually exist.

This was a huge improvement, but then I noticed a comment at mw:LangConv that it has only partial support for Chinese. After some investigation, I found that the conversions performed by <langconvert> are not always the same as those performed by the actual feature on the Chinese Wiktionary.

For example, as of this writing (April 2024), zh:矽晶片 redirects to zh:硅晶片, but <langconvert from='zh-Hans' to='zh-Hant'>矽晶片</langconvert> (Simplified→Traditional) and <langconvert from='zh-Hant' to='zh-Hans'>矽晶片</langconvert> (Traditional→Simplified) both produce 矽晶片, so this approach means that the bot erroneously uses {{t|cmn|矽晶片}} instead of {{t+|cmn|矽晶片}}.

There are currently only eight such mismatches, but still, it demonstrates that this approach is flawed. And fortunately, I found in my investigation that there's an approach that doesn't produce mismatches — approach #4 below — so I plan to migrate to that approach very shortly.

Approach #4: action=parse with variant=...[edit]

So, that brings us to approach #4, which I plan to migrate to for all of these languages (as well as any languages that I don't currently support this feature for). It turns out that, on a Wiktionary that has this feature, action=parse can be combined with e.g. variant=zh-Hans to do the requisite conversions. For example: https://zh.wiktionary.org/w/api.php?action=parse&contentmodel=wikitext&prop=text&disablelimitreport=true&text=响水%3C&variant=zh-Hant.

This is essentially equivalent to approach #3 — and has all of its advantages over approach #1 and approach #2 — except that it gives the same (correct) result as approach #2 in all cases.