User talk:DTLHS/cleanup/unrecognized scripts

From Wiktionary, the free dictionary
Latest comment: 5 years ago by Metaknowledge
Jump to navigation Jump to search

@Atitarev, Mahagaja, Metaknowledge I think you may be interested in this project, and I also wanted to ask a few questions to the group:

  1. When do CJKV languages not use the default script? (I imagine Han characters in Korean should be Kore and not Hani)
  2. Is it safe to assume that cross-script borrowings like "DVD" are rendered just fine in Cyrl, Kore, etc?
  3. Is the first script listed on a language's category always the default script? E.g. for Category:Tajik language, translations are assumed to be Cyrl and not Arab or Latn? Ultimateria (talk) 16:35, 16 April 2019 (UTC)Reply
@Ultimateria: I don't know the answers to your questions, but I have one of my own: do we ever need the |sc= parameter anymore? Aren't scripts detected automatically now, so that the parameter is always superfluous? —Mahāgaja · talk 16:38, 16 April 2019 (UTC)Reply
@Ultimateria:
  1. I think it's OK to use the default script for CJK(V) entries. Korean fonts should handle hanja well and they should use similar sizes. Not so sure about Vietnamese. I think the script is defined in CSS somewhere in modules.
  2. Normally borrowings are fine, so it's OK to use DVD, not DVD. I like to use fullwidth DVD (DVD) for Japanese.
  3. I think there should be script detection in place and I think we should try to reduce, if not eliminate the use of 'sc=' parameters. It's annoying to see Lua error in Module:parameters at line 95: Parameter 1 should be a valid language or etymology language code; the value "ku" is not valid. See WT:LOL and WT:LOL/E. with oversized fonts when it should be Lua error in Module:parameters at line 95: Parameter 1 should be a valid language or etymology language code; the value "ku" is not valid. See WT:LOL and WT:LOL/E. automatically. --Anatoli T. (обсудить/вклад) 22:59, 16 April 2019 (UTC)Reply
@Mahagaja: The one exception that I know of is multi-script languages that use one or more in translations. E.g. Serbo-Croatian's default script is Latn and the script needs to be specified in Cyrillic translations. Ultimateria (talk) 16:41, 16 April 2019 (UTC)Reply
@-sche might know better than us. A quick scan of the list makes it look to me like we could just remove the sc='s and be fine for almost all of them, so I reckon this is a better bot job. —Μετάknowledgediscuss/deeds 16:46, 16 April 2019 (UTC)Reply
@Ultimateria: I thought that was exactly when it was no longer necessary to specify, because the software can automatically detect that Beograd is in Latin and Београд is in Cyrillic. @Rua, isn't it right that |sc= is superfluous everywhere, including languages with multiple scripts? —Mahāgaja · talk 16:51, 16 April 2019 (UTC)Reply
I'd say almost everywhere, but I can't really vouch for that last 0.001% of special cases. Script detection doesn't work if the script is not one of the language's regular scripts, which can happen with things like mathematical symbols and the like. It is possible to make some changes to the script detection code so that they track cases where none of the scripts matched. Once the tracking transclusions have filled up, you have a better idea of those rare edge cases. —Rua (mew) 16:56, 16 April 2019 (UTC)Reply
@Rua: So, for example, if we had a German entry written in Hebrew script (which is extremely rare, but I did once see a tapestry at the New Synagogue (Berlin) with German—not Yiddish—written in the Hebrew script), we'd have to write something like {{m|de|נעהמען|sc=Hebr}}, since Hebrew isn't specified as a script of German, right? —Mahāgaja · talk 18:59, 16 April 2019 (UTC)Reply
As far as I know, yes. —Rua (mew) 20:03, 16 April 2019 (UTC)Reply
Well, you'd have to find three uses in Hebrew script for that. Except for the most edgy of edge cases, we should assign scripts to languages that use(d) them only occasionally, and indeed we have Hebrew script assigned for Old French and Arabic script for Afrikaans. —Μετάknowledgediscuss/deeds 20:49, 16 April 2019 (UTC)Reply
@Metaknowledge: We might theoretically want to mention {{m|de||נעהמען|sc=Hebr}} somewhere even without having an entry for it. —Mahāgaja · talk 16:01, 17 April 2019 (UTC)Reply
I honestly can't think of why we would, so that falls firmly in "the most edgy of edge cases". —Μετάknowledgediscuss/deeds 17:04, 17 April 2019 (UTC)Reply
Thanks for the ping, but I doubt I have any more insight than the rest of you. Re "is it safe to assume that cross-script borrowings like 'DVD' are rendered just fine in Cyrl, Kore, etc?": AFAIK it's safe to assume Latin script will display OK in Cyrl, Kore, etc, but if there are cross-script borrowings in the other direction, I don't know if those would be negatively affected. As Rua suggests, temporarily instituting tracking (or scanning a database dump for pages matching the same criteria, if anyone would prefer to do that) would help with figuring out edge cases.
One case where I think we want sc= is in {{t-simple}}, to help with memory usage. Right?
Looking at the list I see the Votic entry using Cyrillic; to quote WP, "in the 1920s, the Votic linguist Dmitri Tsvetkov wrote a Votic grammar using a modified Cyrillic alphabet"; if there's any literature in Cyrillic, then I'd think we should add Cyrillic as a script like we allow Arabic-script Afrikaans etc. I seem to think Rua and Tropylium have knowledge of Votic, and I see we even have an L3 speaker(!), Joonas07. Some other entries in the list seem like simple errors, like where the script code was left out of the Church Slavonic translation of Germany, or where "ku-Arab" was specified for a language which I guess is supposed to use Arab(?).
Fixing (or just separating) all the myriad Hani/Hans/Hant instances would make it easier to look over the rest. - -sche (discuss) 17:31, 16 April 2019 (UTC)Reply