Module talk:ar-translit/testcases

From Wiktionary, the free dictionary
Jump to navigation Jump to search

Character replacement[edit]

@CodeCat Links should replace ٱ (hamzat al-waṣl) with the simple alif ا. Unlike hamza (which is arguable), the waṣl symbol is a diacritic. Which module does this replacement? --Anatoli (обсудить/вклад) 07:52, 27 April 2014 (UTC)Reply

It would be the entry_name property for Arabic in Module:languages/data2. —CodeCat 11:43, 27 April 2014 (UTC)Reply
@CodeCat Thank you. The module uses numeric values for characters, though. What are they and how can I find the values for ٱ and ا (or any character if there is a database?). --Anatoli (обсудить/вклад) 12:30, 27 April 2014 (UTC)Reply
Yes, that part is a bit difficult. We use numeric values mainly because the actual characters disrupt the editing window, making things appear in strange places and the cursor doesn't behave the way it should. I'm not sure if this will help but this is how you would convert the characters:
  • Look up the Unicode character number for the character, which would be in hexadecimal (using 0-9 and A-F).
  • Split it into groups of two each.
  • Use a calculator (most calculator programs have such a function somewhere) or one of the many websites to convert each group from hexadecimal to decimal. This will result in groups of 1 to 3 digits from 0-9.
  • Put \ before each group of digits and put them together with no spaces.
I hope that helps? —CodeCat 12:38, 27 April 2014 (UTC)Reply
Also, currently, it replaces all diacritics with nothing, what should be done here in character to another character?
m["ar"] = {
	names = {"Arabic", "Modern Standard Arabic", "Standard Arabic", "Literary Arabic", "Classical Arabic"},
	type = "regular",
	scripts = {"Arab"},
	family = "sem-arb",
	entry_name = {
		from = {"\217\139", "\217\140", "\217\141", "\217\142", "\217\143", "\217\144", "\217\145", "\217\146", "\217\176", "\217\128"},
		to   = {}} }
--Anatoli (обсудить/вклад) 12:34, 27 April 2014 (UTC)Reply
The values appear in pairs, with each value in "from" matched by a value in "to". If there is no matching value in "to", then the replacement will be nothing. So you need to put this character first in the "from" list, and then put a replacement in the "to" list. —CodeCat 12:38, 27 April 2014 (UTC)Reply
I'll check the steps with you. For ٱ and ا after conversion into Unicode I got \u0671 and \u0627. A site converted these numbers (0671 and 0627) to 1649 and 1575. They have four digits, not like the other numbers, though.
m["ar"] = {
	names = {"Arabic", "Modern Standard Arabic", "Standard Arabic", "Literary Arabic", "Classical Arabic"},
	type = "regular",
	scripts = {"Arab"},
	family = "sem-arb",
	entry_name = {
		from = {"\1649", "\217\139", "\217\140", "\217\141", "\217\142", "\217\143", "\217\144", "\217\145", "\217\146", "\217\176", "\217\128"},
		to   = {"\1575"}} }
Is the above correct? Sorry, if my question sounds stupid. --Anatoli (обсудить/вклад) 12:57, 27 April 2014 (UTC)Reply
Actually I forgot a rather big step. First you need to convert the Unicode character number into UTF-8 encoding, and then convert to decimal. For 0671, you get D9 B1 in UTF-8. Now you convert D9 to decimal (which is 217) and B1 (which is 177). So the final result is \217\177. —CodeCat 13:09, 27 April 2014 (UTC)Reply
Thank you. I couldn't find the convert to UTF-8 number function but I used this to get the required value. Anyway, it's working now. اَلْلُغَةُ ٱلْعَرَبِيَّةُ (al-luḡatu l-ʕarabiyyatu) from the test page (which uses rather rare diacritic to show that alif is elided or dropped in pronunciation) links to اللغة العربية. --Anatoli (обсудить/вклад) 13:22, 27 April 2014 (UTC)Reply

Some test cases with plain alif[edit]

@Erutuon, Wikitiki89, Benwing2, CodeCat, Kolmiel, Wyang, Backinstadiums. It might be difficult to determine the readings for unmarked alifs. The first one fails, the 2nd one (more accurate) works. But it seems to work in other cases if ا is followed by a lam ل (definite article ال).

		{ 'رَأَيْتُ ابْنَهُ', "raʾaytu bnahu" },
		{ 'رَأَيْتُ ٱبْنَهُ', "raʾaytu bnahu" },

Not sure if it's feasible but I would use hamzat al-waṣl in such cases. We are asking too much from the module. Even with definite article cases, we can use the more strict spelling مَعَ ٱلسَّلَامَة (maʕa s-salāma) to make sure the alif is silent. Perhaps (nil) should be the right result but say so if you disagree.

BTW, I've removed the wrong spelling/reading رَأَيْتُ اِبْنَهُ (raʾaytu ibnahu) from test cases. --Anatoli T. (обсудить/вклад) 07:44, 16 August 2017 (UTC)Reply

I think the rationale for leaving that testcase in may be to detect hamzat al-waṣl when the second component satisfies certain rules (short etc.), but I agree it seems too difficult. P.S. For me, ‹ʾ› is (interestingly) rendered ‹ʿ› in the examples above, lol. Wyang (talk) 08:06, 16 August 2017 (UTC)Reply
@Wyang Thanks. Logically, it makes sense, when an unmarked alif should be read as hamzat al-waṣl. I think the rule should be "(any) vowel" + space + alif + consonanant, e.g. مَا اسْمُكَ؟What is your name?
which should convert to مَا ٱسْمُكَ؟smuka?What is your name?
(visibly or invisibly).
I had the same rendering on my iPhone but it looks OK on PC. --Anatoli T. (обсудить/вклад) 07:27, 17 August 2017 (UTC)Reply
Hmm. Does it depend on the following word? Hamza#Hamzat waṣl lists only certain short words with two-consonant roots as having Hamzat waṣl in rule #2. Wyang (talk) 10:19, 17 August 2017 (UTC)Reply
That would work if we can rely on folks always using hamzas when they add fully vowelized (transliterable) text. It would be good to have a category or a tracking template for cases in whichthe module assumes that an unmarked alif is hamzat al-waSl, so that it can be checked. — Eru·tuon 18:27, 17 August 2017 (UTC)Reply
Yes, not writing out hamza is an issue for transliteration purposes in any case and in this in particular. The last sentence in Wikipedia "It occurs only in the definite article or at the beginning of a word following a preposition" is wrong, not only and contradicts the said above.--Anatoli T. (обсудить/вклад) 21:25, 17 August 2017 (UTC)Reply
I have the same rendering issue on my iPhone (it's confused me many a time). As to the issue at hand, I really don't get what the problem is. If the plain alif is preceded by a word-final vowel, then it should be assumed to be an alif al-wasl. Otherwise, there should be a vowel placed on the alif, unless it is the definite article, which can be automatically detected. Does that not cover all use-cases? --WikiTiki89 15:02, 22 August 2017 (UTC)Reply
Oh I see, the issue is that the module currently assumes that if the alif al-wasl doesn't have a vowel and is not the definite article, then the word is not sufficiently vocalized to produce a transliteration. It should instead simply treat it as an alif al-wasl. User:Benwing2 wrote this code so perhaps he'd know best how to fix it. --WikiTiki89 15:05, 22 August 2017 (UTC)Reply
@Wikitiki89 Thanks. I still maintain that the rule should be "(any) vowel" + space + alif + consonanant to avoid mistransliterating anything less than strict spellings. User:Benwing2 has been unavailable for a little while but maybe he lost interest in working with Arabic? --Anatoli T. (обсудить/вклад) 00:47, 24 August 2017 (UTC)Reply
@Atitarev: I thing you're saying the same thing as me. Otherwise give me an example where your rule is different from mine. --WikiTiki89 18:04, 24 August 2017 (UTC)Reply
@Wikitiki89 There could be a vowel after the alif, if the author forgot a hamza. The module should produce nil in such cases. --Anatoli T. (обсудить/вклад) 20:45, 24 August 2017 (UTC)Reply
@Atitarev: Can you give an actual example? If a plain alif has a vowel, then we use it. As correctly in اِبْن (ibn) and incorrectly in اَنْتَ (anta). No need to return nil, and this does not interfere with my rule, which only applies when no vowel is written on the alif. --WikiTiki89 20:51, 24 August 2017 (UTC)Reply
@Wikitiki89: I think we are both clear on this. In your examples, like اِبْن (ibn), the silent alif is not a hamzat al-waṣl (ٱ) and it wouldn't get the diacritic if we were to mark all such cases. Also, in you cases, I doubt it would be orthographically correct to have a word starting with a plain alif (without a hamza) + vowel, if the preceding word ends in a vowel as well (e.g. مَعَ اِبْنِ). My rule is more specific about hamzat al-waṣl, not about all cases with alif. --Anatoli T. (обсудить/вклад) 22:32, 24 August 2017 (UTC)Reply
@Atitarev: I don't quite understand what your saying. But to clarify my position, when an alif is given with an explicit hamza or wasla, the transliteration is straightforward. However, I do not think we should ideally ever require a wasla to be given explitly in order to generate the proper transliteration, so the transliteration module should be able to handle plain alifs in these cases. What my rule tried to demonstrate was that handling a plain alif is straightforward. Keep in mind, the transliteration module should try to transliterate things as vocalized as much as possible, regardless of whether the vocalization follows the standard rules or not. Thus, if I write مَعَ اِبْن, it should give maʿa ibn, despite the fact that that is not "proper" grammar, because I indicated the vowel explicitly. Whereas if I write مَعَ ابْنِ, it should give maʿa bni, rather than return nil. This is very straightforward in most cases because if the vowel is not elided, and if it is elided, it is not written. The main exception to that is the definite article, which is usually written without a vowel even when the vowel is not elided. Thus, for the definite article, the module will need to check whether the preceding word ends in a vowel or not. In addition to covering these requirements, the rule I gave above will have the module return nil in the case of مِنْ ابْن and ابْن, because returning min bn and bn would be confusing and guessing the vowel would be infeasible. As far as I can tell your rule agrees with mine in all cases it applies. If you think it does not, please give me a specific example of such a case. --WikiTiki89 14:59, 25 August 2017 (UTC)Reply
@Wikitiki89 We are agreeing but in your definition you haven't mentioned the consonant after the alif in the failed cases, which is what this topic is about, as in مَا اسْمُكَ؟What is your name? or مَعَ ابْنِيwith my son. I have tried to define the rule ("(any) vowel" + space + alif + consonanant) when an unmarked alif should simply be ignored and not treated as unmarked.
No, I don't object to مَعَ اِبْن to be transliterated as maʿa ibn. --Anatoli T. (обсудить/вклад) 00:47, 26 August 2017 (UTC)Reply
@Atitarev: Can you give an example of what you mean by an alif that is not followed by a consonant? All letters in Arabic are consonants, so an alif is always followed by a consonant. I had thought you meant things like مَعَ اِبْن (maʕa ibn) and مَعَ اَب (maʕa ab), but we've covered that. --WikiTiki89 14:35, 28 August 2017 (UTC)Reply
──────────────────────────────────────────────────────────────────────────────────────────────────── @Wikitiki89 Based on your example - مَعَ اِبْنِيmaʕa ibnīwith my son (alif is followed by a vowel, isn't it?), which works fine and causes no transliteration problem. مَعَ ابْنِيwith my son is not working but it matches my rule and a plain alif is followed by a consonant. Do you have another rule in mind? Please suggest then. Yes, the working cases are covered but please check what the topic is about again. --Anatoli T. (обсудить/вклад) 14:45, 28 August 2017 (UTC)Reply
@Atitarev: Well yes, my rule covers both of those cases. So I still don't get what issue you have with it. --WikiTiki89 14:52, 28 August 2017 (UTC)Reply
@Wikitiki89 What "both cases" you are talking about. There's only one case where plain alif = hamzat al-waṣl, the one that fails to transliterate. I have described the rule, under which circumstances the module should detect hamzat al-waṣl. Sorry, it seems that you're not focused or not interested. I am seeking technical help here. --Anatoli T. (обсудить/вклад) 21:49, 28 August 2017 (UTC)Reply
@Atitarev: I don't know what makes you think that I'm not focused or not interested, but in programming it's preferred to fix existing rules to apply to all cases, rather than patch up one specific case that isn't working. So I'm giving a general rule that will cover alif al-wasl for all cases including the one that is currently failing. --WikiTiki89 16:45, 29 August 2017 (UTC)Reply