Home >Java >javaTutorial >Edge Cases to Keep in Mind. Part Text
No matter if you are a software developer, a copywriter or you’re just writing an e-mail, text has many traps you need to be aware of. Some may cause numerous issues, from bugs in your app through visual artefacts to even victims! Let’s take a look at how we can avoid them.
Text (aka strings) exists in virtually all software projects, from one-liners like hello-worlds to enterprise systems containing billions of lines of code, regardless of the programming language, platform and so on. Texts are just sequences of characters, so this shouldn’t be rocket science, right? Let’s take a look what traps you can encounter!
Some of the world’s alphabets (including English) are bicameral, which means they contain both upper and lower case letters.
For example: a is a lower case character and A is upper case. Conversion from one letter case to another is quite a common operation.
Casing might seem trivial — one character is just converted (mapped) to another one. It may even be a character unto itself if it is not a letter, such as 1 or + and so on. Additionally, this mapping can always be simply reversed, e.g. A->a and a->A. So, everything seems fine at first glance. Well, nothing could be further from the truth!
This is not a joke and we are not talking about enraged grammar Nazis. As you can read in this article, a casing glitch caused 2 victims and put 3 more people in jail.
How did that happen? Well, in Turkish (and Azeri) we have 2 distinct i letters: dotted (closed) and dotless (open). In English and other Latin alphabets, lowercase letter is always dotted while uppercase - dotless. Everything is illustrated in Table 1. and online demo.
Table 1. Dotted and dotless i letters.
Lowercase | Uppercase | |
---|---|---|
English | i dotted | I dotless |
Turkish | i dotted | İ dotted |
Turkish | ı dotless | I dotless |
As you can see, the casing change result depends on context, which further depends on the current language. It is important to use appropriate language when composing texts intended for humans. If you don’t care about this, your words may end up having a different meaning than intended.
On the other hand, machine readable texts like HTTP headers or JSON keys should be processed in language-neutral way. Otherwise, you may get non-ASCII characters in the output which may break application logic. That exact situation happened in GSON, a library used by thousands (or maybe millions) of projects.
Characters with diacritics can be precomposed like ó, or created by combining marks like ó. When reading this page, they both look like the same character. Yet, if look at the hexdump of the second one or even try to obtain its length programmatically, like in this demo, you will see that it consist of 2 individual characters: Latin small letter o and a combining acute accent. Similarly, each Hangul (Korean alphabet) syllable block can be precomposed or written as a combination of distinct jamos individual letters/characters.
Why are combining marks so important? Well, there are two ways of writing most of the characters with diacritics (for example from Polish, Hungarian or Czech alphabets). This makes operations like sorting, searching or text length measuring non-trivial. Usually, to achieve the best user experience, texts need to be normalized (converted to one of the normal forms). Otherwise, users may be confused when they see, for instance, multiple “different” logins or filenames that look the same. A great example of this is how Slack handles channel names. They are normalized before channel creation, so situations wherein the same name written in different ways cannot coexist.
There are 2 levels of character equivalence. Canonical equivalence occurs when characters are assumed to have the same both meaning and appearance, e.g. the aforementioned ó and ó differ only by (technical) way of writing. On the other hand, compatibility means that characters may appear distinct but may have the same meaning. For example the ligature ffi is compatible with three distinct letters ffi but they are not canonically equal. More info about Unicode normalization can be found in the standard documentation.
While both the composed and decomposed forms for each 2 levels are standardized — so we have 4 normal forms in total — normalization is not always reversible. For example an angstrom sign Å is decomposed to the Latin capital letter A A plus combining ring above ̊, which are composed back to a Latin capital letter A with ring above Å, not to the angstrom sign from which it originated.
It is also important that all the applications sharing a given text use the same normalization method. If they don’t, it may cause subtle errors and/or even silent data losses. Such bugs may be difficult to discover because each application works faultlessly, at least when running individually. Applications often do not “crash” in such cases but just send or receive data different to what it should, causing unintended consequences. One such examples is this bug in nettalk.
Aforementioned typographic ligatures are used to improve the visual appearance of certain characters which don’t look well separately adjacent to each other. Most users don’t need to worry about ligatures, since they are generated automatically from individual letters by software e.g. TeX produces ligatures by default. However, developers of such tools have to take into account that, in some cases, ligatures may be inappropriate and introduce errors.
Take a look at this: fi. Is the second letter dotted or dotless? Turkish-speaking readers may be confused. Ligatures containing i should not be used in some contexts.
A few scripts (so-called bicameral) like Latin and Greek contain letters of two cases. Virtually all of the letters have lower and upper case. Virtually… but not absolutely all!
While the lowercase set is always present, it is not true for uppercase. So, if there are characters which have only lowercase, what happens if you try to convert them to uppercase? Would it be an error that causes the operation to fail? Would the character stay the same? The answer is nothing of that kind!
One of the most noticeable examples is the German sharp s — ß. It is a lowercase character and, when converted to uppercase, it becomes double S - SS. That transformation is not reversible - SS becomes ss. See it online. TL;DR Unicode 5.1 introduced ẞ (LATIN CAPITAL LETTER SHARP S) but it is not generally considered an uppercase of ß in terms of character mapping. It was recently (in 2016) added to the German orthography ruleset as an equally valid form of SS.
Many other lowercase ligatures do not have their corresponding precomposed uppercase forms. The complete list can be found in the Unicode Special Casing documentation.
Some uppercase characters are missing, so what? Ligatures can consist of 2 or even 3 characters, so uppercased text may be 3 times longer than the original lowercase. This is fact is extremely important where the resulting text length is limited. For example, in avatars or initials generators, like in this bug on bitrise.io.
The Greek alphabet contains the Sigma letter which looks like this in uppercase: Σ. What is its lowercase form? Well, it depends! Usually, it is σ (non-final) but, at the end of the words, it's ς (final). However, if a Sigma is the only letter or the word is written in all caps then a non-final version is always used, even at the final position. See interactive example.
What is the lowercase of a Latin capital letter i with tilde Ĩ? As you may have guessed, the answer is not so trivial. A corresponding lowercase form exists. Both forms are dotless but it is perfectly normal. Both i and j do not have dots if they have some diacritic(s) attached. So what is the problem here?
Apart from Turkish, Lithuanian ortographic rules are also exceptional in the case of the I letter. In the latter, the dot is preserved beneath the accent. This means, for example, that the aforementioned Ĩ, when lowercased in context of the Lithuanian language, becomes i̇̃. If you look carefully you can see that there are 3 characters: a Latin small letter i, a combining dot above and a combining tilde above. The length of the text has increased 3 times (again).
How can you write a word consisting of 7 letters, using only 6 characters? Just use precomposed ligatures and multigraphs (digraphs, trigraphs and so on)! Of course there is no precomposed character for each possible combination of joined letters. However, existent ones can be used to effectively increase text length limits. For example, a Silesian word dzbonek (a pot) consists of 7 letters but it can be written as dzbonek using only 6 characters. See it online. Note that dz is a digraph, not a ligature.
Now you can, for example, tweet messages containing more than 140 characters! The list of precomposed Unicode digraphs and ligatures can be found here.
The alphabetical order is usually taught at the beginning of primary school. A, B, C, D… and so on to Z. As easy as pie!
Unfortunately, alphabetical order depends on language. Even positions of the basic Latin letters (without diacritics) may be different. For example, in Estonian, the letter Z is between S and T.
The location of letters with diacritical marks is also not universal. There are several possible schemes:
Before the corresponding base letter, like in Maltese: W, X, Ż, Z.
After the corresponding base letter, like in Polish: A, Ą, B, C, Ć.
At the end of the alphabet, like in Swedish: Z, Å, Ä.
At the same position (for collation purposes) as the base letter, like in Hungarian: O=Ó.
Note that the same letter may be collated differently in various languages and may even differ in the same language, depending on context!. For example, in Slovak, an A with an umlaut is always located after A. However, in German, it may either have the same value as the non-umlauted version, be located after it or even be treated as A+E. More info about which way is used in which cases can be found here.
It’s not only individual letters that are subject to collation. Multigraphs can also have their own rules. In Slovak, CH is collated between H and I. So, for example, the word chlieb (a bread) will be collated after hodina (an hour). On the other hand, in Polish that digraph is treated just like two separate letters - C and H - and thus have no special collation rules. See it online.
Hungarian even has double digraphs and each of them have their own collation rules. This leads to many complicated cases. Let’s consider one possible example. We have the SZ digraph. It is collated after S. Its doubled version (SZ + SZ) is a SSZ. This means that the word kaszinó (a casino) should be before kassza (cash register). Normally Z is after S but here we have: K A SZ I in the first word and (an equivalent of) K A SZ SZ in the second.
Furthermore, the same group of letters may or may not be a (double) digraph depending on the context. For example, aforementioned Slovak CH is treated as 2 separate letters C and H in some words e.g. viachlas (a polyphony). Normally, in Hungarian, NNY = NY + NY, like in the word mennybolt (a heaven). However, we also have a tizennyolc (eighteen) where NNY = N + NY, so there is a single letter N and a single digraph NY.
You may think that the headline above consist of only plain Latin letters. In fact, the vast majority of them are Greek, Cyrillic or Armenian capital letters. They are only the homoglyphs of some Latin letters.
So A (Latin capital A) is not the same thing as Α (Greek capital Alpha) nor А (Cyrillic capital A). Why is this important? Due to the fact that they are indistinguishable, they can be used in IDN homograph attacks. For example, the domain bank.com, only containing Latin letters, looks pretty much the same as bаnk.com, containing the Cyrillic small A instead of Latin small A. Such domains may be used for phishing.
Dealing with text may be tricky in some cases — especially if you work in a multilingual environment. As a rule of thumb, all configurations should be appropriate for the given context. For example, the user’s current language should be taken into account when processing texts visible to these users, while machine-readable ones should be processed in the language-neutral way (or using English if it is not possible). Selected collation settings should match actual usage as well. Text should be normalized when needed and the chosen normalization method should be consistent across all the system.
Want to know about more edge cases? Stay tuned, part 2 is on the way!
The above is the detailed content of Edge Cases to Keep in Mind. Part Text. For more information, please follow other related articles on the PHP Chinese website!