Home >Java >javaTutorial >How Can I Efficiently Remove Diacritics from Unicode Strings in Java?

How Can I Efficiently Remove Diacritics from Unicode Strings in Java?

Barbara Streisand
Barbara StreisandOriginal
2024-12-11 01:23:10610browse

How Can I Efficiently Remove Diacritics from Unicode Strings in Java?

Remove Diacritic Marks from Unicode Characters

To eliminate diacritical markings (e.g., tilde, umlaut, etc.) from Unicode characters, consider employing the following algorithms:

Java Algorithm

In Java, utilize the following code:

public static final Pattern DIACRITICS_AND_FRIENDS = Pattern.compile("[\p{InCombiningDiacriticalMarks}\p{IsLm}\p{IsSk}\u0591-\u05C7]+");

private static String stripDiacritics(String str) {
    str = Normalizer.normalize(str, Normalizer.Form.NFD);
    str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
    return str;
}

Example:

stripDiacritics("Björn")  = Bjorn

Enhanced Algorithm

For a more comprehensive solution, include a second cleanup stage to handle non-diacritic special characters.

public static final char DEFAULT_REPLACE_CHAR = '-';
public static final String DEFAULT_REPLACE = String.valueOf(DEFAULT_REPLACE_CHAR);
private static final ImmutableMap<String, String> NONDIACRITICS = ImmutableMap.<String, String>builder()
        // ... [List of non-diacritic characters]

public static String simplifiedString(String orig) {
    String str = orig;
    if (str == null) {
        return null;
    }
    str = stripDiacritics(str);
    str = stripNonDiacritics(str);
    if (str.length() == 0) {
        // ... 
    }
    return str.toLowerCase();
}

// ... [Continued implementation]

Applicability and Limitations

These algorithms effectively remove diacritics for search purposes. However, non-diacritic special characters, such as Białegostok's "ł," require additional handling. The enhanced algorithm attempts to replace these characters with their closest equivalent.

The above is the detailed content of How Can I Efficiently Remove Diacritics from Unicode Strings in Java?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn