Home >Java >javaTutorial >How Can I Efficiently Remove Diacritical Marks from Unicode Strings in Java?

How Can I Efficiently Remove Diacritical Marks from Unicode Strings in Java?

Barbara Streisand
Barbara StreisandOriginal
2024-12-01 15:36:14303browse

How Can I Efficiently Remove Diacritical Marks from Unicode Strings in Java?

Removing Diacritical Marks from Unicode Characters

Problem Description

Diacritical marks, such as tildes, circumflexes, carets, umlauts, and carons, can alter the pronunciation of characters. To facilitate efficient search and comparison, it may be necessary to remove these marks and obtain the "simple" counterpart of the characters.

Solution

Utilizing Unicode normalization and regular expressions, the following Java implementation effectively removes diacritical marks:

import java.text.Normalizer;
import java.util.regex.Pattern;

public class DiacriticStripper {

    private static final Pattern DIACRITICS_PATTERN = Pattern.compile("[\p{InCombiningDiacriticalMarks}\p{IsLm}\p{IsSk}\u0591-\u05C7]+");

    public static String stripDiacritics(String input) {
        String normalizedInput = Normalizer.normalize(input, Normalizer.Form.NFD);
        return DIACRITICS_PATTERN.matcher(normalizedInput).replaceAll("");
    }

}

For instance, the input string "ńǹňñṅņṇṋṉ̈ɲƞᶇɳȵ" would be transformed into "n".

Extended String Simplification

The provided solution addresses diacritical marks specifically. If additional non-diacritic special characters need to be handled, a more comprehensive string simplification method can be employed:

import java.text.Normalizer;
import java.util.regex.Pattern;

public class StringSimplifier {

    private static final Pattern DIACRITICS_PATTERN = Pattern.compile("[\p{InCombiningDiacriticalMarks}\p{IsLm}\p{IsSk}\u0591-\u05C7]+");

    public static String simplify(String input) {
        String normalizedInput = Normalizer.normalize(input, Normalizer.Form.NFD);
        String diacriticStripped = DIACRITICS_PATTERN.matcher(normalizedInput).replaceAll("");
        // Replace additional non-diacritic special characters using a custom mapping
        // ...
        return simplifiedString.toLowerCase();
    }

}

By considering a wider range of characters, this method provides a more comprehensive string simplification process.

The above is the detailed content of How Can I Efficiently Remove Diacritical Marks from Unicode Strings in Java?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn