How to use Unicode agent programming in Java-javaTutorial-php.cn

Home

Java

javaTutorial

How to use Unicode agent programming in Java

PHPz

May 06, 2023 pm 08:43 PM

javaunicode

Sequential access

Sequential access is a basic operation for processing strings in the Java language. Under this approach, each character in the input string is accessed sequentially from beginning to end, or sometimes from end to beginning. This section discusses seven technical examples of creating a 32-bit code point array from a string using sequential access methods and estimates their processing time.

Example 1-1: Benchmark (no support for surrogate pairs)

Listing 1 Directly assigns a 16-bit char type value to a 32-bit code point value, nothing at all Consider surrogate pairs:

Listing 1. No support for surrogate pairs

int[] toCodePointArray(String str) { // Example 1-1      int len = str.length();          // the length of str      int[] acp = new int[len];        // an array of code points       for (int i = 0, j = 0; i <p>Although this example does not support surrogate pairs, it provides a processing time baseline to compare subsequent sequences Visit examples. </p><p><strong>Example 1-2: Using isSurrogatePair()</strong></p><p>Listing 2 Use isSurrogatePair() to count the total number of surrogate pairs. After counting, it allocates enough memory for an array of code bits to store the value. It then enters a sequential access loop, using isHighSurrogate() and isLowSurrogate() to determine whether each surrogate pair character is a high or low surrogate. When it finds a high surrogate followed by a low surrogate, it uses toCodePoint() to convert the surrogate pair to a code point value and increments the current index value by 2. Otherwise, it assigns the char type value directly to a code point value and increments the current index value by 1. This example takes 1.38 times longer to process than Example 1-1. </p><p><strong>Listing 2. Limited support</strong></p><pre class="brush:php;toolbar:false">int[] toCodePointArray(String str) { // Example 1-2      int len = str.length();          // the length of str      int[] acp;                       // an array of code points      int surrogatePairCount = 0;      // the count of surrogate pairs       for (int i = 1; i <p>The approach to updating software in Listing 2 is naive. It is cumbersome and requires extensive modifications, making the resulting software brittle and difficult to change in the future. Specifically, these issues are:</p><p>◆The number of code points needs to be calculated to allocate sufficient memory</p><p>◆It is difficult to obtain the correct code point value for a specified index in the string</p> <p>◆It is difficult to move the current index correctly for the next processing step</p><p>An improved algorithm appears in the next example. </p><p><strong>Example: Basic support</strong></p><p>Java 1.5 provides codePointCount(), codePointAt() and offsetByCodePoints() methods to handle the three problems in Example 1-2 respectively. Listing 3 uses these methods to improve the readability of this algorithm: </p><p><strong>Listing 3. Basic support</strong></p><pre class="brush:php;toolbar:false">int[] toCodePointArray(String str) { // Example 1-3      int len = str.length();          // the length of str      int[] acp = new int[str.codePointCount(0, len)];       for (int i = 0, j = 0; i <p> However, Listing 3 takes 2.8 longer to process than Listing 1 times. </p><p><strong>Example 1-4: Using codePointBefore()</strong></p><p>When offsetByCodePoints() receives a negative number as the second parameter, it can calculate a distance from the beginning of the string absolute offset value. Next, codePointBefore() can return the code point value before a specified index. These methods are used to traverse the string from end to beginning in Listing 4:</p><p><strong>Listing 4. Basic support using codePointBefore()</strong></p><pre class="brush:php;toolbar:false">int[] toCodePointArray(String str) { // Example 1-4      int len = str.length();          // the length of str      int[] acp = new int[str.codePointCount(0, len)];      int j = acp.length;              // an index for acp       for (int i = len; i > 0; i = str.offsetByCodePoints(i, -1)) {          acp[--j] = str.codePointBefore(i);      }      return acp;  }

Processing time for this example&mdash ; 2.72 times longer than Example 1-1 - slightly faster than Example 1-3. Typically, when you compare zero instead of non-zero values, the code size in the JVM is smaller, which sometimes improves performance. However, the small improvement may not be worth sacrificing readability.

Example 1-5: Using charCount()

Examples 1-3 and 1-4 provide basic surrogate pair support. They do not require any temporary variables and are robust coding methods. To obtain shorter processing time, using charCount() instead of offsetByCodePoints() is effective, but requires a temporary variable to hold the code point value, as shown in Listing 5:

Listing 5. Using optimized support for charCount()

int[] toCodePointArray(String str) { // Example 1-5      int len = str.length();          // the length of str      int[] acp = new int[str.codePointCount(0, len)];      int j = 0;                       // an index for acp       for (int i = 0, cp; i <p>The processing time of Listing 5 is reduced to 1.68 times longer than Example 1-1. </p><p><strong>Example 1-6: Accessing a char array</strong></p><p>Listing 6 Directly accessing an array of type char while using the optimization shown in Example 1-5:</p><p><strong>Listing 6. Optimized support for using a char array</strong></p><pre class="brush:php;toolbar:false">int[] toCodePointArray(String str) { // Example 1-6      char[] ach = str.toCharArray();  // a char array copied from str      int len = ach.length;            // the length of ach      int[] acp = new int[Character.codePointCount(ach, 0, len)];      int j = 0;                       // an index for acp       for (int i = 0, cp; i <p>Char arrays are copied from strings using toCharArray(). Performance is improved because direct access to the array is faster than indirect access through a method. The processing time is 1.51 times longer than Example 1-1. However, when called, toCharArray() requires some overhead to create a new array and copy the data into the array. The convenience methods provided by the String class cannot be used either. However, this algorithm is useful when dealing with large amounts of data. </p><p><strong>Example 1-7: An object-oriented algorithm</strong></p><p>The object-oriented algorithm for this example uses the CharBuffer class, as shown in Listing 7:</p><p> <strong>Listing 7. Object-oriented algorithm using CharSequence</strong></p><pre class="brush:php;toolbar:false">int[] toCodePointArray(String str) {        // Example 1-7      CharBuffer cBuf = CharBuffer.wrap(str); // Buffer to wrap str      IntBuffer iBuf = IntBuffer.allocate(    // Buffer to store code points              Character.codePointCount(cBuf, 0, cBuf.capacity()));       while (cBuf.remaining() > 0) {          int cp = Character.codePointAt(cBuf, 0); // the current code point          iBuf.put(cp);          cBuf.position(cBuf.position() + Character.charCount(cp));      }      return iBuf.array();  }

Unlike the previous example, Listing 7 does not require an index to hold the current position for sequential access. Instead, CharBuffer internally tracks the current position. The Character class provides static methods codePointCount() and codePointAt(), which handle CharBuffers through the CharSequence interface. CharBuffer always sets the current position to the head of the CharSequence. Therefore, when codePointAt() is called, the second parameter is always set to 0. The processing time is 2.15 times longer than Example 1-1.

处理时间比较

这些顺序访问示例的计时测试使用了一个包含 10,000 个代理对和 10,000 个非代理对的样例字符串。码位数组从这个字符串创建 10,000 次。测试环境包括：

◆OS：Microsoft Windows® XP Professional SP2

◆Java：IBM Java 1.5 SR7

◆CPU：Intel® Core 2 Duo CPU T8300 @ 2.40GHz

◆Memory：2.97GB RAM

表 1 展示了示例 1-1 到 1-7 的绝对和相对处理时间以及关联的 API：

表 1. 顺序访问示例的处理时间和 API

How to use Unicode agent programming in Java

随机访问

随机访问是直接访问一个字符串中的任意位置。当字符串被访问时，索引值基于 16 位 char 类型的单位。但是，如果一个字符串使用 32 位码位，那么它不能使用一个基于 32 位码位的单位的索引访问。必须使用 offsetByCodePoints() 来将码位的索引转换为 char 类型的索引。如果算法设计很糟糕，这会导致很差的性能，因为 offsetByCodePoints() 总是通过使用第二个参数从第一个参数计算字符串的内部。在这个小节中，我将比较三个示例，它们通过使用一个短单位来分割一个长字符串。

示例 2-1：基准测试（不支持代理对）

清单 8 展示如何使用一个宽度单位来分割一个字符串。这个基准测试留作后用，不支持代理对。

清单 8. 不支持代理对

String[] sliceString(String str, int width) { // Example 2-1      // It must be that "str != null && width > 0".      List<string> slices = new ArrayList<string>();      int len = str.length();       // (1) the length of str      int sliceLimit = len - width; // (2) Do not slice beyond here.      int pos = 0;                  // the current position per char type       while (pos <p>sliceLimit 变量对分割位置有所限制，以避免在剩余的字符串不足以分割当前宽度单位时抛出一个 IndexOutOfBoundsException 实例。这种算法在当前位置超出 sliceLimit 时从 while 循环中跳出后再处理最后的分割。</p>
<p><strong>示例 2-2：使用一个码位索引</strong></p>
<p>清单 9 展示了如何使用一个码位索引来随机访问一个字符串：</p>
<p><strong>清单 9. 糟糕的性能</strong></p>
<pre class="brush:php;toolbar:false">String[] sliceString(String str, int width) { // Example 2-2      // It must be that "str != null && width > 0".      List<string> slices = new ArrayList<string>();      int len = str.codePointCount(0, str.length()); // (1) code point count [Modified]      int sliceLimit = len - width; // (2) Do not slice beyond here.      int pos = 0;                  // the current position per code point       while (pos <p>清单 9 修改了 清单 8 中的几行。首先，在 Line (1) 中，length() 被 codePointCount() 替代。其次，在 Lines (3)、(4) 和 (6) 中，char 类型的索引通过 offsetByCodePoints() 用码位索引替代。</p>
<p>基本的算法流与 示例 2-1 中的看起来几乎一样。但处理时间根据字符串长度与示例 2-1 的比率同比增加，因为 offsetByCodePoints() 总是从字符串头到指定索引计算字符串内部。</p>
<p><strong>示例 2-3：减少的处理时间</strong></p>
<p>可以使用清单 10 中展示的方法来避免 示例 2-2 的性能问题：</p>
<p><strong>清单 10. 改进的性能</strong></p>
<pre class="brush:php;toolbar:false">String[] sliceString(String str, int width) { // Example 2-3      // It must be that "str != null && width > 0".      List<string> slices = new ArrayList<string>();      int len = str.length(); // (1) the length of str      int sliceLimit          // (2) Do not slice beyond here. [Modified]              = (len >= width * 2 || str.codePointCount(0, len) > width)              ? str.offsetByCodePoints(len, -width) : 0;      int pos = 0;            // the current position per char type       while (pos <p>首先，在 Line (2) 中，（清单 9 中的）表达式 len-width 被 offsetByCodePoints(len,-width) 替代。但是，当 width 的值大于码位的数量时，这会抛出一个 IndexOutOfBoundsException 实例。必须考虑边界条件以避免异常，使用一个带有 try/catch 异常处理程序的子句将是另一个解决方案。如果表达式 len>width*2 为 true，则可以安全地调用 offsetByCodePoints()，因为即使所有码位都被转换为代理对，码位的数量仍会超过 width 的值。或者，如果 codePointCount(0,len)>width 为 true，也可以安全地调用 offsetByCodePoints()。如果是其他情况，sliceLimit 必须设置为 0。</p>
<p>在 Line (4) 中，清单 9 中的表达式 pos + width 必须在 while 循环中使用 offsetByCodePoints(pos,width) 替换。需要计算的量位于 width 的值中，因为第一个参数指定当 width 的值。接下来，在 Line (5) 中，表达式 pos+=width 必须使用表达式 pos=end 替换。这避免两次调用 offsetByCodePoints() 来计算相同的索引。源代码可以被进一步修改以最小化处理时间。</p>
<h3 id="处理时间比较">处理时间比较</h3>
<p>图 1 和图 2 展示了示例 2-1、2-2 和 2-3 的处理时间。样例字符串包含相同数量的代理对和非代理对。当字符串的长度和 width 的值被更改时，样例字符串被切割 10,000 次。</p>
<p><img src="/static/imghwm/default1.png" data-src="https://img.php.cn/upload/article/000/000/164/168337700040575.png?x-oss-process=image/resize,p_40" class="lazy" alt="How to use Unicode agent programming in Java"></p>
<p><strong>图 1. 一个分段的常量宽度</strong></p>
<p><img src="/static/imghwm/default1.png" data-src="https://img.php.cn/upload/article/000/000/164/168337700041405.png?x-oss-process=image/resize,p_40" class="lazy" alt="How to use Unicode agent programming in Java"></p>
<p><strong>图 2. 分段的常量计数</strong></p>
<p>示例 2-1 和 2-3 按照长度比例增加了它们的处理时间，但 示例 2-2 按照长度的平方比例增加了处理时间。当字符串长度和 width 的值增加而分段的数量固定时，示例 2-1 拥有一个常量处理时间，而示例 2-2 和 2-3 以 width 的值为比例增加了它们的处理时间。</p>
<h3 id="信息-API">信息 API</h3>
<p>大多数处理代理的信息 API 拥有两种名称相同的方法。一种接收 16 位 char 类型参数，另一种接收 32 为码位参数。表 2 展示了每个 API 的返回值。第三列针对 U+53F1，第 4 列针对 U+20B9F，最后一列针对 U+D842（即高代理），而 U+20B9F 被转换为 U+D842 加上 U+DF9F 的代理对。如果程序不能处理代理对，则值 U+D842 而不是 U+20B9F 将导致意想不到的结果（在表 2 中以粗斜体表示）。</p>
<p><strong>表 2. 用于代理的信息 API</strong></p>
<p><img src="/static/imghwm/default1.png" data-src="https://img.php.cn/upload/article/000/000/164/168337700086359.gif?x-oss-process=image/resize,p_40" class="lazy" alt="How to use Unicode agent programming in Java"></p>
<h3 id="其他-API">其他 API</h3>
<p>本小节介绍前面的小节中没有讨论的代理对相关 API。表 3 展示所有这些剩余的 API。所有代理对 API 都包含在表 1、2 和 3 中。</p>
<p><strong>表 3. 其他代理 API</strong></p>
<p><img src="/static/imghwm/default1.png" data-src="https://img.php.cn/upload/article/000/000/164/168337700025390.gif?x-oss-process=image/resize,p_40" class="lazy" alt="How to use Unicode agent programming in Java"></p>
<p>清单 11 展示了从一个码位创建一个字符串的 5 种方法。用于测试的码位是 U+53F1 和 U+20B9F，它们在一个字符串中重复了 100 亿次。清单 11 中的注释部分显示了处理时间：</p>
<p><strong>清单 11. 从一个码位创建一个字符串的 5 种方法</strong></p>
<pre class="brush:php;toolbar:false">int cp = 0x20b9f; // CJK Ideograph Extension B  String str1 = new String(new int[]{cp}, 0, 1);    // processing time: 206ms  String str2 = new String(Character.toChars(cp));                  //  187ms  String str3 = String.valueOf(Character.toChars(cp));              //  195ms  String str4 = new StringBuilder().appendCodePoint(cp).toString(); //  269ms  String str5 = String.format("%c", cp);                            // 3781ms

str1、str2、str3 和 str4 的处理时间没有明显不同。相反，创建 str5 花费的时间要长得多，因为它使用 String.format()，该方法支持基于本地和格式化信息的灵活输出。str5 方法应该只用于程序的末尾来输出文本。

The above is the detailed content of How to use Unicode agent programming in Java. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:亿速云. If there is any infringement, please contact admin@php.cn delete

带你搞懂Java结构化数据处理开源库SPLMay 24, 2022 pm 01:34 PM

本篇文章给大家带来了关于java的相关知识，其中主要介绍了关于结构化数据处理开源库SPL的相关问题，下面就一起来看一下java下理想的结构化数据处理类库，希望对大家有帮助。

Java集合框架之PriorityQueue优先级队列Jun 09, 2022 am 11:47 AM

本篇文章给大家带来了关于java的相关知识，其中主要介绍了关于PriorityQueue优先级队列的相关知识，Java集合框架中提供了PriorityQueue和PriorityBlockingQueue两种类型的优先级队列，PriorityQueue是线程不安全的，PriorityBlockingQueue是线程安全的，下面一起来看一下，希望对大家有帮助。

完全掌握Java锁（图文解析）Jun 14, 2022 am 11:47 AM

本篇文章给大家带来了关于java的相关知识，其中主要介绍了关于java锁的相关问题，包括了独占锁、悲观锁、乐观锁、共享锁等等内容，下面一起来看一下，希望对大家有帮助。

一起聊聊Java多线程之线程安全问题Apr 21, 2022 pm 06:17 PM

本篇文章给大家带来了关于java的相关知识，其中主要介绍了关于多线程的相关问题，包括了线程安装、线程加锁与线程不安全的原因、线程安全的标准类等等内容，希望对大家有帮助。

详细解析Java的this和super关键字Apr 30, 2022 am 09:00 AM

本篇文章给大家带来了关于Java的相关知识，其中主要介绍了关于关键字中this和super的相关问题，以及他们的一些区别，下面一起来看一下，希望对大家有帮助。

Java基础归纳之枚举May 26, 2022 am 11:50 AM

本篇文章给大家带来了关于java的相关知识，其中主要介绍了关于枚举的相关问题，包括了枚举的基本操作、集合类对枚举的支持等等内容，下面一起来看一下，希望对大家有帮助。

java中封装是什么May 16, 2019 pm 06:08 PM

封装是一种信息隐藏技术，是指一种将抽象性函式接口的实现细节部分包装、隐藏起来的方法；封装可以被认为是一个保护屏障，防止指定类的代码和数据被外部类定义的代码随机访问。封装可以通过关键字private，protected和public实现。

归纳整理JAVA装饰器模式（实例详解）May 05, 2022 pm 06:48 PM

本篇文章给大家带来了关于java的相关知识，其中主要介绍了关于设计模式的相关问题，主要将装饰器模式的相关内容，指在不改变现有对象结构的情况下，动态地给该对象增加一些职责的模式，希望对大家有帮助。

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Repo: How To Revive Teammates

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks agoByDDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks agoByDDD

Hot Tools

Dreamweaver Mac version

Visual web development tools

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.