Detailed explanation of how JavaScript uses 300 lines of code to convert Chinese characters to Pinyin-JS Tutorial-php.cn

Home

Web Front-end

JS Tutorial

Detailed explanation of how JavaScript uses 300 lines of code to convert Chinese characters to Pinyin

黄舟

May 21, 2017 am 11:32 AM

This article mainly introduces the god-level programmer JavaScript 300 lines of code to convert Chinese characters to pinyin. Friends in need can refer to the following

.The current situation of converting Chinese characters to Pinyin

First of all, it should be said that there is a strong demand for converting Chinese characters to Pinyin, such as sorting/filtering contacts by Pinyin letters; such as destinations (typically such as ticket purchases)
By Pinyin Initial letter classification and so on. But the solution to this requirement, but I haven’t heard of any clever implementation (especially on the browser side), probably requires a huge dictionary.
Specific to JavaScript, check github and npm. The better libraries for converting Chinese characters to pinyin include pinyin
and pinyinjs. You can see that both All come with a huge dictionary.
These dictionaries are often tens or hundreds of KB (some even several MB), and it still takes some courage to use them on the browser side. So when we encounter the need to convert Chinese characters to Pinyin, it is no wonder that our first reaction is to reject the request (or implement it on the server side).
Now, if I tell you that you can convert Chinese characters to Pinyin in 300 lines of code on the browser side, is it unbelievable?

2. Starting from the Android 4.2.2 contact code

Again emphasize this blog - using the Android source code, Easily convert Chinese characters to Pinyin.
Today I would like to share with you a solution for converting Chinese characters into Pinyin extracted from the Android system source code. With just one class and more than 560 lines of code, you can easily implement the function of converting Chinese characters into Pinyin without any other third party. rely.
Has this broken your thinking pattern: Is there any powerful algorithm that can abandon the dictionary?
After reading the blog for the first time, I was a little disappointed. There was no algorithm analysis. It just introduced the hundreds of lines of code discovered from the Android code. The second time I read the code with the idea of porting it to JavaScript, I finally understood the principle, so I started the journey of porting.

3. Teach you step by step how to convert Chinese characters to Pinyin with 300 lines of JavaScript code

First, let’s get straight to the core: why converting Chinese characters to Pinyin requires a huge dictionary of thinking Settlement?
Because the arrangement of Chinese characters has nothing to do with pinyin, for example, in the Chinese character range \u4E00-\u9FFF, the former may be ha, and the latter may be ze. There is no way to associate the unicode of Chinese characters with pinyin, so we can only There is a huge dictionary that records the pinyin of every Chinese character (or commonly used Chinese character).
However, suppose we can sort all Chinese characters by pinyin, such as 'A','AI','AN','ANG','AO','BA',...,'ZUI',' ZUN','ZUO' sorting, then we only need to remember the first Chinese character of each Chinese character queue with the same pinyin. Then, the required dictionary will be very small (just cover all pinyin, the number of pinyin itself is not large).
Now, the difficulty is to sort the Chinese characters by pinyin. Fortunately, the ICU/localization related API provides this sorting API (if there were no convenient sorting/comparison methods, this article may not appear).

So, this is why 300 lines can be used to convert Chinese characters to Pinyin: Intl.CollatorAPI: Intl.Collator internally implements localization-related string sorting. We can basically sort all Chinese characters according to Pinyin through Intl.Collator.prototype.compare.
Boundary Chinese character table: records the sorted boundary points. Each Chinese character in this Chinese character table is the first Chinese character in a set of Chinese characters with the same pinyin after sorting (Eachunihansisthefirstonewithinsamepinyinwhencollatoriszh_CN).
Speaking of this, there may still be something that is not clearly explained, so I will directly upload a piece of code:

For interested students You can execute the script.js above node--icu-data-dir=node_modules/full-icu to see if it is basically sorted by pinyin. Chinese character table.

Here are a few points to note:

I bolded "Basic" again because the list of Chinese characters we got is not completely sorted according to Pinyin. There are occasionally some other Pinyin Chinese characters inserted in the middle. This needs to be done when making the boundary table. Extra attention.
The table obtained in the above script is the sorting of all Chinese characters. Some of them are different from the table of HanziToPinyin.java in the Android code, so the table of HanziToPinyin.java needs to be updated. (The biggest pitfall and workload in switching from Java to JavaScript: correcting the boundary table)
I believe everyone has seen the core code: constCOLLATOR=newIntl.Collator(['zh-Hans-CN']), Intl.Collator
(The locale specified here is China zh-Hans-CN) is the key to sorting Chinese characters by pinyin. It is an Internationalization API that sorts strings in locale-specific order.
Please first npmifull-icu when executing the script. This dependency will automatically install the missing Chinese support and prompt how to specify the ICU data file to execute the script.
1.ICUICU stands for InternationalComponentsforUnicode, which provides Unicode and internationalization support for applications.
ICUisamature,widelyusedsetofC/C++andJavalibrariesprovidingUnicodeandGlobalizationsupportforsoftwareapplications.ICUiswidelyportableandgivesapplicationsthesameresultsonallplatformsandbetweenC/C++andJavasoftware.
And ICU provides localized string comparison services (UnicodeCollationAlgorithm+locally specific comparison rules):
Collation:Comparestringsaccordingtotheconvention sandstandardsofaparticularlanguage,regionorcountry. ICU'scollationisbasedontheUnicodeCollationAlgorithmpluslocale-specificcomparisonrulesfromtheCommonLocaleDataRepository,acomprehensivesourceforthistypeofdata.
On modern browsers, generally ICU has built-in support for the user's local language, and we can use it directly.
But for node.js, usually, ICU only contains a subset (usually English), so we need to add support for Chinese ourselves. Generally speaking, you can install full-icu
through npminstallfull-icu to install missing Chinese support. (See node--icu-data-dir=node_modules/full-icu above).
2.IntlAPI The previous section should basically explain the knowledge related to internationalization/localization. Here we will add the use of built-in API. How to check whether the user language and Runtime support this language? Intl.Collator.supportedLocalesOf(array|string)
Returns an array containing locales that are supported (without falling back to the default locale). The parameter can be an array or a string, which is the locales you want to test (i.e. BCP47languagetag).

Construct the Collator object and sort the string

through Intl.Collator.prototype. compare, we can sort strings in the order specified by the language. In Chinese, this sorting happens to be mostly in pinyin order, 'A', 'AI', 'AN', 'ANG', 'AO', 'BA', 'BAI', 'BAN' ,'BANG','BAO','BEI','BEN','BENG','BI','BIAN','BIAO','BIE','BIN','BING','BO',' BU','CA','CAI','CAN',...
, this is the key to converting Chinese characters to Pinyin as we mentioned above.

4. Boundary table correction

Obviously, there is a problem with this boundary table and needs to be corrected.
We can see that most of the Chinese characters have been converted into qing. It can be seen that there is a problem with the Chinese character corresponding to the pinyin of qing.
Found this Chinese character, it is '\u72c5'/'狅', plus one character before and after, ['\u4eb2','\u72c5','\u828e']/["情","狅", "芎"]
.
Search, '\u72c5'/'狅' can be read as qing, but now it is read as kuang, which should be the cause of the error.
According to the initial sorting list of all Chinese characters, the first Chinese character of qing is '\u9751'/'靑'.
After the change, only 104 failed conversions.

The above is the detailed content of Detailed explanation of how JavaScript uses 300 lines of code to convert Chinese characters to Pinyin. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

es6数组怎么去掉重复并且重新排序May 05, 2022 pm 07:08 PM

去掉重复并排序的方法：1、使用“Array.from(new Set(arr))”或者“[…new Set(arr)]”语句，去掉数组中的重复元素，返回去重后的新数组；2、利用sort()对去重数组进行排序，语法“去重数组.sort()”。

JavaScript的Symbol类型、隐藏属性及全局注册表详解Jun 02, 2022 am 11:50 AM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于Symbol类型、隐藏属性及全局注册表的相关问题，包括了Symbol类型的描述、Symbol不会隐式转字符串等问题，下面一起来看一下，希望对大家有帮助。

原来利用纯CSS也能实现文字轮播与图片轮播！Jun 10, 2022 pm 01:00 PM

怎么制作文字轮播与图片轮播？大家第一想到的是不是利用js，其实利用纯CSS也能实现文字轮播与图片轮播，下面来看看实现方法，希望对大家有所帮助！

JavaScript对象的构造函数和new操作符（实例详解）May 10, 2022 pm 06:16 PM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于对象的构造函数和new操作符，构造函数是所有对象的成员方法中，最早被调用的那个，下面一起来看一下吧，希望对大家有帮助。

JavaScript面向对象详细解析之属性描述符May 27, 2022 pm 05:29 PM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于面向对象的相关问题，包括了属性描述符、数据描述符、存取描述符等等内容，下面一起来看一下，希望对大家有帮助。

javascript怎么移除元素点击事件Apr 11, 2022 pm 04:51 PM

方法：1、利用“点击元素对象.unbind("click");”方法，该方法可以移除被选元素的事件处理程序；2、利用“点击元素对象.off("click");”方法，该方法可以移除通过on()方法添加的事件处理程序。

整理总结JavaScript常见的BOM操作Jun 01, 2022 am 11:43 AM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于BOM操作的相关问题，包括了window对象的常见事件、JavaScript执行机制等等相关内容，下面一起来看一下，希望对大家有帮助。

foreach是es6里的吗May 05, 2022 pm 05:59 PM

foreach不是es6的方法。foreach是es3中一个遍历数组的方法，可以调用数组的每个元素，并将元素传给回调函数进行处理，语法“array.forEach(function(当前元素,索引,数组){...})”；该方法不处理空数组。

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

1 months agoByDDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

1 months agoByDDD

R.E.P.O. Best Graphic Settings

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

1 weeks agoByDDD

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 Linux new version

SublimeText3 Linux latest version

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.