搜尋
首頁資料庫mysql教程MySQL 5.7 supports the GB18030 Chinese Character Set_MySQL

My former boss at MySQL sent out a notice that MySQL 5.7.4 nowsupports theGB18030character set, thus responding to requests that have been appearing since2005. This is a big deal because the Chinese government demands GB18030 support, and because the older simplified-Chinese character sets (gbk and gb2312) have a much smaller repertoire (that is, they have too few characters). And this is real GB18030 support -- I can define columns and variables with CHARACTER SET GB18030. That's rare --Oracle 12candSQL Server 2012andPostsgreSQL 9.3can't do it. (They allow input from GB18030 clients but they convert it immediately to Unicode.) Among big-time DBMSs, until now,only DB2has treated GB18030 as a first-class character set.

Standard Adherence

We're talking about the current version of the standard, GB18030-2005 "IT Chinese coded character set", especially its description of 70,244 Chinese characters. I couldn't puzzle out the Chinese wording inthe official document, all I could do was use translate.google.comon some excerpts. But I've been told that the MySQL person who coded this feature is Chinese, so they'll have had better luck. What I could understand was what are the difficult characters, what are the requirements for a claim of support, and what the encoding should look like. From the coder's comments, it's clear that part was understood. I did not check whether there was adherence for non-mandatory parts, such as Tibetan script.

Conversions

The repertoire of GB18030 ought to be the same as the Unicode repertoire. So I took a list of every Unicode character, converted to GB18030, and converted back to Unicode. The result in every case was the same Unicode character that I'd started with. That's called "perfect round tripping". As I explained in an earlier blog post"The UTF-8 World Is Not Enough", storing Chinese characters with a Chinese character set has certain advantages. Well, now the biggest disadvantage has disappeared.

Hold on -- how is perfect round tripping possible, given that MySQLfrequently refers to Unicode 4.0, and some of the characters in GB18030-2005 are only in Unicode 4.1? Certainly that ought to be a problem according to theUnicode FAQandthis extract from Ken Lunde's book. But it turns out to be okay because MySQL doesn't actually disallow those characters -- it accepts encodings which are not assigned to characters. Of course I believe that MySQL should have upgraded the Unicode support first, and added GB18030 support later. But the best must not be an enemy of the good.

Also the conversions to and from gb2312 work fine, so I expect that eventually gb2312 will become obsolete. It's time for mainland Chinese users to consider switching over to gb18030 once MySQL 5.7 is GA.

Collations

The new character set comes with three collations: one trivial, one tremendous, one tsk, tsk.

The trivial collation is gb18030_bin. As always the bin stands for binary. I expect that as always this will be the most performant collation, and the only one that guarantees that no two characters will ever have the same weight.

The tremendous collation is gb18030_unicode_520_ci. The "unicode_520" part of the name really does mean that the collation table comes from"Unicode 5.2"and this is the first time that MySQL has taken to heart the maxim: what applies to the superset can apply to the subset. In fact all MySQL character sets should have Unicode collations, because all their characters are in Unicode. So to test this, I went through all the Unicode characters and their GB18030 equivalents, and compared their weights withWEIGHT_STRING:
WEIGHT_STRING(utf32_char COLLATE utf32_unicode_520_ci) to
WEIGHT_STRING(gb18030_char COLLATE gb18030_unicode_520_ci).
Every utf32 weight was exactly the same as the gb18030 weight.

The tsk, tsk collation is gb18030_chinese_ci.

The first bad thing is the suffix chinese_ci, which will make some people think that this collation is like gb2312_chinese_ci. (Such confusion has happened before for the general_ci suffix.) In fact there are thousands of differences between gb2312_chinese_ci and gb18030_chinese_ci. Here's an example.

mysql> CREATE TABLE t5	->(gb2312 CHAR CHARACTER SET gb2312 COLLATE gb2312_chinese_ci,	-> gb18030 CHAR CHARACTER SET gb18030 COLLATE gb18030_chinese_ci);Query OK, 0 rows affected (0.22 sec)mysql> INSERT INTO t5 VALUES ('[','['),(']',']');Query OK, 2 rows affected (0.01 sec)Records: 2Duplicates: 0Warnings: 0mysql> SELECT DISTINCT gb2312 from t5 ORDER BY gb2312;+--------+| gb2312 |+--------+| ]	|| [	|+--------+2 rows in set (0.00 sec)mysql> SELECT DISTINCT gb18030 from t5 ORDER BY gb18030;+---------+| gb18030 |+---------+| [	 || ]	 |+---------+2 rows in set (0.00 sec)

See the difference? The gb18030 order is obviously better -- ']' should be greater than '[' -- but when two collations are wildly different they shouldn't both be called "chinese_ci".

The second bad thing is the algorithm. The new chinese_ci collation is based onpinyinfor Chinese characters, and binary comparisons of the UPPER() values for non-Chinese characters. This is pretty well useless for non-Chinese. I can bet that somebody will observe "well, duh, it's a Chinese character set" -- but I can't see why one would use an algorithm for Latin/Greek/Cyrillic/etc. characters that's so poor. There's aCommon Locale Data Repositoryfor tailoring for Chinese, there are MySQL worklog tasks that explain the brave new world, there's no need to invent an idiolect when there's a received dialect.

Documentation

The documentation isn't up to date yet -- there's no attempt to explain what the new character set and its collations are about, and no mention at all inthe FAQ.

But the worklog taskWL#4024: gb18030 Chinese character setgives a rough idea of what the coder had in mind before starting. It looks as if WL#4024 was partly copied fromhttp://icu-project.org/docs/papers/unicode-gb18030-faq.htmlso that's also worth a look.

For developers who just need to know what's going on now, just re-read this blog post. What I've described should be enough for people who care about Chinese.

I didn't look for bugs with full-text or LIKE searches, and I didn't look at speed. But I did look hard for problems with the essentials, and found none. Congratulations are due.

陳述
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn
mysql無法打開共享庫怎麼解決mysql無法打開共享庫怎麼解決Mar 04, 2025 pm 04:01 PM

本文介紹了MySQL的“無法打開共享庫”錯誤。 該問題源於MySQL無法找到必要的共享庫(.SO/.DLL文件)。解決方案涉及通過系統軟件包M驗證庫安裝

減少在Docker中使用MySQL內存的使用減少在Docker中使用MySQL內存的使用Mar 04, 2025 pm 03:52 PM

本文探討了Docker中的優化MySQL內存使用量。 它討論了監視技術(Docker統計,性能架構,外部工具)和配置策略。 其中包括Docker內存限制,交換和cgroups

如何使用Alter Table語句在MySQL中更改表?如何使用Alter Table語句在MySQL中更改表?Mar 19, 2025 pm 03:51 PM

本文討論了使用MySQL的Alter Table語句修改表,包括添加/刪除列,重命名表/列以及更改列數據類型。

在 Linux 中運行 MySQl(有/沒有帶有 phpmyadmin 的 podman 容器)在 Linux 中運行 MySQl(有/沒有帶有 phpmyadmin 的 podman 容器)Mar 04, 2025 pm 03:54 PM

本文比較使用/不使用PhpMyAdmin的Podman容器直接在Linux上安裝MySQL。 它詳細介紹了每種方法的安裝步驟,強調了Podman在孤立,可移植性和可重複性方面的優勢,還

什麼是 SQLite?全面概述什麼是 SQLite?全面概述Mar 04, 2025 pm 03:55 PM

本文提供了SQLite的全面概述,SQLite是一個獨立的,無服務器的關係數據庫。 它詳細介紹了SQLite的優勢(簡單,可移植性,易用性)和缺點(並發限制,可伸縮性挑戰)。 c

如何為MySQL連接配置SSL/TLS加密?如何為MySQL連接配置SSL/TLS加密?Mar 18, 2025 pm 12:01 PM

文章討論了為MySQL配置SSL/TLS加密,包括證書生成和驗證。主要問題是使用自簽名證書的安全含義。[角色計數:159]

在MacOS上運行多個MySQL版本:逐步指南在MacOS上運行多個MySQL版本:逐步指南Mar 04, 2025 pm 03:49 PM

本指南展示了使用自製在MacOS上安裝和管理多個MySQL版本。 它強調使用自製裝置隔離安裝,以防止衝突。 本文詳細詳細介紹了安裝,起始/停止服務和最佳PRA

哪些流行的MySQL GUI工具(例如MySQL Workbench,PhpMyAdmin)是什麼?哪些流行的MySQL GUI工具(例如MySQL Workbench,PhpMyAdmin)是什麼?Mar 21, 2025 pm 06:28 PM

文章討論了流行的MySQL GUI工具,例如MySQL Workbench和PhpMyAdmin,比較了它們對初學者和高級用戶的功能和適合性。[159個字符]

See all articles

熱AI工具

Undresser.AI Undress

Undresser.AI Undress

人工智慧驅動的應用程序,用於創建逼真的裸體照片

AI Clothes Remover

AI Clothes Remover

用於從照片中去除衣服的線上人工智慧工具。

Undress AI Tool

Undress AI Tool

免費脫衣圖片

Clothoff.io

Clothoff.io

AI脫衣器

AI Hentai Generator

AI Hentai Generator

免費產生 AI 無盡。

熱門文章

R.E.P.O.能量晶體解釋及其做什麼(黃色晶體)
2 週前By尊渡假赌尊渡假赌尊渡假赌
倉庫:如何復興隊友
1 個月前By尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island冒險:如何獲得巨型種子
4 週前By尊渡假赌尊渡假赌尊渡假赌

熱工具

SublimeText3漢化版

SublimeText3漢化版

中文版,非常好用

MantisBT

MantisBT

Mantis是一個易於部署的基於Web的缺陷追蹤工具,用於幫助產品缺陷追蹤。它需要PHP、MySQL和一個Web伺服器。請查看我們的演示和託管服務。

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

這個專案正在遷移到osdn.net/projects/mingw的過程中,你可以繼續在那裡關注我們。 MinGW:GNU編譯器集合(GCC)的本機Windows移植版本,可自由分發的導入函式庫和用於建置本機Windows應用程式的頭檔;包括對MSVC執行時間的擴展,以支援C99功能。 MinGW的所有軟體都可以在64位元Windows平台上運作。

禪工作室 13.0.1

禪工作室 13.0.1

強大的PHP整合開發環境

SublimeText3 Mac版

SublimeText3 Mac版

神級程式碼編輯軟體(SublimeText3)