Home  >  Article  >  Database  >  How to index string fields in MySQL

How to index string fields in MySQL

王林
王林forward
2023-05-28 14:38:522424browse

Assume that you currently maintain a system that supports email login. The user table is defined like this:

create table SUser(
 ID bigint unsigned primary key,
 email varchar(64), 
 ... 
 )engine=innodb;

Since you need to use email to log in, statements similar to this must appear in the business code:

select f1, f2 from SUser where email='xxx';

If there is no index on the email field, then this statement can only perform a full table scan.

1) Can I build an index on the email address field?

  • MySQL supports prefix index, you can define a part of the string as the index

2) If you create an index What happens if the statement does not specify the prefix length?

  • The index will contain the entire string

3) Can you give me an example?

alter table SUser add index index1(email);
 或
 alter table SUser add index index2(email(6));
  • index1 In the index, it contains the entire string of each record

  • index2 In the index, for each record Just take the first 6 bytes

#4) What are the differences between these two different definitions in data structure and storage?

How to index string fields in MySQL

How to index string fields in MySQL

It is obvious that the email(6) index structure will occupy less space

5) Email(6) Are there any disadvantages to this index structure?

  • may increase the number of additional record scans

6) The following statement is defined in these two indexes How are the following implemented?

select id,name,email from SUser where email='zhangssxyz@xxx.com';

index1 (that is, the index structure of the entire email string), execution sequence

  • Find the index value that satisfies the index value from the index1 index tree and is ’zhangssxyz@xxx For this record of .com’, get the value of ID2;

  • Go back to the table and find the row whose primary key value is ID2, judge that the value of email is correct, and add this row of records Result set;

  • Continue to the next record in the index index tree, and find that the condition of email='zhangssxyz@xxx.com’ is no longer met, and the loop ends.

In this process, you only need to retrieve data from the primary key index once, so the system thinks that only one row has been scanned.

index2 (i.e. email(6) index structure), execution sequence

  • Find the record that satisfies the index value of "zhangs" from the index2 index tree, and find the first One is ID1;

  • Go to the primary key and find the row whose primary key value is ID1. It is judged that the value of email is not "zhangssxyz@xxx.com", and this row of records is discarded;

  • Get the next record at the location just found on index2, and find that it is still "zhangs". Take out ID2, then get the entire row on the ID index and judge, this time the value By the way, add this row of records to the result set;

  • Repeat the previous step until the value obtained on idxe2 is not "zhangs", and the loop ends.

In this process, the primary key index needs to be retrieved 4 times, that is, 4 rows are scanned.

7) What conclusions can be drawn from the above comparison?

  • #Using a prefix index may cause the query statement to read data more times.

8) Are prefix indexes really useless?

  • If the index2 we define is not email(6) but email(7), then there is only one record that satisfies the prefix ’zhangss’, and ID2 can be directly found. , scanning only one line and it ends.

#9) So what are the precautions for using prefix index?

  • Length selection is reasonable

10) When creating a prefix index for a string, how do I know what I should do? What length of prefix index should be used?

  • # Count how many different values ​​there are on the index to determine how long a prefix should be used.

11) How to count how many different values ​​there are on the index?

select count(distinct email) as L from SUser;

12) What should we do next after getting how many different values ​​corresponding to the index?

  • Select prefixes of different lengths in turn to see this value

    select 
       count(distinct left(email,4))as L4,
       count(distinct left(email,5))as L5,
       count(distinct left(email,6))as L6,
       count(distinct left(email,7))as L7,
     from SUser;
  • Then, among L4~L7, find the first A value not less than L * 95% means that more than 95% of the data can be found through this index.

13) What is the impact of prefix index on covering index?

The following SQL statement:

select id,email from SUser where email='zhangssxyz@xxx.com';

Compared with the SQL statement

select id,name,email from SUser where email='zhangssxyz@xxx.com';

in the previous example, the first statement only requires the return of the id and email fields.

  • If you use index1 (that is, the index structure of the entire email string), you can get the ID by checking the email. Then there is no need to return the table. This is a covering index.

  • 用 index2(即 email(6) 索引结构)的话,就不得不回到 ID 索引再去判断 email 字段的值。

14)那我把index2 的定义修改为 email(18) 的前缀索引不就行了?

  • 这个18是你自己定义的,系统不知道18这个长度是否已经大于我的email长度,所以它还是会回表去查一下验证。

总而言之:使用前缀索引就用不上覆盖索引对查询性能的优化了

针对类似于邮箱这样的字段,使用前缀索引可能会产生不错的效果。但是,遇到身份证这种前缀的区分度不够好的情况时,我们要怎么办呢?

  • 索引选取的要更长一些。

    • 但是所以越长的话,占的磁盘空间更大,相同的一页能放下的索引值就变少了,反而会影响查询效率。

16)如果我们能够确定业务需求里面只有按照身份证进行等值查询的需求,还有没有别的处理方法呢?

  • 既然正过来相同的多,那我就把它倒过来存。查询时候这样查

    select field_list from t where id_card = reverse('input_id_card_string');

    使用 的时候用count(distinct) 方法去做个验证

  • 使用 hash 字段。在表上再创建一个整数字段,来保存身份证的校验码,同时在这个字段上创建索引。

    alter table t add id_card_crc int unsigned, add index(id_card_crc);

    新记录插入时必须使用 crc32() 函数生成校验码,并填入新字段中。由于校验码可能存在冲突,也就是说两个不同的身份证号通过 crc32() 函数得到的结果可能是相同的,所以你的查询语句 where 部分要判断 id_card 的值是否精确相同。

    select field_list from t where id_card_crc=crc32('input_id_card_string') and id_card='input_id_card_string'

    这样,索引的长度变成了 4 个字节(int类型),比原来小了很多

17)使用倒序存储和使用 hash 字段这两种方法有什么异同点?

  • 相同点:都不支持范围查询

    • 倒序存储的字段上创建的索引是按照倒序字符串的方式排序的,已经没有办法利用索引方式查出身份证号码在[ID_X, ID_Y]的所有市民了。同样地,hash 字段的方式也只能支持等值查询。

  • 区别

    • 从占用的额外空间来看,倒序存储方式在主键索引上,不会消耗额外的存储空间,而 hash 字段方法需要增加一个字段。当然,倒序存储方式使用 4 个字节的前缀长度应该是不够的,如果再长一点,这个消耗跟额外这个 hash 字段也差不多抵消了。

    • 在 CPU 消耗方面,倒序方式每次写和读的时候,都需要额外调用一次 reverse 函数,而 hash 字段的方式需要额外调用一次 crc32() 函数。以仅考虑这两个函数的计算复杂度为前提,reverse 函数对 CPU 资源的额外消耗将较少。

    • 就查询性能而言,采用哈希字段方式的查询更具可靠性。虽然crc32算法不可避免地存在冲突的风险,但这种风险极其微小,因此我们可以认为查询时平均扫描行数接近于1。使用倒序存储方式仍然需要使用前缀索引来进行扫描,因此会增加扫描的行数。

案例:如果你在维护一个学校的学生信息数据库,学生登录名的统一格式是”学号 @gmail.com", 而学号的规则是:十五位的数字,其中前三位是所在城市编号、第四到第六位是学校编号、第七位到第十位是入学年份、最后五位是顺序编号。

学生必须输入正确的登录名和密码,方可继续使用系统。如果只考虑登录验证这个行为,你会如何为登录名设计索引?

  • 如果一个学校每年预计2万新生,50年才100万记录,如果直接使用全字段索引,可以节省多少存储空间?。除非遇到超大规模数据,否则不需要使用后两种方法,从而避免了开发转换和限制风险

  • 在实际操作中,只需对所有字段进行索引,一个学校的数据库数据量和查询负担不会变得很大。 如果单从优化数据表的角度: \1. 后缀@gmail可以单独一个字段来存,或者用业务代码来保证, \2. 城市编号和学校编号估计也不会变,也可以用业务代码来配置 \3. 然后直接存年份和顺序编号就行了,这个字段可以全字段索引

The above is the detailed content of How to index string fields in MySQL. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:yisu.com. If there is any infringement, please contact admin@php.cn delete