PHP Tutorial

Optimization of high-concurrency low-cardinality multi-field arbitrary combination queries_PHP tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jul 12, 2016 am 08:57 AM

android

Optimization of high concurrent low cardinality multi-field arbitrary combination query

1. Question

First explain the "low cardinality multi-field arbitrary combination query" that appears in this title. something. This refers to queries that meet the following conditions:
1. The search conditions involve a combination of multiple field conditions
2. The combination of these fields is uncertain
3. The selection of each individual field The sex is not good

This type of query has many usage scenarios, such as the product display page of e-commerce. Users will enter various combinations of query conditions: category, supplier, brand, promotion, price, etc..., and in the end they often have to sort and paginate the results.

The troublesome aspects of this type of problem are:
1. There are a large number of records. If a full table scan is performed, the performance will be low and it cannot meet the requirements of high concurrent access.
2. The selectivity of any single field involved in the query conditions is very low, and the query efficiency problem cannot be solved through a single field index.
3. If you build an ordinary Btree multi-field index, hundreds or thousands of indexes may be built due to too many combinations of user input conditions, which is unrealistic and difficult to maintain.

2. Solution

There are 2 solutions that I think of for this kind of problem

2.1 bitmap index

The characteristic of bitmap is to store the key and all values are equal to The bitmap of the row set of this key. For combined queries involving multiple keys, you only need to perform an AND or operation on the bitmaps corresponding to these keys. Since the size of bitmap is very small and the efficiency of bit AND-OR operations is also very high, bitmap is very suitable for this type of query.
Bitmap index also has disadvantages. Updating a record will lock the entire table, which is not suitable for scenarios with many concurrent writes. Another problem is that among common relational databases, only Oracle seems to support bitmap indexes, and many times we want to use open source databases.

2.2 Inverted index

Inverted index is similar to bitmap. It stores the key and the row set whose value is equal to this key. The row set may be a list or a tree or Other storage forms. For combined queries of multiple keys, just perform set operations on the results of these keys.
Inverted index is generally used for full-text retrieval, but many systems also use it to support structured data search, such as Elasticsearch. Elasticsearch supports fast search of JSON documents, compound queries, sorting, aggregation, distributed deployment and many other great features. But considering the following factors, we prefer to find a solution in a relational database.
-There is no need to use the advanced features provided by search engines for fuzzy matching. In fact, we need exact matching or simple fuzzy matching.
-The amount of data is not large enough to require a distributed search cluster.
-The original data is already in the relational database, so you don’t want to worry about data synchronization.
-I have developed an application based on the interface of a relational database and do not want to start over.
-Having mastered the operation and maintenance management of relational databases, I don’t know how many pitfalls I have to go through in a brand new system.
- Considering the performance difference between Java and C, the performance of the built-in relational database solution may not be inferior to that of professional search engines.

3. PostgreSQL solution

If the scope of the solution is limited to open source relational databases, there may be only one answer, which is PostgreSQL's gin index.
PostgreSQL's gin index is an inverted index. It is not only used for full-text search but also for regular data types, such as int and varchar.
For multi-dimensional queries, we can build indexes like this:
1. Create a unique multi-field gin index for all low-cardinality fields involved in equivalent conditions
2. For equivalent queries with good selectivity or For the fields involved in the range query, build a btree index

Some students may have questions. It is also a multi-field index. Why does gin's multi-field index only need to build one, but btree's multi-field index does? Consider various query combinations and build several. This is because each field in the gin multi-field index is equivalent and there is no leading field. Therefore, as long as a unique gin multi-field index is built, all query combinations can be covered; while the btree multi-field index is different. If the query conditions do not contain the suoyi leading field, the index cannot be used.

Each key stored internally in the multi-field gin index is in the form of (column number, key datum), so different fields can be distinguished without confusion. The stored value is the set of ctids of all records matching key. This set is stored in the form of btree when the number of records is relatively large, and has been compressed, so the storage space occupied by the gin index is very small, only about one-twentieth of the equivalent btree index, which is also derived from another Improved performance.

For multiple fields involved in multi-dimensional queries, fields included in the multi-field gin index, the ctid set is merged by the gin index (taking the union or intersection), and then the obtained ctid set and other indexes are obtained The ctid collection is then merged with BitmapAnd or BitmapOr. The efficiency of merging ctid sets within the gin index is much higher than that of merging ctid sets between indexes, and the gin index optimizes low-cardinality fields better, so making full use of the characteristics of the gin index is better than building a separate btree index for each field and then using BitmapAnd Or BitmapOr merges the result set much more efficiently.

4. A real case

4.1 Original query

The following SQL is a simplified version of a real SQL in a system.

SELECT CASE WHEN gpppur.GB_BEGINDATE '2016-02-29 14: 36:00' THEN 1
WHEN gpppur.PREVIEW_BEGINDT '2016-02-29 14:36: 00' THEN 2
ELSE 3 END AS flag,
gpppur.*
FROM T_MPS_INFO gpppur
WHERE gpppur.ATTRACT_TP = 0
AND gpppur.COLUMN_ID = 1
AND gpppur.FIELD2 = 1
AND gpppur. STATUS = 1
ORDER BY flag ASC,gpppur.PC_SORT_NUM ASC,gpppur.GB_BEGINDATE DESC
LIMIT 0,45

Using MySQL database , the total data volume is 60w, in which a multi-field index of FIELD2 STATUS is built.
The value distribution of the four fields involved in the query conditions is as follows:

postgres=# select ATTRACT_TP,count(*) from T_MPS_INFO group by ATTRACT_TP;
attract_tp | count
------------ --------
| 16196
6 | 251
2 | 50
1 | 3692
3 | 143
10 | 314
4 | 214
5 | 194333
9 | 326485
7 | 1029
0 | 6458
(11 rows)
postgres=# select COLUMN_ID,count(*) from T_MPS_INFO group by COLUMN_ID;
column_id | count
------------ - -------
| 2557
285 | 20
120 | 194
351 | 2
337 | 79
227 | 26
311 | 9
347 | 2
228 | 21
318 | 1
314 | 9
54 | 10
133 | 27
2147483647 | 1
336 | 1056
364 | 1
131 | 10
243 | 5
115 | 393
61 | 73
226 | 40
196 | 16
350 | 5
373 | 72
377 | 2
260 | 4
184 | 181
363 | 1
341 | 392
64 | 1
344 | 199271
235 | 17
294 | 755
352 | 3
368 | 1
225 | 1
199 | 8
374 | 2
248 | 8
84 | 1
362 | 1
361 | 331979
319 | 7
244 | 65
125 | 2
130 | 1
272 | 65
66 | 2
240 | 2
775 | 1
253 | 49
60 | 45
121 | 5
257 | 3
365 | 1
0 | 1
217 | 5
270 | 1
122 | 39
56 | 49
355 | 5
161 | 1
329 | 1
222 | 9
261 | 275
2 | 3816
57 | 19
307 | 4
310 | 8
97 | 37
202 | 20
203 | 3
85 | 1
375 | 641
58 | 98
1 | 6479
59 | 114
185 | 7
338 | 10
379 | 17
(80 rows)
postgres=# select FIELD2,count(*) from T_MPS_INFO group by FIELD2;
field2 | count
-------- --------
| 2297
6 | 469
2 | 320
1 | 11452
3 | 286
10 | 394
4 | 291
5 | 200497
9 | 331979
0 | 2
7 | 1178
(11 rows)
postgres=# select STATUS,count(*) from T_MPS_INFO group by STATUS;
status | count
- ------- --------
| 2297
0 | 15002
3 | 5
4 | 1
1 | 531829
2 | 31
(6 rows)

Since the value distribution of these fields is extremely uneven, we construct the following Lua script to generate different select statements to simulate the load.
qx.lua:

pathtest = string.match(test, "(.*/)") or ""
dofile(pathtest .. "common.lua")
function thread_init(thread_id)
set_vars()
end
function event(thread_id)
local ATTRACT_TP,COLUMN_ID,FIELD2,STATUS
ATTRACT_TP = sb_rand_uniform(0, 10)
COLUMN_ID = sb_rand_uniform(1, 100)
FIELD2 = sb_rand_uniform(0, 10)
STATUS = sb_rand_uniform(0, 4)
rs = db_query("SELECT CASE WHEN gpppur.GB_BEGINDATE '2016-02-29 14:36:00' THEN 1
WHEN gpppur.PREVIEW_BEGINDT '2016-02-29 14:36:00' THEN 2
ELSE 3 END AS flag,
gpppur.*
FROM T_MPS_INFO gpppur
WHERE gpppur.ATTRACT_TP = "..ATTRACT_TP.."
AND gpppur.COLUMN_ID = "..COLUMN_ID.."
AND gpppur.FIELD2 = "..FIELD2.."
AND gpppur.STATUS = "..STATUS.."
ORDER BY flag ASC,gpppur.PC_SORT_NUM ASC,gpppur.GB_BEGINDATE DESC
LIMIT 45")
end

然后用sysbench进行压测，结果在32并发时测得的qps是64。

[root@rh6375Gt20150507 ~]# sysbench --db-driver=mysql --test=/opt/sysbench-0.5/sysbench/tests/db/qx.lua --mysql-db=test --mysql-user=mysql --mysql-password=mysql --mysql-host=srdsdevapp69 --num-threads=32 --max-time=5 run
sysbench 0.5: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 32
Random number generator seed is 0 and will be ignored
Threads started!
OLTP test statistics:
queries performed:
read: 825
write: 0
other: 0
total: 825
transactions: 0 (0.00 per sec.)
read/write requests: 825 (64.20 per sec.)
other operations: 0 (0.00 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
General statistics:
total time: 12.8496s
total number of events: 825
total time taken by event execution: 399.6003s
response time:
min: 1.01ms
avg: 484.36ms
max: 12602.74ms
approx. 95 percentile: 222.79ms
Threads fairness:
events (avg/stddev): 25.7812/24.12
execution time (avg/stddev): 12.4875/0.23

4.2 优化后的查询

对于上面那个特定的SQL虽然我们可以通过建一个包含所有等值查询条件中4个字段(ATTRACT_TP,COLUMN_ID,FIELD2,STATUS)的组合索引进行优化，但是需要说明的是，这条SQL只是各种查询组合产生的1000多种不同SQL中的一个，每个SQL涉及的查询字段的组合是不一样的，我们不可能为每种组合都单独建一个多字段索引。
所以我们想到了PostgreSQL的gin索引。为了使用PostgreSQL的gin索引，先把MySQL的表定义，索引和数据原封不动的迁移到PostgreSQL。
在添加gin索引前，先做了一个测试。另人惊讶的是，还没有开始进行优化，PostgreSQL测出的性能已经是MySQL的5倍(335/64=5)了。

[root@rh6375Gt20150507 ~]# sysbench --db-driver=pgsql --test=/opt/sysbench-0.5/sysbench/tests/db/qx.lua --pgsql-db=postgres --pgsql-user=postgres --pgsql-password=postgres --pgsql-host=srdsdevapp69 --num-threads=32 --max-time=5 run
sysbench 0.5: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 32
Random number generator seed is 0 and will be ignored
Threads
OLTP test statistics:
queries performed:
read: 1948
write: 0
other: 0
total: 1948
transactions: 0 (0.00 per sec.)
read/write requests: 1948 (335.52 per sec.)
other operations: 0 (0.00 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
General statistics:
total time: 5.8059s
total number of events: 1948
total time taken by event execution: 172.0538s
response time:
min: 0.90ms
avg: 88.32ms
max: 2885.69ms
approx. 95 percentile: 80.01ms
Threads fairness:
events (avg/stddev): 60.8750/27.85
execution time (avg/stddev): 5.3767/0.29

下一步，添加gin索引。

postgres=# create extension btree_gin;
CREATE EXTENSION
postgres=# create index idx3 on t_mps_info using gin(attract_tp, column_id, field2, status);
CREATE INDEX

再进行压测，测出的qps是5412，是MySQL的85倍(5412/64=85)。

[root@rh6375Gt20150507 ~]# sysbench --db-driver=pgsql --test=/opt/sysbench-0.5/sysbench/tests/db/qx.lua --pgsql-db=postgres --pgsql-user=postgres --pgsql-password=postgres --pgsql-host=srdsdevapp69 --num-threads=32 --max-time=5 run
sysbench 0.5: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 32
Random number generator seed is 0 and will be ignored
Threads
OLTP test statistics:
queries performed:
read: 10000
write: 0
other: 0
total: 10000
transactions: 0 (0.00 per sec.)
read/write requests: 10000 (5412.80 per sec.)
other operations: 0 (0.00 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
General statistics:
total time: 1.8475s
total number of events: 10000
total time taken by event execution: 58.2706s
response time:
min: 0.95ms
avg: 5.83ms
max: 68.36ms
approx. 95 percentile: 9.42ms
Threads fairness:
events (avg/stddev): 312.5000/47.80
execution time (avg/stddev): 1.8210/0.02

4.3 补充

作为对比，我们又在MySQL上添加了包含attract_tp, column_id, field2和status这4个字段的多字段索引，测出的qps是4000多，仍然不如PostgreSQL。可见业界广为流传的MySQL的简单查询性能优于PostgreSQL的说法不可信！(对于复杂查询PostgreSQL的性能大大优于MySQL应该是大家的共识。我例子中的SQL不能算是复杂查询吧？)

5. 总结

gin索引(还包括类似的gist,spgist索引)是PostgreSQL的一大特色，基于它可以挖掘出很多好玩的用法。对于本文提到的场景，有兴趣的同学可以把它和Oracle的bitmap索引以及基于搜索引擎(Elasticsearch，Solr等)的方案做个对比。另外，本人所知有限，如果有其它更好的方案，希望能让我知道。

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

How do you set the session cookie parameters in PHP?Apr 22, 2025 pm 05:33 PM

Setting session cookie parameters in PHP can be achieved through the session_set_cookie_params() function. 1) Use this function to set parameters, such as expiration time, path, domain name, security flag, etc.; 2) Call session_start() to make the parameters take effect; 3) Dynamically adjust parameters according to needs, such as user login status; 4) Pay attention to setting secure and httponly flags to improve security.

What is the main purpose of using sessions in PHP?Apr 22, 2025 pm 05:25 PM

The main purpose of using sessions in PHP is to maintain the status of the user between different pages. 1) The session is started through the session_start() function, creating a unique session ID and storing it in the user cookie. 2) Session data is saved on the server, allowing data to be passed between different requests, such as login status and shopping cart content.

How can you share sessions across subdomains?Apr 22, 2025 pm 05:21 PM

How to share a session between subdomains? Implemented by setting session cookies for common domain names. 1. Set the domain of the session cookie to .example.com on the server side. 2. Choose the appropriate session storage method, such as memory, database or distributed cache. 3. Pass the session ID through cookies, and the server retrieves and updates the session data based on the ID.

How does using HTTPS affect session security?Apr 22, 2025 pm 05:13 PM

HTTPS significantly improves the security of sessions by encrypting data transmission, preventing man-in-the-middle attacks and providing authentication. 1) Encrypted data transmission: HTTPS uses SSL/TLS protocol to encrypt data to ensure that the data is not stolen or tampered during transmission. 2) Prevent man-in-the-middle attacks: Through the SSL/TLS handshake process, the client verifies the server certificate to ensure the connection legitimacy. 3) Provide authentication: HTTPS ensures that the connection is a legitimate server and protects data integrity and confidentiality.

The Continued Use of PHP: Reasons for Its EnduranceApr 19, 2025 am 12:23 AM

What’s still popular is the ease of use, flexibility and a strong ecosystem. 1) Ease of use and simple syntax make it the first choice for beginners. 2) Closely integrated with web development, excellent interaction with HTTP requests and database. 3) The huge ecosystem provides a wealth of tools and libraries. 4) Active community and open source nature adapts them to new needs and technology trends.

PHP and Python: Exploring Their Similarities and DifferencesApr 19, 2025 am 12:21 AM

PHP and Python are both high-level programming languages that are widely used in web development, data processing and automation tasks. 1.PHP is often used to build dynamic websites and content management systems, while Python is often used to build web frameworks and data science. 2.PHP uses echo to output content, Python uses print. 3. Both support object-oriented programming, but the syntax and keywords are different. 4. PHP supports weak type conversion, while Python is more stringent. 5. PHP performance optimization includes using OPcache and asynchronous programming, while Python uses cProfile and asynchronous programming.

PHP and Python: Different Paradigms ExplainedApr 18, 2025 am 12:26 AM

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

PHP and Python: A Deep Dive into Their HistoryApr 18, 2025 am 12:25 AM

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks agoByDDD

Hot Tools

Atom editor mac version download

The most popular open source editor

SublimeText3 English version

Recommended: Win version, supports code prompts!

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Hot Topics

Where is the login entrance for gmail email?

7645

CakePHP Tutorial

1392

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

152