Home  >  Article  >  Database  >  How to query quantile value in MySQL

How to query quantile value in MySQL

王林
王林forward
2023-05-27 16:36:282089browse

Background

The concept of quantile value

In statistics and data analysis, quantiles (or quartiles) are often used to describe the statistical characteristics of data distribution. Generally, the quantile value is divided into four equal parts, namely the first quantile (Q1), the second quantile (Q2) (that is, the median), the third quantile (Q3) and the extreme Difference (IQR). Among them, 1/4 of the data is smaller than the first quantile, 1/4 of the data is larger than the third quantile, and the middle 50% of the data is between the first quantile and the third quantile. In statistics, the first quantile refers to the number in the top 25% of the entire sequence after a set of data is arranged in order of size; the second quantile refers to a set of data arranged in order of size. last, the number in the middle position; and the third quantile refers to the number in the bottom 25% of the entire sequence after a set of data is arranged in order of size. The median is the second quartile. In data analysis, quantile values ​​can help us understand the distribution of data and determine whether the data is biased to one side or how dispersed it is. When the data distribution is uneven, quantile values ​​can more accurately represent the differences in the data.

Business Background

The denomination distribution range of coupons issued by merchants is [1, 20], and each coupon will be marked with its corresponding denomination. To accurately control the cost of coupons, it is necessary to understand the issuance of coupons in real time in order to make a more accurate assessment. Through real-time monitoring of the amount of coupons issued, the average amount of coupons issued, and the quantile value of the amount issued (understanding the average amount of coupons issued in different intervals), you can have a clearer understanding of the issuance of coupons.

Currently, the business has sorted out the following indicators and needs data from students who need it. All indicators are based on minutes as the statistical granularity:

Issuance volume: Total amount of coupons issued

Amount of coupons issued Average: Total amount issued/Total amount issued

Coupon amount issued 0.1 percentile mean: The amount of coupons issued per minute is sorted by denomination, with larger denominations in front and smaller denominations later. Calculate the amount of coupons issued per minute. The average value of the top 10% of the coupons [for example, the order of coupon denominations is: 10, 9, 8, 8, 6, 5, 4, 4, 2, 2, then the average value of the 0.1 quantile is 10]

0.2 percentile mean of coupon amount issued: The amount of coupons issued per minute is sorted by denomination, with larger denominations in front and smaller denominations later. Calculate the top 20% of the coupon amount issued per minute. The average value of coupons [for example, the denomination order of issued coupons is: 10,9,8,8,6,5,4,4,2,2, then the average value of 0.2 percentile is (10 9)/2=9.5]

Indicators such as the issuance volume and the average amount of coupons can be implemented using MySQL. So how to use MySQL to query the quantile value?

Thinking

MySQL implements sorting

row_number() over ( partition by a1.min order by metric_value desc) as orderNum

metric_value represents the amount of coupons issued. Through the above function, it can be sorted according to the amount of coupons issued, and the coupon issuance data per minute is based on Amount sorting

MySQL implements topN

SELECT * FROM sales ORDER BY amount DESC LIMIT 10;

Obviously, this topN method cannot achieve sorting by minutes, and the top N% are taken. In order to know the amount of N%, we need to first determine the total amount, so we need to first calculate the total amount per minute. Then multiply it by N% to know how much data we need to extract N%.

select hour,min, count(1) as cn 
from table  
where dt=20230423 and hour=11 and min>=0 and min<=30 
group by hour,min

Then, we multiply the statistical results by N%

select dt,a2.hour,a2.min as min,metric_value, round(cn*N%) as cn, orderNum 
from ( 
	select dt,hour,a1.min as min, 
	metric_value, row_number() over ( partition by a1.min order by metric_value desc) as orderNum 
	from table a1 
	where dt=20230423 and hour=11 and min>=0 and min<=30 
	) as a2 
inner join ( 
	select hour,min , count(1) as cn 
	from table c 
	where dt=20230423 and hour=11 and min>=0 and min<=30  
	group by hour,min ) a3
on a2.hour=a3.hour and a2.min=a3.min

In this way, we can compare cn (the amount of data required to calculate the quantile value) and orderNum (the size of the current coupon according to the face value The size of the sort order) is used to obtain the first N% of the data, and then avg processing is performed on this part of the data to obtain the quantile value data.

Adjust the calculation logic and fuse it together to get the SQL of the percentile value as follows:

select dt,hour,min, round(avg(metric_value)) as metric_value 
from ( 
	select dt,a2.hour,a2.min as min,metric_value, round(cn*?) as cn, orderNum 
from ( 
	select dt,hour,a1.min as min,
	metric_value, row_number() over ( partition by a1.min order by metric_value desc) as orderNum 
	from table a1 
	where dt=20230423 and hour=11 and min>=0 and min<=30 
	) as a2 
inner join ( 
	select hour,min, count(1) as cn 
	from table a1 
	where dt=20230423 and hour=11 and min>=0 and min<=30 
	) as a3
on a2.hour=a3.hour and a2.min=a3.min ) as q 
where cn>orderNum 
group by dt,hour,min 
order by dt,hour,min

This data is within the range of calculating percentile value statistics if cn > orderNum.. In order to calculate the 0.1 percentile value, the first 10% of coupon issuance data per minute needs to be collected. After sorting by denomination and grouping by minutes, each record will be marked with the rank of the record. The total amount of coupons issued per minute is multiplied by 10% to get cnt. This value is the amount of data required to calculate the 0.1-minute average of this minute. When cntd1c13d22951754ba461c3eee1b575dbf MySQL implements direct query of the quantile value

The performance starts from >1min --> Within 15s; performance is greatly improved

The above is the detailed content of How to query quantile value in MySQL. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:yisu.com. If there is any infringement, please contact admin@php.cn delete