There is a table messages
which contains data as shown below:
Id Name Other_Columns ------------------------- 1 A A_data_1 2 A A_data_2 3 A A_data_3 4 B B_data_1 5 B B_data_2 6 C C_data_1
If I run the query select * from messages group by name
, the results I will get are:
1 A A_data_1 4 B B_data_1 6 C C_data_1
What query would return the following results?
3 A A_data_3 5 B B_data_2 6 C C_data_1
That is, the last record in each group should be returned.
Currently, this is the query I use:
SELECT * FROM (SELECT * FROM messages ORDER BY id DESC) AS x GROUP BY name
But this seems very inefficient. Are there any other ways to achieve the same result?
P粉1119279622023-10-10 14:48:01
UPD: 2017-03-31, Version 5.7.5 MySQL enables the ONLY_FULL_GROUP_BY switch by default (so non-deterministic GROUP BY queries are disabled). Additionally, they updated the GROUP BY implementation and the solution may not work as expected even with the switch disabled. Need to check it out.
Bill Karwin's above solution works fine when the number of items within the group is fairly small, but when the group is fairly large the performance of the query becomes poor as the solution requires approximately n*n/2 n/ 2 Only compare IS NULL
.
I tested on an InnoDB table containing 18684446
rows and 1182
groups. This table contains test results for functional tests and has (test_id, request_id)
as the primary key. So, test_id
is a group and I am searching for the last request_id
for each test_id
.
Bill's solution has been running on my Dell e4310 for a few hours now, and although it's running on a covered index (so using the index in EXPLAIN), I don't know when it will be complete.
I have a couple of other solutions based on the same idea:
(group_id, item_value)
pair is the last value in each group_id
, i.e. if We traverse the index in descending order, which is the first one of each group_id
;3 Ways MySQL Uses Indexes is a great article to help you understand some of the details.
Solution 1
This is incredibly fast, taking about 0.8 seconds on my 18M rows:
SELECT test_id, MAX(request_id) AS request_id FROM testresults GROUP BY test_id DESC;
If you want to change the order to ASC, put it in a subquery that returns only the ids and use it as a subquery to join the rest of the columns:
SELECT test_id, request_id FROM ( SELECT test_id, MAX(request_id) AS request_id FROM testresults GROUP BY test_id DESC) as ids ORDER BY test_id;
This takes about 1.2 seconds for my data.
Solution 2
Here's another solution that took about 19 seconds for my table:
SELECT test_id, request_id FROM testresults, (SELECT @group:=NULL) as init WHERE IF(IFNULL(@group, -1)=@group:=test_id, 0, 1) ORDER BY test_id DESC, request_id DESC
It also returns tests in descending order. It's much slower because it performs a full index scan, but it gives you an idea of how to output the N maximum rows for each group.
The disadvantage of this query is that the query cache cannot cache its results.
P粉0154020132023-10-10 11:57:49
MySQL 8.0 now supports window functions, such as almost all popular SQL implementations. Using this standard syntax, we can write up to n queries per group:
WITH ranked_messages AS ( SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn FROM messages AS m ) SELECT * FROM ranked_messages WHERE rn = 1;
This method and other methods of finding the maximum number of rows grouped are described in the MySQL manual.
The following is the original answer I wrote to this question in 2009:
I wrote the solution like this:
SELECT m1.* FROM messages m1 LEFT JOIN messages m2 ON (m1.name = m2.name AND m1.id < m2.id) WHERE m2.id IS NULL;
Regarding performance, one solution may be better depending on the nature of the data. Therefore, you should test both queries and use the one with better performance based on your database.
For example, I have a copy of the StackOverflow August data dump. I will use it for benchmarking purposes. Posts
There are 1,114,357 rows in the table. It's running on MySQL 5.0.75 on my Macbook Pro 2.40GHz.
I will write a query to find the latest posts for a given user ID (mine).
First use the technique shown < /a> by @Eric in a subquery GROUP BY
:
SELECT p1.postid FROM Posts p1 INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid FROM Posts pi GROUP BY pi.owneruserid) p2 ON (p1.postid = p2.maxpostid) WHERE p1.owneruserid = 20860; 1 row in set (1 min 17.89 sec)
EvenEXPLAIN
Analysis< /a> takes more than 16 seconds:
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+ | 1 | PRIMARY || ALL | NULL | NULL | NULL | NULL | 76756 | | | 1 | PRIMARY | p1 | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY | 8 | p2.maxpostid | 1 | Using where | | 2 | DERIVED | pi | index | NULL | OwnerUserId | 8 | NULL | 1151268 | Using index | +----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+ 3 rows in set (16.09 sec)
Now use My tips for using LEFT JOIN
:
SELECT p1.postid FROM Posts p1 LEFT JOIN posts p2 ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid) WHERE p2.postid IS NULL AND p1.owneruserid = 20860; 1 row in set (0.28 sec)
EXPLAIN
Analysis shows that both tables are able to use their indexes:
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+ | 1 | SIMPLE | p1 | ref | OwnerUserId | OwnerUserId | 8 | const | 1384 | Using index | | 1 | SIMPLE | p2 | ref | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8 | const | 1384 | Using where; Using index; Not exists | +----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+ 2 rows in set (0.00 sec)
This is the DDL for my Posts
table:
CREATE TABLE `posts` ( `PostId` bigint(20) unsigned NOT NULL auto_increment, `PostTypeId` bigint(20) unsigned NOT NULL, `AcceptedAnswerId` bigint(20) unsigned default NULL, `ParentId` bigint(20) unsigned default NULL, `CreationDate` datetime NOT NULL, `Score` int(11) NOT NULL default '0', `ViewCount` int(11) NOT NULL default '0', `Body` text NOT NULL, `OwnerUserId` bigint(20) unsigned NOT NULL, `OwnerDisplayName` varchar(40) default NULL, `LastEditorUserId` bigint(20) unsigned default NULL, `LastEditDate` datetime default NULL, `LastActivityDate` datetime default NULL, `Title` varchar(250) NOT NULL default '', `Tags` varchar(150) NOT NULL default '', `AnswerCount` int(11) NOT NULL default '0', `CommentCount` int(11) NOT NULL default '0', `FavoriteCount` int(11) NOT NULL default '0', `ClosedDate` datetime default NULL, PRIMARY KEY (`PostId`), UNIQUE KEY `PostId` (`PostId`), KEY `PostTypeId` (`PostTypeId`), KEY `AcceptedAnswerId` (`AcceptedAnswerId`), KEY `OwnerUserId` (`OwnerUserId`), KEY `LastEditorUserId` (`LastEditorUserId`), KEY `ParentId` (`ParentId`), CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`) ) ENGINE=InnoDB;
Note to commenters: If you want to run another benchmark using a different version of MySQL, a different data set, or a different table design, please do it yourself. I've demonstrated the technique above. Stack Overflow is here to show you how to do software development work, not to do all the work for you.