搜索

首页  >  问答  >  正文

检索每组中的最后一条记录 - MySQL

<p>There is a table <code>messages</code> that contains data as shown below:</p> <pre class="brush:php;toolbar:false;">Id Name Other_Columns ------------------------- 1 A A_data_1 2 A A_data_2 3 A A_data_3 4 B B_data_1 5 B B_data_2 6 C C_data_1</pre> <p>If I run a query <code>select * from messages group by name</code>, I will get the result as:</p> <pre class="brush:php;toolbar:false;">1 A A_data_1 4 B B_data_1 6 C C_data_1</pre> <p>什么查询将返回以下结果?</p> <pre class="brush:php;toolbar:false;">3 A A_data_3 5 B B_data_2 6 C C_data_1</pre> <p>也就是说,应该返回每组中的最后一条记录。</p> <p>目前,这是我使用的查询:</p> <pre class="brush:php;toolbar:false;">SELECT * FROM (SELECT * FROM messages ORDER BY id DESC) AS x GROUP BY name</pre> <p>但这看起来效率很低。还有其他方法可以达到相同的结果吗?</p>
P粉464088437P粉464088437508 天前560

全部回复(2)我来回复

  • P粉786432579

    P粉7864325792023-08-25 11:56:11

    UPD:2017-03-31,版本5.7.5 MySQL 默认启用 ONLY_FULL_GROUP_BY 开关(因此,非确定性 GROUP BY 查询被禁用)。此外,他们更新了 GROUP BY 实现,即使禁用了开关,该解决方案也可能无法按预期工作。需要检查一下。

    Bill Karwin's solution above works fine when item count within groups is rather small, but the performance of the query becomes bad when the groups are rather large, since the solution requires about n*n/2 + n/2 of only IS NULL comparisons.

    I made my tests on a InnoDB table of 18684446 rows with 1182 groups. The table contains testresults for functional tests and has the (test_id, request_id) as the primary key. Thus, test_id is a group and I was searching for the last request_id for each test_id.

    Bill's solution has already been running for several hours on my dell e4310 and I do not know when it is going to finish even though it operates on a coverage index (hence using index in EXPLAIN).

    我有几个基于相同想法的其他解决方案:

    • if the underlying index is BTREE index (which is usually the case), the largest (group_id, item_value) pair is the last value within each group_id, that is the first for each group_id if we walk through the index in descending order;
    • 如果我们读取索引覆盖的值,则按照索引的顺序读取值;
    • 每个索引都隐式包含附加到该索引的主键列(即主键位于覆盖索引中)。在下面的解决方案中,我直接对主键进行操作,在您的情况下,您只需要在结果中添加主键列即可。
    • 在许多情况下,在子查询中按所需的顺序收集所需的行 ID 并将子查询的结果连接到 ID 上要便宜得多。由于对于子查询结果中的每一行,MySQL 将需要基于主键进行一次获取,因此子查询将首先放在连接中,并且行将按照子查询中 id 的顺序输出(如果我们省略显式 ORDER BY用于连接)

    MySQL 使用索引的 3 种方式 是一篇很棒的文章,可以帮助您了解一些细节。

    解决方案1

    这个速度快得令人难以置信,在我的 18M+ 行上大约需要 0.8 秒:

    SELECT test_id, MAX(request_id) AS request_id
    FROM testresults
    GROUP BY test_id DESC;

    如果您想将顺序更改为 ASC,请将其放入子查询中,仅返回 ids 并将其用作子查询来连接其余列:

    SELECT test_id, request_id
    FROM (
        SELECT test_id, MAX(request_id) AS request_id
        FROM testresults
        GROUP BY test_id DESC) as ids
    ORDER BY test_id;

    这对我的数据来说大约需要 1.2 秒。

    解决方案2

    这是另一个解决方案,对于我的表来说大约需要 19 秒:

    SELECT test_id, request_id
    FROM testresults, (SELECT @group:=NULL) as init
    WHERE IF(IFNULL(@group, -1)=@group:=test_id, 0, 1)
    ORDER BY test_id DESC, request_id DESC

    它也按降序返回测试。它要慢得多,因为它执行完整索引扫描,但它可以让您了解如何为每个组输出 N 个最大行。

    该查询的缺点是查询缓存无法缓存其结果。

    回复
    0
  • P粉848442185

    P粉8484421852023-08-25 09:17:24

    MySQL 8.0 现在支持窗口函数,例如几乎所有流行的 SQL 实现。使用这个标准语法,我们可以编写每组最大n个查询:

    WITH ranked_messages AS (
      SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn
      FROM messages AS m
    )
    SELECT * FROM ranked_messages WHERE rn = 1;

    此方法和其他查找的方法分组最大行数在 MySQL 手册中进行了说明。

    以下是我在2009年针对这个问题写的原始答案:


    我这样写解决方案:

    SELECT m1.*
    FROM messages m1 LEFT JOIN messages m2
     ON (m1.name = m2.name AND m1.id < m2.id)
    WHERE m2.id IS NULL;

    关于性能,一种解决方案可能会更好,具体取决于数据的性质。因此,您应该测试这两个查询,并根据您的数据库使用性能更好的查询。

    For example, I have a copy of the StackOverflow August data dump. I'll use that for benchmarking. There are 1,114,357 rows in the Posts table. This is running on MySQL 5.0.75 on my Macbook Pro 2.40GHz.

    我将编写一个查询来查找给定用户 ID(我的)的最新帖子。

    First using the technique shown by @Eric with the GROUP BY in a subquery:

    SELECT p1.postid
    FROM Posts p1
    INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid
                FROM Posts pi GROUP BY pi.owneruserid) p2
      ON (p1.postid = p2.maxpostid)
    WHERE p1.owneruserid = 20860;
    
    1 row in set (1 min 17.89 sec)

    Even the EXPLAIN analysis takes over 16 seconds:

    +----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
    | id | select_type | table      | type   | possible_keys              | key         | key_len | ref          | rows    | Extra       |
    +----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
    |  1 | PRIMARY     | <derived2> | ALL    | NULL                       | NULL        | NULL    | NULL         |   76756 |             | 
    |  1 | PRIMARY     | p1         | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY     | 8       | p2.maxpostid |       1 | Using where | 
    |  2 | DERIVED     | pi         | index  | NULL                       | OwnerUserId | 8       | NULL         | 1151268 | Using index | 
    +----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
    3 rows in set (16.09 sec)

    Now produce the same query result using my technique with LEFT JOIN:

    SELECT p1.postid
    FROM Posts p1 LEFT JOIN posts p2
      ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid)
    WHERE p2.postid IS NULL AND p1.owneruserid = 20860;
    
    1 row in set (0.28 sec)

    The EXPLAIN analysis shows that both tables are able to use their indexes:

    +----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
    | id | select_type | table | type | possible_keys              | key         | key_len | ref   | rows | Extra                                |
    +----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
    |  1 | SIMPLE      | p1    | ref  | OwnerUserId                | OwnerUserId | 8       | const | 1384 | Using index                          | 
    |  1 | SIMPLE      | p2    | ref  | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8       | const | 1384 | Using where; Using index; Not exists | 
    +----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
    2 rows in set (0.00 sec)

    Here's the DDL for my Posts table:

    CREATE TABLE `posts` (
      `PostId` bigint(20) unsigned NOT NULL auto_increment,
      `PostTypeId` bigint(20) unsigned NOT NULL,
      `AcceptedAnswerId` bigint(20) unsigned default NULL,
      `ParentId` bigint(20) unsigned default NULL,
      `CreationDate` datetime NOT NULL,
      `Score` int(11) NOT NULL default '0',
      `ViewCount` int(11) NOT NULL default '0',
      `Body` text NOT NULL,
      `OwnerUserId` bigint(20) unsigned NOT NULL,
      `OwnerDisplayName` varchar(40) default NULL,
      `LastEditorUserId` bigint(20) unsigned default NULL,
      `LastEditDate` datetime default NULL,
      `LastActivityDate` datetime default NULL,
      `Title` varchar(250) NOT NULL default '',
      `Tags` varchar(150) NOT NULL default '',
      `AnswerCount` int(11) NOT NULL default '0',
      `CommentCount` int(11) NOT NULL default '0',
      `FavoriteCount` int(11) NOT NULL default '0',
      `ClosedDate` datetime default NULL,
      PRIMARY KEY  (`PostId`),
      UNIQUE KEY `PostId` (`PostId`),
      KEY `PostTypeId` (`PostTypeId`),
      KEY `AcceptedAnswerId` (`AcceptedAnswerId`),
      KEY `OwnerUserId` (`OwnerUserId`),
      KEY `LastEditorUserId` (`LastEditorUserId`),
      KEY `ParentId` (`ParentId`),
      CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`)
    ) ENGINE=InnoDB;

    评论者请注意:如果您想要使用不同版本的 MySQL、不同的数据集或不同的表设计进行另一个基准测试,请随意自己做。我已经展示了上面的技术。 Stack Overflow 在这里向您展示如何进行软件开发工作,而不是为您完成所有工作。

    回复
    0
  • 取消回复