项目里需要一个搜索相似图片的功能,百度了一些dhash的算法,生成了16个长度的hash值,在mysql里这样查询:
SELECT pk, hash,
BIT_COUNT( CONV(hash, 16, 10) ^ CONV('4c8e3366c275650f', 16, 10) ) as hamming_distance
FROM image_hashes HAVING hamming_distance < 4 ORDER BY hamming_distance ASC;
经测试,15W条数据,搜索需要很长时间。第二次速度会快些,有啥办法可以优化?
巴扎黑2017-04-17 14:56:25
Thank you for the invitation.
Sorry to say, I have never done this, but I have heard of Hamming distance before. 15k data, not a lot, but using the mysql
function, there is no way to create an index. . .
You can refer to mysql picture Hamming distance calculation, nearly 40w XOR calculation, how to break
Hamming distance on binary strings in SQL
You can try mysql’s memory table. Introduction to the features and use of MySQL memory table
大家讲道理2017-04-17 14:56:25
Sorry, I have no experience with "Hamming distance".
If there is only 15W data, you can load the data into memory and cache it, and then perform calculations in the program. After all, SQL is not good at doing this kind of thing.
Based on your SQL, there really isn’t much room for optimization. The only things I can think of are the following:
hash
char
column instead of varchar
Discard ORDER BY hamming_distance ASC
and sort by program instead
It is recommended to send the DDL statement and EXPLAIN results for analysis together, but the biggest performance checkpoint should be BIT_COUNT(CONV(hash, 16, 10) ^ CONV('4c8e3366c275650f', 16, 10)) as hamming_distance
, which is actually equivalent to a full table scan.