有个36.6G的csv文件,需要去重并导入到数据库(顺序无所谓,只需要结果是一个无重复的表),如何处理?
PHPz2017-04-17 13:29:41
If the Foo field cannot be repeated, then just define Unique and it will be automatically removed:
CREATE TABLE xxx (
...
Foo varchar unique not null,
...
);
大家讲道理2017-04-17 13:29:41
You can import all the database and delete duplicate data through sql operation
伊谢尔伦2017-04-17 13:29:41
Create a unique index for possible duplicate fields
When inserting, use insert ignore into ...
怪我咯2017-04-17 13:29:41
You can use bash, sort first, and then use awk to check whether adjacent lines are the same. If not, output them to a new file. This is actually not slow, but it may require a lot of space.
A better approach is to let the database handle it by itself when importing, such as defining unique fields as mentioned above.