python - 100G超大文件合并排序

Question

目前想实现两个100G文件合并，日志文件，都有日期，日期会有所交集，如一份1日到10日，另一份5日到15日，所以合并后也需要对日期进行排序。 目标是，查询某一时间区段的信息，目前的打算是： 我可以知道每个文件...

阿神 · Answer

Provide an idea: I think your idea of segmentation is actually quite good, but it doesn’t actually need to be segmented. All you have to do is maintain an index file. That is to say, by reading the file once, for every 1000 entries (for example), the starting file offset and end offset of the 1000 entries are associated with the start time and end time of these 1000 entries. This way you get an index file.

时间1~时间2，文件1，offset1~offset2
时间3~时间4，文件2，offset3~offset4
...

When you want to query in the future, check the index file first, and then you can know which file and which range the required data is in. Because each of your files is sorted by time, there is no need to sort when indexing.

伊谢尔伦 · Answer

It’s so big, I still need to query why it is regularly transferred to the database.

python - 100G超大文件合并排序

reply all(2)I'll reply