I need to read 50 million double data from a txt file and store it in a vector. I initially thought that the file io might be too slow, so I used file memory mapping to read the file contents into memory as blocks. , and then push_back into the vector one by one, but it only takes 3 minutes to read the data one by one directly from the file. After I optimized it, it increased to 5 minutes.
My optimization plan is to read the entire file into memory, put it in the buffer of char*, and then use vec_name.reserve(50000000); to allocate 50 million capacity to avoid repeated memory allocation, but there seems to be nothing effect.
Is it because time is mainly spent on push_back?
Is there any good optimization method? Thank you all!
The optimized key code is as follows: (It takes five minutes to read all the data into the vector)
ifstream iVecSim("input.txt");
iVecSim.seekg(0, iVecSim.end);
long long file_size = iVecSim.tellg();//文件大小
iVecSim.seekg(0, iVecSim.beg);
char *buffer = new char[file_size];
iVecSim.read(buffer, file_size);
string input(buffer);
delete[]buffer;
istringstream ss_sim(input);//string流
string fVecSim;
vec_similarity.reserve(50000000);
while (ss_sim.good()) {//从string流中读入vector
ss_sim >> fVecSim;
vec_similarity.push_back(atof(fVecSim.c_str()));
}
漂亮男人2017-05-31 10:38:40
It makes no sense to run in debug mode. When I use your code to run in release mode, it only takes about 14 seconds.
To solve a problem, find the problem first. I modified the code like this and first find out where the time is spent
std::cout << "Start" << std::endl;
auto n1 = ::GetTickCount();
auto n2 = 0;
auto n3 = 0;
auto n4 = 0;
while (ss_sim.good())
{
auto n = ::GetTickCount();
ss_sim >> fVecSim;
n2 += (::GetTickCount() - n);
n = ::GetTickCount();
auto v = atof(fVecSim.c_str());
n3 += (::GetTickCount() - n);
n = ::GetTickCount();
vec_similarity.push_back(v);
n4 += (::GetTickCount() - n);
}
n1 = ::GetTickCount() - n1;
std::cout << "ss_sim >> fVecSim:" << n2 << "ms" << std::endl;
std::cout << "atof:" << n3 << "ms" << std::endl;
std::cout << "push_back:" << n4 << "ms" << std::endl;
std::cout << "Total:" << n1 << "ms" << std::endl;
So the bottleneck lies in the sentence "ss_sim >> fVecSim". atof is fast enough.
So my conclusion is: the ultimate optimization solution is to start with the storage format and store your data as binary instead of string. This avoids the overhead of string IO and conversion functions and truly achieves fetching data in seconds.
phpcn_u15822017-05-31 10:38:40
The most efficient way at present is to use streams, and it can be seen from your code implementation: you read all the file contents into the buffer at once, which is not the best way. It is recommended to read buffer[1024] on average each time, which is 1K, or other values. After reading, the pointer moves to the next line and continues reading until the end of the EOF position
天蓬老师2017-05-31 10:38:40
1. If there is no dependency between data, you can try multi-threaded reading in blocks;
2. In addition, the memory of vector is continuous. If the subsequent traversal is not random access, using list will be more efficient. Quite a few.
天蓬老师2017-05-31 10:38:40
You can switch to C-style scanf
Try it
Wow, why are you treating my answer like this? The netizen who reported me would like to ask, why is there something wrong with this answer?