Home >Backend Development >PHP Tutorial >Example code for reading very large files with PHP, _PHP tutorial
At the end of last year, the databases of account information on various websites were leaked. It was very impressive. I took the opportunity to download several databases and prepared to learn from data analysts to analyze these account information. Although this data information has been "organized", it is quite useful to study by yourself, after all, there is such a large amount of data.
Number
The problem caused by the large amount of data is that a single file is very large, and it is not easy to open this file. Don't expect Notepad to freeze immediately. Even the MSSQL client cannot open such a large SQL file.
Insufficient memory is reported. The reason is said to be that when MSSQL reads data, it puts the read data into the memory at once. If the amount of data is too large and the memory is insufficient, it will directly cause the system to crash.
Navicat Premium
Here is a recommended software, Navicat Premium, which is quite powerful. SQL files of hundreds of megabytes can be opened easily without any lag. Moreover, this client software supports connections to various databases such as MSSQL, MYSQL, Oracle, etc. I will slowly study many other functions by myself.
Although
Although Navicat can be used to open the 274MB SQL file CSDN, the content is meaningless, and it is inconvenient to query, classify, make statistics, etc. on these account information. only
The method is to read the data one by one, then split the different fragments of each record, and then store these fragments in the database in the format of data fields, so that they can be used conveniently in the future.
Use PHP to read very large files
PHP
There are many ways to read files. Depending on the target file, adopting a more appropriate method can effectively improve execution efficiency. Since the CSDN database file is very large, we try not to read it all in a short time.
After all, every time a piece of data is read, it needs to be split and written. Then a more appropriate way is to read the file area by area, by using PHP's fseek and fread combined, you can
To read a certain part of the data in the file at will, the following is the example code:
代码如下: function readBigFile($filename, $count = 20, $tag = "\r\n") { $content = "";//最终内容 $current = "";//当前读取内容寄存 $step= 1;//每次走多少字符 $tagLen = strlen($tag); $start = 0;//起始位置 $i = 0;//计数器 $handle = fopen($filename,'r+');//读写模式打开文件,指针指向文件起始位置 while($i < $count && !feof($handle)) { fseek($handle, $start, SEEK_SET);//指针设置在文件开头 $current = fread($handle,$step);//读取文件 $content .= $current;//组合字符串 $start += $step;//依据步长向前移动 //依据分隔符的长度截取字符串最后免得几个字符 $substrTag = substr($content, -$tagLen); if ($substrTag == $tag) { //判断是否为判断是否是换行或其他分隔符 $i++; $content .= "<br />"; } } //关闭文件 fclose($handle); //返回结果 return $content; } $filename = "csdn.sql";//需要读取的文件 $tag = "\n";//行分隔符 注意这里必须用双引号 $count = 100;//读取行数 $data = readBigFile($filename,$count,$tag); echo $data;
Regarding the value of the variable $tag passed in by the function, the value passed in is also different depending on the system: Windows uses "rn", Linux/unix uses "n", and Mac OS uses "r" ".
The general process of program execution: first define some basic variables for reading files, then open the file, position the pointer at the specified position of the file, and read the content of the specified size. Store the content in a variable each time it is read until the required number of lines to read is reached or the end of the file.
Never assume that everything in a program will work as planned.
root
According to the above code, although the data at the specified location and size in the file can be obtained, the entire process is only executed once and not all the data can be obtained. In fact, to get all the data, you can use this loop
Adding a loop to determine whether the file ends is added to the outer layer, but this is a waste of system resources, and may even cause PHP execution timeout because the file is too large and cannot be read to the end. Another method is to record and store the last time the data was read.
The position of the needle, and then when the loop is executed again, the pointer is positioned at the last ending position, so that there is no need to read the file from beginning to end in one loop.
Actually, I haven’t imported the CSDN database into the database yet, because there was an analysis on CNBETA a few days after the leak. Haha, the action was too fast. When you see that others have already done this, you automatically don’t have much motivation to do it, but in order to learn, you still have to take the time to complete it.