Home >Backend Development >PHP Problem >How to use box/spout to parse large Excel tables

How to use box/spout to parse large Excel tables

醉折花枝作酒筹
醉折花枝作酒筹forward
2021-07-22 17:34:052926browse

A relatively famous library for PHP to parse Excel is phpoffice/phpexcel. In actual use, when encountering tens of thousands of Excel rows, the memory usage of phpexcel will soar. Today I will introduce box/spout, another PHP library that efficiently parses Excel.

How to use box/spout to parse large Excel tables

Look at the official introduction: Read and write spreadsheet files in a fast and scalable way. It can be seen that when this library parses Excel, the size of the Excel file will not affect the memory too much. Usage amount. It can be said to be an alternative to PHPExcel.

Since the document link on the Github homepage of the box/spout library has been down, here is a simple example of converting an XLSX file into CSV:

include 'vendor/autoload.php';
use Box\Spout\Reader\ReaderFactory;
use Box\Spout\Common\Type;

$t = time();
$reader = ReaderFactory::create(Type::XLSX);
//如果注释掉,单元格内的日期类型将会是DateTime,不注释的话Spout自动帮你将日期转化成string
//$reader->setShouldFormatDates(true);
$reader->open('./test.xlsx');
$iterator = $reader->getSheetIterator();
$iterator->rewind();
$sheet1 = $iterator->current();
$rowIter = $sheet1->getRowIterator();
foreach ($rowIter as $row) {
    $d = '';
    foreach ($row as $col) {
        echo $d;
        if ($col instanceof DateTime) {
            echo $col->format('Y-m-d');
        } else {
            echo $col;
        }
        $d = "\t";
    }
    echo PHP_EOL;
}
$reader->close();

The usage method is still very simple and clear.

Use memory cache String dictionary

As mentioned earlier, the memory size consumed by box/spout during parsing is not affected by the size of the Excel file. How is this achieved?

Here is a simple science: The XLSX file format conforms to a standard called OOXML (https://zh.wikipedia.org/zh-cn/Office_Open_XML). XLSX is actually a Zip package that can be decompressed to see its contents.

In the XLSX table, if the content of the cell is a string, only a stringId is saved when actually saving, and the real content of the string is saved in a String dictionary.

When reading an XLSX file, if the cell is a string, Spout will query the String dictionary. Spout has two query methods. One is to read a part of the dictionary from the file each time, and the other is to load the entire dictionary into memory for query.

Obviously, loading the entire String dictionary into memory has the fastest query speed. However, Spout is too conservative and uses the first query method in many cases. So slightly modify the Spout code and let Spout try to load all the dictionaries into the memory:

# Spout/Reader/XLSX/Helper/SharedStringsCaching/CachingStrategyFactory.php
class CachingStrategyFactory {
  ....
  const MAX_NUM_STRINGS_PER_TEMP_FILE = 10000; // 改成50000
  ....
}

Try to convert an EXCEL with 1.3w rows, 28 columns, and 2.8MB size into CSV for comparison:

Method Time consuming Occupies memory
Dictionary is not loaded To memory 185 s 1.3 MB
Dictionary loaded into memory 43 s 9.4 MB

It can be seen that the processing time difference is quite large.

Recommended learning: php video tutorial

The above is the detailed content of How to use box/spout to parse large Excel tables. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:csdn.net. If there is any infringement, please contact admin@php.cn delete