Home  >  Article  >  Backend Development  >  How can I efficiently parse fixed width files in Python?

How can I efficiently parse fixed width files in Python?

Susan Sarandon
Susan SarandonOriginal
2024-10-30 18:28:31399browse

How can I efficiently parse fixed width files in Python?

Efficient Parsing of Fixed Width Files

Fixed width files pose a challenge when it comes to parsing due to their rigid structure. To address this, multiple approaches can be employed to efficiently extract data from such files.

Using the struct Module

The Python standard library's struct module offers a concise and fast solution for parsing fixed width lines. It allows for predefined field widths and data types, making it a suitable option for large datasets. The following code snippet demonstrates how to utilize struct for this purpose:

<code class="python">import struct

fieldwidths = (2, -10, 24)
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths)

# Convert Unicode input to bytes and the result back to Unicode string.
unpack = struct.Struct(fmtstring).unpack_from  # Alias.
parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))

print('fmtstring: {!r}, record size: {} chars'.format(fmtstring, struct.calcsize(fmtstring)))

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields: {}'.format(fields))</code>

String Slicing with Compile-Time Optimization

String slicing is another viable method for parsing fixed width files. While initially less efficient, a technique known as "compile-time optimization" can significantly improve performance. The following code implements this optimization:

<code class="python">def make_parser(fieldwidths):
    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
    pads = tuple(fw < 0 for fw in fieldwidths)  # bool flags for padding fields
    flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one
    slcs = ', '.join('line[{}:{}]'.format(i, j) for pad, i, j in flds if not pad)
    parse = eval('lambda line: ({})\n'.format(slcs))  # Create and compile source code.
    # Optional informational function attributes.
    parse.size = sum(abs(fw) for fw in fieldwidths)
    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                                for fw in fieldwidths)
    return parse</code>

This optimized approach provides both efficiency and readability for parsing fixed width files.

The above is the detailed content of How can I efficiently parse fixed width files in Python?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn