Home  >  Article  >  Backend Development  >  How can I efficiently parse fixed-width file lines in Python?

How can I efficiently parse fixed-width file lines in Python?

Barbara Streisand
Barbara StreisandOriginal
2024-10-30 17:09:26442browse

How can I efficiently parse fixed-width file lines in Python?

Fast Parsing of Fixed Width File Lines

Parsing fixed width files, where each column occupies a specific number of characters in a line, can be a task requiring efficiency. Here's a discussion on how to achieve this efficiently:

The Problem

Consider a fixed-width file where the first 20 characters represent one column, followed by 21-30 for the second, and so on. Given a line with 100 characters, how can we effectively parse it into its respective columns?

Solutions

1. Struct Module:

Utilizing the Python standard library's struct module provides both simplicity and speed due to its C implementation. The code below demonstrates its usage:

<code class="python">import struct

fieldwidths = (2, -10, 24)
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths)

# Convert Unicode input to bytes and decode result.
unpack = struct.Struct(fmtstring).unpack_from  # Alias.
parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))

# Parse a sample line.
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields:', fields)</code>

Output:

fmtstring: '2s 10x 24s', record size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

2. Optimized String Slicing:

While string slicing is commonly used, it can become cumbersome for large lines. Here's an optimized approach:

<code class="python">from itertools import zip_longest
from itertools import accumulate

def make_parser(fieldwidths):
    # Calculate slice boundaries.
    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
    # Create field slice tuples.
    flds = tuple(zip_longest(cuts, (0,)+cuts))[:-1]  # Ignore final value.
    # Construct the parsing function.
    parse = lambda line: tuple(line[i:j] for i, j in flds)
    parse.size = sum(abs(fw) for fw in fieldwidths)
    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                            for fw in fieldwidths)
    return parse

# Parse a sample line.
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24)  # Negative values indicate ignored padding fields.
parse = make_parser(fieldwidths)
fields = parse(line)
print('fmtstring:', parse.fmtstring, ', record size:', parse.size, 'chars')
print('fields:', fields)</code>

Output:

fmtstring: '2s 10x 24s', record size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

The above is the detailed content of How can I efficiently parse fixed-width file lines in Python?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn