Home >Backend Development >Python Tutorial >How can we efficiently split a text string of concatenated words without spaces into individual words?

How can we efficiently split a text string of concatenated words without spaces into individual words?

Barbara Streisand
Barbara StreisandOriginal
2024-11-04 10:48:021046browse

How can we efficiently split a text string of concatenated words without spaces into individual words?

Splitting Text into a Word List Without Spaces

Problem

Given a text string consisting of concatenated words without spaces:

Input: "tableapplechairtablecupboard..."

How can we efficiently split this text into a list of individual words?

Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]

Algorithm

A simple approach is to iteratively find the longest possible word within the text. However, this can lead to suboptimal results.

Frequency-Based Algorithm

Instead, we can exploit the relative frequency of words in the language to improve accuracy:

  1. Model the Word Distribution: Assume words are independently distributed and follow Zipf's law, where word probability is inversely proportional to its rank.
  2. Define Word Cost: The cost of a word is defined as the logarithm of the inverse of its likelihood.
  3. Dynamic Programming Approach:

    • Initialize a cost array where the first element is 0.
    • For each character in the text, find the word that minimizes the total cost for characters up to that point.
    • Backtrack from the end to reconstruct the minimum-cost word sequence.

Code Implementation

<code class="python">from math import log

wordcost = {}  # Dictionary of word costs using Zipf's law

maxword = max(len(word) for word in wordcost)

def infer_spaces(s):
    cost = [0]
    for i in range(1, len(s) + 1):
        candidates = enumerate(reversed(cost[max(0, i - maxword):i]))
        c, k = min((wordcost.get(s[i - k - 1:i], 9e999) + c, k + 1) for k, c in candidates)
        cost.append(c)

    out = []
    i = len(s)
    while i > 0:
        c, k = best_match(i)
        assert c == cost[i]
        out.append(s[i - k:i])
        i -= k

    return " ".join(reversed(out))</code>

Results

This algorithm is able to accurately segment text into a list of words, even in the absence of spaces.

Example:

Input: "tableapplechairtablecupboard..."
Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]

Optimizations:

  • Suffix Tree: By building a suffix tree from the word list, the candidate search can be accelerated.
  • Text Block Splitting: For large text inputs, the text can be split into blocks to minimize memory usage while maintaining accuracy.

The above is the detailed content of How can we efficiently split a text string of concatenated words without spaces into individual words?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn