给定一个由不带空格的串联单词组成的文本字符串:
Input: "tableapplechairtablecupboard..."
我们如何有效地将这段文本分割成单个单词的列表?
Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]
一个简单的方法是迭代地找到文本中最长的可能单词。然而,这可能会导致次优结果。
相反,我们可以利用语言中单词的相对频率来提高准确性:
动态规划方法:
<code class="python">from math import log wordcost = {} # Dictionary of word costs using Zipf's law maxword = max(len(word) for word in wordcost) def infer_spaces(s): cost = [0] for i in range(1, len(s) + 1): candidates = enumerate(reversed(cost[max(0, i - maxword):i])) c, k = min((wordcost.get(s[i - k - 1:i], 9e999) + c, k + 1) for k, c in candidates) cost.append(c) out = [] i = len(s) while i > 0: c, k = best_match(i) assert c == cost[i] out.append(s[i - k:i]) i -= k return " ".join(reversed(out))</code>
该算法能够准确地将文本分割成单词列表,即使在
示例:
Input: "tableapplechairtablecupboard..." Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]
优化:
以上是我们如何有效地将不带空格的串联单词的文本字符串拆分为单个单词?的详细内容。更多信息请关注PHP中文网其他相关文章!