給定一個由不含空格的串聯單字組成的文字字符字串:
Input: "tableapplechairtablecupboard..."
我們如何有效地將這段文字分割成單字的清單?
Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]
一個簡單的方法是迭代地找出文本中最長的可能單字。然而,這可能會導致次優結果。
相反,我們可以利用語言中單字的相對頻率來提高準確性:
動態規劃方法:
<code class="python">from math import log wordcost = {} # Dictionary of word costs using Zipf's law maxword = max(len(word) for word in wordcost) def infer_spaces(s): cost = [0] for i in range(1, len(s) + 1): candidates = enumerate(reversed(cost[max(0, i - maxword):i])) c, k = min((wordcost.get(s[i - k - 1:i], 9e999) + c, k + 1) for k, c in candidates) cost.append(c) out = [] i = len(s) while i > 0: c, k = best_match(i) assert c == cost[i] out.append(s[i - k:i]) i -= k return " ".join(reversed(out))</code>
結果該
該結果Input: "tableapplechairtablecupboard..." Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]
該算法能夠準確地將文本分割成單詞列表,即使在
示例:以上是我們如何有效地將不帶空格的串聯單字的文字字串拆分為單字?的詳細內容。更多資訊請關注PHP中文網其他相關文章!