Home > Article > Backend Development > Line break - [PHP] Why does this happen when using PHP to count words in a pure English txt? [Resolved]
The code is as follows:
<code><?php /** * 任一个英文的纯文本文件,统计其中的单词出现的个数。 * Created by PhpStorm. * User: Paul * Date: 2016/11/5 * Time: 23:18 */ $content = file_get_contents('4/Gone with the wind.txt'); $res = count_word($content, 1); print_r($res); /** * 任一个英文的纯文本文件,统计其中的单词出现的个数。 * @param string $string 字符串 * @param int $lower 是否大小写 1:不区分大小写 0:区分大小写 * @return array */ function count_word($string, $lower = 0) { $string = trim($string); if ($lower) { $string = strtolower($string); } //过滤掉一些标点符号 $string = str_replace(';', '', $string); $string = str_replace(',', '', $string); $string = str_replace('.', '', $string); $string = str_replace('.', '', $string); $string = str_replace('‘', '', $string); $string = str_replace('?', '', $string); $string = str_replace('“', '', $string); $string = str_replace('”', '', $string); $string = str_replace('―', '', $string); $string = str_replace('-', '', $string); $string = str_replace('!', '', $string); $string = str_replace(':', '', $string); $string = str_replace('(', '', $string); $string = str_replace(')', '', $string); $array = explode(' ', trim($string)); $res = array(); foreach ($array as $key=>$value) { //过滤掉如I’ll、you’re、masters’s等单词 if (strpos($value, '’') !== false || strpos($value, "'") !== false) { continue; } //过滤掉空 if (empty($value) === true) { continue; } if (array_key_exists($value, $res)) { $res[$value]++; } else { $res[$value] = 1; } } //排序 array_multisort($res, SORT_DESC, SORT_NUMERIC); return $res; }</code>
Output result:
<code>array( [repression] => 1 [thoroughness] => 1 [bleached] => 1 [tow] => 1 [inspired] => 1 [uniformwell] => 1 [panamas] => 1 [caps when] => 1 )</code>
I don’t understand why two words are judged as one word. The txt was opened with sublime and the encoding was set to UTF-8. It was not opened or edited with the text document tool that comes with the computer. In addition, when filtering punctuation marks There was also a way to filter out rn for processing, but it had no effect, so the code was removed. Find out why this happens and how to avoid it?
The code is as follows:
<code><?php /** * 任一个英文的纯文本文件,统计其中的单词出现的个数。 * Created by PhpStorm. * User: Paul * Date: 2016/11/5 * Time: 23:18 */ $content = file_get_contents('4/Gone with the wind.txt'); $res = count_word($content, 1); print_r($res); /** * 任一个英文的纯文本文件,统计其中的单词出现的个数。 * @param string $string 字符串 * @param int $lower 是否大小写 1:不区分大小写 0:区分大小写 * @return array */ function count_word($string, $lower = 0) { $string = trim($string); if ($lower) { $string = strtolower($string); } //过滤掉一些标点符号 $string = str_replace(';', '', $string); $string = str_replace(',', '', $string); $string = str_replace('.', '', $string); $string = str_replace('.', '', $string); $string = str_replace('‘', '', $string); $string = str_replace('?', '', $string); $string = str_replace('“', '', $string); $string = str_replace('”', '', $string); $string = str_replace('―', '', $string); $string = str_replace('-', '', $string); $string = str_replace('!', '', $string); $string = str_replace(':', '', $string); $string = str_replace('(', '', $string); $string = str_replace(')', '', $string); $array = explode(' ', trim($string)); $res = array(); foreach ($array as $key=>$value) { //过滤掉如I’ll、you’re、masters’s等单词 if (strpos($value, '’') !== false || strpos($value, "'") !== false) { continue; } //过滤掉空 if (empty($value) === true) { continue; } if (array_key_exists($value, $res)) { $res[$value]++; } else { $res[$value] = 1; } } //排序 array_multisort($res, SORT_DESC, SORT_NUMERIC); return $res; }</code>
Output result:
<code>array( [repression] => 1 [thoroughness] => 1 [bleached] => 1 [tow] => 1 [inspired] => 1 [uniformwell] => 1 [panamas] => 1 [caps when] => 1 )</code>
I don’t understand why two words are judged as one word. The txt was opened with sublime and the encoding was set to UTF-8. It was not opened or edited with the text document tool that comes with the computer. In addition, when filtering punctuation marks There was also a way to filter out rn for processing, but it had no effect, so the code was removed. Find out why this happens and how to avoid it?
Your problem should be that line feeds (and carriage returns) are not processed and those filter characters are replaced with '', which should be replaced with ' '
<code class="php"><?php $content = file_get_contents(__FILE__); //没有你的原始文本, 所以就直接读取文件自身作为样本了 $res = count_word($content, 1); print_r($res); /** * 任一个英文的纯文本文件,统计其中的单词出现的个数。 * @param string $string 字符串 * @param int $lower 是否大小写 1:不区分大小写 0:区分大小写 * @return array */ function count_word($string, $lower = 0) { $string = trim($string); if ($lower) { $string = strtolower($string); } //过滤掉一些标点符号 $string = str_replace([';',',','.','.','‘','?','“','”','―','-','!',':','(',')',"\r","\n"], ' ', $string); $array = explode(' ', $string); $res = array(); foreach ($array as $key=>$value) { //过滤掉空 if (!$value) { continue; } //过滤掉如I’ll、you’re、masters’s等单词 if (strpos($value, '’') !== false || strpos($value, "'") !== false) { continue; } if (array_key_exists($value, $res)) { $res[$value]++; } else { $res[$value] = 1; } } //排序 array_multisort($res, SORT_DESC, SORT_NUMERIC); return $res; }</code>
I don’t know what the string in your file looks like, but the trim
function only removes spaces on both sides (rn
), so I think the problem lies here.