Home > Article > Backend Development > Introduction to the use of PHP Chinese word segmentation tool ICTCLAS_PHP tutorial
For Chinese search engines, Chinese word segmentation is one of the most basic parts of the entire system, because the current Chinese search algorithm based on single characters is not very good. Of course, this article is not to do research on Chinese search engines, but to share how to use PHP to build an on-site search engine. This article is an article in this system.
The word segmentation tool I use is the open source version of ICTCLAS from the Institute of Computing Technology, Chinese Academy of Sciences. There is also the open source Bamboo, which I will also investigate later.
It is a good choice to start from ICTCLAS, because its algorithm is widely spread, has public academic documents, is easy to compile, and has few library dependencies. But currently only C/C++, Java and C# versions of the code are provided, and there is no PHP version of the code. What should we do? Maybe we can study its C/C++ source code and academic documents, and then develop a PHP version. However, I want to use inter-process communication to call the C/C++ version of the executable file from the PHP code.
After downloading and decompressing the source code, directly make ictclas on a machine with C++ development library and compilation environment. There is an error in its Makefile script, and the code that executes the test does not add '. /', of course it cannot be executed successfully like under Windows. But it does not affect the compilation results.
The PHP class for Chinese word segmentation is below. Use the proc_open() function to execute the word segmentation program, interact with it through the pipeline, input the text to be segmented, and read the word segmentation results.
<?php class NLP{ private static $cmd_path; // 不以'/'结尾 static function set_cmd_path($path){ self::$cmd_path = $path; } private function cmd($str){ $descriptorspec = array( 0 => array("pipe", "r"), 1 => array("pipe", "w"), ); $cmd = self::$cmd_path . "/ictclas"; $process = proc_open($cmd, $descriptorspec, $pipes); if (is_resource($process)) { $str = iconv('utf-8', 'gbk', $str); fwrite($pipes[0], $str); $output = stream_get_contents($pipes[1]); fclose($pipes[0]); fclose($pipes[1]); $return_value = proc_close($process); } /* $cmd = "printf '$input' | " . self::$cmd_path . "/ictclas"; exec($cmd, $output, $ret); $output = join("\n", $output); */ $output = trim($output); $output = iconv('gbk', 'utf-8', $output); return $output; } /** * 进行分词, 返回词语列表. */ function tokenize($str){ $tokens = array(); $output = self::cmd($input); if($output){ $ps = preg_split('/\s+/', $output); foreach($ps as $p){ list($seg, $tag) = explode('/', $p); $item = array( 'seg' => $seg, 'tag' => $tag, ); $tokens[] = $item; } } return $tokens; } } NLP::set_cmd_path(dirname(__FILE__)); ?>
Easy to use (make sure the ICTCLAS compiled executable and dictionary are in the current directory):
<?php require_once('NLP.php'); var_dump(NLP::tokenize('Hello, World!')); ?>