Home >Backend Development >PHP Tutorial >Use PHPdig to create your own Google [Graphic Tutorial]_PHP Tutorial
一、什么是PHPdig?
PHPdig是国外非常流行的垂直搜索引擎产品(与其说是产品,不如说是一项区别于传统搜索引擎的搜索技术),采用PHP语言编写,利用了PHP程序运行的高效性,极大地提高了搜索反应速度,它可以像Google或者Baidu以及其它搜索引擎一样搜索互联网,搜索内容除了普通的网页外还包括txt, doc, xls, pdf等各式的文件,具有强大的内容搜索和文件解析功能。PHPdig同传统的搜索引擎一样,包含了以下三种最基本的技术:
1.Spider技术
2.网页结构化信息抽取技术或元数据采集技术
3.分词、索引技术
区别于传统搜索引擎,PHPdig适用于专业化更强、层次更深的个性化搜索引擎,利用它打造针对某一领域的垂直搜索引擎是最好的选择。
二、如何获得这PHPdig?
PHPdig是免费产品(需要保留版权),最新版本是 phpdig-1.8.9 为了避免Apache以及MYSQL的版本兼容性问题,建议采用较低级的版本,其网站地址是:http://www.phpdig.net ,下载地址是:http://www.phpdig.net/navigation.php?action=download 说明一下,我试用过phpdig-1.8.9版本,但出现了很多问题,改用PHPdig-1.8.8则问题较少。
三、具体步骤
1.获取产品
访问http://www.phpdig.net/navigation.php?action=download下载PHPdig-1.8.8至桌面,解压缩至Apache服务器html目录,一般路径为:D:\usr\www\html\,(如果你没有安装Apache服务器请事先安装,推荐使用Mappm-Server v1.1.9 Final,Mappm-Server 采用傻瓜式安装,一次搞定,方便调试和运行 PHP/CGI MySQL 程序)。
2.运行并配置PHPdig数据库
打开浏览器输入http://localhost/phpdig/按回车键,页面列出PHPdig的所有文件及包含文件夹,找一找发现没有默认首页文件(default,index),单击search.php文件出现错误提示:Unable to connect to database : Check the connection script。提示无法完成数据库连接,原来我们还没有完成PHPdig的数据库配置。返回进入admin目录找到install.php文件,单击运行,乍一看,全英文界面(说明一下,PHPdig目前所有版本均不支持中文界面),没有关系,如果你有过汉化经验不妨自己动手将其汉化,这里提供一份我自己汉化的cn-language.php文档的下载(请将其拷贝至locales目录下)。另外你还需修改includes目录下的config.php文件(语言修改)和style.css文件(字体修改和样式修改)。
进入install.php后系统要求我们输入PHPdig管理用户名和密码,默认情况下均为admin,进入后出现如下界面(汉化后):
(图1)
所需提供的信息有:
如果你是在本地测试,请输入默认情况下的服务器名称localhost(localhost是Mappm-Server下的默认务服务器名称,也就是mysql的默认服务器名称,Mappm-Server内置mysql数据库)数据库服务器端口默认为3126,可以不填,数据库sock协议默认为空,用户名默认为root(Mappm-Server默认用户名),密码是你在安装Mappm-Server时输入的用户密码,PHPdig数据库名称默认为phpdig,可任意修改,同时,你可以对数据库中的数据表加前缀,默认为空。
如果你要上传到与Internet相连的web服务器请向服务器提供商索要mysql服务器的名称或者IP地址以及数据库服务器端口、sock协议、用户名、密码等,数据库名称以及数据表前缀的设置同上。
至于右边的四个单选按钮,你可以视情况而定,初次使用(安装)选择默认的“建立数据库”
确认上述信息无误后单击安装按钮,如果连接数据库不成功会提示“不能连接数据库”的错误信息,如果数据库连接成功则会直接跳入管理页面如下图:
(图2)
3. 界面区域介绍
Area 1 is a text input area. The default text has three lines, all starting with http. At a glance, everyone knows that the website address of the website to be spidered is entered here (it is recommended to only spider one website at a time).
Area 2 is the spider option. The search depth refers to how many levels of directories the website has been spidered to. The number of links per page refers to the maximum number of linked web pages below that can be crawled for a certain web page. By default, they are all 0, which means that the entire site will be spidered.
Area 3 displays database status information, including websites that have been spidered, keywords, indexes, and site information that is being spidered.
Area 4 is a drop-down list box that lists the URLs of spidered sites. Select one of the sites and you can clear and update it in area 5.
Area 5 not only provides clearing and updating operations for the sites selected in Area 4, but also provides relevant statistical information entrances and spider control.
4. Run spider for a specific site
If you are very interested in the content of Tianji Software channel, you can make a more professional search engine than Google to search for the content of Tianji Software. Your search engine will be more comprehensive and deeper than Google. Let's take the content of the spider Tianji software channel as an example to introduce how to spider a website.
1) Enter http://soft.yesky.com in Area 1 of Figure 2, and keep the search depth and number of links per page at the default of 0
2) Click the spider button, the page jumps to the spider information page, and the program starts to automatically spider the content of the site http://soft.yesky.com.
Note: The process of the spider website is very slow. If the website has too much content, the process may last from a few hours to a day, but you don’t have to worry about the script running timeout because the system timeout is set to a maximum of 48 hours. . During this process, you can also interrupt the running of the spider program and restart the spider program to run the unfinished website. It should be noted that if you accidentally close the spider running page during this process, the system does not actually stop the spider and is still consuming system resources. You can reopen the spider page and click the Stop spider link to release system resources.
(Picture 3)
5. Search using PHPdig
After a period of time, the result of running the spider program is to capture the information on the http://soft.yesky.com website into the server database, mainly the title information, keyword information and page address information of the other party's content. Wait, at this point, you can search by accessing search.php.
(Picture 4)
You can choose the number of search results to display, and you can choose fuzzy search or precise search. In addition, you can choose to search for a certain site. By default, all sites that have been spidered will be searched.
(Picture 5)
The picture above is the search results page for searching "QQ2006".
6. Problems
Due to PHPdig’s language setting issues, system word segmentation issues, and character processing issues in the MYSQL database, there are still many uncertain factors in PHPdig’s search for Chinese vocabulary. These things need to be further solved and improved by us. We welcome your comments. Friends who are interested should go to the Taoba-PHPdig theme community to discuss this.