Home > Article > Web Front-end > Open source java CMS
Original address: http://javaz.cn/site/javaz/site_study/info/2015/23312.html
Project address: http://www.freeteam.cn/
Web page information collection
Supported starting from FreeCMS 2.1
Capture through simple configuration To obtain target web page information, it supports incremental collection, keyword replacement, and scheduled collection. The same collection rule can collect multiple pages (static and dynamic), collect multiple information attributes, and automatically review and staticize information pages.
Collection Rules Management
Click Collection Rules from the left management menu to enter.
Add collection rules
Click the "Add" button below the collection rules list.
Fill in the relevant After selecting the properties, click the "Save" button.
Collection rule attribute description
Collection rule attributes are divided into basic, settings, collection address, collection attributes, and keyword replacement.
Generally, you only need to fill in the relevant attributes in the basic tab. If you need more advanced settings you can use the next few tabs.
The main attributes are explained below.
Name: The name of the collection rule.
Collected column: The collected information should be added to that column.
Page encoding: The page encoding of the target web page, the default is UTF-8.
Collection address: The address of the target web page. Only one can be set in the Basic tab. If you want to set multiple, you can set it in the Collection Address tab.
Collection scheduling: Set the timing to execute the collection operation. This setting is very important. The collection operation can only be executed when the collection scheduling system is set up.
Content list start and end html: Because the system extracts information attributes by intercepting keywords from the content of the target web page, it is very important to set the start and end html of the target attribute. It must be set to be relatively unique. Start and end html so that the system can correctly intercept the target attributes. This attribute is mainly used to intercept the html of the target page information list.
Content address start and end html: After obtaining the content list html according to the above attributes, use this attribute to intercept each content address.
Content title starts and ends html: After obtaining the content address according to the above attributes, the system will crawl the web content of this content address, and then intercept the content title based on this attribute. The setting of content-related attributes is similar to this attribute and will not be described in detail below.
Status: The system will execute the collection rules in the enabled status.
Collect pictures: Download the pictures in the information content to the local.
Automatically approved: Set the collected information directly to the approved status.
Use the click volume of collected information: The click volume of the collected information is 0 by default. After setting this attribute and the content click volume to start and end html, the system will intercept the click volume of the target information and set it to the click volume of the collected information. quantity.
Maximum number of collected contents: No limit by default. If this attribute is set, the system will count how many pieces of information have been collected by this collection rule from the collection record. If the maximum number of collected contents is exceeded, the system will no longer collect .
Set the first image as the title image: If there are images in the information content, extract the first image as the title image and set the information as image information.
Clear the html tags in the content: Clear the html tags in the information content and keep the plain text.
Whether to collect when the content is empty: You can set not to collect this information when the content is empty.
Use the adding time of the collected information: By default, the adding time of the collected information is the current time. After setting this attribute and the content adding time start and end html, the system will intercept the adding time of the target information and set it as post-collection information. of addition time.
Collection information adding time format: The default format is yyyy-MM-dd. If the adding time format of the target page is different, it needs to be set to the correct date format here.
Collection start time: The default is the current time. If it is less than the collection start time, the system will not collect.
Collection end time: The default is to never end. If the collection end time is exceeded, the system will not collect.
Content address completion url: Because some web pages use relative paths or absolute paths, you can set the prefix of the content address.
Image address completion url: Because some web pages use relative paths or absolute paths, you can set the prefix of the image link address.
Completion url of the A tag link address in the content: Because some web pages use relative paths or absolute paths, you can set the prefix of the A tag link address in the content.
Collection addresses are divided into static and dynamic addresses. Static addresses are fixed addresses. Dynamic addresses generally refer to addresses that can be paged. {page} is used to represent paging variables. You can set which page to collect from. page, such as http://www.freetam.cn/list_{page}.html, set the starting page number to 1 and the ending page number to 10. The system will automatically extract http://www.freetam.cn/list_1.html Go to http://www.freetam.cn/list_10.html for data on all pages.
Under normal circumstances, we only collect the title and content of the information. The system also provides the function of collecting content description, clicks, author, source, and adding time attributes.
Through the keyword replacement function, you can replace the keywords in the collected information with the keywords you want.
Edit collection rules
Select the collection rule to be edited, and then click the "Edit" button.
Note: Only one collection rule can be edited at the same time.
Fill in the relevant attributes and click the "Save" button.
Collection
Select the collection rules that need to be collected, and then click the "Collect" button.
Note: Only one collection rule can be collected at the same time.
Delete collection rules
Select the collection rule to be deleted, and then click the "Delete" button.
Tip: Multiple collection rules can be deleted at the same time.
In order to prevent misoperation, the system will prompt the user whether to delete, click "OK" to complete the deletion operation.
View the collection record
Click the collection record from the left management menu to enter.
Here you can view all web page collection records. You can delete the specified collection records, but the collected information data will not be deleted. Select the collection records that need to be deleted, and then click the "Delete" button.
Tip: Multiple collection records can be deleted at the same time.
In order to prevent misoperation, the system will prompt the user whether to delete, click "OK" to complete the deletion operation.