Home >CMS Tutorial >DEDECMS >How to use dedecms collection
Taking the official website of Dreamweaver as an example, we collect the PHP tutorial column under the Webmaster Academy and open the list address http://www.dedecms.com/web-art/PHP_jiaocheng.
#Log in to the backend, enter "Collection Node Management", create a new node, and select the content model as "Normal Article".
1. Set the basic information of the node (Recommended learning: dedecms tutorial)
First fill in a node name that is easy to remember, and select The target page code is GB2312. The anti-hotlink mode does not need to be set. Since the target site has no restrictions, this item will not be modified. The system default timeout is 10 seconds.
2. Set the list URL acquisition rules
In this step we need to make some settings, obtain the article list address, return to the target site list page, and observe the changes between pages , it can be found that only the numbers after "14_" have regular incremental changes.
Home page: http://www.dedecms.com/web-art/PHP_jiaocheng/list_14_1.html
Middle: http://www.dedecms.com/web-art/PHP_jiaocheng /list_14_(*).html
Last page: http://www.dedecms.com/web-art/PHP_jiaocheng/list_14_172.html
Copy a paging address and return to "New On the "Add Collection Node" page, select "Source Attribute" as "Batch Generate List URL", paste the address into the "Matching URL", modify the rule change as (*), and enter 1 in the "Batch Generate Address Settings" (*) To 172, what this means is to generate all addresses from the first page to the last 172 pages of the list.
Test it. In the pop-up box, we can see that 172 address records are looped out, and it is set up smoothly. Sometimes we encounter a list that is difficult to obtain, then we can copy the irregular address into the "Manually specified list URL" text box to collect it.
3. Set article URL matching rules
The article address source page has been specified above. In this step, you need to find the article address page that meets the requirements among these pages. . Open a list page and observe that the box in the left column contains all the addresses we need. In this case, the pages that are clearly distinguished can be filtered using the "HTML at the beginning of the region" and "HTMLL at the end of the region" settings.
But other methods can also be used. Move the mouse to various link addresses and observe the complete address displayed in the lower left corner of the browser. The addresses we need all contain "PHP_jiaocheng/20", then we fill it in "Must Contain".
Both methods can filter out addresses. When it comes to complex pages, they can be used together. With the addition of regular rules, there are almost no addresses that cannot be filtered out. Compare with the figure below. Finally confirm and go to the next step "Web content acquisition rules".
4. Web page content acquisition rules
The above introduces the list setting method, next we enter the setting of content acquisition rules , if the collection is to serve, the function of the above one to three steps is just that the appetizer serves as a guide for the following main course. The next step is to introduce how to collect article content from the target site. This step is the most core part of the entire collection.
Continue to return to the PHP tutorial list of DreamWeaver and open an article in the list. Here we take the article "Regular Expressions" as an example: http://www.dedecms.com/web -art/PHP_jiaocheng/20070420/38633.html, copy this address to the "Preview URL"; because all articles of DreamWeaver are not paginated, there is no need to set the pagination here, and you can directly enter the "Fixed Collection Project" page
(Note: If the collected content contains paging, you only need to set the matching rules in the paging navigation part. Here are all listed paging lists, top and bottom pages, or incomplete paging lists that can be set according to the content. Yes)
The following is the quoted content:
All listed paginated list: The paginated content lists all links, as shown in the figure below
Up and down page form or incomplete paging list: a single page displays the current paging content, an incomplete display list form
5. Fixed collection items
Enter here In the first step, we start to analyze the page source code. Collection is nothing more than analyzing the structure of the HTML page to obtain the content we need. Therefore, we are required to have a certain understanding of HTML code and be able to find the required content by viewing the page source file. It is best to open several more pages for analysis and find the similarities.
It is recommended that everyone use Dreamweaver analysis. When analyzing the page code, it will be much more convenient to use the search function more often. Especially after finding the tag, search to see if there are any duplications to reduce analysis errors.
1) Article title: The title of this page is "Regular Expression" Copy it, press Ctrl F key in Dreamweaver to search all, there are 30 records. Because of the uniqueness, here we select the "
2) Author: Continue searching with author as the keyword. Only 110 lines have unique occurrences. Copy them together with the tags before and after alluse to the matching rules, and use [content] to replace the place to be collected.
3) Source: Same as above. Find the tag in line 109, copy it, and use [content] to replace the place to be collected. If the source contains hyperlink tags that you want to remove, in the filter rule box, fill in the following rules to filter them out:
<a>]*)> <br></a><br>
4) Release time: Copy, paste and modify the same operations as above at line 111.
5) Article content: Search for the beginning of the article content. For example, "Part One" found the target in line 118. Click the status bar
and found that all the article content could not be selected. Continue to the previous
At this point, the content filtering settings have been completed.
6. Node collection
If your collection node is completed in one go and the test is successful, click the button as prompted to collect directly, but the node is written before Yes, you need to go to the "Node Management Page" to check the nodes to be collected and press the "Collect" button to collect. If you want to collect new content from all nodes, go to the monitoring collection page to operate.
You can set the number of data collected per page for each page collection. Generally speaking, do not set it too large, otherwise the system may not be able to process it and some parts cannot be collected. It is recommended not to exceed 15.
The number of threads refers to how many threads are collecting at the same time each time. Increasing the number of threads can speed up the collection, but it will also increase the occupation of server resources, so please use it with caution. If the target site has an anti-refresh limit, you can set it here according to the anti-refresh limit time of the target site. If not, the default is 0 seconds.
Additional options These three settings should be easy to understand literally, so you can choose according to your actual needs.
Collection completed.
For more wordpress related technical articles, please visit the wordpress tutorial column to learn!
The above is the detailed content of How to use dedecms collection. For more information, please follow other related articles on the PHP Chinese website!