python - Scrapy ItemLoader data cleaning questions

Question

When using scrapy to capture data, use the itemloader class. When the value taken out by the selector is empty, enter scrapy.Field() and call filter(). If the selector value is not empty, it will indeed return "with value". If the selector is taken out [] or "", then after the value enters filter(), it does not...

仅有的幸福 · Answer

Thanks for the invitation~
I don’t know much about Scrapy, so I can’t say much about the topic. The general idea of the crawler I wrote myself in PHP is:
1. First, according to regular rules and some loops, put the pages to be collected into the queue, and press Category classification, for example, a queue for paginated list pages, and a queue for data content pages in the list.
2. Then use xpath to crawl the data of the relevant content pages. During the crawling process, some of the crawled data will be processed as required by the subject.
3. Assemble the data and save the data according to the standards you need.

That’s roughly it. Most of my crawler frameworks are probably based on this idea. They just add anti-crawling mechanism, multi-threading, multi-process, incremental crawling and other functions on this basis. Therefore, the questioner found your framework爬取数据那里进行处理或组装数据的地方进行处理都行.

python - Scrapy ItemLoader data cleaning questions

reply all(1)I'll reply