Home  >  Article  >  Backend Development  >  python3 to create funny page crawler

python3 to create funny page crawler

高洛峰
高洛峰Original
2017-02-14 13:37:001761browse

Most of the Python tutorials on the Internet are version 2.X. Compared with python3.X, python2.X has changed a lot. Many libraries are used differently. I installed python3.X. Let’s take a look at the details. Example

0x01

I had nothing to do during the Spring Festival (how free I am), so I wrote a simple program to read some jokes and record the process of writing the program. The first time I came into contact with crawlers was when I saw a post like this. It was a funny post about crawling photos of girls on Omelette. It was not very convenient. So I just took pictures of cats and tigers myself.

Technology is inspiring the future. As a programmer, how can you do such a thing? It is better to make jokes that are better for your physical and mental health.

python3 to create funny page crawler

0x02

Before we roll up our sleeves and get started, let’s popularize some theoretical knowledge.

To put it simply, we need to pull down the content at a specific location on the web page. How to pull it out? We must first analyze the web page to see which piece of content we need. For example, what we crawled this time is jokes from the hilarious website. When we open the jokes page of the hilarious website, we can see a lot of jokes. Our purpose is to obtain this content. Come back and calm down after reading it. If you keep laughing like this, we can't write code. In chrome, we open Inspect Element and then expand the HTML tags level by level, or click the small mouse to locate the element we need.

python3 to create funny page crawler

Finally, we can find that the content in

is the joke we need. The same is true when looking at the second joke. So, we can find all the

in this webpage, and then extract the content inside, and we're done.

0x03

Okay, now that we know our purpose, we can roll up our sleeves and get started. I use python3 here. Regarding the choice of python2 and python3, everyone can decide by themselves. The functions can be realized, but there are some differences. But it is still recommended to use python3.
We want to pull down the content we need. First we have to pull down this web page. How to pull it down? Here we need to use a library called urllib. We use the method provided by this library to get the entire web page.
First, we import urllib


##Copy code The code is as follows:

import urllib.request as request

Then, we can use request to get the web page,


Copy code The code is as follows:

def getHTML(url ):

return request.urlopen(url).read()

Life is short, I use python, one line of code, download a web page, you said, there is no reason not to use python.

After downloading the web page, we have to parse the web page to get the elements we need. In order to parse elements, we need to use another tool called Beautiful Soup. Using it, we can quickly parse HTML and XML and get the elements we need.


Copy code The code is as follows:

soup = BeautifulSoup(getHTML("http://www.pengfu.com/xiaohua_1 .html"))

Using BeautifulSoup to parse web pages is just one sentence, but when you run the code, such a warning will appear, prompting you to specify a parser. Otherwise, it may not work on other platforms or An error is reported on the system.


Copy code The code is as follows:

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5 /site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 64 of the file joke.py. To get rid of this warning, change code that looks like this:

BeautifulSoup([your markup])

to this:

BeautifulSoup([your markup], "lxml" )

markup_type=markup_type))

The types of parsers and the differences between different parsers are explained in detail in the official documents. At present, it is more reliable to use lxml parsing.

After modification


Copy code The code is as follows:

soup = BeautifulSoup(getHTML("http://www. pengfu.com/xiaohua_1.html", 'lxml'))

In this way, there will be no above warning.


Copy code The code is as follows:

p_array = soup.find_all('p', {'class':"content- img clearfix pt10 relative"})

Use the find_all function to find all p tags of class = content-img clearfix pt10 relative and then traverse this array


Copy code The code is as follows:

for x in p_array: content = x.string

In this way, we get the content of destination p. At this point, we have achieved our goal and climbed to our joke.
But when crawling in the same way, such an error will be reported


Copy code The code is as follows:

raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response

It said that the remote end did not respond, closed the link, and checked the network and there was no problem. What is causing this? Is my posture wrong?
Nothing happens when I open Charles to capture the packet. Alas, this is strange. How come a good website can be accessed by python? Is it a problem with UA? After looking at Charles, I found that for requests initiated using urllib, the default UA is Python-urllib/3.5, and when accessing UA in chrome, it is User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36, could it be because the server rejects the python crawler based on UA? Let’s try to disguise it and see if it works


Copy code The code is as follows:

def getHTML(url):
headers = {'User-Agent': 'User-Agent:Mozilla /5.0 (Macintosh; Intel Mac OS .urlopen(req).read()

In this way, python is disguised as chrome to obtain the webpage of Qibai, and the data can be obtained smoothly.

At this point, use python to crawl Qibaihe. The jokes on Pingbelly.com are over. We only need to analyze the corresponding web pages, find the elements we are interested in, and use the powerful functions of python to achieve our goals. Whether it is XXOO pictures or connotation jokes, we can do it with one click. , I won’t talk anymore, I’ll look for some pictures of girls.

# -*- coding: utf-8 -*-
import sys
import urllib.request as request
from bs4 import BeautifulSoup

def getHTML(url):
  headers = {'User-Agent': 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}
  req = request.Request(url, headers=headers)
  return request.urlopen(req).read()

def get_pengfu_results(url):
  soup = BeautifulSoup(getHTML(url), 'lxml')
  return soup.find_all('p', {'class':"content-img clearfix pt10 relative"})

def get_pengfu_joke():
  for x in range(1, 2):
    url = 'http://www.pengfu.com/xiaohua_%d.html' % x
    for x in get_pengfu_results(url):
      content = x.string
      try:
        string = content.lstrip()
        print(string + '\n\n')
      except:
        continue
  return

def get_qiubai_results(url):
  soup = BeautifulSoup(getHTML(url), 'lxml')
  contents = soup.find_all('p', {'class':'content'})
  restlus = []
  for x in contents:
    str = x.find(&#39;span&#39;).getText(&#39;\n&#39;,&#39;<br/>&#39;)
    restlus.append(str)
  return restlus

def get_qiubai_joke():
  for x in range(1, 2):
    url = &#39;http://www.qiushibaike.com/8hr/page/%d/?s=4952526&#39; % x
    for x in get_qiubai_results(url):
      print(x + &#39;\n\n&#39;)
  return

if __name__ == &#39;__main__&#39;:
  get_pengfu_joke()
  get_qiubai_joke()

For more python3 production of funny web page crawlers and related articles, please pay attention to the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn