python3 to create funny page crawler-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

python3 to create funny page crawler

高洛峰

Feb 14, 2017 pm 01:37 PM

Most of the Python tutorials on the Internet are version 2.X. Compared with python3.X, python2.X has changed a lot. Many libraries are used differently. I installed python3.X. Let’s take a look at the details. Example

0x01

I had nothing to do during the Spring Festival (how free I am), so I wrote a simple program to read some jokes and record the process of writing the program. The first time I came into contact with crawlers was when I saw a post like this. It was a funny post about crawling photos of girls on Omelette. It was not very convenient. So I just took pictures of cats and tigers myself.

Technology is inspiring the future. As a programmer, how can you do such a thing? It is better to make jokes that are better for your physical and mental health.

python3 to create funny page crawler

0x02

Before we roll up our sleeves and get started, let’s popularize some theoretical knowledge.

To put it simply, we need to pull down the content at a specific location on the web page. How to pull it out? We must first analyze the web page to see which piece of content we need. For example, what we crawled this time is jokes from the hilarious website. When we open the jokes page of the hilarious website, we can see a lot of jokes. Our purpose is to obtain this content. Come back and calm down after reading it. If you keep laughing like this, we can't write code. In chrome, we open Inspect Element and then expand the HTML tags level by level, or click the small mouse to locate the element we need.

python3 to create funny page crawler

Finally, we can find that the content in

is the joke we need. The same is true when looking at the second joke. So, we can find all the

in this webpage, and then extract the content inside, and we're done.

0x03

Okay, now that we know our purpose, we can roll up our sleeves and get started. I use python3 here. Regarding the choice of python2 and python3, everyone can decide by themselves. The functions can be realized, but there are some differences. But it is still recommended to use python3.
We want to pull down the content we need. First we have to pull down this web page. How to pull it down? Here we need to use a library called urllib. We use the method provided by this library to get the entire web page.
First, we import urllib

##Copy code The code is as follows:

import urllib.request as request

Then, we can use request to get the web page,

Copy code The code is as follows:

def getHTML(url ):

return request.urlopen(url).read()

Life is short, I use python, one line of code, download a web page, you said, there is no reason not to use python.

After downloading the web page, we have to parse the web page to get the elements we need. In order to parse elements, we need to use another tool called Beautiful Soup. Using it, we can quickly parse HTML and XML and get the elements we need.

Copy code The code is as follows:

soup = BeautifulSoup(getHTML("http://www.pengfu.com/xiaohua_1 .html"))

Using BeautifulSoup to parse web pages is just one sentence, but when you run the code, such a warning will appear, prompting you to specify a parser. Otherwise, it may not work on other platforms or An error is reported on the system.

Copy code The code is as follows:

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5 /site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 64 of the file joke.py. To get rid of this warning, change code that looks like this:

BeautifulSoup([your markup])

to this:

BeautifulSoup([your markup], "lxml" )

markup_type=markup_type))

The types of parsers and the differences between different parsers are explained in detail in the official documents. At present, it is more reliable to use lxml parsing.

After modification

Copy code The code is as follows:

soup = BeautifulSoup(getHTML("http://www. pengfu.com/xiaohua_1.html", 'lxml'))

In this way, there will be no above warning.

Copy code The code is as follows:

p_array = soup.find_all('p', {'class':"content- img clearfix pt10 relative"})

Use the find_all function to find all p tags of class = content-img clearfix pt10 relative and then traverse this array

Copy code The code is as follows:

for x in p_array: content = x.string

In this way, we get the content of destination p. At this point, we have achieved our goal and climbed to our joke.
But when crawling in the same way, such an error will be reported

Copy code The code is as follows:

raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response

It said that the remote end did not respond, closed the link, and checked the network and there was no problem. What is causing this? Is my posture wrong?
Nothing happens when I open Charles to capture the packet. Alas, this is strange. How come a good website can be accessed by python? Is it a problem with UA? After looking at Charles, I found that for requests initiated using urllib, the default UA is Python-urllib/3.5, and when accessing UA in chrome, it is User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36, could it be because the server rejects the python crawler based on UA? Let’s try to disguise it and see if it works

Copy code The code is as follows:

def getHTML(url):
headers = {'User-Agent': 'User-Agent:Mozilla /5.0 (Macintosh; Intel Mac OS .urlopen(req).read()

In this way, python is disguised as chrome to obtain the webpage of Qibai, and the data can be obtained smoothly.

At this point, use python to crawl Qibaihe. The jokes on Pingbelly.com are over. We only need to analyze the corresponding web pages, find the elements we are interested in, and use the powerful functions of python to achieve our goals. Whether it is XXOO pictures or connotation jokes, we can do it with one click. , I won’t talk anymore, I’ll look for some pictures of girls.

# -*- coding: utf-8 -*-
import sys
import urllib.request as request
from bs4 import BeautifulSoup

def getHTML(url):
  headers = {&#39;User-Agent&#39;: &#39;User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36&#39;}
  req = request.Request(url, headers=headers)
  return request.urlopen(req).read()

def get_pengfu_results(url):
  soup = BeautifulSoup(getHTML(url), &#39;lxml&#39;)
  return soup.find_all(&#39;p&#39;, {&#39;class&#39;:"content-img clearfix pt10 relative"})

def get_pengfu_joke():
  for x in range(1, 2):
    url = &#39;http://www.pengfu.com/xiaohua_%d.html&#39; % x
    for x in get_pengfu_results(url):
      content = x.string
      try:
        string = content.lstrip()
        print(string + &#39;\n\n&#39;)
      except:
        continue
  return

def get_qiubai_results(url):
  soup = BeautifulSoup(getHTML(url), &#39;lxml&#39;)
  contents = soup.find_all(&#39;p&#39;, {&#39;class&#39;:&#39;content&#39;})
  restlus = []
  for x in contents:
    str = x.find(&#39;span&#39;).getText(&#39;\n&#39;,&#39;<br/>&#39;)
    restlus.append(str)
  return restlus

def get_qiubai_joke():
  for x in range(1, 2):
    url = &#39;http://www.qiushibaike.com/8hr/page/%d/?s=4952526&#39; % x
    for x in get_qiubai_results(url):
      print(x + &#39;\n\n&#39;)
  return

if __name__ == &#39;__main__&#39;:
  get_pengfu_joke()
  get_qiubai_joke()

For more python3 production of funny web page crawlers and related articles, please pay attention to the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Python vs. C : Understanding the Key DifferencesApr 21, 2025 am 12:18 AM

Python and C each have their own advantages, and the choice should be based on project requirements. 1) Python is suitable for rapid development and data processing due to its concise syntax and dynamic typing. 2)C is suitable for high performance and system programming due to its static typing and manual memory management.

Python vs. C : Which Language to Choose for Your Project?Apr 21, 2025 am 12:17 AM

Choosing Python or C depends on project requirements: 1) If you need rapid development, data processing and prototype design, choose Python; 2) If you need high performance, low latency and close hardware control, choose C.

Reaching Your Python Goals: The Power of 2 Hours DailyApr 20, 2025 am 12:21 AM

By investing 2 hours of Python learning every day, you can effectively improve your programming skills. 1. Learn new knowledge: read documents or watch tutorials. 2. Practice: Write code and complete exercises. 3. Review: Consolidate the content you have learned. 4. Project practice: Apply what you have learned in actual projects. Such a structured learning plan can help you systematically master Python and achieve career goals.

Maximizing 2 Hours: Effective Python Learning StrategiesApr 20, 2025 am 12:20 AM

Methods to learn Python efficiently within two hours include: 1. Review the basic knowledge and ensure that you are familiar with Python installation and basic syntax; 2. Understand the core concepts of Python, such as variables, lists, functions, etc.; 3. Master basic and advanced usage by using examples; 4. Learn common errors and debugging techniques; 5. Apply performance optimization and best practices, such as using list comprehensions and following the PEP8 style guide.

Choosing Between Python and C : The Right Language for YouApr 20, 2025 am 12:20 AM

Python is suitable for beginners and data science, and C is suitable for system programming and game development. 1. Python is simple and easy to use, suitable for data science and web development. 2.C provides high performance and control, suitable for game development and system programming. The choice should be based on project needs and personal interests.

Python vs. C : A Comparative Analysis of Programming LanguagesApr 20, 2025 am 12:14 AM

Python is more suitable for data science and rapid development, while C is more suitable for high performance and system programming. 1. Python syntax is concise and easy to learn, suitable for data processing and scientific computing. 2.C has complex syntax but excellent performance and is often used in game development and system programming.

2 Hours a Day: The Potential of Python LearningApr 20, 2025 am 12:14 AM

It is feasible to invest two hours a day to learn Python. 1. Learn new knowledge: Learn new concepts in one hour, such as lists and dictionaries. 2. Practice and exercises: Use one hour to perform programming exercises, such as writing small programs. Through reasonable planning and perseverance, you can master the core concepts of Python in a short time.

Python vs. C : Learning Curves and Ease of UseApr 19, 2025 am 12:20 AM

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks agoByDDD

Hot Tools

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),