最近学习Python,于是就用Python写了一个抓取Discuz!用户名的脚本,代码很少但是很搓。思路很简单,就是正则匹配title然后提取用户名写入文本文档。程序以百度站长社区为例(一共有40多万用户),挂在VPS上就没管了,虽然用了延时但是后来发现一共只抓取了50000多个用户名就被封了。。。
代码如下:
# -*- coding: utf-8 -*-
# Author: 天一
# Blog: http://www.90blog.org
# Version: 1.0
# 功能: Python抓取百度站长平台用户名脚本
import urllib
import urllib2
import re
import time
def BiduSpider():
pattern = re.compile(r'
uid=1
thedatas = []
while uid theUrl = "http://bbs.zhanzhang.baidu.com/home.php?mod=space&uid="+str(uid)
uid +=1
theResponse = urllib2.urlopen(theUrl)
thePage = theResponse.read()
#正则匹配用户名
theFindall = re.findall(pattern,thePage)
#等待0.5秒,以防频繁访问被禁止
time.sleep(0.5)
if theFindall :
#中文编码防止乱码输出
thedatas = theFindall[0].decode('utf-8').encode('gbk')
#写入txt文本文档
f = open('theUid.txt','a')
f.writelines(thedatas+'\n')
f.close()
if __name__ == '__main__':
BiduSpider()
最终成果如下:

SlicingaPythonlistisdoneusingthesyntaxlist[start:stop:step].Here'showitworks:1)Startistheindexofthefirstelementtoinclude.2)Stopistheindexofthefirstelementtoexclude.3)Stepistheincrementbetweenelements.It'susefulforextractingportionsoflistsandcanuseneg

NumPyallowsforvariousoperationsonarrays:1)Basicarithmeticlikeaddition,subtraction,multiplication,anddivision;2)Advancedoperationssuchasmatrixmultiplication;3)Element-wiseoperationswithoutexplicitloops;4)Arrayindexingandslicingfordatamanipulation;5)Ag

ArraysinPython,particularlythroughNumPyandPandas,areessentialfordataanalysis,offeringspeedandefficiency.1)NumPyarraysenableefficienthandlingoflargedatasetsandcomplexoperationslikemovingaverages.2)PandasextendsNumPy'scapabilitieswithDataFramesforstruc

ListsandNumPyarraysinPythonhavedifferentmemoryfootprints:listsaremoreflexiblebutlessmemory-efficient,whileNumPyarraysareoptimizedfornumericaldata.1)Listsstorereferencestoobjects,withoverheadaround64byteson64-bitsystems.2)NumPyarraysstoredatacontiguou

ToensurePythonscriptsbehavecorrectlyacrossdevelopment,staging,andproduction,usethesestrategies:1)Environmentvariablesforsimplesettings,2)Configurationfilesforcomplexsetups,and3)Dynamicloadingforadaptability.Eachmethodoffersuniquebenefitsandrequiresca

The basic syntax for Python list slicing is list[start:stop:step]. 1.start is the first element index included, 2.stop is the first element index excluded, and 3.step determines the step size between elements. Slices are not only used to extract data, but also to modify and invert lists.

Listsoutperformarraysin:1)dynamicsizingandfrequentinsertions/deletions,2)storingheterogeneousdata,and3)memoryefficiencyforsparsedata,butmayhaveslightperformancecostsincertainoperations.

ToconvertaPythonarraytoalist,usethelist()constructororageneratorexpression.1)Importthearraymoduleandcreateanarray.2)Uselist(arr)or[xforxinarr]toconvertittoalist,consideringperformanceandmemoryefficiencyforlargedatasets.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

Zend Studio 13.0.1
Powerful PHP integrated development environment

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),
