urllib

urlopen

一个基本请求

from urllib import request, parse

def Api():
    # 禁用证书验证
    ssl._create_default_https_context = ssl._create_unverified_context
    response = request.urlopen('https://www.baidu.com/')
    print(response.read().decode('utf-8'))

if __name__ == '__main__':
    Api()

response类型可以获取的基本方法

response.status 返回状态码

response.getheaders() 返回响应头信息

response.getheader('Server') 获取响应头Server的值

data

urlopen中的可选参数，当需要添加参数时需要使用bytes方法转格式，如果传递了参数请求方式就从GET转为POST

data = bytes(parse.urlencode({'name': 'k'}), encoding='utf-8')

request.urlopen('https://www.baidu.com/,data')

timeout

timeout参数用于设置超时时间，如果设置了时间还未得到响应就会报错，单位秒

request.urlopen('https://www.baidu.com/,timeout=0.1')

Requset

当需要加入请求头等数据时需要构建Request类

url 请求地址
data 传输的数据，需要用urllib.parse.encode进行编码
headers 请求头数据
origin_req_host 请求方的host名称或者IP地址
unverifiable 标识请求是否是无法验证的，默认False
Method 请求的方法

def Requset():
    ssl._create_default_https_context = ssl._create_unverified_context
    url = 'https://www.httpbin.org/post'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
        'Host': 'www.httpbin.org'
    }
    dict = {'name': 'k'}
    data = bytes(parse.urlencode(dict), encoding='utf-8')
    re = req.Request(url=url, data=data, headers=headers, method='POST')
    response = req.urlopen(re)
    print(response.read().decode('utf-8'))


if __name__ == '__main__':
    # Api()
    Requset()

BaseHandler

BaseHansler是其他所有Handler的父类

HTTPDefaultErrorHandler 用于处理HTTP响应错误，所有的错误都会抛出HTTPError类型异常

HTTPRedirectHandler 用于处理重定向

HTTPCookieProcessor 用于处理Cookie

ProxyHandler 用于设置代理，代理默认空

HTTPPasswordMgr 用于管理密码，维护用户名密码对照表

HTTPBasicAuthHandler 用于管理认证，链接需要认证解决时可以使用

需要账号密码的案例

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError

ssl._create_default_https_context = ssl._create_unverified_context
    username = 'admin'
    password = 'admin'
    url = 'https://ssr3.scrape.center/'

    p = HTTPPasswordMgrWithDefaultRealm()
    p.add_password(None, url, username, password)
    # 实例化对象用来出来验证handler
    auth_handler = HTTPBasicAuthHandler(p)
    opener = build_opener(auth_handler)

    try:
        result = opener.open(url)
        html = result.read().decode('utf-8')
        print(html)
    except URLError as e:
        print(e.reason)

获取Cookie案例

cookie = http.cookiejar.CookieJar()
    handle = req.HTTPCookieProcessor(cookie)
    opener = req.build_opener(handle)
    resp = opener.open('https://www.baidu.com')
    for item in cookie:
        print(item.name+'='+item.value)

异常

URLError来自urllib类的error模块，继承OSError类，是error异常模块的基类，Request模块产生的异常可以用这个捕捉

HTTPError是URLError的子类，专门用来处理HTTP请求错误，他有三个属性

code：返回HTTP状态码

reason：返回错误原因

headers：返回请求头

解析链接

urlparse

from urllib.parse import urlparse

result=urlpase('https://www.baidu.com/imdex.html;user?id=5#comment')
print(result)
# ParseResult(scheme='https',netloc='www.baidu.com',path='/index.html',params='user',query='id=5',fragment='comment')

当链接标准为: scheme:/netloc/path;params?query#fragment

ParseResult返回一个元组

urlunparse

from urllib.parse import urlunparse

data=['https','www.baidu.com','index.html','user','a=6','comment']
urlunparse(data)

不能多余或者少于6个参数，否则会抛出参数不足的问题。urlunparse用于构造URL

urlsplit

用法和urlparse类型，不过不会单独解析params，而是会path合并在一起，返回五个结果。返回SplitResult也是元组

urlunsplit

用法与urlunparse类型，也是将各部分合成完整的链接，传入五个参数。

urljoin

使用两个参数，base_url作为第一个参数，将一个新的链接作为第二个参数。urljoin会分析base_url的scheme、netloc、path三个参数并对新链接缺失的部分进行补充。

from urllib.parse import urljoin

print(urljoin('https://www.baidu.com','FAQ.html'))	
#  https://www.baidu.com/FAQ.html

print(urljoin('https://www.baidu.com','https://www.douyin.com/FAQ.html'))
#	 https://www.douyin.com/FAQ.html

print(urljoin('https://www.baidu.com?qw=1','https://www.douyin.com/index.html'))
#  https://www.douyin.com/index.html

print(urljoin('https://www.baidu.com','?category=2#comment'))
#   https://www.baidu.com?category=2#comment

print(urljoin('www.baidu.com','?category=2#comment'))
#   www.baidu.com?category=2#comment

print(urljoin('www.baidu.com#comment','?category=2'))
#   www.baidu.com?category=2

base_url提供了三项内容：scheme、netloc、path。如果新链接不存在这三项，就予以补充。如果存在，就使用新的链接里的。base_url中不起作用

urlencode

from urllib.parse import urlencode

params={
   'name':'germey',
   'age':18
}

base_url='https://www.baidu.com'
url=base_url+urlencode(params)
print(url)		# https://www.baidu.com?name=germey&age=18

在构造参数时，有时会先将字典参数表示出来，然后将字典转化为URL的参数时，只需要调用该方法。

parse_qs

from urllib.parse import parse_qs

query='name=germey&age=25'
print(parse_qs(query))  # {'name':['germey'],'age':['25']}

将一串GET请求参数转回字典

parse_qsl

from urllib.parse import parse_qsl

query='name=germey&age=25'
print(parse_qsl(query))  # [('name':'germey'),('age',25)]

将请求转为元组

quote

from urllib.parse import quote

keyword='壁纸'
url='https://www.baidu.com/s?wd='+quote(keyword)
print(url)  # https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8

将内容转化为URL编码的格式。当URL带有中文参数时，有可能会导致乱码，此时利用quote将中文转化为URL编码。

unquote

将URL进行解码

Robots协议

Robots协议也称作爬虫协议，用来告诉爬虫和搜索引擎哪些页面可以抓取，哪些不可以抓取。搜索爬虫在访问一个站点时，首先会检查这个站点目录下是否存在robots.txt文件。如果存在就会根据其中爬取范围来爬取，如果没有找到这个文件，爬虫就会访问所有可爬取的页面。

User-agent：描述爬虫的名称，*代表Robots协议对所有爬虫都有效，可以定义多条，但至少需要指定一条
Disallow：指定不允许爬虫爬取的目录，/代表不允许爬取所有页面
Allow：一般不会单独设置，和DIsallow搭配。用来排除某些限制

进制所有爬虫访问所有目录

User-agent:*

Disallow:/

允许所有爬虫访问所有页面

User-agent:*

Disallow:

禁止所有爬虫访问网站某些目录

User-agent:*

Disallow:/private/

DIsalllow：/tmp/

只允许一个爬虫访问所有目录

User-agent:WebCrawler

Disallow:

User-agent:*

DIsallow:/

爬虫名称	网站名称
BaiduSpider	百度
Googlebot	谷歌
360Spider	360搜索
YodaoBot	有道
Ia_archiver	Alexa
Scooter	altavista
Bingbot	必应

robotparser

from urllib.robotparser import RobotFileParser

rp=RobotFileParser()
rp.set_url('https://www.baidu.com/robots.txt')
rp.read()

print(rp.can_fetch('Baiduspider','https://www.baidu.com'))								#True
print(rp.can_fetch('Baiduspider','https://www.baidu.com/homepage/'))			#True
print(rp.can_fetch('Googlebot','https://www.baidu.com/homepage/'))				#False

根据网站的robots.txt文件判断一个爬虫是否有权限。

set_url：用来设置robots.txt的文件链接，创建对象时使用构造方法则不需要设置

read：读取robots.txt。如果不调用判断会为False，不会返回读取内容，会执行读取操作

parse：解析robot.txt文件。传入参数是文件某些行的内容，会根据robots.txt的语法进行解析

can_fetch：第一个参数User-agent，第二个参数是需要抓取的URL，返回True或者False，表示是否可以抓取该URL

mtime：返回上次抓取和分析robots.txt文件的时间。对于长时间分析和抓取文件搜索有必要，可能需要定期检查以抓取最新的robots文件

modified：可以将当前时间设置为上次抓取和分析robots.txt文件的时间