Spider-入门-526互联

又是HTTP

F12大法

Network面板

列名	看法
name	看样子是截取了主域名后的一部分
status	状态码（200，500.......）
type	文档类型为啥HTML非得搞个document
initiator	请求源看不懂
size	大小，如果有cache的话，会有标识
time	响应时间
waterfall	可视化？

Header

请求和响应比较详细的内容

请求

CRUD

请求头

cookies, UA, type之类的玩意

请求体

表单数据，正好这几天刚写完了表单

注意类型

Content-Type	提交数据的方式
application/x-www-form-urlencoded	表单数据
multipart/form-data	表单文件上传
application/json	序列化json数据
text/xml	XML数据

响应

响应头

时间（create,update,outdate）类型，cookie

响应体

正主来了，我们要解析的对象

网页基础

HTML

标签
节点

CSS

加buff

使用link标签与html配合
CSS选择器（还是比较喜欢用xpath）

#id .class .class_inner{
position: center;//页面布局
bottom: 40px;//距离底边距离
width: 100%;//宽度
height: 181px;//高度
}

JS

动态交互
script标签与html交互

基本库的使用

urllib

https://docs.python.org/3/library/urllib.html

request

模拟发送请求
- urlopen(url, data=None, [timeout, ], cafile=None, capath=None, cadefault=False, context=None)
- Request
  
  class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
- Handler
- Opener
error
- 各种error：HTTPError、URLError
- 异常里的属性
  - 错误状态码、原因、headers
parse

处理url
- urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)
  
  urlstring 必填，scheme协议，优先采取urlstring 里的协议
- urlunparse
  
  合并
- urllib.parse.urlsplit(urlstring, scheme='', allow_fragments=True)
- urlunsplit
  
  长度5
- urllib.parse.urljoin(base, url, allow_fragments=True)
- urlencode()
字典转GET参数
- parse_qs()
转存成字典类型
- parse_qsl()
元组列表
quote()

转中文为URL编码

robotparser

听起来挺有用，但实际上...
```
from urllib.robotparser import RobotFileParser
from urllib.request import urlopen
rp = RobotFileParser()
rp.parse(urlopen('http://www.zhihu.com/robots.txt').read().decode('utf-8').split('\n'))
print(rp.can_fetch('*', 'https://www.zhihu.com/topic/19661050/hot'))
print(rp.can_fetch('*', "https://www.zhihu.com/question/448199854/answer/1882163142"))
```
- set_url ：用来设置 robots.txt 文件的链接。如果在创建 RobotFileParser 对象时传入了链接，那么就不需要再使用这个方法设置了。
- read：读取 robots.txt 文件并进行分析。注意，这个方法执行一个读取和分析操作，如果不调用这个方法，接下来的判断都会为 False，所以一定记得调用这个方法。这个方法不会返回任何内容，但是执行了读取操作。
  parse：用来解析 robots.txt 文件，传入的参数是 robots.txt 某些行的内容，它会按照 robots.txt 的语法规则来分析这些内容。
- can_fetch：该方法传入两个参数，第一个是 User-agent，第二个是要抓取的 URL。返回的内容是该搜索引擎是否可以抓取这个 URL，返回结果是 True 或 False.
- mtime：返回的是上次抓取和分析 robots.txt 的时间，这对于长时间分析和抓取的搜索爬虫是很有必要的，你可能需要定期检查来抓取最新的 robots.txt。
- modified：它同样对长时间分析和抓取的搜索爬虫很有帮助，将当前时间设置为上次抓取和分析 robots.txt 的时间。

requests

在urllib中，我们可以通过handler,opener完成cookie认证之类的操作，但在requests里，功能更加强大（反正内置urllib3）

直接贴接口文档https://docs.python-requests.org/zh_CN/latest/api.html#id2

response.status_code

# 信息性状态码  
100: ('continue',),  
101: ('switching_protocols',),  
102: ('processing',),  
103: ('checkpoint',),  
122: ('uri_too_long', 'request_uri_too_long'),  

# 成功状态码  
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'),  
201: ('created',),  
202: ('accepted',),  
203: ('non_authoritative_info', 'non_authoritative_information'),  
204: ('no_content',),  
205: ('reset_content', 'reset'),  
206: ('partial_content', 'partial'),  
207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),  
208: ('already_reported',),  
226: ('im_used',),  

# 重定向状态码  
300: ('multiple_choices',),  
301: ('moved_permanently', 'moved', '\\o-'),  
302: ('found',),  
303: ('see_other', 'other'),  
304: ('not_modified',),  
305: ('use_proxy',),  
306: ('switch_proxy',),  
307: ('temporary_redirect', 'temporary_moved', 'temporary'),  
308: ('permanent_redirect',  
      'resume_incomplete', 'resume',), # These 2 to be removed in 3.0  

# 客户端错误状态码  
400: ('bad_request', 'bad'),  
401: ('unauthorized',),  
402: ('payment_required', 'payment'),  
403: ('forbidden',),  
404: ('not_found', '-o-'),  
405: ('method_not_allowed', 'not_allowed'),  
406: ('not_acceptable',),  
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),  
408: ('request_timeout', 'timeout'),  
409: ('conflict',),  
410: ('gone',),  
411: ('length_required',),  
412: ('precondition_failed', 'precondition'),  
413: ('request_entity_too_large',),  
414: ('request_uri_too_large',),  
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),  
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),  
417: ('expectation_failed',),  
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),  
421: ('misdirected_request',),  
422: ('unprocessable_entity', 'unprocessable'),  
423: ('locked',),  
424: ('failed_dependency', 'dependency'),  
425: ('unordered_collection', 'unordered'),  
426: ('upgrade_required', 'upgrade'),  
428: ('precondition_required', 'precondition'),  
429: ('too_many_requests', 'too_many'),  
431: ('header_fields_too_large', 'fields_too_large'),  
444: ('no_response', 'none'),  
449: ('retry_with', 'retry'),  
450: ('blocked_by_windows_parental_controls', 'parental_controls'),  
451: ('unavailable_for_legal_reasons', 'legal_reasons'),  
499: ('client_closed_request',),  

# 服务端错误状态码  
500: ('internal_server_error', 'server_error', '/o\\', '✗'),  
501: ('not_implemented',),  
502: ('bad_gateway',),  
503: ('service_unavailable', 'unavailable'),  
504: ('gateway_timeout',),  
505: ('http_version_not_supported', 'http_version'),  
506: ('variant_also_negotiates',),  
507: ('insufficient_storage',),  
509: ('bandwidth_limit_exceeded', 'bandwidth'),  
510: ('not_extended',),  
511: ('network_authentication_required', 'network_auth', 'network_authentication')

正则表达式

如何筛选我想要的数据

表达式搜索

https://any86.github.io/any-rule/

python-spider python spider 001

爬虫框架scrapy spider

前端font-spider字体文件

题解spider p9447 2021