【Python模块】Urllib的使用

我们首先了解一下 Urllib 库，它是 Python 内置的 HTTP 请求库，也就是说我们不需要额外安装即可使用
发送简单的get请求

#python2
import urllib2
response = urllib2.urlopen('http://www.baidu.com')

#python3
import urllib.request
res = urllib.request.urlopen('http://www.baidu.com')

import urllib.request

a = urllib.request.urlopen('http://www.baidu.com')
print(a.read().decode('gbk', 'ignore'))
print(a.status)#状态码
print(a.getheaders())#获取所有header返回元组列表
print(a.getheader('Set-Cookie'))#获取名称的header返回string

read()方法返回的是字节数据decode就是解码把字节数据转string
decode可传参数

1 2	decode('utf-8') #以utf-8编码方法解码 decode('gbk') #以gbk编码方法解码

有时候可能出现一个问题如下
UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0x80 in position 1412: illegal multibyte sequence
或者
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x80 in position 1412: illegal multibyte sequence
出现这个问题只要改一下decode编码方式就行了比如gbk的改成utf-8，utf-8的改成gbk的编码方式若都不行还可以加一个参数屏蔽错误
decode(‘gbk’,‘ignore’)就没事了

urlopen可传参url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None
除了url其他都是可选参数

调用了urlopen方法后返回的是Respone对象
可以调用Respone对象中的方法进行数据的读取

以上学好了可以进行简单的请求但是我要带入header进行请求就必须用到Request类

1
2
3

request = urllib.request.Request('http://www.baidu.com')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

这个传入的是Request对象

Request中的参数有url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None
data就是字节数据
headers是一个字典，也可以通过调用 Request 实例的 add_header() 方法来添加
origin_req_host 参数指的是请求方的host或IP
unverifiable 参数指的是这个请求是否是无法验证的，默认是False我也不是很懂
method是请求的方法

以上参数除了url其他都是可选参数也就是说可填可不填

data参数需要讲一下
当你的data是字典的时候可以用
bytes(urllib.parse.urlencode(dict), encoding=‘utf8’)#以utf-8编码进行转换成字节数据也可以传gbk同理

异常的处理
异常有两个分别是
URLError和HTTPError
URLError在url格式不正确的时候就会抛出这个异常
而HTTPError呢就是返回404等状态码就会抛出这个异常
HTTPError有三个属性
HTTPError.code #状态码
HTTPError.reason #错误信息
HTTPERRor.headers #协议头

处理异常使用try和except来捕抓处理异常

有时候我门模拟登录需要禁止重定向来获取登陆的cookie

class RedirctHandler(urllib2.HTTPRedirectHandler):
  """docstring for RedirctHandler"""
  def http_error_301(self, req, fp, code, msg, headers):
    pass
  def http_error_302(self, req, fp, code, msg, headers):
    pass

进行https请求

#方法一
context = ssl._create_unverified_context()
urlopen(url,context=context)

#方法二
ssl._create_default_https_context =ssl._create_unverified_context

代理ip设置

proxy_handler = urllib.request.ProxyHandler({'http': '127.0.0.1:1080', 'https': '127.0.0.1:1080'})
opener = urllib.request.build_opener(proxy_handler)
r = opener.open('http://www.baidu.com')
print(r.read().decode('utf-8', 'ignore'))
#安装代理然后每次请求无需重新设置代理
urllib.request.install_opener(opener)