requests库的基本使用 · 虫师de江湖

[TOC] # Requests库的基本使用 > `Requests`库是在`urllib3`库的基础上更层次的抽象封装，使用起来更加简单，代码量也会更少。 ## 快速安装 ``` pip install requests ``` ## 快速编写示例 > 最好的学习就是从实践中锻炼学习，立刻动起小手！从这些简单示例开始练习。 ### 访问HTTP协议页面 ```Python #!/usr/bin/env python3 import requests url = 'https://www.baidu.com' resp = requests.get(url) print(resp.status_code) if resp.status_code == 200: print(resp.text) ``` ### GET-参数查询信息下面是两个`GET`方法参数查询示例，其中第二个示例的`seen_list`是一个列表，这对于参数名相同并且包含多个值的设置是非常方便的。 ```Python import requests as req payload1 = { 'name': 'Peter', 'age': 23 } payload2 = { 'name': 'Peter', 'seen_list': [1,2,3,4,5,6] } url = 'https://httpbin.org/get' resp = req.get(url, params=payload1) print(resp.url) resp = req.get(url, params=payload2) print(resp.url) ``` 执行结果： ``` $ python ./method_get.py https://httpbin.org/get?name=Peter&age=23 https://httpbin.org/get?name=Peter&seen_list=1&seen_list=2&seen_list=3&seen_list=4&seen_list=5&seen_list=6 ``` ### POST-提交web表单 > `POST`方法通常用于页面登录提交表单，因为`POST`提交参数值不会体现在`URL`中，相对更加安全，由于`URL`地址长度是有限制的，`GET`方法无法携带大量数据内容，`POST`方法就没有限制。 ```Python import requests as req payload = { 'name': 'Peter', 'age': 23 } url = 'https://httpbin.org/post' resp = req.post(url, data=payload) print(resp.text) ``` 执行结果： ``` $ python ./method_post.py { "args": {}, "data": "", "files": {}, "form": { "age": "18", "name": "Peter" }, "headers": { "Accept-Encoding": "identity", "Content-Length": "214", "Content-Type": "multipart/form-data; boundary=281e3c05a41e5ec834f98cf2b673113a", "Host": "httpbin.org", "X-Amzn-Trace-Id": "Root=1-5f0c0f9d-f490eb9ad301417e59f7b6a2" }, "json": null, "origin": "127.0.0.1", "url": "https://httpbin.org/post" } ``` 可以看到`form`表单中有了我们`POST`提交的数据了。 ### POST-发送JSON数据 ```Python import requests as req payload = { 'name': 'Peter', 'age': 23 } url = 'https://httpbin.org/post' resp = req.post(url, json=payload) print(resp.text) ``` 执行结果： ``` $ python ./method_post_json.py { "args": {}, "data": "{\"name\": \"Peter\", \"age\": 23}", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "28", "Content-Type": "application/json", "Host": "httpbin.org", "User-Agent": "python-requests/2.24.0", "X-Amzn-Trace-Id": "Root=1-5f0c1120-a6c6b65c1eb9dbdc18e21420" }, "json": { "age": 23, "name": "Peter" }, "origin": "127.0.0.1", "url": "https://httpbin.org/post" } ``` 我们可以看到 `headers`中多了`"Content-Type": "application/json"`字段，省去了`urllib3`设置`headers`的部分了。 ### 使用`stream`流模式下载二进制文件 ```Python import requests as req url = 'https://docs.oracle.com/javase/specs/jls/se14/jls14.pdf' filename = url.split('/')[-1] with req.get(url,stream=True) as r: with open(filename,'wb') as f: f.write(r.content) ``` 当在请求中将stream设为True后，`Requests`无法将连接自动释放回连接池，需要读取完所有数据或者手动调用`Requests.close`。 ### 下载一个图片示例 ```Python import requests as req resp = req.get('http://www.baidu.com/favicon.ico') with open('favicon.ico', 'wb') as f: f.write(resp.content) ``` ### 设置超时`timeout` > 超时判断分为两种： Connect建立连接超时和 Read读取数据超时 `timeout`超时时间单位为秒, 参数值可以是下面任意一种： - `float`：最大超时时间，连接和读数据使用一个。 - `二元组`: (连接超时时间, 读数据超时时间)。 - `None` ：一直等到地老天荒，海枯石烂。 ```Python #!/usr/bin/env python3 import requests url = 'https://www.baidu.com' try: resp = requests.get(url, timeout = (0.01,3)) print(resp.status_code) if resp.status_code == 200: print(resp.text[:50]) except requests.exceptions.ConnectTimeout as e: print('连接超时: ' + str(e)) ``` ### 访问HTTPS协议页面 > `Requests`库也是使用`certifi`库进行证书验证，默认`verify`为`True`，如果设置为`False`也可以访问，但是会出现SSL的`Warning`警告信息。 ```Python #!/usr/bin/env python3 import requests as req url = 'https://httpbin.org/anything' resp = req.get(url, verify=True) print(resp.status_code) ``` ### 使用代理访问 ``` import requests as req proxies = { 'http' : '88.198.201.112:8888', 'https' : '88.198.201.112:8888' } print(f'代理地址：{proxies}') resp = req.get('https://httpbin.org/ip', proxies = proxies) print(resp.text) ``` **执行结果：** ```sh $ python proxy_http_get.py 代理地址：{'http': 'http://88.198.201.112:8888', 'https': 'http://88.198.201.112:8888'} { "origin": "88.198.201.112" } ``` 我们访问的`https`协议地址，代理生效了。 ### SOCKS5协议代理的使用 > 使用前可能需要安装`PySocks`包才可以使用 ```sh pip install 'requests[socks]' ``` 设置方法同样简单如下： ```Python proxies = { 'http': 'socks5://user:pass@host:port', 'https': 'socks5://user:pass@host:port' } ``` 看看这个示例： ```Python import requests as req proxies = { 'http' : 'socks5://127.0.0.1:1080', 'https' : 'socks5://127.0.0.1:1080' } print(f'代理地址：{proxies}') resp = req.get('https://httpbin.org/ip', proxies = proxies) print(resp.text) url = 'https://www.google.com' resp = req.get(url, proxies = proxies) print(f'返回状态码:{resp.status_code}') ``` **执行结果：** ```sh $ python ./proxy_socks_get.py 代理地址：{'http': 'socks5://127.0.0.1:1080', 'https': 'socks5://127.0.0.1:1080'} { "origin": "88.198.201.112" } 返回状态码:200 ``` ### Session管理 > requests库的`Session`可以为相同网站的多次访问保持`cookies`信息，例如模拟登录操作后，cookies信息会自动保持，继续访问该网站其他内容时都会使用这个登录账号操作。 ```Python import requests s = requests.Session() s.get('https://httpbin.org/cookies/set/sessioncookie/123456789') r = s.get("https://httpbin.org/cookies") print(r.text) # '{"cookies": {"sessioncookie": "123456789"}}' ``` 如果使用多个代理IP并发访问同一网站，就不要使用`Session()`，直接使用`get()`或`post()`访问，否则多个代理IP都会判定为同一个用户的访问，这样就失去了多代理IP并发的作用了。 ## 编码方式的识别当你收到一个响应时，Requests 会猜测响应的编码方式，用于在你调用 Response.text 方法时对响应进行解码。Requests 首先在 HTTP 头部检测是否存在指定的编码方式，如果不存在，则会使用 charade [http://pypi.python.org/pypi/charade] 来尝试猜测编码方式(在Request 2.24.0版本中使用的是`chardet`库了)。只有当 HTTP 头部不存在明确指定的字符集，并且 Content-Type 头部字段包含 text 值之时， Requests 才不去猜测编码方式。在这种情况下， RFC 2616 [http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1] 指定默认字符集必须是 ISO-8859-1 。Requests 遵从这一规范。如果你需要一种不同的编码方式，你可以手动设置 Response.encoding 属性，或使用原始的 Response.content。 ---