[TOC] # urllib3库的基本使用 > urllib3库是Python中的HTTP协议客户端,功能丰富而强大。 ## 安装urllib3库 ```bash pip install urllib3 ``` 查看`urllib3`版本: ```Python #!/usr/bin/env python3 import urllib3 print(urllib3.__version__) ``` ## 编写urllib3示例 > 下面开始使用urllib3库,编写一下比较常见的用法示例 ### 访问HTTP协议页面 ```Python #!/usr/bin/env python3 import urllib3 http = urllib3.PoolManager() url = 'http://www.baidu.com' resp = http.request('GET', url) print(resp.status) if resp.status == 200: print(resp.data.decode('utf-8')) ``` ### 使用`stream`流模式下载二进制文件 ```Python import urllib3 import certifi url = 'https://docs.oracle.com/javase/specs/jls/se14/jls14.pdf' filename = url.split('/')[-1] http = urllib3.PoolManager(ca_certs = certifi.where()) try: # 设置 preload_content = False 将开启流传输模式 # resp = http.request('GET', url, preload_content = False) with open(filename, 'wb') as f: for chunk in resp.stream(4096): f.write(chunk) finally: # 流传输模式,需要手动释放链接 resp.release_conn() ``` ### 设置超时`timeout` > 通过设置超时时间(单位秒,float类型) ```Python import urllib3 http = urllib3.PoolManager() url = 'https://www.baidu.com' try: #resp = http.request('GET', url, timeout=0.5, retries=False) resp = http.request('GET', url, timeout=urllib3.Timeout(connect=0.5, read=3.0)) print(resp.status) if resp.status == 200: print(resp.data.decode('utf-8')) except urllib3.exceptions.ConnectTimeoutError: print('连接超时') ``` ### 访问HTTPS协议页面 >在urllib3提供客户端TLS / SSL的验证。为此,我们需要下载certifi模块。它为我们提供了精心挑选的根证书的集合,用于在验证TLS主机的身份和验证SSL证书的可信赖性。 安装 certifi模块: ``` pip install certifi ``` 查看证书文件位置: ```Python import certifi print(certifi.where()) ``` 同样,编写一个访问HTTPS页面的例子: ```Python #!/usr/bin/env python3 import urllib3 import certifi url = 'https://httpbin.org/anything' http = urllib3.PoolManager(ca_certs=certifi.where()) resp = http.request('GET', url) print(resp.status) ``` ### 参数查询信息-GET ```Python #!/usr/bin/env python3 import urllib3 import certifi http = urllib3.PoolManager(ca_certs=certifi.where()) payload = { 'name': 'Peter', 'age': 23 } url = 'https://httpbin.org/get' resp = http.request('GET', url, fields=payload) print(resp.data.decode('utf-8')) ``` 执行结果: ```JSON { "args": { "age": "23", "name": "Peter" }, "headers": { "Accept-Encoding": "identity", "Host": "httpbin.org", "X-Amzn-Trace-Id": "Root=1-5f0bbbdc-915aa646d14ee320d98bc4e3" }, "origin": "127.0.0.1", "url": "https://httpbin.org/get?name=Peter&age=23" } ``` ### POST 提交web表单 ```Python #!/usr/bin/env python3 import urllib3 import certifi http = urllib3.PoolManager(ca_certs = certifi.where()) payload = { 'name': 'Peter', 'age': 23 } url = 'https://httpbin.org/post' resp = http.request('POST', url, fields = payload) print(resp.data.decode('utf-8')) ``` 执行结果: ```sh $ python ./post_request.py { "args": {}, "data": "", "files": {}, "form": { "age": "18", "name": "Peter" }, "headers": { "Accept-Encoding": "identity", "Content-Length": "214", "Content-Type": "multipart/form-data; boundary=281e3c05a41e5ec834f98cf2b673113a", "Host": "httpbin.org", "X-Amzn-Trace-Id": "Root=1-5f0c0f9d-f490eb9ad301417e59f7b6a2" }, "json": null, "origin": "127.0.0.1", "url": "https://httpbin.org/post" } ``` 可以看到`form`表单中有了我们`POST`提交的数据了。 ### POST 发送JSON数据 ```Python import urllib3 import certifi import json http = urllib3.PoolManager(ca_certs = certifi.where()) payload = { 'name':'Peter', 'age': 23 } encoded_data = json.dumps(payload).encode('utf-8') header = { 'Content-Type': 'application/json' } url = 'https://httpbin.org/post' resp = http.request('POST', url, headers =header, body = encoded_data) print(resp.data.decode('utf-8')) ``` 执行结果: ``` $ python ./post_json.py { "args": {}, "data": "{\"name\": \"Peter\", \"age\": 23}", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "28", "Content-Type": "application/json", "Host": "httpbin.org", "User-Agent": "python-requests/2.24.0", "X-Amzn-Trace-Id": "Root=1-5f0c1120-a6c6b65c1eb9dbdc18e21420" }, "json": { "age": 23, "name": "Peter" }, "origin": "127.0.0.1", "url": "https://httpbin.org/post" } ``` 我们可以看到 `headers`中多了`"Content-Type": "application/json"`字段, `json` 中有数据了。 ### 使用代理访问 > urllib3支持配置代理访问服务器,HTTP协议代理使用`ProxyManager`类,`SOCKS4`和`SOCKS5`协议使用的是`SOCKSProxyManager` 注:这里出现的公网IP地址都不是真实有效的,仅用于验证效果。 #### HTTP/HTTPS协议代理的使用 > 首先我需要从代理池`http://localhost:5010/get`中获取一个代理地址,然后再使用它来访问'https://httpbin.org/ip' 代码如下: ```Python import urllib3 import json proxy_addr = 'http://88.198.201.112:8888' print(f'代理地址:{proxy_addr}') proxy = urllib3.ProxyManager(proxy_addr) resp = proxy.request('GET', 'https://httpbin.org/ip') print(resp.data.decode('utf-8')) ``` **执行一下的结果:** ``` $ python ./get_proxy.py 代理地址:http://88.198.201.112:8888 { "origin": "88.198.201.112" } ``` 可以看到`httpbin.org`服务器返回的源IP地址是代理地址,而不再是我个人的公网地址了,这样可以解决同IP地址访问过多而被限制情况。 #### SOCKS5协议代理的使用 > 使用前可能需要安装`PySocks`包才可以使用 ```Bash pip install 'urllib3[socks]' ``` 示例代码如下: ```Python from urllib3.contrib.socks import SOCKSProxyManager import json proxy_addr = 'socks5://127.0.0.1:1080' print(f'SOCKS5代理地址:{proxy_addr}') proxy = SOCKSProxyManager(proxy_addr) resp = proxy.request('GET', 'https://httpbin.org/ip') print(resp.data.decode('utf-8')) url = 'https://www.google.com' resp = proxy.request('GET', url) print(f'返回状态码:{resp.status}') ``` **执行结果:** ```Bash $ python ./get_socks.py SOCKS5代理地址:socks5://127.0.0.1:1080 { "origin": "88.198.201.112" } 返回状态码:200 ``` ---