实战练习：百度贴吧热议榜 · 虫师de江湖

[TOC] # Python爬虫抓取实战一百度贴吧热议榜 > 下面示例比较简单，主要说明下如何查找AJAX数据接口。 1. 打开`Chrome`浏览器，按`F12`显示开发者工具，然后访问`https://tieba.baidu.com`网站 2. 找到`Network`页，选中`Filter`漏斗图标，此时可以看到"All|XHR|JS|...."不同类型文件的选择，我们选中`XHR`。 3. 此时下面的URL列表中有个`topicList`，选中后，右侧显示"Preview"预览内容为`JSON`数据，这里正是我们看到的热门话题`Top30`。 ![XHR](https://img.kancloud.cn/8b/2c/8b2c5b79322799bbb1449c153807a657_1096x413.png) 找到这里，我们就定位到了数据源请求链接地址为： `https://jump.bdimg.com/hottopic/browse/topicList` ，接下来我们就是通过`Python`脚本访问并获取响应数据。 ```Python #!/usr/bin/env python3 # -*- coding: utf-8 -*- import requests as req import html headers = { 'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36', } # 获取贴吧信息： # 排名序号|标题|摘要|热度|链接图片 def tieba_hot(): url='https://jump.bdimg.com/hottopic/browse/topicList' # JSON数据接口 resp = req.get(url, headers = headers) data = resp.json() topic_list = data['data']['bang_topic']['topic_list'] for topic in topic_list: topic_url = html.unescape(topic['topic_url']) print('{} |{}|{}\n'.format(topic['idx_num'],topic['topic_name'],topic_url)) if __name__ == '__main__': tieba_hot() ``` `requests`库请求返回的结果数据是`json`格式数据，我们可以通过`json()`方法提取为`Python`字典类型的数据结果。然后对话题列表提取`话题排名`、`话题内容`和对应的`URL`地址信息。 **说明：** `topic_url`中的`\&`字符被转义为`\&`，需要对`HTML实体`进行`反转义`处理，否则访问URL会无法正确访问。我们使用`html.unescape()`方法来处理一下，只需要将`topic['topic_url']` 改成 `html.unescape(topic['topic_url'])` 即可，其他字段不变。关于`HTML实体字符转换`的`HTML库`实现源码可以点击这里查看 [Lib/html/__init__.py](https://github.com/python/cpython/blob/3.8/Lib/html/__init__.py)