AI写作智能体 自主规划任务,支持联网查询和网页读取,多模态高效创作各类分析报告、商业计划、营销方案、教学内容等。 广告
**爬取文字** 现在有一个文章列表要爬取,需要获取文章列表的标题 https://www.weixz.com/zxzx/ ![](https://img.kancloud.cn/de/d6/ded638168c58cf256a3dd788bb882e52_858x891.png) <br/><br/> 开始写代码 第一步创建表单结构体 ``` package main import "github.com/PeterYangs/article-spider/form" func main() { f:=form.Form{ } } ``` <br/><br/> 然后写入网站域名 ``` package main import "github.com/PeterYangs/article-spider/form" func main() { f := form.Form{ Host: "https://www.weixz.com", } } ``` <br/><br/> 写入当前栏目 ``` package main import "github.com/PeterYangs/article-spider/form" func main() { f := form.Form{ Host: "https://www.weixz.com", Channel: "/zxzx/", } } ``` **这里需要注意一下,一般来说文章列表都是有分页的,你爬当然也不止爬一页,需要尽可能的都爬完。如这个网站,它的第二页的链接是https://www.weixz.com/zxzx/list_2.html,所以上面代码需要修改一下** <br/><br/> 填充页码 ``` package main import "github.com/PeterYangs/article-spider/form" func main() { f := form.Form{ Host: "https://www.weixz.com", Channel: "/zxzx/list_[PAGE].html", } } ``` 其实规则很简单,就是将页码的位置替换成 **[PAGE]** <br/><br/> 填写最大爬取页码和起始页码 ``` package main import "github.com/PeterYangs/article-spider/form" func main() { f := form.Form{ Host: "https://www.weixz.com", Channel: "/zxzx/list_[PAGE].html", Limit: 5, PageStart: 1, } } ``` limit意思是爬5页,PageStart意思是从第一页开始爬取 <br/><br/> 填写列表选择器 先看看这个列表的规则 ![](https://img.kancloud.cn/e2/03/e2031073e7a685ca754adb26fd33129c_465x221.png) li是这个列表的主体,为了省事,使用谷歌浏览器获取li的选择器 ![](https://img.kancloud.cn/49/59/4959051e339e217cb854bb39160bd7b1_517x397.png) 放入代码中 ``` package main import "github.com/PeterYangs/article-spider/form" func main() { f := form.Form{ Host: "https://www.weixz.com", Channel: "/zxzx/list_[PAGE].html", Limit: 5, PageStart: 1, ListSelector: "body > div > div.information-main.mt-20px.wd1200.displayFlex > div.information-main-left > div.information-main-list > ul > li:nth-child(1)", } } ``` 需要注意的是,我需要爬的是所有的li而不是某一个li,所以修改为 ``` package main import "github.com/PeterYangs/article-spider/form" func main() { f := form.Form{ Host: "https://www.weixz.com", Channel: "/zxzx/list_[PAGE].html", Limit: 5, PageStart: 1, ListSelector: "body > div > div.information-main.mt-20px.wd1200.displayFlex > div.information-main-left > div.information-main-list > ul > li", } } ``` <br/><br/> 填写a链接选择器 ``` package main import "github.com/PeterYangs/article-spider/form" func main() { f := form.Form{ Host: "https://www.weixz.com", Channel: "/zxzx/list_[PAGE].html", Limit: 5, PageStart: 1, ListSelector: "body > div > div.information-main.mt-20px.wd1200.displayFlex > div.information-main-left > div.information-main-list > ul > li:nth-child(1)", ListHrefSelector: "div > a", } } ``` **ListHrefSelector**是填写a链接选择器的,爬虫需要获取到详情页面的链接才能爬取到更多的数据。a链接的选择器不要填写完整的选择器,而是要填写相对于列表的选择器。如上面的例子,列表已经到li了,那a的选择器就从li开始,填div>a就可以了 <br/><br/> 获取详情页面的标题字段 ``` package main import ( "github.com/PeterYangs/article-spider/fileTypes" "github.com/PeterYangs/article-spider/form" ) func main() { f := form.Form{ Host: "https://www.weixz.com", Channel: "/zxzx/list_[PAGE].html", Limit: 5, PageStart: 1, ListSelector: "body > div > div.information-main.mt-20px.wd1200.displayFlex > div.information-main-left > div.information-main-list > ul > li:nth-child(1)", ListHrefSelector: "div.information-main-list-title > a", DetailFields: map[string]form.Field{ "title": {Types: fileTypes.SingleField, Selector: "body > div > div.information-main.mt-20px.wd1200.displayFlex > div.information-main-left > div.informationContents > div.informationContentTitle > h1"}, }, } } ``` **DetailFields**是详情页面的获取字段的列表,title是key,可以填写你想要的任何名称,后面将成为excel的表头。**Types**是字段类型,如这个例子是我需要获取详情页面的标题,标题是文字类型,所以Types的类型填写为**fileTypes.SingleField**,**Selector**是标题的选择器。 <br/><br/> 开始运行 ``` package main import ( "github.com/PeterYangs/article-spider/fileTypes" "github.com/PeterYangs/article-spider/form" "github.com/PeterYangs/article-spider/spider" ) func main() { f := form.Form{ Host: "https://www.weixz.com", Channel: "/zxzx/list_[PAGE].html", Limit: 5, PageStart: 1, ListSelector: "body > div > div.information-main.mt-20px.wd1200.displayFlex > div.information-main-left > div.information-main-list > ul > li", ListHrefSelector: "div.information-main-list-title > a", DetailFields: map[string]form.Field{ "title": {Types: fileTypes.SingleField, Selector: "body > div > div.information-main.mt-20px.wd1200.displayFlex > div.information-main-left > div.informationContents > div.informationContentTitle > h1"}, }, } spider.Start(f) } ``` 运行完成后,将在web/static/excel 下生成一个对应的excel文件