**爬取文字**
现在有一个文章列表要爬取,需要获取文章列表的标题
https://www.weixz.com/zxzx/

<br/><br/>
开始写代码
第一步创建表单结构体
```
package main
import "github.com/PeterYangs/article-spider/form"
func main() {
f:=form.Form{
}
}
```
<br/><br/>
然后写入网站域名
```
package main
import "github.com/PeterYangs/article-spider/form"
func main() {
f := form.Form{
Host: "https://www.weixz.com",
}
}
```
<br/><br/>
写入当前栏目
```
package main
import "github.com/PeterYangs/article-spider/form"
func main() {
f := form.Form{
Host: "https://www.weixz.com",
Channel: "/zxzx/",
}
}
```
**这里需要注意一下,一般来说文章列表都是有分页的,你爬当然也不止爬一页,需要尽可能的都爬完。如这个网站,它的第二页的链接是https://www.weixz.com/zxzx/list_2.html,所以上面代码需要修改一下**
<br/><br/>
填充页码
```
package main
import "github.com/PeterYangs/article-spider/form"
func main() {
f := form.Form{
Host: "https://www.weixz.com",
Channel: "/zxzx/list_[PAGE].html",
}
}
```
其实规则很简单,就是将页码的位置替换成 **[PAGE]**
<br/><br/>
填写最大爬取页码和起始页码
```
package main
import "github.com/PeterYangs/article-spider/form"
func main() {
f := form.Form{
Host: "https://www.weixz.com",
Channel: "/zxzx/list_[PAGE].html",
Limit: 5,
PageStart: 1,
}
}
```
limit意思是爬5页,PageStart意思是从第一页开始爬取
<br/><br/>
填写列表选择器
先看看这个列表的规则

li是这个列表的主体,为了省事,使用谷歌浏览器获取li的选择器

放入代码中
```
package main
import "github.com/PeterYangs/article-spider/form"
func main() {
f := form.Form{
Host: "https://www.weixz.com",
Channel: "/zxzx/list_[PAGE].html",
Limit: 5,
PageStart: 1,
ListSelector: "body > div > div.information-main.mt-20px.wd1200.displayFlex > div.information-main-left > div.information-main-list > ul > li:nth-child(1)",
}
}
```
需要注意的是,我需要爬的是所有的li而不是某一个li,所以修改为
```
package main
import "github.com/PeterYangs/article-spider/form"
func main() {
f := form.Form{
Host: "https://www.weixz.com",
Channel: "/zxzx/list_[PAGE].html",
Limit: 5,
PageStart: 1,
ListSelector: "body > div > div.information-main.mt-20px.wd1200.displayFlex > div.information-main-left > div.information-main-list > ul > li",
}
}
```
<br/><br/>
填写a链接选择器
```
package main
import "github.com/PeterYangs/article-spider/form"
func main() {
f := form.Form{
Host: "https://www.weixz.com",
Channel: "/zxzx/list_[PAGE].html",
Limit: 5,
PageStart: 1,
ListSelector: "body > div > div.information-main.mt-20px.wd1200.displayFlex > div.information-main-left > div.information-main-list > ul > li:nth-child(1)",
ListHrefSelector: "div > a",
}
}
```
**ListHrefSelector**是填写a链接选择器的,爬虫需要获取到详情页面的链接才能爬取到更多的数据。a链接的选择器不要填写完整的选择器,而是要填写相对于列表的选择器。如上面的例子,列表已经到li了,那a的选择器就从li开始,填div>a就可以了
<br/><br/>
获取详情页面的标题字段
```
package main
import (
"github.com/PeterYangs/article-spider/fileTypes"
"github.com/PeterYangs/article-spider/form"
)
func main() {
f := form.Form{
Host: "https://www.weixz.com",
Channel: "/zxzx/list_[PAGE].html",
Limit: 5,
PageStart: 1,
ListSelector: "body > div > div.information-main.mt-20px.wd1200.displayFlex > div.information-main-left > div.information-main-list > ul > li:nth-child(1)",
ListHrefSelector: "div.information-main-list-title > a",
DetailFields: map[string]form.Field{
"title": {Types: fileTypes.SingleField, Selector: "body > div > div.information-main.mt-20px.wd1200.displayFlex > div.information-main-left > div.informationContents > div.informationContentTitle > h1"},
},
}
}
```
**DetailFields**是详情页面的获取字段的列表,title是key,可以填写你想要的任何名称,后面将成为excel的表头。**Types**是字段类型,如这个例子是我需要获取详情页面的标题,标题是文字类型,所以Types的类型填写为**fileTypes.SingleField**,**Selector**是标题的选择器。
<br/><br/>
开始运行
```
package main
import (
"github.com/PeterYangs/article-spider/fileTypes"
"github.com/PeterYangs/article-spider/form"
"github.com/PeterYangs/article-spider/spider"
)
func main() {
f := form.Form{
Host: "https://www.weixz.com",
Channel: "/zxzx/list_[PAGE].html",
Limit: 5,
PageStart: 1,
ListSelector: "body > div > div.information-main.mt-20px.wd1200.displayFlex > div.information-main-left > div.information-main-list > ul > li",
ListHrefSelector: "div.information-main-list-title > a",
DetailFields: map[string]form.Field{
"title": {Types: fileTypes.SingleField, Selector: "body > div > div.information-main.mt-20px.wd1200.displayFlex > div.information-main-left > div.informationContents > div.informationContentTitle > h1"},
},
}
spider.Start(f)
}
```
运行完成后,将在web/static/excel 下生成一个对应的excel文件
