解析部分文档 · Beautiful Soup 4.2.0 中文文档

# 解析部分文档如果仅仅因为想要查找文档中的`<a>`标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把`<a>`标签以外的东西都忽略掉. `SoupStrainer` 类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在 `SoupStrainer` 中定义过的文档. 创建一个 `SoupStrainer` 对象并作为 `parse_only` 参数给 `BeautifulSoup` 的构造方法即可. ## SoupStrainer `SoupStrainer` 类接受与典型搜索方法相同的参数：[name](#id32) , [attrs](#css) , [recursive](#recursive) , [text](#text) , [**kwargs](#keyword) 。下面举例说明三种 `SoupStrainer` 对象： ``` from bs4 import SoupStrainer only_a_tags = SoupStrainer("a") only_tags_with_id_link2 = SoupStrainer(id="link2") def is_short_string(string): return len(string) < 10 only_short_strings = SoupStrainer(text=is_short_string) ``` 再拿“爱丽丝”文档来举例，来看看使用三种 `SoupStrainer` 对象做参数会有什么不同: ``` html_doc = """ <html><head><title>The Dormouse's story</title></head> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """ print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) # <a class="sister" href="http://example.com/elsie" id="link1"> # Elsie # </a> # <a class="sister" href="http://example.com/lacie" id="link2"> # Lacie # </a> # <a class="sister" href="http://example.com/tillie" id="link3"> # Tillie # </a> print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) # <a class="sister" href="http://example.com/lacie" id="link2"> # Lacie # </a> print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) # Elsie # , # Lacie # and # Tillie # ... # ``` 还可以将 `SoupStrainer` 作为参数传入 [搜索文档树](#id24) 中提到的方法.这可能不是个常用用法,所以还是提一下: ``` soup = BeautifulSoup(html_doc) soup.find_all(only_short_strings) # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', # u'\n\n', u'...', u'\n'] ```