在 Java 中读取网页 · ZetCode 中文系列教程

# 在 Java 中读取网页原文：http://zetcode.com/articles/javareadwebpage/ 在 Java 中读取网页是一个教程，它提供了几种读取 Java 中网页的方法。它包含六个示例，这些示例从一个很小的网页上下载 HTTP 源。 ## Java 读取网页工具 Java 具有内置的工具和第三方库，用于读取/下载网页。在示例中，我们使用 URL，JSoup，`HtmlCleaner`，Apache `HttpClient`，Jetty `HttpClient`和 HtmlUnit。在以下示例中，我们从 [something.com](http://something.com) 微型网页下载 HTML 源。 ## 读取带有 URL 的网页 `URL`表示一个统一资源定位符，它是指向万维网上资源的指针。 `ReadWebPageEx.java` ```java package com.zetcode; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.MalformedURLException; import java.net.URL; public class ReadWebPageEx { public static void main(String[] args) throws MalformedURLException, IOException { BufferedReader br = null; try { URL url = new URL("http://www.something.com"); br = new BufferedReader(new InputStreamReader(url.openStream())); String line; StringBuilder sb = new StringBuilder(); while ((line = br.readLine()) != null) { sb.append(line); sb.append(System.lineSeparator()); } System.out.println(sb); } finally { if (br != null) { br.close(); } } } } ``` 该代码示例读取网页的内容。 ```java br = new BufferedReader(new InputStreamReader(url.openStream())); ``` `openStream()`方法打开到指定 URL 的连接，并返回`InputStream`以从该连接读取。 `InputStreamReader`是从字节流到字符流的桥梁。它读取字节，并使用指定的字符集将其解码为字符。另外，`BufferedReader`用于获得更好的性能。 ```java StringBuilder sb = new StringBuilder(); while ((line = br.readLine()) != null) { sb.append(line); sb.append(System.lineSeparator()); } ``` 使用`readLine()`方法逐行读取 HTML 数据。源被附加到`StringBuilder`上。 ```java System.out.println(sb); ``` 最后，`StringBuilder`的内容被打印到终端。 ## 使用 JSoup 读取网页 JSoup 是流行的 Java HTML 解析器。 ```java <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.9.2</version> </dependency> ``` 我们已经使用了这种 Maven 依赖关系。 `ReadWebPageEx2.java` ```java package com.zetcode; import java.io.IOException; import org.jsoup.Jsoup; public class ReadWebPageEx2 { public static void main(String[] args) throws IOException { String webPage = "http://www.something.com"; String html = Jsoup.connect(webPage).get().html(); System.out.println(html); } } ``` 该代码示例使用 JSoup 下载并打印一个很小的网页。 ```java String html = Jsoup.connect(webPage).get().html(); ``` `connect()`方法连接到指定的网页。 `get()`方法发出 GET 请求。最后，`html()`方法检索 HTML 源。 ## 使用`HtmlCleaner`读取网页 `HtmlCleaner`是用 Java 编写的开源 HTML 解析器。 ```java <dependency> <groupId>net.sourceforge.htmlcleaner</groupId> <artifactId>htmlcleaner</artifactId> <version>2.16</version> </dependency> ``` 对于此示例，我们使用`htmlcleaner` Maven 依赖项。 `ReadWebPageEx3.java` ```java package com.zetcode; import java.io.IOException; import java.net.URL; import org.htmlcleaner.CleanerProperties; import org.htmlcleaner.HtmlCleaner; import org.htmlcleaner.SimpleHtmlSerializer; import org.htmlcleaner.TagNode; public class ReadWebPageEx3 { public static void main(String[] args) throws IOException { URL url = new URL("http://www.something.com"); CleanerProperties props = new CleanerProperties(); props.setOmitXmlDeclaration(true); HtmlCleaner cleaner = new HtmlCleaner(props); TagNode node = cleaner.clean(url); SimpleHtmlSerializer htmlSerializer = new SimpleHtmlSerializer(props); htmlSerializer.writeToStream(node, System.out); } } ``` 该示例使用`HtmlCleaner`下载网页。 ```java CleanerProperties props = new CleanerProperties(); props.setOmitXmlDeclaration(true); ``` 在属性中，我们设置为省略 XML 声明。 ```java SimpleHtmlSerializer htmlSerializer = new SimpleHtmlSerializer(props); htmlSerializer.writeToStream(node, System.out); ``` `SimpleHtmlSerializer`将创建结果 HTML，而不会进行任何缩进和/或压缩。 ## 使用 Apache `HttpClient`读取网页 Apache `HttpClient`是 HTTP/1.1 兼容的 HTTP 代理实现。它可以使用请求和响应过程来抓取网页。 HTTP 客户端实现 HTTP 和 HTTPS 协议的客户端。 ```java <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.2</version> </dependency> ``` 我们将这种 Maven 依赖关系用于 Apache HTTP 客户端。 `ReadWebPageEx4.java` ```java package com.zetcode; import java.io.IOException; import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.HttpClientBuilder; import org.apache.http.util.EntityUtils; public class ReadWebPageEx4 { public static void main(String[] args) throws IOException { HttpGet request = null; try { String url = "http://www.something.com"; HttpClient client = HttpClientBuilder.create().build(); request = new HttpGet(url); request.addHeader("User-Agent", "Apache HTTPClient"); HttpResponse response = client.execute(request); HttpEntity entity = response.getEntity(); String content = EntityUtils.toString(entity); System.out.println(content); } finally { if (request != null) { request.releaseConnection(); } } } } ``` 在代码示例中，我们将 GET HTTP 请求发送到指定的网页并接收 HTTP 响应。从响应中，我们读取了 HTML 源代码。 ```java HttpClient client = HttpClientBuilder.create().build(); ``` 生成了`HttpClient`。 ```java request = new HttpGet(url); ``` `HttpGet`是 HTTP GET 方法的类。 ```java request.addHeader("User-Agent", "Apache HTTPClient"); HttpResponse response = client.execute(request); ``` 执行 GET 方法并接收到`HttpResponse`。 ```java HttpEntity entity = response.getEntity(); String content = EntityUtils.toString(entity); System.out.println(content); ``` 从响应中，我们获得网页的内容。 ## 使用 Jetty `HttpClient`读取网页 Jetty 项目也有一个 HTTP 客户端。 ```java <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-client</artifactId> <version>9.4.0.M1</version> </dependency> ``` 这是 Jetty HTTP 客户端的 Maven 依赖项。 `ReadWebPageEx5.java` ```java package com.zetcode; import org.eclipse.jetty.client.HttpClient; import org.eclipse.jetty.client.api.ContentResponse; public class ReadWebPageEx5 { public static void main(String[] args) throws Exception { HttpClient client = null; try { client = new HttpClient(); client.start(); String url = "http://www.something.com"; ContentResponse res = client.GET(url); System.out.println(res.getContentAsString()); } finally { if (client != null) { client.stop(); } } } } ``` 在示例中，我们使用 Jetty HTTP 客户端获取网页的 HTML 源。 ```java client = new HttpClient(); client.start(); ``` 创建并启动`HttpClient`。 ```java ContentResponse res = client.GET(url); ``` GET 请求发布到指定的 URL。 ```java System.out.println(res.getContentAsString()); ``` 使用`getContentAsString()`方法从响应中检索内容。 ## 使用 HtmlUnit 读取网页 HtmlUnit 是用于测试基于 Web 的应用的 Java 单元测试框架。 ```java <dependency> <groupId>net.sourceforge.htmlunit</groupId> <artifactId>htmlunit</artifactId> <version>2.23</version> </dependency> ``` 我们使用这种 Maven 依赖关系。 `ReadWebPageEx6.java` ```java package com.zetcode; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.WebResponse; import com.gargoylesoftware.htmlunit.html.HtmlPage; import java.io.IOException; public class ReadWebPageEx6 { public static void main(String[] args) throws IOException { try (WebClient webClient = new WebClient()) { String url = "http://www.something.com"; HtmlPage page = webClient.getPage(url); WebResponse response = page.getWebResponse(); String content = response.getContentAsString(); System.out.println(content); } } } ``` 该示例下载一个网页并使用 HtmlUnit 库进行打印。在本文中，我们使用各种工具（包括 URL，JSoup，`HtmlCleaner`，Apache `HttpClient`，Jetty `HttpClient`和 HtmlUnit）在 Java 中抓取了一个网页。您可能也对以下相关教程感兴趣： [Java 教程](/lang/java/)， [Java Servlet 读取网页](/articles/javaservletreadwebpage/)，[如何使用 Java 读取文本文件](/articles/javagetpostrequest/) ，[发送 HTTP GET POST 请求 ]使用 Java](/articles/javareadtext/)，[使用 Java8 的`StringJoiner`来连接字符串](/articles/java8stringjoiner/)，[Jetty 教程](/java/jetty/)或 [Google Guava 简介](/articles/guava/)。