【十二】Hive分区 · PHP开发笔记&解决方案

# 【十二】Hive分区 ### 12.1 实验目的掌握Hive分区的用法，加深对Hive分区概念的理解，了解Hive表在HDFS的存储目录结构。 ### 12.2 实验要求创建一个Hive分区表；根据数据年份创建year=2014和year=2015两个分区；将2015年的数据导入到year=2015的分区；在Hive界面用条件year=2015查询2015年的数据。 ### 12.3 实验原理分区(Partition) 对应于数据库中的分区(Partition) 列的密集索引，但是 Hive 中分区(Partition) 的组织方式和数据库中的很不相同。在 Hive 中，表中的一个分区(Partition) 对应于表下的一个目录，所有的分区(Partition) 的数据都存储在对应的目录中。例如：pvs 表中包含 ds 和 ctry 两个分区(Partition)，则对应于 ds = 20090801, ctry = US 的 HDFS 子目录为：/wh/pvs/ds=20090801/ctry=US；对应于 ds = 20090801, ctry = CA 的 HDFS 子目录为；/wh/pvs/ds=20090801/ctry=CA。外部表(External Table) 指向已经在 HDFS 中存在的数据，可以创建分区(Partition)。它和 Table 在元数据的组织上是相同的，而实际数据的存储则有较大的差异。 Table 的创建过程和数据加载过程（这两个过程可以在同一个语句中完成），在加载数据的过程中，实际数据会被移动到数据仓库目录中；之后对数据的访问将会直接在数据仓库目录中完成。删除表时，表中的数据和元数据将会被同时删除。 ### 12.4 实验步骤因为Hive依赖于MapReduce，所以本实验之前先要启动Hadoop集群，然后再启动Hive进行实验，主要包括以下三个步骤。 #### 12.4.1 启动Hadoop集群在主节点进入Hadoop安装目录，启动Hadoop集群。 ~~~ [root@master ~]# cd /usr/cstor/hadoop/sbin [root@master sbin]# ./start-all.sh ~~~ #### 12.4.2 用命令进入Hive客户端进入Hive安装目录，用命令进入Hive客户端。 ~~~ [root@master ~]# cd /usr/cstor/hive [root@master hive]# bin/hive ~~~ #### 12.4.3 通过HQL语句进行实验进入客户端后，查看Hive数据库，并选择default数据库： ~~~ hive> show databases; OK default Time taken: 1.152 seconds, Fetched: 1 row(s) hive> use default; OK Time taken: 0.036 seconds 在命令端创建Hive分区表： hive> create table parthive (createdate string, value string) partitioned by (year string) row format delimited fields terminated by '\t'; OK Time taken: 0.298 seconds ~~~ 查看新建的表： ~~~ hive> show tables; OK parthive Time taken: 1.127 seconds, Fetched: 1 row(s) ~~~ 给parthive表创建两个分区： ~~~ hive> alter table parthive add partition(year='2014'); OK Time taken: 0.195 seconds hive> alter table parthive add partition(year='2015'); OK Time taken: 0.121 seconds ~~~ 查看parthive的表结构： ~~~ hive> describe parthive; OK createdate string value string year string # Partition Information # col_name data_type comment year string Time taken: 0.423 seconds, Fetched: 8 row(s) ~~~ 向year=2015分区导入本地数据： ~~~ hive> load data local inpath '/root/data/12/parthive.txt' into table parthive partition(year='2015'); Loading data to table default.parthive partition (year=2015) Partition default.parthive{year=2015} stats: [numFiles=1, totalSize=110] OK Time taken: 1.071 seconds ~~~ 根据条件查询year=2015的数据： ~~~ hive> select * from parthive t where t.year='2015'; ~~~ 根据条件统计year=2015的数据： ~~~ hive> select count(*) from parthive where year='2015'; ~~~ ### 12.5 实验结果用命令查看HDFS文件，Hive中parthive表在HDFS文件中的存储目录结构如下图12-1所示： ![](https://box.kancloud.cn/10b98eb752d82f712506f1b6fc80ef1a_484x54.jpg) 图12-1 parthive表结构图 Hive客户端查询结果如下图12-2所示： ![](https://box.kancloud.cn/b4cf5b3e6ff0cf4bab163183280ea2a3_529x166.jpg) 图12-2 客户端查询结果图 Hive客户端统计结果如下图12-3所示： ![](https://box.kancloud.cn/f0e58c32e29c6908b4f19ff09d58aaff_529x257.jpg)