创建表 · JAVA

[TOC] # 创建表另外需要注意的是传统数据库对表数据验证是 schema on write（写时模式），而 Hive 在load时是不检查数据是否符合schema的，hive 遵循的是 schema on read（读时模式），只有在读的时候hive才检查、解析具体的数据字段、schema。读时模式的优势是load data 非常迅速，因为它不需要读取数据进行解析，仅仅进行文件的复制或者移动。写时模式的优势是提升了查询性能，因为预先解析之后可以对列建立索引，并压缩，但这样也会花费要多的加载时间 ## 建表语法 ~~~ CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path] ~~~ 说明： 1. create table 创建一个指定名字的表。如果相同名字的表已经存在，则抛出异常；用户可以用 `IF NOT EXISTS` 选项来忽略这个异常。 2. external关键字可以让用户创建一个外部表，在建表的同时指定一个指向实际数据的路径（LOCATION），Hive 创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。 3. like 允许用户复制现有的表结构，但是不复制数据。 4. row format ~~~ DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)] ~~~ 用户在建表的时候可以自定义 SerDe 或者使用自带的 SerDe。如果没有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED，将会使用自带的 SerDe。在建表的时候，用户还需要为表指定列，用户在指定表的列的同时也会指定自定义的 SerDe，Hive通过 SerDe 确定表的具体的列的数据。 5. stored as `SEQUENCEFILE | TEXTFILE | RCFILE` 如果文件数据是纯文本，可以使用 STORED AS TEXTFILE。如果数据需要压缩，使用 `STORED AS SEQUENCEFILE` 其中TEXTFILE为默认格式，建表时不指定默认为这个格式，导入数据时会直接把数据文件拷贝到hdfs上不进行处理。　　SEQUENCEFILE，RCFILE，ORCFILE格式的表不能直接从本地文件导入数据，数据要先导入到textfile格式的表中，然后再从表中用insert导入SequenceFile,RCFile,ORCFile表中。 6. clustered by 对于每一个表（table）或者分区， Hive可以进一步组织成桶，也就是说桶是更为细粒度的数据范围划分。Hive也是针对某一列进行桶的组织。Hive采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。把表（或者分区）组织成桶（Bucket）有两个理由：（1）**获得更高的查询处理效率**。桶为表加上了额外的结构，Hive 在处理有些查询时能利用这个结构。具体而言，连接两个在（包含连接列的）相同列上划分了桶的表，可以使用 Map 端连接（Map-side join）高效的实现。比如JOIN操作。对于JOIN操作两个表有一个相同的列，如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以，可以大大较少JOIN的数据量。（2）**使取样（sampling）更高效**。在处理大规模数据集时，在开发和修改查询的阶段，如果能在数据集的一小部分数据上试运行查询，会带来很多方便 ## 分割 `fields terminated by`: 字段与字段之间的分割符 `collection items terminated by`:一个字段中各个子元素item的分隔符 ## 分区表分区表实际上就是对应一个hdfs文件系统上的独立的文件夹,该文件夹下是该分区所有的数据文件. hive中的分就是分目录,把一个大的数据集根据业务需要分割成小的数据集.在查询的时通过where子句中的表达式选择查询所需要的指定分区,这样的查询效率会提高很多 # 具体实例 ## 文件载入表 ~~~ hive> create table student(id int, name string, age int) > row format delimited > fields terminated by ','; OK ~~~ 创建表的时候指定行分割和每个字段分割创建文本 ~~~ [root@master ~]# cat student.txt 1,jdxia,17 2,user2,20 ~~~ 然后上传上去(后面就是hadoop的路径) ~~~ hdfs dfs -put student.txt /user/hive/warehouse/db1.db/student/ ~~~ 然后查询下表 ~~~ hive> select * from student; OK 1 jdxia 17 2 user2 20 Time taken: 0.082 seconds, Fetched: 2 row(s) ~~~ 如果表不这样指定行分割和列分割,会显示null 我们再次上传下看下 ~~~ [root@master ~]# cp student.txt student1.txt [root@master ~]# hdfs dfs -put student1.txt /user/hive/warehouse/db1.db/student/ ~~~ 然后select看下发现又多了数据 ## hdfs载入表但是这样做不好,我们一般这么做 **inpath载入** 创建表 ~~~ hive> create table t_user(id int,name string,age int) > row format delimited > fields terminated by ','; OK Time taken: 0.088 seconds ~~~ 把本地的东西载入进去 ~~~ hive> load data local inpath '/root/student.txt' into table t_user; ~~~ 如果要用load加载hdfs上面的数据我们先把这个加载到hadoop中 ~~~ hdfs dfs -put student1.txt / ~~~ 然后我们在hive中操作 ~~~ load data inpath '/student1.txt' into table t_user; ~~~ 这样就可以用hdfs中的文件,载入进表中 ## 创建分桶表分桶表不要load,不然你去hdfs上看还是一个文件开启分桶机制,默认是关闭的 ~~~ set hive.enforce.bucketing=true; //查看 set hive.enforce.bucketing; ~~~ clustered by表示按什么分桶 ~~~ hive> create table stu_buck(Sno int,Sname string,Sex string,Sage int,Sdept string) > clustered by(Sno) > sorted by(Sno DESC) > into 4 buckets > row format delimited > fields terminated by ','; ~~~ ~~~ //清空表数据,可以用这个 truncate table stu_buck; ~~~ **桶表插入** ~~~ student_ext表数据,用,分割下 95001,李勇,男,20,CS 95002,刘晨,女,19,IS 95003,王敏,女,22,MA 95004,张立,男,19,IS 95005,刘刚,男,18,MA ~~~ ~~~ //插入数据,需要后面有这样的规则( distribute by sno sort by sno desc;)不然没有按照分桶的规则,distribute分发的意思 //不要用clustered会报错 insert overwrite table stu_buck select * from student_ext distribute by sno sort by sno desc; ~~~ **桶表抽样查询** ~~~ //查看下 select * from student_ext; //hive可以和hdfs有交互 dfs -cat /user/hive/warehouse/db1.db/stu_buck/000000_0 ~~~ ~~~ Select * from student tablesample(bucket 1 out of 2 on id) tablesample是抽样语句，语法：TABLESAMPLE(BUCKET x OUT OF y) y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例. 如，table总共分了64份，当y=32时，抽取(64/32=)2个bucket的数据，当y=128时，抽取(64/128=)1/2个bucket的数据。 x表示从哪个bucket开始抽取。例如，table总bucket数为32，tablesample(bucket 3 out of 16)，表示总共抽取（32/16=）2个bucket的数据，分别为第3个bucket和第（3+16=）19个bucket的数据 ~~~ ~~~ //查询其中一个桶,和直接cat查询文件是一样的,这是取1个桶的 select * from stu_buck tablesample (bucket 1 out of 4 on sno); //取2个桶的,1和3这2个桶 select * from stu_buck tablesample (bucket 1 out of 2 on sno); ~~~ # 内部表和外部表的区别 Hive中内部表与外部表的区别： 1）创建表时：创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。 2）删除表时：在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。这样外部表相对来说更加安全些，数据组织也更加灵活，方便共享源数据。 external ## 外部表 ~~~ hive> create external table t_ext(id int,name string,age int) > row format delimited > fields terminated by ','; OK ~~~ 创建外部表可以加个local属性指定路径,他可以加载外部的东西,不像内部表 ~~~ hive> create external table t_ext(id int,name string,age int) > row format delimited > fields terminated by ',' > location "/hivedata"; OK ~~~ `/hivedata` 是个目录然后我们把文件放到这个目录下面,发现是可以select出数据的然后我们看mysql表 TBLS(创建表单日期的一些数据)和COLUMNS_V2(表的一些字段信息)表 **如果我们把表drop掉的话,发现hdfs中还是有的,但是hive中是没有的,表示连接断开了,但是数据还是在的** ## 查看表的类型 ~~~ desc formatted t_ext; ~~~ Table Type中 MANAGED_TABLE管理表,表删除了就都删除了,表示不是外部表 # 表存储格式 stored as **创建表** ~~~ create table t_2(id int,name string) row format delimited fields terminated by ',' stored as textfile; ~~~ 填充外部文件数据 ~~~ [root@master ~]# cat name.txt 1,jdxia 2,xiaozhan ~~~ ~~~ hive> load data local inpath '/root/name.txt' into table t_2; ~~~ ~~~ STORED AS `SEQUENCEFILE|TEXTFILE|RCFILE` 如果文件数据是纯文本，可以使用 STORED AS TEXTFILE。如果数据需要压缩，使用 STORED AS SEQUENCEFILE ~~~ 默认是TEXTFILE 创建个压缩的 ~~~ hive> create table t_3(id int,name string) > row format delimited > fields terminated by ',' > stored as SEQUENCEFILE; ~~~ 压缩表是不能用外部文件load导入的,会报错,会让你检查文件格式他的导入值,可以用其他表insert进去 ~~~ hive> insert overwrite table t_3 select * from t_2; ~~~ ### 区别 **TEXTFILE 格式** 默认格式，数据不做压缩，磁盘开销大，数据解析开销大。可结合Gzip、Bzip2使用(系统自动检查，执行查询时自动解压)，但使用这种方式，Hive不会对数据进行切分，从而无法对数据进行并行操作示例 ~~~ create table if not exists textfile_table( site string, url string, pv bigint, label string) row format delimited fields terminated by '\t' stored as textfile; ~~~ 插入数据 ~~~ Hive> Hive.exec.compress.output=true; Hive> set mapred.output.compress=true; Hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; Hive> set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; Hive> insert overwrite table textfile_table select * from textfile_table ~~~ **SEQUENCEFILE 格式** SequenceFile是Hadoop API提供的一种二进制文件支持，其具有使用方便、可分割、可压缩的特点。 SequenceFile支持三种压缩选择：NONE，RECORD，BLOCK。Record压缩率低，一般建议使用BLOCK压缩示例 ~~~ create table if not exists seqfile_table( site string, url string, pv bigint, label string) row format delimited fields terminated by '\t' stored as sequencefile; ~~~ 插入数据操作： ~~~ Hive> set Hive.exec.compress.output=true; Hive> set mapred.output.compress=true; Hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; Hive> set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; Hive> SET mapred.output.compression.type=BLOCK; Hive> insert overwrite table seqfile_table select * from textfile_table; ~~~ **RCFILE 文件格式** RCFILE是一种行列存储相结合的存储方式。首先，其将数据按行分块，保证同一个record在一个块上，避免读一个记录需要读取多个block。其次，块数据列式存储，有利于数据压缩和快速的列存取 ~~~ create table if not exists rcfile_table( site string, url string, pv bigint, label string) row format delimited fields terminated by '\t' stored as rcfile; ~~~ 插入数据操作： ~~~ Hive> set Hive.exec.compress.output=true; Hive> set mapred.output.compress=true; Hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; Hive> set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; Hive> insert overwrite table rcfile_table select * from textfile_table; ~~~ 相比TEXTFILE和SEQUENCEFILE，RCFILE由于列式存储方式，数据加载时性能消耗较大，但是具有较好的压缩比和查询响应。数据仓库的特点是一次写入、多次读取，因此，整体来看，RCFILE相比其余两种格式具有较明显的优势。