select · 大数据

[TOC] ## select * 基本的Select操作 * 语法结构 ~~~ SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list [HAVING condition]] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list] ] [LIMIT number] ~~~ **注：** 1. **order by 会对输入做全局排序，因此只有一个reducer，会导致当输入规模较大时，需要较长的计算时间** 2. **sort by不是全局排序，其在数据进入reducer前完成排序。因此，如果用sort by进行排序，并且设置`mapred.reduce.tasks>1`，则sort by只保证每个reducer的输出有序，不保证全局有序** 3. distribute by(字段)(分发)根据指定的字段将数据分到不同的reducer，且分发算法是hash散列。 4. Cluster by(字段)(桶) 除了具有Distribute by的功能外，还会对该字段进行排序。因此，如果分桶和sort字段是同一个时，此时，`cluster by = distribute by + sort by` 分桶表的作用：最大的作用是用来提高join操作的效率；（思考这个问题： `select a.id,a.name,b.addr from a join b on a.id = b.id;` 如果a表和b表已经是分桶表，而且分桶的字段是id字段做这个join操作时，还需要全表做笛卡尔积吗？） **注意：在hive中提供了一种“严格模式”的设置来阻止用户执行可能会带来未知不好影响的查询** 设置属性hive.mapred.mode 为strict能够阻止以下三种类型的查询： 1. 除非在where语段中包含了分区过滤，否则不能查询分区了的表。这是因为分区表通常保存的数据量都比较大，没有限定分区查询会扫描所有分区，耗费很多资源。不允许：`select *from logs;` 允许：`select * from logs where day=20151212;` 2. 包含order by，但没有limit子句的查询。因为order by 会将所有的结果发送给单个reducer来执行排序，这样的排序很耗时 3. 笛卡尔乘积；简单理解就是JOIN没带ON，而是带where的 **案例** ~~~ create external table student_ext(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',' location '/stu'; ~~~ ~~~ //where查询 select * from student_ext where sno=95020; //分组 select sex,count(*) from student_ext group by sex; ~~~ ~~~ //分区,排序,但是这个只有1个reduce,没意义 select * from student_ext cluster by sex; ~~~ ~~~ //设置4个reduce //这样每个reduce自己内部会排序 hive> set mapred.reduce.task=4; hive> create table tt_1 as select * from student_ext cluster by sno; //查看结果,这个tt_1文件夹下面有4个文件 dfs -cat /user/hive/warehouse/db1.db/tt_1/000000_0; //这个结果和上面一样,分成4个reduce create table tt_2 as select * from student_ext distribute by sno sort by sno; //排序可以按照其他方式排序 create table tt_3 as select * from student_ext distribute by sno sort by sage; ~~~