1.概述
在离线数仓处理通过HQL业务数据时,经常会遇到行转列或者列转行之类的操作,就像concat_ws之类的函数被广泛使用,今天这个也是经常要使用的拓展方法。
2.explode函数
2.1 函数语法
-- explode(a) - separates the elements of array a into multiple rows, or the elements of a map into multiple rows and columns Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode Function type:BUILTIN -- explode()用于array的语法如下 select explode(arraycol) as newcol from tablename; -- explode()用于map的语法如下: select explode(mapcol) as (keyname,valuename) from tablename;
2.2 函数说明
- explode 函数是UDTF函数,将hive一列中复杂的array或者map结构拆分成多行。
- Explode函数是不允许在select再有其他字段,
- explode(ARRAY) 列表中的每个元素生成一行。
- explode(MAP) map中每个key-value对,生成一行,key为一列,value为一列。
2.3 使用案例
-- explode (array) select explode(array('A','B','C')); select explode(array('A','B','C')) as col; select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf; select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf as col; -- 结果 col A B C -- explode (map) select explode(map('A',10,'B',20,'C',30)); select explode(map('A',10,'B',20,'C',30)) as (key,value); select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf; select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf as key,value; -- 结果 key value A 10 B 20 C 30
3.posexplode函数
2.1 函数语法
-- posexplode(a) - behaves like explode for arrays, but includes the position of items in the original array Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDTFPosExplode Function type:BUILTIN select posexplode(ARRAY
a) -- Explodes an array to multiple rows with additional positional column of int type (position of items in the original array, starting with 0). Returns a row-set with two columns (pos,val), one row for each element from the array. 2.2 函数说明
- posexplode 函数,将ARRAY数组a展开,每个Value一行,每行两列分别对应数组从0开始的下标和数组元素。
2.3 使用案例
-- posexplode (array) select posexplode(array('A','B','C')); select posexplode(array('A','B','C')) as (pos,val); select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf; select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf as pos,val; -- 结果 pos val 0 A 1 B 2 C
4.later view
4.1 语法
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)* fromClause: FROM baseTable (lateralView)* -- columnAlias是给udtf(expression)列起的别名。 -- tableAlias 虚拟表的别名。
4.2 用法描述
- lateral view为侧视图,意义是为了配合UDTF来使用,把某一行数据拆分成多行数据。
- 不加lateral view的UDTF只能提取单个字段拆分,并不能塞回原来数据表中。
- 加上lateral view就可以将拆分的单个字段数据与原始表数据关联上。
- lateral view函数会将UDTF生成的结果放到一个虚拟表中,然后虚拟表中的数据和输入行进行join来达到连接UDTF外的select字段的目的。(本质是笛卡尔积)
4.3 使用案例
4.3.1 准备数据
下表 pageAds. 它有两个字段: pageid (页码) and adid_list (页面上的adid):
Column name Column type pageid STRING adid_list Array 表中数据如下:
pageid adid_list front_page [1, 2, 3] contact_page [3, 4, 5] 需求: 统计各个页面出现的广告的次数
4.3.2 代码实现
第一步: 使用 lateral view 和 explore() 函数将 adid_list 列的 list 拆分,sql代码如下:
select pageid, adid FROM pageAds lateral view explode(adid_list) ad_view as adid;
可的如下结果
pageid adid front_page 1 front_page 2 front_page 3 contact_page 4 contact_page 5 第二步: 使用 count/group by 语句统计出每个adid出现的次数:
select adid,count(1) as cnt FROM pageAds lateral view explode(adid_list) ad_view as adid group by adid;
adid cnt 1 1 2 1 3 2 4 1 5 1 4.4 Multiple Lateral Views
FROM子句可以有多个LATERAL VIEW子句。 后面的LATERAL VIEWS子句可以引用出现在LATERAL VIEWS左侧表的任何列。
例如,如下查询:
SELECT * FROM exampleTable LATERAL VIEW explode(col1) myTable1 AS myCol1 LATERAL VIEW explode(col2) myTable2 AS myCol2;
例如使用以下基表:
Array pageid_list Array adid_list [1, 2, 3] [“a”, “b”, “c”] [3, 4] [“c”, “d”] 单个Lateral View查询:
SELECT pageid_list, adid FROM pageAds_1 LATERAL VIEW explode(adid_list) adTable AS adid; [1,2,3] a [1,2,3] b [1,2,3] c [4,5] c [4,5] d
多个Lateral View查询:
select pageid,adid FROM pageAds_1 lateral view explode(pageid_list) adTable as pageid lateral view explode(adid_list) adTable as adid; 1,a 1,b 1,c 2,a 2,b 2,c 3,a 3,b 3,c 3,c 3,d 4,c 4,d
4.5 later view json_tuple()
4.5.1 准备数据
create table lateral_tal_3 ( id int, col1 string, col2 string ); insert into lateral_tal_3 values(1234,'{"part1" : "61", "total" : "623", "part2" : "560", "part3" : "1", "part4" : "1"}',' {"to_part2" : "0", "to_part4" : "0", "to_up" : "0", "to_part3" : "0", "to_part34" : "0"}'), (4567,'{"part1" : "451", "total" : "89928", "part2" : "88653", "part3" : "789", "part4" : "35"}','{"to_part2" : "54", "to_part4" : "6", "to_up" : "65", "to_part3" : "2", "to_part34" : "3"}'), (7890,'{"part1" : "142", "total" : "351808", "part2" : "346778", "part3" : "4321", "part4" : "567"}','{"to_part2" : "76", "to_part4" : "23", "to_up" : "65", "to_part3" : "14", "to_part34" : "53"}');
id col1 col2 1234 {“part1” : “61”, “total” : “623”, “part2” : “560”, “part3” : “1”, “part4” : “1”} {“to_part2” : “0”, “to_part4” : “0”, “to_up” : “0”, “to_part3” : “0”, “to_part34” : “0”} 4567 {“part1” : “451”, “total” : “89928”, “part2” : “88653”, “part3” : “789”, “part4” : “35”} {“to_part2” : “54”, “to_part4” : “6”, “to_up” : “65”, “to_part3” : “2”, “to_part34” : “3”} 7890 {“part1” : “142”, “total” : “351808”, “part2” : “346778”, “part3” : “4321”, “part4” : “567”} {“to_part2” : “76”, “to_part4” : “23”, “to_up” : “65”, “to_part3” : “14”, “to_part34” : “53”} 需求: 解析非结构化的json数据类型
“json_tuple(jsonStr, p1, p2, …, pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.”
Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple
Function type:BUILTIN
json_tuple : 第一个参数是json 字符串所在的列名,其它参数是获取 json 字符串中的哪些key值;
4.5.2 代码实现
SELECT id, part1, part3, part4, to_part2, to_part3, to_part4, IF(part3 = 0, 0.0, to_part3 / part3) as ratio3, IF(part4 = 0, 0.0, to_part4 / part4) as ratio4 FROM lateral_tal_3 lateral VIEW json_tuple(col1, 'part3', 'part4', 'part1') json1 AS part3, part4, part1 lateral VIEW json_tuple(col2, 'to_part2','to_part3', 'to_part4') json2 AS to_part2, to_part3, to_part4 ; 1234,61,1,1,0,0,0,0,0 4567,451,789,35,54,2,6,0.0025348542458808617,0.17142857142857143 7890,142,4321,567,76,14,23,0.0032399907428835918,0.04056437389770723
5.使用案例
需求1: 如何产生1-100的连续的数字?
--方式1: 结合space函数与split函数,posexplode函数,lateral view函数获得 select id_start + pos as id from ( select 1 as id_start, 100 as id_end ) m lateral view posexplode(split(space(id_end - id_start), '')) t as pos, val; -- 方式2:结合space函数与split函数,explode函数,lateral view函数+窗口函数获得 select row_number() over () as id from (select split(space(99), '') as x) t lateral view explode(x) ex; -- 方式2:结合space函数与split函数,posexplode函数,lateral view函数获取 from (select split(space(99), ' ') as x) t lateral view posexplode(x) ex as pos,val;
需求2: 获取2024-07-15至2024-07-29间所有的日期
SELECT pos, date_add(start_date, pos) dd FROM (SELECT '2024-07-15' AS start_date, '2024-07-29' AS end_date) temp lateral VIEW posexplode(split(space(datediff(end_date, start_date)), '')) t AS pos, val;
- posexplode 函数,将ARRAY数组a展开,每个Value一行,每行两列分别对应数组从0开始的下标和数组元素。