概述
I am creating a Hive table on Amazon's Elastic MapReduce by using a gzipped JSON encoded file. I am using this JSON SerDe: http://code.google.com/p/hive-json-serde/
The unencoded file looks like this:
{"id":"101", "items":"A:231,234,119,12"}
{"id":"102", "items":"B:13,89,121"}
...
I'd like to create an array of the "items" column for user with a table generating function. The array I want would be the "exploded" CSV of ints ignoring the ":" and the letter before it. I want to be able to GROUP BY the item ints (231,234,etc.) or JOIN them against other tables. Changing around the format of this file would be difficult and all my old data is stored in this format so I'd like to solve this in Hive without making changes to how to store my data.
Is there a way for me to do this?
Edit:
Thanks @mark-grover, this worked!
I did something like this:
SELECT id, item FROM my_table LATERAL VIEW explode(split(substr(items, 3, length(items) - 2), ',')) t1 AS item
to get:
101 231
101 234
101 119
101 12
102 13
102 89
102 121
In doing this I found out I'd like to do a little more. I'd like to also select the position that each item was in in the CSV. So I'd like the output to look like:
101 231 0
101 234 1
101 119 2
101 12 3
102 13 1
102 89 2
102 121 3
Any ideas here?
Edit again:
I came up with something for my followup question doing something like this:
SELECT id, item, find_in_set(item, substr(items, 3, length(items) - 2)) AS position
FROM my_table LATERAL VIEW explode(split(substr(items, 3, length(items) - 2), ',')) t1 AS item
解决方案
You can do that use Hive UDFs.
You would want to use the split UDF to split the string on commas to get an array and then use Lateral view along with explode UDTF to "explode" the contents of the array.
To answer your second question:
This is what I can think of right now. There might be a better/easier/less hackier way to do it.
Create a UDF say split_extended that would return as a struct with 2 elements, first one being the actual contents of the array element, second one being the index. When you explode this the type of exploded column will be a struct. One of the entries of this struct would the actual element from the array, the other would be the index corresponding to the element.
If you think about it, we are augmenting split UDF to return 2 pieces of information - the element and the index. These are several ways of going about returning such information as a single object - structure is one of them, array being another (first element of the array being the actual element from the array, second element being the index). Another hackier way to return 2 pieces of information is to concat them with something (say a ':') that you know doesn't exist in your data so then you can have your UDF return '231:0', '234:0', '119:2', etc. and in the end instead of accessing them members of a struct, you can split the resulting string out based on ':' (using split UDF) to get the desired two pieces.
For creating your own UDF, you will be able to leverage most, if not all of split UDF code available here.
最后
以上就是舒服咖啡豆为你收集整理的hive将list转成string_Hive:将逗号分隔的字符串转换为数组以便生成表生成函数的全部内容,希望文章能够帮你解决hive将list转成string_Hive:将逗号分隔的字符串转换为数组以便生成表生成函数所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复