id col1 col2 ... coln
foo barA barB ...
foo barD barX
boo barA barC
foo barC barC
I'd like to combine this into 'collapsed' rows which look like this:
foo barA;barD;barC barB;barX;barC
boo barD barC
At the moment the source document is a hive 'table', [which is essentially the same as a flat text file I suppose] - and I am wondering what is the most efficient way to accomplish this?
EDIT: related earlier question (for SQL, alas not hive) Combine multiple rows into one space separated string
If you are loading the data into hive from a mapreduce job you may be able to adjust that MR to do the transformation on the data for you, and load it into the table as you want it (array or ; delimited, ect)
If you are looking to be able to update/adjust the data, HIVE probably isn't the best option for that. You may want to look at HBase and do an 'aggregation' to generate the data as you want it to load into HBase. Anytime the same Key/ColumnFamily/Column value is generated, it overwrites it if it exists so it will 'update' the value. I use this in production to generate data throughout the day that constantly updates.
In either case, to manipulate the structure of large quantities of data, you will probably want to use a mapreduce job and have it do the restructuring for you.