when writing hadoop streaming task. i used -archives to upload a tgz from local machine to hdfs task working directory, but it has not been untarred as the document says. I've searched a lot without any luck.
Here is the hadoop streaming task starting command with hadoop-2.5.2, very simple
hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar \
-archives /home/hadoop/tmp/test.tgz#test \
-D mapreduce.job.maps=1 \
-D mapreduce.job.reduces=1 \
-input "/test/test.txt" \
-output "/res/" \
-mapper "sh mapper.sh" \
cat > /dev/null
ls -l test
in "test.tgz" there is two files "test.1.txt" and "test.2.txt"
echo "abcd" > test.1.txt
echo "efgh" > test.2.txt
tar zcvf test.tgz test.1.txt test.2.txt
the output from above task
lrwxrwxrwx 1 hadoop hadoop 71 Feb 8 23:25 test -> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/116/test.tgz
but what desired may be like this
-rw-r--r-- 1 hadoop hadoop 5 Feb 8 23:25 test.1.txt
-rw-r--r-- 1 hadoop hadoop 5 Feb 8 23:25 test.2.txt
so, why test.tgz has not been untarred automatically as document says, and is there any other way makes the "tgz" being untarred
any help please, thanks
My mistake. After submitted a issue to hadoop.apache.org. I've been told that hadoop actually has already untarred the test.tgz.
Although the name is still test.tgz, but it's an after untarred direcotry. So the files can be read like "cat test/test.1.txt"
This will untarred
tar -zxvf test.tgz