当前位置: 动力学知识库 > 问答 > 编程问答 >

python - Pydoop on Amazon EMR

问题描述:

How would I use Pydoop on Amazon EMR?

I tried googling this topic to no avail: is it at all possible?

网友答案:

I finally got this working. Everything happens on the master node...ssh to that node as the user hadoop

You need some packages:

sudo easy_install argparse importlib
sudo apt-get update
sudo apt-get install libboost-python-dev

To build stuff:

wget http://apache.mirrors.pair.com/hadoop/common/hadoop-0.20.205.0/hadoop-0.20.205.0.tar.gz
wget http://sourceforge.net/projects/pydoop/files/Pydoop-0.6/pydoop-0.6.0.tar.gz
tar xvf hadoop-0.20.205.0.tar.gz
tar xvf pydoop-0.6.0.tar.gz

export JAVA_HOME=/usr/lib/jvm/java-6-sun 
export JVM_ARCH=64 # I assume that 32 works for 32-bit systems
export HADOOP_HOME=/home/hadoop
export HADOOP_CPP_SRC=/home/hadoop/hadoop-0.20.205.0/src/c++/
export HADOOP_VERSION=0.20.205
export HDFS_LINK=/home/hadoop/hadoop-0.20.205.0/src/c++/libhdfs/

cd ~/hadoop-0.20.205.0/src/c++/libhdfs
sh ./configure
make
make install
cd ../install
tar cvfz ~/libhdfs.tar.gz lib
sudo tar xvf ~/libhdfs.tar.gz -C /usr

cd ~/pydoop-0.6.0
python setup.py bdist
cp dist/pydoop-0.6.0.linux-x86_64.tar.gz ~/
sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C /

Save the two tarballs and in the future, you can skip the build part and simply do the following to install (need to figure out how to do this a boostrap option for installing on multi node clusters)

sudo tar xvf ~/libhdfs.tar.gz -C /usr
sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C /

I was then able to run the example program using the Full-fledged Hadoop API (after fixing a bug in the constructor so that it calls super(WordCountMapper, self)).

#!/usr/bin/python

import pydoop.pipes as pp

class WordCountMapper(pp.Mapper):

  def __init__(self, context):
    super(WordCountMapper, self).__init__(context)
    context.setStatus("initializing")
    self.input_words = context.getCounter("WORDCOUNT", "INPUT_WORDS")

  def map(self, context):
    words = context.getInputValue().split()
    for w in words:
      context.emit(w, "1")
    context.incrementCounter(self.input_words, len(words))

class WordCountReducer(pp.Reducer):

  def reduce(self, context):
    s = 0
    while context.nextValue():
      s += int(context.getInputValue())
    context.emit(context.getInputKey(), str(s))

pp.runTask(pp.Factory(WordCountMapper, WordCountReducer))

I uploaded that program to a bucket and called it run. I then used the following conf.xml:

<?xml version="1.0"?>
<configuration>

<property>
  <name>hadoop.pipes.executable</name>
  <value>s3://<my bucket>/run</value>
</property>

<property>
  <name>mapred.job.name</name>
  <value>myjobname</value>
</property>

<property>
  <name>hadoop.pipes.java.recordreader</name>
  <value>true</value>
</property>

<property>
  <name>hadoop.pipes.java.recordwriter</name>
  <value>true</value>
</property>

</configuration>

Finally, I used the following command line:

hadoop pipes -conf conf.xml -input s3://elasticmapreduce/samples/wordcount/input -output s3://tmp.nou/asdf
网友答案:

The answer given is only partially correct, but the solution is more simple then doing it like this:

copy this code into a bash file you create on your computer:

bootstrap.sh:

#!/bin/bash
pip install pydoop

after you finish writing this file, upload it to an s3 bucket.

then you can in the emr add bootstrap action:
Choose "Custom action" Give path to your s3 bucket. And that it, you have a pydoop installed in your emr cluster.

分享给朋友:
您可能感兴趣的文章:
随机阅读: