When using Perl script as mapper & reducer in Hadoop streaming, how we can manage perl module dependencies.
I want to use "Net::RabbitMQ" in my perl mapper & reducer script.
Is there any standard way in perl/hadoop streaming to handle dependencies similar to the DistributedCache (for Hadoop java MR)
There are a couple of ways to handle dependencies including specifying a custom library path or creating a packed binary of your Perl application with PAR::Packer. There are some examples of how to accomplish these tasks in the Examples section of the Hadoop::Streaming POD, and the author includes a good description of the process, as well as some considerations for the different ways to handle dependencies. Note that the suggestions provided in the Hadoop::Streaming documentation about handling Perl dependencies are not specific to that module.
Here is an excerpt from the documentation for Hadoop::Streaming (there are detailed examples therein, as previously mentioned):
All perl modules must be installed on each hadoop cluster machine. This proves to be a challenge for large installations. I have a local::lib controlled perl directory that I push out to a fixed location on all of my hadoop boxes (/apps/perl5) that is kept up-to-date and included in my system image. Previously I was producing stand-alone perl files with PAR::Packer (pp), which worked quite well except for the size of the jar with the -file option. The standalone files can be put into hdfs and then included with the jar via the -cacheFile option.