I have created an Azure HDInsight cluster using PowerShell. Now I need to install some custom software on the worker nodes that is required for the mappers I will be running using Hadoop streaming. I haven't found any PowerShell command that could help me with this task. I can prepare a custom job that will setup all the workers, but I'm not convinced that this is the best solution. Are there better options?
With AWS Elastic MapReduce there is an option to install additional software in a bootstrap action that is defined when you create a cluster. I was looking for something similar.
You can use a bootstrap action to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data.
from: Create Bootstrap Actions to Install Additional Software
The short answer is that you don't. It's not ideal from a caching perspective, but you ought to be able to bundle all your job dependencies into the map reduce jar which is distributed across the cluster for you by YARN (part of Hadoop). This is broadly speaking transparent to the end user, as it's all handled through the job submission process.
If you need something large which is a shared dependency across many jobs, and you don't want it copied out every time, you can keep it on wasb:// storage, and reference that in a class path, but that might cause you complexity if you are for instance using the .NET Streaming API.
I've just heard from a collage that I need to update my Azure PS because recently a new Cmdlet
Add-AzureHDInsightScriptAction was added and it does just that.
Customize HDInsight clusters using Script Action