当前位置: 动力学知识库 > 问答 > 编程问答 >

Baseline for measuring Apache Spark jobs execution times

问题描述:

I am fairly new to Apache Spark. I have been using it for several months, but this is my first project that uses it.

I use Spark to compute dynamic reports from data, stored in a NoSQL database (Cassandra). So far I have created several reports and they are computed correctly. Inside them I use DataFrame .unionAll(), .join(), .count(), .map(), etc.

I am running a 1.4.1 Spark cluster on my local machine with the following setup:

export SPARK_WORKER_INSTANCES=6

export SPARK_WORKER_CORES=8

export SPARK_WORKER_MEMORY=1g

I have also populated the database with test data which is around 10-12k records per table.

By using the driver's web UI (http://localhost:4040/), I have noticed that the jobs are taking 40s-50s to execute, so lately I have been researching ways to tune Apache Spark and the jobs.

I have configured Spark to use the KryoSerializer, I have set the spark.io.compression.codec to lzf, I have optimized the jobs as much as I can and as much as my knowledge allows me to.

This led to the jobs taking 20s-30s to compute (which I think is a good improvement). The problem is that because this is my first Spark project, I have no baseline to compare the jobs times, so I have no idea if the execution is slow or fast and whether there is some problem in the code or with the Spark config.

What is the best way to proceed? Is there a graph or benchmark that shows how much time an action with N data should take?

网友答案:

You have to use hive . On top of hive you can put spark . After doing this create temp table in hive for Cassandra table you can perform all type of aggregation and filtering . After doing this you can use hive jdbc connection to get result . It will give fast result .

分享给朋友:
您可能感兴趣的文章:
随机阅读: