I am fairly new to Apache Spark. I have been using it for several months, but this is my first project that uses it.
I use Spark to compute dynamic reports from data, stored in a NoSQL database (Cassandra). So far I have created several reports and they are computed correctly. Inside them I use
I am running a 1.4.1 Spark cluster on my local machine with the following setup:
I have also populated the database with test data which is around 10-12k records per table.
By using the driver's web UI (http://localhost:4040/), I have noticed that the jobs are taking 40s-50s to execute, so lately I have been researching ways to tune Apache Spark and the jobs.
I have configured Spark to use the
KryoSerializer, I have set the
lzf, I have optimized the jobs as much as I can and as much as my knowledge allows me to.
This led to the jobs taking 20s-30s to compute (which I think is a good improvement). The problem is that because this is my first Spark project, I have no baseline to compare the jobs times, so I have no idea if the execution is slow or fast and whether there is some problem in the code or with the Spark config.
What is the best way to proceed? Is there a graph or benchmark that shows how much time an action with N data should take?
You have to use hive . On top of hive you can put spark . After doing this create temp table in hive for Cassandra table you can perform all type of aggregation and filtering . After doing this you can use hive jdbc connection to get result . It will give fast result .