当前位置: 动力学知识库 > 问答 > 编程问答 >

Spark-Cassandra very slow when using IN for composite partition key

问题描述:

I have a cassandra Table with a composite partition key as (time_bucket timestamp, node int). time_bucket value is the time the data was inserted with seconds converted to 00 and node values range from 0 to 100

A spark job runs every minute picking up data from the table. The table contains close to 25 million records with records being added every minute.

If my spark job selects all the records every time it runs, the job completes in 2 minutes. But if i query using:

sc.cassandraTable(keyspace_name,table_name).where("time_bucket = ? ", from).where("nodeid_bucket IN ? ", nodeid_bucket_range)

where val nodeid_bucket_range = 0 to 100,

the job takes 10 minutes to complete.

My cluster has 6 nodes and am using DSE 4.8.9. Each executor uses 8 cores and 20GB of memory. Increasing these values does not help to make the spark job faster.

Any idea why my job takes 10 minutes? Does spark-cassandra not function well when using IN clause?

网友答案:

You probably want joinWithCassandraTable. Almost always an In clause is better served by doing a join if you have a large number of values. This will execute all of your requests in parallel on different executors.

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable

网友答案:

IN statements translate into key OR key OR key3 ... OR key100, which is very inefficient for the optimizer to make something useful of. In your case you could use:

sc.cassandraTable(keyspace_name,table_name).where("time_bucket = ? ", from).where("nodeid_bucket > ? AND nodeid_bucket < ? ", nodeid_bucket_range)

Watch the edges of the range and of course this assumes your range is continuous.

分享给朋友:
您可能感兴趣的文章:
随机阅读: