当前位置: 动力学知识库 > 问答 > 编程问答 >

Spark Job Speed Relational To SQL Server Size

问题描述:

My setup:

  • I have a MS SQL server running Azure (V12)
  • I have a Spark cluster running in Azure HDInsights
  • Spark.Version: 1.6.2
  • Scala.Version:2.10.5

I am pulling in my data (about 5 million rows) from SQL server, via the jdbc driver, and then updating the column names. Then I perform my logic.

Here is how I read from the server:

println("SQL Load Start: " + TimeStamp.getCurrentTime.toDateString)

val options: Map[String, String] = Map("url" -> connectionString,

"dbtable" -> ("(SELECT * FROM Data)" + "Data"))

val data = sQLContext.read.format("jdbc").options(options).load()

println("DF Count: " + data.count().toString)

println("Partition Count: " + data.rdd.partitions.length.toString)

println("SQL Load End: " + TimeStamp.getCurrentTime.toDateString)

When I scale the SQL server up, my data is read in quickly and my logic runs in about 10 seconds.

When the SQL server is small, my data is read in at a slower pace (expected), but my logic also takes much longer (UNEXPECTED) taking about 3 - 4 minutes.

The cluster size is not changing.

The number of partitions in the DataFrame does not appear to change. I am only changing the size of the SQL server.

Does anyone have any ideas on why the logic portion of my Spark job would fluctuate based on the size of my SQL server?

网友答案:

I figured this out. One of my teammates reminded me about something I learned in training. If you don't cache the data after retrieving it from MS SQL, it will go and retrieve the data again when you work on it.

The updated could looks like the following:

println("SQL Load Start: " + TimeStamp.getCurrentTime.toDateString)
val options: Map[String, String] = Map("url" -> connectionString,
    "dbtable" -> ("(SELECT * FROM Data)" + "Data"))
val data = sQLContext.read.format("jdbc").options(options).load()

data.cache()   // NEW CODE

println("DF Count: " + data.count().toString)
println("Partition Count: " + data.rdd.partitions.length.toString)
println("SQL Load End: " + TimeStamp.getCurrentTime.toDateString)
分享给朋友:
您可能感兴趣的文章:
随机阅读: