I got a large conversion job- 299Gb of JPEG images, already in the database, into thumbnail equivalents for reporting and bandwidth purposes.
I've written a thread safe SQLCLR function to do the business of re-sampling the images, lovely job.
Problem is, when I execute it in an
UPDATE statement (from the PhotoData field to the ThumbData field), this executes linearly to prevent race conditions, using only one processor to resample the images.
So, how would I best utilise the 12 cores and phat raid setup this database machine has? Is it to use a subquery in the
FROM clause of the update statement? Is this all that is required to enable parallelism on this kind of operation?
Anyway the operation is split into batches, around 4000 images per batch (in a windowed query of about 391k images), this machine has plenty of resources to burn.
I too recommend the "round-robin" methodology advocated by kragen2uk and onupdatecascade (I'm voting them up). I know I've read something irritating about CLR routines and SQL paralellism, but I forget what it was just now... but I think they don't play well together.
The bit I've done in the past on similar tasks it to set up a table listing each batch of work to be done. For each connection you fire up, it goes to this table, gest the next batch, marks it as being processed, processes it, updates it as Done, and repeats. This allows you to gauge performance, manage scaling, allow stops and restarts without having to start over, and gives you something to show how complete the task is (let alone show that it's actually doing anything).
Please check the configuration setting for Maximum Degree of Parallelism (MAXDOP) on your SQL Server. You can also set the value of MAXDOP.
This link might be useful to you http://www.mssqltips.com/tip.asp?tip=1047
Could you not split the query into batches, and execute each batch separately on a separate connection? SQL server only uses parallelism in a query when it feels like it, and although you can stop it, or even encourage it (a little) by changing the cost threshold for parallelism option to O, but I think its pretty hit and miss.
One thing thats worth noting is that it will only decide whether or not to use parallelism at the time that the query is compiled. Also, if the query is compiled at a time when the CPU load is higher, SQL server is less likely to consider parallelism.
Find some criteria to break the set into distinct sub-sets of rows (1-100, 101-200, whatever) and then call your update statement from multiple connections at the same time, where each connection handles one subset of rows in the table. All the connections should run in parallel.