I am working on a small crawling engine and am using curl to request pages from various websites. Question is what do suggest should I set my connection_timeout and timeout values to? Stuff I would normally be crawling would be pages with lots of images and text.
cURL knows two different timeouts.
CURLOPT_CONNECTTIMEOUT it doesn't matter how much text the site contains or how many other resources like images it references because this is a connection timeout and even the server cannot know about the size of the requested page until the connection is established.
CURLOPT_TIMEOUT it does matter. Even large pages require only a few packets on the wire, but the server may need more time to assemble the output. Also the number of redirects and other things (e.g. proxies) can significantly increase response time.
Generally speaking the "best value" for timeouts depends on your requirements and conditions of the networks and servers. Those conditions are subject of change. Therefore there is no "one best value".
I recommend to use rather short timeouts and retry failed downloads later.
Btw cURL does not automatically download resources referenced in the response. You have to do this manually with further calls to
curl_exec (with fresh timeouts).
The best response is the rik's one.
I have a Proxy Checker and in my benchmarks I saw that most of working Proxies takes less than 10 seconds to connect.
So I use 10 seconds for ConnectionTimeOut and TimeOut but that's in my case, you have to decide how many time you want to use so start with big values, use curl_getinfo to see time benchmarks and decrease the value.
Note: A proxy that takes more than 5 or 10 seconds to connect is useless for me, that's why I use that values.
Yes. If your target is a proxy to query another site, such a cascading connection will require fairly long period like these values to execute the curl calls.
Especially when you encountered intermittent curl problems, please check these values first.
If you set it too high then your script will be slow as a one url that is down will take all the time you set in CURLOPT_TIMEOUT to finish processing. If you are not using proxies then you can just set the following values
CURLOPT_TIMEOUT = 3 CURLOPT_CONNECTTIMEOUT = 1
Then you can go through failed urls at a later time to double check on them.
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,30); curl_setopt($ch, CURLOPT_TIMEOUT,60);