当前位置: 动力学知识库 > 问答 > 编程问答 >

php - How to preg_match_all only whats in the body

问题描述:

So I have :

function crawl( $url ){

$content = @file_get_contents( $url );

if( $content === FALSE) {

echo "<br/> Not working " . $url;

return;

}

$content = strtolower( $content );

preg_match_all( '/http:\/\/[^ "\']+/', $content , $links );

foreach( $links[0] as $crawled ){

sleep( 1 );

crawl( $crawled );

}

}

I want it to go through the site I give it ($url), and search for all the links in it, kind of like a web crawler, and it goes through the first site get it gets links that don't go anywhere cuz they're css links or js or something that isn't a page. How can I fix it to only get links in the body tags or actual links?

网友答案:

Here's a crude way of trimming the content to only what is within the body tags before applying the regex:

$content = strtolower( $content );
// Added code below...
$bodyStartPos = strpos( $content , "<body>" );
$bodyEndPos = strpos( $content , "</body>" );
$content = substr( $content, $bodyStartPos, $bodyEndPos - $bodyStartPos );

There's more detail you could add such as allowing whitespace in tags, adding the length of the tag to the start position, ensuring end tag is after start tag, ignoring tags in quotes etc. But this should be rough and ready enough to get you started...

分享给朋友:
您可能感兴趣的文章:
随机阅读: