当前位置: 动力学知识库 > 问答 > 编程问答 >

java - Jsoup trying to test javascript link

问题描述:

I'm using JSoup to parse a webpage all links, I then test the response code of these gathered links.

The issue I'm having is some of the pages I'm testing have links that open a javascript popup using: . I'm sure there's a simple way to avoid selecting this link but I can't think anymore!

My code:

PingUrls(String pageUrl) {

url = pageUrl;

int i = 0;

int retries = 3;

while (i < retries){

try {

response = Jsoup.connect(url)

.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")

.timeout(10000)

.execute();

success = true;

break;

} catch (IOException e) {

}

System.out.println("Attempt "+i);

i++;

}

}

public int getUrlStatus(){

if(success){

int statusCode = response.statusCode();

return statusCode;

}else {

return 404;

}

}

public ArrayList<String> getLinks(String targetValue){

ArrayList<String> urls = new ArrayList<String>();

try {

Document doc = response.parse();

Elements element = doc.select(targetValue+" a[href]");

for (Element page : element){

urls.add(page.attr("abs:href"));

}

return urls;

} catch (IOException e) {

System.out.println(e);

return null;

}

}

网友答案:

First of all I'd avise using a Set instead of a List. (If you're not familiar with Collections, a Set will make sure that there are no repeated elements)

Also, I'd put a method like manageURL(String url); before you add it to the Collection. Put some tests in it to make sure it craws the way you want. Like testing the url's absolute path, canonical path, and to make sure it is http or https protocoled.

分享给朋友:
您可能感兴趣的文章:
随机阅读: