当前位置: 动力学知识库 > 问答 > 编程问答 >

java - Find text region which include article content in HTML

问题描述:

Recently I want to get information in HTML source by Java. The base need is to get the main content area of the HTML.

For example, the following is HTML source for example:

<html>

<head>

<tilte>

chinese charactor --中文

<title>

</head>

<body>

<div>

this is something area including Chinese charactor.,like meun I don't need,

</div>

<div>

this is something area including Chinese charactor,like ads I don't need,

</div>

<div>

this is main content, include the content I need. almost every content is filled by many Chinese charactor.Like: 好好学习,天天向上。 我爱stackoverflow.谢谢你的帮助,非常感谢!

</div>

<div>

this is foot area, also including Chinese charactor ,but I don't need.

</div>

</body>

</html>

This HTML source is a simple one; There are many different and complex sources. I want to parse the div or other element area which contain the main content by java. The result I want is:

<div>

This is main content, include the content I need. almost every content is filled by many Chinese character like: 好好学习,天天向上。 我爱stackoverflow.谢谢你的帮助,非常感谢!

</div>

There are tens of thousands of divs which have different content in them, and the div id is unknown or different. The divs have many different conditions, such as p tags. Is there a way to judge the Chinese character's appearance or distribution to parse the content?

网友答案:

I can't say I'm that confident I understand the question, but it seems like you want to scrape a certain div in an HTML page via Java?

I had to do this to scrape some data from a legacy system to test a new one - have a look at http://htmlunit.sourceforge.net/ . Basically it allows you to hit the page you want as if it were in a browser (so even if you would normally have to fill out a form to get to that page you can do it), then scrape the contents of different parts of the page in a bunch of different ways - you can get a collection of all the divs, and pick the third one, for instance, or pick the div with the right CSS class, or just use XPath.

网友答案:

I can't say that I kow for certain what you're going for, but one good place to start would probably be in Apache's HTTPComponents package. There are a lot of tools there for making http requests and getting the data back in a string buffer (what I think you're going for)

Check it out here:

http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html#d5e43

Also, on the HTTPComponents main page, there are Chinese translations of most of the tutorials--you know, if that's something that would be useful to you :D

http://hc.apache.org/

分享给朋友:
您可能感兴趣的文章:
随机阅读: