当前位置: 动力学知识库 > 问答 > 编程问答 >

Jsoup isn't detecting text in quotes

问题描述:

I'm using Jsoup to parse website. I am parsing the class:

<td class="tl">

<script> document.write(Icons.GetShortDescription(1, 'CurrentWeather'));</script>

"Despejado"<span class="details">

</span>

</td>

Jsoup could not detect the text "Despejado." Here's the relevant code:

 url="http://freemeteo.ar.com/eltiempo/mendoza/historia/historial-diario/?gid=3844421&date=2010-07-02&station=23812&language=spanishar&country=argentina";

doc = Jsoup.connect(url).get();

i=0;

Elements lineks = doc.select("table.daily-history");

for (Element linek : lineks) {

Elements datos=linek.select("tbody");

for(Element dato : datos){

Elements datos5 = dato.select("td.tl");

System.out.println("code class:" + datos5.html());

}

}

The output is :

code class: <script>

document.write(Icons.GetShortDescription(1, 'CurrentWeather'));

</script><span class="details"> </span>

Jsoup not read "despejado." What is the problem?

  • bug Jsoup?
  • the problem is website?

Please help me understand how read the text "despejado"?**

网友答案:

Okay I got it.

Jsoup can't get the "despejado" because it doesn't exist on the website until the JavaScript script puts it on. So there's nothing for Jsoup to select or get. Jsoup is an html parser not a JavaScript parses. But, I think I figured it out.

JavaScript scripts are declared up at top and if you go look you'll see the script that puts the "despejado" and the other descriptions on the page:

<script type="text/javascript" src="/Services/IconDescriptions/Index/37/g.js"></script>

Okay, so if you go look at that script you'll see this huge script file, here's some of it:

var Icons = {
    "Forecast":{
        "1":{"Description":"Buen tiempo","ShortDescription":"Despejado"},
        "2":{"Description":"Pocas nubes","ShortDescription":"Pocas nubes"},
        "3":{"Description":"Cielos parcialmente cubiertos","ShortDescription":"Parcialmente cubierto"},
        "4":{"Description":"Cielos cubiertos","ShortDescription":"Cubierto"},

... and like 150 more

Okay, now knowing this you can use this:

Elements elements = doc.select("table.daily-history tbody td.tl script");

        int number;
        String numberString;

        for (Element element: elements){

            // here's what you had
            System.out.println("code class: " + element.html());

            // get the html as a string
            numberString = element.html();

            // isolate the number you need
            numberString = numberString.substring(numberString.lastIndexOf("(")+1,numberString.lastIndexOf(" ") -1);

            // parse to integer
            number = Integer.valueOf(numberString);
            System.out.println("number: " + number);


        }

I kept the extra String code in there to help you understand. So, here's the System output:

code class: document.write(Icons.GetShortDescription(1, 'CurrentWeather'));
number: 1

Now, you can use the "number" which is "1" and cross-reference it to the JavaScript file to get the "Short Description" which is "despejado". I checked a few other dates on the calendar for different conditions and it works.

I wish there were an easier way but this will work. If you can perhaps find a text-only version of the website, that should make it easy. Website will sometimes have easier versions for blind people with their screen readers. Good luck!

分享给朋友:
您可能感兴趣的文章:
随机阅读: