当前位置: 动力学知识库 > 问答 > 编程问答 >

html - LibXML C++ XPathEval Errors

问题描述:

For starters, I'm seeing two types of problems with my the functionality of the code. I can't seem to find the correct element with the function xmlXPathEvalExpression. In addition, I am receiving errors similar to:

HTML parser error : Unexpected end tag : a

This happens for what appears to be all tags in the page.

For some background, the HTML is fetched by CURL and fed into the parsing function immediately after. For the sake of debugging, the return statements have been replaced with printf.

std::string cleanHTMLDoc(std::string &aDoc, std::string &symbolString) {

std::string ctxtID = "//span[id='" + symbolString + "']";

htmlDocPtr doc = htmlParseDoc((xmlChar*) aDoc.c_str(), NULL);

xmlXPathContextPtr context = xmlXPathNewContext(doc);

xmlXPathObjectPtr result = xmlXPathEvalExpression((xmlChar*) ctxtID.c_str(), context);

if (xmlXPathNodeSetIsEmpty(result->nodesetval)) {

xmlXPathFreeObject(result);

xmlXPathFreeContext(context);

xmlFreeDoc(doc);

printf("[ERR] Invalid XPath\n");

return "";

}

else {

int size = result->nodesetval->nodeNr;

for (int i = size - 1; i >= 0; --i) {

printf("[DBG] %s\n", result->nodesetval->nodeTab[i]->name);

}

return "";

}

}

The parameter aDoc contains the HTML of the page, and symbolString contains the id of the item we're looking for; in this case yfs_l84_aapl. I have verified that this is an element on the page in the style span[id='yfs_l84_aapl'] or <span id="yfs_l84_aapl">.

From what I've read, the errors fed out of the HTML Parser are due to a lack of a namespace, but when attempting to use the XHTML namespace, I've received the same error. When instead using htmlParseChunk to write out the DOM tree, I do not receive these errors due to options such as HTML_PARSE_NOERROR. However, the htmlParseDoc does not accept these options.

For the sake of information, I am compiling with Visual Studio 2015 and have successfully compiled and executed programs with this library before. My apologies for the poorly formatted code. I recently switched from writing Java in Eclipse.

Any help would be greatly appreciated!

[Edit]

It's not a pretty answer, but I made what I was looking to do work. Instead of looking through the DOM by my (assumed) incorrect XPath expression, I moved through tag by tag to end up where I needed to be, and hard-coded in the correct entry in the nodeTab attribute of the nodeSet.

The code is as follows:

std::string StockIO::cleanHTMLDoc(std::string htmlInput) {

std::string ctxtID = "/html/body/div/div/div/div/div/div/div/div/span/span";

xmlChar* xpath = (xmlChar*) ctxtID.c_str();

htmlDocPtr doc = htmlParseDoc((xmlChar*) htmlInput.c_str(), NULL);

xmlXPathContextPtr context = xmlXPathNewContext(doc);

xmlXPathObjectPtr result = xmlXPathEvalExpression(xpath, context);

if (xmlXPathNodeSetIsEmpty(result->nodesetval)) {

xmlXPathFreeObject(result);

xmlXPathFreeContext(context);

xmlFreeDoc(doc);

printf("[ERR] Invalid XPath\n");

return "";

}

else {

xmlNodeSetPtr nodeSet = result->nodesetval;

xmlNodePtr nodePtr = nodeSet->nodeTab[1];

return (char*) xmlNodeListGetString(doc, nodePtr->children, 1);

}

}

I will leave this question open in hopes that someone will help elaborate upon what I did wrong in setting up my XPath expression.

分享给朋友:
您可能感兴趣的文章:
随机阅读: