XmlParse and HTMLParse problem


#1

Just installed Lucee 5.3.1 (5.3.1.74 and 5.3.1.15-BETA) the weekend of Oct 6th. I’m having trouble with XmlParse() and htmlParse() functions. In both cases, the isXml() function returns true, but the XmlSearch() function returns an empty array. In trying to isolate the source of the problem, if just the xmlns attribute is removed from the tag, then XMLSearch returns a non-empty array.

Help…

Code shown below is my cfm page to test/debug this issue.

<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>XMLParse issue</title>
</head>

<body>
    <cfscript>
  xml_stream = '
			<?xml version="1.0" encoding="UTF-8" standalone="no"?><html xmlns="http://www.w3.org/1999/xhtml" xmlns:html="http://www.w3.org/1999/xhtml"><head><meta charset="utf-8"/><title>Global Perspectives</title></head><body> <h1>Chapter 5: Resources</h1> <h2>Overview</h2> </body></html>
';
		xml_document = XmlParse(xml_stream);
        
	       	writeoutput( 'isXml: ' & isXml( xml_document ) & '<br />');
		dump(XmlSearch(xml_document,"/html")); // root node
</cfscript>
</body>
</html>

#2

Just as a follow-up, here’s a screenshot:

The empty array at the bottom is the result of the function call

dump(XmlSearch(xml_document,"/html"));

If I remove the attribute xmlns from xml_stream, then the dump displays an array with one element.


#3

Looks like it has something todo with: https://luceeserver.atlassian.net/browse/LDEV-839
I added an example, for your use-case.


#4

@Marilou_Landes I was having an issue with a website that i was scraping. Don’t need to go into all of that, but what i found is that after you run your variable through htmlParse() the following is added to the <html> tag within the html your trying to parse:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:html="http://www.w3.org/1999/xhtml">

If the html your trying to parse doesn’t include the <html> it will be added. If it does include the tag, the xmlns attributes are added. The only workaround that I found was to htmlParse() the variable, then use the reReplace function to get rid of the xmlns stuff. This is what I used. Mind you, there’s probably regex guru’s that can do this better than I, but the following works:

REreplace(the_html_parsed_variable, '(?i)<html [a-zA-Z\d/()\:;,\.=\s\n\t"-]+>', "<html>");

All that does is replace the xmlns stuff with a plain ole <html> tag. Once thats done I was able to use xmlParse() to start parsing out the data I wanted.

Hope this helps.


#5

Yes, very similar to LDEV-839. However, now I’m using version 5.3.1.87-RC, and the problem persists. Suggestion by Hugh_Rainey sort of helps, but creates other issues. I’m not sure why it works in the first place, since I believe the_html_parsed_variable is an XML object. How can you do a string replacement on an XML object? XmlSearch() produces correct results, but is it searching an object or a string? I say that the solution “sort of” helps, because later on in my code, when I try to insert new xml elements into the XML document, inserts fail.

So, I’m still searching for a solution to my problem.

My environment is Lucee 5.3.1.87-RC
OS is Linux (2.6.32-754.2.1.el6.x86_64) 64bit
Servlet Container is Apache Tomcat/8.0.53


#6

Please try the latest snapshot, 5.3.1 has so many problems that I wouldn’t even bother releasing it.

Lots of good fixes have already landed in the 5.3.2 branch, @brucekirkpatrick has also been doing some great work which I hope will also eventually land as well


#7

FWIW, I’ve found a workable (somewhat arcane) solution.

xml_document = htmlParse(html_stream);
xml_document = reReplace( xml_document, '(?i)<html [a-zA-Z\d/()\:;,\.=\s\n\t"-]+>', "<html>" );
xml_document2 = xmlParse(toString(xml_document));

First line converts html text string to an XML Object.
Second line removes any/all xmlns attributes from the <html> tag.
Third line converts XML Object to a string, then uses XML parse to create a usable XML Object.

I can now perform all document modifications I need to do on the XmlParsed object.

It doesn’t make sense, but it’s working. As we install newer snapshots of Lucee, I will revisit this code to see if I can simplify the process.

Thanks to all who have helped me with this.