How to manipulate contents of an imported HTML page

mikek · September 18, 2022, 12:44am

I am pulling in a page from another web site - it’s a lot of html with javascript and css scattered through it. I want to extract only the parts I need for my application. As it turns out the entire contents I need for my purposes are in a table. So the part between the opening and closing tags of the table give me the stuff I need to manipulate.

I’m using cfhttp to download the page from the internet as a variable i’m calling ‘rawpage’.

However when I try to use find() or refind() to locate where in this variable the table begins, I get an error that says it can’t cast a struct as a string. How do i get around this?

I am trying to find the start of the table with the following line:
start = FindNoCase( "<table" , rawpage, '0');

I need to discard everything from the beginning of rawpage to <table> and everything from </table> to the end of the variable. That leaves the table contents as a nice neat variable I can work on.

Cheers
Mike Kear
Windsor, NSW, Australia

Phillyun · September 18, 2022, 3:32am

As a starting point, take a look at your variable

dump(rawpage)

I’m making an assumption about your code here but I think you’ll see why you are getting the error. rawpage.filecontent is likely the string you’re most interested in.

Last thought, slightly OT and worth sharing as you explore with new code: Please make sure this is content you’re allowed to take/format/use. You didn’t say and I won’t assume ill intent - just confirm (on your own) it’s licensed for your use case.

mikek · September 18, 2022, 5:02pm

Thank you for your help Jason. Yes i have permission to use this data - we’re starting a joint side project that’s only slightly related to their existing business. But the guy I’m working with on the other company doesnt know how to get me access to the API they use or their database, and I tried to help him for a while but eventually decided that was going to be about as frustrating as getting Adobe tech support to help with something. So I figured the data I need is on a page they publish every Friday, so why dont I just scrape that page, then extract the data I need from the resulting data. .

I notice that you are using the variable rawpage.filecontent It’s a while since i did this coding in CF years ago and I had forgoten about .filecontent altogether. So that explains the error related to structs. By just working on the variable rawpage thats causing this error. I’ll read up on that again and look at some blog posts and see where i might be going haywire with that.

Thank you for your input, Jason. I think you might have pointed me towards the answer I need.

Cheers
Mike Kear
Windsor, NSW, Australia