The way I am reading the documentation for CFPDF, there is an option to extract text from a PDF. Example:
<cfpdf source="mypdf" action="extracttext" type="xml" name="diditwork"></cfpdf>
But that results in an error message:
not supported yet, see https://issues.jboss.org/browse/LUCEE-1559
Am I missing something?
This issue has been around since Railo and our workaround has been to use the PDFBox java library’s text extraction directly.
But it looks as if it’s finally been addressed in Lucee 22.214.171.124 and PDF Extension 126.96.36.199. See this ticket: https://luceeserver.atlassian.net/browse/LDEV-1941
OK, I’ve upgraded to Lucee 188.8.131.52 and PDF 184.108.40.206.
I wasn’t getting any data until I exported the PDF as a flat PDF on my laptop. And I tried to flatten it first with Lucee, but that did not work.
<cfpdf source="mypdf" action="extracttext" type="xml" name="diditwork" info=#showme#></cfpdf>
Does show the text.
Now, the stupid questions start!! Like, why can’t the output be structured? It’s just one long XML data field.
File a bug with a sample pdf, link it back to the above issue
Take a look at Matt Clemente’s cfc wrapper for PDFBox