Extract Text with CFPDF?

derrickpeavy · April 22, 2020, 7:43pm

The way I am reading the documentation for CFPDF, there is an option to extract text from a PDF. Example:

<cfpdf source="mypdf" action="extracttext" type="xml" name="diditwork"></cfpdf>

But that results in an error message:

not supported yet, see https://issues.jboss.org/browse/LUCEE-1559

Am I missing something?

Julian_Halliwell · April 22, 2020, 8:24pm

This issue has been around since Railo and our workaround has been to use the PDFBox java library’s text extraction directly.

But it looks as if it’s finally been addressed in Lucee 5.3.5.75 and PDF Extension 1.0.0.78. See this ticket: [LDEV-1941] - Lucee

derrickpeavy · April 22, 2020, 9:37pm

OK, I’ve upgraded to Lucee 5.3.5.92 and PDF 1.0.0.80.

I wasn’t getting any data until I exported the PDF as a flat PDF on my laptop. And I tried to flatten it first with Lucee, but that did not work.

<cfpdf source="mypdf" action="extracttext" type="xml" name="diditwork" info=#showme#></cfpdf>

Does show the text.

Now, the stupid questions start!! Like, why can’t the output be structured? It’s just one long XML data field.

Zackster · April 23, 2020, 5:55am

File a bug with a sample pdf, link it back to the above issue

Julian_Halliwell · April 23, 2020, 7:14am

Take a look at Matt Clemente’s cfc wrapper for PDFBox