External jar file problem (TIKA)

Hey everybody!

I’m having some difficulties implementing a well-known jar library called ‘Tika’.
This is used to parse files to readable text, and it’s a java library.

So I’ve a clean installation of Lucee on my PC (localhost) and an empty page where
I’ve a small code section to call the Tika library. I can see that the Tika library is loaded
in the Lucee admin and I don’t get any parse errors.

The problem is the empty result that I get back. Have tried many PDF’s and other files, tried
different code examples etc. I find online that this is perhaps caused by some missing
dependencies, but I’ve downloaded the Tika jar file from their website and it’s pretty big (70mb).

How can I find out what is going wrong here? Is there perhaps some hidden error I’m not getting
or can somebody maybe test it on their machine? Any help would be really appreciated!

I’ve also got a stackoverflow thread here: link

Thanks for reading and if you have any questions, please feel free to ask!

I get the same results using the Tika bundle that ships with Lucee. I guess it’s connected to the fact that PDF text extraction doesn’t work natively in Lucee

If you’re only interested in PDFs then Matt Clemente has a possible solution

We do use Tika for general text extraction in Lucee, but with the latest “app” jar - downloadable directly from Apache - loaded via JavaLoader to avoid java class conflicts with the bundled version.

please always state which version of Lucee :slight_smile:

Hey Julian,

Thanks for your elaborate answer! I would like to use this library because it sounds like a very wide plugin with lots of file extension parsers. I too tried to use the jar file downloaded from that link and placed it in the lib folder of the lucee web-inf. Next I used this code, converted from an example, to try to do a clean test (this is the small test, bigger one at the bottom):

handler = createObject(“java”, “org.apache.tika.sax.BodyContentHandler”);
metadata = createObject(“java”, “org.apache.tika.metadata.Metadata”);
inputstream = createObject(“java”, “java.io.FileInputStream”).init(createObject(“java”, “java.io.File”).init(‘C:\lucee\tomcat\webapps\ROOT\test\dummy.pdf’));
pcontext = createObject(“java”, “org.apache.tika.parser.ParseContext”);
pdfparser = createObject(“java”, “org.apache.tika.parser.AutoDetectParser”);
pdfparser.parse(inputstream, handler, metadata, pcontext);
writeDump(handler.toString());

But did you use the single jar file? I tried the server one and the app one. When I writeOutput the result i get null. But is it because of some hidden error? I tried it on the server and on my own clean local Lucee, but same result.

EDIT
I’ve found this link to the javaLoader you were talking about: link. So should I add this to the components folder and use this example? link. It would be very helpfull if you could explain that, we’re having some troubles with jar files lately.

(this is the sloppy one, but should work I think)

try {
httpGet = createObject(“java”, “org.apache.http.client.methods.HttpGet”).init(‘[local pdf]’);
response = createObject(“java”, “org.apache.http.impl.client.DefaultHttpClient”).execute(httpGet);
entity = response.getEntity();
input = createObject(“java”, “java.io.InputStream”);
if(entity !== null)
{
try{
input = entity.getContent();
handler = createObject(“java”, “org.apache.tika.sax.BodyContentHandler”);
metadata = createObject(“java”, “org.apache.tika.metadata.Metadata”);
parser = createObject(“java”, “org.apache.tika.parser.AutoDetectParser”);
parseContext = createObject(“java”, “org.apache.tika.parser.ParseContext”);
parser.parse(input, handler, metadata, parseContext);
map.put(“text”, handler.toString().replaceAll(“\n|\r|\t”, " "));
map.put(“title”, metadata.get(createObject(“java”, “org.apache.tika.metadata.TikaCoreProperties”).TITLE));
map.put(“pageCount”, metadata.get(“xmpTPg:NPages”));
// /map.put(“status_code”, response.getStatusLine().getStatusCode() + “”);
f_response = metadata.get(createObject(“java”, “org.apache.tika.metadata.TikaCoreProperties”).SOURCE);
writeOutput(parser);
}
catch(any e){
f_response = “#e.message# 1”;
}
finally{
if (input !== null) {
try {
input.close();
} catch (any e) {
f_response = “#e.message# 2”;
}
}
}
}
else
{
f_response = “Object is null”;
}
}
catch(any e){
f_response = e.message;
}

Ah sorry, the server version is “Lucee 5.3.3.62”. Apache version is: “Apache Tomcat/9.0.20” Java version is: “11.0.3 (AdoptOpenJDK) 64bi” on a Windows Server 2019. We do have some other issues with websockets and a spreadsheet tag, so that doesn’t sit very well with me. Maybe there is some underlying issue that we cannot find. But first things first ^^ Would be nice to have this plugin working to use as a search tool in documents.

I expect you may be experiencing a “class conflict” with the older version of Tika in Lucee which as we know doesn’t work. Even though you’ve put the newer Tika app jar in your /lib folder, Lucee will have already loaded the older version and will be using that - hence no change to the results.

JavaLoader can solve that problem, but it might first be worth trying the “OSGi” approach that Lucee 5 offers and which is also designed to avoid class conflicts. Brad has a good blog post about this, which you should read first, then try the following which is based on his advice:

  1. Use 7-zip to open up the tika-app-1.23.jar you downloaded (right click-click the jar and choose 7-zip > Open Archive)
  2. Find the file META-INF/MANIFEST.MF, right-click it and choose Edit.
  3. Add the following to the end of the file contents:
Bundle-Name: Apache Tika App Bundle
Bundle-SymbolicName: apache-tika-app-bundle
Bundle-Description: Apache Tika App jar converted to an OSGi bundle
Bundle-ManifestVersion: 2
Bundle-Version: 1.23
  1. Save and close the file, choosing to update the archive when prompted by 7-zip
  2. The jar is now an OSGi bundle and can be dropped into your Lucee installation’s lucee-server/bundles folder.
  3. Check the new bundle is available in your Lucee server admin UI under Info > Bundle (jar)

Assuming that works, try this code in your app:

tika = CreateObject( "java", "org.apache.tika.Tika", "apache-tika-app-bundle" );
filePath = "[FullFilePathToMy].pdf"; //e.g. "C:\temp\my.pdf"
try{
    fileStream = CreateObject( "java","java.io.FileInputStream" ).init( JavaCast( "string", filePath ) );
    result = tika.parseToString( fileStream );
}
finally{
    fileStream.close();
}
dump( result );
2 Likes

Hey, this sound very promising and hopefully I can take a look later today and come back to you!

Hey,

I did what you said but I’m getting the following error when I call the code:

The OSGi Bundle with name [apache-tika-app-bundle] is not available locally (C:\lucee\tomcat\lucee-server\bundles) or from the update provider (http://update.lucee.org).

When I look at the server admin bundles page I see the following:

I will continue to try to work it out, but I hope you maybe know what’s going on!

Btw: I’ve also removed the previous tike jar files and tried a restart

Edit: I’ve now also tried it locally on a cleanly installed Lucee, still same error. I opened the jar with winrar and replaced the manifest.mf file with the changed one. I also use the app version like you said, instead of the server version.

Thanks in advance,

DrunkenMoose

pass in a third argument [path-to-tika]

In your screenshot of the bundles admin page the bundle meta data isn’t showing, which suggests the changes to the MANIFEST.MF weren’t saved and therefore the jar file isn’t recognised as a bundle.

Try editing a copy of the MANFEST.MF outside and then replace the file in the jar’s META-INF folder.

Hey,

Yes I did that exactly from his example.

Greetings,
DrunkenMoose

Hey,

Thanks for your quick response. Yes that’s exactly what I did. I put the manfest file outside the jar. Put the code on the end and saved it and put it back. I also see that de modification date is correct. I don’t get why this is so hard with my Lucee :frowning: I’m curretly also trying to get the PDFbox to work, but I get the exact same issue.

The OSGi Bundle with name [org.apache.pdfbox.app] is not available locally (C:\lucee\tomcat\lucee-server\bundles) or from the update provider (http://update.lucee.org).

Thanks!

Is the screenshot of the MANIFEST.MF from the jar in your lucee-server/bundles folder? If so, try restarting Lucee, to ensure the changes are picked up.

Another thing: your screen shot of the MANIFEST.MF doesn’t look right. Here’s what my complete MANIFEST.MF looks like:

Manifest-Version: 1.0
Archiver-Version: Plexus Archiver
Created-By: Apache Maven 3.6.0
Built-By: tim
Build-Jdk: 11.0.4
Specification-Title: Apache Tika application
Specification-Version: 1.23
Specification-Vendor: The Apache Software Foundation
Implementation-Title: Apache Tika application
Implementation-Version: 1.23
Implementation-Vendor-Id: org.apache.tika
Implementation-Vendor: The Apache Software Foundation
Automatic-Module-Name: org.apache.tika.app
Main-Class: org.apache.tika.cli.TikaCLI
Bundle-Name: Apache Tika App Bundle
Bundle-SymbolicName: apache-tika-app-bundle
Bundle-Description: Apache Tika App jar converted to an OSGi bundle
Bundle-ManifestVersion: 2
Bundle-Version: 1.23
2 Likes

Hey,

Thanks I will try that! Other good news: I just got the PDFbox bundle working. With you information I now get how I can check what it is and where to place it. I put the PDFbox and PDFbox-app jars in de bundle folder, saw it was loaded in the lucee admin, and now I get parse text out of a PDF. (I saw that it has to be text and not an image of text). So thank you, this got me a step closer! Still I would like to figure Tika out too so I know how jars work and that my lucee is fine.

I will try to do the bundle like you said and get back to you.

Greetings,
DrunkenMoose

I agree it’s too complicated, which is why we use JavaLoader, which just works without modifying anything.

But JL is old and no longer maintained and Lucee really should make it just as easy.

You might want to vote for this ticket which has suggestions for how things could be improved:

https://luceeserver.atlassian.net/browse/LDEV-1528

Hey again,

Tika is working now! You’re a god! I indeed made an error by putting the bundle data on the bottom of the manifest, but it already was present on the top. I just changed that to the given info and now it’s loaded. I will now try to implement this in our projects.

Is there maybe a way I can give you some bucks through paypal for your troubles?

Greetings,
DrunkenMoose

1 Like

Glad you got it working in the end. This is a free support forum so no need for payment, but feel free to make a donation to the Lucee project.

2 Likes

We’ve successfully used the ‘separate classloader’ approach for a long time (we built a ‘simple class loader’ class so we get a distinct class path for Tika (and a few other things). This works fine with ACF and Lucee 5.latest, but breaks with Lucee 6

The problem is that Tika loads a logger object with this line:
private static final Logger LOG = LoggerFactory.getLogger(TikaConfig.class);
… and the logger (which is coming from the Lucee classpath) is throwing an NPE with the message “No Bundle provided” - it appears that the Lucee-provided logger is expecting a bundle instead of a simple class…

Here’s the stack fragment where things appear to be going wrong…
0: …java.util.Objects.requireNonNull[Objects.java:248]
1: …org.apache.logging.log4j.core.osgi.BundleContextSelector.locateContext[BundleContextSelector.java:143]
2: …org.apache.logging.log4j.core.osgi.BundleContextSelector.getContext[BundleContextSelector.java:127]
3: …org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext[ClassLoaderContextSelector.java:117]
4: …org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext[Log4jContextFactory.java:149]
5: …org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext[Log4jContextFactory.java:46]
6: …org.apache.logging.log4j.LogManager.getContext[LogManager.java:197]
7: …org.apache.logging.log4j.spi.AbstractLoggerAdapter.getContext[AbstractLoggerAdapter.java:136]
8: …org.apache.logging.slf4j.Log4jLoggerFactory.getContext[Log4jLoggerFactory.java:58]
9: …org.apache.logging.log4j.spi.AbstractLoggerAdapter.getLogger[AbstractLoggerAdapter.java:46]
10: …org.apache.logging.slf4j.Log4jLoggerFactory.getLogger[Log4jLoggerFactory.java:32]
11: …org.slf4j.LoggerFactory.getLogger[LoggerFactory.java:422]
12: …org.slf4j.LoggerFactory.getLogger[LoggerFactory.java:447]
13: …org.apache.tika.config.TikaConfig.[TikaConfig.java:96]
14: …jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0[NativeConstructorAccessorImpl.java:-2]
15: …jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance[NativeConstructorAccessorImpl.java:62]
16: …jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance[DelegatingConstructorAccessorImpl.java:45]
17: …java.lang.reflect.Constructor.newInstance[Constructor.java:490]
… (our loader class invoking the constructor)

… Again… this logic works with Lucee 5 and all current ACF releases, so it’s not that we’re loading the wrong thing

I’ve experimented with tika a while ago, never with Lucee 6. Just posting it here… Don’t know if this approach is similar to yours or if it even works in Lucee 6:

1 Like