The problem:
I recently needed to display a video and scale it using the <video>
tag. The problem was that a search of the Internet failed to find how to determine the dimensions (width and height) of the source video that I needed to scale and preserve aspect ratio.
A solution:
Use Apache Tika https://tika.apache.org/ that can detect and extract metadata and text from over a thousand different file types.
The method:
- Install Apache Tika in Lucee
Lucee already bundles the Apache Tika core. However, as it is a façade for the API, many calls to it return nothing with the result that you need to use the full version of Apache Tika.
Download Tika from https://tika.apache.org/download.html I am using tika-app-2.1.0.jar but other versions should work. Lucee requires that third-party JAR files be OSGi compliant and that is achieved by editing the JAR file using WinRar or 7-Zip and adding these lines to the internal META-INF/MANIFEST.MF file then save back into the JAR file.
Bundle-Name: Apache Tika App Bundle
Bundle-SymbolicName: apache-tika-app-bundle
Bundle-Description: Apache Tika App jar converted to an OSGi bundle
Bundle-ManifestVersion: 2
Bundle-Version: 2.1.0
Place the tika-app-2.1.0.jar file into the C:\Lucee\tomcat\lucee-server\bundles folder (as applicable to your installation). When browsing http://127.0.0.1:8888/lucee/admin/server.cfm?action=info.bundle you should see the apache-tika-app-bundle as “not loaded”.
- Add this function to your CFML code to read and retrieve text and metadata from a file.
<cfscript>
function getFileContent(filename) {
var LOCAL = StructNew() ;
LOCAL.result = StructNew() ;
LOCAL.result.error = "" ;
LOCAL.result.text = "" ;
if (FileExists(filename)) {
LOCAL.f = createObject("java", "java.io.File").init(filename);
LOCAL.fis = createObject("java","java.io.FileInputStream").init(LOCAL.f);
LOCAL.ch = CreateObject("java","org.apache.tika.sax.BodyContentHandler", "apache-tika-app-bundle");
LOCAL.parser = CreateObject("java","org.apache.tika.parser.AutoDetectParser", "apache-tika-app-bundle");
LOCAL.md = CreateObject("java","org.apache.tika.metadata.Metadata", "apache-tika-app-bundle");
try {
LOCAL.parser.parse(LOCAL.fis, LOCAL.ch, LOCAL.md);
LOCAL.keys = LOCAL.md.names();
LOCAL.result.metadata = StructNew();
for (var ii = 1; ii lt arrayLen(LOCAL.keys); ii = ii + 1) {
LOCAL.mdval = LOCAL.md.get(LOCAL.keys[ii]);
if (not isNull(LOCAL.mdval)) {
LOCAL.result.metadata[LOCAL.keys[ii]] = LOCAL.mdval;
}
}
LOCAL.result.text = LOCAL.ch.toString();
} catch (any e) {
LOCAL.result.error = e;
}
LOCAL.fis.close();
} else {
LOCAL.result.error = "File not found" ;
}
return LOCAL.result;
}
</cfscript>
- To extract text and metadata from an existing file or a file that has been recently uploaded, use code similar to that below adapting to your circumstances.
<!--- create a list to filter the metadata --->
<cfset RequiredMetaData = "Content-Type,Date/Time,Exif SubIFD:Date/Time Original,Make,Model,tiff:Make,tiff:Model,Lens,F-Number,Focal Length,Focal Length 35,Exif SubIFD:Exposure Time,Exif SubIFD:F-Number,Exif SubIFD:Focal Length,Author,title,Comments,Exif SubIFD:Lens Specification,Copyright,Shutter Speed Value,GPS Latitude,GPS Longitude,GPS:GPS Latitude,geo:lat,GPS:GPS Longitude,geo:long,GPS Img Direction,Flash,Orientation,ISO Speed Ratings,Exposure Time,Image Height,tiff:ImageLength,Image Width,tiff:ImageWidth">
<!--- get file content --->
<cfset FileContent = getFileContent("#fullPathToFileName#")>
<!--- get the file text content --->
<cfif IsDefined("FileContent.text")>
<cfset contentText = Trim(FileContent.text)>
<cfelse>
<cfset contentText = "">
</cfif>
<!--- get the file metadata as a comma delimited string--->
<cfset fileMetaData = "">
<cfset videoWidth = 0>
<cfset videoHeight = 0
<cfif IsDefined("FileContent.metadata")>
<cfset qrymeta = QueryNew("Property,PropertyValue,SortOrder")>
<cfset ii = 0>
<cfloop collection=#FileContent.metadata# item="prop">
<cfif ListFindNoCase(variables.RequiredMetaData, variables.prop) gt 0>
<cfset ii = variables.ii + 1>
<cfset QueryAddRow(qrymeta, 1)>
<cfset temp = QuerySetCell(qrymeta, "Property", "#ReplaceNoCase(ReplaceNoCase(ReplaceNoCase(variables.prop,'tiff:','','all'),'GPS:GPS','','all'),'Exif SubIFD:','','all')#", variables.ii)>
<cfset temp = QuerySetCell(qrymeta, "PropertyValue", "#FileContent.metadata[variables.prop]#", variables.ii)>
<cfset temp = QuerySetCell(qrymeta, "SortOrder", "#FindNoCase(prop, variables.RequiredMetaData)#", variables.ii)>
</cfif>
</cfloop>
<cfquery dbtype="query" name="qrymeta2">
SELECT Property,
PropertyValue
FROM qrymeta
ORDER BY SortOrder
</cfquery>
<cfloop query="qrymeta2">
<cfset fileMetaData = variables.fileMetaData & qrymeta2.Property & " = " & qrymeta2.PropertyValue & ", ">
<!--- The Content-Type metadata isn't returned for videos so identify by the file extension --->
<cfif variables.fileextension eq "MP4" or variables.fileextension eq "M4V">
<cfif qrymeta2.Property eq "ImageWidth">
<cfset videoWidth = val(qrymeta2.PropertyValue)>
</cfif>
<cfif qrymeta2.Property eq "ImageLength">
<cfset videoHeight = val(qrymeta2.PropertyValue)>
</cfif>
</cfif>
</cfloop>
</cfif>
- There are other metadata values that can be retrieved so a
<cfdump var="#FileContent#">
will reveal the metadata for the selected file type.
Extracting the text content from documents such as PDF and Word will then allow you to save that text in a database and in conjunction with SQL Server Full-Text Search you could create a Google-like document search with results ranking and content snippets.