Can not extract text via Apache Tika using Lucee

Can not extract text via Apache Tika using Lucee - java

I would like to extract text from pdf, docx etc via Lucee 5+ (5.2.9), but unfortunately i get empty result set. I have used several Apache Tika versions (runnable jar with Java 1.8.0) that might fit to my specific Lucee and Java requirements, but the result set always remains empty.
exract.cfc
component {
public any function init() {
_setTikaJarPath( GetDirectoryFromPath( GetCurrentTemplatePath( ) ) & "tika-app-1.19.1.jar" );
return this;
}
private struct function doParse( required any fileContent, boolean includeMeta=true, boolean includeText=true ) {
var result = {};
var is = "";
var jarPath = _getTikaJarPath();
if ( IsBinary( arguments.fileContent ) ) {
is = CreateObject( "java", "java.io.ByteArrayInputStream" ).init( arguments.fileContent );
} else {
// TODO, support plain string input (i.e. html)
return {};
}
try {
var parser = CreateObject( "java", "org.apache.tika.parser.AutoDetectParser", jarPath );
var ch = CreateObject( "java", "org.apache.tika.sax.BodyContentHandler" , jarPath ).init(-1);
var md = CreateObject( "java", "org.apache.tika.metadata.Metadata" , jarPath ).init();
parser.parse( is, ch, md );
if ( arguments.includeMeta ) {
result.metadata = {};
for( var key in md.names() ) {
var mdval = md.get( key );
if ( !isNull( mdval ) ) {
result.metadata[ key ] = _removeNonUnicodeChars( mdval );
}
}
}
if ( arguments.includeText ) {
result.text = _removeNonUnicodeChars( ch.toString() );
}
} catch( any e ) {
result = { error = e };
}
return result;
}
public function read(required string filename) {
var result = {};
if(!fileExists(filename)) {
result.error = "#filename# does not exist.";
return result;
};
var f = createObject("java", "java.io.File").init(filename);
var fis = createObject("java","java.io.FileInputStream").init(f);
try {
result = doParse(fis);
} catch(any e) {
result.error = e;
}
fis.close();
return result;
}
private string function _removeNonUnicodeChars( required string potentiallyDirtyString ) {
return ReReplace( arguments.potentiallyDirtyString, "[^\x20-\x7E]", "", "all" );
}
// GETTERS AND SETTERS
private string function _getTikaJarPath() {
return _tikaJarPath;
}
private void function _setTikaJarPath( required string tikaJarPath ) {
_tikaJarPath = arguments.tikaJarPath;
}
}
and the code that i use to run it
<cfset takis = new exract()>
<cfset files = directoryList(expandPath("./sources"))>
<cfloop index="f" array="#files#">
<cfif not findNoCase(".DS_Store",f)>
<cfdump var="#takis.read(f)#" label="#f#">
</cfif>
</cfloop>

I think the problem is a class clash: The Lucee core engine already loads a version of Tika meaning the one you point to is ignored. But the loaded version doesn't behave as expected, returning empty strings as you've seen.
I've solved this by using OSGi to load the desired Tika version. This involves editing the Manifest of the tika-app jar to include basic OSGi metadata and then loading it via my osgiLoader
There is a pre-built Tika bundle available but I haven't been able to get it to work with Lucee.
Here's how to convert the latest tika-app jar to OSGi:
open the "tika-app-1.28.2.jar" with 7-zip
open META-INF then select MANIFEST.MF and press F4 to open it in a text editor
add the following to the end of the file:
Bundle-Name: Apache Tika App Bundle
Bundle-SymbolicName: apache-tika-app-bundle
Bundle-Description: Apache Tika App jar converted to an OSGi bundle
Bundle-ManifestVersion: 2
Bundle-Version: 1.28.2
Bundle-ClassPath: .,tika-app-1.28.2.jar
Save choosing to update when prompted.
You can then call the jar using osgiLoader as follows:
extractor.cfc
component{
property name="loader" type="object";
property name="tikaBundle" type="struct";
public extractor function init( required object loader, required struct tikaBundle ){
variables.loader = arguments.loader
variables.tikaBundle = arguments.tikaBundle
return this
}
public string function parseToString( required string filePath ){
try{
var fileStream = CreateObject( "java", "java.io.FileInputStream" ).init( JavaCast( "string", arguments.filePath ) )
var tikaObject = loader.loadClass( "org.apache.tika.Tika", tikaBundle.path, tikaBundle.name, tikaBundle.version )
var result = tikaObject.parseToString( fileStream )
}
finally{
fileStream.close()
}
return result
}
}
(The following script assumes extractor.cfc, the modified Tika jar, the osgiLoader.cfc and the document to be processed are in the same directory.)
index.cfm
<cfscript>
docPath = ExpandPath( "test.pdf" )
loader = New osgiLoader()
tikaBundle = {
version: "1.28.2"
,name: "apache-tika-app-bundle"
,path: ExpandPath( "tika-app-1.28.2.jar" )
}
extractor = New extractor( loader, tikaBundle )
result = extractor.parseToString( docPath )
dump( result )
</cfscript>
Another way to get the right version loaded is to use JavaLoader. For some reason I couldn't get it to work with the latest tika-app jar (1.28.2), but 1.19.1 does seem to work.
Hacking the existing extension
I would advise you to raise an issue with Preside to change their extension to avoid the clash, but as a temporary hack you could try amending it yourself as follows:
First, add your modified Tika bundle and the osgiLoader.cfc to the /preside-ext-tika/services/ directory.
Next, change line 14 of DocumentMetadataService.cfc so the name of the Tika jar path matches your modified bundle.
_setTikaJarPath( GetDirectoryFromPath( GetCurrentTemplatePath( ) ) & "tika-app-1.28.2.jar" );
Then, modify lines 33-35 of the same cfc to replace:
var parser = CreateObject( "java", "org.apache.tika.parser.AutoDetectParser", jarPath );
var ch = CreateObject( "java", "org.apache.tika.sax.BodyContentHandler" , jarPath ).init(-1);
var md = CreateObject( "java", "org.apache.tika.metadata.Metadata" , jarPath ).init();
with the following:
var loader = New osgiLoader();
var tikaBundle = { version: "1.28.2", name: "apache-tika-app-bundle" };
var parser = loader.loadClass( "org.apache.tika.parser.AutoDetectParser", jarPath, tikaBundle.name, tikaBundle.version )
var ch = loader.loadClass( "org.apache.tika.sax.BodyContentHandler" , jarPath, tikaBundle.name, tikaBundle.version ).init(-1)
var md = loader.loadClass( "org.apache.tika.metadata.Metadata" , jarPath, tikaBundle.name, tikaBundle.version ).init()
NB: I don't have Preside so can't test it in context.

Related

Generate ISO and retrieve the size after adding files

I am trying to generate an ISO file with JIIC library:
https://github.com/stephenc/java-iso-tools
The thing is that I need to know the exact amount of the generated (to be) iso file after adding a file. (I need to have a restriction on the generated iso file size and if exceeds, to generate a new iso file).
I am trying:
long currentIsoSize;
var numOfCreatedIsos = 1;
var root = new ISO9660RootDirectory();
for (var i = 0; i<listFilesInPdfDir.length; i++) {
if (listFilesInPdfDir[i].isDirectory()) {
currentIsoSize = generateIso(event, numOfCreatedIsos, root, listFilesInPdfDir[i]);
if (maxIsoFileSizeExceeds(currentIsoSize)) {
root.getDirectories().remove(root.getDirectories().get(root.getDirectories().size()-1));
generateIso(event, numOfCreatedIsos, root, null);
root = new ISO9660RootDirectory();
numOfCreatedIsos++;
i--;
}
} else {
log.warn("File: {} skipped", listFilesInPdfDir[i]);
}
}
private long generateIso(ArchiveEventTrigger event, int numOfCreatedIsos, ISO9660RootDirectory root, File file) throws HandlerException, FileNotFoundException {
if (file != null) {
var directory = root.addDirectory(file);
for (var pdf : Objects.requireNonNull(file.listFiles())) {
directory.addFile(pdf);
}
}
var isoName = ArchiveUtils.buildIsoName(event, numOfCreatedIsos);
var isoFile = new File("DIR_TO_ISO"+isoName);
var handler = new ISOImageFileHandler(isoFile);
var iso = new CreateISO(handler, root);
iso.process(Utils.isoConfig("daily-isos-"+numOfCreatedIsos), null, null, null);
return new File("DIR_TO_ISO"+isoName).length();
}
The thing is that it creates the appropriate ISO but:
The file size is not the expected one (files: 70MB - iso: 60MB(about the size of the first directory inserted to iso)
Only the first directory has valid files, the other directories have the files but are corrupted.
I noticed that it is happening because of the iso.process call for the same iso file.
Any suggestions?

Downloading large amount of data and storing on Android & iOS

So I have this API that downloads from our web service, but the we service sends it as a ZIP file instead of a json stream or something else.
Now the files can get quite large butt they are not saved as a ZIP file on the device but are instead unzipped and then saved in a realm database.
This seems like a extremely complicated way to do this and I would just like to remove the zip part and turn it into a Json streaming service instead.
Is that a valid way to do this or is there something else I should be doing?
The app for context is basically a Form viewer that is intended to have offline mode.
[WebMethod]
public string AndroidGetFormByID(string sessionID, int formID)
{
JObject json = new JObject();
UserDetails user = DBUserHelper.GetUserBySessionID(new Guid(sessionID));
if (user == null)
{
json["Error"] = "Not logged in";
return json.ToString(Newtonsoft.Json.Formatting.None);
}
Client client = Client.GetClient(user.ClientID);
var formTemplateRecord = SqlInsertUpdate.SelectQuery("SELECT JSON, CreatedDate FROM FormTemplates WHERE ID=#ID AND clientID=#clientID", "FormsConnectionString", new List<SqlParameter> { new SqlParameter("#ID", formID), new SqlParameter("#clientID", client.ID) }).GetFirstRow();
var formJson = formTemplateRecord["JSON"].ToString();
if (formJson == null)
{
json["Error"] = "No such form";
return json.ToString(Newtonsoft.Json.Formatting.None);
}
json = JObject.Parse(formJson);
json["formID"] = formID;
try
{
json["created"] = Convert.ToDateTime(formTemplateRecord["CreatedDate"]).ToString("dd/MM/yyyy");
}
catch (Exception e)
{
}
MemoryStream convertedFormData = new MemoryStream();
try
{
using (MemoryStream ms = new MemoryStream(json.ToString(Newtonsoft.Json.Formatting.None).ToByteArray()))
{
ms.Seek(0, SeekOrigin.Begin);
using (ZipFile zipedForm = new ZipFile())
{
zipedForm.AddEntry(json["title"].ToString() + "_" + json["formID"].ToString(), ms);
zipedForm.Save(convertedFormData);
}
}
}
catch (Exception ex)
{
return ex.Message.ToString();
}
return Convert.ToBase64String(convertedFormData.ToArray());
}
Also added a bit of java code for context how it is being used:
private void getForms ( WeakReference < Context > contextWeakReference, List < Integer > ids )
{
AtomicInteger atomicReference = new AtomicInteger( );
Observable.interval( 1, TimeUnit.SECONDS )
.map( aLong -> ids.get( aLong.intValue() ) )
.take( ids.size() )
.flatMap( integer ->
{
atomicReference.set( integer );
GetFormsListener.setCurrentItem( listOfIds.indexOf( integer ) + 1 );
FormDBHelper.updateTemplateDownloading( contextWeakReference, atomicReference.get( ), -1, FormIOHelper.FORM_STATUS.DOWNLOADING.toString() );
return ServiceGenerator.createService( ).androidGetFormByID( ClientUtils.loginDetailsConstructor.sessionID, String.valueOf( integer ) );}, 1 )
.map( base64 ->
{
final Context context = contextWeakReference.get();
if(context == null)
throw new NullPointerException( );
AppUtils.LogToConsole( Log.ASSERT, "Reached Here Before Write Form", AppUtils.getLoggedTime( ) );
final File file = FormIOHelper.checkFormFileExists( context.getFilesDir(), atomicReference.get(), "Library", FormIOHelper.FOLDERS.TEMPLATES.toString() );
FormIOHelper.writeForm( file, base64 );
AppUtils.LogToConsole( Log.ASSERT, "Reached Here After Write Form", AppUtils.getLoggedTime( ) );
return file;
} )
.map( file ->
{
JsonObject formObject = null;
try
{
JsonObject jsonObject = FormIOHelper.getFormFromZipFileAndStrip( file );
formObject = FormDBHelper.stripFormJson( contextWeakReference, jsonObject, -1 );
} catch ( Throwable e )
{
ErrorLog.log( e );
FormDBHelper.updateTemplateDownloading( contextWeakReference, atomicReference.get( ), -1, FormIOHelper.FORM_STATUS.ERROR.toString( ) );
}
if ( formObject == null )
return new JsonArray( );
JsonArray jsonElements;
if ( formObject.has( "embeddedFiles" ) && formObject.get( "embeddedFiles" ).isJsonArray( ) )
jsonElements = formObject.get( "embeddedFiles" ).getAsJsonArray( );
else
jsonElements = new JsonArray( );
if ( jsonElements.size( ) > 0 )
{
final List < DownloadableFilesConstructor > downloadableFilesConstructorList = FormIOHelper.setEmbeddedFiles( jsonElements );
Context context = contextWeakReference.get( );
if ( context == null )
return jsonElements;
DownloadableFilesDBHelper.saveData( context, downloadableFilesConstructorList );
}
return jsonElements;
} )

You can try out Google GSON Stream. It helps in downloading large amount of data through JSON REST API.

How to create separate change list when using the API?

I am trying to create a Groovy script that takes a list of change lists from our trunk and merges them one at a time into a release branch. I would like to have all the change lists locally because I want to run a test build before submitting upstream. However, whenever I run the script I find that when I look in P4V all the merges have been placed in the default change list. How can I keep them separate?
My code (in Groovy but using the Java API) is as follows:
final changeListNumbers = [ 579807, 579916, 579936 ]
final targetBranch = "1.0.7"
changeListNumbers.each { changeListNumber ->
final existingCl = server.getChangelist( changeListNumber )
final cl = new Changelist(
IChangelist.UNKNOWN,
client.getName(),
server.userName,
ChangelistStatus.NEW,
new Date(),
"${existingCl.id} - ${existingCl.description}",
false,
server
);
cl.fileSpecs = mergeChangeListToBranch( client, cl, changeListNumber, targetBranch )
}
def List<IFileSpec> mergeChangeListToBranch( final IClient client, final IChangelist changeList, final srcChangeListNumber, final String branchVersion ){
final projBase = '//Clients/Initech'
final trunkBasePath = "$projBase/trunk"
final branchBasePath = "$projBase/release"
final revisionedTrunkPath = "$trunkBasePath/...#$srcChangeListNumber,$srcChangeListNumber"
final branchPath = "$branchBasePath/$branchVersion/..."
println "trunk path: $revisionedTrunkPath\nbranch path is: $branchPath"
mergeFromTo( client, changeList, revisionedTrunkPath, branchPath )
}
def List<IFileSpec> mergeFromTo( final IClient client, final IChangelist changeList,final String sourceFile, final String destFile ){
mergeFromTo(
client,
changeList,
new FileSpec( new FilePath( FilePath.PathType.DEPOT, sourceFile ) ),
new FileSpec( new FilePath( FilePath.PathType.DEPOT, destFile ) )
)
}
def List<IFileSpec> mergeFromTo( final IClient client, final IChangelist changeList, final FileSpec sourceFile, final FileSpec destFile ){
final resolveOptions = new ResolveFilesAutoOptions()
resolveOptions.safeMerge = true
client.resolveFilesAuto(
client.integrateFiles( sourceFile, destFile, null, null ),
// client.integrateFiles( changeList.id, false, null, null, sourceFile, destFile ),
resolveOptions
)
}
If I try to IChangeList.update() I get the following error:
Caught: com.perforce.p4java.exception.RequestException: Error in change specification.
Error detected at line 7.
Invalid status 'new'.
If instead of using IChangelist.UNKNOWN to existingCl.id + 10000 (which is larger than any existing change list number currently in use) then I get
Caught: com.perforce.p4java.exception.RequestException: Tried to update new or default changelist

To create the changelist in the server, call IClient.createChangelist():
final existingCl = server.getChangelist( changeListNumber )
cl = new Changelist(
IChangelist.UNKNOWN,
... snip ...
);
cl = client.createChangelist(cl);
cl.fileSpecs = mergeChangeListToBranch( client, cl, ...
Then to integrate into this particular change:
IntegrateFilesOptions intOpts = new IntegrateFilesOptions()
intOpts.setChangelistId( cl.getId())
client.integrateFiles( sourceFile, destFile, null, intOpts)
That integrateFiles() returns the integrated file(s), so check that the returned IFileSpec.getOpStatus() is FileSpecOpStatus.VALID.

How do I re-encode dynamically compiled bytes to text?

Consider the following(Sourced primarily from here):
JavaCompiler compiler = ToolProvider.getSystemJavaCompiler( );
JavaFileManager manager = new MemoryFileManager( compiler.getStandardFileManager( null, null, null ) );
compiler.getTask( null, manager, null, null, null, sourceScripts ).call( ); //sourceScripts is of type List<ClassFile>
And the following file manager :
public class MemoryFileManager extends ForwardingJavaFileManager< JavaFileManager > {
private HashMap< String, ClassFile > classes = new HashMap<>( );
public MemoryFileManager( StandardJavaFileManager standardManager ) {
super( standardManager );
}
#Override
public ClassLoader getClassLoader( Location location ) {
return new SecureClassLoader( ) {
#Override
protected Class< ? > findClass( String className ) throws ClassNotFoundException {
if ( classes.containsKey( className ) ) {
byte[ ] classFile = classes.get( className ).getClassBytes( );
System.out.println(new String(classFile, "utf-8"));
return super.defineClass( className, classFile, 0, classFile.length );
} else throw new ClassNotFoundException( );
}
};
}
#Override
public ClassFile getJavaFileForOutput( Location location, String className, Kind kind, FileObject sibling ) {
if ( classes.containsKey( className ) ) return classes.get( className );
else {
ClassFile classObject = new ClassFile( className, kind );
classes.put( className, classObject );
return classObject;
}
}
}
public class ClassFile extends SimpleJavaFileObject {
private byte[ ] source;
protected final ByteArrayOutputStream compiled = new ByteArrayOutputStream( );
public ClassFile( String className, byte[ ] contentBytes ) {
super( URI.create( "string:///" + className.replace( '.', '/' ) + Kind.SOURCE.extension ), Kind.SOURCE );
source = contentBytes;
}
public ClassFile( String className, CharSequence contentCharSequence ) throws UnsupportedEncodingException {
super( URI.create( "string:///" + className.replace( '.', '/' ) + Kind.SOURCE.extension ), Kind.SOURCE );
source = ( ( String )contentCharSequence ).getBytes( "UTF-8" );
}
public ClassFile( String className, Kind kind ) {
super( URI.create( "string:///" + className.replace( '.', '/' ) + kind.extension ), kind );
}
public byte[ ] getClassBytes( ) {
return compiled.toByteArray( );
}
public byte[ ] getSourceBytes( ) {
return source;
}
#Override
public CharSequence getCharContent( boolean ignoreEncodingErrors ) throws UnsupportedEncodingException {
return new String( source, "UTF-8" );
}
#Override
public OutputStream openOutputStream( ) {
return compiled;
}
}
Stepping through the code, on the compiler.getTask().call(), the first thing that happens here is getJavaFileForOutput() is called, and then the getClassLoader() method is called to load the class, which yields in the compiled bytes being written to console.
Why does that println in the getClassLoader() method yield an amalgamation of my working compiled bytecode(primarily strings, it appears the actual bytecode instruction keywords are not here) and random gibberish? This leads me to believe that I was using too short a UTF so I tried UTF-16, and it looked more or less similar. How do I encode the bytes back into text? I am aware that using the SimpleJavaFileManager would be straightforward enough but I need to be able to use this example of caching(without the possible memory leaks of course) for performance purposes.
Edit:
And yes, the compiled code does classload and run perfectly.

Why does that println in the getClassLoader() method yield an amalgamation of my working compiled bytecode(primarily strings, it appears the actual bytecode instruction keywords are not here) and random gibberish?
Without seeing the so-called "random gibberish", I would surmise that what you are seeing is the well-formed binary content of a class file that has been "decoded" as a String in some character set.
That ain't going to work. It is a binary format, and you can't expect to turn it into text like that and have it display as something readable.
(And for what it is worth, a ".class" file would not contain keywords for the JVM opcodes, any more than a ".exe" file would contain keywords for machine instructions. It is binary!)
If you want to see the compiled code in text form, then save the bytes in that byte array to a file, and use the javap utility to look at it. (I'll leave you to look up the command line syntax for the javap command ... )

How to convert a string into a piece of code (Factory Method Pattern?)

Let's say we have a String like this:
String string2code = "variable = 'hello';";
How could we convert that String to a piece of code like this?:
variable = "hello";

GroovyShell is the answer:
String string2code = "variable = 'hello'; return variable.toUpperCase()";
def result = new GroovyShell().evaluate string2code
assert result == "HELLO"

If you're into more complex stuff later, you can compile whole classes using GroovyClassLoader.
private static Class loadGroovyClass( File file ) throws MigrationException {
try {
GroovyClassLoader gcl = new GroovyClassLoader( ExternalMigratorsLoader.class.getClassLoader() );
GroovyCodeSource src = new GroovyCodeSource( file );
Class clazz = gcl.parseClass( src );
return clazz;
}
catch( CompilationFailedException | IOException ex ){
...
}
}

Maybe you can take a look a Janino
Janino is a small java compiler than not only can compile source files, it can compile expressions like the one you have.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Can not extract text via Apache Tika using Lucee - java

Related

Generate ISO and retrieve the size after adding files

Downloading large amount of data and storing on Android & iOS

How to create separate change list when using the API?

How do I re-encode dynamically compiled bytes to text?

How to convert a string into a piece of code (Factory Method Pattern?)

Categories

Resources