How do I re-encode dynamically compiled bytes to text? - java

Consider the following(Sourced primarily from here):
JavaCompiler compiler = ToolProvider.getSystemJavaCompiler( );
JavaFileManager manager = new MemoryFileManager( compiler.getStandardFileManager( null, null, null ) );
compiler.getTask( null, manager, null, null, null, sourceScripts ).call( ); //sourceScripts is of type List<ClassFile>
And the following file manager :
public class MemoryFileManager extends ForwardingJavaFileManager< JavaFileManager > {
private HashMap< String, ClassFile > classes = new HashMap<>( );
public MemoryFileManager( StandardJavaFileManager standardManager ) {
super( standardManager );
}
#Override
public ClassLoader getClassLoader( Location location ) {
return new SecureClassLoader( ) {
#Override
protected Class< ? > findClass( String className ) throws ClassNotFoundException {
if ( classes.containsKey( className ) ) {
byte[ ] classFile = classes.get( className ).getClassBytes( );
System.out.println(new String(classFile, "utf-8"));
return super.defineClass( className, classFile, 0, classFile.length );
} else throw new ClassNotFoundException( );
}
};
}
#Override
public ClassFile getJavaFileForOutput( Location location, String className, Kind kind, FileObject sibling ) {
if ( classes.containsKey( className ) ) return classes.get( className );
else {
ClassFile classObject = new ClassFile( className, kind );
classes.put( className, classObject );
return classObject;
}
}
}
public class ClassFile extends SimpleJavaFileObject {
private byte[ ] source;
protected final ByteArrayOutputStream compiled = new ByteArrayOutputStream( );
public ClassFile( String className, byte[ ] contentBytes ) {
super( URI.create( "string:///" + className.replace( '.', '/' ) + Kind.SOURCE.extension ), Kind.SOURCE );
source = contentBytes;
}
public ClassFile( String className, CharSequence contentCharSequence ) throws UnsupportedEncodingException {
super( URI.create( "string:///" + className.replace( '.', '/' ) + Kind.SOURCE.extension ), Kind.SOURCE );
source = ( ( String )contentCharSequence ).getBytes( "UTF-8" );
}
public ClassFile( String className, Kind kind ) {
super( URI.create( "string:///" + className.replace( '.', '/' ) + kind.extension ), kind );
}
public byte[ ] getClassBytes( ) {
return compiled.toByteArray( );
}
public byte[ ] getSourceBytes( ) {
return source;
}
#Override
public CharSequence getCharContent( boolean ignoreEncodingErrors ) throws UnsupportedEncodingException {
return new String( source, "UTF-8" );
}
#Override
public OutputStream openOutputStream( ) {
return compiled;
}
}
Stepping through the code, on the compiler.getTask().call(), the first thing that happens here is getJavaFileForOutput() is called, and then the getClassLoader() method is called to load the class, which yields in the compiled bytes being written to console.
Why does that println in the getClassLoader() method yield an amalgamation of my working compiled bytecode(primarily strings, it appears the actual bytecode instruction keywords are not here) and random gibberish? This leads me to believe that I was using too short a UTF so I tried UTF-16, and it looked more or less similar. How do I encode the bytes back into text? I am aware that using the SimpleJavaFileManager would be straightforward enough but I need to be able to use this example of caching(without the possible memory leaks of course) for performance purposes.
Edit:
And yes, the compiled code does classload and run perfectly.

Why does that println in the getClassLoader() method yield an amalgamation of my working compiled bytecode(primarily strings, it appears the actual bytecode instruction keywords are not here) and random gibberish?
Without seeing the so-called "random gibberish", I would surmise that what you are seeing is the well-formed binary content of a class file that has been "decoded" as a String in some character set.
That ain't going to work. It is a binary format, and you can't expect to turn it into text like that and have it display as something readable.
(And for what it is worth, a ".class" file would not contain keywords for the JVM opcodes, any more than a ".exe" file would contain keywords for machine instructions. It is binary!)
If you want to see the compiled code in text form, then save the bytes in that byte array to a file, and use the javap utility to look at it. (I'll leave you to look up the command line syntax for the javap command ... )

Related

Can not extract text via Apache Tika using Lucee

I would like to extract text from pdf, docx etc via Lucee 5+ (5.2.9), but unfortunately i get empty result set. I have used several Apache Tika versions (runnable jar with Java 1.8.0) that might fit to my specific Lucee and Java requirements, but the result set always remains empty.
exract.cfc
component {
public any function init() {
_setTikaJarPath( GetDirectoryFromPath( GetCurrentTemplatePath( ) ) & "tika-app-1.19.1.jar" );
return this;
}
private struct function doParse( required any fileContent, boolean includeMeta=true, boolean includeText=true ) {
var result = {};
var is = "";
var jarPath = _getTikaJarPath();
if ( IsBinary( arguments.fileContent ) ) {
is = CreateObject( "java", "java.io.ByteArrayInputStream" ).init( arguments.fileContent );
} else {
// TODO, support plain string input (i.e. html)
return {};
}
try {
var parser = CreateObject( "java", "org.apache.tika.parser.AutoDetectParser", jarPath );
var ch = CreateObject( "java", "org.apache.tika.sax.BodyContentHandler" , jarPath ).init(-1);
var md = CreateObject( "java", "org.apache.tika.metadata.Metadata" , jarPath ).init();
parser.parse( is, ch, md );
if ( arguments.includeMeta ) {
result.metadata = {};
for( var key in md.names() ) {
var mdval = md.get( key );
if ( !isNull( mdval ) ) {
result.metadata[ key ] = _removeNonUnicodeChars( mdval );
}
}
}
if ( arguments.includeText ) {
result.text = _removeNonUnicodeChars( ch.toString() );
}
} catch( any e ) {
result = { error = e };
}
return result;
}
public function read(required string filename) {
var result = {};
if(!fileExists(filename)) {
result.error = "#filename# does not exist.";
return result;
};
var f = createObject("java", "java.io.File").init(filename);
var fis = createObject("java","java.io.FileInputStream").init(f);
try {
result = doParse(fis);
} catch(any e) {
result.error = e;
}
fis.close();
return result;
}
private string function _removeNonUnicodeChars( required string potentiallyDirtyString ) {
return ReReplace( arguments.potentiallyDirtyString, "[^\x20-\x7E]", "", "all" );
}
// GETTERS AND SETTERS
private string function _getTikaJarPath() {
return _tikaJarPath;
}
private void function _setTikaJarPath( required string tikaJarPath ) {
_tikaJarPath = arguments.tikaJarPath;
}
}
and the code that i use to run it
<cfset takis = new exract()>
<cfset files = directoryList(expandPath("./sources"))>
<cfloop index="f" array="#files#">
<cfif not findNoCase(".DS_Store",f)>
<cfdump var="#takis.read(f)#" label="#f#">
</cfif>
</cfloop>
I think the problem is a class clash: The Lucee core engine already loads a version of Tika meaning the one you point to is ignored. But the loaded version doesn't behave as expected, returning empty strings as you've seen.
I've solved this by using OSGi to load the desired Tika version. This involves editing the Manifest of the tika-app jar to include basic OSGi metadata and then loading it via my osgiLoader
There is a pre-built Tika bundle available but I haven't been able to get it to work with Lucee.
Here's how to convert the latest tika-app jar to OSGi:
open the "tika-app-1.28.2.jar" with 7-zip
open META-INF then select MANIFEST.MF and press F4 to open it in a text editor
add the following to the end of the file:
Bundle-Name: Apache Tika App Bundle
Bundle-SymbolicName: apache-tika-app-bundle
Bundle-Description: Apache Tika App jar converted to an OSGi bundle
Bundle-ManifestVersion: 2
Bundle-Version: 1.28.2
Bundle-ClassPath: .,tika-app-1.28.2.jar
Save choosing to update when prompted.
You can then call the jar using osgiLoader as follows:
extractor.cfc
component{
property name="loader" type="object";
property name="tikaBundle" type="struct";
public extractor function init( required object loader, required struct tikaBundle ){
variables.loader = arguments.loader
variables.tikaBundle = arguments.tikaBundle
return this
}
public string function parseToString( required string filePath ){
try{
var fileStream = CreateObject( "java", "java.io.FileInputStream" ).init( JavaCast( "string", arguments.filePath ) )
var tikaObject = loader.loadClass( "org.apache.tika.Tika", tikaBundle.path, tikaBundle.name, tikaBundle.version )
var result = tikaObject.parseToString( fileStream )
}
finally{
fileStream.close()
}
return result
}
}
(The following script assumes extractor.cfc, the modified Tika jar, the osgiLoader.cfc and the document to be processed are in the same directory.)
index.cfm
<cfscript>
docPath = ExpandPath( "test.pdf" )
loader = New osgiLoader()
tikaBundle = {
version: "1.28.2"
,name: "apache-tika-app-bundle"
,path: ExpandPath( "tika-app-1.28.2.jar" )
}
extractor = New extractor( loader, tikaBundle )
result = extractor.parseToString( docPath )
dump( result )
</cfscript>
Another way to get the right version loaded is to use JavaLoader. For some reason I couldn't get it to work with the latest tika-app jar (1.28.2), but 1.19.1 does seem to work.
Hacking the existing extension
I would advise you to raise an issue with Preside to change their extension to avoid the clash, but as a temporary hack you could try amending it yourself as follows:
First, add your modified Tika bundle and the osgiLoader.cfc to the /preside-ext-tika/services/ directory.
Next, change line 14 of DocumentMetadataService.cfc so the name of the Tika jar path matches your modified bundle.
_setTikaJarPath( GetDirectoryFromPath( GetCurrentTemplatePath( ) ) & "tika-app-1.28.2.jar" );
Then, modify lines 33-35 of the same cfc to replace:
var parser = CreateObject( "java", "org.apache.tika.parser.AutoDetectParser", jarPath );
var ch = CreateObject( "java", "org.apache.tika.sax.BodyContentHandler" , jarPath ).init(-1);
var md = CreateObject( "java", "org.apache.tika.metadata.Metadata" , jarPath ).init();
with the following:
var loader = New osgiLoader();
var tikaBundle = { version: "1.28.2", name: "apache-tika-app-bundle" };
var parser = loader.loadClass( "org.apache.tika.parser.AutoDetectParser", jarPath, tikaBundle.name, tikaBundle.version )
var ch = loader.loadClass( "org.apache.tika.sax.BodyContentHandler" , jarPath, tikaBundle.name, tikaBundle.version ).init(-1)
var md = loader.loadClass( "org.apache.tika.metadata.Metadata" , jarPath, tikaBundle.name, tikaBundle.version ).init()
NB: I don't have Preside so can't test it in context.

Convert UTF-8 to windows-1252 and write into csv in gwt 2.7.0 on tomcat v7

I am facing a problem with converting UTF-8 to windows-1252. I have to output symbols like ²,³,°. The customer wants to open the file in Excel without importing the file by double clicking.
System:
Frontend in gwt 2.7.0 with massive usage of gxt 3.1.4
Server on customer side is a tomcat v7
Testing is done on gwt build in server
The problem right now is, that the application supports Japanese symbols, which are displayed perfectly fine in UTF-8 but not in windows-1252. On the other hand, the ²,³,° symbols are displayed. The current solution is to collect the rows of the csv and put them in hidden fields inside a FormPanel. The FormPanel is then encoded and submitted.
public void postCsvForExcel( String url, Map<String, String> postData )
{
setSize( "0px", "0px" );
setVisible( false );
sinkEvents( Event.ONLOAD );
setMethod( FormPanel.METHOD_POST );
setEncoding( FormPanel.ENCODING_URLENCODED );
VerticalPanel panel = new VerticalPanel();
add( panel );
for( Entry<String, String> data : postData.entrySet() )
{
Hidden hiddenField = new Hidden( data.getKey(), data.getValue() );
panel.add( hiddenField );
}
SubmitButton submit = new SubmitButton();
panel.add( submit );
setAction( url );
FormElement.as( this.getElement() ).setAcceptCharset( "Cp1252" );
RootPanel.get().add( this );
submit();
}
The Japanese characters are only displayed in the header. Facing this problem, I have extended the HttpServlet for POST operations to translate the UTF-8 like following and removed the FormElement.as( this.getElement() ).setAcceptCharset( "Cp1252" ); part from the method above.
public class ExporterServlet extends HttpServlet {
public ExporterServlet() {
}
#Override
protected void service(HttpServletRequest arg0, HttpServletResponse arg1)
throws ServletException, IOException {
super.service(arg0, arg1);
}
#Override
protected void doPost(HttpServletRequest req, HttpServletResponse resp)
throws ServletException, IOException {
String filename = req.getParameter("filename");
String content = req.getParameter("content");
if(filename != null) {
//resp.setContentType( getContentType( filename ) + "; charset=utf-8" );
resp.setContentType( "text/csv" + "; charset=windows-1252" );
resp.setHeader( "Content-Disposition", "attachment;filename=\"" + filename + "\"" );
resp.setIntHeader("Expires", 0);
resp.setContentLength(content.length());
resp.setStatus(200);
//resp.setCharacterEncoding( "UTF-8" );
resp.setCharacterEncoding( "windows-1252" );
//byte[] destinationBytes = content.getBytes( "utf-8" );
ByteBuffer bb = ByteBuffer.wrap( content.getBytes() );
CharBuffer cb = Charset.forName( "UTF-8" ).decode( bb );
bb = Charset.forName( "windows-1252" ).encode( cb );
resp.getOutputStream().write( bb.array() );
resp.getOutputStream().flush();
}
}
}
But this seems not to work. Am I missing sth.
Further information: I have observed one strange thing. The doPost method, although being called has no effect on the encoding of the file. I have tried to encode it in UTF-8 but the output was still windows-1252. When I removed the encoding of the FormPanel in the method before, the result was UTF-8.
Another question is, what is the correct encoding for windows-1252, I have tried both versions, cp1252 and windows-1252, I cant spot a difference in the result.
Japanese characters cant be displayed in cp1252. The instruction itself was not investigated properly. The customer doesn't know what cp1252 is capable of. Should have checked that before working on a solution.
You can add the following code, and try it again:
filename = new String(filename.getBytes(), "ISO-8859-1");

How to convert a string into a piece of code (Factory Method Pattern?)

Let's say we have a String like this:
String string2code = "variable = 'hello';";
How could we convert that String to a piece of code like this?:
variable = "hello";
GroovyShell is the answer:
String string2code = "variable = 'hello'; return variable.toUpperCase()";
def result = new GroovyShell().evaluate string2code
assert result == "HELLO"
If you're into more complex stuff later, you can compile whole classes using GroovyClassLoader.
private static Class loadGroovyClass( File file ) throws MigrationException {
try {
GroovyClassLoader gcl = new GroovyClassLoader( ExternalMigratorsLoader.class.getClassLoader() );
GroovyCodeSource src = new GroovyCodeSource( file );
Class clazz = gcl.parseClass( src );
return clazz;
}
catch( CompilationFailedException | IOException ex ){
...
}
}
Maybe you can take a look a Janino
Janino is a small java compiler than not only can compile source files, it can compile expressions like the one you have.

Why the HelloWorld of opennlp library works fine on Java but doesn't work with Jruby?

I am getting this error:
SyntaxError: hello.rb:13: syntax error, unexpected tIDENTIFIER
public HelloWorld( InputStream data ) throws IOException {
The HelloWorld.rb is:
require "java"
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.IOException;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
public class HelloWorld {
private POSModel model;
public HelloWorld( InputStream data ) throws IOException {
setModel( new POSModel( data ) );
}
public void run( String sentence ) {
POSTaggerME tagger = new POSTaggerME( getModel() );
String[] words = sentence.split( "\\s+" );
String[] tags = tagger.tag( words );
double[] probs = tagger.probs();
for( int i = 0; i < tags.length; i++ ) {
System.out.println( words[i] + " => " + tags[i] + " # " + probs[i] );
}
}
private void setModel( POSModel model ) {
this.model = model;
}
private POSModel getModel() {
return this.model;
}
public static void main( String args[] ) throws IOException {
if( args.length < 2 ) {
System.out.println( "HelloWord <file> \"sentence to tag\"" );
return;
}
InputStream is = new FileInputStream( args[0] );
HelloWorld hw = new HelloWorld( is );
is.close();
hw.run( args[1] );
}
}
when running ruby HelloWorld.rb "I am trying to make it work"
when I run the HelloWorld.java "I am trying to make it work" it works perfectly, of course the .java doesn't contain the require java statement.
EDIT:
I followed the following steps.
The output for jruby -v :
jruby 1.6.7.2 (ruby-1.8.7-p357) (2012-05-01 26e08ba) (Java HotSpot(TM) 64-Bit Server VM 1.6.0_35) [darwin-x86_64-java]
JRuby is a ruby implementation in Java, this means if you want to use JRuby, you have to use the ruby syntax. You can indeed use Java objects in JRuby, but using the ruby syntax – you just can’t use Java syntax.
For example, frame = javax.swing.JFrame.new("Window") uses JFrame, but with a ruby syntax (i.e. JFrame.new rather than new JFrame).
And so your code would be something like:
require 'java'
# Require opennlp jars
Dir.glob('**/*.jar').each do |jar|
require jar
end
java_import 'opennlp.tools.postag.POSTaggerME'
java_import 'opennlp.tools.postag.POSModel'
class HelloWorld
def initialize(data)
#model = POSModel.new(data)
end
def run(sentence)
tagger = POSTaggerME.new(#model)
words = sentence.split
tags = tagger.tag(words)
probs = tagger.probs
probs.each_with_index do |p,i|
puts "#{words[i]} => #{tags[i]} # #{p}"
end
end
end
stream = File.new(ARGV[0]).to_java.getInStream
HelloWorld.new(stream).run(ARGV[1])
All. ruby. code.
Because it's written in Java and not Ruby?

Java reflection and manifest file in jar

I would like to (and I don't know if it's possible) do something if jarA is in my classpath and do something else if jarB is in my classpath. I am NOT going to be specifying these jars in the Netbeans project library references because I don't know which of the jars will be used.
Now including the jar in my Netbeans project library references works when I try to use the jar's classes through reflection. But when I remove the netbeans project library reference but add the jar to my classpath the reflection does not work.
My question is really 1) can this be done? 2) Am I thinking about it correctly 3) How come when I specify -cp or -classpath to include the directory containing the jar it doesn't work? 4) How come when I specify the directory in my manifest.mf in the jar file it doesn't work?
Please let me know. This is really bothering me.
Thanks,
Julian
on point 3 - you should include the fully qualified jar name in your classpath, not just the directory.
I believe so!
ClassLoader.getSystemClassLoader()getURLs();
This will tell you which Jar files are in your classpath. Then do X or Y as you please.
A classpath can reference a directory that contains .class files, or it can reference a .jar file directly. If it references a directory that contains .jar files, they will not be included.
java -help says this about -classpath: "list of directories, JAR archives,
and ZIP archives to search for class files." This is very clear that a directory on the classpath is searched for class files, not JAR archives.
This is how I'm doing it. The inner class encapsulates the singleton instance of the logger and its trace method (heh - I know - a singleton inside a singleton). The outer class only uses it if the special class can be loaded, otherwise we go on without it. Hopefully you can modify this to suit your needs. And any suggestions as to better code are always appreciated! :-) HTH
import java.lang.reflect.Method;
import java.text.SimpleDateFormat;
import java.util.Date;
/**
* Provides centralized access to standardized output formatting. Output is sent to System.out and,
* if the classpath allows it, to the Cisco CTIOS LogManager. (This is NOT a dependency, however.)
*
*/
public class LogWriter
{
protected static LogWriter me = null;
private SimpleDateFormat dateFormat = null;
private StringBuffer line = null;
CLogger ciscoLogger = null;
/*
* The following 2 methods constitute the thread-safe singleton pattern.
*/
private static class LogWriterHolder
{
public static LogWriter me = new LogWriter();
}
/**
* Returns singleton instance of the class. Thread-safe. The only way to get one is to use this.
*
* #return an instance of LogWriter
*/
public static LogWriter sharedInstance()
{
return LogWriterHolder.me;
}
#SuppressWarnings("unchecked")
LogWriter()
{
dateFormat = new SimpleDateFormat("yyyyMMddHHmmss ");
line = new StringBuffer();
try {
Class x = Class.forName("com.cisco.cti.ctios.util.LogManager");
if( x != null ) {
java.lang.reflect.Method m = x.getMethod("Instance", new Class[0]);
java.lang.reflect.Method n = x.getMethod("Trace", int.class, String.class );
if( m != null ) {
Object y = m.invoke( x , new Object[0] );
if( n != null ) {
ciscoLogger = new CLogger();
ciscoLogger.target = y;
ciscoLogger.traceImpl = n ;
}
}
}
} catch(Throwable e )
{
System.err.println( e.getMessage() ) ;
e.printStackTrace();
}
}
/**
* Formats a line and sends to System.out. The collection and formatting of the text is
* thread safe.
*
* #param message The human message you want to display in the log (required).
* #param hostAddress Host address of server (optional)
* #param hostPort Port on hostAddresss (optional) - also used for className in object-specific logs.
*/
public void log( String message, String hostAddress, String hostPort )
{
if ( message == null || message.length() < 3 ) return;
synchronized( this )
{
try {
line.delete(0, line.length());
line.append(dateFormat.format(new Date()));
line.append(hostAddress);
line.append(":");
line.append(hostPort);
line.append(" ");
while (line.length() < 28)
line.append(" ");
line.append(message);
this.write( line.toString() );
} catch (Exception e) {
e.printStackTrace();
}
}
}
private void write(String line )
{
System.out.println( line ) ;
}
/**
* Write a simple log message to output delegates (default is System.out).
* <p>
* Will prepend each line with date in yyyyMMddHHmmss format. there will be a big space
* after the date, in the spot where host and port are normally written, when {#link LogWriter#log(String, String, String) log(String,String,String)}
* is used.
*
* #param message What you want to record in the log.
*/
public void log( String message )
{
if( ciscoLogger != null ) ciscoLogger.trace(0x01, message );
this.log( message, "", "");
}
class CLogger
{
Object target;
Method traceImpl;
#SuppressWarnings("boxing")
public void trace( int x, String y )
{
try {
traceImpl.invoke( target, x, y) ;
} catch( Throwable e ) {
// nothing to say about this
}
}
}
}

Categories

Resources