Tika in server mode performance

Tika in server mode performance - java

I read over some articles that tika in server mode improves performance. Can someone explain how? Can we implement similar functionality within our java application for better performance?
Running tika in server mode

In the example you provided when tika is executed as standalone application using jar, there are additional steps that are performed before actually processing pdf file. You can roughly split it to 3:
JVM is instantiated
Tika classes loaded and configured (e.g.: parsers, etc...)
(only then) tika performs content processing
In server mode first two steps are performed on server startup, and it is ready to process files as it receives them.
You can do the same in your application if it performs some processing of input data and the processing time is measurably less that instantiating and configuring the app.
As for implementation you can have a look at tika source code

I looked at the code in TikaServer, only Parser object seems to be initialized. Other Socket related code is not required here. I tried out the code to initialize Parser only once, but didnt see any improvement (for extracting content of 100 files).
So as per vadchen's answer
JVM initialization isn't an issue of a running application; it will anyway happen only once.
Tika initializes Parser object, which doesn't seem to have much impact on performance.
So there isn't any performance improvement as claimed by the article.

Related

Java - netty library execution time very big - Java NIO

I am developing a Java application that reads data from a Redis Database, I use Lettuce library to connect to Redis which in turn uses 'Netty' library to communicate with Redis
I suspect that the execution time of my application is greater than expected, so a conducted a profiling experiment using JProfiler, I was surprised that a FastThreadLocalRunnable takes a significant portion of the execution time with no justification as the tree shows no function calls taking time:
So, is it a bug in Lettuce library?, or is it a problem in the profiler measuring the execution time?
Any help is appreciated
Edit:
Thanks to Ingo's answer I can now expand the tree but it turns out that the java NIO is consuming my processor:
Any idea?

The call tree in JProfiler only shows classes that are included in the call tree filters that you define in the profiling settings:
By default, this excludes a lot of common frameworks and libraries so that you can get started without configuring anything. It is better if you delete these filters and add your own profiled packages here.
In addition to the profiled classes, JProfiler shows the thread entry point even it is not a profiled class, such as io.netty.util.concurrent.FastThreadLocalRunnable. Also, the first call into non-profiled classes is always shown at any level in the call tree.
In your case there are call chains to non-profiled classes below io.netty.util.concurrent.FastThreadLocalRunnable that never call a profiled class. They could belong to some framework or to some part of your code that is not included in the profiled classes. This time has to go somewhere, so it is attributed to the io.netty.util.concurrent.FastThreadLocalRunnable node.
An easy way to check is to disable filtering in the profiling settings, then you see all classes.
More information about call tree filters can be found at
https://www.ej-technologies.com/resources/jprofiler/help/doc/main/methodCallRecording.html

Generating java code from wsdl using cxf gives code too large error

I have generated the code form wsdl to java using cxf 2.7.3 but when building the assembly I get "code too large" error. Indicating that one of the methods have exceeded java 64kb limit. I know exactly which class and to me this seems like bug in cxf. Actully Axis2 does the same so I was wondering if anyone knows how to solve this.
The code I'm playing around with is provided here in path eco-api-ex / examples / java /
How can I force the code generation to split up large generated method? or should I use some external tool for this?
[ERROR] \workspace\e-conomics\target\generated\src\main\java\com\e_conomic\Econo
micWebServiceSoap_EconomicWebServiceSoap12_Client.java:[34,23] error: code too l
arge

Don't run wsdl2java with the -client flag. The _Client.java class is just a sample class to show how to use the generated service class and proxies and such. It's not normally needed for anything. That SHOULD be the only class generated with a large method like that.

You've got a 3MB WSDL document there. (No wonder my browser was a bit unhappy when I tried to view a general XML document of that size.) It's got around 3000 elements defined in it; also 3k messages and 4.5k operations. I don't know exactly what you're hitting the limit in (there's a few places where registries of all entities of a particular type are constructed) but it doesn't matter too much. It's just far too large for most code to normally handle. (The limit you're hitting appears to be the one on the total size of bytecode for a method; hitting that is usually an indication of something somewhere else going badly awry: in this case, it's the bunker-busting WSDL document.)
Constructing a cut-down version that has a much smaller set of elements, messages and operations would be an excellent idea. Putting that cut-down version in your repository where Maven can find it (e.g., in src/main/wsdl) would also make a lot of sense, as it would stop you from downloading that 3MB document again each time you build.

JAXWS Client Timeout on Jboss

how can i set timeout for JAXWS client, Im using Jboss 5.1.
I was trying to do this with
bp.getRequestContext().put("com.sun.xml.ws.connect.timeout", 100);
bp.getRequestContext().put("com.sun.xml.ws.request.timeout", 100);
but it doesn't works. It works fine for standalone client.
When i tried to use
bp.getRequestContext().put("com.sun.xml.ws.request.timeout", 100);
I.ve got org.jboss.ws.core.WSTimeoutException: Timeout after: 100ms, but it is heppening after 300 (3*100 ms).
Can anyone help me with this issue.

While this looks likely to be an oversight on your part, The settings for JAX-WS timeouts might depend on the specific RI you're building on.
You could try these settings (they are paired to be used in pairs)
BindingProviderProperties.REQUEST_TIMEOUT
BindingProviderProperties.CONNECT_TIMEOUT
BindingProviderProperties should be from com.sun.xml.internal.WS.client
Or the strings
javax.xml.ws.client.connectionTimeout
javax.xml.ws.client.receive timeout
All properties to be put on getRequestContext() in milliseconds.
BtW, how were you able to time the milliseconds without code :)?

JAXWS use the JAXB for marshalling and unmarshalling.
In the container probably it's taking more time because it's JAXContext scanning your classpath.
If that's the case try eager initialization of JaxBContext:
JBossWS may perform differently during the first method invocation of
each service and the following ones when huge wsdl contracts (with
hundreds of imported xml schemas) are referenced. This is due to a
bunch of operations performed internally during the first invocation
whose resulting data is then cached and reused during the following
ones. While this is usually not a problem, you might be interested in
having almost the same performance on each invocation. This can be
achieved setting the org.jboss.ws.eagerInitializeJAXBContextCache
system property to true, both on server side (in the JBoss start
script) and on client side (a convenient constant is available in
org.jboss.ws.Constants). The JAXBContext creation is usually
responsible for most of the time required by the stack during the
first invocation; this feature makes JBossWS try to eagerly create and
cache the JAXB contexts before the first invocation is handled.
http://www.mastertheboss.com/javaee/jboss-web-services/web-services-performance-tuning

Java RMI: statistics for object stubs

I'd like to profile network overhead of my RMI-based application. For instance, I'd be interesting to know how many bytes a stub transferred over the network, or how many method calls were done through it. I can't find anything in the RMI API to hook into, though. Is this possible at all?

I am not particularly fond of RMI and found JSon-based, Thrift and even XML-RPC easier to work with. However, sometimes we don't have a choice.
There is a microbenchmark suite for RMI, as well as object serialization, in the "test" tree of the jdk7/jdk repository, see:
jdk/test/java/rmi/reliability/benchmark
The script:
jdk/test/java/rmi/reliability/scripts/create_benchmark_jars.ksh
shows how to create two JAR files which is used in the benchmarking. You can pass command-line parameters to each each instance for specific settings such repetitions per run, etc. (One instance of the jar will run as the client and the other as the server, which is also configured from a command line parameter.)
I haven't played much with this myself - usually trusting existing benchmarks, for example:
http://daniel.gredler.net/2008/01/07/java-remoting-protocol-benchmarks
...or using tools such as (I haven't looked much at the last two):
JMeter (http://jmeter.apache.org/), Soap-stone (http://soap-stone.sourceforge.net/) or
JVM-serialisers (https://github.com/eishay/jvm-serializers/wiki/)

XML With SimpleXML Library - Performance on Android

I'm using the Simple XML library to process XML files in my Android application. These file can get quite big - around 1Mb, and can be nested pretty deeply, so they are fairly complex.
When the app loads one of these files, via the Simple API, it can take up to 30sec to complete. Currently I am passing a FileInputStream into the [read(Class, InputStream)][2] method of Simple's Persister class. Effectively it just reads the XML nodes and maps the data to instances of my model objects, replicating the XML tree structure in memory.
My question then is how do I improve the performance on Android? My current thought would be to read the contents of the file into a byte array, and pass a ByteArrayInputStream to the Persister's read method instead. I imagine the time to process the file would be quicker, but I'm not sure if the time saved would be counteracted by the time taken to read the whole file first. Also memory constraints might be an issue.
Is this a fools errand? Is there anything else I could do to increase the performance in this situation? If not I will just have to resort to improving the feedback to the user on the progress of loading the file.
Some caveats:
1) I can't change the XML library I'm using - the code in question is part of an "engine" which is used across desktop, mobile and web applications. The overhead to change it would be too much for the timebeing.
2) The data files are created by users so I don't have control over the size/depth of nesting in them.

Well, there are many things you can do to improve this. Here they are.
1) On Android you should be using at least version 2.5.2, but ideally 2.5.3 as it uses KXML which is much faster and more memory efficient on Android.
2) Simple will dynamically build your object graph, this means that it will need to load classes that have not already been loaded, and build a schema for each class based on its annotations using reflection. So first use will always be by far the most expensive. Repeated use of the same persister instance will be many times faster. So try to avoid multiple persister instances, just use one if possible.
3) Try measure the time taken to read the file directly, without using the Simple XML library. How long does it take? If it takes forever then you know the performance hit here is due to the file system. Try use a BufferedInputStream to improve performance.
4) If you still notice any issues, raise it on the mailing list.
EDIT:
Android has some issues with annotation processing https://code.google.com/p/android/issues/detail?id=7811, Simple 2.6.6 fixes has implemented work arounds for these issues. Performance increases of 10 times can be observed.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.