Can we make Lucene IndexWriter serializable for ExecutionContext of Spring Batch?

Can we make Lucene IndexWriter serializable for ExecutionContext of Spring Batch? - java

This question is related to my another SO question.
To keep IndexWriter open for the duration of a partitioned step, I thought to add IndexWriter in ExecutionContext of partitioner and then close in a StepExecutionListenerSupport 's afterStep(StepExecution stepExecution) method.
Challenge that I am facing in this approach is that ExecutionContext needs Objects to be serializable.
In light of these two questions, Q1, Q2 -- it doesn't seem feasible because I can't add a no - arg constructor in my custom writer because IndexWriter doesn't have any no - arg constructor.
public class CustomIndexWriter extends IndexWriter implements Serializable {
/*
private Directory d;
private IndexWriterConfig conf;
public CustomIndexWriter(){
super();
super(this.d, this.conf);
}
*/
public CustomIndexWriter(Directory d, IndexWriterConfig conf) throws IOException {
super(d, conf);
}
/**
*
*/
private static final long serialVersionUID = 1L;
private void readObject(ObjectInputStream input) throws IOException, ClassNotFoundException{
input.defaultReadObject();
}
private void writeObject(ObjectOutputStream output) throws IOException, ClassNotFoundException {
output.defaultWriteObject();
}
}
In above code, I can't add constructor shown as commented because no - arg constructor doesn't exist in Super class and can't access this fields before super .
Is there a way to achieve this?

You can always add a parameter-less constructor.
E.g:
public class CustomWriter extends IndexWriter implements Serializable {
private Directory lDirectory;
private IndexWriterConfig iwConfig;
public CustomWriter() {
super();
// Assign default values
this(new Directory("." + System.getProperty("path.separator")), new IndexWriterConfig());
}
public CustomWriter(Directory dir, IndexWriterConfig iwConf) {
lDirectory = dir;
iwConfig = iwConf;
}
public Directory getDirectory() { return lDirectory; }
public IndexWriterConfig getConfig() { return iwConfig; }
public void setDirectory(Directory dir) { lDirectory = dir; }
public void setConfig(IndexWriterConfig conf) { iwConfig = conf; }
// ...
}
EDIT:
Having taken a look at my own code (using Lucene.Net), the IndexWriter needs an analyzer, and a MaxFieldLength.
So the super-call would look something like this:
super(new Directory("." + System.getProperty("path.separator")), new StandardAnalyzer(), MaxFieldLength.UNLIMITED);
So adding these values as defaults should fix the issue. Maybe then add getter- and setter-methods for the analyzer and MaxFieldLength, so you have control over that at a later stage.

I am not sure how but this syntax works in Spring Batch and ExecutionContext returns a non - null Object in StepExecutionListenerSupport.
public class CustomIndexWriter implements Serializable {
private static final long serialVersionUID = 1L;
private transient IndexWriter luceneIndexWriter;
public CustomIndexWriter(IndexWriter luceneIndexWriter) {
this.luceneIndexWriter=luceneIndexWriter;
}
public IndexWriter getLuceneIndexWriter() {
return luceneIndexWriter;
}
public void setLuceneIndexWriter(IndexWriter luceneIndexWriter) {
this.luceneIndexWriter = luceneIndexWriter;
}
}
I put an instance of CustomIndexWriter in step partitioner, partitioned step chunk works with writer by doing, getLuceneIndexWriter() and then in StepExecutionListenerSupport , I close this writer.
This way my spring batch partitioned step works with a single instance of Lucene Index Writer Object.
I was hoping that I will get a NullPointer if trying to perform operation on writer obtained by getLuceneIndexWriter() but that doesn't happen ( despite it being transient ). I am not sure why this works but it does.
For Spring Batch job metadata, I am using in - memory repository and not db based repository. Not sure if this will continue to work once I start using db for metadata.

Related

Custom Serialization capability for EntryProcessor in Hazelcast

Do we have custom serialization capability for EntryProcessor or ExecutorService ?. Hazelcast document is not specifying anything in this regard. There are no samples given in the document related to custom serialization of EntryProcessor. We are looking for a Portable serialization of the EntryProcessor.
public class SampleEntryProcessor implements EntryProcessor<SampleDataKey, SampleDataValue , SampleDataValue >,Portable {
/**
*
*/
private static final long serialVersionUID = 1L;
private SampleDataValue sampleDataValue ;
public SampleDataValue process(Map.Entry<SampleDataKey, SampleDataValue > entry) {
//Sample logic here
return null;
}
#Override
public int getFactoryId() {
return 1;
}
#Override
public int getClassId() {
return 1;
}
#Override
public void writePortable(PortableWriter writer) throws IOException {
writer.writePortable("i", sampleDataValue );
}
#Override
public void readPortable(PortableReader reader) throws IOException {
sampleDataValue = reader.readPortable("i");
}
}
UPDATE : When i try to call processor am getting error as follows.
Exception in thread "main" java.lang.ClassCastException: com.hazelcast.internal.serialization.impl.portable.DeserializedPortableGenericRecord cannot be cast to com.hazelcast.map.EntryProcessor
at com.hazelcast.client.impl.protocol.task.map.MapExecuteOnKeyMessageTask.prepareOperation(MapExecuteOnKeyMessageTask.java:42)
at com.hazelcast.client.impl.protocol.task.AbstractPartitionMessageTask.processInternal(AbstractPartitionMessageTask.java:45)

Yes, you can use different serialization mechanisms to serialize entry processors, provided that they are correctly configured on the sender and receiver sides. So, after making sure that the Portable factory for your class is registered on the members and on the instance you are sending the entry processor from (for example, your client), it should work.

Spring Boot batch - MultiResourceItemReader : move to next file on error

In a batch service, I read multiple XML files using a MultiResourceItemReader, which delegate to a StaxEventItemReader.
If an error is raised reading a file (a parsing exception for example), I would like to specify to Spring to start reading the next matching file. Using #OnReadError annotation and/or a SkipPolicy for example.
Currently, when a reading exception is raised, the batch stops.
Does anyone have an idea how to do it ?
EDIT: I see MultiResourceItemReader has a method readNextItem(), but it's private -_-

I'm not using SB for a while, but looking MultiResourceItemReader code I suppose you can write your own ResourceAwareItemReaderItemStream wrapper where you check for a flag setted to move to next file or to perform a standard read using a delegate.
This flag can be stored into execution-context or into your wrapper and should be cleared after a move next.
class MoveNextReader<T> implements ResourceAwareItemReaderItemStream<T> {
private ResourceAwareItemReaderItemStream delegate;
private boolean skipThisFile = false;
public void setSkipThisFile(boolean value) {
skipThisFile = value;
}
public void setResource(Resource resource) {
skipThisFile = false;
delegate.setResource(resource);
}
public T read() {
if(skipThisFile) {
skipThisFile = false;
// This force MultiResourceItemReader to move to next resource
return null;
}
return delegate.read();
}
}
Use this class as delegate for MultiResourceItemReader and in #OnReadErrorinject MoveNextReader and set MoveNextReader.skipThisFile.
I can't test code from myself but I hope this can be a good starting point.

Here are my final classes to read multiple XML files and jump to the next file when a read error occurs on one (thanks to Luca's idea).
My custom ItemReader, extended from MultiResourceItemReader :
public class MyItemReader extends MultiResourceItemReader<InputElement> {
private SkippableResourceItemReader<InputElement> reader;
public MyItemReader() throws IOException {
super();
// Resources
PathMatchingResourcePatternResolver resourceResolver = new PathMatchingResourcePatternResolver();
this.setResources( resourceResolver.getResources( "classpath:input/inputFile*.xml" ) );
// Delegate reader
reader = new SkippableResourceItemReader<InputElement>();
StaxEventItemReader<InputElement> delegateReader = new StaxEventItemReader<InputElement>();
delegateReader.setFragmentRootElementName("inputElement");
Jaxb2Marshaller unmarshaller = new Jaxb2Marshaller();
unmarshaller.setClassesToBeBound( InputElement.class );
delegateReader.setUnmarshaller( unmarshaller );
reader.setDelegate( delegateReader );
this.setDelegate( reader );
}
[...]
#OnReadError
public void onReadError( Exception exception ){
reader.setSkipResource( true );
}
}
And the ItemReader-in-the-middle used to skip the current resource :
public class SkippableResourceItemReader<T> implements ResourceAwareItemReaderItemStream<T> {
private ResourceAwareItemReaderItemStream<T> delegate;
private boolean skipResource = false;
#Override
public void close() throws ItemStreamException {
delegate.close();
}
#Override
public T read() throws UnexpectedInputException, ParseException, NonTransientResourceException, Exception {
if( skipResource ){
skipResource = false;
return null;
}
return delegate.read();
}
#Override
public void setResource( Resource resource ) {
skipResource = false;
delegate.setResource( resource );
}
#Override
public void open( ExecutionContext executionContext ) throws ItemStreamException {
delegate.open( executionContext );
}
#Override
public void update( ExecutionContext executionContext ) throws ItemStreamException {
delegate.update( executionContext );
}
public void setDelegate(ResourceAwareItemReaderItemStream<T> delegate) {
this.delegate = delegate;
}
public void setSkipResource( boolean skipResource ) {
this.skipResource = skipResource;
}
}

How to resolve NoSuchFieldError exception when testing Lucene 4.0

I want to test my own Analyzer. Following is test code from Lucene in Action 2nd Edition, Code List 4.2, page 121.
public class AnalyzerUtils {
public static void displayTokens(Analyzer analyzer, String text) throws IOException {
TokenStream tokenStream = analyzer.tokenStream("contents", new StringReader(text));
displayTokens(tokenStream);
}
public static void displayTokens(TokenStream stream) throws IOException {
CharTermAttribute term = stream.getAttribute(CharTermAttribute.class);
while(stream.incrementToken()) {
System.out.println(Arrays.toString(term.buffer()));
}
}
}
My customed Analyzer is:
static class SimpleAnalyzer extends Analyzer {
static class SimpleFilter extends TokenFilter {
protected SimpleFilter(TokenStream input) { super(input); }
#Override
public boolean incrementToken() throws IOException { return false; }
}
#Override
protected TokenStreamComponents createComponents(String s, Reader reader) {
Tokenizer tokenizer = new WhitespaceTokenizer(reader);
return new TokenStreamComponents(tokenizer, new SimpleFilter(tokenizer));
}
}
static class FilteringAnalyzer extends Analyzer {
static class FilteringFilter extends FilteringTokenFilter {
public FilteringFilter(TokenStream in) { super(in); }
#Override
protected boolean accept() throws IOException { return false; }
}
#Override
protected TokenStreamComponents createComponents(String s, Reader reader) {
Tokenizer tokenizer = new WhitespaceTokenizer(reader);
return new TokenStreamComponents(tokenizer, new FilteringFilter(tokenizer));
}
}
Problem is if I run AnalyzerUtils.displayTokens(new SimpleAnalyzer(), "美国 法国 中国");, it is ok; however, running AnalyzerUtils.displayTokens(new FilteringAnalyzer(), "美国 法国 中国"); I got this Exception:
Exception in thread "main" java.lang.NoSuchFieldError: LATEST
at org.apache.lucene.analysis.util.FilteringTokenFilter.<init>(FilteringTokenFilter.java:70)
at cn.edu.nju.ws.miliqa.nlp.ner.index.NameEntityIndexing$FilteringFilter.<init>(NameEntityIndexing.java:62)
at cn.edu.nju.ws.miliqa.nlp.ner.index.NameEntityIndexing$FilteringAnalyzer.createComponents(NameEntityIndexing.java:83)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:134)
at cn.edu.nju.ws.miliqa.lucene.AnalyzerUtils.displayTokens(AnalyzerUtils.java:19)
The difference between to test case is the filter in analyzer extendes TokenFilter or FilteringTokenFilter. I have been working on it for three days, but still have no idea about it. What is the reason for this odd exception?

The java.lang.NoSuchFieldError runtime exception means you have one class attempting to access a field on another class that doesn't exist. The offending class was FilteringTokenFilter.
Most likely, you have multiple versions of Lucene in your classpath.
You mention you are using 4.0 in the title, but Version.LATEST (the field this exception is complaining is missing) was not introduced until Lucene 4.10.
That implies that perhaps you have a copy of FilteringTokenFilter.class in a Lucene 4.10+ jar file attempting to find the field "LATEST" in an older (4.0?) Version.class file.
Check you only have one copy of each "lucene-core" and "lucene-analyzers-common" jar files in your class path, and that they are both matching version numbers. If you are not sure, download them again to ensure you have matching versions.

Using guava-libraries LoadingCache to cache java pojo

I am using guava-libraries LoadingCache to cache classes in my app.
Here is the class I have came up with.
public class MethodMetricsHandlerCache {
private Object targetClass;
private Method method;
private Configuration config;
private LoadingCache<String, MethodMetricsHandler> handlers = CacheBuilder.newBuilder()
.maximumSize(1000)
.build(
new CacheLoader<String, MethodMetricsHandler>() {
public MethodMetricsHandler load(String identifier) {
return createMethodMetricsHandler(identifier);
}
});
private MethodMetricsHandler createMethodMetricsHandler(String identifier) {
return new MethodMetricsHandler(targetClass, method, config);
}
public void setTargetClass(Object targetClass) {
this.targetClass = targetClass;
}
public void setMethod(Method method) {
this.method = method;
}
public void setConfig(Configuration config) {
this.config = config;
}
public MethodMetricsHandler getHandler(String identifier) throws ExecutionException {
return handlers.get(identifier);
}
I am using this class as follows to cache the MethodMetricsHandler
...
private static MethodMetricsHandlerCache methodMetricsHandlerCache = new MethodMetricsHandlerCache();
...
MethodMetricsHandler handler = getMethodMetricsHandler(targetClass, method, config);
private MethodMetricsHandler getMethodMetricsHandler(Object targetClass, Method method, Configuration config) throws ExecutionException {
String identifier = targetClass.getClass().getCanonicalName() + "." + method.getName();
methodMetricsHandlerCache.setTargetClass(targetClass);
methodMetricsHandlerCache.setMethod(method);
methodMetricsHandlerCache.setConfig(config);
return methodMetricsHandlerCache.getHandler(identifier);
}
My question:
Is this creating a cache of the MethodMetricHandler classes keyed on identifier (not used this before so just a sanity check).
Also is there a better approach? Given that I will have multiple instances (hundreds) of the same MethodMetricHandler for a given identifier if I do not cache?

Yes, it does create a cache of MethodMetricsHandler objects. This approach generally is not bad however I might be able to say more if you described your use case because this solution is quite unusual. You've partially reinvented factory pattern.
Also think about some suggestions:
It's very odd that you need to call 3 setters before running getHandler
As "Configuration" is not in the key, you'll get the same object from cache for different configurations and the same targetClass and method
Why targetClass is an Object. You may want to pass Class<?> instead.
Are you planning to evict objects from cache?

Reading and writing multiple files in parallel

I need to write a program in Java which will read a relatively large number (~50,000) files in a directory tree, process the data, and output the processed data in a separate (flat) directory.
Currently I have something like this:
private void crawlDirectoyAndProcessFiles(File directory) {
for (File file : directory.listFiles()) {
if (file.isDirectory()) {
crawlDirectoyAndProcessFiles(file);
} else {
Data d = readFile(file);
ProcessedData p = d.process();
writeFile(p,file.getAbsolutePath(),outputDir);
}
}
}
Suffice to say that each of those methods is removed and trimmed down for ease of reading, but they all work fine. The whole process works fine, except that it is slow. The processing of data occurs via a remote service and takes between 5-15 seconds. Multiply that by 50,000...
I've never done anything multi-threaded before, but I figure I can get some pretty good speed increases if I do. Can anyone give some pointers how I can effectively parallelise this method?

I would use a ThreadPoolExecutor to manage the threads. You can do something like this:
private class Processor implements Runnable {
private final File file;
public Processor(File file) {
this.file = file;
}
#Override
public void run() {
Data d = readFile(file);
ProcessedData p = d.process();
writeFile(p,file.getAbsolutePath(),outputDir);
}
}
private void crawlDirectoryAndProcessFiles(File directory, Executor executor) {
for (File file : directory.listFiles()) {
if (file.isDirectory()) {
crawlDirectoryAndProcessFiles(file,executor);
} else {
executor.execute(new Processor(file);
}
}
}
You would obtain an Executor using:
ExecutorService executor = Executors.newFixedThreadPool(poolSize);
where poolSize is the maximum number of threads you want going at once. (It's important to have a reasonable number here; 50,000 threads isn't exactly a good idea. A reasonable number might be 8.) Note that after you've queued all the files, your main thread can wait until things are done by calling executor.awaitTermination.

Assuming you have a single hard disk (i.e. something that only allows single simultaneous read operations, not a SSD or RAID array, network file system, etc...), then you only want one thread performing IO (reading from/writing to the disk). Also, you only want as many threads doing CPU bound operations as you have cores, otherwise time will be wasted in context switching.
Given the above restrictions, the code below should work for you. The single threaded executor ensures that only one Runnable executes at any one time. The fixed thread pool ensures no more than NUM_CPUS Runnables are executing at any one time.
One thing this does not do is to provide feedback on when processing is finished.
private final static int NUM_CPUS = 4;
private final Executor _fileReaderWriter = Executors.newSingleThreadExecutor();
private final Executor _fileProcessor = Executors.newFixedThreadPool(NUM_CPUS);
private final class Data {}
private final class ProcessedData {}
private final class FileReader implements Runnable
{
private final File _file;
FileReader(final File file) { _file = file; }
#Override public void run()
{
final Data data = readFile(_file);
_fileProcessor.execute(new FileProcessor(_file, data));
}
private Data readFile(File file) { /* ... */ return null; }
}
private final class FileProcessor implements Runnable
{
private final File _file;
private final Data _data;
FileProcessor(final File file, final Data data) { _file = file; _data = data; }
#Override public void run()
{
final ProcessedData processedData = processData(_data);
_fileReaderWriter.execute(new FileWriter(_file, processedData));
}
private ProcessedData processData(final Data data) { /* ... */ return null; }
}
private final class FileWriter implements Runnable
{
private final File _file;
private final ProcessedData _data;
FileWriter(final File file, final ProcessedData data) { _file = file; _data = data; }
#Override public void run()
{
writeFile(_file, _data);
}
private Data writeFile(final File file, final ProcessedData data) { /* ... */ return null; }
}
public void process(final File file)
{
if (file.isDirectory())
{
for (final File subFile : file.listFiles())
process(subFile);
}
else
{
_fileReaderWriter.execute(new FileReader(file));
}
}

The easiest (and probably one of the most reasonable) way is to have a thread pool (take a look in corresponding Executor). Main thread is responsible to crawl in the directory. When a file is encountered, then create a "Job" (which is a Runnable/Callable) and let the Executor handle the job.
(This should be sufficient for you to start, I prefer not giving too much concrete code coz it should not be difficult for you to figure out once you have read the Executor, Callable etc part)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Can we make Lucene IndexWriter serializable for ExecutionContext of Spring Batch? - java

Related

Custom Serialization capability for EntryProcessor in Hazelcast

Spring Boot batch - MultiResourceItemReader : move to next file on error

How to resolve NoSuchFieldError exception when testing Lucene 4.0

Using guava-libraries LoadingCache to cache java pojo

Reading and writing multiple files in parallel

Categories

Resources