Does anyone know how in spring-batch (3.0.7) can I flat a result of processor that returns list of entities?
Example:
I got a processor that returns List
public class MyProcessor implements ItemProcessor < Long , List <Entity>> {
public List<Entity> process ( Long id )
}
Now all following processors / writers need to work on List < Entity >. Is there any way to flat the result to simply Entity so the further processors in given step can work on single Entities?
The only way is to persist the list somehow with a writer and then create a separate step that would read from the persisted data.
Thanks in advance!
As you know, processors in spring-batch can be chained with a composite processor. Within the chain, you can change the processing type from processor to processor, but of course input and output type of two "neighbour"-processors have to match.
However, Input out Output type is always treated as one item. Therefore, if the output type of a processor ist a List, this list is regared as one item. Hence, the following processor needs to have an InputType "List", resp., if a writer follows, the Writer needs to have a List-of-List as type its write-method.
Moreover, a processor can not multiply its element. There can only be one output item for every input element.
Basically, there is nothing wrong with having a chain like
Reader<Integer>
ProcessorA<Integer,List<Integer>>
ProcessorB<List<Integer>,List<Integer>>
Writer<List<Integer>> (which leads to a write-method write(List<List<Integer>> items)
Depending on the context, there could be a better solution.
You could mitigate the impact (for instance reuseability) by using wrapper-processors and a wrapper-writer like the following code examples:
public class ListWrapperProcessor<I,O> implements ItemProcessor<List<I>, List<O>> {
ItemProcessor<I,O> delegate;
public void setDelegate(ItemProcessor<I,O> delegate) {
this.delegate = delegate;
}
public List<O> process(List<I> itemList) {
List<O> outputList = new ArrayList<>();
for (I item : itemList){
O outputItem = delegate.process(item);
if (outputItem!=null) {
outputList.add(outputItem);
}
}
if (outputList.isEmpty()) {
return null;
}
return outputList;
}
}
public class ListOfListItemWriter<T> implements InitializingBean, ItemStreamWriter<List<T>> {
private ItemStreamWriter<T> itemWriter;
#Override
public void write(List<? extends List<T>> listOfLists) throws Exception {
if (listOfLists.isEmpty()) {
return;
}
List<T> all = listOfLists.stream().flatMap(Collection::stream).collect(Collectors.toList());
itemWriter.write(all);
}
#Override
public void afterPropertiesSet() throws Exception {
Assert.notNull(itemWriter, "The 'itemWriter' may not be null");
}
public void setItemWriter(ItemStreamWriter<T> itemWriter) {
this.itemWriter = itemWriter;
}
#Override
public void close() {
this.itemWriter.close();
}
#Override
public void open(ExecutionContext executionContext) {
this.itemWriter.open(executionContext);
}
#Override
public void update(ExecutionContext executionContext) {
this.itemWriter.update(executionContext);
}
}
Using such wrappers, you could still implement "normal" processor and writers and then use such wrappers in order to move the "List"-handling out of them.
Unless you can provide a compelling reason, there's no reason to send a List of Lists to your ItemWriter. This is not the way the ItemProcessor was intended to be used. Instead, you should create/configure and ItemReader to return one object with relevant objects.
For example, if you're reading from the database, you could use the HibernateCursorItemReader and a query that looks something like this:
"from ParentEntity parent left join fetch parent.childrenEntities"
Your data model SHOULD have a parent table with the Long id that you're currently passing to your ItemProcessor, so leverage that to your advantage. The reader would then pass back ParentEntity objects, each with a collection of ChildEntity objects that go along with it.
Related
I would like to create an array_agg UDF for Apache Drill to be able to aggregate all values of a group to a list of values.
This should work with any major types (required, optional) and minor types (varchar, dict, map, int, etc.)
However, I get the impression that Apache Drill's UDF API does not really make use of inheritance and generics. Each type has its own writer and handler, and they cannot be abstracted to handle any type. E.g., the ValueHolder interface seems to be purely cosmetic and cannot be used to have type-agnostic hooking of UDFs to any type.
My current implementation
I tried to solve this by using Java's reflection so I could use the ListHolder's write function independent of the holder of the original value.
However, I then ran into the limitations of the #FunctionTemplate annotation.
I cannot create a general UDF annotation for any value (I tried it with the interface ValueHolder: #param ValueHolder input.
So to me it seems like the only way to support different types to have separate classes for each type. But I can't even abstract much and work on any #Param input, because input is only visible in the class where its defined (i.e. type specific).
I based my implementation on https://issues.apache.org/jira/browse/DRILL-6963
and created the following two classes for required and optional varchars (how can this be unified in the first place?)
#FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class VarChar_Agg implements DrillAggFunc {
#Param org.apache.drill.exec.expr.holders.VarCharHolder input;
#Workspace ObjectHolder agg;
#Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
#Override
public void setup() {
agg = new ObjectHolder();
}
#Override
public void reset() {
agg = new ObjectHolder();
}
#Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj;
listWriter.varChar().write(input);
}
#Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
#FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class NullableVarChar_Agg implements DrillAggFunc {
#Param NullableVarCharHolder input;
#Workspace ObjectHolder agg;
#Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
#Override
public void setup() {
agg = new ObjectHolder();
}
#Override
public void reset() {
agg = new ObjectHolder();
}
#Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
if (input.isSet != 1) {
return;
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj;
org.apache.drill.exec.expr.holders.VarCharHolder outHolder = new org.apache.drill.exec.expr.holders.VarCharHolder();
outHolder.start = input.start;
outHolder.end = input.end;
outHolder.buffer = input.buffer;
listWriter.varChar().write(outHolder);
}
#Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
Interestingly, I can't import org.apache.drill.exec.vector.complex.writer.BaseWriter to make the whole thing easier because then Apache Drill would not find it.
So I have to put the entire package path for everything in org.apache.drill.exec.vector.complex.writer in the code.
Furthermore, I'm using the depcreated ObjectHolder. Any better solution?
Anyway: These work so far, e.g. with this query:
SELECT
MIN(tbl.`timestamp`) AS start_view,
MAX(tbl.`timestamp`) AS end_view,
array_agg(tbl.eventLabel) AS label_agg
FROM `dfs.root`.`/path/to/avro/folder` AS tbl
WHERE tbl.data.slug IS NOT NULL
GROUP BY tbl.data.slug
however, when I use ORDER BY, I get this:
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: UnsupportedOperationException: NULL
Fragment 0:0
Additionally, I tried more complex types, namely maps/dicts.
Interestingly, when I call SELECT sqlTypeOf(tbl.data) FROM tbl, I get MAP.
But when I write UDFs, the query planner complains about having no UDF array_agg for type dict.
Anyway, I wrote a version for dicts:
#FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class Map_Agg implements DrillAggFunc {
#Param MapHolder input;
#Workspace ObjectHolder agg;
#Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
#Override
public void setup() {
agg = new ObjectHolder();
}
#Override
public void reset() {
agg = new ObjectHolder();
}
#Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj;
//listWriter.copyReader(input.reader);
input.reader.copyAsValue(listWriter);
}
#Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
#FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class Dict_agg implements DrillAggFunc {
#Param DictHolder input;
#Workspace ObjectHolder agg;
#Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
#Override
public void setup() {
agg = new ObjectHolder();
}
#Override
public void reset() {
agg = new ObjectHolder();
}
#Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj;
//listWriter.copyReader(input.reader);
input.reader.copyAsValue(listWriter);
}
#Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
But here, I get an empty list in the field data_agg for my query:
SELECT
MIN(tbl.`timestamp`) AS start_view,
MAX(tbl.`timestamp`) AS end_view,
array_agg(tbl.data) AS data_agg
FROM `dfs.root`.`/path/to/avro/folder` AS tbl
GROUP BY tbl.data.viewSlag
Summary of questions
Most importantly: How do I create an array_agg UDF for Apache Drill?
How to make UDFs type-agnostic/general purpose? Do I really have to implement an entire class for each Nullable, Required and Repeated version of all types? That's a lot to do and quite tedious. Isn't there a way to handle values in an UDF agnostic to the underlying types?
I wish Apache Drill would just use what Java offers here with function generic types, specialised function overloading and inheritence of their own type system. Am I missing something on how to do that?
How can I fix the NULL problem when I use ORDER BY on my varchar version of the aggregate?
How can I fix the problem where my aggregate of maps/dicts is an empty list?
Is there an alternative to using the deprecated ObjectHolder?
To answer your question, unfortunately you've run into one of the limits of the Drill Aggregate UDF API which is that it can only return simple data types.1 It would be a great improvement to Drill to fix this, but that is the current status. If you're interested in discussing that further, please start a thread on the Drill user group and/or slack channel. I don't think it is impossible, but it would require some modification to the Drill internals. IMHO it would be well worth it because there are a few other UDFs that I'd like to implement that need this feature.
The second part of your question is how to make UDFs type agnostic and once again... you've found yet another bit of ugliness in the UDF API. :-) If you do some digging in the codebase, you'll see that most of the Math functions have versions that accept FLOAT, INT etc..
Regarding the aggregate of null or empty lists. I actually have some good news here... The current way of doing that is to provide two versions of the function, one which accepts regular holders and the second which accepts nullable holders and returns an empty list or map if the inputs are null. Yes, this sucks, but the additional good news is that I'm working on cleaning this up and hopefully will have a PR submitted soon that will eliminate the need to do this.
Regarding the ObjectHolder, I wrote a median function that uses a few Stacks to compute a streaming median and I used the ObjectHolder for that. I think it will be with us for some time as there is no alternative at the moment.
I hope this answers your questions.
How can i create a method that accepts Class and Field as parameters? Like this:
List<SomeClassEntity> list = ...;
// Service to make useful things around a list of objects
UsefulThingsService<SomeClassEntity> usefulThingsService = new UsefulThingsService<>();
// Maybe invoke like this. Did't work
usefulThingsService.makeUsefulThings(list, SomeClassEntity.class, SomeClassEntity::getFieldOne);
// or like this. Will cause delayed runtime erros
usefulThingsService.makeUsefulThings(list, SomeClassEntity.class, "fieldTwo");
public class SomeClassEntity {
Integer fieldOne = 10;
Double fieldThree = 0.123;
public Integer getFieldOne() {
return fieldOne;
}
public void setFieldOne(Integer fieldOne) {
this.fieldOne = fieldOne;
}
public Double getFieldThree() {
return fieldThree;
}
public void setFieldThree(Double fieldThree) {
this.fieldThree = fieldThree;
}
}
public class UsefulThingsService<T> {
public void makeUsefulThings(Class<T> someClassBClass, String fieldName) {
// there is some code
}
}
Want to have correct references on compile stage, not at runtime.
Update:
I need code that would look more convenient than this:
Field fieldOne = null;
try {
fieldOne = SomeClassEntity.class.getDeclaredField("fieldOne");
} catch (NoSuchFieldException e) {
e.printStackTrace();
}
usefulThingsService.makeUsefulThings(SomeClassEntity.class, fieldOne);
I apologize for the next clarification.
Update 2:
- The service compares the list with the previous list, reveals only the changed fields of objects (list items) and updates these fields in the objects in the original list.
- Currently i use annotation on entity's field that is actually ID of the entity and that ID is used to detect identically entities (old and new) when i need to update field of entity in source list.
- Service detect annotated field and use it for next update process.
- I want to refuse to use annotations and provide an Field directly in constructor of service. Or use something other that could establish a relationship between class and field on compilation stage.
Assuming that you want field access because you want to get and set the value, you’d need two functions:
public class UsefulThingsService<T> {
public <V> void makeUsefulThings(List<T> list, Function<T,V> get, BiConsumer<T,V> set) {
for(T object: list) {
V v = get.apply(object);
// there is some code
set.accept(object, v);
}
}
}
and
usefulThingsService.makeUsefulThings(
list, SomeClassEntity::getFieldOne, SomeClassEntity::setFieldOne);
usefulThingsService.makeUsefulThings(
list, SomeClassEntity::getFieldThree, SomeClassEntity::setFieldThree);
There are, however, some things open. E.g., how is this service supposed to do something useful with the field resp. property, without even knowing its actual type. In your example, both are subtypes of Number, so you could declare <V extends Number>, so the method knows how to extract numerical values, however, constructing an appropriate result object would require specifying another function argument.
Background
I have a Spring Batch job where :
FlatFileItemReader - Reads one row at a time from the file
ItemProcesor - Transforms the row from the file into a List<MyObject> and returns the List. That is, each row in the file is broken down into a List<MyObject> (1 row in file transformed to many output rows).
ItemWriter - Writes the List<MyObject> to a database table. (I used this
implementation to unpack the list received from the processor
and delegae to a JdbcBatchItemWriter)
Question
At point 2) The processor can return a List of 100000 MyObject instances.
At point 3), The delegate JdbcBatchItemWriter will end up writing the entire List with 100000 objects to the database.
My question is : The JdbcBatchItemWriter does not allow a custom batch size. For all practical purposes, the batch-size = commit-interval for the step. With this in mind, is there another implementation of an ItemWriter available in Spring Batch that allows writing to the database and allows configurable batch size? If not, how do go about writing a custom writer myself to acheive this?
I see no obvious way to set the batch size on the JdbcBatchItemWriter. However, you can extend the writer and use a custom BatchPreparedStatementSetter to specify the batch size. Here is a quick example:
public class MyCustomWriter<T> extends JdbcBatchItemWriter<T> {
#Override
public void write(List<? extends T> items) throws Exception {
namedParameterJdbcTemplate.getJdbcOperations().batchUpdate("your sql", new BatchPreparedStatementSetter() {
#Override
public void setValues(PreparedStatement ps, int i) throws SQLException {
// set values on your sql
}
#Override
public int getBatchSize() {
return items.size(); // or any other value you want
}
});
}
}
The StagingItemWriter in the samples is an example of how to use a custom BatchPreparedStatementSetter as well.
The answer from Mahmoud Ben Hassine and the comments pretty much covers all aspects of the solution and is the accepted answer.
Here is the implementation I used if anyone is interested :
public class JdbcCustomBatchSizeItemWriter<W> extends JdbcDaoSupport implements ItemWriter<W> {
private int batchSize;
private ParameterizedPreparedStatementSetter<W> preparedStatementSetter;
private String sqlFileLocation;
private String sql;
public void initReader() {
this.setSql(FileUtilties.getFileContent(sqlFileLocation));
}
public void write(List<? extends W> arg0) throws Exception {
getJdbcTemplate().batchUpdate(sql, Collections.unmodifiableList(arg0), batchSize, preparedStatementSetter);
}
public void setBatchSize(int batchSize) {
this.batchSize = batchSize;
}
public void setPreparedStatementSetter(ParameterizedPreparedStatementSetter<W> preparedStatementSetter) {
this.preparedStatementSetter = preparedStatementSetter;
}
public void setSqlFileLocation(String sqlFileLocation) {
this.sqlFileLocation = sqlFileLocation;
}
public void setSql(String sql) {
this.sql = sql;
}
}
Note :
The use of Collections.unmodifiableList prevents the need for any explicit casting.
I use sqlFileLocation to specify an external file that contains the sql and FileUtilities.getfileContents simply returns the contents of this sql file. This can be skipped and one can directly pass the sql to the class as well while creating the bean.
I wouldn't do this. It presents issues for restartability. Instead, modify your reader to produce individual items rather than having your processor take in an object and return a list.
From the ItemWriter interface which JdbcBatchItemWriter implemented
public interface ItemWriter<T> {
void write(List<? extends T> items) throws Exception;
}
It think is designed for batch update, but what if the item I used as input is already a List or List, do I have to write my own JdbcItemWriter, or the built-in JdbcBatchItemWriter could do the work?
The built in JdbcBatchItemWriter will work. You're item is of type List. There's nothing wrong with that. You'll just need to implement the appropriate ItemPreparedStatementSetter or ItemSqlParameterSourceProvider yourself to map the elements of the List to the values in the SQL.
Nope.
Use a domain object contains a list.
class MyDomainObject {
List<Item> items = new ArrayList<Item>();
}
and T substitution produce
public class MyItemWrite implement ItemWrite<MyDomainObject> {
public void write(List<? extends MyDomainObject> items) throws Exception {
for(MyDomainObject o : items)
{
// Perform o.items write
}
}
}
Let's say I have a manufacturing scheduling system, which is made up of four parts:
There are factories that can manufacture a certain type of product and know if they are busy:
interface Factory<ProductType> {
void buildProduct(ProductType product);
boolean isBusy();
}
There is a set of different products, which (among other things) know in which factory they are built:
interface Product<ActualProductType extends Product<ActualProductType>> {
Factory<ActualProductType> getFactory();
}
Then there is an ordering system that can generate requests for products to be built:
interface OrderSystem {
Product<?> getNextProduct();
}
Finally, there's a dispatcher that grabs the orders and maintains a work-queue for each factory:
class Dispatcher {
Map<Factory<?>, Queue<Product<?>>> workQueues
= new HashMap<Factory<?>, Queue<Product<?>>>();
public void addNextOrder(OrderSystem orderSystem) {
Product<?> nextProduct = orderSystem.getNextProduct();
workQueues.get(nextProduct.getFactory()).add(nextProduct);
}
public void assignWork() {
for (Factory<?> factory: workQueues.keySet())
if (!factory.isBusy())
factory.buildProduct(workQueues.get(factory).poll());
}
}
Disclaimer: This code is merely an example and has several bugs (check if factory exists as a key in workQueues missing, ...) and is highly non-optimal (could iterate over entryset instead of keyset, ...)
Now the question:
The last line in the Dispatcher (factory.buildProduct(workqueues.get(factory).poll());) throws this compile-error:
The method buildProduct(capture#5-of ?) in the type Factory<capture#5-of ?> is not applicable for the arguments (Product<capture#7-of ?>)
I've been racking my brain over how to fix this in a type-safe way, but my Generics-skills have failed me here...
Changing it to the following, for example, doesn't help either:
public void assignWork() {
for (Factory<?> factory: workQueues.keySet())
if (!factory.isBusy()) {
Product<?> product = workQueues.get(factory).poll();
product.getFactory().buildProduct(product);
}
}
Even though in this case it should be clear that this is ok...
I guess I could add a "buildMe()" function to every Product that calls factory.buildProduct(this), but I have a hard time believing that this should be my most elegant solution.
Any ideas?
EDIT:
A quick example for an implementation of Product and Factory:
class Widget implements Product<Widget> {
public String color;
#Override
public Factory<Widget> getFactory() {
return WidgetFactory.INSTANCE;
}
}
class WidgetFactory implements Factory<Widget> {
static final INSTANCE = new WidgetFactory();
#Override
public void buildProduct(Widget product) {
// Build the widget of the given color (product.color)
}
#Override
public boolean isBusy() {
return false; // It's really quick to make this widget
}
}
Your code is weird.
Your problem is that you are passing A Product<?> to a method which expects a ProductType which is actually T.
Also I have no idea what Product is as you don't mention its definition in the OP.
You need to pass a Product<?> to work. I don't know where you will get it as I can not understand what you are trying to do with your code
Map<Factory<?>, Queue<Product<?>>> workQueues = new HashMap<Factory<?>, Queue<Product<?>>>();
// factory has the type "Factory of ?"
for (Factory<?> factory: workqueues.keySet())
// the queue is of type "Queue of Product of ?"
Queue<Product<?>> q = workqueues.get(factory);
// thus you put a "Product of ?" into a method that expects a "?"
// the compiler can't do anything with that.
factory.buildProduct(q.poll());
}
Got it! Thanks to meriton who answered this version of the question:
How to replace run-time instanceof check with compile-time generics validation
I need to baby-step the compiler through the product.getFactory().buildProduct(product)-part by doing this in a separate generic function. Here are the changes that I needed to make to the code to get it to work (what a mess):
Be more specific about the OrderSystem:
interface OrderSystem {
<ProductType extends Product<ProductType>> ProductType getNextProduct();
}
Define my own, more strongly typed queue to hold the products:
#SuppressWarnings("serial")
class MyQueue<T extends Product<T>> extends LinkedList<T> {};
And finally, changing the Dispatcher to this beast:
class Dispatcher {
Map<Factory<?>, MyQueue<?>> workQueues = new HashMap<Factory<?>, MyQueue<?>>();
#SuppressWarnings("unchecked")
public <ProductType extends Product<ProductType>> void addNextOrder(OrderSystem orderSystem) {
ProductType nextProduct = orderSystem.getNextProduct();
MyQueue<ProductType> myQueue = (MyQueue<ProductType>) workQueues.get(nextProduct.getFactory());
myQueue.add(nextProduct);
}
public void assignWork() {
for (Factory<?> factory: workQueues.keySet())
if (!factory.isBusy())
buildProduct(workQueues.get(factory).poll());
}
public <ProductType extends Product<ProductType>> void buildProduct(ProductType product) {
product.getFactory().buildProduct(product);
}
}
Notice all the generic functions, especially the last one. Also notice, that I can NOT inline this function back into my for loop as I did in the original question.
Also note, that the #SuppressWarnings("unchecked") annotation on the addNextOrder() function is needed for the typecast of the queue, not some Product object. Since I only call "add" on this queue, which, after compilation and type-erasure, stores all elements simply as objects, this should not result in any run-time casting exceptions, ever. (Please do correct me if this is wrong!)