Run bulk operation as intermediate stream operation

Run bulk operation as intermediate stream operation - java

I have a java stream of undefined length. Now I need to load some meta data from the database and assign it to the streamed data.
I cannot:
load all data from the stream to my RAM at once, populate the metadata and then start a new stream as this might use to much RAM.
load the metadata for each element individually as this would flood my database with too many requests.
Thus I thought I could load the metadata in partitions from the database.
I need a method like this:
<T> Stream<List<T>> partition(Stream<T> stream, int partitionSize)
so I can use it like this
partition(dataSource.stream(), 1000)
.map(metadataSource::populate)
.flatMap(List::stream)
.forEach(this::doSomething);
I already found Guava's Iteralbes#partition but that would force me to convert the stream to an iterable, partition it and convert it to a stream again. Is there something inbuilt for the stream partitioning or is there an easy way to implement it myself?

I haven't found an existing method that does this already, so I implemented one myself:
public class Partitioner<E> implements Iterator<List<E>> {
private final Iterator<E> iterator;
private final int partitionSize;
public static <T> Stream<List<T>> partition(final Stream<T> stream, final int partitionSize) {
return new Partitioner<>(stream, partitionSize).asStream();
}
public Partitioner(final Stream<E> stream, final int partitionSize) {
this(stream.iterator(), partitionSize);
}
public Partitioner(final Iterator<E> iterator, final int partitionSize) {
this.iterator = iterator;
this.partitionSize = partitionSize;
}
#Override
public boolean hasNext() {
return this.iterator.hasNext();
}
#Override
public List<E> next() {
if (!hasNext()) {
throw new NoSuchElementException("No more elements");
}
final ArrayList<E> result = new ArrayList<>(this.partitionSize);
for (int i = 0; i < this.partitionSize && hasNext(); i++) {
result.add(this.iterator.next());
}
return result;
}
public Stream<List<E>> asStream() {
return StreamSupport.stream(Spliterators.spliteratorUnknownSize(this, Spliterator.NONNULL), false);
}
}

Related

Is there something like an Iterator, but with functions like Streams?

So basically what I am trying to do is the following:
Load Batch of Data from the Database
Map that data (Object[] query result) to a class representing the data in a readable format
Write to File
Repeat until query gets no more results
I listed the structures that I am familiar with that seem to fit the need and why they don't fit my needs.
Iterator → Has no option to map and filter without calling next()
I need to define the map function in a subclass though without actually having the data (similar to a stream), so that I can pass the "Stream" way up to a calling class and only there call next, which then calls all the map functions as a result
Stream → All data needs to be available before mapping and filtering is possible
Observable → Sends data as soon as it comes available. I need to process it in sync though
To get more of a feeling what I am trying to do, I made a small example:
// Disclaimer: "Something" is the structure I am not sure of now.
// Could be an Iterator or something else that fits (Thats the question)
public class Orchestrator {
#Inject
private DataGetter dataGetter;
public void doWork() {
FileWriter writer = new FileWriter("filename");
// Write the formatted data to the file
dataGetter.getData()
.forEach(data -> writer.writeToFile(data));
}
}
public class FileWriter {
public void writeToFile(List<Thing> data) {
// Write to file
}
}
public class DataGetter {
#Inject
private ThingDao thingDao;
public Something<List<Thing>> getData() {
// Map data to the correct format and return that
return thingDao.getThings()
.map(partialResult -> /* map to object */);
}
}
public class ThingDao {
public Something<List<Object[]>> getThings() {
Query q = ...;
// Dont know what to return
}
}
What I have got so far:
I tried to go from the base of an Iterator, because it's the only one that really fulfills my memory requirements. Then I have added some methods to map and loop over the data. It's not really a robust design though and it's going to be harder than I thought, so I wanted to know if there is anything out there already that does what I need.
public class QIterator<E> implements Iterator<List<E>> {
public static String QUERY_OFFSET = "queryOffset";
public static String QUERY_LIMIT = "queryLimit";
private Query query;
private long lastResultIndex = 0;
private long batchSize;
private Function<List<Object>, List<E>> mapper;
public QIterator(Query query, long batchSize) {
this.query = query;
this.batchSize = batchSize;
}
public QIterator(Query query, long batchSize, Function<List<Object>, List<E>> mapper) {
this(query, batchSize);
this.mapper = mapper;
}
#Override
public boolean hasNext() {
return lastResultIndex % batchSize == 0;
}
#Override
public List<E> next() {
query.setParameter(QueryIterator.QUERY_OFFSET, lastResultIndex);
query.setParameter(QueryIterator.QUERY_LIMIT, batchSize);
List<Object> result = (List<Object>) query.getResultList(); // unchecked
lastResultIndex += result.size();
List<E> mappedResult;
if (mapper != null) {
mappedResult = mapper.apply(result);
} else {
mappedResult = (List<E>) result; // unchecked
}
return mappedResult;
}
public <R> QIterator<R> map(Function<List<E>, List<R>> appendingMapper) {
return new QIterator<>(query, batchSize, (data) -> {
if (this.mapper != null) {
return appendingMapper.apply(this.mapper.apply(data));
} else {
return appendingMapper.apply((List<E>) data);
}
});
}
public void forEach(BiConsumer<List<E>, Integer> consumer) {
for (int i = 0; this.hasNext(); i++) {
consumer.accept(this.next(), i);
}
}
}
This works so far, but has some unchecked assignments which I do not really like and also I would like to have the ability to "append" one QIterator to another which is not hard by itself, but it should also take the maps that follow after the append.

Assume you have a DAO that provides data in a paginated manner, e.g. by applying the LIMIT and OFFSET clauses to the underlying SQL. Such a DAO class would have a method that takes those values as argument, i.e. the method would conform to the following functional method:
#FunctionalInterface
public interface PagedDao<T> {
List<T> getData(int offset, int limit);
}
E.g. calling getData(0, 20) would return the first 20 rows (page 1), calling getData(60, 20) would return the 20 rows on page 4. If the method returns less than 20 rows, it means we got the last page. Asking for data after the last row will return an empty list.
For the demo below, we can mock such a DAO class:
public class MockDao {
private final int rowCount;
public MockDao(int rowCount) {
this.rowCount = rowCount;
}
public List<SimpleRow> getSimpleRows(int offset, int limit) {
System.out.println("DEBUG: getData(" + offset + ", " + limit + ")");
if (offset < 0 || limit <= 0)
throw new IllegalArgumentException();
List<SimpleRow> data = new ArrayList<>();
for (int i = 0, rowNo = offset + 1; i < limit && rowNo <= this.rowCount; i++, rowNo++)
data.add(new SimpleRow("Row #" + rowNo));
System.out.println("DEBUG: data = " + data);
return data;
}
}
public class SimpleRow {
private final String data;
public SimpleRow(String data) {
this.data = data;
}
#Override
public String toString() {
return "Row[data=" + this.data + "]";
}
}
If you then want to generate a Stream of rows from that method, streaming all rows in blocks of a certain size, we need a Spliterator for that, so we can use StreamSupport.stream(Spliterator<T> spliterator, boolean parallel) to create a stream.
Here is an implementation of such a Spliterator:
public class PagedDaoSpliterator<T> implements Spliterator<T> {
private final PagedDao<T> dao;
private final int blockSize;
private int nextOffset;
private List<T> data;
private int dataIdx;
public PagedDaoSpliterator(PagedDao<T> dao, int blockSize) {
if (blockSize <= 0)
throw new IllegalArgumentException();
this.dao = Objects.requireNonNull(dao);
this.blockSize = blockSize;
}
#Override
public boolean tryAdvance(Consumer<? super T> action) {
if (this.data == null) {
if (this.nextOffset == -1/*At end*/)
return false; // Already at end
this.data = this.dao.getData(this.nextOffset, this.blockSize);
this.dataIdx = 0;
if (this.data.size() < this.blockSize)
this.nextOffset = -1/*At end, after this data*/;
else
this.nextOffset += data.size();
if (this.data.isEmpty()) {
this.data = null;
return false; // At end
}
}
action.accept(this.data.get(this.dataIdx++));
if (this.dataIdx == this.data.size())
this.data = null;
return true;
}
#Override
public Spliterator<T> trySplit() {
return null; // Parallel processing not supported
}
#Override
public long estimateSize() {
return Long.MAX_VALUE; // Unknown
}
#Override
public int characteristics() {
return ORDERED | NONNULL;
}
}
We can now test that using the mock DAO above:
MockDao dao = new MockDao(13);
Stream<SimpleRow> stream = StreamSupport.stream(
new PagedDaoSpliterator<>(dao::getSimpleRows, 5), /*parallel*/false);
stream.forEach(System.out::println);
Output
DEBUG: getData(0, 5)
DEBUG: data = [Row[data=Row #1], Row[data=Row #2], Row[data=Row #3], Row[data=Row #4], Row[data=Row #5]]
Row[data=Row #1]
Row[data=Row #2]
Row[data=Row #3]
Row[data=Row #4]
Row[data=Row #5]
DEBUG: getData(5, 5)
DEBUG: data = [Row[data=Row #6], Row[data=Row #7], Row[data=Row #8], Row[data=Row #9], Row[data=Row #10]]
Row[data=Row #6]
Row[data=Row #7]
Row[data=Row #8]
Row[data=Row #9]
Row[data=Row #10]
DEBUG: getData(10, 5)
DEBUG: data = [Row[data=Row #11], Row[data=Row #12], Row[data=Row #13]]
Row[data=Row #11]
Row[data=Row #12]
Row[data=Row #13]
As can be seen, we get 13 rows of data, retrieved from the database in blocks of 5 rows.
The data is not retrieved from the database until it is needed, causing low memory footprint, depending on block size and the stream operation not caching the data.

You can do it in one line as follows:
stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(queryThatReturnsAllRowsOrdered);
Stream.generate(rs.next() ? map(rs) : null)
.takeWhile(Objects::nonNull)
.filter(<some predicate>)
.forEach(<some operation);
This starts processing when the first row is returned from the query and continues in parallel with the database until all rows have been read.
This approach only has one row in memory at a time, and minimises the load on the database by only running 1 query.
Mapping from a ResultSet is far more easy and natural than mapping from Object[] because you can access columns by name and with properly typed values, eg:
MyDao map(ResultSet rs) {
try {
String someStr = rs.getString("COLUMN_X");
int someInt = rs.getInt("COLUMN_Y"):
return new MyDao(someStr, someInt);
} catch (SQLException e ) {
throw new RuntimeException(e);
}
}

Why does guava joiner implement a private method iterable(final Object first, final Object second, final Object[] rest)?

private static Iterable<Object> iterable(
final Object first, final Object second, final Object[] rest) {
checkNotNull(rest);
return new AbstractList<Object>() {
#Override
public int size() {
return rest.length + 2;
}
#Override
public Object get(int index) {
switch (index) {
case 0:
return first;
case 1:
return second;
default:
return rest[index - 2];
}
}
};
}
What is author's purpose?
I guess he wants to make use of the array generated by compiler, rather than new an ArrayList.
But still a confusing point, why not write as below?
private static Iterable<Object> iterable(final Object[] rest) {
checkNotNull(rest);
return new AbstractList<Object>() {
#Override
public int size() {
return rest.length;
}
#Override
public Object get(int index) {
return rest[index];
}
};
}

The point here is that this method is called from public methods which look like (source):
public final String join(
#NullableDecl Object first, #NullableDecl Object second, Object... rest) {
return join(iterable(first, second, rest));
}
Using signatures like this is a trick to force you to pass in at least two arguments - after all, if you've not got two arguments, there is nothing to join.
For example:
Joiner.on(':').join(); // Compiler error.
Joiner.on(':').join("A"); // Compiler error.
Joiner.on(':').join("A", "B"); // OK.
Joiner.on(':').join("A", "B", "C"); // OK.
// etc.
This iterable method just creates an Iterable without having to copy everything into a new array. Doing so would be O(n) in the number of arguments; the approach taken here is O(1).

Custom iterator asks Object instead my type Java

I've made my own implementation of HashTable and want to build an iterator to make it possible to use forEach. MyHashTable is a generic class with K, V parameters.
Here is snippet from MyHashTable class:
#Override
public Iterator iterator() {
return new MyHashTableIterator();
}
class MyHashTableIterator implements Iterator {
private MyHashTableEntry[] entries;
private int entriesIter;
public MyHashTableIterator() {
this.entries = entrySet();
this.entriesIter = 0;
}
#Override
public boolean hasNext() {
return entriesIter < entries.length - 1;
}
#Override
public MyHashTableEntry next() {
return entries[entriesIter++];
}
}
This seems to be ok for me, but when I try to use forEach somewhere it requires me to iterate using Object variable:
for (MyHashTable.MyHashTableEntry entry : wordDict) {
}
This code produces:
Error: java: incompatible types: java.lang.Object cannot be converted to MyHashTable.MyHashTableEntry
How to make it work good, not using Object variable?

Iterator design pattern for generic types

I am trying to implement Iterator design pattern for generic type. I can not use generic array to store collection, Can someone help me to design Iterator pattern for generic types.
My code here:
public class NamesRepository<T> implements Container<T> {
public NamesRepository(){
names = new T[]; // Compilation error
}
public T names[];
#Override
public Iterator<T> getIterator() {
return new NameIterator();
}
private class NameIterator implements Iterator<T>{
private int index;
#Override
public boolean hasNext() {
if(index<names.length)
return true;
return false;
}
#Override
public T next() {
if(this.hasNext())
return names[index++];
return null;
}
}
}

As has been discussed many times before, you can't directly create a generic array using the type parameter. You can follow that question's answers if you must use an array.
However, you can create a List of a type parameter instead. No tricks are necessary, and it's more flexible.
public NamesRepository(){
names = new ArrayList<T>(); // Easy creation
}
public List<T> names;
Then, your iterator's hasNext() method can compare its index to names.size(), and the next() method can return names.get(index++). You can even implement the other method required for the Iterator interface, remove(), by calling names.remove(index).

You cannot create an array of type parameter. Rather you can store an Object[] array, and type cast the element that you are returning to T:
public NamesRepository(){
names = new Object[5]; // You forgot size here
}
public Object names[];
And then change the next() method as:
#Override
public T next() {
if(this.hasNext()) {
#SuppressWarnings("unchecked")
T value = (T)names[index++];
return value;
}
return null;
}
And you can really change the hasNext() method to a single line:
#Override
public boolean hasNext() {
return index < names.length;
}

The function NameRespositary() is wrongly creating a list of generic types. You have to create it through the type parameter.
public NamesRepository(){ names = new ArrayList<T>(); } public List<T> names;

Is there a no-duplicate List implementation out there?

I know about SortedSet, but in my case I need something that implements List, and not Set. So is there an implementation out there, in the API or elsewhere?
It shouldn't be hard to implement myself, but I figured why not ask people here first?

There's no Java collection in the standard library to do this. LinkedHashSet<E> preserves ordering similarly to a List, though, so if you wrap your set in a List when you want to use it as a List you'll get the semantics you want.
Alternatively, the Commons Collections (or commons-collections4, for the generic version) has a List which does what you want already: SetUniqueList / SetUniqueList<E>.

Here is what I did and it works.
Assuming I have an ArrayList to work with the first thing I did was created a new LinkedHashSet.
LinkedHashSet<E> hashSet = new LinkedHashSet<E>()
Then I attempt to add my new element to the LinkedHashSet. The add method does not alter the LinkedHasSet and returns false if the new element is a duplicate. So this becomes a condition I can test before adding to the ArrayList.
if (hashSet.add(E)) arrayList.add(E);
This is a simple and elegant way to prevent duplicates from being added to an array list. If you want you can encapsulate it in and override of the add method in a class that extends the ArrayList. Just remember to deal with addAll by looping through the elements and calling the add method.

So here's what I did eventually. I hope this helps someone else.
class NoDuplicatesList<E> extends LinkedList<E> {
#Override
public boolean add(E e) {
if (this.contains(e)) {
return false;
}
else {
return super.add(e);
}
}
#Override
public boolean addAll(Collection<? extends E> collection) {
Collection<E> copy = new LinkedList<E>(collection);
copy.removeAll(this);
return super.addAll(copy);
}
#Override
public boolean addAll(int index, Collection<? extends E> collection) {
Collection<E> copy = new LinkedList<E>(collection);
copy.removeAll(this);
return super.addAll(index, copy);
}
#Override
public void add(int index, E element) {
if (this.contains(element)) {
return;
}
else {
super.add(index, element);
}
}
}

Why not encapsulate a set with a list, sort like:
new ArrayList( new LinkedHashSet() )
This leaves the other implementation for someone who is a real master of Collections ;-)

You should seriously consider dhiller's answer:
Instead of worrying about adding your objects to a duplicate-less List, add them to a Set (any implementation), which will by nature filter out the duplicates.
When you need to call the method that requires a List, wrap it in a new ArrayList(set) (or a new LinkedList(set), whatever).
I think that the solution you posted with the NoDuplicatesList has some issues, mostly with the contains() method, plus your class does not handle checking for duplicates in the Collection passed to your addAll() method.

I needed something like that, so I went to the commons collections and used the SetUniqueList, but when I ran some performance test, I found that it seems not optimized comparing to the case if I want to use a Set and obtain an Array using the Set.toArray() method.
The SetUniqueTest took 20:1 time to fill and then traverse 100,000 Strings comparing to the other implementation, which is a big deal difference.
So, if you worry about the performance, I recommend you to use the Set and Get an Array instead of using the SetUniqueList, unless you really need the logic of the SetUniqueList, then you'll need to check other solutions...
Testing code main method:
public static void main(String[] args) {
SetUniqueList pq = SetUniqueList.decorate(new ArrayList());
Set s = new TreeSet();
long t1 = 0L;
long t2 = 0L;
String t;
t1 = System.nanoTime();
for (int i = 0; i < 200000; i++) {
pq.add("a" + Math.random());
}
while (!pq.isEmpty()) {
t = (String) pq.remove(0);
}
t1 = System.nanoTime() - t1;
t2 = System.nanoTime();
for (int i = 0; i < 200000; i++) {
s.add("a" + Math.random());
}
s.clear();
String[] d = (String[]) s.toArray(new String[0]);
s.clear();
for (int i = 0; i < d.length; i++) {
t = d[i];
}
t2 = System.nanoTime() - t2;
System.out.println((double)t1/1000/1000/1000); //seconds
System.out.println((double)t2/1000/1000/1000); //seconds
System.out.println(((double) t1) / t2); //comparing results
}
Regards,
Mohammed Sleem

My lastest implementation: https://github.com/marcolopes/dma/blob/master/org.dma.java/src/org/dma/java/util/UniqueArrayList.java
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.LinkedHashSet;
/**
* Extends <tt>ArrayList</tt> and guarantees no duplicate elements
*/
public class UniqueArrayList<T> extends ArrayList<T> {
private static final long serialVersionUID = 1L;
public UniqueArrayList(int initialCapacity) {
super(initialCapacity);
}
public UniqueArrayList() {
super();
}
public UniqueArrayList(T[] array) {
this(Arrays.asList(array));
}
public UniqueArrayList(Collection<? extends T> col) {
addAll(col);
}
#Override
public void add(int index, T e) {
if (!contains(e)) super.add(index, e);
}
#Override
public boolean add(T e) {
return contains(e) ? false : super.add(e);
}
#Override
public boolean addAll(Collection<? extends T> col) {
Collection set=new LinkedHashSet(this);
set.addAll(col);
clear();
return super.addAll(set);
}
#Override
public boolean addAll(int index, Collection<? extends T> col) {
Collection set=new LinkedHashSet(subList(0, index));
set.addAll(col);
set.addAll(subList(index, size()));
clear();
return super.addAll(set);
}
#Override
public T set(int index, T e) {
return contains(e) ? null : super.set(index, e);
}
/** Ensures element.equals(o) */
#Override
public int indexOf(Object o) {
int index=0;
for(T element: this){
if (element.equals(o)) return index;
index++;
}return -1;
}
}

Off the top of my head, lists allow duplicates. You could quickly implement a UniqueArrayList and override all the add / insert functions to check for contains() before you call the inherited methods. For personal use, you could only implement the add method you use, and override the others to throw an exception in case future programmers try to use the list in a different manner.

The documentation for collection interfaces says:
Set — a collection that cannot contain duplicate elements.
List — an ordered collection (sometimes called a sequence). Lists can contain duplicate elements.
So if you don't want duplicates, you probably shouldn't use a list.

in add method, why not using HashSet.add() to check duplicates instead of HashSet.consist().
HashSet.add() will return true if no duplicate and false otherwise.

What about this?
Just check the list before adding with a contains for an already existing object
while (searchResult != null && searchResult.hasMore()) {
SearchResult nextElement = searchResult.nextElement();
Attributes attributes = nextElement.getAttributes();
String stringName = getAttributeStringValue(attributes, SearchAttribute.*attributeName*);
if(!List.contains(stringName)){
List.add(stringName);
}
}

I just made my own UniqueList in my own little library like this:
package com.bprog.collections;//my own little set of useful utilities and classes
import java.util.HashSet;
import java.util.ArrayList;
import java.util.List;
/**
*
* #author Jonathan
*/
public class UniqueList {
private HashSet masterSet = new HashSet();
private ArrayList growableUniques;
private Object[] returnable;
public UniqueList() {
growableUniques = new ArrayList();
}
public UniqueList(int size) {
growableUniques = new ArrayList(size);
}
public void add(Object thing) {
if (!masterSet.contains(thing)) {
masterSet.add(thing);
growableUniques.add(thing);
}
}
/**
* Casts to an ArrayList of unique values
* #return
*/
public List getList(){
return growableUniques;
}
public Object get(int index) {
return growableUniques.get(index);
}
public Object[] toObjectArray() {
int size = growableUniques.size();
returnable = new Object[size];
for (int i = 0; i < size; i++) {
returnable[i] = growableUniques.get(i);
}
return returnable;
}
}
I have a TestCollections class that looks like this:
package com.bprog.collections;
import com.bprog.out.Out;
/**
*
* #author Jonathan
*/
public class TestCollections {
public static void main(String[] args){
UniqueList ul = new UniqueList();
ul.add("Test");
ul.add("Test");
ul.add("Not a copy");
ul.add("Test");
//should only contain two things
Object[] content = ul.toObjectArray();
Out.pl("Array Content",content);
}
}
Works fine. All it does is it adds to a set if it does not have it already and there's an Arraylist that is returnable, as well as an object array.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Run bulk operation as intermediate stream operation - java

Related

Is there something like an Iterator, but with functions like Streams?

Why does guava joiner implement a private method iterable(final Object first, final Object second, final Object[] rest)?

Custom iterator asks Object instead my type Java

Iterator design pattern for generic types

Is there a no-duplicate List implementation out there?

Categories

Resources