How to process chuncks of a file with java.util.stream

How to process chuncks of a file with java.util.stream - java

To get familliar with the stream api, I tried to code a quite simple pattern.
Problem: Having a text file containing not nested blocks of text. All blocks are identified by start/endpatterns (e.g. <start> and <stop>. The content of a block isn't syntactically distinguishable from the noise between the blocks. Therefore it is impossible, to work with simple (stateless) lambdas.
I was just able to implement something ugly like:
Files.lines(path).collect(new MySequentialParseAndProsessEachLineCollector<>());
To be honest, this is not what I want.
Im looking for a mapper something like:
Files.lines(path).map(MyMapAllLinesOfBlockToBuckets()).parallelStream().collect(new MyProcessOneBucketCollector<>());
is there a good way to extract chunks of data from a java 8 stream seems to contain a skeleton of a solution. Unfortunatly, I'm to stubid to translate that to my problem. ;-)
Any hints?

Here is a solution which can be used for converting a Stream<String>, each element representing a line, to a Stream<List<String>>, each element representing a chunk found using a specified delimiter:
public class ChunkSpliterator implements Spliterator<List<String>> {
private final Spliterator<String> source;
private final Predicate<String> start, end;
private final Consumer<String> getChunk;
private List<String> current;
ChunkSpliterator(Spliterator<String> lineSpliterator,
Predicate<String> chunkStart, Predicate<String> chunkEnd) {
source=lineSpliterator;
start=chunkStart;
end=chunkEnd;
getChunk=s -> {
if(current!=null) current.add(s);
else if(start.test(s)) current=new ArrayList<>();
};
}
public boolean tryAdvance(Consumer<? super List<String>> action) {
while(current==null || current.isEmpty()
|| !end.test(current.get(current.size()-1)))
if(!source.tryAdvance(getChunk)) return false;
current.remove(current.size()-1);
action.accept(current);
current=null;
return true;
}
public Spliterator<List<String>> trySplit() {
return null;
}
public long estimateSize() {
return Long.MAX_VALUE;
}
public int characteristics() {
return ORDERED|NONNULL;
}
public static Stream<List<String>> toChunks(Stream<String> lines,
Predicate<String> chunkStart, Predicate<String> chunkEnd,
boolean parallel) {
return StreamSupport.stream(
new ChunkSpliterator(lines.spliterator(), chunkStart, chunkEnd),
parallel);
}
}
The lines matching the predicates are not included in the chunk; it would be easy to change this behavior, if desired.
It can be used like this:
ChunkSpliterator.toChunks( Files.lines(Paths.get(myFile)),
Pattern.compile("^<start>$").asPredicate(),
Pattern.compile("^<stop>$").asPredicate(),
true )
.collect(new MyProcessOneBucketCollector<>())
The patterns are specifying as ^word$ to require the entire line to consist of the word only; without these anchors, lines containing the pattern can start and end a chunk. The nature of the source stream does not allow parallelism when creating the chunks, so when chaining with an immediate collection operation the parallelism for the entire operation is rather limited. It depends on the MyProcessOneBucketCollector if there can be any parallelism at all.
If your final result does not depend on the order of occurrences of the buckets in the source file, it is strongly recommended that either your collector reports itself to be UNORDERED or you insert an unordered() in the stream’s method chains before the collect.

Related

How to set a value to variable based on multiple conditions using Java Streams API?

I couldn't wrap my head around writing the below condition using Java Streams. Let's assume that I have a list of elements from the periodic table. I've to write a method that returns a String by checking whether the list has Silicon or Radium or Both. If it has only Silicon, method has to return Silicon. If it has only Radium, method has to return Radium. If it has both, method has to return Both. If none of them are available, method returns "" (default value).
Currently, the code that I've written is below.
String resolve(List<Element> elements) {
AtomicReference<String> value = new AtomicReference<>("");
elements.stream()
.map(Element::getName)
.forEach(name -> {
if (name.equalsIgnoreCase("RADIUM")) {
if (value.get().equals("")) {
value.set("RADIUM");
} else {
value.set("BOTH");
}
} else if (name.equalsIgnoreCase("SILICON")) {
if (value.get().equals("")) {
value.set("SILICON");
} else {
value.set("BOTH");
}
}
});
return value.get();
}
I understand the code looks messier and looks more imperative than functional. But I don't know how to write it in a better manner using streams. I've also considered the possibility of going through the list couple of times to filter elements Silicon and Radium and finalizing based on that. But it doesn't seem efficient going through a list twice.
NOTE : I also understand that this could be written in an imperative manner rather than complicating with streams and atomic variables. I just want to know how to write the same logic using streams.
Please share your suggestions on better ways to achieve the same goal using Java Streams.

It could be done with Stream IPA in a single statement and without multiline lambdas, nested conditions and impure function that changes the state outside the lambda.
My approach is to introduce an enum which elements correspond to all possible outcomes with its constants EMPTY, SILICON, RADIUM, BOTH.
All the return values apart from empty string can be obtained by invoking the method name() derived from the java.lang.Enum. And only to caver the case with empty string, I've added getName() method.
Note that since Java 16 enums can be declared locally inside a method.
The logic of the stream pipeline is the following:
stream elements turns into a stream of string;
gets filtered and transformed into a stream of enum constants;
reduction is done on the enum members;
optional of enum turs into an optional of string.
Implementation can look like this:
public static String resolve(List<Element> elements) {
return elements.stream()
.map(Element::getName)
.map(String::toUpperCase)
.filter(str -> str.equals("SILICON") || str.equals("RADIUM"))
.map(Elements::valueOf)
.reduce((result, next) -> result == Elements.BOTH || result != next ? Elements.BOTH : next)
.map(Elements::getName)
.orElse("");
}
enum
enum Elements {EMPTY, SILICON, RADIUM, BOTH;
String getName() {
return this == EMPTY ? "" : name(); // note name() declared in the java.lang.Enum as final and can't be overridden
}
}
main
public static void main(String[] args) {
System.out.println(resolve(List.of(new Element("Silicon"), new Element("Lithium"))));
System.out.println(resolve(List.of(new Element("Silicon"), new Element("Radium"))));
System.out.println(resolve(List.of(new Element("Ferrum"), new Element("Oxygen"), new Element("Aurum")))
.isEmpty() + " - no target elements"); // output is an empty string
}
output
SILICON
BOTH
true - no target elements
Note:
Although with streams you can produce the result in O(n) time iterative approach might be better for this task. Think about it this way: if you have a list of 10.000 elements in the list and it starts with "SILICON" and "RADIUM". You could easily break the loop and return "BOTH".
Stateful operations in the streams has to be avoided according to the documentation, also to understand why javadoc warns against stateful streams you might take a look at this question. If you want to play around with AtomicReference it's totally fine, just keep in mind that this approach is not considered to be good practice.
I guess if I had implemented such a method with streams, the overall logic would be the same as above, but without utilizing an enum. Since only a single object is needed it's a reduction, so I'll apply reduce() on a stream of strings, extract the reduction logic with all the conditions to a separate method. Normally, lambdas have to be well-readable one-liners.

Collect the strings to a unique set. Then check containment in constant time.
Set<String> names = elements.stream().map(Element::getName).map(String::toLowerCase).collect(toSet());
boolean hasSilicon = names.contains("silicon");
boolean hasRadium = names.contains("radium");
String result = "";
if (hasSilicon && hasRadium) {
result = "BOTH";
} else if (hasSilicon) {
result = "SILICON";
} else if (hasRadium) {
result = "RADIUM";
}
return result;

i have used predicate in filter to for radium and silicon and using the resulted set i am printing the result.
import java.util.ArrayList;
import java.util.List;
import java.util.Set;
import java.util.stream.Collectors;
public class Test {
public static void main(String[] args) {
List<Element> elementss = new ArrayList<>();
Set<String> stringSet = elementss.stream().map(e -> e.getName())
.filter(string -> (string.equals("Radium") || string.equals("Silicon")))
.collect(Collectors.toSet());
if(stringSet.size()==2){
System.out.println("both");
}else if(stringSet.size()==1){
System.out.println(stringSet);
}else{
System.out.println(" ");
}
}
}

You could save a few lines if you use regex, but I doubt if it is better than the other answers:
String resolve(List<Element> elements) {
String result = elements.stream()
.map(Element::getName)
.map(String::toUpperCase)
.filter(str -> str.matches("RADIUM|SILICON"))
.sorted()
.collect(Collectors.joining());
return result.matches("RADIUMSILICON") ? "BOTH" : result;
}

Stream mysteriously consumed twice

The following code ends up with a java.lang.IllegalStateException: stream has already been operated upon or closed.
public static void main(String[] args) {
Stream.concat(Stream.of("FOOBAR"),
reverse(StreamSupport.stream(new File("FOO/BAR").toPath().spliterator(), true)
.map(Path::toString)));
}
static <T> Stream<T> reverse(Stream<T> stream) {
return stream.reduce(Stream.empty(),
(Stream<T> a, T b) -> Stream.concat(Stream.of(b), a),
(a, b) -> Stream.concat(b, a));
}
The obvious solution is to generate a non parallel stream with StreamSupport.stream(…, false), but I can’t see why can’t run in parallel.

Stream.empty() is not a constant. This method returns a new stream instance on each invocation that will get consumed like any other stream, e.g. when you pass it into Stream.concat.
Therefore, Stream.empty() is not suitable as identity value for reduce, as the identity value may get passed as input to the reduction function an arbitrary, intentionally unspecified number of times. It’s an implementation detail that is happens to be used only a single time for sequential reduction and potentially multiple times for parallel reduction.
You can use
static <T> Stream<T> reverse(Stream<T> stream) {
return stream.map(Stream::of)
.reduce((a, b) -> Stream.concat(b, a))
.orElseGet(Stream::empty);
}
instead.
However, I only provide the solution as an academic exercise. As soon as the stream gets large, it leads to an excessive amount of concat calls and the note of the documentation applies:
Use caution when constructing streams from repeated concatenation. Accessing an element of a deeply concatenated stream can result in deep call chains, or even StackOverflowError.
Generally, the resulting underlying data structure will be far more expensive than a flat list, when using the Stream API this way.
You can use something like
Stream<String> s = Stream.concat(Stream.of("FOOBAR"),
reverse(new File("FOO/BAR").toPath()).map(Path::toString));
static Stream<Path> reverse(Path p) {
ArrayDeque<Path> d = new ArrayDeque<>();
p.forEach(d::addFirst);
return d.stream();
}
or
static Stream<Path> reverse(Path p) {
Stream.Builder b = Stream.builder();
for(; p != null; p = p.getParent()) b.add(p.getFileName());
return b.build();
}
With Java 9+ you can use a stream that truly has no additional storage (which does not necessarily imply that it will be more efficient):
static Stream<Path> reverse(Path p) {
return Stream.iterate(p, Objects::nonNull, Path::getParent).map(Path::getFileName);
}

Java Stream `generate()` how to "include" the first "excluded" element

Assume this usage scenario for a Java stream, where data is added from a data source. Data source can be a list of values, like in the example below, or a paginated REST api. It doesn't matter, at the moment.
import java.util.List;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.stream.Stream;
public class Main {
public static void main(String[] args) {
final List<Boolean> dataSource = List.of(true, true, true, false, false, false, false);
final AtomicInteger index = new AtomicInteger();
Stream
.generate(() -> {
boolean value = dataSource.get(index.getAndIncrement());
System.out.format("--> Executed expensive operation to retrieve data: %b\n", value);
return value;
})
.takeWhile(value -> value == true)
.forEach(data -> System.out.printf("--> Using: %b\n", data));
}
}
If you run this code your output will be
--> Executed expensive operation to retrieve data: true
--> Using: true
--> Executed expensive operation to retrieve data: true
--> Using: true
--> Executed expensive operation to retrieve data: true
--> Using: true
--> Executed expensive operation to retrieve data: false
As you can see the last element, the one that evaluated to false, did not get added to the stream, as expected.
Now assume that the generate() method loads pages of data from a REST api. In that case the value true/false is a value on page N indicating if page N + 1 exists, something like a has_more field. Now, I want the last page returned by the API to be added to the stream, but I do not want to perform another expensive operation to read an empty page, because I already know that there are no more pages.
What is the most idiomatic way to do this using the Java Stream API? Every workaround I can think of requires a call to the API to be executed.
UPDATE
In addition to the approaches listed in Inclusive takeWhile() for Streams there is another ugly way to achieve this.
public static void main(String[] args) {
final List<Boolean> dataSource = List.of(true, true, true, false, false, false, false);
final AtomicInteger index = new AtomicInteger();
final AtomicBoolean hasMore = new AtomicBoolean(true);
Stream
.generate(() -> {
if (!hasMore.get()) {
return null;
}
boolean value = dataSource.get(index.getAndIncrement());
hasMore.set(value);
System.out.format("--> Executed expensive operation to retrieve data: %b\n", value);
return value;
})
.takeWhile(Objects::nonNull)
.forEach(data -> System.out.printf("--> Using: %b\n", data));
}

You are using the wrong tool for your job. As already noticable in your code example, the Supplier passed to Stream.generate has to go great lengths to maintain the index it needs for fetching pages.
What makes matters worse, is that Stream.generate creates an unordered Stream:
Returns an infinite sequential unordered stream where each element is generated by the provided Supplier.
This is suitable for generating constant streams, streams of random elements, etc.
You’re not returning constant or random values nor anything else that would be independent of the order.
This has a significant impact on the semantics of takeWhile:
Otherwise returns, if this stream is unordered, a stream consisting of a subset of elements taken from this stream that match the given predicate.
This makes sense if you think about it. If there is at least one element rejected by the predicate, it could be encountered at an arbitrary position for an unordered stream, so an arbitrary subset of elements encountered before it, including the empty set, would be a valid prefix.
But since there is no “before” or “after” for an unordered stream, even elements produced by the generator after the rejected one could be included by the result.
In practice, you are unlikely to encounter such effects for a sequential stream, but it doesn’t change the fact that Stream.generate(…) .takeWhile(…) is semantically wrong for your task.
From your example code, I conclude that pages do not contain their own number nor a "getNext" method, so we have to maintain the number and the "hasNext" state for creating a stream.
Assuming an example setup like
class Page {
private String data;
private boolean hasNext;
public Page(String data, boolean hasNext) {
this.data = data;
this.hasNext = hasNext;
}
public String getData() {
return data;
}
public boolean hasNext() {
return hasNext;
}
}
private static String[] SAMPLE_PAGES = { "foo", "bar", "baz" };
public static Page getPage(int index) {
Objects.checkIndex(index, SAMPLE_PAGES.length);
return new Page(SAMPLE_PAGES[index], index + 1 < SAMPLE_PAGES.length);
}
You can implement a correct stream like
Stream.iterate(Map.entry(0, getPage(0)), Objects::nonNull,
e -> e.getValue().hasNext()? Map.entry(e.getKey()+1, getPage(e.getKey()+1)): null)
.map(Map.Entry::getValue)
.forEach(page -> System.out.println(page.getData()));
Note that Stream.iterate creates an ordered stream:
Returns a sequential ordered Stream produced by iterative application of the given next function to an initial element,
conditioned on satisfying the given hasNext predicate.
Of course, things would be much easier if the page knew its own number, e.g.
Stream.iterate(getPage(0), Objects::nonNull,
p -> p.hasNext()? getPage(p.getPageNumber()+1): null)
.forEach(page -> System.out.println(page.getData()));
or if there was a method to get from an existing Page to the next Page, e.g.
Stream.iterate(getPage(0), Objects::nonNull, p -> p.hasNext()? p.getNextPage(): null)
.forEach(page -> System.out.println(page.getData()));

Java pattern for parameters of which only one needs to be non-null?

In the last time I often write long functions that have several parameters but use only one of them and the functionality is only different at a few keypoints that are scattered around the function. Thus splitting the function would create too many small functions without a purpose. Is this good style or is there a good general refactoring pattern for this? To be more clear, an example:
public performSearch(DataBase dataBase, List<List<String>> segments) {performSearch(dataBase,null,null,segments);}
public performSearch(DataBaseCache dataBaseCache,List<List<String>> segments) {performSearch(null,dataBaseCache,null,segments);}
public performSearch(DataBase dataBase, List<String> keywords {performSearch(dataBase,null,keywords,null);}
public performSearch(DataBaseCache dataBaseCache,List<String> keywords) {performSearch(null,dataBaseCache,keywords,null);}
/** either dataBase or dataBaseCache may be null, dataBaseCache is used if it is non-null, else dataBase is used (slower). */
private void performSearch(DataBase dataBase, DataBaseCache dataBaseCache, List<String> keywords, List<List<String>> segments)
{
SearchObject search = new SearchObject();
search.setFast(true);
...
search.setNumberOfResults(25);
if(dataBaseCache!=null) {search.setSource(dataBaseCache);}
else {search.setSource(dataBase);}
... do some stuff ...
if(segments==null)
{
// create segments from keywords
....
segments = ...
}
}
This style of code works but I don't like all those null parameters and the possibilities of calling methods like this wrong (both parameters null, what happens if both are non-null) but I don't want to write 4 seperate functions either... I know this may be too general but maybe someone has a general solution to this principle of problems :-)
P.S.: I don't like to split up a long function if there is no reason for it other than it being long (i.e. if the subfunctions are only ever called in that order and only by this one function) especially if they are tightly interwoven and would need a big amount of parameters transported around them.

I think it is very bad procedural style. Try to avoid such coding. Since you already have a bulk of such code it may be very hard to re-factor it because each method contains its own logic that is slightly different from other. BTW the fact that it is hard is an evidence that the style is bad.
I think you should use behavioral patterns like
Chain of responsibilities
Command
Strategy
Template method
that can help you to change your procedural code to object oriented.

Could you use something like this
public static <T> T firstNonNull(T...parameters) {
for (T parameter: parameters) {
if (parameter != null) {
return parameter;
}
}
throw new IllegalArgumentException("At least one argument must be non null");
}
It does not check if more than one parameter is not null and they must be of the same type, but you could use it like this:
search.setSource(firstNonNull(dataBaseCache, database));

Expecting nulls is an anti-pattern because it litters your code with NullPointerExceptions waiting to happen. Use the builder pattern to construct the SearchObject. This is the signature you want, I'll let you figure out the implementation:
class SearchBuilder {
SearchObject search = new SearchObject();
List<String> keywords = new ArrayList<String>();
List<List<String>> segments = new ArrayList<List<String>>();
public SearchBuilder(DataBase dataBase) {}
public SearchBuilder(DataBaseCache dataBaseCache) {}
public void addKeyword(String keyword) {}
public void addSegment(String... segment) {}
public void performSearch();
}

I agree with what Alex said. Without knowing the problem I would recommend following structure based on what was in the example:
public interface SearchEngine {
public SearchEngineResult findByKeywords(List<String> keywords);
}
public class JDBCSearchEngine {
private DataSource dataSource;
public JDBCSearchEngine(DataSource dataSource) {
this.dataSource = dataSource;
}
public SearchEngineResult findByKeywords(List<String> keywords) {
// Find from JDBC datasource
// It might be useful to use a DAO instead of datasource, if you have database operations other that searching
}
}
public class CachingSearchEngine {
private SearchEngine searchEngine;
public CachingSearchEngine(SearchEngine searchEngine) {
this.searchEngine = searchEngine;
}
public SearchEngineResult findByKeywords(List<String> keywords) {
// First check from cache
...
// If not found, then fetch from real search engine
SearchEngineResult result = searchEngine.findByKeywords(keywords);
// Then add to cache
// Return the result
return result;
}
}

Find matching Predicate. Predicate of Predicate?

I have a collection of Predicates, say List<Predicate<File>>. I then have a single File and I need to get the predicate (if any) that matches the file. I was thinking along the lines of using Iterables.find() but of course that takes a Predicate not a value to pass into a Predicate. I thought about implementing the following but don't know if there already exists a mechanism.
public static <T> Predicate<Predicate<? super T>> createInversePredicate(
final T value) {
return new Predicate<Predicate<? super T>>() {
#Override
public boolean apply(Predicate<? super T> input) {
return input.apply(value);
}
};
}
This would allow me to do the following:
private List<Predicate<File>> filters = ...;
#Nullable
Predicate<File> findMatching(File file){
return Iterables.find(filters, createInversePredicate(file), null);
}
Is there a better way?

Guava team member here.
This is how I'd do it. There isn't a better way.

I would avoid the complexity of creating an "inverse" predicate, and simply use imperative code:
private List<Predicate<File>> filters = ...;
#Nullable
Predicate<File> findMatchingFilter(File file){
for (Predicate<File> filter : filters) {
if (filter.apply(file)) {
return filter;
}
}
return null;
}
It's more straightforward, and the next programmer won't need to take 1 minute to understand this "inverse" predicate business :)

Java 8 users can do this:
Predicate<File> findMatching(File file) {
List<Predicate<File>> matchingFilters = filters.stream().filter(predicate -> predicate.test(file)).collect(Collectors.toList());
return matchingFilters.isEmpty()? null : matchingFilters.get(0);
}
Here I am assuming only one predicate will match the file.
You can also use Optional<Predicate<File>> instead of #Nullable in Java 8.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to process chuncks of a file with java.util.stream - java

Related

How to set a value to variable based on multiple conditions using Java Streams API?

Stream mysteriously consumed twice

Java Stream `generate()` how to "include" the first "excluded" element

Java pattern for parameters of which only one needs to be non-null?

Find matching Predicate. Predicate of Predicate?

Categories

Resources