How to partition a collection into arbitrary sized partition - java

I'm trying to use Java 8 Lambda expressions and streams to parse some logs. I have one giant log file that has run after run. I want to split it into separate collections, one for each run. I do not know how many runs the log has in advanced. And to exercise my very weak lambda expressions muscles I'd like to do it in one pass through the list.
Here is my current implementation:
List<String> lines = readLines(fileDirectory);
Pattern runStartPattern = Pattern.compile("INFO: \\d\\d:\\d\\d:\\d\\d: Starting");
LinkedList<List<String>> testRuns = new LinkedList<>();
List<String> currentTestRun = new LinkedList<>(); // In case log starts in middle of run
testRuns.add(currentTestRun);
for(String line:lines){
if(runStartPattern.matcher(line).find()){
currentTestRun = new ArrayList<>();
testRuns.add(currentTestRun);
}
currentTestRun.add(line);
}
if(testRuns.getFirst().size()==0){ // In case log starts at a run
testRuns.removeFirst();
}
Basically something like TomekRekawek's solution here but with an unknown partition size to begin with.

There's no standard way to easily achieve this in Stream API, but my StreamEx library has a groupRuns method which can solve this pretty easily:
List<List<String>> testLines = StreamEx.of(lines)
.groupRuns((a, b) -> !runStartPattern.matcher(b).find())
.toList();
It groups the input elements based on some predicate which is applied to the pair of adjacent elements. Here we don't want to group the lines if the second line matches the runStartPattern. This works correctly regardless of whether the log starts in the middle of run or not. Also this feature works nice with parallel streams as well.

Related

How to read a file in unsequential order

file.txt have 10 line
Integer[] lineWanted ={2,5,1};
BufferedReader br = new BufferedReader(new FileReader("file.txt"));
for (int i = 0; i < lineWanted.length; i++) {
List<String> lineList = br.lines()
.skip(indexes[i]-1)
.limit(1)
.collect(Collectors.toList());
System.out.println(lineList);
}
But code is skipping lines and then counting.
i.e. O/p getting are for line 2, 7 and 8.
If you insist on doing it this way, look carefully at what skip (and limit) is doing. You are skipping to the line index you want, but from the current position in the file. I.e., you get to line 2 correctly, then skip 5 lines (actually 4 from skip + 1 from limit). This puts you at 7, where you get one line to get to 8.
The "correct" way to implement this would be to pre-sort lineWanted, keep track of the previous index, and increment by the difference between the current and previous indices. However, as #tsolakp points out, multiple calls to lines is effectively an undefined operation: you just shouldn't do it.
The specification of BufferedReader.lines() makes it pretty clear that after processing the Stream, the BufferedReader is in an undefined state and can not be used afterwards. So unless you have a strong reason to use a BufferedReader, it’s recommended to use Files.lines to get the stream of lines from a file, which prevents any attempt to reuse the underlying reader in the first place.
You could achieve the goal by repeatedly creating a new stream, but that bears an unacceptable overhead. Keep in mind that even if you skip lines, the file contents have to be processed anyway, to identify the line boundaries, before they can be skipped. And I/O operations are generally expensive compared to computations.
A compromise is to identify the maximum wanted line number first, to avoid processing more lines than necessary (via limit) and the minimum wanted line number to avoid unnecessary intermediate storage (via skip) for a single Stream operation collecting into a temporary List. This may temporarily hold some unwanted lines between the minimum and maximum, but will be more efficient than multiple I/O based Stream operations in most cases:
int[] lineWanted = {2, 5, 1};
IntSummaryStatistics iss = Arrays.stream(lineWanted).summaryStatistics();
List<String> lineList;
try(Stream<String> lines = Files.lines(Paths.get("file.txt"))) {
lineList = lines
.limit(iss.getMax()+1).skip(iss.getMin())
.collect(Collectors.toList());
}
lineList = Arrays.stream(lineWanted)
.map(i -> i-iss.getMin())
.mapToObj(lineList::get)
.collect(Collectors.toList());
System.out.println(lineList);
If you really cannot sort your linesWanted list, the best way would probably be buffering the whole document into String[] of lines, but it all depends on how you want to access the data. Do you want to read only a few lines from one document, or will you be reading the whole document, just in random order?
Just have to move BR inside the loop (No better option able to find, as wanted to do it with BufferedReader).
Files.readAllLines(path).get(lineNo)
is another option which give directly the line but not helpful in my case.

Java 8 filter for contains

I am new to Java 8 and when I am trying to put a filter for all those cities which contains one letter. It doesn't work for me. However, when I run it with old approach it works.
List<String> cityList = new ArrayList<>();
cityList.add("Noida");
cityList.add("Gurgaon");
cityList.add("Denver");
cityList.add("London");
cityList.add("Utah");
cityList.add("New Delhi");
System.out.println(cityList);
/* Prior to Java 8 Approach */
for (String city : cityList) {
if(city.contains("a")){
System.out.println(city + " contains letter a");
}
}
/* Java 8 Approach */
System.out.println(Stream.of(cityList).filter(str -> str.contains("a")).collect(Collectors.toList()));
Here is the output
Noida contains letter a
Gurgaon contains letter a
Utah contains letter a
[]
Can you please explain me where am I am making mistakes.
Thanks in advance !
You'll need to use cityList.stream() rather than Stream.of(cityList). Reason being that currently, Stream.of(cityList) returns a Stream<List<String>> whereas you want Stream<String>. You can still accomplish your task by using your current approach but you'll need to flatten the Stream<List<String>> into Stream<String> (I do not recommend as it causes un-necessary overhead hence it's better to use cityList.stream()).
That said, here is how you should go about accomplishing your task:
System.out.println(cityList.stream().filter(str -> str.contains("a")).collect(Collectors.toList()));
Stream.of(cityList) creates a Stream<List<String>> having a single element - your input List.
You need to use cityList.stream() in order to get a Stream<String> contaning the elements of your input List.
System.out.println(cityList.stream().filter(str -> str.contains("a")).collect(Collectors.toList()));
outputs
[Noida, Gurgaon, Utah]
The only reason you code passed compilation is that both List and String have a contains method that returns a boolean.

In JavaDocs for Stream map function, the mapper needs to be a Non-interfering parameter why? [duplicate]

I am a beginner in Java 8.
Non-interference is important to have consistent Java stream behaviour.
Imagine we are process a large stream of data and during the process
the source is changed. The result will be unpredictable. This is
irrespective of the processing mode of the stream parallel or
sequential.
The source can be modified till the statement terminal operation is
invoked. Beyond that the source should not be modified till the stream
execution completes. So handling the concurrent modification in stream
source is critical to have a consistent stream performance.
The above quotations are taken from here.
Can someone do some simple example that would shed lights on why mutating the stream source would give such big problems?
Well the oracle example is self-explanatory here. First one is this:
List<String> l = new ArrayList<>(Arrays.asList("one", "two"));
Stream<String> sl = l.stream();
l.add("three");
String s = l.collect(Collectors.joining(" "));
If you change l by adding one more elements to it before you call the terminal operation (Collectors.joining) you are fine; but notice that the Stream consists of three elements, not two; at the time you created the Stream via l.stream().
On the other hand doing this:
List<String> list = new ArrayList<>();
list.add("test");
list.forEach(x -> list.add(x));
will throw a ConcurrentModificationException since you can't change the source.
And now suppose you have an underlying source that can handle concurrent adds:
ConcurrentHashMap<String, Integer> cMap = new ConcurrentHashMap<>();
cMap.put("one", 1);
cMap.forEach((key, value) -> cMap.put(key + key, value + value));
System.out.println(cMap);
What should the output be here? When I run this it is:
{oneoneoneoneoneoneoneone=8, one=1, oneone=2, oneoneoneone=4}
Changing the key to zx (cMap.put("zx", 1)), the result is now:
{zxzx=2, zx=1}
The result is not consistent.

non-interference requirement on Java 8 streams

I am a beginner in Java 8.
Non-interference is important to have consistent Java stream behaviour.
Imagine we are process a large stream of data and during the process
the source is changed. The result will be unpredictable. This is
irrespective of the processing mode of the stream parallel or
sequential.
The source can be modified till the statement terminal operation is
invoked. Beyond that the source should not be modified till the stream
execution completes. So handling the concurrent modification in stream
source is critical to have a consistent stream performance.
The above quotations are taken from here.
Can someone do some simple example that would shed lights on why mutating the stream source would give such big problems?
Well the oracle example is self-explanatory here. First one is this:
List<String> l = new ArrayList<>(Arrays.asList("one", "two"));
Stream<String> sl = l.stream();
l.add("three");
String s = l.collect(Collectors.joining(" "));
If you change l by adding one more elements to it before you call the terminal operation (Collectors.joining) you are fine; but notice that the Stream consists of three elements, not two; at the time you created the Stream via l.stream().
On the other hand doing this:
List<String> list = new ArrayList<>();
list.add("test");
list.forEach(x -> list.add(x));
will throw a ConcurrentModificationException since you can't change the source.
And now suppose you have an underlying source that can handle concurrent adds:
ConcurrentHashMap<String, Integer> cMap = new ConcurrentHashMap<>();
cMap.put("one", 1);
cMap.forEach((key, value) -> cMap.put(key + key, value + value));
System.out.println(cMap);
What should the output be here? When I run this it is:
{oneoneoneoneoneoneoneone=8, one=1, oneone=2, oneoneoneone=4}
Changing the key to zx (cMap.put("zx", 1)), the result is now:
{zxzx=2, zx=1}
The result is not consistent.

Why can't stream of streams be reduced un parallel ? / stream has already been operated upon or closed

Context
I've stumble upon a rather annoying problem : I've a program with a lot of data source that are able to stream the same type of elements and I want to "map" each availiable element in the program (element order doesn't matter).
Therefore I've tried to reduce my Stream<Stream<T>> streamOfStreamOfT; into a simple Stream<T> streamOfT; using streamOfT = streamOfStreamOfT.reduce(Stream.empty(), Stream::concat);
Since element order is not important for me, I've tried to parallelize the reduce operation with a .parallel() : streamOfT = streamOfStreamOfT.parallel().reduce(Stream.empty(), Stream::concat); But this triggers an java.lang.IllegalStateException: stream has already been operated upon or closed
Example
To experience it yourself just play with the following main (java 1.8u20) by commenting / uncommenting the .parallel()
public static void main(String[] args) {
// GIVEN
List<Stream<Integer>> listOfStreamOfInts = new ArrayList<>();
for (int j = 0; j < 10; j++) {
IntStream intStreamOf10Ints = IntStream.iterate(0, i -> i + 1)
.limit(10);
Stream<Integer> genericStreamOf10Ints = StreamSupport.stream(
intStreamOf10Ints.spliterator(), true);
listOfStreamOfInts.add(genericStreamOf10Ints);
}
Stream<Stream<Integer>> streamOfStreamOfInts = listOfStreamOfInts
.stream();
// WHEN
Stream<Integer> streamOfInts = streamOfStreamOfInts
// ////////////////
// PROBLEM
// |
// V
.parallel()
.reduce(Stream.empty(), Stream::concat);
// THEN
System.out.println(streamOfInts.map(String::valueOf).collect(
joining(", ")));
}
Question
Can someone explain this limitation ? / find a better way of handling parallel reduction of stream of streams
Edit 1
Following #Smutje and #LouisWasserman comments it seems that .flatMap(Function.identity()) is a better option that tolerates .parallel() streams
The form of reduce you are using takes an identity value and an associative combining function. But Stream.empty() is not a value; it has state. Streams are not data structures like arrays or collections; they are carriers for pushing data through possibly-parallel aggregate operations, and they have some state (like whether the stream has been consumed or not.) Think about how this works; you're going to build a tree where the same "empty" stream appears in more than one leaf. When you try to use this stateful not-an-identity twice (which won't happen sequentially, but will happen in parallel), the second time you try and traverse through that empty stream, it will quite correctly be seen to be already used.
So the problem is, you're simply using this reduce method incorrectly. The problem is not with the parallelism; it is simply that the parallelism exposed the underlying problem.
Secondly, even if this "worked" the way you think it should, you would only get parallelism building the tree that represents the flattened stream-of-streams; when you go to do the joining, that's a sequential stream pipeline there. Ooops.
Thirdly, even if this "worked" the way you think it should, you're going to add a lot of element-access overhead by building up concatenated streams, and you're not going to get the benefit of parallelism that you are seeking.
The simple answer is to flatten the streams:
String joined = streamOfStreams.parallel()
.flatMap(s -> s)
.collect(joining(", "));

Categories

Resources