Java 8 Streams: Read file word by word - java

I use Java 8 streams a lot to process files but so far always line-by-line.
What I want is a function, which gets a BufferedReader br and should read an specific number of words (seperated by "\\s+") and should leave the BufferedReader at the exact position, where the number of words was reached.
Right now I have a version, which reads the file linewise:
final int[] wordCount = {20};
br
.lines()
.map(l -> l.split("\\s+"))
.flatMap(Arrays::stream)
.filter(s -> {
//Process s
if(--wordCount[0] == 0) return true;
return false;
}).findFirst();
This obviously leaves the Inputstream at the position of the next line of the
20th word.
Is there a way to get a stream which reads less than a line from the inputstream?
EDIT
I am parsing a file where the first word contains the number of following words. I read this word and then accordingly read in the specific number of words. The file contains multiple such sections, where each section is parsed in the described function.
Having read all the helpful comments, it becomes clear to me, that using a Scanner is the right choice for this problem and that Java 9 will have a Scanner class which provides stream features (Scanner.tokens() and Scanner.findAll()).
Using Streams the way I described it will give me no guarantee, that the reader will be at specific position, after the terminal operation of the stream (API docs), therefore making streams the wrong choice for parsing a structure, where you parse only a section and have to keep track of the position.

Regarding your original problem: I assume your file looks like this:
5 a section of five words 3 three words
section 2 short section 7 this section contains a lot
of words
And you want to get the output like this:
[a, section, of, five, words]
[three, words, section]
[short, section]
[this, section, contains, a, lot, of, words]
In general Stream API is badly suitable for such problems. Writing plain old loop looks a better solution here. If you still want to see Stream API based solution, I can suggest using my StreamEx library which contains headTail() method allowing you to easily write custom stream-transformation logic. Here's how your problem could be solved using the headTail:
/* Transform Stream of words like 2, a, b, 3, c, d, e to
Stream of lists like [a, b], [c, d, e] */
public static StreamEx<List<String>> records(StreamEx<String> input) {
return input.headTail((count, tail) ->
makeRecord(tail, Integer.parseInt(count), new ArrayList<>()));
}
private static StreamEx<List<String>> makeRecord(StreamEx<String> input, int count,
List<String> buf) {
return input.headTail((head, tail) -> {
buf.add(head);
return buf.size() == count
? records(tail).prepend(buf)
: makeRecord(tail, count, buf);
});
}
Usage example:
String s = "5 a section of five words 3 three words\n"
+ "section 2 short section 7 this section contains a lot\n"
+ "of words";
Reader reader = new StringReader(s);
Stream<List<String>> stream = records(StreamEx.ofLines(reader)
.flatMap(Pattern.compile("\\s+")::splitAsStream));
stream.forEach(System.out::println);
The result looks exactly as desired output above. Replace reader with your BufferedReader or FileReader to read from the input file. The stream of records is lazy: at most one record is preserved by the stream at a time and if you short-circuit, the rest of the input will not be read (well, of course the current file line will be read to the end). The solution, while looks recursive, does not eat stack or heap, so it works for huge files as well.
Explanation:
The headTail() method takes a two-argument lambda which is executed at most once during the outer stream terminal operation execution, when stream element is requested. The lambda receives the first stream element (head) and the stream which contains all other original elements (tail). The lambda should return a new stream which will be used instead of the original one. In records we have:
return input.headTail((count, tail) ->
makeRecord(tail, Integer.parseInt(count), new ArrayList<>()));
First element of the input is count: convert it to number, create empty ArrayList and call makeRecord for the tail. Here's makeRecord helper method implementation:
return input.headTail((head, tail) -> {
First stream element is head, add it to the current buffer:
buf.add(head);
Target buffer size is reached?
return buf.size() == count
If yes, call the records for the tail again (process the next record, if any) and prepend the resulting stream with single element: current buffer.
? records(tail).prepend(buf)
Otherwise, call myself for the tail (to add more elements to the buffer).
: makeRecord(tail, count, buf);
});

Related

How to read a file in unsequential order

file.txt have 10 line
Integer[] lineWanted ={2,5,1};
BufferedReader br = new BufferedReader(new FileReader("file.txt"));
for (int i = 0; i < lineWanted.length; i++) {
List<String> lineList = br.lines()
.skip(indexes[i]-1)
.limit(1)
.collect(Collectors.toList());
System.out.println(lineList);
}
But code is skipping lines and then counting.
i.e. O/p getting are for line 2, 7 and 8.
If you insist on doing it this way, look carefully at what skip (and limit) is doing. You are skipping to the line index you want, but from the current position in the file. I.e., you get to line 2 correctly, then skip 5 lines (actually 4 from skip + 1 from limit). This puts you at 7, where you get one line to get to 8.
The "correct" way to implement this would be to pre-sort lineWanted, keep track of the previous index, and increment by the difference between the current and previous indices. However, as #tsolakp points out, multiple calls to lines is effectively an undefined operation: you just shouldn't do it.
The specification of BufferedReader.lines() makes it pretty clear that after processing the Stream, the BufferedReader is in an undefined state and can not be used afterwards. So unless you have a strong reason to use a BufferedReader, it’s recommended to use Files.lines to get the stream of lines from a file, which prevents any attempt to reuse the underlying reader in the first place.
You could achieve the goal by repeatedly creating a new stream, but that bears an unacceptable overhead. Keep in mind that even if you skip lines, the file contents have to be processed anyway, to identify the line boundaries, before they can be skipped. And I/O operations are generally expensive compared to computations.
A compromise is to identify the maximum wanted line number first, to avoid processing more lines than necessary (via limit) and the minimum wanted line number to avoid unnecessary intermediate storage (via skip) for a single Stream operation collecting into a temporary List. This may temporarily hold some unwanted lines between the minimum and maximum, but will be more efficient than multiple I/O based Stream operations in most cases:
int[] lineWanted = {2, 5, 1};
IntSummaryStatistics iss = Arrays.stream(lineWanted).summaryStatistics();
List<String> lineList;
try(Stream<String> lines = Files.lines(Paths.get("file.txt"))) {
lineList = lines
.limit(iss.getMax()+1).skip(iss.getMin())
.collect(Collectors.toList());
}
lineList = Arrays.stream(lineWanted)
.map(i -> i-iss.getMin())
.mapToObj(lineList::get)
.collect(Collectors.toList());
System.out.println(lineList);
If you really cannot sort your linesWanted list, the best way would probably be buffering the whole document into String[] of lines, but it all depends on how you want to access the data. Do you want to read only a few lines from one document, or will you be reading the whole document, just in random order?
Just have to move BR inside the loop (No better option able to find, as wanted to do it with BufferedReader).
Files.readAllLines(path).get(lineNo)
is another option which give directly the line but not helpful in my case.

How to read in a text file and store content in seperate arrayLists

My classes are the following: Room, Drawers, Shorts, Tops, and Socks. I have arrayLists roomList, drawerList, shortList, topList, and sockList in their designated classes. I also have a text file(room.txt) in which I need to read in the contents and store them in the appropriate arrayList.
The text file looks something like this:
Room 1, White,BlackStripes,3
Drawer 1,Black,Large,3,2,4
Drawer 2,White, Small,4,1,2
Short 1,Blue, M, 32
Short 2,Yellow, L, 34
Short 3, Orange, S, 28
Top 1,Green,L, 10
Sock 1, White, L, 20
Sock 2, Red, L, 18
Basically I'm having trouble putting the content in the right place. This is what my code looks like:
try{
Scanner read = new Scanner(new File("room.txt"));
while(read.hasNextLine())
{
Room myRoom = new Room();
Drawers myDrawers = new Drawer();
Shorts myShorts = new Short();
Tops myTops = new Tops();
Socks mySock = new Socks();
// What goes here, including comma delimeters, and counters etc
myRoom.setName(//What goes here);
myRoom.setColor();
myRoom.setStripes();
myRoom.SetDrawerAmount();
Room.roomList.add(myRoom);
myDrawers.setName();
myDrawers.setColor();
myDrawers.setSize();
myDrawers.setContainers();
myDrawers.setKnobs();
myDrawers.setItem();
Drawers.drawerList.add(myDrawers);
myShorts.setname();
myShorts.setColor();
myShorts.setSize();
myShorts.setNumSize();
Shorts.shortList.add(myShorts);
myTops.setName();
myTops.setColor();
myTops.setSize();
myTops.setNumSize();
Tops.topList.add(myTops);
mySocks.setName();
mySocks.setColor();
mySocks.setSize();
mySocks.setPairs();
Socks.sockList.add(mySocks);
}
}
catch (Exception e)
{
System.out.println("Error File Not Found");
}
Basically, I'm not sure how to format at it to know when I should add to roomList, or drawerList, or shortsList, or topsList, or socksList, since there are different numbers of lines and different numbers of content per line.
You must write a custom parser for that kind of files.
Your parser must read each line, then split the line by commas "," (see String.split() method). That gives you an String[] for each line.
Once you have the String[], compare the first element with "Rooms", "Drawers", etc... in a switch statement or if/else.
You must process each element of the String[] in order to classify your objects. Do this in a separate method. Your method should returns a Room, Drawer, Shorts, etc. to the caller, and so on...
You don't tell us what the classes Room, Drawer... look like so there is your problem :)
Why don't you use plain old lists?
List<Room> rlist= new ArrayList<Room>();
List<Drawer> dlist= new ArrayList<Drawer>();
so on and so forth, then it is this easy:
rlist.add(myRoom);
then on and on again
#eyp has good advice. In addition,
Scanner can be frustrating at times. If it works for you, go with it. Otherwise, read up on other java I/O classes like FileReader and BufferedReader. You can compose a BufferedReader and a FileReader (See the example in the BufferedReader documentation: Java Standard Edition Documentation.) It's almost as easy as using Scanner, it gives you the readLine() method you need, and it allows you to do more advanced I/O in the future.
If the different classes were subclasses of a common superclass, you could put a lot of the parsing into the superclass constructor to avoid repeating it for each subclass. Look at what each line in your text file is made up of, and I think you'll be able to recognize a common superclass. Then each subclass could handle the part that is specific to it.

Why can't stream of streams be reduced un parallel ? / stream has already been operated upon or closed

Context
I've stumble upon a rather annoying problem : I've a program with a lot of data source that are able to stream the same type of elements and I want to "map" each availiable element in the program (element order doesn't matter).
Therefore I've tried to reduce my Stream<Stream<T>> streamOfStreamOfT; into a simple Stream<T> streamOfT; using streamOfT = streamOfStreamOfT.reduce(Stream.empty(), Stream::concat);
Since element order is not important for me, I've tried to parallelize the reduce operation with a .parallel() : streamOfT = streamOfStreamOfT.parallel().reduce(Stream.empty(), Stream::concat); But this triggers an java.lang.IllegalStateException: stream has already been operated upon or closed
Example
To experience it yourself just play with the following main (java 1.8u20) by commenting / uncommenting the .parallel()
public static void main(String[] args) {
// GIVEN
List<Stream<Integer>> listOfStreamOfInts = new ArrayList<>();
for (int j = 0; j < 10; j++) {
IntStream intStreamOf10Ints = IntStream.iterate(0, i -> i + 1)
.limit(10);
Stream<Integer> genericStreamOf10Ints = StreamSupport.stream(
intStreamOf10Ints.spliterator(), true);
listOfStreamOfInts.add(genericStreamOf10Ints);
}
Stream<Stream<Integer>> streamOfStreamOfInts = listOfStreamOfInts
.stream();
// WHEN
Stream<Integer> streamOfInts = streamOfStreamOfInts
// ////////////////
// PROBLEM
// |
// V
.parallel()
.reduce(Stream.empty(), Stream::concat);
// THEN
System.out.println(streamOfInts.map(String::valueOf).collect(
joining(", ")));
}
Question
Can someone explain this limitation ? / find a better way of handling parallel reduction of stream of streams
Edit 1
Following #Smutje and #LouisWasserman comments it seems that .flatMap(Function.identity()) is a better option that tolerates .parallel() streams
The form of reduce you are using takes an identity value and an associative combining function. But Stream.empty() is not a value; it has state. Streams are not data structures like arrays or collections; they are carriers for pushing data through possibly-parallel aggregate operations, and they have some state (like whether the stream has been consumed or not.) Think about how this works; you're going to build a tree where the same "empty" stream appears in more than one leaf. When you try to use this stateful not-an-identity twice (which won't happen sequentially, but will happen in parallel), the second time you try and traverse through that empty stream, it will quite correctly be seen to be already used.
So the problem is, you're simply using this reduce method incorrectly. The problem is not with the parallelism; it is simply that the parallelism exposed the underlying problem.
Secondly, even if this "worked" the way you think it should, you would only get parallelism building the tree that represents the flattened stream-of-streams; when you go to do the joining, that's a sequential stream pipeline there. Ooops.
Thirdly, even if this "worked" the way you think it should, you're going to add a lot of element-access overhead by building up concatenated streams, and you're not going to get the benefit of parallelism that you are seeking.
The simple answer is to flatten the streams:
String joined = streamOfStreams.parallel()
.flatMap(s -> s)
.collect(joining(", "));

External Sorting from files in Java

I am wondering how do we write Java code from the following PseudoCode
foreach file F in file directory D
foreach int I in file F
sort all I from each file
Basically this is part of the External Sorting algorithm, so those files contain lists of sorted integer, and I want to read the first one from each file and sort it and then output to another file, and then move to the next integer from each file again until all the integers are fully sorted.
The problem is that as far as I understand for each file we need a reader, so if we have N files then does that mean we need N file readers?
======update=======
I am wondering is it something that look like this? Correct me if I miss anything or any other better approach.
int numOfFiles = 10;
Scanner [] scanners = new Scanner[numOfFiles];
try{
//reader all the files
for(int i = 0 ; i < numOfFiles; i++){
scanners[i] = new Scanner(new BufferedReader(
new FileReader("file"+i+".txt");
}
}
catch(FileNotFoundException fnfe){
}
The problem is that as far as I understand for each file we need a reader, so if we have N files then does that mean we need N file readers ?
Yes, that's right - unless you want to either have to go back over the data, or the whole of each file into memory. Either of those would let you get away with only one file open at a time - but that may well not suit what you want to do.
Operating systems usually only allow you to open a certain number of files at a time. If you're trying to do something like create a single sorted set of results from a very large number of files, you might want to consider operating on a few of them at a time, producing larger intermediate files. At its simplest, this would just sort two files at a time, e.g.
input1 + input2 => tmp-a1
input3 + input4 => tmp-a2
input5 + input6 => tmp-a3
input7 + input8 => tmp-a4
tmp-a1 + tmp-a2 => tmp-b1
tmp-a3 + tmp-a4 => tmp-b2
tmp-b1 + tmp-b2 => result
Yes, we must have N file readers for reading N files.
Inorder to iterate all the files in a directory, read the files one by one, and store them in a List. Then sort that list again to get your output.
There's a method called Polyphase merge sort I recently learnt in my ds class where you traverse the files in form of runs (a run is a sorted sequence). There are n sources, and a destination.
The gist of this polyphase method is having to keep no file (given a set of files) idle. It significantly reduces the iterations. It's done by taking an fibonacci sequence of an order equal to that of number of files. So in case of 5 files, I'll take the fib sequence of order 5: [1,1,2,4,8], which represent the number of runs you're going to take out of each file and place them, where from files corresponding to runs=1, one of them will be the destination.
In short:
Distribute a file into runs according to the fib sequence. [which would mean the entire dataset is in a single file. if that's not the case, you can always create in situ runs where you might want to add dummy runs to suit the sequence]
Take first n runs from every file into the buffer, sort them (insertion preferred) and dump them into ONE files. That ONE file is again selected by the fibonacci sequence.
Run to a point you get a single file with single run.
This is the paper which neatly explains the polyphase concept. ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/76/543/CS-TR-76-543.pdf
http://en.wikipedia.org/wiki/Polyphase_merge_sort explains the algo better
Just presenting code, not answering "need N file readers ?" :)
use org.apache.commons.io:
//get line iterators :
Collection<File> files = FileUtils.listFiles(/* TODO : filter conf */);
List<LineIterator> iters = new ArrayList<LineIterator>();
for(File file : files) {
iters.add(FileUtils.lineIterator(file, "UTF-8"));
}
//collect a line from each file
List<String> numbers = new ArrayList<String>();
for(LineIterator li : iters) {
numbers.add(li.nextLine());
}
//sort
//Arrays.sort(numbers/*will fail*/);// :)
Yes, you need N File readers.
public void workOnFiles(){
File []D = new File("directoryName").listFiles(); //D.length should equal to N.
for(File F:D){
doSortingForEachFile(F);//do sorting part here. The same reader cannot open same file here again.
}
}
public void doSortingForEachFile(File f){
try{
ArrayList<Integer> list=new ArrayList<Integer>();
Scanner s=new Scanner(f);
while(s.hasNextInt()){//store ints inside the file.
list.add(s.nextInt());
}
s.close();//once closed, cannot open again.
Collections.sort(list);//this method will sort the ArrayList of int.
//...write numbers inside list to another file...
}catch(Exception e){}
}

Java: read file input contents and filter them if some line pattern sequences are found

I need to process an input file, and copy its content (line by ilne) to an output file. However, there are some unimportant data (stray) inside this input file that I need to skip.
The main problem that I am trying to solve is actually more complicated than this, but I am just gonna simplify the problem:
So, I have an input file containing hundreds of thousands of lines.
If the following sequence of 3-lines occur inside the input file:
A
B
C
then I need to skip these 3 lines and proceed with the next line in the input file. I can only skip these 3 lines if these 3 lines occur as a sequence of consecutive lines.
For example:
Input File:
A
A
B
C
B
P
A
B
C
A
B
A
A
B
C
A
Output file:
A
B
P
A
B
A
A
Clarification:
A
A (skipped)
B (skipped)
C (skipped)
B
P
A (skipped)
B (skipped)
C (skipped)
A
B
A
A (skipped)
B (skipped)
C (skipped)
A
Notice that I can skip the sequence of lines (A, B, C) only if they occur sequentially. All the other lines which are not skipped have to be copied to the output file.
If I use BufferedReader.nextLine(), I cannot backtrack to the previous lines if the next line doesn't match the input pattern. For example, if I already encounter an A, and the next line is another A (not B), I then have to copy the first A to the output file, and start the filtering again from the second A which I have not processed, and check the next following line, and so on.
One way that I can think of is to firstly save the contents of the input text file, so I can easily backtrack when traversing the input file contents if it doesn't match the pattern I am looking for. However this is not a memory-wise solution. Is there any clever algorithm to solve this, preferably in one time traversal, i.e. O (N) complexity? Or if this is not possible, what would be the most optimal solution which is still memory-wise? Some example C / Java codes will be really helpful.
You could do this with a 3-element array.
Whenever you encounter an A, check if the first element of the array is empty -- if not, flush the array to the output file -- then store the new A to the first element of the array.
Whenever you encounter a B, check if the second element of the array is empty but the first element is full -- if not, flush the array to the output file along with the new B. Otherwise (that is, if the first element is full but the 2nd is empty) you'll store the new B as the 2nd element of the array.
For C, repeat the logic for B, incremented by one: Whenever you encounter a C, check if the third element of the array is empty but the 2nd element is full -- if not, flush the array to the output file along with the new C. Otherwise (that is, if the 2nd element is full but the 3rd is empty) you'll store the new C as the 3rd element of the array.
When you encounter neither A nor B nor C, flush any existing array elements to the output file, then write the new line directly to the output file.
The main trick here is that you're defining explicit rules for filling each slot of the buffer array, and using this to avoid re-checking any line matches, while flushing the buffer to the output and resetting the sequence whenever you break pattern.
Granted, you admit that your actual ruleset is somewhat more complicated, but the same type of approach should work.
I'm assuming your lines are more complex than just "A", "B" and "C", but there is some way to pick an "A" from a "B" from a "C".
(If they really are A, B anc C then you don't need to store anything)
I'd make a little state machine type program.
state = Base;
while(there are more lines)
{
line = read_a_line()
switch(state) {
case Base:
if (line.isTypeA()) {
storedLines.add(line);
state = GotA;
}
else {
ouput(line);
}
break;
case GotA:
if (line.isTypeB()) {
storedLines.add(line);
state = gotB;
}
else {
output(storedLines);
output(line);
state = Base;
}
break;
case GotB:
if (line.isTypeC()) {
storedLines.clear();
}
else {
output(storedLines);
output(line);
}
state = Base;
break;
}
// TODO: special case handling to make sure you write everything at the end of the
// file.
You could use mark
and reset on your stream to "rewind"

Categories

Resources