I am trying to read a huge file which has approximately one billion lines in it. I want to use Stream for parallel processing and insert each lines in a database. I am doing something like this,
br = new BufferedReader(new InputStreamReader(inputStream, "UTF-8"));
list = br.lines().parallel().collect(Collectors.toList());
This will store all the lines in a list. But I don't want to keep all the lines in memory. So, I want to save them into database as soon as a line is read. Please help me in achieving this. Also, guide me in tweaking this idea.
Thanks in advance :)
Seems you need to use forEach and pass a Consumer that will take the line and store it in the database
lines.parallel()
.forEach(line -> {
//Invoke the code passing the 'line' that persists in the DB...something like
dbWriter.write(line);
});
Related
Hi so im using a api to run a search for books.
http://openlibrary.org/search.json?q=prolog
i then run a buffered search (below) to read each line in and select the lines that i want using a IF statement
"title_suggest": "Prolog",
URL search = new URL("http://openlibrary.org/search.json?q="+searchingTerm);//search string
BufferedReader in = new BufferedReader(
new InputStreamReader(search.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) //read file
for instance: if inputLine contains title_suggest add inputLine to an arrayList.
however this is quite slow and i was wondering if there is a more effecient way to read in the data?
I cannot imagine that the parsing is a giant time suck compared to retrieving the data over the internet. But, whatever you're doing, you're way better off using a genuine JSON parser rather than rolling up your own parser, especially if you're relying on if statements to do so.
Also, make damned sure your query sent to the API is as restrictive as you can make it, after all the more exact data they can give you the better off it is for all parties involved.
this is quite slow
BufferedReader isn't 'quite slow'. It is extremely fast. You can read millions of lines per second with BufferedReader.readLine().
The slow part is your code that processes each line. Or possibly the server is slow executiong the query or delivering the data.
You're barking up the wrong tree here.
I am writing a code in Java to read lines from a file. It is required by the problem statement that the code reads the same file multiple times. However, it should only read new lines, without using any flag of any sort. Please suggest ways on how I can approach this. Any ideas are welcome.
There is no way to "only read new lines." To achieve what you're looking to do, I would suggest caching the old version of the file and comparing the new file every time you re-read it with the old cached one. You will be able to detect the new lines and any other change in the file. After you are done analyzing, overwrite the old cache saving the newest read.
Is it possible to download just the first 50 lines of a .txt file in java?
If possible, I'd need a solution without external libraries, compatible with Java 5 and as simple as possible (involving lines of text rather than streams... one can dream!)
Certainly it's possible, just read the first 50 lines and then stop reading.
You can't do it without streams, since that's what will happen underneath anyways, but a regular new BufferedReader(new InputStreamReader(inputStream, "UTF-8"))); (select the proper encoding) will work just fine.
For a project I am working on, I am trying to count the vowels in text file as fast as possible. In order to do so, I am trying a concurrent approach. I was wondering if it is possible to concurrently read a text file as a way to speed up the counting? I believe the bottleneck is the I/O, and since right now I am reading the file in via a buffered reader and processing line by line, I was wondering if it was possible to read multiple sections of the file at once.
My original thought was to use
Split File - Java/Linux
but apparently MappedByteBuffers are not great performance wise, and I still need to read line by line from each MappedByteBuffer once I split.
Another option is to split after reading a certain number of lines, but that defeats the purpose.
Would appreciate any help.
The following will NOT split the file - but can help in concurrently processing it!
Using Streams in Java 8 you can do things like:
Stream<String> lines = Files.lines(Paths.get(filename));
lines.filter(StringUtils::isNotEmpty) // ignore empty lines
and if you want to run in parallel you can do:
lines.parallel().filter(StringUtils::isNotEmpty)
In the example above I was filtering empty lines - but of course you can modify it to your use (counting vowels) by implementing your own method and calling it.
I have a large log file with client-id as one of the fields in each log line. I would like to split this large log file in to several files grouped by client-id. So, if the original file has 10 lines with 10 unique client-ids, then at the end there will be 10 files with 1 line in each.
I am trying to do this in Scala and don't want to load the entire file in to memory, load one line at a time using scala.io.Source.getLines(). That is working nicely. But, I don't have a good way to write it out in to separate files one line at a time. I can think of two options:
Create a new PrintWriter backed by a BufferedWriter (Files.newBufferedWriter) for every line. This seems inefficient.
Create a new PrintWriter backed by a BufferedWriter for every output File, hold on to these PrintWriters and keep writing to them till we read all lines in the original log file and the close them. This doesn't seems a very functional way to do in Scala.
Being new to Scala I am not sure of there are other better way to accomplish something like this. Any thoughts or ideas are much appreciated.
You can do the second option in pretty functional, idiomatic Scala. You can keep track of all of your PrintWriters, and fold over the lines of the file:
import java.io._
import scala.io._
Source.fromFile(new File("/tmp/log")).getLines.foldLeft(Map.empty[String, PrintWriter]) {
case (printers, line) =>
val id = line.split(" ").head
val printer = printers.get(id).getOrElse(new PrintWriter(new File(s"/tmp/log_$id")))
printer.println(line)
printers.updated(id, printer)
}.values.foreach(_.close)
Maybe in a production level version, you'd want to wrap the I/O operations in a try (or Try), and keep track of failures that way, while still closing all the PrintWriters at the end.