Writing to standard out vs writing to file - java

This is similar to:
Printing to the console vs writing to a file (speed)
I was confused because there are two conflicting answers. I wrote a simple java program
for(int i=0; i<1000000; i++){
System.out.println(i);
}
and ran it with /usr/bin/time -v java test to measure time to output to stdout, then I tried /usr/bin/time -v java test > file and /usr/bin/time -v java > /dev/null. Writing to console was slowest (10 seconds) then file (6 seconds) and /dev/null was fastest (2 seconds). Why?

Because writing to console needs to refresh the screen each time something is written, which takes time.
Writing to a file needs to write bytes on disk, which takes time, but less time than refreshing the screen.
And writing to /dev/null doesn't write anything anywhere, which takes much less time.

Another problem with System.out.println is that System.out by default is in autoflush mode, println practically switches off buffering. Try this
PrintWriter a = new PrintWriter(System.out, false);
for (int i = 0; i < 1000000; i++) {
a.println(i);
}
a.flush();
and you will see that output to a file becomes ten times faster.

Related

Why print in Scala is so slow?

Just started to learn Scala :) The question right away: Why printing to stdout is so slow?
$ time scala -e 'val s = "foobarjoe!"; for (i <- 1 to 10000000) println(s)' > /dev/null
real 0m22.522s
user 0m14.252s
sys 0m8.508s
$ scala -version
Scala code runner version 2.11.4 -- Copyright 2002-2013, LAMP/EPFL
The same code in Perl as an example:
$ time perl -le '$s = "foobarjoe!"; for (1..10000000) { print($s) }' > /dev/null
real 0m1.276s
user 0m1.266s
sys 0m0.010s
This link answers the question: http://www.rgagnon.com/javadetails/java-0603.html
If your program is doing a lot printing to the console using System.out.println() then it is possible to get a good performance boost by using an alternative to do the console output.
By default, System.out.print() is only line-buffered and does a lot work related to Unicode handling. Because of its small buffer size, System.out.println() is not well suited to handle many repetitive outputs in a batch mode. Each line is flushed right away. If your output is mainly ASCII-based then by removing the Unicode-related activities, the overall execution time will be better.
It now remains to translate the solution from Java to Scala :)
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new
FileOutputStream(java.io.FileDescriptor.out), "ASCII"), 512);

Error output stream is concurrent with standard output stream when printing in Java

When this code gets executed, it produces scrambled outputs with no particular order between System.out.print() and System.err.print(). Why is that so?
for (int i = 0; i < 100; i++) {
System.out.print(i+" ");
System.err.print(i+"+");
}
Why is the standard error output stream is concurrent with standard output stream when printing in Java?
System.out and System.err are two different streams and so, they work at different times. The problem you're experiencing here is because in console applications both streams are flushed to the same output buffer, which is the console. Each stream can flush at different time, the console buffer will care about the synchronization for you, but this leads to unpredictable results.
If you want/need to have a single and better managed output for both streams, use a logger library like log4j2 or logback, that already solves these problems.

Waiting for answer from Java process?

I'm trying to call a command-line application (let's call it SomeTerminalApp) from Java, more precisely, the following should happen.
Java boots up SomeTerminalApp
SomeTerminalApp asks for user input, which Java provides
SomeTerminalApp processes the input, and after a while shows the output and asks new user input.
Java detects this, processes SomeTerminalApp's output and insers new user input, possibly based on the previous output.
Steps 3-4 are repeated any number of times (until a stop condition in Java is met, depending on the outputs)
We are done: Java closes the SomeTerminalApp and minds its own business again.
How should I best do this? Currently, I got very naive process handling
Process p = Runtime.getRuntime().exec("/path/SomeTerminalApp -q"); //run in quiet mode
BufferedReader br = new BufferedReader(new InputStreamReader(p.getInputStream()));
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(p.getOutputStream()));
with a function that inserts input and reads a fixed number of output lines immediately.
public ArrayList<String> exec (ArrayList<String> input, int nrOfOutputLines) {
for (int i=0; i<input.size(); i++) bw.write(input.get(i)+"\n");
bw.flush();
ArrayList<String> output = new ArrayList<String>();
for (int i=0; i<nrOfOutputLines; i++) output.add(br.readLine());
return output;
}
Alas, I often don't know how many lines I'd have to read (as the output is unknown and is automatically split over many lines), and it also happens that the output doesn't come immediately (it can take from milliseconds to days for the output to appear).
How should I treat such a process properly, i.e. listen to when the process has finished processing (not just putting a fixed timer or no timer) and read all the produced output as a response to the latest command (not just the last N lines for a fixed number N)?
You've got the right idea. There is no easy way to know when the subprocess is done. It either has to print something recognizable as its last action before going idle ("done" or something), or it needs to exit (which you can detect as EOF on your input).

How to get UNIX uptime in Java?

What is the best way to get the UNIX uptime in Java? Is there a standard Java library/function I could use or should I use Runtime's exec or ProcessBuilder to execute 'uptime'? Thanks
You can read /proc/uptime:
new Scanner(new FileInputStream("/proc/uptime")).next();
//"1128046.07" on my machine and still counting
From Wikipedia:
Shows how long the system has been on since it was last restarted:
$ cat /proc/uptime
350735.47 234388.90
The first number is the total number of seconds the system has been up. The second number is how much of that time the machine has spent idle, in seconds.
Runtime.getRuntime().exec('uptime');
Did you try this? The only command that does something equivalent to system('uptime'); in java is runTime.exec(). Thats all I could think of ..
This is likely to be entirely system dependant but you can use System.nanotime();
System.out.println("/proc/uptime " + new Scanner(new FileInputStream("/proc/uptime")).next());
System.out.println("System.nanoTime " + System.nanoTime() / 1e9);
prints
/proc/uptime 265671.85
System.nanoTime 265671.854834206
Warning: This is unlikely to work on all platforms.

any quick sorting for a huge csv file

I am looking some java implementation of sorting algorithm. The file could be HUGE, say 20000*600=12,000,000 lines of records. The line is comma delimited with 37 fields and we use 5 fields as keys. Is it possible to sort it quickly, say 30 minutes?
If you got other approach other than java, it is welcome if it can be easily integrated into java system. For example, unix utility.
Thanks.
Edit: The lines need to be sort is dispersed into 600 files, with 20000 lines each, 4mb for each file. Finally I would like them to be 1 big sorted file.
I am trying to time a unix sort, would update that afterwards.
Edit:
I appended all the files into a big one, and tried the unix sort function, it is pretty good. The time to sort a 2gb file is 12-13 minutes. The append action require 4 minutes for 600 files.
sort -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r big.txt -o sorted.txt
How does the data get in the CSV format? Does it come from a relational database? You can make it such that whatever process creates the file writes its entries in the right order so you don't have to solve this problem down the line.
If you are doing a simple lexicographic order you can try the unix sort, but I am not sure how that will perform on a file with that size.
Calling unix sort program should be efficient. It does multiple passes to ensure it is not a memory hog. You can fork a process with java's Runtime, but the outputs of the process are redirected, so you have to some juggling to get the redirect to work right:
public static void sortInUnix(File fileIn, File sortedFile)
throws IOException, InterruptedException {
String[] cmd = {
"cmd", "/c",
// above should be changed to "sh", "-c" if on Unix system
"sort " + fileIn.getAbsolutePath() + " > "
+ sortedFile.getAbsolutePath() };
Process sortProcess = Runtime.getRuntime().exec(cmd);
// capture error messages (if any)
BufferedReader reader = new BufferedReader(new InputStreamReader(
sortProcess.getErrorStream()));
String outputS = reader.readLine();
while (outputS != null) {
System.err.println(outputS);
outputS = reader.readLine();
}
sortProcess.waitFor();
}
Use the java library big-sorter which is published to Maven Central and has an optional dependency on commons-csv for CSV processing. It handles files of any size by splitting to intermediate files, sorting and merging the intermediate files repeatedly until there is only one left. Note also that the max group size for a merge is configurable (useful for when there are a large number of input files).
Here's an example:
Given the CSV file below, we will sort on the second column (the "number" column):
name,number,cost
WIPER BLADE,35,12.55
ALLEN KEY 5MM,27,3.80
Serializer<CSVRecord> serializer = Serializer.csv(
CSVFormat.DEFAULT
.withFirstRecordAsHeader()
.withRecordSeparator("\n"),
StandardCharsets.UTF_8);
Comparator<CSVRecord> comparator = (x, y) -> {
int a = Integer.parseInt(x.get("number"));
int b = Integer.parseInt(y.get("number"));
return Integer.compare(a, b);
};
Sorter
.serializer(serializer)
.comparator(comparator)
.input(inputFile)
.output(outputFile)
.sort();
The result is:
name,number,cost
ALLEN KEY 5MM,27,3.80
WIPER BLADE,35,12.55
I created a CSV file with 12 million rows and 37 columns and filled the grid with random integers between 0 and 100,000. I then sorted the 2.7GB file on the 11th column using big-sorter and it took 8 mins to do single-threaded on an i7 with SSD and max heap set at 512m (-Xmx512m).
See the project README for more details.
Java Lists can be sorted, you can try starting there.
Python on a big server.
import csv
def sort_key( aRow ):
return aRow['this'], aRow['that'], aRow['the other']
with open('some_file.csv','rb') as source:
rdr= csv.DictReader( source )
data = [ row for row in rdr ]
data.sort( key=sort_key )
fields= rdr.fieldnames
with open('some_file_sorted.csv', 'wb') as target:
wtr= csv.DictWriter( target, fields }
wtr.writerows( data )
This should be reasonably quick. And it's very flexible.
On a small machine, break this into three passes: decorate, sort, undecorate
Decorate:
import csv
def sort_key( aRow ):
return aRow['this'], aRow['that'], aRow['the other']
with open('some_file.csv','rb') as source:
rdr= csv.DictReader( source )
with open('temp.txt','w') as target:
for row in rdr:
target.write( "|".join( map(str,sort_key(row)) ) + "|" + row )
Part 2 is the operating system sort using "|" as the field separator
Undecorate:
with open('sorted_temp.txt','r') as source:
with open('sorted.csv','w') as target:
for row in rdr:
keys, _, data = row.rpartition('|')
target.write( data )
You don't mention platform, so it is hard to come to terms with the time specified. 12x10^6 records isn't that many, but sorting is a pretty intensive task. Let's say 37 fields, say 100bytes/field would be 45GB? That's a bit much for most machines, but if the records average 10bytes/field your server should be able to fit the entire file in RAM, which would be ideal.
My suggestion: Break the file into chunks that are 1/2 the available RAM, sort each chunk, then merge-sort the resulting sorted chunks. This lets you do all of your sorting in memory rather than hitting swap, which is what I suspect of causing any slow-down.
Say (1G chunks, in a directory you can play around in):
split --line-bytes=1000000000 original_file chunk
for each in chunk*
do
sort $each > $each.sorted
done
sort -m chunk*.sorted > original_file.sorted
As your data set is huge as you have mentioned. Sorting it all at one go will be time consuming depending on your machine (If you try QuickSort).
But since you would like it to be done within 30 mins. I would suggest that you have a look at Map Reduce using
Apache Hadoop as your application server.
Please keep in mind it's not an easy approach, but in the longer run you can easily scale up depending upon your data size.
I am also pointing you to an excellent link on Hadoop setup
Work your way through single node setup and move to Hadoop cluster.
I would be glad to help you if you get stuck anywhere.
You really do need to make sure you have the right tools for the job. ( Today, I am hoping to get a 3.8 GHz PC with 24 GB memory for home use. It been a while since I bought myself a new toy. ;)
However, if you want to sort these lines and you don't have enough hardware, you don't need to break up the data because its in 600 files already.
Sort each file individually, then do a 600-way merge sort (you only need to keep 600 lines in memory at once) Its not as simple as doing them all at once, but you could probably do it on a mobile phone. ;)
Since you have 600 smaller files, it could be faster to sort all of them concurrently. This will eat up 100% of the CPU. That's the point, correct?
waitlist=
for f in ${SOURCE}/*
do
sort -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r -o ${f}.srt ${f} &
waitlist="$waitlist $!"
done
wait $waitlist
LIST=`echo $SOURCE/*.srt`
sort --merge -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r -o sorted.txt ${LIST}
This will sort 600 small files all at the same time and then merge the sorted files. It may be faster than trying to sort a single large file.
Use Map/Reduce Hadoop to do the sorting.. i recommend Spring Data Hadoop. Java.
Well since you're talking about HUGE datasets this means you'll need some external sorting algorithm anyhow. There are some for java and pretty much any other language out there - since the result will have to be stored on the disk anyhow which language you're using is pretty uninteresting.

Categories

Resources