I want to read a big text file, what i decided to create four threads and read 25% of file by each one.
and then join them.
but its not more impressive.
can any one tell me can i use concurrent programming for the same.
as my file structure have some data as
name contact compnay policyname policynumber uniqueno
and I want to put all data in hashmap at last.
thanks
Reading a large file is typically limited by I/O performance, not by CPU time. You can't speed up the reading by dividing into multiple threads (it will rather decrease performance, since it's still the same file, on the same drive). You can use concurrent programming to process the data, but that can only improve performance after reading the file.
You may, however, have some luck by dedicating one single thread to reading the file, and delegate the actual processing from this thread to worker threads, whenever a data unit has been read.
If it is a big file chances are that it is written to disk as a contiguous part and "streaming" the data would be faster than parallel reads as this would start moving the heads back and forth. To know what is fastest you need intimate knowledge of your target production environment, because on high end storage the data will likely be distributed over multiple disks and parallel reads might be faster.
Best approach is i think is to read it with large chunks into memory. Making it available as a ByteArrayInputStream to do the parsing.
Quite likely you will peg the CPU during parsing and handling of the data. Maybe parallel map-reduce could help here spread the load over all cores.
You might want to use Memory-mapped file buffers (NIO) instead of plain java.io.
Well, you might flush the disk cache and put a high contention on the synchronization of the hashmap if you do it like that. I would suggest that you simply make sure that you have buffered the stream properly (possibly with a large buffer size). Use the BufferedReader(Reader in, int sz) constructor to specify buffer size.
If the bottle neck is not parsing the lines (that is, the bottle neck is not the CPU usage) you should not parallelize the task in the way described.
You could also look into memory mapped files (available through the nio package), but thats probably only useful if you want to read and write files efficiently. A tutorial with source code is available here: http://www.linuxtopia.org/online_books/programming_books/thinking_in_java/TIJ314_029.htm
well you can take help from below link
http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
OR
by using large buffer
or using this
import java.io.*;
public class line1 {
public static void main(String args[]) {
if (args.length != 1) {
System.err.println("missing filename");
System.exit(1);
}
try {
FileInputStream fis =
new FileInputStream(args[0]);
BufferedInputStream bis =
new BufferedInputStream(fis);
DataInputStream dis =
new DataInputStream(bis);
int cnt = 0;
while (dis.readLine() != null)
cnt++;
dis.close();
System.out.println(cnt);
}
catch (IOException e) {
System.err.println(e);
}
}
}
Related
Let me preface this post with a single caution. I am a total beginner when it comes to Java. I have been programming PHP on and off for a while, but I was ready to make a desktop application, so I decided to go with Java for various reasons.
The application I am working on is in the beginning stages (less than 5 classes) and I need to read bytes from a local file. Typically, the files are currently less than 512kB (but may get larger in the future). Currently, I am using a FileInputStream to read the file into three byte arrays, which perfectly satisfies my requirements. However, I have seen a BufferedInputStream mentioned, and was wondering if the way I am currently doing this is best, or if I should use a BufferedInputStream as well.
I have done some research and have read a few questions here on Stack Overflow, but I am still having troubles understanding the best situation for when to use and not use the BufferedInputStream. In my situation, the first array I read bytes into is only a few bytes (less than 20). If the data I receive is good in these bytes, then I read the rest of the file into two more byte arrays of varying size.
I have also heard many people mention profiling to see which is more efficient in each specific case, however, I have no profiling experience and I'm not really sure where to start. I would love some suggestions on this as well.
I'm sorry for such a long post, but I really want to learn and understand the best way to do these things. I always have a bad habit of second guessing my decisions, so I would love some feedback. Thanks!
If you are consistently doing small reads then a BufferedInputStream will give you significantly better performance. Each read request on an unbuffered stream typically results in a system call to the operating system to read the requested number of bytes. The overhead of doing a system call is may be thousands of machine instructions per syscall. A buffered stream reduces this by doing one large read for (say) up to 8k bytes into an internal buffer, and then handing out bytes from that buffer. This can drastically reduce the number of system calls.
However, if you are consistently doing large reads (e.g. 8k or more) then a BufferedInputStream slows things a bit. You typically don't reduce the number of syscalls, and the buffering introduces an extra data copying step.
In your use-case (where you read a 20 byte chunk first then lots of large chunks) I'd say that using a BufferedInputStream is more likely to reduce performance than increase it. But ultimately, it depends on the actual read patterns.
If you are using a relatively large arrays to read the data a chunk at a time, then BufferedInputStream will just introduce a wasteful copy. (Remember, read does not necessarily read all of the array - you might want DataInputStream.readFully). Where BufferedInputStream wins is when making lots of small reads.
BufferedInputStream reads more of the file that you need in advance. As I understand it, it's doing more work in advance, like, 1 big continous disk read vs doing many in a tight loop.
As far as profiling - I like the profiler that's built into netbeans. It's really easy to get started with. :-)
I can't speak to the profiling, but from my experience developing Java applications I find that using any of the buffer classes - BufferedInputStream, StringBuffer - my applications are exceptionally faster. Because of which, I use them even for the smallest files or string operation.
import java.io.*;
class BufferedInputStream
{
public static void main(String arg[])throws IOException
{
FileInputStream fin=new FileInputStream("abc.txt");
BufferedInputStream bis=new BufferedInputStream(fin);
int size=bis.available();
while(true)
{
int x=bis.read(fin);
if(x==-1)
{
bis.mark(size);
System.out.println((char)x);
}
}
bis.reset();
while(true)
{
int x=bis.read();
if(x==-1)
{
break;
System.out.println((char)x);
}
}
}
}
When I want to write Java code for writing text to a file, it usually looks something like this:
File logFile = new File("/home/someUser/app.log");
FileWriter writer;
try {
writer = new FileWriter(logFile, true);
writer.write("Some text.");
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
I am now writing a Logger that will be used extensively by an in-house reporting tool. For reasons outside the context of this question, I can't use one of the traditional logging frameworks (SLF4J, Log4j, Logback, JUL, JCL, etc.). So I have to make something homegrown.
This logging system will be simple, non-configurable, but has to be capable of handling high-volume (possibly hundreds of log operations per second, or more).
So I ask: how can I optimize my normal file I/O template above, to handle high-throughput logging? What kid of "hidden gems of Java File I/O" can I capitalize on here? Pretty much anything goes, except, like I said, use of other logging frameworks. Basic Logger API needs to be something like:
public class Logger {
private File logFile;
public Logger(File logFile) {
super();
setFile(logFile);
}
public void log(String message) {
???
}
}
Thanks in advance!
Update: If my Logger used a ByteOutputStream instead of a FileWriter, then how can I properly synchronize my log(String) : void method?
public class Logger {
private File logFile;
// Constructor, getters/setters, etc.
public void synchronized log(String message) {
FileOutputStream foutStream = new FileOutputStream(logFile);
ByteOutputStream boutStream = new BytesOutputStream(foutStream);
boutStream.write(message.getBytes(Charset.forName("utf-8")));
// etc.
}
}
If you want to achieve maximum throughput for the logging operation you should decouple the logging of messages from writing them to the file system by using a queue and a separate log-writing thread.
The purpose of a logging system isn't just to achieve maximum throughput. It is required as an audit trail. Business decisions must be made as to how much data loss, if any, is tolerable in case of a crash. You need to investigate that first, before committing yourself to any specific technical solution.
I'm speaking only about throughput here and not about engineering or reliability concerns, since the question asked only about performance.
You will want to buffer writes to disk. Writing lots of little tiny pieces with unboffered I/O incurs a whole bunch of overhead:
The cost of a native method call; the JVM has to do a good amount of bookkeeping, including knowing which threads are running native methods and which aren't, in order to work. This is on the order of tens or hundreds of nanoseconds on modern platforms.
Copying the data from the Java heap to native memory through a magic JNI call. There's a memory copy taking time proportional to the length of your data, but there's also a bunch of JVM bookkeeping. Ballpark the bookkeeping overhead around a few hundred nanoseconds.
The cost of an OS call to write() or similar. Ballpark the overhead around 2 microseconds. (There are other costs, too; your caches and TLB have probably been flushed upon return.) write() also needs to copy the data from user space to kernel space.
The OS may internally buffer your writes. It may not. This depends on the OS and the characteristics of the underlying filesystem. You can typically force it not to buffer your writes. You can typically also flush the OS's buffer. Doing so will incur the cost of a disk seek and a write. Ballpark the disk seek around 8ms and the write between 100MB/s and 1GB/s. Throw the disk seek overhead out the window if you're using a RAM disk or flash storage or something like that --- latencies there are typically much lower.
The really big cost you want to avoid if possible is the disk seek cost. 8 milliseconds is a hell of a long time to wait when writing a 100-odd-byte log message. You will want some kind of buffering between the user and the backing storage, whether it's provided by the OS or hidden by the logging interface.
The overhead of a system call from the JVM is also significant, though it's about 1000 times less than the cost of a disk seek. You're spending two or three microseconds to tell the kernel to buffer a write of 100-odd bytes. Almost all of those two or three microseconds are spent handling various bookkeeping tasks that have nothing at all to do with writing a log message to a file. This is why you want the buffering to happen in userspace, and preferably in Java code instead of native code. (However, engineering concerns may render this impossible.)
Java already comes with drop-in buffering solutions --- BufferedWriter and BufferedOutputStream. It turns out that these are internally synchronised. You'll want to use BufferedOutputStream so that the String-to-bytes conversion happens outside of the lock rather than inside.
You could do one better than the Buffered classes if you kept a queue of Strings that you flush once it reaches a certain size. This saves a memory copy, but I rather doubt this is worth doing.
On buffer sizes, I suggested something around 4MB or 8MB. Buffer sizes in this range cover up the latency of a disk seek fairly well on most typical modern hardware. Your southbridge can push about 1GB/s and a typical disk can push about 100MB/s. Maxing out your southbridge, then, an 8MB write will take about 8 milliseconds --- roughly as long as a disk seek. With a single "typical modern disk", 90% of the time spent doing an 8MB random write is spent doing the write.
Again, you can't do buffering inside Java if log messages need to be reliably written to the backing store. You need to trust the kernel in that case, and you pay a speed hit for doing so.
I am reading a 50G file containing millions of rows separated by newline character. Presently I am using following syntax to read the file
String line = null;
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("FileName")));
while ((line = br.readLine()) != null)
{
// Processing each line here
// All processing is done in memory. No IO required here.
}
Since the file is too big, it is taking 2 Hrs to process the whole file. Can I improve the reading of file from the harddisk so that the IO(Reading) operation takes minimal time. The restriction with my code is that I have to process each line sequential order.
it is taking 2 Hrs to process the whole file.
50 GB / 2 hours equals approximately 7 MB/s. It's not a bad rate at all. A good (modern) hard disk should be capable of sustaining higher rate continuously, so maybe your bottleneck is not the I/O? You're already using BufferedReader, which, like the name says, is buffering (in memory) what it reads. You could experiment creating the reader with a bit bigger buffer than the default size (8192 bytes), like so:
BufferedReader br = new BufferedReader(
new InputStreamReader(new FileInputStream("FileName")), 100000);
Note that with the default 8192 bytes buffer and 7 MB/s throughput the BufferedReader is going to re-fill its buffer almost 1000 times per second, so lowering that number could really help cutting down some overhead. But if the processing that you're doing, instead of the I/O, is the bottleneck, then no I/O trick is going to help you much. You should maybe consider making it multi-threaded, but whether it's doable, and how, depends on what "processing" means here.
Your only hope is to parallelize the reading and processing of what's inside. Your strategy should be to never require the entire file contents to be in memory at once.
Start by profiling the code you have to see where the time is being spent. Rewrite the part that takes the most time and re-profile to see if it improved. Keep repeating until you get an acceptable result.
I'd think about Hadoop and a distributed solution. Data sets that are larger than yours are processed routinely now. You might need to be a bit more creative in your thinking.
Without NIO you won't be able to break the throughput barrier. For example, try using new Scanner(File) instead of directly creating readers. Recently I took a look at that source code, it uses NIO's file channels.
But the first thing I would suggest is to run an empty loop with BufferedReader that does nothing but reading. Note the throughput -- and also keep an eye on the CPU. If the loop floors the CPU, then there's definitely an issue with the IO code.
Disable the antivirus and any other program which adds to disk contention while reading the file.
Defragment the disk.
Create a raw disk partition and read the file from there.
Read the file from an SSD.
Create a 50GB Ramdisk and read the file from there.
I think you may get the best results by re-considering the problem you're trying to solve. There's clearly a reason you're loading this 50Gig file. Consider if there isn't a better way to break the stored data down and only use the data you really need.
The way you read the file is fine. There might be ways to get it faster, but it usually requires understanding where your bottleneck is. Because the IO throughput is actually on the lower end, I assume the computation is having a performance side effect. If its not too lengthy you could show you whole program.
Alternatively, you could run your program without the contents of the loop and see how long it takes to read through the file :)
I have a program that generates a lot of data and puts it in a queue to write but the problem is its generating data faster than I'm currently writing(causing it to max memory and start to slow down). Order does not matter as I plan to parse the file later.
I looked around a bit and found a few questions that helped me design my current process(but I still find it slow). Here's my code so far:
//...background multi-threaded process keeps building the queue..
FileWriter writer = new FileWriter("foo.txt",true);
BufferedWriter bufferWritter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWritter.newLine();
bufferWritter.write(data);
}
bufferWritter.close();
I'm pretty new to programming so I maybe assessing this wrong(maybe a hardware issue as I'm using EC2), but is there a to very quickly dump the queue results into a file or if my approach is okay can I improve it somehow? As order does not matter, does it make more sense to write to multiple files on multiple drives? Will threading make it faster?,etc..I'm not exactly sure the best approach and any suggestions would be great. My goal is to save the results of the queue(sorry no outputting to /dev/null :-) and keep memory consumption as low as possible for my app(I'm not 100% sure but the queue fills up 15gig, so I'm assuming it'll be a 15gig+ file).
Fastest way to write huge data in text file Java (realized I should use buffered writer)
Concurrent file write in Java on Windows (made me see that maybe multi-threading writes wasn't a great idea)
Looking at that code, one thing that springs to mind is character encoding. You're writing strings, but ultimately, it's bytes that go to the streams. A writer character-to-byte encoding under the hood, and it's doing it in the same thread that is handling writing. That may mean that there is time being spent encoding that is delaying writes, which could reduce the rate at which data is written.
A simple change would be to use a queue of byte[] instead of String, do the encoding in the threads which push onto the queue, and have the IO code use a BufferedOutputStream rather than a BufferedWriter.
This may also reduce memory consumption, if the encoded text takes up less than two bytes per character on average. For latin text and UTF-8 encoding, this will usually be true.
However, i suspect it's likely that you're simply generating data faster than your IO subsystem can handle it. You will need to make your IO subsystem faster - either by using a faster one (if you're on EC2, perhaps renting a faster instance, or writing to a different backend - SQS vs EBS vs local disk, etc), or by ganging several IO subsystems together in parallel somehow.
Yes, writing multiple files on multiple drives should help, and if nothing else is writing to those drives at the same time, performance should scale linearly with the number of drives until I/O is no longer the bottleneck. You could also try a couple other optimizations to boost performance even more.
If you're generating huge files and the disk simply can't keep up, you can use a GZIPOutputStream to shrink the output--which, in turn, will reduce the amount of disk I/O. For non-random text, you can usually expect a compression ratio of at least 2x-10x.
//...background multi-threaded process keeps building the queue..
OutputStream out = new FileOutputStream("foo.txt",true);
OutputStreamWriter writer = new OutputStreamWriter(new GZIPOutputStream(out));
BufferedWriter bufferWriter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWriter.newLine();
bufferWriter.write(data);
}
bufferWriter.close();
If you're outputting regular (i.e., repetitive) data, you might also want to consider switching to a different output format--for example, a binary encoding of the data. Depending on the structure of your data, it might be more efficient to store it in a database. If you're outputting XML and really want to stick to XML, you should look into a Binary XML format, such as EXI or Fast InfoSet.
I guess as long as you produce your data out of calculations and do not load your data from another data source, writing will always be slower than generating your data.
You can try writing your data in multiple files (not in the same file -> due to synchronization problems) in multiple threads (but I guess that will not fix your problem).
Is it possible for you to wait for the writing part of your application to finish its operation and continue your calculations?
Another approach is:
Do you empty your queue? Does solutions.poll() reduce your solutions queue?
writing to different files using multiple threads is a good idea. Also, you should look into setting the BufferedWriters buffer size, which you can do from the constructor. Try initializing with a 10 Mb buffer and see if that helps
Say you have a file of bigger size then you have memory to handle. You'd like to read the files n bytes in turns and not get blocked in the process
read a block
pass it to a thread
read another block
pass it to a thread
I tried different things with varying success, however blocking always seem to be the issue.
Please provide an example of a non-blocking way to gain access to, say byte[]
You can't.
You will always block while waiting for the disk to provide you with data. If you have a lot of work to do with each chunk of data, then using a second thread may help: that thread can perform CPU-intensive work on the data while the first thread is blocked waiting for the next read to complete.
But that doesn't sound like your situation.
Your best bet is to read data in as large a block as you possibly can (say, 1MB or more). This minimizes the time blocked in the kernel, and may result in less time waiting for the disk (if the blocks being read happen to be contiguous).
Here's teh codez
ExecutorService exec = Executors.newFixedThreadPool(1);
// use RandomAccessFile because it supports readFully()
RandomAccessFile in = new RandomAccessFile("myfile.dat", "r");
in.seek(0L);
while (in.getFilePointer() < in.length())
{
int readSize = (int)Math.min(1000000, in.length() - in.getFilePointer());
final byte[] data = new byte[readSize];
in.readFully(data);
exec.execute(new Runnable()
{
public void run()
{
// do something with data
}
});
}
It sounds like you are looking for Streams, buffering, or some combination of the two (BufferedInputStream anyone?).
Check this out:
http://docs.oracle.com/javase/tutorial/essential/io/buffers.html
This is the standard way to deal with very large files. I apologize if this isn't what you were looking for, but hopefully it'll help get the juices flowing anyway.
Good luck!
If you have a program that does I/O and CPU computations, blocking is inevitable (somewhere in your program) if on average the amount of CPU time it takes to process a byte is less than the time to read a byte.
If you try to read a file and that requires a disk seek, the data might not arrive for 10 ms. A 2 GHz CPU could have done 20 M clock cycles of work in that time.