I am trying to parse a pcap file two different ways by using two different methods. The pcap file is passed to the class when it is created that contains both methods. When i use the pcap file in the first method no problem looping. However, when i go to parse through it a second time in the second method nothing happens when i try to print each packet. I tried passing the pcap file directly to the second method and still no dice. Do I need to reset a counter/pointer? Any ideas?
How pcap file is loaded from disk
pcap = Pcap.openStream(pcapPath);
How class constructor intakes pcap file
public PcapParsing(Pcap pcap) {
this.pcap = pcap;
}
How both methods parse the pcap file
public void arpFloodDetect(Pcap ppcap)
{
try {
ppcap.loop((final Packet packet) -> {
System.out.println(ppcap.toString());
return true;
});
} catch (IOException e) {
e.printStackTrace();
}
}
Do I need to reset a counter/pointer?
You need to create a new Pcap by calling Pcap.openStream again. The Pcap API does not expose any methods for resetting the underlying stream.
Pcap files can get large like a couple gigs or larger. Will this add a significant load penalty for each time i call it?
It depends on how good your file system is. If we assume that your file system is on a fast local SSD, and you are running an OS which uses RAM for file system buffer caching, then the reading a big file will be fast the first time, and faster the second time.
It also depends on what you mean by "significant", and what is acceptable. And how much money you are prepared to pay to upgrade your hardware to achieve acceptable performance.
Would you happen to know a different way of loading files that avoids a penalty if there is one?
Basically, no.
The only other alternatives I can think of involve read or mapping the entire file into the JVM's address space and then wrapping it in an InputStream. You still need to create a Pcap for each pass through the file.
But the problem with this is that it requires as much JVM address space as the size of the file you are processing. If the file is significantly bigger than the amount of physical RAM available, it can get horrible:
In the best case your performance will be equivalent to re-reading the file from disk.
In the worst case your application thrashes and brings the operating system to its knees (or gets OOM-killed to prevent that).
The current Pcap implementation is designed to avoid that by not caching the data in RAM. That is how it is able to cope with huge input files without running out of memory, etc.
Related
Context
I am writing a Java program that communicates with a C# program through standard in and standard out. The C# program is started as a child process. It gets "requests" through stdin and sends "responses" through stdout. The requests are very lightweight (a few bytes size), but the responses are large. In a normal run of the program, the responses amount for about 2GB of data.
I am looking for ways to improve performance, and my measurements indicate that writing to stdout is a bottleneck. Here are the numbers from a normal run:
Total time: 195 seconds
Data transferred through stdout: 2026MB
Time spent writing to stdout: 85 seconds
stdout throughput: 23.8 MB/s
By the way, I am writing all the bytes to an in-memory buffer first, and copying them in one go to stdout to make sure I only measure stdout write time.
Question
What is an efficient and elegant way to share data between the C# child process and the Java parent process? It is clear that stdout is not going to be enough.
I have read here and there about sharing memory through memory mapped files, but the Java and .NET APIs give me the impression that I'm looking in the wrong place.
Before you invest more in memory mapped files or named pipes I would first check whether you actually read and write efficiently. java.lang.Process.getInputStream() uses a BufferedInputStream, so the reader side should be OK. But in your C# program you will most likely use Console.Write. The problem here is that AutoFlush is enabled by default. So every single write explicitely flushes the stream. I wrote my last C# code years ago, so I'm not up-to-date. But maybe it is possible to set the AutoFlush property of Console.Out to false and flush the stream manually after multiple writes.
If disabling AutoFlush should not be possible the only way to improve performance with Console.Out would be to write more text with a single write.
Another potential bottleneck may be a shell in between that has to interpret the written data. Ensure that you execute the C# program directly and not through a script or by calling the command executor.
Before you start using memory mapped files I would first try to simply write into a file. As long as you have enough free memory that is not used by your programs or others and as long as there are no other programs with frequent disk access the operating system will be able to hold quite a big amount of written data within the file system cache. As long as your Java program reads fast enough from file while your C# program is writing to the file chances are high that only some or even no data has to be loaded from disk.
As Matthew Watson mentioned in the comments, it is indeed possible and incredibly fast to use a memory mapped file. In fact, the throughput for my program went from 24 MB/s to 180 MB/s. Below is the gist of it.
The following Java code creates the memory mapped file used for communication and opens a buffer we can read from:
var path = Paths.get("test.mmap");
var channel = FileChannel.open(path, StandardOpenOption.READ, StandardOpenOption.WRITE, StandardOpenOption.CREATE);
var mappedByteBuffer = channel.map(FileChannel.MapMode.READ_WRITE, 0, 200_000 * 8);
The following C# code opens the memory mapped file and creates a stream that you can use to write bytes to it (note that buffer is the name of the array of bytes to be written):
// This code assumes the file has already been created on the Java side
var file = File.Open("test.mmap", FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite);
var memoryMappedFile = MemoryMappedFile.CreateFromFile(file, fileName, 0, MemoryMappedFileAccess.ReadWrite, HandleInheritability.None, false);
var stream = memoryMappedFile.CreateViewStream();
stream.Write(buffer, 0, buffer.Length);
stream.Flush();
Of course, you need to somehow synchronize the Java and the C# side. For the sake of simplicity, I didn't include that in the code above. In my code, I am using standard in and standard out to signal when it is safe to read / write.
With reference to the stackoverflow question it is said that the InputStream can be read multiple times with mark() and reset() provided by the InputStream or by using PushbackInputStream.
In all these cases the content of the stream is stored in byte array (ie; the original content of the file is stored in main memory) and reused multiple times.
What happens when the size of the file exceeds the memory size? I think this may pave way for OutOfMemoryException.
Is there any better way to read the stream content multiple times without storing the stream content locally (ie; in main memory)?
Please help me knowing this. Thanks in advance.
It depends on the source of the stream.
If it's a local file, you can likely re-open and re-read the stream as many times as you want.
If it's dynamically generated by a process, a remote service, etc., you might not be free to re-generate it. In that case, you need to store it, either in memory or in some more persistent (and slow) storage like a file system or storage service.
Maybe an analogy would help. Suppose your friend is speaking to you at length. You listen carefully without interruption, but when they are done, you realize you didn't understand something they said near the beginning, and want to review that portion.
At this point, there are a few possibilities.
Perhaps your friend was actually reading aloud from a book. You can simply re-read the book.
Or, perhaps you had to foresight to record their monologue. You can replay the recording.
However, since neither you nor your friend has perfect and unlimited recall, simply repeating verbatim what was said ten minutes ago from memory alone is not an option.
An InputStream is like your friend speaking. Neither of you has a good enough memory to remember exactly, word-for-word, what is said. In the same way, neither a process that is generating the data stream nor your program has enough RAM to store, byte-for-byte, the stream. To scale, your program has to rely on its "short-term memory" (RAM), working with just a small portion of the whole stream at any given time, and "taking notes" (writing to a persistent store) as it encounters important points.
If the source of stream is a local file, then it's like your friend reading a book. Either of you can re-read that content easily enough.
If you copy the stream to some persistent storage, that's like recording your friend's speech. You can replay it as often as you like.
Consider a scenario where browser is uploading a large file, but the server is busy, and not able to read that stream for some time. Where is that data stored during that delay?
Because the receiver can't always respond immediately to input, TCP and many other protocols allocate a small buffer to store some data from a sender. But, they also have a way to tell the sender to wait, they are sending data too fast—flow control. Going back to the analogy, it's like telling your friend to pause a moment while you catch up with your note-taking.
As the browser uploads the file, at first, the buffer will be filled. But if the server can't keep up, the browser will be instructed to pause its upload until there is more room in the buffer. (This generally happens at the OS and TCP level; the client and server applications don't manage this directly.) The upload speed depends on how fast the browser can read the file from disk, how fast the network link is, and how fast the server can process the uploaded data. Even a fast network and client will be limited by the weak link in this chain.
When I want to write Java code for writing text to a file, it usually looks something like this:
File logFile = new File("/home/someUser/app.log");
FileWriter writer;
try {
writer = new FileWriter(logFile, true);
writer.write("Some text.");
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
I am now writing a Logger that will be used extensively by an in-house reporting tool. For reasons outside the context of this question, I can't use one of the traditional logging frameworks (SLF4J, Log4j, Logback, JUL, JCL, etc.). So I have to make something homegrown.
This logging system will be simple, non-configurable, but has to be capable of handling high-volume (possibly hundreds of log operations per second, or more).
So I ask: how can I optimize my normal file I/O template above, to handle high-throughput logging? What kid of "hidden gems of Java File I/O" can I capitalize on here? Pretty much anything goes, except, like I said, use of other logging frameworks. Basic Logger API needs to be something like:
public class Logger {
private File logFile;
public Logger(File logFile) {
super();
setFile(logFile);
}
public void log(String message) {
???
}
}
Thanks in advance!
Update: If my Logger used a ByteOutputStream instead of a FileWriter, then how can I properly synchronize my log(String) : void method?
public class Logger {
private File logFile;
// Constructor, getters/setters, etc.
public void synchronized log(String message) {
FileOutputStream foutStream = new FileOutputStream(logFile);
ByteOutputStream boutStream = new BytesOutputStream(foutStream);
boutStream.write(message.getBytes(Charset.forName("utf-8")));
// etc.
}
}
If you want to achieve maximum throughput for the logging operation you should decouple the logging of messages from writing them to the file system by using a queue and a separate log-writing thread.
The purpose of a logging system isn't just to achieve maximum throughput. It is required as an audit trail. Business decisions must be made as to how much data loss, if any, is tolerable in case of a crash. You need to investigate that first, before committing yourself to any specific technical solution.
I'm speaking only about throughput here and not about engineering or reliability concerns, since the question asked only about performance.
You will want to buffer writes to disk. Writing lots of little tiny pieces with unboffered I/O incurs a whole bunch of overhead:
The cost of a native method call; the JVM has to do a good amount of bookkeeping, including knowing which threads are running native methods and which aren't, in order to work. This is on the order of tens or hundreds of nanoseconds on modern platforms.
Copying the data from the Java heap to native memory through a magic JNI call. There's a memory copy taking time proportional to the length of your data, but there's also a bunch of JVM bookkeeping. Ballpark the bookkeeping overhead around a few hundred nanoseconds.
The cost of an OS call to write() or similar. Ballpark the overhead around 2 microseconds. (There are other costs, too; your caches and TLB have probably been flushed upon return.) write() also needs to copy the data from user space to kernel space.
The OS may internally buffer your writes. It may not. This depends on the OS and the characteristics of the underlying filesystem. You can typically force it not to buffer your writes. You can typically also flush the OS's buffer. Doing so will incur the cost of a disk seek and a write. Ballpark the disk seek around 8ms and the write between 100MB/s and 1GB/s. Throw the disk seek overhead out the window if you're using a RAM disk or flash storage or something like that --- latencies there are typically much lower.
The really big cost you want to avoid if possible is the disk seek cost. 8 milliseconds is a hell of a long time to wait when writing a 100-odd-byte log message. You will want some kind of buffering between the user and the backing storage, whether it's provided by the OS or hidden by the logging interface.
The overhead of a system call from the JVM is also significant, though it's about 1000 times less than the cost of a disk seek. You're spending two or three microseconds to tell the kernel to buffer a write of 100-odd bytes. Almost all of those two or three microseconds are spent handling various bookkeeping tasks that have nothing at all to do with writing a log message to a file. This is why you want the buffering to happen in userspace, and preferably in Java code instead of native code. (However, engineering concerns may render this impossible.)
Java already comes with drop-in buffering solutions --- BufferedWriter and BufferedOutputStream. It turns out that these are internally synchronised. You'll want to use BufferedOutputStream so that the String-to-bytes conversion happens outside of the lock rather than inside.
You could do one better than the Buffered classes if you kept a queue of Strings that you flush once it reaches a certain size. This saves a memory copy, but I rather doubt this is worth doing.
On buffer sizes, I suggested something around 4MB or 8MB. Buffer sizes in this range cover up the latency of a disk seek fairly well on most typical modern hardware. Your southbridge can push about 1GB/s and a typical disk can push about 100MB/s. Maxing out your southbridge, then, an 8MB write will take about 8 milliseconds --- roughly as long as a disk seek. With a single "typical modern disk", 90% of the time spent doing an 8MB random write is spent doing the write.
Again, you can't do buffering inside Java if log messages need to be reliably written to the backing store. You need to trust the kernel in that case, and you pay a speed hit for doing so.
I have a program that generates a lot of data and puts it in a queue to write but the problem is its generating data faster than I'm currently writing(causing it to max memory and start to slow down). Order does not matter as I plan to parse the file later.
I looked around a bit and found a few questions that helped me design my current process(but I still find it slow). Here's my code so far:
//...background multi-threaded process keeps building the queue..
FileWriter writer = new FileWriter("foo.txt",true);
BufferedWriter bufferWritter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWritter.newLine();
bufferWritter.write(data);
}
bufferWritter.close();
I'm pretty new to programming so I maybe assessing this wrong(maybe a hardware issue as I'm using EC2), but is there a to very quickly dump the queue results into a file or if my approach is okay can I improve it somehow? As order does not matter, does it make more sense to write to multiple files on multiple drives? Will threading make it faster?,etc..I'm not exactly sure the best approach and any suggestions would be great. My goal is to save the results of the queue(sorry no outputting to /dev/null :-) and keep memory consumption as low as possible for my app(I'm not 100% sure but the queue fills up 15gig, so I'm assuming it'll be a 15gig+ file).
Fastest way to write huge data in text file Java (realized I should use buffered writer)
Concurrent file write in Java on Windows (made me see that maybe multi-threading writes wasn't a great idea)
Looking at that code, one thing that springs to mind is character encoding. You're writing strings, but ultimately, it's bytes that go to the streams. A writer character-to-byte encoding under the hood, and it's doing it in the same thread that is handling writing. That may mean that there is time being spent encoding that is delaying writes, which could reduce the rate at which data is written.
A simple change would be to use a queue of byte[] instead of String, do the encoding in the threads which push onto the queue, and have the IO code use a BufferedOutputStream rather than a BufferedWriter.
This may also reduce memory consumption, if the encoded text takes up less than two bytes per character on average. For latin text and UTF-8 encoding, this will usually be true.
However, i suspect it's likely that you're simply generating data faster than your IO subsystem can handle it. You will need to make your IO subsystem faster - either by using a faster one (if you're on EC2, perhaps renting a faster instance, or writing to a different backend - SQS vs EBS vs local disk, etc), or by ganging several IO subsystems together in parallel somehow.
Yes, writing multiple files on multiple drives should help, and if nothing else is writing to those drives at the same time, performance should scale linearly with the number of drives until I/O is no longer the bottleneck. You could also try a couple other optimizations to boost performance even more.
If you're generating huge files and the disk simply can't keep up, you can use a GZIPOutputStream to shrink the output--which, in turn, will reduce the amount of disk I/O. For non-random text, you can usually expect a compression ratio of at least 2x-10x.
//...background multi-threaded process keeps building the queue..
OutputStream out = new FileOutputStream("foo.txt",true);
OutputStreamWriter writer = new OutputStreamWriter(new GZIPOutputStream(out));
BufferedWriter bufferWriter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWriter.newLine();
bufferWriter.write(data);
}
bufferWriter.close();
If you're outputting regular (i.e., repetitive) data, you might also want to consider switching to a different output format--for example, a binary encoding of the data. Depending on the structure of your data, it might be more efficient to store it in a database. If you're outputting XML and really want to stick to XML, you should look into a Binary XML format, such as EXI or Fast InfoSet.
I guess as long as you produce your data out of calculations and do not load your data from another data source, writing will always be slower than generating your data.
You can try writing your data in multiple files (not in the same file -> due to synchronization problems) in multiple threads (but I guess that will not fix your problem).
Is it possible for you to wait for the writing part of your application to finish its operation and continue your calculations?
Another approach is:
Do you empty your queue? Does solutions.poll() reduce your solutions queue?
writing to different files using multiple threads is a good idea. Also, you should look into setting the BufferedWriters buffer size, which you can do from the constructor. Try initializing with a 10 Mb buffer and see if that helps
Let's say one program is reading file F.txt, and another program is writing to this file at the same moment.
(When I'm thinking about how would I implement this functionality if I were a system programmer) I realize that there can be ambiguity in:
what will the first program see?
where does the second program write new bytes? (i.e. write "in place" vs write to a new file and then replace the old file with the new one)
how many programs can write to the same file simultaneously?
.. and maybe something not so obvious.
So, my questions are:
what are the main strategies for reading/writing files functionality?
which of them are supported in which OS (Windows, Linux, Mac OS etc)?
can it be dependent on certain programming language? (I can suppose that Java can try to provide some unified behavior on all supported OSs)
A single byte read has a long journey to go, from the magnetic plate/flash cell to your local Java variable. This is the path that a single byte travels:
Magnetic plate/flash cell
Internal hard disc buffer
SATA/IDE bus
SATA/IDE buffer
PCI/PCI-X bus
Computer's data bus
Computer's RAM via DMA
OS Page-cache
Libc read buffer, aka user space fopen() read buffer
Local Java variable
For performance reasons, most of the file buffering done by the OS is kept on the Page Cache, storing the recent read and write files contents on RAM.
That means that every read and write operation from your Java code is done from and to your local buffer:
FileInputStream fis = new FileInputStream("/home/vz0/F.txt");
// This byte comes from the user space buffer.
int oneByte = fis.read();
A page is usually a single block of 4KB of memory. Every page has some special flags and attributes, one of them being the "dirty page", which means that page has some modified data not written to phisical media.
Some time later, when the OS decides to flush the dirty data back to the disk, it sends the data on the opposite direction from where it came.
Whenever two distinct process writes data to the same file, the resulting behaviour is:
Impossible, if the file is locked. The secondth process won't be able to open the file.
Undefined, if writing over the same region of the file.
Expected, if operating over different regions of the file.
A "region" is dependant on the internal buffer sizes that your application uses. For example, on a two megabytes file, two distinct processes may write:
One on the first 1kB of data (0; 1024).
The other on the last 1kB of data (2096128; 2097152)
Buffer overlapping and data corruption would occur only when the local buffer is two megabytes in size. On Java you can use the Channel IO to read files with a fine-grained control of what's going on inside.
Many transactional databases forces some writes from the local RAM buffers back to disk by issuing a sync operation. All the data related to a single file gets flushed back to the magnetic plates or flash cells, effectively ensuring that on power failure no data will be lost.
Finally, a memory mapped file is a region of memory that enables a user process to read and write directly from and to the page cache, bypassing the user space buffering.
The Page Cache system is vital to the performance of a multitasking protected mode OS, and every modern operating system (Windows NT upwards, Linux, MacOS, *BSD) supports all these features.
http://ezinearticles.com/?How-an-Operating-Systems-File-System-Works&id=980216
Strategies can be as much as file systems. Generally, the OS focuses on the avoidance of I/O operations by caching the file before it is synchronized with the disc. Reading from the buffer will see the previously saved data to it. So between the software and hardware is a layer of buffering (eg MySQL MyISAM engine uses this layer much)
JVM synchronize file descriptor buffers to disk at closing file or when a program is invoking methods like fsync() but buffers may be synchronized also by OS when they exceed the defined thresholds. In the JVM this is of course unified on all supported OS.