I recently came across this article which provided a nice intro to memory mapped files and how it can be shared between two processes. Here is the code for a process that reads in the file:
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
public class MemoryMapReader {
/**
* #param args
* #throws IOException
* #throws FileNotFoundException
* #throws InterruptedException
*/
public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException {
FileChannel fc = new RandomAccessFile(new File("c:/tmp/mapped.txt"), "rw").getChannel();
long bufferSize=8*1000;
MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, bufferSize);
long oldSize=fc.size();
long currentPos = 0;
long xx=currentPos;
long startTime = System.currentTimeMillis();
long lastValue=-1;
for(;;)
{
while(mem.hasRemaining())
{
lastValue=mem.getLong();
currentPos +=8;
}
if(currentPos < oldSize)
{
xx = xx + mem.position();
mem = fc.map(FileChannel.MapMode.READ_ONLY,xx, bufferSize);
continue;
}
else
{
long end = System.currentTimeMillis();
long tot = end-startTime;
System.out.println(String.format("Last Value Read %s , Time(ms) %s ",lastValue, tot));
System.out.println("Waiting for message");
while(true)
{
long newSize=fc.size();
if(newSize>oldSize)
{
oldSize = newSize;
xx = xx + mem.position();
mem = fc.map(FileChannel.MapMode.READ_ONLY,xx , oldSize-xx);
System.out.println("Got some data");
break;
}
}
}
}
}
}
I have, however, a few comments/questions regarding that approach:
If we execute the reader only on an empty file, i.e run
long bufferSize=8*1000;
MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, bufferSize);
long oldSize=fc.size();
This will allocate 8000 bytes which will now extend the file. The buffer that this returns has a limit of 8000 and a position of 0, therefore, the reader can proceed and read empty data. After this happens, the reader will stop, as currentPos == oldSize.
Supposedly now the writer comes in (code is omitted as most of it is straightforward and can be referenced from the website) - it uses the same buffer size, so it will write first 8000 bytes, then allocate another 8000, extending the file. Now, if we suppose this process pauses at this point, and we go back to the reader, then the reader sees the new size of the file and allocates the remainder (so from position 8000 until 1600) and starts reading again, reading in another garbage...
I am a bit confused whether there is a why to synchronize those two operations. As far as I see it, any call to map might extend the file with really an empty buffer (filled with zeros) or the writer might have just extended the file, but has not written anything into it yet...
I do a lot of work with memory-mapped files for interprocess communication. I would not recommend Holger's #1 or #2, but his #3 is what I do. But a key point is perhaps that I only ever work with a single writer - things get more complicated if you have multiple writers.
The start of the file is a header section with whatever header variables you need, most importantly a pointer to the end of the written data. The writer should always update this header variable after writing a piece of data, and the reader should never read beyond this variable. A thing called "cache coherency" that all mainstream CPU's have will guarantee that the reader will see memory writes in the same sequence they are written, so the reader will never read uninitialised memory if you follow these rules. (An exception is where the reader and writers are on different servers - cache coherency doesn't work there. Don't try to implement shared memory across different servers!)
There is no limit to how frequently you can update the end-of-file pointer - it's all in memory and there won't be any i/o involved, so you can update it each record or each message you write.
ByteBuffer has versions of 'getInt()' and 'putInt()' methods which take an absolute byte offset, so that's what I use for reading & writing the end-of-file marker...I never use the relative versions when working with memory-mapped files.
There's no way you should use the file size or yet another interprocess method to communicate the end-of-file marker and no need or benefit when you already have shared memory.
Check out my library Mappedbus (http://github.com/caplogic/mappedbus) which enables multiple Java processes (JVMs) to write records in order to the same memory mapped file.
Here's how Mappedbus solves the synchronization problem between multiple writers:
The first eight bytes of the file make up a field called the limit. This field specifies how much data has actually been written to the file. The readers will poll the limit field (using volatile) to see whether there's a new record to be read.
When a writer wants to add a record to the file it will use the fetch-and-add instruction to atomically update the limit field.
When the limit field has increased a reader will know there's new data to be read, but the writer which updated the limit field might not yet have written any data in the record. To avoid this problem each record contains an initial byte which make up the commit field.
When a writer has finished writing a record it will set the commit field (using volatile) and the reader will only start reading a record once it has seen that the commit field has been set.
(BTW, the solution has only been verified to work on Linux x86 with Oracle's JVM. It most likely won't work on all platforms).
There are several ways.
Let the writer acquire an exclusive Lock on the region that has not been written yet. Release the lock when everything has been written. This is compatible to every other application running on that system but it requires the reader to be smart enough to retry on failed reads unless you combine it with one of the other methods
Use another communication channel, e.g. a pipe or a socket or a file’s metadata channel to let the writer tell the reader about the finished write.
Write at a position in the file a special marker (being part of the protocol) telling about the written data, e.g.
MappedByteBuffer bb;
…
// write your data
bb.force();// ensure completion of all writes
bb.put(specialPosition, specialMarkerValue);
bb.force();// ensure visibility of the marker
Related
This is my current thread, I use it to stress test the CPU, I need to output the "Hcount" every hour to a .txt file, currently, it will print it but only from one thread ,when another hour passes it deletes what is written on the .txt file and rewrite the new "Hcount"
I'm Running 3 threads.
import java.util.Random;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
public class MyThread extends Thread{
public void run() {
String B;//Will hold the value of Cpointer
String A;//will hold the string value of Hcount
Path fileName =
Path.of("D:/EEoutput/Out.txt");
Random rand = new Random();
long Hcount = 0;//counts the number of iterations in an hour
long t = System.currentTimeMillis();
long end = t + 3800000*5;//a minute
double a1 = 0; //random holder 1
double a2 = 0;//random holder 2
double answer = 0; // answer placeholder
Long hour = System.currentTimeMillis()+3600000;//will be used to say when are we outputing the number of iterations to the file
int Cpointer = 1;//will tell how many hours has passed
while (System.currentTimeMillis() < end) {
a1 = rand.nextDouble();
a2 = rand.nextDouble();
answer = a1 * 23 / (a2 + a1) + a2;
Hcount++;
if (System.currentTimeMillis() >= hour)// checks if the program needs to
{
B = String.valueOf(Cpointer);
A=String.valueOf(Hcount);
try {
Files.writeString(fileName, A);
} catch (IOException e) {
e.printStackTrace();
}
hour = System.currentTimeMillis()+3600000;//sets stop to next hour
Cpointer++;//declares that another hour has passed, will be used to tell how many iterations are there in a certain hour
Hcount = 0;
}
}
}
}
'''
Writing into file from multiple threads is a bad idea. I suggest you create a queue (even if just in memory queue) and have all your threads writing the info that they want to write into your file into this queue. In other words your queue will have multiple producers. And than have a single consumer on your queue that will read from the queue and write it into your file. This way you will have only one thread writing into file
You have two separate issues here.
Files.writeString replaces content by default. You want Files.writeString(fileName, A, StandardOpenOption.APPEND).
Writing to the same file from simultaneous threads isn't going to work (think about it. The OS cannot promise that your write will be atomic, that should be obvious). So even if you fix it, it'll seem to work but every so often fail on you: A race condition.
The usual strategy to work around that last part is to use locks of some kind. If a single JVM is the only one doing those file writes, you can use whatever you like that java offers: synchronized, for example. Or an ReadWriteLock from the j.u.concurrent package.
But, this does mean your CPU stresser thread will be doing the waiting for the lock. You may instead want to start a separate thread, and have a single ConcurrentBlockingQueue. Your CPU stress testers send log messages to the queue, and your log writer thread will just be doing a 5-liner loop: Endlessly, fetch an item from the queue (this blocks until there is something), write it to the file, flush the stream, and loop.
This solves a bunch of problems, given that now only one thread writes.
If it's multiple JVMs, that's trickier - then lock with a .lock file. You can use Files.createFile() to create logfile.lock; this will fail if the file is already there. Then wait some time (you can't ask the OS to tell you when the file is deleted, so you have to wait half a second or so and check again, forever, until the file is gone), until it succeeds, then write, then delete the lock file.
A major downside to .lock files is: If your process hard-crashes, the lock file sticks around. Which you don't want. One solution to that is to write your own PID (Process ID) to it, and thus anybody doing a check can at least see that the process it belongs to is dead. Except this is tricky; modern OSes don't just let you check for existence, neccessarily, and it's all very OS-dependent (no java libraries that automate this stuff, as far as I know). This all gets quite complicated, so, let's keep it simple:
If you want to write to the same file simultaneously from different JVMs / processes on the same system, you can do that 'safely', but it is rather complicated.
I have a massive 25GB CSV file. I know that there are ~500 Million records in the file.
I want to do some basic analysis with the data. Nothing too fancy.
I don't want to use Hadoop/Pig, not yet atleast.
I have written a java program to do my analysis concurrently. Here is what I am doing.
class MainClass {
public static void main(String[] args) {
long start = 1;
long increment = 10000000;
OpenFileAndDoStuff a = new OpenFileAndDoStuff[50];
for(int i=0;i<50;i++) {
a[i] = new OpenFileAndDoStuff("path/to/50GB/file.csv",start,start+increment-1);
a[i].start();
start += increment;
}
for(OpenFileAndDoStuff obj : a) {
obj.join();
}
//do aggregation
}
}
class OpenFileAndDoStuff extends Thread {
volatile HashMap<Integer, Integer> stuff = new HashMap<>();
BufferedReader _br;
long _end;
OpenFileAndDoStuff(String filename, long startline, long endline) throws IOException, FileNotFoundException {
_br = new BufferedReader(new FileReader(filename));
long counter=0;
//move the bufferedReader pointer to the startline specified
while(counter++ < start)
_br.readLine();
this._end = end;
}
void doStuff() {
//read from buffered reader until end of file or until the specified endline is reached and do stuff
}
public void run() {
doStuff();
}
public HashMap<Integer, Integer> getStuff() {
return stuff;
}
}
I thought doing this I could open 50 bufferedReaders, all reading 10 million lines chucks in parallel and once all of them are done doing their stuff, I'd aggregate them.
But, the problem I face is that even though I ask 50 threads to start, only two start at a time and can read from the file at a time.
Is there a way I can make all 50 of them open the file and read form it at the same time ? Why am I limited to only two readers at a time ?
The file is on a windows 8 machine and java is also on the same machine.
Any ideas ?
Here is a similar post: Concurrent reading of a File (java preffered)
The most important question here is what is the bottleneck in your case?
If the bottleneck is your disk IO, then there isn't much you can do at the software part. Parallelizing the computation will only make things worse, because reading the file from different parts simultaneously will degrade disk performance.
If the bottleneck is processing power, and you have multiple CPU cores, then you can take an advantage of starting multiple threads to work on different parts of the file. You can safely create several InputStreams or Readers to read different parts of the file in parallel (as long as you don't go over your operating system's limit for the number of open files). You could separate the work into tasks and run them in parallel
See the referred post for an example that reads a single file in parallel with FileInputStream, which should be significantly faster than using BufferedReader according to these benchmarks: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly#FileReaderandBufferedReader
One issue I see is that when a Thread is being asked to read, for example, lines 80000000 through 90000000, you are still reading in the first 80000000 lines (and ignoring them).
Maybe try java.io.RandomAccessFile.
In order to do this, you need all of the lines to be the same number of Bytes. If you cannot adjust the structure of your file, then this would not be an option. But if you can, this should allow for greater concurrency.
I'm new to Java and working on reading very large files, need some help to understand the problem and solve it. We have got some legacy code which have to be optimized to make it run properly.The file size can vary from 10mb to 10gb only. only trouble start when file starting beyond 800mb size.
InputStream inFileReader = channelSFtp.get(path); // file reading from ssh.
byte[] localbuffer = new byte[2048];
ByteArrayOutputStream bArrStream = new ByteArrayOutputStream();
int i = 0;
while (-1 != (i = inFileReader.read(buffer))) {
bArrStream.write(localbuffer, 0, i);
}
byte[] data = bArrStream.toByteArray();
inFileReader.close();
bos.close();
We are getting the error
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
Any help would be appreciated?
Try to use java.nio.MappedByteBuffer.
http://docs.oracle.com/javase/7/docs/api/java/nio/MappedByteBuffer.html
You can map a file's content onto memory without copying it manually. High-level Operating Systems offer memory-mapping and Java has API to utilize the feature.
If my understanding is correct, memory-mapping does not load a file's entire content onto memory (meaning "loaded and unloaded partially as necessary"), so I guess a 10GB file won't eat up your memory.
Even though you can increase the JVM memory limit, it is needless and allocating a huge memory like 10GB to process a file sounds overkill and resource intensive.
Currently you are using a "ByteArrayOutputStream" which keeps an internal memory to keep the data. This line in your code keeps appending the last read 2KB file chunk to the end of this buffer:
bArrStream.write(localbuffer, 0, i);
bArrStream keeps growing and eventually you run out of memory.
Instead you should reorganize your algorithm and process the file in a streaming way:
InputStream inFileReader = channelSFtp.get(path); // file reading from ssh.
byte[] localbuffer = new byte[2048];
int i = 0;
while (-1 != (i = inFileReader.read(buffer))) {
//Deal with the current read 2KB file chunk here
}
inFileReader.close();
The Java virtual machine (JVM) runs with a fixed upper memory limit, which you can modify thus:
java -Xmx1024m ....
e.g. the above option (-Xmx...) sets the limit to 1024 megabytes. You can amend as necessary (within limits of your machine, OS etc.) Note that this is different from traditional applications which would allocate more and more memory from the OS upon demand.
However a better solution is to rework your application such that you don't need to load the whole file into memory at one go. That way you don't have to tune your JVM, and you don't impose a huge memory footprint.
You can't read 10GB Textfile in memory. You have to read X MB first, do something with it and than read the next X MB.
The problem is inherent in what you're doing. Reading entire files into memory is always and everywhere a bad idea. You're really not going to be able to read a 10GB file into memory with current technology unless you have some pretty startling hardware. Find a way to process them line by line, record by record, chunk by chunk, ...
Is it mandatory to get entire ByteArray() of output stream?
byte[] data = bArrStream.toByteArray();
Best approach is read line by line & write it line by line. You can use BufferedReader or Scanner to read large files as below.
import java.io.*;
import java.util.*;
public class FileReadExample {
public static void main(String args[]) throws FileNotFoundException {
File fileObj = new File(args[0]);
long t1 = System.currentTimeMillis();
try {
// BufferedReader object for reading the file
BufferedReader br = new BufferedReader(new FileReader(fileObj));
// Reading each line of file using BufferedReader class
String str;
while ( (str = br.readLine()) != null) {
System.out.println(str);
}
}catch(Exception err){
err.printStackTrace();
}
long t2 = System.currentTimeMillis();
System.out.println("Time taken for BufferedReader:"+(t2-t1));
t1 = System.currentTimeMillis();
try (
// Scanner object for reading the file
Scanner scnr = new Scanner(fileObj);) {
// Reading each line of file using Scanner class
while (scnr.hasNextLine()) {
String strLine = scnr.nextLine();
// print data on console
System.out.println(strLine);
}
}
t2 = System.currentTimeMillis();
System.out.println("Time taken for scanner:"+(t2-t1));
}
}
You can replace System.out with your ByteArrayOutputStream in above example.
Please have a look at below article for more details: Read Large File
Have a look at related SE question:
Scanner vs. BufferedReader
ByteArrayOutputStream writes to an in-memory buffer. If this is really how you want it to work, then you have to size the JVM heap after the maximum possible size of the input. Also, if possible, you may check the input size before even start processing to save time and resources.
The alternative approach is a streaming solution, where the amount of memory used at runtime is known (maybe configurable but still known before the program starts), but if it's feasible or not depends entirely on you application's domain (because you can't use an in-memory buffer anymore) and maybe the architecture of the rest of your code if you can't/don't want to change it.
Try using a large buffer read size may be 10 mb and then check.
Read the file iteratively linewise. This would significantly reduce memory consumption. Alternately you may use
FileUtils.lineIterator(theFile, "UTF-8");
provided by Apache Commons IO.
FileInputStream inputStream = null;
Scanner sc = null;
try {
inputStream = new FileInputStream(path);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
String line = sc.nextLine();
// System.out.println(line);
}
// note that Scanner suppresses exceptions
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}
Run Java with the command-line option -Xmx, which sets the maximum size of the heap.
See here for details..
Assuming that you are reading large txt file and the data is set line by line , use line by line reading approach. As I know you can read up to 6GB may be more.
...
// Open the file
FileInputStream fstream = new FileInputStream("textfile.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine;
//Read File Line By Line
while ((strLine = br.readLine()) != null) {
// Print the content on the console
System.out.println (strLine);
}
//Close the input stream
br.close();
Refrence for the code fragment
Short answer,
without doing anything, you can push the current limit by a factor of 1.5. It means that, if you are able to process 800MB, you can process 1200 MB. It also means that if by some trick with java -Xm .... you can move to a point where your current code can process 7GB, your problem is solved, because the 1.5 factor will take you to 10.5GB, assuming you have that space available on your system and that JVM can get it.
Long answer:
The error is pretty self-descriptive. You hit the practical memory limit on your configuration. There is a lot of speculating about the limit that you can have with JVM, I do not know enough about that, since I can not find any official information. However, you will somehow be limited by constraints like the available swap, the kernel address space usage, the memory fragmentation, etc.
What is happening now is that ByteArrayOutputStream objects are created with a default buffer of size 32 if you do not supply any size (this is your case). Whenever you call the write method on the object, there is an internal machinery that is started. The openjdk implementation release 7u40-b43 that seems to match perfectly with the output of your error, uses an internal method ensureCapacity to check that the buffer has enough room to put the bytes you want to write. If there is not enough room, another internal method grow is called to grow the size of the buffer. The method grow defines the appropriate size and calls the method copyOf from the class Arrays to do the job.
The appropriate size of the buffer is the maximum between the current size and the size riquired to hold all the content (the present content and the new content to be write).
The method copyOf from the class Arrays (follow the link) allocates the space for the new buffer, copy the content of the old buffer to the new one and return it to grow.
Your problem occurs at the allocation of the space for the new buffer, After some write, you got to a point where the available memory is exhausted: java.lang.OutOfMemoryError: Java heap space.
If we look into details, you are reading by chunks of 2048. So
your first write to the grows the size of the buffer from 32 to 2048
your second call will double it to 2*2048
your third call will take it to 2^2*2048, you have to time to write two more times before the need of allocating.
then 2^3*2048, you will have the time for 4 mores writes before allocating again.
at some point, your buffer will be of size 2^18*2048 which is 2^19*1024 or 2^9*2^20 (512 MB)
then 2^19*2048 which is 1024 MB or 1 GB
Something that is unclear in your description is that you can somehow read up to 800MB, but can no go beyond. You have to explain that to me.
I expect that your limit be exactly a power of 2 (or close if we use power of 10 units somewere). In that regard, I expect you to start having trouble immediatly above one of these: 256MB, 512 MB, 1GB, 2GB, etc.
When you hit that limit, it does not mean that you are out of memory, it simply means that it is not possible to allocate another buffer of twice the size of the buffer you already have. This observation opens room for improvement in your work: find the maximum size of buffer that you can allocate and reserve it upfront by calling the appropriate constructor
ByteArrayOutputStream bArrStream = new ByteArrayOutputStream(myMaxSize);
It has the advantage of reducing the overhead background memory allocation that happens under the hood to keep you happy. By doing this, you will be able to go to 1.5 the limit you have right now. This is simply because the last time the buffer was increased, it went from half the current size to the current size, and at some point you had both the current buffer and the old one together in memory. But you will not be able to go beyond 3 times the limit you are having now. The explanation is exactly the same.
That been said, I do not have any magic suggestion to solve the problem apart from process your data by chunks of given size, one chunk at a time. Another good approach will be to use the suggestion of Takahiko Kawasaki and use MappedByteBuffer. Keep in mind that in any case you will need at least 10 GB of physical memory or swap memory to be able to load a file of 10GB.
see
After thinking about it, I decided to put a second answer. I considered the advantages and disadvantages of putting this second answer, and the advantages are worth going for it. So here it is.
Most of the suggested considerations are forgetting a given fact: There is a builtin limit in the size of arrays (including ByteArrayOutputStream) that you can have in Java. And that limit is dictated by the bigest int value which is 2^31 - 1(little bit less than 2Giga). This means that you can only read a maximum of 2 GB (-1 byte) and put it in a single ByteArrayOutputStream. The limit might actually be smaller for array size if the VM wants more control.
My suggestion is to use an ArrayList of byte[] instead of a single byte[] holding the full content of the file. And also remove the non necessary step of putting in ByteArrayOutputStream before putting it in a final data array. Here is an example based on your original code:
InputStream inFileReader = channelSFtp.get(path); // file reading from ssh.
// good habits are good, define a buffer size
final int BUF_SIZE = (int)(Math.pow(2,30)); //1GB, let's not go close to the limit
byte[] localbuffer = new byte[BUF_SIZE];
int i = 0;
while (-1 != (i = inFileReader.read(localbuffer))) {
if(i<BUF_SIZE){
data.add( Arrays.copyOf(localbuffer, i) )
// No need to reallocate the reading buffer, we copied the data
}else{
data.add(localbuffer)
// reallocate the reading buffer
localbuffer = new byte[BUF_SIZE]
}
}
inFileReader.close();
// Process your data, keep in mind that you have a list of buffers.
// So you need to loop over the list
Simply running your program should work fine on 64 bits system with enough physical memory or swap. Now if you want to speed it up to help the VM size correctly the heap at the beginning, run with the options -Xms and -Xmx. For example if you want a heap of 12GB to be able to handle 10GB file, use java -Xms12288m -Xmx12288m YourApp
Currently I have the below code for reading an InputStream. I am storing the whole file into a StringBuilder variable and processing this string afterwards.
public static String getContentFromInputStream(InputStream inputStream)
// public static String getContentFromInputStream(InputStream inputStream,
// int maxLineSize, int maxFileSize)
{
StringBuilder stringBuilder = new StringBuilder();
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
String lineSeparator = System.getProperty("line.separator");
String fileLine;
boolean firstLine = true;
try {
// Expect some function which checks for line size limit.
// eg: reading character by character to an char array and checking for
// linesize in a loop until line feed is encountered.
// if max line size limit is passed then throw an exception
// if a line feed is encountered append the char array to a StringBuilder
// after appending check the size of the StringBuilder
// if file size exceeds the max file limit then throw an exception
fileLine = bufferedReader.readLine();
while (fileLine != null) {
if (!firstLine) stringBuilder.append(lineSeparator);
stringBuilder.append(fileLine);
fileLine = bufferedReader.readLine();
firstLine = false;
}
} catch (IOException e) {
//TODO : throw or handle the exception
}
//TODO : close the stream
return stringBuilder.toString();
}
The code went for a review with the Security team and the following comments were received:
BufferedReader.readLine is susceptible to DOS (Denial of Service) attacks (line of infinite length, huge file containing no line feed/carriage return)
Resource exhaustion for the StringBuilder variable (cases when a file containing data greater than the available memory)
Below are the solutions I could think of:
Create an alternate implementation of readLine method (readLine(int limit)), which checks for the no. of bytes read and if it exceeds the specified limit, throw a custom exception.
Process the file line by line without loading the file in entirety. (pure non-Java solution :) )
Please suggest if there are any existing libraries which implement the above solutions.
Also suggest any alternate solutions which offer more robustness or are more convenient to implement than the proposed ones. Though performance is also a major requirement, security comes first.
Updated Answer
You want to avoid all sorts of DOS attacks (on lines, on size of the file, etc). But in the end of the function, you're trying to convert the entire file into one single String!!! Assume that you limit the line to 8 KB, but what happens if somebody sends you a file with two 8 KB lines? The line reading part will pass, but when finally you combine everything into a single string, the String will choke all available memory.
So since finally you're converting everything into one single String, limiting line size doesn't matter, nor is safe. You have to limit the entire size of the file.
Secondly, what you're basically trying to do is, you're trying to read data in chunks. So you're using BufferedReader and reading it line-by-line. But what you're trying to do, and what you really want at the end - is some way of reading the file piece by piece. Instead of reading one line at a time, why not instead read 2 KB at a time?
BufferedReader - by its name - has a buffer inside it. You can configure that buffer. Let's say you create a BufferedReader with buffer size of 2 KB:
BufferedReader reader = new BufferedReader(..., 2048);
Now if the InputStream that you pass to BufferedReader has 100 KB of data, BufferedReader will automatically read it 2 KB at at time. So it will read the stream 50 times, 2 KB each (50x2KB = 100 KB). Similarly, if you create BufferedReader with a 10 KB buffer size, it will read the input 10 times (10x10KB = 100 KB).
BufferedReader already does the work of reading your file chunk-by-chunk. So you don't want to add an extra layer of line-by-line above it. Just focus on the end result - if your file at the end is too big (> available RAM) - how are you going to convert it into a String at the end?
One better way is to just pass things around as a CharSequence. That's what Android does. Throughout the Android APIs, you will see that they return CharSequence everywhere. Since StringBuilder is also a subclass of CharSequence, Android will internally use either a String, or a StringBuilder or some other optimized string class based on the size/nature of input. So you could rather directly return the StringBuilder object itself once you've read everything, rather than converting it to a String. This would be safer against large data. StringBuilder also maintains the same concept of buffers inside it, and it will internally allocate multiple buffers for large strings, rather than one long string.
So overall:
Limit the overall file size since you're going to deal with the entire content at some point. Forget about limiting or splitting lines
Read in chunks
Using Apache Commons IO, here is how you would read data from a BoundedInputStream into a StringBuilder, splitting by 2 KB blocks instead of lines:
// import org.apache.commons.io.output.StringBuilderWriter;
// import org.apache.commons.io.input.BoundedInputStream;
// import org.apache.commons.io.IOUtils;
BoundedInputStream boundedInput = new BoundedInputStream(originalInput, <max-file-size>);
BufferedReader reader = new BufferedReader(new InputStreamReader(boundedInput), 2048);
StringBuilder output = new StringBuilder();
StringBuilderWriter writer = new StringBuilderWriter(output);
IOUtils.copy(reader, writer); // copies data from "reader" => "writer"
return output;
Original Answer
Use BoundedInputStream from Apache Commons IO library. Your work becomes much more easier.
The following code will do what you want:
public static String getContentFromInputStream(InputStream inputStream) {
inputStream = new BoundedInputStream(inputStream, <number-of-bytes>);
// Rest code are all same
You just simply wrap your InputStream with a BoundedInputStream and you specify a maximum size. BoundedInputStream will take care of limiting reads up to that maximum size.
Or you can do this when you're creating the reader:
BufferedReader bufferedReader = new BufferedReader(
new InputStreamReader(
new BoundedInputStream(inputStream, <no-of-bytes>)
)
);
Basically what we're doing here is, we're limiting the read size at the InputStream layer itself, rather than doing that when reading lines. So you end up with a reusable component like BoundedInputStream which limits reading at the InputStream layer, and you can use that wherever you want.
Edit: Added footnote
Edit 2: Added updated answer based on comments
There are basically 4 ways to do file processing:
Stream-Based Processing (the java.io.InputStream model): Optionally put a bufferedReader around the stream, iterate & read the next available text from the stream (if no text is available, block until some becomes available), process each piece of text independently as it's read (catering for widely-varying sizes of text pieces)
Chunk-Based Non-Blocking Processing (the java.nio.channels.Channel model): Create a set of fixed-sized buffers (representing the "chunks" to be processed), read into each of the buffers in turn without blocking (nio API delegates to native IO, using fast O/S-level threads), your main processing thread picks each buffer in turn once it is filled and processes the fixed-size chunk, as other buffers continue to be asynchronously loaded.
Part File Processing (including line-by-line processing) (can leverage (1) or (2) to isolate or build up each "part"): break your file format down into semantically meaningful sub-parts (if possible! breaking into lines could be possible!), iterate through stream pieces or chunks and build-up content in memory until the next part is completely built, process each part as soon as it's built.
Entire File Processing (the java.nio.file.Files model): Read the entire file into memory in one operation, process the complete contents
Which one should you use?
It depends - on your file contents and the type of processing you require.
From a resource-use efficiency perspective (best to worst) is: 1,2,3,4.
From a processing speed & efficiency perspective (best to worst) is: 2,1,3,4.
From an ease of programming perspective (best to worst): 4,3,1,2.
However, some types of processing might require more than the smallest piece of text (ruling out 1, and maybe 2) and some file formats may not have internal parts (ruling out 3).
You're doing 4. I suggest you shift to 3 (or lower), if you can.
Under 4, there's only one way to avoid DOS - limit the size before it's read into memory, (or for that matter copied to your file system). It's too late once it's read in. If this is not possible, then try 3, 2 or 1.
Limiting File Size
Often the file is uploaded via a HTML form.
If uploading using Servlet #MultipartConfig annotation and request.getPart().getInputStream(), you have control over how much data you read from the stream. Also, request.getPart().getSize() returns the file size in advance and if it's small enough, you can do request.getPart().write(path) to write the file to disk.
If uploading using JSF, then JSF 2.2 (very new) has the standard html component <h:inputFile> (javax.faces.component.html.InputFile), which has an attribute for maxLength; pre-JSF 2.2 implementations have similar custom components (e.g. Tomahawk has <t:InputFileUpload> with maxLength attribute; PrimeFaces has <p:FileUpload> with sizeLimit attribute).
Alternatives to Read Entire File
Your code which uses InputStream, StringBuilder, etc, is an efficient way to read the entire file, but is not necessarily the simplest way (least lines of code).
Junior/average developers could get the misapprehension that you're doing efficient stream-based processing, when you're processing the entire file - so include appropriate comments.
If you want less code, you could try one of the following:
List<String> stringList = java.nio.file.Files.readAllLines(path, charset);
or
byte[] byteContents = java.nio.file.Files.readAllBytes(path);
But they require care, or they could be inefficient in resource usage. If you use readAllLines and then concatenate the List elements into a single String, then you would consume double the memory (for the List elements + the concatenated String). Similarly, if you use readAllBytes, followed by encoding to String (new String(byteContents, charset)), then again, you're using "double" the memory. So best to process directly against List<String> or byte[], unless you limit your files to a small enough size.
instead of readLine use read which reads a given amount of chars.
in each loop check how much data has been read, if it's more then a certain amount, more then the maximum of an expected input, stop it and return an error and log it.
I faced a similar issue when copying a huge binary file (which generally does not contain newline character). doing a readline() leads to reading the entire binary file into one single string causing OutOfMemory on Heap space.
Here is a simple JDK alternative:
public static void main(String[] args) throws Exception
{
byte[] array = new byte[1024];
FileInputStream fis = new FileInputStream(new File("<Path-to-input-file>"));
FileOutputStream fos = new FileOutputStream(new File("<Path-to-output-file>"));
int length = 0;
while((length = fis.read(array)) != -1)
{
fos.write(array, 0, length);
}
fis.close();
fos.close();
}
Things to note:
The above example copies the file using a buffer of 1K bytes. However, if you are doing this copy over network, you may want to tweak the buffer size.
If you would like to use FileChannel or libraries like Commons IO, just make sure that the implementation boils down to something like above
This worked for me without any problems.
char charArray[] = new char[ MAX_BUFFER_SIZE ];
int i = 0;
int c = 0;
while((c = br.read()) != -1 && i < MAX_BUFFER_SIZE) {
char character = (char) c;
charArray[i++] = character;
}
return Arrays.copyOfRange(charArray,0,i);
I cannot think a soloution other than Apache Commons IO FileUtils.
Its pretty simple with FileUtils class, as the so called DOS attack wont come directly from the top layer.
Reading and writing a file is very much simple as you can do it with just one line of code like
String content =FileUtils.readFileToString(new File(filePath));
You can explore more about this.
There is class EntityUtils under Apache httpCore. Use getString() method of this class to get the String from Response content.
Recommendations from Fortify Scan. You can adapt the InputStream to other resources such as HTTP request InputStream.
InputStream zipInput = zipFile.getInputStream(zipEntry);
Reader zipReader = new InputStreamReader(zipInput);
BufferedReader br = new BufferedReader(zipReader);
StringBuffer sb = new StringBuffer();
int intC;
while ((intC = br.read()) != -1){
char c = (char)intC;
if (c == "\n"){
break;
}
if (sb.length >= MAX_STR_LEN){
throw new Exception("Input too long");
}
sb.append(c);
}
String line = sb.toString();
I am really in trouble: I want to read HUGE files over several GB using FileChannels and MappedByteBuffers - all the documentation I found implies it's rather simple to map a file using the FileChannel.map() method.
Of course there is a limit at 2GB as all the Buffer methods use int for position, limit and capacity - but what about the system implied limits below that?
In reality, I get lots of problems regarding OutOfMemoryExceptions! And no documentation at all that really defines the limits!
So - how can I map a file that fits into the int-limit safely into one or several MappedByteBuffers without just getting exceptions?
Can I ask the system which portion of a file I can safely map before I try FileChannel.map()? How?
Why is there so little documentation about this feature??
I can offer some working code. Whether this solves your problem or not is difficult to say. This hunts through a file for a pattern recognised by the Hunter.
See the excellent article Java tip: How to read files quickly for the original research (not mine).
// 4k buffer size.
static final int SIZE = 4 * 1024;
static byte[] buffer = new byte[SIZE];
// Fastest because a FileInputStream has an associated channel.
private static void ScanDataFile(Hunter p, FileInputStream f) throws FileNotFoundException, IOException {
// Use a mapped and buffered stream for best speed.
// See: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly
FileChannel ch = f.getChannel();
long red = 0L;
do {
long read = Math.min(Integer.MAX_VALUE, ch.size() - red);
MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, red, read);
int nGet;
while (mb.hasRemaining() && p.ok()) {
nGet = Math.min(mb.remaining(), SIZE);
mb.get(buffer, 0, nGet);
for (int i = 0; i < nGet && p.ok(); i++) {
p.check(buffer[i]);
}
}
red += read;
} while (red < ch.size() && p.ok());
// Finish off.
p.close();
ch.close();
f.close();
}
What I use is a List<ByteBuffer> where each ByteBuffer maps to the file in block of 16 MB to 1 GB. I uses powers of 2 to simplify the logic. I have used this to map in files up to 8 TB.
A key limitation of memory mapped files is that you are limited by your virtual memory. If you have a 32-bit JVM you won't be able to map in very much.
I wouldn't keep creating new memory mappings for a file because these are never cleaned up. You can create lots of these but there appears to be a limit of about 32K of them on some systems (no matter how small they are)
The main reason I find MemoryMappedFiles useful is that they don't need to be flushed (if you can assume the OS won't die) This allows you to write data in a low latency way, without worrying about losing too much data if the application dies or too much performance by having to write() or flush().
You don't use the FileChannel API to write the entire file at once. Instead, you send the file in parts. See example code in Martin Thompson's post comparing performance of Java IO techniques: Java Sequential IO Performance
In addition, there is not much documentation because you are making a platform-dependent call. from the map() JavaDoc:
Many of the details of memory-mapped files are inherently dependent
upon the underlying operating system and are therefore unspecified.
The bigger the file, the less you want it all in memory at once. Devise a way to process the file a buffer at a time, a line at a time, etc.
MappedByteBuffers are especially problematic, as there is no defined release of the mapped memory, so using more than one at a time is essentially bound to fail.