I have a Java application that pushes data rows to a Spark Streaming MemoryStream (org.apache.spark.sql.execution.streaming.MemoryStream) and writes it out to a file sink.
I wonder if the rows that have been written out will be considered for garbage collection ? Basically if I keep pushing rows to that Memory Stream and assuming I am writing it out continuously to the filesink, will I eventually run out of memory ?
Now same question but let's assume I am doing window transformations before the write operation and using ten minute windows. Is there a way to automatically drop out the rows based on the value of their timestamp column (greater than ten minutes) ?
Let's say I'm designing a REST service with Spring, and I need to have a method that accepts a file, and returns some kind of ResponseDto. The application server has its POST request size limited to 100MB. Here's the hypothetical spring controller method implementation:
public ResponseEntity<ResponseDto> uploadFile(#RequestBody MultipartFile file) {
return ResponseEntity.ok(someService.process(file));
}
Let's assume that my server has 64GB of RAM. How do I ensure that I don't get an out of memory error if in a short period (short enough for process() method to still be running for every file uploaded), 1000 users decide to upload a 100MB file (or just 1 user concurrently uploads 1000 files)?
EDIT: To clarify, I want to make sure my application doesn't crash, but instead just stops accepting/delays new requests.
You can monitor the memory usage and see when you have to stop accepting requests or cancel existing requests.
https://docs.oracle.com/javase/6/docs/api/java/lang/management/MemoryMXBean.html
https://docs.oracle.com/javase/6/docs/api/java/lang/management/MemoryPoolMXBean.html
Also you can use this
Runtime runtime = Runtime.getRuntime();
System.out.println("Free memory: " + runtime.freeMemory() + " bytes.");
Consider creating a database table that holds that holds the uploads being done:
CREATE TABLE PROC_FILE_UPLOAD
(
ID NUMBER(19,0) NOT NULL
, USER_ID NUMBER(19,0) NOT NULL
, UPLOAD_STATUS_ID NUMBER(19,0) NOT NULL
, FILE_SIZE NUMBER(19,0)
, CONSTRAINT PROC_FILE_UPLOAD_PK PRIMARY KEY (ID) ENABLE
);
COMMENT ON COLUMN PROC_FILE_UPLOAD.FILE_SIZE IS 'In Bytes';
USER_ID being a FK to your users table and UPLOAD_STATUS_ID a FK to a data dictionary with the different status for your application (IN_PROGRESS, DONE, ERROR, UNKNOWN, whatever suits you).
Before your service uploads a file, it must check if the current user is already uploading a file and if the maximum number of concurrent uploads has been reached. If so, reject the upload, else update PROC_FILE_UPLOAD information with the new upload and proceed.
Even though you could hold many files in memory with 64 GB RAM, you don't want to waste too much resources with it. There are memory efficient ways to read files, for example, you could use a BufferedReader, it is a very memory efficient way to read files since it doesn't store the whole file in memory.
The documentation does a really good job explaining it:
Reads text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines.
The buffer size may be specified, or the default size may be used. The default is large enough for most purposes.
In general, each read request made of a Reader causes a corresponding read request to be made of the underlying character or byte stream. It is therefore advisable to wrap a BufferedReader around any Reader whose read() operations may be costly, such as FileReaders and InputStreamReaders. For example,
BufferedReader in
= new BufferedReader(new FileReader("foo.in"));
will buffer the input from the specified file. Without buffering, each invocation of read() or readLine() could cause bytes to be read from the file, converted into characters, and then returned, which can be very inefficient.
Here is another SO questions that you may find useful:
Java read large text file with 70 million lines of text
If you need to calculate the checksum of the file like you said in the comments you could use this link.
You can either limit the number of concurrent requests or use streaming to avoid keeping the whole file in RAM.
Limiting requests
You can limit the number of concurrent incoming requests in the web server. The default web server for Spring Boot is Tomcat, which is configurable in application.properties with server.tomcat.max-connections. If you have 64 GB RAM available after the app is fully loaded and your max file size 100 MB, you should be able to accept 640 concurrent requests. After that limit is reached, you can keep incoming connections in a queue before accepting them, configurable with server.tomcat.accept-count. These properties are described here: https://tomcat.apache.org/tomcat-9.0-doc/config/http.html
(In theory you can do better. If you know the upload size in advance, you can use a counting semaphore to reserve space for a file when it's time to start processing it, and delay starting any upload until there is room to reserve space for it.)
Streaming
If you are able to implement streaming instead, you can handle many more connections at the same time by not ever keeping the whole file in RAM for any one connection but instead processing the upload one bit at a time, e.g. as you write the upload out to a file or database. It looks like Apache Commons library has a component to help you build an API which streams in the request:
https://www.initialspark.co.uk/blog/articles/java-spring-file-streaming.html
I am trying to use NiFi to process large CSV files (potentially billions of records each) using HDF 1.2. I've implemented my flow, and everything is working fine for small files.
The problem is that if I try to push the file size to 100MB (1M records) I get a java.lang.OutOfMemoryError: GC overhead limit exceeded from the SplitText processor responsible of splitting the file into single records. I've searched for that, and it basically means that the garbage collector is executed for too long without obtaining much heap space. I expect this means that too many flow files are being generated too fast.
How can I solve this? I've tried changing nifi's configuration regarding the max heap space and other memory-related properties, but nothing seems to work.
Right now I added an intermediate SplitText with a line count of 1K and that allows me to avoid the error, but I don't see this as a solid solution for when the incoming file size will become potentially much more than that, I am afraid I will get the same behavior from the processor.
Any suggestion is welcomed! Thank you
The reason for the error is when splitting 1M records with a line count of 1, you are creating 1M flow files which equate 1M Java objects. Overall the approach of using two SplitText processors is common and avoids creating all of the objects at the same time. You could probably use an even larger split size on the first split, maybe 10k. For a billion records I am wondering if a third level would make sense, split from 1B to maybe 10M, then 10M to 10K, then 10K to 1, but I would have to play with it.
Some additional things to consider are increasing the default heap size from 512MB, which you may have already done, and also figuring out if you really need to split down to 1 line. It is hard to say without knowing anything else about the flow, but in a lot of cases if you want to deliver each line somewhere you could potentially have a processor that reads in a large delimited file and streams each line to the destination. For example, this is how PutKafka and PutSplunk work, they can take a file with 1M lines and stream each line to the destination.
I had a similar error while using the GetMongo processor in Apache NiFi.
I changed my configurations to:
Limit: 100
Batch Size: 10
Then the error disappeared.
I have 642million records to be uploaded into java concurrent hashmap. The size of the entire data set on disk is only 30GB. However, when I am uploading it to memory, almost 250GB of memory is consumed.
I don't have any explanation for such a overhead. Can any one explain it? Also, suggest any idea how to reduce the memory consumption..
Java servlet returns JSON object.
response.setContentType("application/json");
response.getWriter().write(json.toString());
JSON object contains data fetched from table (database) size of which > 50 MB.
On running, servlet throws this error:
java.lang.OutOfMemoryError: Java heap space
It seems the issue is while writing json data. Server is unable to allocate contiguous memory of size> 50 MB to the String.
I am unable to figure out fix for this issue. How can I send huge JSON object from Servlet?
json.toString() is likely to cause the error. It is creating one big string from the already existing json object before anything has been send out.
Slurping everything into memory is convenient, but not very wise when it comes to any limitations. Process your database records one by one and stream immediately to the client instead of copying around in memory. Rule of thumb: "Any limitation given will be exceeded at some time."
Splitting the JSON data structure into smaller parts is definitely one way to solve the problem at hand. But an alternative via heap size increase might do the job in this case as well.
The “java.lang.OutOfMemoryError: Java heap space” error will be triggered when you try to add more data into the heap space area in memory, but the size of this data is larger than the JVM can accommodate in the Java heap space.
Note that JVM gets a limited amount of memory from the OS, specified during startup. There are several startup parameters controlling the separate regions of memory allocated, but in your case you are interested in heap region, which you can set (or increase if present) similar to the following example setting your heap size to 1GB:
java -Xmx1024m com.mycompany.MyClass