Memory stream and garbage collection (Apache Spark Structured Streaming)

Memory stream and garbage collection (Apache Spark Structured Streaming) - java

I have a Java application that pushes data rows to a Spark Streaming MemoryStream (org.apache.spark.sql.execution.streaming.MemoryStream) and writes it out to a file sink.
I wonder if the rows that have been written out will be considered for garbage collection ? Basically if I keep pushing rows to that Memory Stream and assuming I am writing it out continuously to the filesink, will I eventually run out of memory ?
Now same question but let's assume I am doing window transformations before the write operation and using ten minute windows. Is there a way to automatically drop out the rows based on the value of their timestamp column (greater than ten minutes) ?

Related

Fastest way to read data from single MS SQL table into java under single transaction

I need to read a large result set from MS SQL server into a java program. I need to read a consistent data state, so its running under a single transaction. I don't want dirty reads.
I can split the read using offset and fetch next and having each set of rows processed by a separate thread.
However, when doing this, it seems that the overall performance is ~30k rows read / sec, which is pretty lame. I'd like to get ~1m / sec.
I've checked that I have no memory pressure using visual VM. There are no GC pauses. Looking at machine utilisation it seems that there is no CPU limitation either.
I believe that the upstream source (MS SQL) is the limiting factor.
Any ideas on what I should look at?

How to process large log file in java

I have 4 files and each one is 200 MB. I have created 4 threads and parallelly running 4 thread and each thread processing and adding in to Array blocking queue.
Some other thread is taking Array Blocking Queue and process and adding in to batch. The batch size is 5000 and executing batch and inserting records into database.But still its taking complete 4 files is around 6 mins to complete.
How increase performance in this case?

1) Make sure you have enough memory for queue+processor buffers+db buffers.
2) Batch size of 5k is a bit more than needed, in general you get up to speed in 100, not that iе makes much difference here though.
3) You can push data into oracle in multiple threads. Fetching sequences for ID fields population ahead, you'll be able to insert into 1 table in parallel, if you have not many indexes. Otherwise consider disabling/recalculating indexes, or insert into temporary table and then move everything into main one.
4) Take a look at oracle settings with fellow DB admin. Things like extend size/increase can change performance.

Efficient Use of Java Heap Memory

The exact question I wanted to ask has already answered here. Still I just want to explore few more possibilities (if there are any).
Scenario: My application is thread based data centric web-app and the amount of data gets decided at run time by User. A user can request some data operation, which triggers multiple threads, each thread transport own data. Sometimes the data selection crash the application with OutOfMemoryError i.e. insufficient space to allocate new object in the Java heap. When there are multiple users using the application concurrently and most of them request big data operations this situation (OutOfMemoryError) is more likely to occur.
Question: Is there a way that I can prevent whole application being crashed? I can limit the amount of data being pulled in memory but if there is a better way than this? Even after limiting the amount of data per user, multiple concurrent user can generate OutOfMemoryError. One user can be put on hold or exit rather than all survive.

Consistently I have experienced the following point:
Stream large data out. Combined with a GZIPOutputStream on Accept-Inflate (maybe as servlet filter). This can be done for the file system and the database. Also there exist (URL based) XML pipelines where parts of the XML can be streamed.
This lowers memory costs, a stream may be throttled (articially slowed down.
PDF generation: optimize the PDF, repeated images only stored once, sensible font usage (ideally the PDF fonts, otherwise embedded fonts).
Office documents: the OpenOffice or Microsofts xlsx/docx variants.
In your case:
Have every process be combinable and stream there result to one output stream: a branching pipeline of tasks. If such a task might be called with the same parameters, yielding the same data, you could use parametrized URLs and cache the results.
I am aware this answer might not fit.

Reading large amount of records MySQL into Java

Having a MySQL database with +8 million records that I need to process (that can't be done in the database itself), I encounter issues when trying to read them into my Java application.
I already tried some solutions of people with similar problems (eg., link) however, none have worked out for me. I tried to set the FetchSize and all, but no luck! My application is built making use of a BlockingQueue of which the Producer reads data continously from the database, stores it in the queue so the Consumer can process it. This way I limit the amount of records in main memory at the same time.
My code works for small amount of records (I tested for 1000 records) so I suggest the fase from database to my application needs to be fixed.
Edit1
connection = ConnectionFactory.getConnection(DATABASE);
preparedStatement = connection.prepareStatement(query, java.sql.ResultSet.CONCUR_READ_ONLY, java.sql.ResultSet.TYPE_FORWARD_ONLY);
preparedStatement.setFetchSize(1000);
preparedStatement.executeQuery();
rs = preparedStatement.getResultSet();
Edit2
Eventually now I get some output other than seeing my memory go down. I get this error:
Exception in thread "Thread-0" java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.Buffer.<init>(Buffer.java:59)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:2089)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3554)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:491)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3245)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2413)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2836)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2828)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2777)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1651)
at razoralliance.dao.DataDAOImpl.getAllDataRS(DataDAOImpl.java:38)
at razoralliance.app.DataProducer.run(DataProducer.java:34)
at java.lang.Thread.run(Thread.java:722)
Edit3
I did some more research around the Producer-Consumer pattern and it turns out that, when the Consumer can not keep up with the Producer, the queue will automatically enlarge thus eventually run out of memory. So I switched to ArrayBlockingQueue which makes the size fixed. However, I still get memoryleaks. Eclipse Memory Analyzer says that ArrayBlockingQueue occupies 65,31% of my memory while it only has 1000 objects in memory with 4 fields all text.

You will need to stream your results. With the MySQL driver it appears you have to set CONCUR_READ_ONLY and TYPE_FORWARD_ONLY for your ResultSet. Also, set the fetch size accordingly: stmt.setFetchSize(Integer.MIN_VALUE);
By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate, and due to the design of the MySQL network protocol is easier to implement. If you are working with ResultSets that have a large number of rows or large values, and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY); stmt.setFetchSize(Integer.MIN_VALUE);
The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this, any result sets created with the statement will be retrieved row-by-row.
There are some caveats with this approach...
Streaming large result sets with MySQL
http://dev.mysql.com/doc/connector-j/en/connector-j-reference-implementation-notes.html

Why do not you try this approach for this solution
Problem exporting a lot of data from database to .csv with java
Instead of fetching the entire result set it can be fetched one by one and then it can be used for processing. The link what i referring to you used to get record one by one and write into file but you can use this result for processing.This is one approach you can use.
another approach is you can multi threading concept which will fetch record as per your demand and will process separately.

Pro and Cons of opening multiple InputStream?

While coding the solution to the problem of downloading a huge dynamic zip with low RAM impact, an idea started besieging me, and led to this question, asked for pure curiosity / hunger of knowledge:
What kind of drawbacks could I meet with if, instead of loading the InputStreams one at a time (with separate queries to the database), I would load all the InputStreams in a single query, returning a List of (n, potentially thousands, "opened") InputStreams ?
Current (safe) version: n queries, one inputStream instantiated at a time
for (long id : ids){
InputStream in = getMyService().loadStreamById(id);
IOUtils.copyStream(in, out);
in.close();
}
Hypothetical version: one query, n instantiated inputStreams
List<InputStream> streams = getMyService().loadAllStreams();
for (InputStream in : streams){
IOUtils.copyStream(in, out);
in.close();
in = null;
}
Which are the pro and cons of the second approach, excluding the (I suppose little) amount of memory used to keep multiple java InputStream instantiated ?
Could it lead to some kind of network freeze or database stress (or lock, or problems if others read/write the same BLOB field the Stream is pointing to, etc...) more than multiple queries ?
Or are they smart enough to be almost invisible until asked for data, and then 1 query + 1000 active stream could be better than 1000 query + 1 active stream ?

The short answer is that you risk hitting a limit of your operating system and/or DBMS.
The longer answer depends on the specific operating system and DBMS, but here are a few things to think about:
On Linux there are a maximum number of open file descriptors that any process can hold. The default is/was 1024, but it's relatively easy to increase. The intent of this limit IMO is to kill a poorly-written process, as the amount of memory required per file/socket is minimal (on a modern machine).
If the open stream represents an individual socket connection to the database, there's a hard limit on the total number of client sockets that a single machine may open to a single server address/port. This is driven by the client's dynamic port address range, and it's either 16 or 32k (but can be modified). This limit is across all processes on the machine, so excessive consumption by one process may starve other processes trying to access the same server.
Depending on how the DBMS manages the connections used to retrieve BLOBs, you may run into a much smaller limit enforced by the DBMS. Oracle, for example, defaults to a total of 50 "cursors" (active retrieval operations) per user connection.
Aside from these limits, you won't get any benefit given your code as written, since it runs through the connections sequentially. If you were to use multiple threads to read, you may see some benefit from having multiple concurrent connections. However, I'd still open those connections on an as-needed basis. And lest you think of spawning a thread for each connection (and running into the physical limit of number of threads), you'll probably reach a practical throughput limit before you hit any physical limits.

I tested it in PostgreSQL, and it works.
Since PostgreSQL seems to not have a predefined max cursor limit, I still don't know if the simple assignment of a cursor/pointer from a BLOB field to an Java InputStream object via java.sql.ResultSet.getBinaryStream("blob_field") is considered an active retrieval operation or not (I guess no, but who knows...);
Loading all the InputStreams at once with something like SELECT blob_field FROM table WHERE length(blob_field)>0 , produced a very long query execution time, and a very fast access to the binary content (in the sequential way, as above).
With a test case of 200 MB with 20 files of 10 MB each one:
The old way was circa 1 second per each query, plus 0.XX seconds for the other operations (reading each InputStream and writing it to the outputstream, and something else);
Total Elapsed time: 35 seconds
The experimental way was circa 22 seconds for the big query, and 12 total seconds for iterating and performing the other operations.
Total Elapsed time: 34 seconds
This makes me think that while assigning the BinaryStream from the database to the Java InputStream object, the complete reading is already being performed :/ making the use of an InputStream similar to an byte[] one (but worst in this case, because of the memory overload coming from having all the items instantiated);
Conclusion
reading all at once is a bit faster (~ 1 second faster every 30 seconds of execution),
but it could seriously make the big query timing out, other than cause RAM memory leaks and, potentially, max cursor hits.
Do not try this at home, just stick with one query at once...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.