Hashmap Over Large dataset giving OutOfMemory in spark - java

I have requirement of updating hashmap. In Spark job I have JavaPairRDD and in this wrapper is having 9 different hashmap. Each hashmap is having key near about 40-50 cr keys. While merging two maps (ReduceByKey in spark) I am getting Java heap memory OutOfMemory exception. Below is the code snippet.
private HashMap<String, Long> getMergedMapNew(HashMap<String, Long> oldMap,
HashMap<String, Long> newMap) {
for (Entry<String, Long> entry : newMap.entrySet()) {
try {
String imei = entry.getKey();
Long oldTimeStamp = oldMap.get(imei);
Long newTimeStamp = entry.getValue();
if (oldTimeStamp != null && newTimeStamp != null) {
if (oldTimeStamp < newTimeStamp) {
oldMap.put(imei, newTimeStamp);
} else {
oldMap.put(imei, oldTimeStamp);
} else if (oldTimeStamp == null) {
oldMap.put(imei, newTimeStamp);
} else if (newTimeStamp == null) {
oldMap.put(imei, oldTimeStamp);
} catch (Exception e) {
logger.error("{}", Utils.getStackTrace(e));
return oldMap;
This method works on small dataset but failed with large dataset. Same method is being used for all 9 different hashmap. I searched for increasing heap memory but no idea how to increase this in spark as it works on cluster. My cluster size is also large (nr. 300 nodes). Please help me to find out some solutions.

Firstly I'd focus on 3 parameters: spark.driver.memory=45g spark.executor.memory=6g spark.dirver.maxResultSize=8g Don't take the config for granted, this is something that works on my set up without OOM errors. Check how much available memory you have in UI. You want to give executors as much memory as you can. btw. spark.driver.memory enables more heap space.

As far as i can see, this code is executed on the spark driver. I would recommend to convert those two Hashmaps to DataFrames with 2 columns imei and timestamp. Then join both using an outer join on imei and select the appropriate timestamps using when.
This code will be executed on the workers, be parallized and consequentially you wont run into the memory problems. If you plan on really doing this on the driver then follow the instructions given by Jarek and increase spark.driver.memory.


How to write large volumes of unique data to Postgres without storing it all in memory

I have a Spring Boot application that generates images. I'm trying to scale it to the point it can generate an unlimited number of images.
When an image is generated I create a hash using MurmurHash3 of the base64 encoded values of the image, this is then added to an object as a #Lob value. The hash is how I consider images to be unique, the images are then pushed into Postgres.
So far everything is fine and this creates ~1,000 images in a few seconds without problem. Where I'm having issues is say I want to create 100,000+ images.
When the images are generated there is a pretty good chance of duplicates, so what I thought was a good idea would be to create 'chunks' of images using a HashSet to hopefully rule out duplicates at least within the specific 'chunk'
public class CreateImages {
public void process() {
while (repository.count() < 100_000) {
private void createChunk() {
Set<TokenUri> result = new HashSet<>();
while (result.size() < 1000) {
final ImageWrapper imageWrapper = svgService.create(-1);
try {
} catch (Exception e) {
log.error("Failed to save chunk {}", e.getMessage());
log.info("Created {} images", repository.count());
Not worrying about time taken here to create the images (it's all single threaded) this will do what I expect, each chunk doesn't contain duplicates, however more than likely it will contain duplicates when compared to previously generated chunks.
So to try and solve that I added a #Column(unique = true) annotation to each hash row being saved. Thinking Postgres will reject duplicates, but allow 'non-duplicates' to be saved.
What seems to happen though is the batch write fails due to not satisfying the condition and doesn't seem to move past it.
2022-01-02 15:38:56.071 ERROR 19292 --- [ main] o.h.engine.jdbc.spi.SqlExceptionHelper : ERROR: duplicate key value violates unique constraint "uk_90jgw9r7w8bhtgw17fmi79j0w"
Detail: Key (hash)=(1625765490) already exists.
Even when attempting to catch those with a generic Exception either I'm not handling it correctly, or it doesn't do what I expect.
Even this feels rather hackey and not a correct solution.
So, tl:dr - How can I generate an unknown (assume millions) of unique objects, (without keeping them all in memory to check for uniqueness) and safely store those into Postgres?
Is there some standard pattern for this kind of thing?
This could be a possible solution. You save only hash property in a Set and in that way you can have unique hash across chunks and also it's memory efficient because you are saving only a String and not an object with all properties. You also need to override hashCode and equals methods in TokenUri (to use hash property) because otherwise Set<TokenUri> doesn't work.
public class CreateImages {
public void process() {
Set<String> hashCodesOfSavedImages = new HashSet<>();
while (hashCodesOfSavedImages.size() < 100_000) {
hashCodesOfSavedImages.addAll(createChunkOfImages(hashCodesOfSavedImages, 1000));
private Set<String> createChunkOfImages(Set<String> hashCodesOfSavedImages, int chunkSize) {
Set<TokenUri> chunkOfImages = new HashSet<>();
while (chunkOfImages.size() < chunkSize) {
final ImageWrapper imageWrapper = svgService.create(-1);
// O(1) time complexity (contains)
if (!hashCodesOfSavedImages.contains(imageWrapper.hash())) {
return chunkOfImages.stream().map(TokenUri::hash).collect(Collectors.toSet());

Flink Task Manager timeout

My program gets very slow as more and more records are processed. I initially thought it is due to excessive memory consumption as my program is String intensive (I am using Java 11 so compact strings should be used whenever possible) so I increased the JVM Heap:
I also increased the task manager's memory as well as timeout, flink-conf.yaml:
jobmanager.heap.size: 6144m
heartbeat.timeout: 5000000
However, none of this helped with the issue. The Program still gets very slow at about the same point which is after processing roughly 3.5 million records, only about 0.5 million more to go. As the program approaches the 3.5 million mark it gets very very slow until it eventually times out, total execution time is about 11 minutes.
I checked the memory consumption in VisualVm, but the memory consumption never goes more than about 700MB.My flink pipeline looks as follows:
final StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironment(1);
DataStream<Tuple> stream = environment.addSource(new TPCHQuery3Source(filePaths, relations));
stream.process(new TPCHQuery3Process(relations)).addSink(new FDSSink());
Where the bulk of the work is done in the process function, I am implementing data base join algorithms and the columns are stored as Strings, specifically I am implementing query 3 of the TPCH benchmark, check here if you wish https://examples.citusdata.com/tpch_queries.html.
The timeout error is this:
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id <id> timed out.
Once I got this error as well:
Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap space
Also, my VisualVM monitoring, screenshot is captured at the point where things get very slow:
Here is the run loop of my source function:
while (run) {
readers.forEach(reader -> {
try {
String line = reader.readLine();
if (line != null) {
Tuple tuple = lineToTuple(line, counter.get() % filePaths.size());
if (tuple != null && isValidTuple(tuple)) {
} else {
if (closedReaders.size() == filePaths.size()) {
System.out.println("ALL FILES HAVE BEEN STREAMED");
} catch (IOException e) {
I basically read a line of each of the 3 files I need, based on the order of the files, I construct a tuple object which is my custom class called tuple representing a row in a table, and emit that tuple if it is valid i.e. fullfils certain conditions on the date.
I am also suggesting the JVM to do garbage collection at the 1 millionth, 1.5millionth, 2 millionth and 2.5 millionth record like this:
Any thoughts on how I can optimize this?
String intern() saved me. I did intern on every string before storing it in my maps and that worked like a charm.
these are the properties that I changed on my link stand-alone cluster to compute the TPC-H query 03.
jobmanager.memory.process.size: 1600m
heartbeat.timeout: 100000
taskmanager.memory.process.size: 8g # defaul: 1728m
I implemented this query to stream only the Order table and I kept the other tables as a state. Also I am computing as a windowless query, which I think it makes more sense and it is faster.
public class TPCHQuery03 {
private final String topic = "topic-tpch-query-03";
public TPCHQuery03() {
this(PARAMETER_OUTPUT_LOG, "", false, false, -1);
public TPCHQuery03(String output, String ipAddressSink, boolean disableOperatorChaining, boolean pinningPolicy, long maxCount) {
try {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
if (disableOperatorChaining) {
DataStream<Order> orders = env
.addSource(new OrdersSource(maxCount)).name(OrdersSource.class.getSimpleName()).uid(OrdersSource.class.getSimpleName());
// Filter market segment "AUTOMOBILE"
// customers = customers.filter(new CustomerFilter());
// Filter all Orders with o_orderdate < 12.03.1995
DataStream<Order> ordersFiltered = orders
.filter(new OrderDateFilter("1995-03-12")).name(OrderDateFilter.class.getSimpleName()).uid(OrderDateFilter.class.getSimpleName());
// Join customers with orders and package them into a ShippingPriorityItem
DataStream<ShippingPriorityItem> customerWithOrders = ordersFiltered
.keyBy(new OrderKeySelector())
.process(new OrderKeyedByCustomerProcessFunction(pinningPolicy)).name(OrderKeyedByCustomerProcessFunction.class.getSimpleName()).uid(OrderKeyedByCustomerProcessFunction.class.getSimpleName());
// Join the last join result with Lineitems
DataStream<ShippingPriorityItem> result = customerWithOrders
.keyBy(new ShippingPriorityOrderKeySelector())
.process(new ShippingPriorityKeyedProcessFunction(pinningPolicy)).name(ShippingPriorityKeyedProcessFunction.class.getSimpleName()).uid(ShippingPriorityKeyedProcessFunction.class.getSimpleName());
// Group by l_orderkey, o_orderdate and o_shippriority and compute revenue sum
DataStream<ShippingPriorityItem> resultSum = result
.keyBy(new ShippingPriority3KeySelector())
.reduce(new SumShippingPriorityItem(pinningPolicy)).name(SumShippingPriorityItem.class.getSimpleName()).uid(SumShippingPriorityItem.class.getSimpleName());
// emit result
if (output.equalsIgnoreCase(PARAMETER_OUTPUT_MQTT)) {
.map(new ShippingPriorityItemMap(pinningPolicy)).name(ShippingPriorityItemMap.class.getSimpleName()).uid(ShippingPriorityItemMap.class.getSimpleName())
.addSink(new MqttStringPublisher(ipAddressSink, topic, pinningPolicy)).name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else if (output.equalsIgnoreCase(PARAMETER_OUTPUT_LOG)) {
} else if (output.equalsIgnoreCase(PARAMETER_OUTPUT_FILE)) {
StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path(PATH_OUTPUT_FILE), new SimpleStringEncoder<String>("UTF-8"))
.withMaxPartSize(1024 * 1024 * 1024).build())
.map(new ShippingPriorityItemMap(pinningPolicy)).name(ShippingPriorityItemMap.class.getSimpleName()).uid(ShippingPriorityItemMap.class.getSimpleName())
} else {
System.out.println("discarding output");
System.out.println("Stream job: " + TPCHQuery03.class.getSimpleName());
System.out.println("Execution plan >>>\n" + env.getExecutionPlan());
} catch (IOException e) {
} catch (Exception e) {
public static void main(String[] args) throws Exception {
new TPCHQuery03();
The UDFs are here: OrderSource, OrderKeyedByCustomerProcessFunction, ShippingPriorityKeyedProcessFunction, and SumShippingPriorityItem. I am using the com.google.common.collect.ImmutableList since the state will not be updated. Also I am keeping only the necessary columns on the state, such as ImmutableList<Tuple2<Long, Double>> lineItemList.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space while trying to Verify Millions of data

From flat file which contain data line by line, my task to be verify data from DB not present I am trying to verify Using Java first I have inserted flat file data into HashSet1 and another Hashset2 For DB data after that I am trying to check Hashset1.Contain(Hashset2) so that I can identify which data is not Present in DB.
Given Below is Dummy Code which you can Assume hashset1(which is some missing data) as File Reader data and hashset2(full data from db) as DB data
but as I mentioned here I have 30 Million Data need to verify, I am able to verify 1 million data through this way but not able to verify 30 million data which is my task. Is there any best way to do this kindly suggest and some sort of code it will we very thankful.
public class App
public static void sampleMethod() {
Set<Integer> hashset1 = new HashSet<Integer>();
Set<Integer> hashset2 = new HashSet<Integer>();
for(int i = 0; i<30000000; i++ ) {
if(i %50000 != 0) {
int count = 0;
for(int j =0;j<30000000;j++) {
if(hashset1.contains(j)) {
} else {
System.out.println(j+" Is Not Present");
System.out.println("Contain Value Count" + count);
public static void main( String[] args )
Error Stack Trace :
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.resize(HashMap.java:703)
at java.util.HashMap.putVal(HashMap.java:662)
at java.util.HashMap.put(HashMap.java:611)
at java.util.HashSet.add(HashSet.java:219)
at com.java.anz.BankingPro.App.sampleMethod(App.java:20)
at com.java.anz.BankingPro.App.main(App.java:38)
For combining two sets of data, it's enough to load only the smaller of the two into a hashset (1.), then, as the next step, detect the differences of the sets (2.), and only then modify data according to the found differences (3.). Let's call the small set simply smallHashSet in the following pseudo code:
load smaller set of data into smallHashSet
iterate (loop) over entries in bigger set of data, one by one - do not load it all at once, just load one after another and process one at a time:
2.1. let's say bigSetEntry is such an entry from the bigger set, then
if (smallHashSet.contains(bigSetEntry)) smallHashSet.remove(bigSetEntry).
When you are done, then smallHashSet contains only the entries which are in the small set, but missing in the big set. And you never needed to load the big set all at once. You can now do something with these differing entries, e.g. add them to the big data file.

Spark java.lang.StackOverflowError

I'm using spark in order to calculate the pagerank of user reviews, but I keep getting Spark java.lang.StackOverflowError when I run my code on a big dataset (40k entries). when running the code on a small number of entries it works fine though.
Entry Example :
product/productId: B00004CK40 review/userId: A39IIHQF18YGZA review/profileName: C. A. M. Salas review/helpfulness: 0/0 review/score: 4.0 review/time: 1175817600 review/summary: Reliable comedy review/text: Nice script, well acted comedy, and a young Nicolette Sheridan. Cusak is in top form.
The Code:
public void calculatePageRank() {
JavaRDD < String > rddFileData = sc.textFile(inputFileName).cache();
JavaRDD < String > rddMovieData = rddFileData.map(new Function < String, String > () {
public String call(String arg0) throws Exception {
String[] data = arg0.split("\t");
String movieId = data[0].split(":")[1].trim();
String userId = data[1].split(":")[1].trim();
return movieId + "\t" + userId;
JavaPairRDD<String, Iterable<String>> rddPairReviewData = rddMovieData.mapToPair(new PairFunction < String, String, String > () {
public Tuple2 < String, String > call(String arg0) throws Exception {
String[] data = arg0.split("\t");
return new Tuple2 < String, String > (data[0], data[1]);
JavaRDD<Iterable<String>> cartUsers = rddPairReviewData.map(f -> f._2());
List<Iterable<String>> cartUsersList = cartUsers.collect();
JavaPairRDD<String,String> finalCartesian = null;
int iterCounter = 0;
for(Iterable<String> out : cartUsersList){
JavaRDD<String> currentUsersRDD = sc.parallelize(Lists.newArrayList(out));
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD);
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD).union(finalCartesian);
if(iterCounter % 20 == 0) {
JavaRDD<Tuple2<String,String>> finalCartesianToTuple = finalCartesian.map(m -> new Tuple2<String,String>(m._1(),m._2()));
finalCartesianToTuple = finalCartesianToTuple.filter(x -> x._1().compareTo(x._2())!=0);
JavaPairRDD<String, String> userIdPairs = finalCartesianToTuple.mapToPair(m -> new Tuple2<String,String>(m._1(),m._2()));
JavaRDD<String> userIdPairsString = userIdPairs.map(new Function < Tuple2<String, String>, String > () {
//Tuple2<Tuple2<MovieId, userId>, Tuple2<movieId, userId>>
public String call (Tuple2<String, String> t) throws Exception {
return t._1 + " " + t._2;
try {
//calculate pagerank using this https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java
JavaPageRank.calculatePageRank(userIdPairsString, 100);
} catch (Exception e) {
// TODO Auto-generated catch block
I have multiple suggestions which will help you to greatly improve the performance of the code in your question.
Caching: Caching should be used on those data sets which you need to refer to again and again for same/ different operations (iterative algorithms.
An example is RDD.count — to tell you the number of lines in the
file, the file needs to be read. So if you write RDD.count, at
this point the file will be read, the lines will be counted, and the
count will be returned.
What if you call RDD.count again? The same thing: the file will be
read and counted again. So what does RDD.cache do? Now, if you run
RDD.count the first time, the file will be loaded, cached, and
counted. If you call RDD.count a second time, the operation will use
the cache. It will just take the data from the cache and count the
lines, no recomputing.
Read more about caching here.
In your code sample you are not reusing anything that you've cached. So you may remove the .cache from there.
Parallelization: In the code sample, you've parallelized every individual element in your RDD which is already a distributed collection. I suggest you to merge the rddFileData, rddMovieData and rddPairReviewData steps so that it happens in one go.
Get rid of .collect since that brings the results back to the driver and maybe the actual reason for your error.
This problem will occur when your DAG grows big and too many level of transformations happening in your code. The JVM will not be able to hold the operations to perform lazy execution when an action is performed in the end.
Checkpointing is one option. I would suggest to implement spark-sql for this kind of aggregations. If your data is structured, try to load that into dataframes and perform grouping and other mysql functions to achieve this.
When your for loop grows really large, Spark can no longer keep track of the lineage. Enable checkpointing in your for loop to checkpoint your rdd every 10 iterations or so. Checkpointing will fix the problem. Don't forget to clean up the checkpoint directory after.
Below things fixed stackoverflow error, as others pointed it's because of lineage that spark keeps building, specially when you have loop/iteration in code.
Set checkpoint directory
checkpoint dataframe/Rdd you are modifying/operating in iteration
Cache Dataframe which are reused in each iteration

Javolution FastSortedMap in Threaded Environment

I'm trying to find an alternative to using java.utils.TreeMap in a threaded environment due to the memory TreeMap consumes and doesn't free, using Sun JDK 1.6. We have a constant resizing TreeMap, which needs to keep sorted by key:
public class WKey implements Comparable<Object> {
private Long ms = null;
private Long id = null;
public WKey(Long ms, Long id) {
this.ms = ms;
this.id = id;
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((id == null) ? 0 : id.hashCode());
result = prime * result + ((ms == null) ? 0 : ms.hashCode());
return result;
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
WKey other = (WKey) obj;
if (id == null) {
if (other.id != null)
return false;
} else if (!id.equals(other.id))
return false;
if (ms == null) {
if (other.ms != null)
return false;
} else if (!ms.equals(other.ms))
return false;
return true;
public int compareTo(Object arg0) {
WKey k = (WKey) arg0;
if (this.ms < k.ms)
return -1;
else if (this.ms.equals(k.ms)) {
if (this.id < k.id)
return -1;
else if (this.id.equals(k.id)) {
return 0;
return 1;
Thread 1
Iterator<WKey> it = result.keySet().iterator();
if (it.hasNext()) {
WKey key = it.next();
/// Some processing here
Constantly retrieves the first element within the TreeMap and then
removes it.
Threads 2, 3, and 4
for (Object r : rs) {
Object[] row = (Object[]) r;
Long ms = ((Calendar) row[1]).getTimeInMillis();
Long id = (Long) row[0];
WKey key = new WKey(ms, id);
result.put(key, row);
Are bulk processing threads which process returned results from various
services, which are generally basic POJOs. POJOs are generated a key
based off their id and timestamp using the key above. I cannot
modify the POJO to implement a Comparator, so I must use this key.
After keys have been identified and process, they are inserted into a
shared tree map where they are getting pulled off in sorted order by
a processing thread.
We were using:
Map<WKey, Object[]> result =
Collections.synchronizedMap(new TreeMap<WKey, Object[]>());
We also tried using ConcurrentSkipListMap:
SortedMap<WKey, Object[]> result =
new ConcurrentSkipListMap<WKey, Object[]>();
We are experimenting with big data and need a collection which sufficiently utilizes memory any time remove or put is used in a threaded environment. We are inserting records by the hundred-thousands and removing elements from the top on a needed basis. We need a container which can scale. The problem with TreeMap is it never releases memory unless you recreate the container, new Collections.synchronizedMap(new TreeMap()) . This is an expensive operation to call in a threaded environment anytime a new entry is removed.
Alternatively, I've been experimenting with Javolution. It has a FastSortedMap, which seems to fit in nicely. However, I find their implementation and usage of the collection rather quirky and lacking sufficient documentation and examples.
They do have a few examples listed in the doc, which relate to the clases FastSortedMap is derived from, but nothing seems to work:
A high-performance hash map with real-time behavior. Related to FastCollection, fast map supports various views.
atomic() - Thread-safe view for which all reads are mutex-free and map updates (e.g. putAll) are atomic.
shared() - View allowing concurrent modifications.
parallel() - A view allowing parallel processing including updates.
sequential() - View disallowing parallel processing.
unmodifiable() - View which does not allow any modifications.
entrySet() - FastSet view over the map entries allowing entries to be added/removed.
keySet() - FastSet view over the map keys allowing keys to be added (map entry with null value).
values() - FastCollection view over the map values (add not supported).
I instantiated the following collection as a replacement to TreeMap:
private FastMap<WKey, Object[]> result =
new FastSortedMap<WKey, Object[]>().shared();
However, once another thread touches the container. All the member functions start to fail. I still encounter null values returned from result.iterator().next(), size() sometimes hangs, result.keySet().min() is very sluggish. result.get returns null. None of the examples in doc really show how the concurrent views are used, listed above. It's really frustrating.
I've looked a at Apache Collections, but I'm afraid I might experience the same issue as many of their sorting collections are derived from java.utils HashMaps and TreeMaps. I looked into Guava as well, but their sorted containers require you to implement comparable on both key and value. I was trying to avoid implementing comparable on the 'value'. I don't need to sort both objects. If I implemented comparable on the value, I would just use a sorted list, queue, or table. Highscale and Trove don't have ordered maps. Fastutils may be a candidate, but I'd have to synchronize everything manually, and I'm trying to save time.
I've reviewed others listed in the stackoverflow benchmark post, but the projects listed previously seem to be my best alternatives.
So far, I'm not convinced Javolution is everything they advertise on their site. My experience is that their implementation is very inconsistent, lacking documentation, and performs rather sluggish in threaded environments. TreeMap performs great; I just wish it wouldn't allocate in such large bursts and GC every now and then. However, I'm hoping there might be somebody out there to prove me wrong, may even demonstrate appropriate usage for Javolutions collections in a threaded environment.
Otherwise, if somebody knows a way around resizing Treemaps, without using 'new', or has solved similar/alternative instances working with threading and sorted maps, any info would be greatly appreciated!

