Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space

Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space - java

I have seen and tried many existing StackOverflow posts regarding this issue but none work. I guess my JAVA heap space is not as large as expected for my large dataset, My dataset contains 6.5M rows. My Linux instance contains 64GB Ram with 4 cores. As per this suggestion I need to fix my code but I think making a dictionary from pyspark dataframe should not be very costly. Please advise me if any other way to compute that.
I just want to make a python dictionary from my pyspark dataframe, this is the content of my pyspark dataframe,
property_sql_df.show() shows,
+--------------+------------+--------------------+--------------------+
| id|country_code| name| hash_of_cc_pn_li|
+--------------+------------+--------------------+--------------------+
| BOND-9129450| US|Scotron Home w/Ga...|90cb0946cf4139e12...|
| BOND-1742850| US|Sited in the Mead...|d5c301f00e9966483...|
| BOND-3211356| US|NEW LISTING - Com...|811fa26e240d726ec...|
| BOND-7630290| US|EC277- 9 Bedroom ...|d5c301f00e9966483...|
| BOND-7175508| US|East Hampton Retr...|90cb0946cf4139e12...|
+--------------+------------+--------------------+--------------------+
What I want is to make a dictionary with hash_of_cc_pn_li as key and id as a list value.
Expected Output
{
"90cb0946cf4139e12": ["BOND-9129450", "BOND-7175508"]
"d5c301f00e9966483": ["BOND-1742850","BOND-7630290"]
}
What I have tried so far,
%%time
duplicate_property_list = {}
for ind in property_sql_df.collect():
hashed_value = ind.hash_of_cc_pn_li
property_id = ind.id
if hashed_value in duplicate_property_list:
duplicate_property_list[hashed_value].append(property_id)
else:
duplicate_property_list[hashed_value] = [property_id]
What I get now on the console:
java.lang.OutOfMemoryError: Java heap space
and showing this error on Jupyter notebook output
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:33097)

making a dictionary from pyspark dataframe should not be very costly
This is true in terms of runtime, but this will easily take up a lot of space. Especially if you're doing property_sql_df.collect(), at which point you're loading your entire dataframe into driver memory. At 6.5M rows, you'll already hit 65GB if each row has 10KB, or 10K characters, and we haven't even gotten to the dictionary yet.
First, you can collect just the columns you need (e.g. not name). Second, you can do the aggregation upstream in Spark, which will save some space depending on how many ids there are per hash_of_cc_pn_li:
rows = property_sql_df.groupBy("hash_of_cc_pn_li") \
.agg(collect_set("id").alias("ids")) \
.collect()
duplicate_property_list = { row.hash_of_cc_pn_li: row.ids for row in rows }

Adding accepted answer from linked post for posterity. The answer solves the problem by leveraging write.json method and preventing the collection of too-large dataset to the Driver here:
https://stackoverflow.com/a/63111765/12378881

Here's how to make a sample DataFrame with your data:
data = [
("BOND-9129450", "90cb"),
("BOND-1742850", "d5c3"),
("BOND-3211356", "811f"),
("BOND-7630290", "d5c3"),
("BOND-7175508", "90cb"),
]
df = spark.createDataFrame(data, ["id", "hash_of_cc_pn_li"])
Let's aggregate the data in a Spark DataFrame to limit the number of rows that are collected on the driver node. We'll use the two_columns_to_dictionary function defined in quinn to create the dictionary.
agg_df = df.groupBy("hash_of_cc_pn_li").agg(F.max("hash_of_cc_pn_li").alias("hash"), F.collect_list("id").alias("id"))
res = quinn.two_columns_to_dictionary(agg_df, "hash", "id")
print(res) # => {'811f': ['BOND-3211356'], 'd5c3': ['BOND-1742850', 'BOND-7630290'], '90cb': ['BOND-9129450', 'BOND-7175508']}
This might work on a relatively small, 6.5 million row dataset, but it won't work on a huge dataset. "I think making a dictionary from pyspark dataframe should not be very costly" is only true for DataFrames that are really tiny. Making a dictionary from a PySpark DataFrame is actually very expensive.
PySpark is a cluster computing framework that benefits from having data spread out across nodes in a cluster. When you call collect all the data is moved to the driver node and the worker nodes don't help. You'll get an OutOfMemory exception whenever you try to move too much data to the driver node.
It's probably best to avoid the dictionary entirely and figure out a different way to solve the problem. Great question.

From Spark-2.4 we can use groupBy,collect_list,map_from_arrays,to_json built in functions for this case.
Example:
df.show()
#+------------+-----------------+
#| id| hash_of_cc_pn_li|
#+------------+-----------------+
#|BOND-9129450|90cb0946cf4139e12|
#|BOND-7175508|90cb0946cf4139e12|
#|BOND-1742850|d5c301f00e9966483|
#|BOND-7630290|d5c301f00e9966483|
#+------------+-----------------+
df.groupBy(col("hash_of_cc_pn_li")).\
agg(collect_list(col("id")).alias("id")).\
selectExpr("to_json(map_from_arrays(array(hash_of_cc_pn_li),array(id))) as output").\
show(10,False)
#+-----------------------------------------------------+
#|output |
#+-----------------------------------------------------+
#|{"90cb0946cf4139e12":["BOND-9129450","BOND-7175508"]}|
#|{"d5c301f00e9966483":["BOND-1742850","BOND-7630290"]}|
#+-----------------------------------------------------+
To get one dict use another agg with collect_list.
df.groupBy(col("hash_of_cc_pn_li")).\
agg(collect_list(col("id")).alias("id")).\
agg(to_json(map_from_arrays(collect_list(col("hash_of_cc_pn_li")),collect_list(col("id")))).alias("output")).\
show(10,False)
#+---------------------------------------------------------------------------------------------------------+
#|output |
#+---------------------------------------------------------------------------------------------------------+
#|{"90cb0946cf4139e12":["BOND-9129450","BOND-7175508"],"d5c301f00e9966483":["BOND-1742850","BOND-7630290"]}|
#+---------------------------------------------------------------------------------------------------------+

Related

Using training made with python API as input to LabelImage module in java API?

I have a problem with java tensorflow API. I have run the training using the python tensorflow API, generating the files output_graph.pb and output_labels.txt. Now for some reason I want to use those files as input to the LabelImage module in java tensorflow API. I thought everything would have worked fine since that module wants exactly one .pb and one .txt. Nevertheless, when I run the module, I get this error:
2017-04-26 10:12:56.711402: W tensorflow/core/framework/op_def_util.cc:332] Op BatchNormWithGlobalNormalization is deprecated. It will cease to work in GraphDef version 9. Use tf.nn.batch_normalization().
Exception in thread "main" java.lang.IllegalArgumentException: No Operation named [input] in the Graph
at org.tensorflow.Session$Runner.operationByName(Session.java:343)
at org.tensorflow.Session$Runner.feed(Session.java:137)
at org.tensorflow.Session$Runner.feed(Session.java:126)
at it.zero11.LabelImage.executeInceptionGraph(LabelImage.java:115)
at it.zero11.LabelImage.main(LabelImage.java:68)
I would be very grateful if you help me finding where the problem is. Furthermore I want to ask you if there is a way to run the training from java tensorflow API, because that would make things easier.
To be more precise:
As a matter of fact, I do not use self-written code, at least for the relevant steps. All I have done is doing the training with this module, https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py, feeding it with the directory that contains the images divided among subdirectories according to their description. In particular, I think these are the lines that generate the outputs:
output_graph_def = graph_util.convert_variables_to_constants(
sess, graph.as_graph_def(), [FLAGS.final_tensor_name])
with gfile.FastGFile(FLAGS.output_graph, 'wb') as f:
f.write(output_graph_def.SerializeToString())
with gfile.FastGFile(FLAGS.output_labels, 'w') as f:
f.write('\n'.join(image_lists.keys()) + '\n')
Then, I give the outputs (one some_graph.pb and one some_labels.txt) as input to this java module: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/java/src/main/java/org/tensorflow/examples/LabelImage.java, replacing the default inputs. The error I get is the one reported above.

The model used by default in LabelImage.java is different that the model that is being retrained, so the names of inputs and output nodes do not align. Note that TensorFlow models are graphs and the arguments to feed() and fetch() are names of nodes in the graph. So you need to know the names appropriate for your model.
Looking at retrain.py, it seems that it has a node that takes the raw contents of a JPEG file as input (the node DecodeJpeg/contents) and produces the set of labels in the node final_result.
If that's the case, then you'd do something like the following in Java (and you don't need the bit that constructs a graph to normalize the image since that seems to be a part of the retrained model, so replace LabelImage.java:64 with something like:
try (Tensor image = Tensor.create(imageBytes);
Graph g = new Graph()) {
g.importGraphDef(graphDef);
try (Session s = new Session(g);
// Note the change to the name of the node and the fact
// that it is being provided the raw imageBytes as input
Tensor result = s.runner().feed("DecodeJpeg/contents", image).fetch("final_result").run().get(0)) {
final long[] rshape = result.shape();
if (result.numDimensions() != 2 || rshape[0] != 1) {
throw new RuntimeException(
String.format(
"Expected model to produce a [1 N] shaped tensor where N is the number of labels, instead it produced one with shape %s",
Arrays.toString(rshape)));
}
int nlabels = (int) rshape[1];
float[] probabilities = result.copyTo(new float[1][nlabels])[0];
// At this point nlabels = number of classes in your retrained model
DoSomethingWith(probabilities);
}
}
Hope that helps.

Regarding the "No operation" error, I was able to resolve that by using input and output layer names "Mul" and "final_result", respectively. See:
https://github.com/tensorflow/tensorflow/issues/2883

Save a spark RDD using mapPartition with iterator

I have some intermediate data that I need to be stored in HDFS and local as well. I'm using Spark 1.6. In HDFS as intermediate form I'm getting data in /output/testDummy/part-00000 and /output/testDummy/part-00001. I want to save these partitions in local using Java/Scala so that I could save them as /users/home/indexes/index.nt(by merging both in local) or /users/home/indexes/index-0000.nt and /home/indexes/index-0001.nt separately.
Here is my code:
Note: testDummy is same as test, output is with two partitions. I want to store them separately or combined but local with index.nt file. I prefer to store separately in two data-nodes. I'm using cluster and submit spark job on YARN. I also added some comments, how many times and what data I'm getting. How could I do? Any help is appreciated.
val testDummy = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).saveAsTextFile(outputFilePathForHDFS+"/testDummy")
println("testDummy done") //1 time print
def savesData(iterator: Iterator[(String)]): Iterator[(String)] = {
println("Inside savesData") // now 4 times when coalesce(Constants.INITIAL_PARTITIONS)=2
println("iter size"+iterator.size) // 2 735 2 735 values
val filenamesWithExtension = outputPath + "/index.nt"
println("filenamesWithExtension "+filenamesWithExtension.length) //4 times
var list = List[(String)]()
val fileWritter = new FileWriter(filenamesWithExtension,true)
val bufferWritter = new BufferedWriter(fileWritter)
while (iterator.hasNext){ //iterator.hasNext is false
println("inside iterator") //0 times
val dat = iterator.next()
println("datadata "+iterator.next())
bufferWritter.write(dat + "\n")
bufferWritter.flush()
println("index files written")
val dataElements = dat.split(" ")
println("dataElements") //0
list = list.::(dataElements(0))
list = list.::(dataElements(1))
list = list.::(dataElements(2))
}
bufferWritter.close() //closing
println("savesData method end") //4 times when coal=2
list.iterator
}
println("before saving data into local") //1
val test = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).mapPartitions(savesData)
println("testRDD partitions "+test.getNumPartitions) //2
println("testRDD size "+test.collect().length) //0
println("after saving data into local") //1
PS: I followed, this and this but not exactly same what I'm looking for, I did somehow but not getting anything in index.nt

A couple of things:
Never call Iterator.size if you plan to use data later. Iterators are TraversableOnce. The only way to compute Iterator size is to traverse all its element and after that there is no more data to be read.
Don't use transformations like mapPartitions for side effects. If you want to perform some type of IO use actions like foreach / foreachPartition. It is a bad practice and doesn't guarantee that given piece of code will be executed only once.
Local path inside action or transformations is a local path of particular worker. If you want to write directly on the client machine you should fetch data first with collect or toLocalIterator. It could be better though to write to distributed storage and fetch data later.

Java 7 provides means to watch directories.
https://docs.oracle.com/javase/tutorial/essential/io/notification.html
The idea is to create a watch service, register it with the directory of interest (mention the events of your interest, like file creation, deletion, etc.,), do watch, you will be notified of any events like creation, deletion, etc., you can take whatever action you want then.
You will have to depend on Java hdfs api heavily wherever applicable.
Run the program in background since it waits for events forever. (You can write logic to quit after you do whatever you want)
On the other hand, shell scripting will also help.
Be aware of coherency model of hdfs file system while reading files.
Hope this helps with some idea.

What is the replacement for summing list in Scala-Scalding

I have following code where I maintain a large List: What I do here is go over the data stream and create an inverted index. I use twitter scalding api and dataTypePipe is type of TypedPipe
lazy val cats = dataTypePipe.cross(cmsCats)
.map(vf => (vf._1.itemId, vf._1.leafCats, vf._2))
.flatMap {
case (id, categorySet, cHhitters) => categorySet.map(cat => (
...
}
.filter(f => f._2.nonEmpty)
.group.withReducers(4000)
.sum
.map {
case ((token,bucket), ids) =>
toIndexedRecord(ids, token, bucket)
}
Due to a serialization issue I convert scala list to java list and use avro to write:
def toIndexedRecord(ids: List[Long], token: String, bucket: Int): IndexRecord = {
val javaList = ids.map(l => l: java.lang.Long).asJava //need to convert from scala long to java long
new IndexRecord(token, bucket,javaList)
}
But the issue is large number of information keeping in list cause Java Heap issue. I believe summing is also a contributor to this issue
2013-08-25 16:41:09,709 WARN org.apache.hadoop.mapred.Child: Error running child
cascading.pipe.OperatorException: [_pipe_0*_pipe_1][com.twitter.scalding.GroupBuilder$$anonfun$1.apply(GroupBuilder.scala:189)] operator Every failed executing operation: MRMAggregator[decl:'value']
at cascading.flow.stream.AggregatorEveryStage.receive(AggregatorEveryStage.java:136)
at cascading.flow.stream.AggregatorEveryStage.receive(AggregatorEveryStage.java:39)
at cascading.flow.stream.OpenReducingDuct.receive(OpenReducingDuct.java:49)
at cascading.flow.stream.OpenReducingDuct.receive(OpenReducingDuct.java:28)
at cascading.flow.hadoop.stream.HadoopGroupGate.run(HadoopGroupGate.java:90)
at cascading.flow.hadoop.FlowReducer.reduce(FlowReducer.java:133)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:522)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.OutOfMemoryError: Java heap space
at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:168)
at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:45)
at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.immutable.List.$colon$colon$colon(List.scala:127)
at scala.collection.immutable.List.$plus$plus(List.scala:193)
at com.twitter.algebird.ListMonoid.plus(Monoid.scala:86)
at com.twitter.algebird.ListMonoid.plus(Monoid.scala:84)
at com.twitter.scalding.KeyedList$$anonfun$sum$1.apply(TypedPipe.scala:264)
at com.twitter.scalding.MRMAggregator.aggregate(Operations.scala:279)
at cascading.flow.stream.AggregatorEveryStage.receive(AggregatorEveryStage.java:128)
So my question is what can I do to avoid this situation.

Try .forceToReducers before the .sum. This OOM is happening map side as we are caching values. That may not help in your case.
If the lists are truly too large however, there is really very little that can be done.

Quick, but unscalable answer: try increasing mapred.child.java.opts
Better answer, well it's a little tricky to understand the question because I don't know the types of your vals and I don't know what f are vf because you haven't named them informatively. If you provide the minimal amount of code required so I can paste into an IDE and have a play around then I might find your problem.
sum might be where the OOM happens, but it is not what is causing it - refactoring to do sum in different way won't help.
Chances are your crossing on something too big to fit in memory. So mapred.child.java.opts might be only solution for you unless you completely restructure your data. Note cross calls crossWithTiny, now tiny means tiny :)

Jena Rule Engine with TDB

I am having my data loaded in TDB model and have written some rule using Jena in order to apply into TDB. Then I am storing the inferred data into a new TDB.
I applied the case above in a small dataset ~200kb and worded just fine. HOWEVER, my actual TDB is 2.7G and the computer has been running for about a week and it is in fact still running.
Is that something normal, or am I doing something wrong? What is the alternative of the Jena rule engine to use?
Here is a small piece of the code:
public class Ruleset {
private List<Rule> rules = null;
private GenericRuleReasoner reasoner = null;
public Ruleset (String rulesSource){
this.rules = Rule.rulesFromURL(rulesSource);
this.reasoner = new GenericRuleReasoner(rules);
reasoner.setOWLTranslation(true);
reasoner.setTransitiveClosureCaching(true);
}
public InfModel applyto(Model mode){
return ModelFactory.createInfModel(reasoner, mode);
}
public static void main(String[] args) {
System.out.println(" ... Running the Rule Engine ...");
String rulepath = "src/schemaRules.osr";
Ruleset rule = new Ruleset (rulepath);
InfModel infedModel = rule.applyto(data.tdb);
infdata.close();
}
}

A large dataset in a persistent store is not a good match with Jena's rule system. The basic problem is that the RETE engine will make many small queries into the graph during rule propagation. The overhead in making these queries to any persistent store, including TDB, tends to make the execution times unacceptably long, as you have found.
Depending on your goals for employing inference, you may have some alternatives:
Load your data into a large enough memory graph, then save the inference closure (the base graph plus the entailments) to a TDB store in a single transaction. Thereafter, you can query the store without incurring the overhead of the rules system. Updates, obviously, can be an issue with this approach.
Have your data in TDB, as now, but load a subset dynamically into a memory model to use live with inference. Makes updates easier (as long as you update both the memory copy and the persistent store), but requires you to partition your data.
If you only want some basic inferences, such as closure of the rdfs:subClassOf hierarchy, you can use the infer command line tool to generate an inference closure which you can load into TDB:
$ infer -h
infer --rdfs=vocab FILE ...
General
-v --verbose Verbose
-q --quiet Run with minimal output
--debug Output information for debugging
--help
--version Version information
Infer can be more efficient, because it doesn't require a large memory model. However, it is restricted in the inferences that it will compute.
If none of these work for you, you may want to consider commercial inference engines such as OWLIM or Stardog.

Thanks Ian.
I was actually able to do it via SPARQL update as DAVE advise me to and it took only 10 minutes to finish the job.
Here is an example of the code:
System.out.println(" ... Load rules ...");
data.startQuery();
String query = data.loadQuery("src/sparqlUpdatesRules.tql");
data.endQuery();
System.out.println(" ... Inserting rules ...");
UpdateAction.parseExecute(query, inferredData.tdb);
System.out.println(" ... Printing RDF ...");
inferredData.exportRDF();
System.out.println(" ... closeing ...");
inferredData.close();
and here is an example of the SPARQL update:
INSERT {
?w ddids:carries ?p .
} WHERE {
?p ddids:is_in ?w .
};
thanks for your answers

SpreadsheetAddRows failing on moderate size query

Edit: i changed the name as there is a similar SO question How do I fix SpreadSheetAddRows function crashing when adding a large query? out there that describes my issue so i pharased more succinctly...the issue is spreadsheetAddrows for my query result bombs the entire server at what i consider a moderate size (1600 rows, 27 columns) but that sounds considerably less than his 18,000 rows
I am using an oracle stored procedure accessed via coldfusion 9.0.1 cfstoredproc that on completion creates a spreadsheet for the user to download
The issue is that result sets greater than say 1200 rows are returning a 500 internal server error, 700 rows return fine, so i am guessing it is a memory problem?
the only message i received other than 500 Internal server error in the standard coldfusion look was in small print "gc overhead limit exceeded" and that was only once on a page refresh, which refers to the underlying Java JVM
I am not even sure how to go about diagnosing this
here is the end of the cfstoredproc and spreadsheet obj
<!--- variables assigned correctly above --->
<cfprocresult name="RC1">
</cfstoredproc>
<cfset sObj = spreadsheetNew("reconcile","yes")>
<cfset SpreadsheetAddRow(sObj, "Column_1, ... , Column27")>
<cfset SpreadsheetFormatRow(sObj, {bold=TRUE, alignment="center"}, 1)>
<cfset spreadsheetAddRows(sObj, RC1)>
<cfheader name="content-disposition" value="attachment; filename=report_#Dateformat(NOW(),"MMDDYYYY")#.xlsx">
<cfcontent type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" variable="#spreadsheetReadBinary(sObj)#">

My Answer lies with coldfusion and one simple fact: DO NOT USE SpreadsheetAddRows or any of those related functions like SpreadsheetFormatRows
My solution to this was to execute the query, create an xls file, use the tag cfspreadsheet to write to the newly created xls file, then serve to the browser, deleting after serving
Using SpreadsheetAddRows, Runtime went from crashing server on 1000+ rows, 5+mins on 700 rows
Using the method outlined above 1-1.5 secs
if you are interested in more code, i can provide just comment, i am using the coldbox framework so didnt think the specificness would help just the new workflow

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.