I expected the new mapWithState API for Spark 1.6+ to near-immediately remove objects that are timed-out, but there is a delay.
I'm testing the API with the adapted version of the JavaStatefulNetworkWordCount below:
SparkConf sparkConf = new SparkConf()
.setAppName("JavaStatefulNetworkWordCount")
.setMaster("local[*]");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
ssc.checkpoint("./tmp");
StateSpec<String, Integer, Integer, Tuple2<String, Integer>> mappingFunc =
StateSpec.function((word, one, state) -> {
if (state.isTimingOut())
{
System.out.println("Timing out the word: " + word);
return new Tuple2<String,Integer>(word, state.get());
}
else
{
int sum = one.or(0) + (state.exists() ? state.get() : 0);
Tuple2<String, Integer> output = new Tuple2<String, Integer>(word, sum);
state.update(sum);
return output;
}
});
JavaMapWithStateDStream<String, Integer, Integer, Tuple2<String, Integer>> stateDstream =
ssc.socketTextStream(args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER_2)
.flatMap(x -> Arrays.asList(SPACE.split(x)))
.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.mapWithState(mappingFunc.timeout(Durations.seconds(5)));
stateDstream.stateSnapshots().print();
Together with nc (nc -l -p <port>)
When I type a word into the nc window I see the tuple being printed in the console every second. But it doesn't seem like the timing out message gets printed out 5s later, as expected based on the timeout set. The time it takes for the tuple to expire seems to vary between 5 & 20s.
Am I missing some configuration option, or is the timeout perhaps only performed at the same time as checkpoints?
Once an event times out it's NOT deleted right away, but is only marked for deletion by saving it to a 'deltaMap':
override def remove(key: K): Unit = {
val stateInfo = deltaMap(key)
if (stateInfo != null) {
stateInfo.markDeleted()
} else {
val newInfo = new StateInfo[S](deleted = true)
deltaMap.update(key, newInfo)
}
}
Then, timed out events are collected and sent to the output stream only at checkpoint. That is: events which time out at batch t, will appear in the output stream only at the next checkpoint - by default, after 5 batch-intervals on average, i.e. batch t+5:
override def checkpoint(): Unit = {
super.checkpoint()
doFullScan = true
}
...
removeTimedoutData = doFullScan // remove timedout data only when full scan is enabled
...
// Get the timed out state records, call the mapping function on each and collect the
// data returned
if (removeTimedoutData && timeoutThresholdTime.isDefined) {
...
Elements are actually removed only when there are enough of them, and when state map is being serialized - which currently also happens only at checkpoint:
/** Whether the delta chain length is long enough that it should be compacted */
def shouldCompact: Boolean = {
deltaChainLength >= deltaChainThreshold
}
// Write the data in the parent state map while copying the data into a new parent map for
// compaction (if needed)
val doCompaction = shouldCompact
...
By default checkpointing occurs every 10 iterations, thus in the example above every 10 seconds; since your timeout is 5 seconds, events are expected within 5-15 seconds.
EDIT: Corrected and elaborated answer following comments by #YuvalItzchakov
Am I missing some configuration option, or is the timeout perhaps only
performed at the same time as snapshots?
Every time a mapWithState is invoked (with your configuration, around every 1 second), the MapWithStateRDD will internally check for expired records and time them out. You can see it in the code:
// Get the timed out state records, call the mapping function on each and collect the
// data returned
if (removeTimedoutData && timeoutThresholdTime.isDefined) {
newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>
wrappedState.wrapTimingOutState(state)
val returned = mappingFunction(batchTime, key, None, wrappedState)
mappedData ++= returned
newStateMap.remove(key)
}
}
(Other than time taken to execute each job, it turns out that newStateMap.remove(key) actually only marks files for deletion. See "Edit" for more.)
You have to take into account the time it takes for each stage to be scheduled, and the amount of time it takes for each execution of such a stage to actually take it's turn and run. It isn't accurate because this runs as a distributed systems where other factors can come into play, making your timeout more/less accurate than you expect it to be.
Edit
As #etov rightly points out, newStateMap.remove(key) doesn't actually remove the element from the OpenHashMapBasedStateMap[K, S], but simply mark it for deletion. This is also a reason why you're seeing the expiration time adding up.
The actual relevant piece of code is here:
// Write the data in the parent state map while
// copying the data into a new parent map for compaction (if needed)
val doCompaction = shouldCompact
val newParentSessionStore = if (doCompaction) {
val initCapacity = if (approxSize > 0) approxSize else 64
new OpenHashMapBasedStateMap[K, S](initialCapacity = initCapacity, deltaChainThreshold)
} else { null }
val iterOfActiveSessions = parentStateMap.getAll()
var parentSessionCount = 0
// First write the approximate size of the data to be written, so that readObject can
// allocate appropriately sized OpenHashMap.
outputStream.writeInt(approxSize)
while(iterOfActiveSessions.hasNext) {
parentSessionCount += 1
val (key, state, updateTime) = iterOfActiveSessions.next()
outputStream.writeObject(key)
outputStream.writeObject(state)
outputStream.writeLong(updateTime)
if (doCompaction) {
newParentSessionStore.deltaMap.update(
key, StateInfo(state, updateTime, deleted = false))
}
}
// Write the final limit marking object with the correct count of records written.
val limiterObj = new LimitMarker(parentSessionCount)
outputStream.writeObject(limiterObj)
if (doCompaction) {
parentStateMap = newParentSessionStore
}
If deltaMap should be compacted (marked with the doCompaction variable), then (and only then) is the map cleared from all the deleted instances. How often does that happen? One the delta exceeds the threadshold:
val DELTA_CHAIN_LENGTH_THRESHOLD = 20
Which means the delta chain is longer than 20 items, and there are items that have been marked for deletion.
Related
I'm using Ignite 2.13 in a Windows environment.
I have exclusive use of a 110 node compute grid. I want to complete a ComputeTask that maps to 100 ComputeJobs. The jobs take a variable amount of time to complete. As the compute progresses many of the nodes become idle waiting for the remaining jobs to complete. Midway through the compute I'd like to find all the idle nodes and then kill some subset of them, leaving a few in case one of the remaining jobs randomly fails and needs to failover.
In the ComputeTask the "result" method is called as Jobs complete. From the "result" method I'm trying to find all the grid nodes with an active job. Any node that does not have an active job is assumed idle and potentially eligible to be killed off.
I have two methods for finding the nodes with an active job getActiveNodesOld and getActiveNodesLocal. They return drastically different answers.
// This version uses this node's view of the all the cluster.
// This status lags behind reality by 2 seconds.
public static Collection<UUID> getActiveNodesOld(Ignite grid)
{
Set<UUID> retval = new LinkedHashSet<>();
Collection<ClusterNode> nodes = grid.cluster().nodes();
for(ClusterNode node : nodes)
{
int activeJobs = node.metrics().getCurrentActiveJobs();
if(activeJobs > 0)
{
retval.add(node.id());
}
}
return retval;
}
// This version asks each node about its own status. The results are supposed to be more
// accurate.
public static Collection<UUID> getActiveNodesLocal(Ignite grid)
{
Set<UUID> retval = new LinkedHashSet<>();
IgniteCompute compute = grid.compute();
Collection<IgniteBiTuple<UUID, Integer>> metricsMap =
compute.broadcast(new ActiveJobCallable());
for(Map.Entry<UUID, Integer> entry : metricsMap)
{
UUID id = entry.getKey();
int activeJobs = entry.getValue();
if(activeJobs > 0)
{
retval.add(id);
}
if(activeJobs > 1){
logger.info("Warning: Node:" + id + " has " + activeJobs + " active jobs. ");
}
}
return retval;
}
private static class ActiveJobCallable implements IgniteCallable<IgniteBiTuple<UUID, Integer>>
{
#IgniteInstanceResource
private transient Ignite ignite;
public ActiveJobCallable()
{
}
#Override
public IgniteBiTuple<UUID, Integer> call()
{
ClusterNode clusterNode = ignite.cluster().localNode();
ClusterMetrics nodeMetrics = clusterNode.metrics();
return new IgniteBiTuple<>(clusterNode.id(), nodeMetrics.getCurrentActiveJobs());
}
}
These two methods return very different results with regard to which nodes are active. The first time through the getActiveNodesLocal reports that there are 0 active nodes. I can see how this is possible - perhaps a job is not considered complete until "result" returns. I'll give it the benefit of the doubt. The other method, getActiveNodesOld indicates there is a large number of idle nodes. Even though its being called for the first completed compute job so the remainder of the grid should still be working on compute jobs.
As the compute progresses neither method produces the answers I'd expect. The nodes seem to start reporting 1 activeJob and then never report 0 activeJobs even after the job is complete. Do I need to be resetting the ClusterMetrics somehow?
The method getActiveNodesLocal broadcasts an IgniteCallable into the grid. Does that callable get counted in the results of getCurrentActiveJobs()? It doesn't seem to but I thought that might explain the weird results I'm seeing. I never see the logger message I added that should be triggered if "activeJobs > 1".
Is there some other way I Should accomplish my goal of finding nodes that are idling? Can I find, for example, what node each job is currently assigned to and then determine which nodes don't have a job assignment? I don't see a method for the ComputeTask to determine node<->job mapping but maybe I'm just overlooking it.
I suppose I could have the nodes send a grid message when they start and also when they complete a job and I could track which jobs are active but that feels like the wrong solution.
Skip ClusterMetrics entirely. Create an IgniteCache and put the node UUID in the cache when the Job is started and remove the UUID when the job is completed.
Finding the Nodes that are still in the cluster and have a job could be done like this:
public static Collection<UUID> getActiveNodesCache(Ignite grid)
{
Set<UUID> retval = new TreeSet<>();
IgniteCache<UUID, ComputeJob> workingOnMap = grid.cache(JOB_CACHE);
workingOnMap.query(new ScanQuery<>(null)).forEach(entry -> retval.add((UUID) entry.getKey()));
// We count on the nodes to add and remove themselves from the cache when they start/complete jobs.
// Nodes that crashed aren't removed from this cache - they just die. So retval could
// contain nodes that are gone at this point.
Set<UUID> inGrid = grid.cluster().nodes().stream().map(ClusterNode::id).collect(Collectors.toSet());
Set<UUID> notInGrid = new LinkedHashSet<>(retval);
notInGrid.removeAll(inGrid);
retval.retainAll(inGrid); // remove any nodes that are no longer in the cluster.
logger.log(Level.INFO,
"Found {0} nodes in grid. {1} of those are active. {2} nodes were active but left grid.",
new Object[]{inGrid.size(), retval.size(), notInGrid.size()});
return retval;
}
The ComputeJobs can create the cache like this:
IgniteCache<UUID, ComputeJob> workingOnMap = grid.getOrCreateCache(getInProgressConfig());
Because you want to kill the nodes, be sure the cache is REPLICATED across the grid so that dead nodes don't take portions of the cache with them.
public static CacheConfiguration<UUID, ComputeJob> getInProgressConfig()
{
CacheConfiguration<UUID, ComputeJob> config = new CacheConfiguration<>();
config.setName(JOB_CACHE);
config.setCacheMode(CacheMode.REPLICATED);
config.setAtomicityMode(CacheAtomicityMode.ATOMIC);
config.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC);
return config;
}
My program gets very slow as more and more records are processed. I initially thought it is due to excessive memory consumption as my program is String intensive (I am using Java 11 so compact strings should be used whenever possible) so I increased the JVM Heap:
-Xms2048m
-Xmx6144m
I also increased the task manager's memory as well as timeout, flink-conf.yaml:
jobmanager.heap.size: 6144m
heartbeat.timeout: 5000000
However, none of this helped with the issue. The Program still gets very slow at about the same point which is after processing roughly 3.5 million records, only about 0.5 million more to go. As the program approaches the 3.5 million mark it gets very very slow until it eventually times out, total execution time is about 11 minutes.
I checked the memory consumption in VisualVm, but the memory consumption never goes more than about 700MB.My flink pipeline looks as follows:
final StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironment(1);
environment.setParallelism(1);
DataStream<Tuple> stream = environment.addSource(new TPCHQuery3Source(filePaths, relations));
stream.process(new TPCHQuery3Process(relations)).addSink(new FDSSink());
environment.execute("FlinkDataService");
Where the bulk of the work is done in the process function, I am implementing data base join algorithms and the columns are stored as Strings, specifically I am implementing query 3 of the TPCH benchmark, check here if you wish https://examples.citusdata.com/tpch_queries.html.
The timeout error is this:
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id <id> timed out.
Once I got this error as well:
Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap space
Also, my VisualVM monitoring, screenshot is captured at the point where things get very slow:
Here is the run loop of my source function:
while (run) {
readers.forEach(reader -> {
try {
String line = reader.readLine();
if (line != null) {
Tuple tuple = lineToTuple(line, counter.get() % filePaths.size());
if (tuple != null && isValidTuple(tuple)) {
sourceContext.collect(tuple);
}
} else {
closedReaders.add(reader);
if (closedReaders.size() == filePaths.size()) {
System.out.println("ALL FILES HAVE BEEN STREAMED");
cancel();
}
}
counter.getAndIncrement();
} catch (IOException e) {
e.printStackTrace();
}
});
}
I basically read a line of each of the 3 files I need, based on the order of the files, I construct a tuple object which is my custom class called tuple representing a row in a table, and emit that tuple if it is valid i.e. fullfils certain conditions on the date.
I am also suggesting the JVM to do garbage collection at the 1 millionth, 1.5millionth, 2 millionth and 2.5 millionth record like this:
System.gc()
Any thoughts on how I can optimize this?
String intern() saved me. I did intern on every string before storing it in my maps and that worked like a charm.
these are the properties that I changed on my link stand-alone cluster to compute the TPC-H query 03.
jobmanager.memory.process.size: 1600m
heartbeat.timeout: 100000
taskmanager.memory.process.size: 8g # defaul: 1728m
I implemented this query to stream only the Order table and I kept the other tables as a state. Also I am computing as a windowless query, which I think it makes more sense and it is faster.
public class TPCHQuery03 {
private final String topic = "topic-tpch-query-03";
public TPCHQuery03() {
this(PARAMETER_OUTPUT_LOG, "127.0.0.1", false, false, -1);
}
public TPCHQuery03(String output, String ipAddressSink, boolean disableOperatorChaining, boolean pinningPolicy, long maxCount) {
try {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
if (disableOperatorChaining) {
env.disableOperatorChaining();
}
DataStream<Order> orders = env
.addSource(new OrdersSource(maxCount)).name(OrdersSource.class.getSimpleName()).uid(OrdersSource.class.getSimpleName());
// Filter market segment "AUTOMOBILE"
// customers = customers.filter(new CustomerFilter());
// Filter all Orders with o_orderdate < 12.03.1995
DataStream<Order> ordersFiltered = orders
.filter(new OrderDateFilter("1995-03-12")).name(OrderDateFilter.class.getSimpleName()).uid(OrderDateFilter.class.getSimpleName());
// Join customers with orders and package them into a ShippingPriorityItem
DataStream<ShippingPriorityItem> customerWithOrders = ordersFiltered
.keyBy(new OrderKeySelector())
.process(new OrderKeyedByCustomerProcessFunction(pinningPolicy)).name(OrderKeyedByCustomerProcessFunction.class.getSimpleName()).uid(OrderKeyedByCustomerProcessFunction.class.getSimpleName());
// Join the last join result with Lineitems
DataStream<ShippingPriorityItem> result = customerWithOrders
.keyBy(new ShippingPriorityOrderKeySelector())
.process(new ShippingPriorityKeyedProcessFunction(pinningPolicy)).name(ShippingPriorityKeyedProcessFunction.class.getSimpleName()).uid(ShippingPriorityKeyedProcessFunction.class.getSimpleName());
// Group by l_orderkey, o_orderdate and o_shippriority and compute revenue sum
DataStream<ShippingPriorityItem> resultSum = result
.keyBy(new ShippingPriority3KeySelector())
.reduce(new SumShippingPriorityItem(pinningPolicy)).name(SumShippingPriorityItem.class.getSimpleName()).uid(SumShippingPriorityItem.class.getSimpleName());
// emit result
if (output.equalsIgnoreCase(PARAMETER_OUTPUT_MQTT)) {
resultSum
.map(new ShippingPriorityItemMap(pinningPolicy)).name(ShippingPriorityItemMap.class.getSimpleName()).uid(ShippingPriorityItemMap.class.getSimpleName())
.addSink(new MqttStringPublisher(ipAddressSink, topic, pinningPolicy)).name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else if (output.equalsIgnoreCase(PARAMETER_OUTPUT_LOG)) {
resultSum.print().name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else if (output.equalsIgnoreCase(PARAMETER_OUTPUT_FILE)) {
StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path(PATH_OUTPUT_FILE), new SimpleStringEncoder<String>("UTF-8"))
.withRollingPolicy(
DefaultRollingPolicy.builder().withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.withMaxPartSize(1024 * 1024 * 1024).build())
.build();
resultSum
.map(new ShippingPriorityItemMap(pinningPolicy)).name(ShippingPriorityItemMap.class.getSimpleName()).uid(ShippingPriorityItemMap.class.getSimpleName())
.addSink(sink).name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else {
System.out.println("discarding output");
}
System.out.println("Stream job: " + TPCHQuery03.class.getSimpleName());
System.out.println("Execution plan >>>\n" + env.getExecutionPlan());
env.execute(TPCHQuery03.class.getSimpleName());
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) throws Exception {
new TPCHQuery03();
}
}
The UDFs are here: OrderSource, OrderKeyedByCustomerProcessFunction, ShippingPriorityKeyedProcessFunction, and SumShippingPriorityItem. I am using the com.google.common.collect.ImmutableList since the state will not be updated. Also I am keeping only the necessary columns on the state, such as ImmutableList<Tuple2<Long, Double>> lineItemList.
I am using kafka stream 2.1
I am trying to aggregate a stream of messages by their id. We have approximatively 20 messages with same id produced almost at the same time (a couple hundred ms maximum between two messages). So I am using a session window with a inactivity gap of 500ms and grace time of 5 seconds.
The incoming records has the ID as a key, and the value is made of few string fields + a map that can contains from 0 to few hundreds of entry (key is a string, value is an object with one string field).
Here is the code :
private final Duration INACTIVITY_GAP = Duration.ofMillis(500);
private final Duration GRACE_TIME = Duration.ofMillis(5000);
KStream<String, MyCustomMessage> source = streamsBuilder.stream("inputTopic", Consumed.with(Serdes.String(), myCustomSerde));
source
.groupByKey(Grouped.with(Serdes.String(), myCustomSerde))
.windowedBy(SessionWindows.with(INACTIVITY_GAP).grace(GRACE_TIME))
.aggregate(
// initializer
() -> {
return new CustomAggMessage();
},
// aggregates records in same session
(s, message, aggMessage) -> {
// ...
return aggMessage;
},
// merge sessions
(s, aggMessage1, aggMessage2) -> {
// ... merge
return aggMessage2;
},
Materialized.with(Serdes.String(), myCustomAggSerde)
)
.suppress(Suppressed.untilWindowCloses(unbounded()))
.selectKey((windowed, o) -> windowed.key());
.toStream().to("outputTopic")
I also tried another Suppressed : .suppress(Suppressed.untilTimeLimit(Duration.ofSeconds(30), maxBytes(1_000_000_000L).emitEarlyWhenFull())). It did not help.
I have a custom rocksdb config :
public class CustomRocksDbConfig implements RocksDBConfigSetter {
#Override
public void setConfig(final String storeName, final Options options, final Map<String, Object> configs) {
BlockBasedTableConfig tableConfig = (BlockBasedTableConfig) options.tableFormatConfig();
tableConfig.setCacheIndexAndFilterBlocks(true);
options.setTableFormatConfig(tableConfig);
}
}
The input topic has 32 partitions (about 10k msg/second) and we run 8 instances with 4 stream threads each.
When running this. The heap usage is very high (max heap is set to 4G, machine has 8G) and makes the app crash and restart, so the lag increases.
Does someone know why ? What could I change to make this working ? Is the session window and its parameters the right way to achieve this ?
I'm using spark in order to calculate the pagerank of user reviews, but I keep getting Spark java.lang.StackOverflowError when I run my code on a big dataset (40k entries). when running the code on a small number of entries it works fine though.
Entry Example :
product/productId: B00004CK40 review/userId: A39IIHQF18YGZA review/profileName: C. A. M. Salas review/helpfulness: 0/0 review/score: 4.0 review/time: 1175817600 review/summary: Reliable comedy review/text: Nice script, well acted comedy, and a young Nicolette Sheridan. Cusak is in top form.
The Code:
public void calculatePageRank() {
sc.clearCallSite();
sc.clearJobGroup();
JavaRDD < String > rddFileData = sc.textFile(inputFileName).cache();
sc.setCheckpointDir("pagerankCheckpoint/");
JavaRDD < String > rddMovieData = rddFileData.map(new Function < String, String > () {
#Override
public String call(String arg0) throws Exception {
String[] data = arg0.split("\t");
String movieId = data[0].split(":")[1].trim();
String userId = data[1].split(":")[1].trim();
return movieId + "\t" + userId;
}
});
JavaPairRDD<String, Iterable<String>> rddPairReviewData = rddMovieData.mapToPair(new PairFunction < String, String, String > () {
#Override
public Tuple2 < String, String > call(String arg0) throws Exception {
String[] data = arg0.split("\t");
return new Tuple2 < String, String > (data[0], data[1]);
}
}).groupByKey().cache();
JavaRDD<Iterable<String>> cartUsers = rddPairReviewData.map(f -> f._2());
List<Iterable<String>> cartUsersList = cartUsers.collect();
JavaPairRDD<String,String> finalCartesian = null;
int iterCounter = 0;
for(Iterable<String> out : cartUsersList){
JavaRDD<String> currentUsersRDD = sc.parallelize(Lists.newArrayList(out));
if(finalCartesian==null){
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD);
}
else{
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD).union(finalCartesian);
if(iterCounter % 20 == 0) {
finalCartesian.checkpoint();
}
}
}
JavaRDD<Tuple2<String,String>> finalCartesianToTuple = finalCartesian.map(m -> new Tuple2<String,String>(m._1(),m._2()));
finalCartesianToTuple = finalCartesianToTuple.filter(x -> x._1().compareTo(x._2())!=0);
JavaPairRDD<String, String> userIdPairs = finalCartesianToTuple.mapToPair(m -> new Tuple2<String,String>(m._1(),m._2()));
JavaRDD<String> userIdPairsString = userIdPairs.map(new Function < Tuple2<String, String>, String > () {
//Tuple2<Tuple2<MovieId, userId>, Tuple2<movieId, userId>>
#Override
public String call (Tuple2<String, String> t) throws Exception {
return t._1 + " " + t._2;
}
});
try {
//calculate pagerank using this https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java
JavaPageRank.calculatePageRank(userIdPairsString, 100);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
sc.close();
}
I have multiple suggestions which will help you to greatly improve the performance of the code in your question.
Caching: Caching should be used on those data sets which you need to refer to again and again for same/ different operations (iterative algorithms.
An example is RDD.count — to tell you the number of lines in the
file, the file needs to be read. So if you write RDD.count, at
this point the file will be read, the lines will be counted, and the
count will be returned.
What if you call RDD.count again? The same thing: the file will be
read and counted again. So what does RDD.cache do? Now, if you run
RDD.count the first time, the file will be loaded, cached, and
counted. If you call RDD.count a second time, the operation will use
the cache. It will just take the data from the cache and count the
lines, no recomputing.
Read more about caching here.
In your code sample you are not reusing anything that you've cached. So you may remove the .cache from there.
Parallelization: In the code sample, you've parallelized every individual element in your RDD which is already a distributed collection. I suggest you to merge the rddFileData, rddMovieData and rddPairReviewData steps so that it happens in one go.
Get rid of .collect since that brings the results back to the driver and maybe the actual reason for your error.
This problem will occur when your DAG grows big and too many level of transformations happening in your code. The JVM will not be able to hold the operations to perform lazy execution when an action is performed in the end.
Checkpointing is one option. I would suggest to implement spark-sql for this kind of aggregations. If your data is structured, try to load that into dataframes and perform grouping and other mysql functions to achieve this.
When your for loop grows really large, Spark can no longer keep track of the lineage. Enable checkpointing in your for loop to checkpoint your rdd every 10 iterations or so. Checkpointing will fix the problem. Don't forget to clean up the checkpoint directory after.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
Below things fixed stackoverflow error, as others pointed it's because of lineage that spark keeps building, specially when you have loop/iteration in code.
Set checkpoint directory
spark.sparkContext.setCheckpointDir("./checkpoint")
checkpoint dataframe/Rdd you are modifying/operating in iteration
modifyingDf.checkpoint()
Cache Dataframe which are reused in each iteration
reusedDf.cache()
We have a slow backend server that is getting crushed by load and we'd like the middle-tier Scala server to only have one outstanding request to the backend for each unique lookup.
The backend server only stores immutable data, but upon the addition of new data, the middle-tier servers will request the newest data on behalf of the clients and the backend server has a hard time with the load. The immutable data is cached in memcached using unique keys generated upon the write, but the write rate is high so we get a low memcached hit rate.
One idea I have is to use Google Guava's MapMaker#makeComputingMap() to wrap the actual lookup and after ConcurrentMap#get() returns, the middle-tier will save the result and just delete the key from the Map.
This seems a little wasteful, although the code is very easy to write, see below for an example of what I'm thinking.
Is there a more natural data structure, library or part of Guava that would solve this problem?
import com.google.common.collect.MapMaker
object Test
{
val computer: com.google.common.base.Function[Int,Long] =
{
new com.google.common.base.Function[Int,Long] {
override
def apply(i: Int): Long =
{
val l = System.currentTimeMillis + i
System.err.println("For " + i + " returning " + l)
Thread.sleep(2000)
l
}
}
}
val map =
{
new MapMaker().makeComputingMap[Int,Long](computer)
}
def get(k: Int): Long =
{
val l = map.get(k)
map.remove(k)
l
}
def main(args: Array[String]): Unit =
{
val t1 = new Thread() {
override def run(): Unit =
{
System.err.println(get(123))
}
}
val t2 = new Thread() {
override def run(): Unit =
{
System.err.println(get(123))
}
}
t1.start()
t2.start()
t1.join()
t2.join()
System.err.println(get(123))
}
}
I'm not sure why you implement remove yourself, why not simply have weak or soft values and let the GC clean up for you?
new MapMaker().weakValues().makeComputingMap[Int, Long](computer)
I think what you do is quite reasonable. You only use the structure to get lock-striping on the key, to ensure that accesses to the same key conflict. No worries that you don't need a value mapping per key. ConcurrentHashMap and friends is the only structure in Java libraries+Guava that offers you lock-striping.
This does induce some minor runtime overhead, plus the size of the hashtable which you don't need (which might even grow, if accesses to the same segment pile up and remove() doesn't keep up).
If you want to make it as cheap as possible, you could code some simple lock-striping yourself. Basically an Object[] (or Array[AnyRef] :)) of N locks (N = concurrency level), and you just map the hash of the lookup key into this array, and lock. Another advantage of this is that you really don't have to do hashcode tricks that CHM requires to do, because the latter has to split the hashcode in one part to select the lock, and another for the needs of the hashtable, but you can use the whole of it just for the lock selection.
edit: Sketching my comment below:
val concurrencyLevel = 16
val locks = (for (i <- 0 to concurrencyLevel) yield new AnyRef).toArray
def access(key: K): V = {
val lock = locks(key.hashCode % locks.size)
lock synchronized {
val valueFromCache = cache.lookup(key)
valueFromCache match {
case Some(v) => return v
case None =>
val valueFromBackend = backendServer.lookup(key)
cache.put(key, valueFromBackend)
return valueFromBackend
}
}
}
(Btw, is the toArray call needed? Or the returned IndexSeq is already fast to access by index?)