I'm using spark in order to calculate the pagerank of user reviews, but I keep getting Spark java.lang.StackOverflowError when I run my code on a big dataset (40k entries). when running the code on a small number of entries it works fine though.
Entry Example :
product/productId: B00004CK40 review/userId: A39IIHQF18YGZA review/profileName: C. A. M. Salas review/helpfulness: 0/0 review/score: 4.0 review/time: 1175817600 review/summary: Reliable comedy review/text: Nice script, well acted comedy, and a young Nicolette Sheridan. Cusak is in top form.
The Code:
public void calculatePageRank() {
sc.clearCallSite();
sc.clearJobGroup();
JavaRDD < String > rddFileData = sc.textFile(inputFileName).cache();
sc.setCheckpointDir("pagerankCheckpoint/");
JavaRDD < String > rddMovieData = rddFileData.map(new Function < String, String > () {
#Override
public String call(String arg0) throws Exception {
String[] data = arg0.split("\t");
String movieId = data[0].split(":")[1].trim();
String userId = data[1].split(":")[1].trim();
return movieId + "\t" + userId;
}
});
JavaPairRDD<String, Iterable<String>> rddPairReviewData = rddMovieData.mapToPair(new PairFunction < String, String, String > () {
#Override
public Tuple2 < String, String > call(String arg0) throws Exception {
String[] data = arg0.split("\t");
return new Tuple2 < String, String > (data[0], data[1]);
}
}).groupByKey().cache();
JavaRDD<Iterable<String>> cartUsers = rddPairReviewData.map(f -> f._2());
List<Iterable<String>> cartUsersList = cartUsers.collect();
JavaPairRDD<String,String> finalCartesian = null;
int iterCounter = 0;
for(Iterable<String> out : cartUsersList){
JavaRDD<String> currentUsersRDD = sc.parallelize(Lists.newArrayList(out));
if(finalCartesian==null){
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD);
}
else{
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD).union(finalCartesian);
if(iterCounter % 20 == 0) {
finalCartesian.checkpoint();
}
}
}
JavaRDD<Tuple2<String,String>> finalCartesianToTuple = finalCartesian.map(m -> new Tuple2<String,String>(m._1(),m._2()));
finalCartesianToTuple = finalCartesianToTuple.filter(x -> x._1().compareTo(x._2())!=0);
JavaPairRDD<String, String> userIdPairs = finalCartesianToTuple.mapToPair(m -> new Tuple2<String,String>(m._1(),m._2()));
JavaRDD<String> userIdPairsString = userIdPairs.map(new Function < Tuple2<String, String>, String > () {
//Tuple2<Tuple2<MovieId, userId>, Tuple2<movieId, userId>>
#Override
public String call (Tuple2<String, String> t) throws Exception {
return t._1 + " " + t._2;
}
});
try {
//calculate pagerank using this https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java
JavaPageRank.calculatePageRank(userIdPairsString, 100);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
sc.close();
}
I have multiple suggestions which will help you to greatly improve the performance of the code in your question.
Caching: Caching should be used on those data sets which you need to refer to again and again for same/ different operations (iterative algorithms.
An example is RDD.count — to tell you the number of lines in the
file, the file needs to be read. So if you write RDD.count, at
this point the file will be read, the lines will be counted, and the
count will be returned.
What if you call RDD.count again? The same thing: the file will be
read and counted again. So what does RDD.cache do? Now, if you run
RDD.count the first time, the file will be loaded, cached, and
counted. If you call RDD.count a second time, the operation will use
the cache. It will just take the data from the cache and count the
lines, no recomputing.
Read more about caching here.
In your code sample you are not reusing anything that you've cached. So you may remove the .cache from there.
Parallelization: In the code sample, you've parallelized every individual element in your RDD which is already a distributed collection. I suggest you to merge the rddFileData, rddMovieData and rddPairReviewData steps so that it happens in one go.
Get rid of .collect since that brings the results back to the driver and maybe the actual reason for your error.
This problem will occur when your DAG grows big and too many level of transformations happening in your code. The JVM will not be able to hold the operations to perform lazy execution when an action is performed in the end.
Checkpointing is one option. I would suggest to implement spark-sql for this kind of aggregations. If your data is structured, try to load that into dataframes and perform grouping and other mysql functions to achieve this.
When your for loop grows really large, Spark can no longer keep track of the lineage. Enable checkpointing in your for loop to checkpoint your rdd every 10 iterations or so. Checkpointing will fix the problem. Don't forget to clean up the checkpoint directory after.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
Below things fixed stackoverflow error, as others pointed it's because of lineage that spark keeps building, specially when you have loop/iteration in code.
Set checkpoint directory
spark.sparkContext.setCheckpointDir("./checkpoint")
checkpoint dataframe/Rdd you are modifying/operating in iteration
modifyingDf.checkpoint()
Cache Dataframe which are reused in each iteration
reusedDf.cache()
Related
My program gets very slow as more and more records are processed. I initially thought it is due to excessive memory consumption as my program is String intensive (I am using Java 11 so compact strings should be used whenever possible) so I increased the JVM Heap:
-Xms2048m
-Xmx6144m
I also increased the task manager's memory as well as timeout, flink-conf.yaml:
jobmanager.heap.size: 6144m
heartbeat.timeout: 5000000
However, none of this helped with the issue. The Program still gets very slow at about the same point which is after processing roughly 3.5 million records, only about 0.5 million more to go. As the program approaches the 3.5 million mark it gets very very slow until it eventually times out, total execution time is about 11 minutes.
I checked the memory consumption in VisualVm, but the memory consumption never goes more than about 700MB.My flink pipeline looks as follows:
final StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironment(1);
environment.setParallelism(1);
DataStream<Tuple> stream = environment.addSource(new TPCHQuery3Source(filePaths, relations));
stream.process(new TPCHQuery3Process(relations)).addSink(new FDSSink());
environment.execute("FlinkDataService");
Where the bulk of the work is done in the process function, I am implementing data base join algorithms and the columns are stored as Strings, specifically I am implementing query 3 of the TPCH benchmark, check here if you wish https://examples.citusdata.com/tpch_queries.html.
The timeout error is this:
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id <id> timed out.
Once I got this error as well:
Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap space
Also, my VisualVM monitoring, screenshot is captured at the point where things get very slow:
Here is the run loop of my source function:
while (run) {
readers.forEach(reader -> {
try {
String line = reader.readLine();
if (line != null) {
Tuple tuple = lineToTuple(line, counter.get() % filePaths.size());
if (tuple != null && isValidTuple(tuple)) {
sourceContext.collect(tuple);
}
} else {
closedReaders.add(reader);
if (closedReaders.size() == filePaths.size()) {
System.out.println("ALL FILES HAVE BEEN STREAMED");
cancel();
}
}
counter.getAndIncrement();
} catch (IOException e) {
e.printStackTrace();
}
});
}
I basically read a line of each of the 3 files I need, based on the order of the files, I construct a tuple object which is my custom class called tuple representing a row in a table, and emit that tuple if it is valid i.e. fullfils certain conditions on the date.
I am also suggesting the JVM to do garbage collection at the 1 millionth, 1.5millionth, 2 millionth and 2.5 millionth record like this:
System.gc()
Any thoughts on how I can optimize this?
String intern() saved me. I did intern on every string before storing it in my maps and that worked like a charm.
these are the properties that I changed on my link stand-alone cluster to compute the TPC-H query 03.
jobmanager.memory.process.size: 1600m
heartbeat.timeout: 100000
taskmanager.memory.process.size: 8g # defaul: 1728m
I implemented this query to stream only the Order table and I kept the other tables as a state. Also I am computing as a windowless query, which I think it makes more sense and it is faster.
public class TPCHQuery03 {
private final String topic = "topic-tpch-query-03";
public TPCHQuery03() {
this(PARAMETER_OUTPUT_LOG, "127.0.0.1", false, false, -1);
}
public TPCHQuery03(String output, String ipAddressSink, boolean disableOperatorChaining, boolean pinningPolicy, long maxCount) {
try {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
if (disableOperatorChaining) {
env.disableOperatorChaining();
}
DataStream<Order> orders = env
.addSource(new OrdersSource(maxCount)).name(OrdersSource.class.getSimpleName()).uid(OrdersSource.class.getSimpleName());
// Filter market segment "AUTOMOBILE"
// customers = customers.filter(new CustomerFilter());
// Filter all Orders with o_orderdate < 12.03.1995
DataStream<Order> ordersFiltered = orders
.filter(new OrderDateFilter("1995-03-12")).name(OrderDateFilter.class.getSimpleName()).uid(OrderDateFilter.class.getSimpleName());
// Join customers with orders and package them into a ShippingPriorityItem
DataStream<ShippingPriorityItem> customerWithOrders = ordersFiltered
.keyBy(new OrderKeySelector())
.process(new OrderKeyedByCustomerProcessFunction(pinningPolicy)).name(OrderKeyedByCustomerProcessFunction.class.getSimpleName()).uid(OrderKeyedByCustomerProcessFunction.class.getSimpleName());
// Join the last join result with Lineitems
DataStream<ShippingPriorityItem> result = customerWithOrders
.keyBy(new ShippingPriorityOrderKeySelector())
.process(new ShippingPriorityKeyedProcessFunction(pinningPolicy)).name(ShippingPriorityKeyedProcessFunction.class.getSimpleName()).uid(ShippingPriorityKeyedProcessFunction.class.getSimpleName());
// Group by l_orderkey, o_orderdate and o_shippriority and compute revenue sum
DataStream<ShippingPriorityItem> resultSum = result
.keyBy(new ShippingPriority3KeySelector())
.reduce(new SumShippingPriorityItem(pinningPolicy)).name(SumShippingPriorityItem.class.getSimpleName()).uid(SumShippingPriorityItem.class.getSimpleName());
// emit result
if (output.equalsIgnoreCase(PARAMETER_OUTPUT_MQTT)) {
resultSum
.map(new ShippingPriorityItemMap(pinningPolicy)).name(ShippingPriorityItemMap.class.getSimpleName()).uid(ShippingPriorityItemMap.class.getSimpleName())
.addSink(new MqttStringPublisher(ipAddressSink, topic, pinningPolicy)).name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else if (output.equalsIgnoreCase(PARAMETER_OUTPUT_LOG)) {
resultSum.print().name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else if (output.equalsIgnoreCase(PARAMETER_OUTPUT_FILE)) {
StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path(PATH_OUTPUT_FILE), new SimpleStringEncoder<String>("UTF-8"))
.withRollingPolicy(
DefaultRollingPolicy.builder().withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.withMaxPartSize(1024 * 1024 * 1024).build())
.build();
resultSum
.map(new ShippingPriorityItemMap(pinningPolicy)).name(ShippingPriorityItemMap.class.getSimpleName()).uid(ShippingPriorityItemMap.class.getSimpleName())
.addSink(sink).name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else {
System.out.println("discarding output");
}
System.out.println("Stream job: " + TPCHQuery03.class.getSimpleName());
System.out.println("Execution plan >>>\n" + env.getExecutionPlan());
env.execute(TPCHQuery03.class.getSimpleName());
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) throws Exception {
new TPCHQuery03();
}
}
The UDFs are here: OrderSource, OrderKeyedByCustomerProcessFunction, ShippingPriorityKeyedProcessFunction, and SumShippingPriorityItem. I am using the com.google.common.collect.ImmutableList since the state will not be updated. Also I am keeping only the necessary columns on the state, such as ImmutableList<Tuple2<Long, Double>> lineItemList.
I am looking for an efficient way to preprocess CSV data before (or while) dumping it to a java stream.
Under normal circumstances I would do something like this to process the file:
File input = new File("helloworld.csv");
InputStream is = new FileInputStream(input);
BufferedReader br = new BufferedReader(new InputStreamReader(is));
br.lines().parallel().forEach(line -> {
System.out.println(line);
});
However in this current case I need to preprocess the records before or while streaming them and each item in my collection could depend on the previous. Here is a simple example CSV file to demonstrate the issue:
species, breed, name
dog, lab, molly
, greyhound, stella
, beagle, stanley
cat, siamese, toby
, persian, fluffy
In my example CSV the species column is only populated when it changes from record to record. I know the simple answer would be to fix my CSV output but in this case that is not possible.
I am looking for a reasonable efficient way to process the records from CSV, copying the species value from the prior record if blank, and then passing to a parallel stream after preprocessing.
Downstream processing can take a long time so I ultimately need to process in parallel once preprocessing is complete. My CSV files can also be large so I would like to avoid loading the entire file into an object in memory first.
I was hoping there was some way to do something like the following (warning bad pseudocode):
parallelStream.startProcessing
while read line {
if (line.doesntHaveSpecies) {
line.setSpecies
}
parallelStream.add(line)
}
My current solution is to process the entire file and "fix it" then stream it. Since the file can be large, it would be nice to start processing records immediately after they have been "fixed" and before the entire file has been processed.
You have to encapsulate the state into a Spliterator.
private static Stream<String> getStream(BufferedReader br) {
return StreamSupport.stream(
new Spliterators.AbstractSpliterator<String>(
100, Spliterator.ORDERED|Spliterator.NONNULL) {
String prev;
public boolean tryAdvance(Consumer<? super String> action) {
try {
String next = br.readLine();
if(next==null) return false;
final int ix = next.indexOf(',');
if(ix==0) {
if(prev==null)
throw new IllegalStateException("first line without value");
next = prev+next;
}
else prev=ix<0? next: next.substring(0, ix);
action.accept(next);
return true;
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
}
}, false);
}
which can be used as
try(Reader r = new FileReader(input);
BufferedReader br = new BufferedReader(r)) {
getStream(br).forEach(System.out::println);
}
The Spliterator will always be traversed sequentially. If parallel processing is turned on, the Stream implementation will try to get new Spliterator instances for other threads by calling trySplit. Since we can’t offer an efficient strategy for that operation, we inherit the default from AbstractSpliterator which will do some array based buffering. This will always work correctly, but only pay off, if you have heavy computations in the subsequent stream pipeline. Otherwise, you may simply stay with sequential execution.
you can't start it with parallel stream because it has to be execute sequentially to get the species from previous line. So we could introduce some side-effect mapper:
final String[] species = new String[1];
final Function<String, String> speciesAppending = l -> {
if (l.startsWith(",")) {
return species[0] + l;
} else {
species[0] = l.substring(0, l.indexOf(','));
return l;
}
};
try (Stream<String> stream = Files.lines(new File("helloworld.csv").toPath())) {
stream.map(speciesAppending).parallel()... // TODO
}
I'm trying to create a generic method in Java for querying hbase.
I currently have one written which takes in 3 arguments
A Range (to scan the table)
A Column (to be returned) ... and
A Condition (i.e. browser==Chrome)
So a statement (if written in a SQLish language) may look like
SELECT OS FROM TABLE WHERE BROWSER==CHROME IN RANGE (5 WEEKS AGO -> 2 WEEKS AGO)
Now, I know I'm not using HBase properly (using common column queries for rowkey etc.) but for the sake of experimentation I'd like to try it, to help me learn.
So the first thing I do is set a Range on the Scan. (5 weeks to 2 weeks ago), since the rowkey is the timestamp, this is very efficient.
Then I set a SingleColumnValueFilter (browser = Chrome) (after the range filter, this is pretty fast)
Then I store all the rowkeys (from the scan) into an array.
For each rowkey (in the array) I perform a GET operation to get the corresponding OS.
I have tried using MultiGet, which sped up the process a lot.
I then tried using normal GET requests, each spawning a new thread, all running concurrently, which halved the query time! But still not fast enough.
I have considered limiting the number of threads using a single connection to the database. i.e - 100 threads per connection.
Given my circumstances, what is the most efficient way to perform these GETs, or am I totally approaching it incorrectly?
Any help is hugely appreciated.
EDIT (Here is my threaded GET attempt)
List<String> newresults = Collections.synchronizedList(new ArrayList<String>());
for (String rowkey : result) {
spawnGetThread(rowkey, colname);
}
public void spawnGetThread(String rk, String cn) {
new Thread(new Runnable() {
public void run() {
String rt = "";
Get get = new Get(Bytes.toBytes(rk));
get.addColumn(COL_FAM, cn);
try {
Result getResult = tb.get(get);
rt = (Bytes.toString(getResult.value()));
} catch (IOException e) {
}
newresults.add(rt);
}
}).start();
}
Given my circumstances, what is the most efficient way to perform
these GETs, or am I totally approaching it incorrectly?
I would suggest the below way
Get is good if you know which rowkeys you can acccess upfront.
In that case you can use method like below , it will return array of Result.
/**
* Method getDetailRecords.
*
* #param listOfRowKeys List<String>
* #return Result[]
* #throws IOException
*/
private Result[] getDetailRecords(final List<String> listOfRowKeys) throws IOException {
final HTableInterface table = HBaseConnection.getHTable(TBL_DETAIL);
final List<Get> listOFGets = new ArrayList<Get>();
Result[] results = null;
try {
for (final String rowkey : listOfRowKeys) {// prepare batch of get with row keys
// System.err.println("get 'yourtablename', '" + saltIndexPrefix + rowkey + "'");
final Get get = new Get(Bytes.toBytes(saltedRowKey(rowkey)));
get.addColumn(COLUMN_FAMILY, Bytes.toBytes(yourcolumnname));
listOFGets.add(get);
}
results = table.get(listOFGets);
} finally {
table.close();
}
return results;
}
Additional Note: Rowfilters are always faster than column value filters( Which does full table scan)..
Would suggest to go through hbase-the-definitive guide -->Client API: Advanced Features
I expected the new mapWithState API for Spark 1.6+ to near-immediately remove objects that are timed-out, but there is a delay.
I'm testing the API with the adapted version of the JavaStatefulNetworkWordCount below:
SparkConf sparkConf = new SparkConf()
.setAppName("JavaStatefulNetworkWordCount")
.setMaster("local[*]");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
ssc.checkpoint("./tmp");
StateSpec<String, Integer, Integer, Tuple2<String, Integer>> mappingFunc =
StateSpec.function((word, one, state) -> {
if (state.isTimingOut())
{
System.out.println("Timing out the word: " + word);
return new Tuple2<String,Integer>(word, state.get());
}
else
{
int sum = one.or(0) + (state.exists() ? state.get() : 0);
Tuple2<String, Integer> output = new Tuple2<String, Integer>(word, sum);
state.update(sum);
return output;
}
});
JavaMapWithStateDStream<String, Integer, Integer, Tuple2<String, Integer>> stateDstream =
ssc.socketTextStream(args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER_2)
.flatMap(x -> Arrays.asList(SPACE.split(x)))
.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.mapWithState(mappingFunc.timeout(Durations.seconds(5)));
stateDstream.stateSnapshots().print();
Together with nc (nc -l -p <port>)
When I type a word into the nc window I see the tuple being printed in the console every second. But it doesn't seem like the timing out message gets printed out 5s later, as expected based on the timeout set. The time it takes for the tuple to expire seems to vary between 5 & 20s.
Am I missing some configuration option, or is the timeout perhaps only performed at the same time as checkpoints?
Once an event times out it's NOT deleted right away, but is only marked for deletion by saving it to a 'deltaMap':
override def remove(key: K): Unit = {
val stateInfo = deltaMap(key)
if (stateInfo != null) {
stateInfo.markDeleted()
} else {
val newInfo = new StateInfo[S](deleted = true)
deltaMap.update(key, newInfo)
}
}
Then, timed out events are collected and sent to the output stream only at checkpoint. That is: events which time out at batch t, will appear in the output stream only at the next checkpoint - by default, after 5 batch-intervals on average, i.e. batch t+5:
override def checkpoint(): Unit = {
super.checkpoint()
doFullScan = true
}
...
removeTimedoutData = doFullScan // remove timedout data only when full scan is enabled
...
// Get the timed out state records, call the mapping function on each and collect the
// data returned
if (removeTimedoutData && timeoutThresholdTime.isDefined) {
...
Elements are actually removed only when there are enough of them, and when state map is being serialized - which currently also happens only at checkpoint:
/** Whether the delta chain length is long enough that it should be compacted */
def shouldCompact: Boolean = {
deltaChainLength >= deltaChainThreshold
}
// Write the data in the parent state map while copying the data into a new parent map for
// compaction (if needed)
val doCompaction = shouldCompact
...
By default checkpointing occurs every 10 iterations, thus in the example above every 10 seconds; since your timeout is 5 seconds, events are expected within 5-15 seconds.
EDIT: Corrected and elaborated answer following comments by #YuvalItzchakov
Am I missing some configuration option, or is the timeout perhaps only
performed at the same time as snapshots?
Every time a mapWithState is invoked (with your configuration, around every 1 second), the MapWithStateRDD will internally check for expired records and time them out. You can see it in the code:
// Get the timed out state records, call the mapping function on each and collect the
// data returned
if (removeTimedoutData && timeoutThresholdTime.isDefined) {
newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>
wrappedState.wrapTimingOutState(state)
val returned = mappingFunction(batchTime, key, None, wrappedState)
mappedData ++= returned
newStateMap.remove(key)
}
}
(Other than time taken to execute each job, it turns out that newStateMap.remove(key) actually only marks files for deletion. See "Edit" for more.)
You have to take into account the time it takes for each stage to be scheduled, and the amount of time it takes for each execution of such a stage to actually take it's turn and run. It isn't accurate because this runs as a distributed systems where other factors can come into play, making your timeout more/less accurate than you expect it to be.
Edit
As #etov rightly points out, newStateMap.remove(key) doesn't actually remove the element from the OpenHashMapBasedStateMap[K, S], but simply mark it for deletion. This is also a reason why you're seeing the expiration time adding up.
The actual relevant piece of code is here:
// Write the data in the parent state map while
// copying the data into a new parent map for compaction (if needed)
val doCompaction = shouldCompact
val newParentSessionStore = if (doCompaction) {
val initCapacity = if (approxSize > 0) approxSize else 64
new OpenHashMapBasedStateMap[K, S](initialCapacity = initCapacity, deltaChainThreshold)
} else { null }
val iterOfActiveSessions = parentStateMap.getAll()
var parentSessionCount = 0
// First write the approximate size of the data to be written, so that readObject can
// allocate appropriately sized OpenHashMap.
outputStream.writeInt(approxSize)
while(iterOfActiveSessions.hasNext) {
parentSessionCount += 1
val (key, state, updateTime) = iterOfActiveSessions.next()
outputStream.writeObject(key)
outputStream.writeObject(state)
outputStream.writeLong(updateTime)
if (doCompaction) {
newParentSessionStore.deltaMap.update(
key, StateInfo(state, updateTime, deleted = false))
}
}
// Write the final limit marking object with the correct count of records written.
val limiterObj = new LimitMarker(parentSessionCount)
outputStream.writeObject(limiterObj)
if (doCompaction) {
parentStateMap = newParentSessionStore
}
If deltaMap should be compacted (marked with the doCompaction variable), then (and only then) is the map cleared from all the deleted instances. How often does that happen? One the delta exceeds the threadshold:
val DELTA_CHAIN_LENGTH_THRESHOLD = 20
Which means the delta chain is longer than 20 items, and there are items that have been marked for deletion.
Basically I want something like this,
int count = 100;
Java<String> myRandomRDD = generate(count, new Function<String, String>() {
#Override
public String call(String arg0) throws Exception {
return RandomStringUtils.randomAlphabetic(42);
}
});
Theoretically I could use Spark RandomRDD, but I can't get it working right. I'm overwhelmed by the choices. Should I use RandomRDDs::randomRDD or RandomRDDs::randomRDDVector? Or should I use RandomVectorRDD?
I have tried the following, but I can't even get the syntax to be correct.
RandomRDDs.randomRDD(jsc, new RandomDataGenerator<String>() {
#Override
public void setSeed(long arg0) {
// TODO Auto-generated method stub
}
#Override
public org.apache.spark.mllib.random.RandomDataGenerator<String> copy() {
// TODO Auto-generated method stub
return null;
}
#Override
public String nextValue() {
RandomStringUtils.randomAlphabetic(42);
}
}, count, ??);
The documentation is sparse, I'm confused, and I would appreciate any help.
Thanks!
The simplest solution I can think of is:
JavaRDD<String> randomStringRDD = RandomRDDs.uniformJavaRDD(jsc, numRows).map((Double d) -> RandomStringUtils.randomAlphabetic(42));
Here is a more complete example to test locally:
SparkConf conf = new SparkConf().setAppName("Test random").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(conf);
int numRows= 10;//put here how many rows you want
JavaRDD<String> randomStringRDD = RandomRDDs.uniformJavaRDD(jsc, rows).map((Double d) -> RandomStringUtils.randomAlphabetic(42));
//display (to use only on small dataset)
for(String row:randomStringRDD.collect()){
System.out.println(numRows);
}
There is a small CPU overhead because there is no need to generate the initial set of random numbers, but it takes care of creating the partitions etc.
If avoiding that small overhead is important to you, and you want to generate 1 million rows in 10 partitions, you could try the following:
Create an empty rdd via jsc.emptyRDD()
Set its partitioning via repartition to create 10 partitions
use a mapPartition function to create 1milion/10 partition = 100000 rows per partition. Your RDD is ready.
Side notes:
Having the RandomRDDs.randomRDD() class exposed would make it simpler, but it is unfortunately not.
However, RandomRDDs.randomVectorRDD() is exposed, so you could use that one if you need to generate randomized vectors. (but you asked for Strings here so that does not apply).
The RandomRDD class is private to Spark, but we can access the RandomRDDs class and to create these. There are some examples in JavaRandomRDDsSuite.java (see https://github.com/apache/spark/blob/master/mllib/src/test/java/org/apache/spark/mllib/random/JavaRandomRDDsSuite.java ). It seems that the java examples all make Double's and the like but we can use this and turn it into strings like so:
import static org.apache.spark.mllib.random.RandomRDDs.*;
...
JavaDoubleRDD rdd1 = normalJavaRDD(sc, size, numPartitions);
JavaRDD<String> rdd = rdd1.map(e -> Double.toString(e));
That being siad we could use the randomRDD function, but it uses class tags which are a bit frustrating to use with Java. (I've created a JIRA https://issues.apache.org/jira/browse/SPARK-10626 to make an easy Java API for accessing this).