I am working on a task in which I need to process data in chunks. I have a properties file in which I define the chunk size, suppose 500 and the data that I am getting form the data base is suppose 1000 records. I want to process 1000 records in chunks 500 each using Multi Threading.
This is the first time I am implementing this so please let me know if I can achieve the same using another technique. The main purpose behind this is that I am generating an excel file in which I populate the data keeping in mind the chunk size. So probably first thread processes 500 records and second thread next 500.
Partial Code (Rest parses the xml and writes in Excel using POI)
public List<NYProgramTO> getNYPPAData() throws Exception {
this.getConfiguration();
List<NYProgramTO> to = dao.getLatestNYData();
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Document document = null;
// Returns chunkSize
List<NYProgramTO> myList = getNextChunk(to);
ExecutorService executor = Executors.newFixedThreadPool(myList.size());
myList.stream()
.forEach((NYProgramTO nyTo) ->
{
executor.execute(new NYExecutorThread(nyTo, migrationConfig , appContext, dao));
});
executor.shutdown();
executor.awaitTermination(300, TimeUnit.SECONDS);
System.gc();
dao.getLatestNYData(); method returns me the total number of records from the database and this is how I populate the list to.
I have the following method which gives me the next set of chunk, so suppose if 500 records had processed this method should give next 500 records to process (Hope this makes sense).
private static List<NYProgramTO> getNextChunk(List<NYProgramTO> list) {
currentIndex = 0; // This is static int class variable
List<NYProgramTO> nyList = new ArrayList<>();
if(list.size() == 0) {
return list;
}
int totalCount = list.size();
for(int i = currentIndex; i < (currentIndex + chunkSize); i++) {
if(i == totalCount) break;
nyList.add(list.get(i));
}
return nyList;
}
In my first method I create threads now here I am not sure to how many thread do I need to create. Currently I am passing the size of the list that I receive from getNextChunk(); method.
NYExecutorThread this class simply implements Runnable and I don't have any logic in it yet. Currently I simply pass parameters on the constructor to be able to get the configurations and create threads.
It is a little confusing and I want if anyone has implemented such a logic, please let me know how can I go ahead with this?
Thanks
Related
I have two separate ChronicleQueues that were created by independent threads that monitor web socket streams in a Java application. When I read each queue independently in a separate single-thread program, I can traverse each entire queue as expected - using the following minimal code:
final ExcerptTailer queue1Tailer = queue1.createTailer();
final ExcerptTailer queue2Tailer = queue2.createTailer();
while (true)
{
try( final DocumentContext context = queue1Tailer.readingDocument() )
{
if ( isNull(context.wire()) )
break;
counter1++;
queue1Data = context.wire()
.bytes()
.readObject(Queue1Data.class);
queue1Writer.write(String.format("%d\t%d\t%d%n", counter1, queue1Data.getEventTime(), queue1Data.getEventContent()));
}
}
while (true)
{
try( final DocumentContext context = queue2Tailer.readingDocument() )
{
if ( isNull(context.wire()) )
break;
counter2++;
queue2Data = context.wire()
.bytes()
.readObject(Queue2Data.class);
queue2Writer.write(String.format("%d\t%d\t%d%n", counter2, queue2Data.getEventTime(), queue2Data.getEventContent()));
}
}
In the above, I am able to read all the Queue1Data objects, then all the Queue2Data objects and access values as expected. However, when I try to interleave reading the queues (read an object from one queue, based on a property of Queue1Data object (a time stamp), read Queue2Data objects until the first object that is after the time stamp (the limit variable below), of the active Queue1Data object is found - then do something with it) after only one object from the queue2Tailer is read, an exception is thrown .DecoratedBufferUnderflowException: readCheckOffset0 failed. The simplified code that fails is below (I have tried putting the outer while(true) loop inside and outside the the queue2Tailer try block):
final ExcerptTailer queue1Tailer = queue1Queue.createTailer("label1");
try( final DocumentContext queue1Context = queue1Tailer.readingDocument() )
{
final ExcerptTailer queue2Tailer = queue2Queue.createTailer("label2");
while (true)
{
try( final DocumentContext queue2Context = queue2Tailer.readingDocument() )
{
if ( isNull(queue2Context.wire()) )
{
terminate = true;
break;
}
queue2Data = queue2Context.wire()
.bytes()
.readObject(Queue2Data.class);
while(true)
{
queue1Data = queue1Context.wire()
.bytes()
.readObject(Queue1Data.class); // first read succeeds
if (queue1Data.getFieldValue() > limit) // if this fails the inner loop continues
{ // but the second read fails
// cache a value
break;
}
}
// continue working with queu2Data object and cached values
} // end try block for queue2 tailer
} // end outer while loop
} // end outer try block for queue1 tailer
I have tried as above, and also with both Tailers created at the beginning of the function which does the processing (a private function executed when a button is clicked in a relatively simple Java application). Basically I took the loop which worked independently, and put it inside another loop in the function, expecting no problems. I thinking I am missing something crucial in how tailers are positioned and used to read objects, but I cannot figure out what it is - since the same basic code works when reading queues independently. The use of isNull(context.wire()) to determine when there are no more objects in a queue I got from one of the examples, though I am not sure this is the proper way to determine when there are no more objects in a queue when processing the queue sequentially.
Any suggestions would be appreciated.
You're not writing it correctly in the first instance.
Now, there's hardcore way of achieving what you are trying to achieve (that is, do everything explicitly, on lower level), and use MethodReader/MethodWriter magic rovided by Chronicle.
Hardcore way
Writing
// write first event type
try (DocumentContext dc = queueAppender.writingDocument()) {
dc.wire().writeEventName("first").text("Hello first");
}
// write second event type
try (DocumentContext dc = queueAppender.writingDocument()) {
dc.wire().writeEventName("second").text("Hello second");
}
This will write different types of messages into the same queue, and you will be able to easily distinguish those when reading.
Reading
StringBuilder reusable = new StringBuilder();
while (true) {
try (DocumentContext dc = tailer.readingDocument()) {
if (!dc.isPresent) {
continue;
}
dc.wire().readEventName(reusable);
if ("first".contentEquals(reusable)) {
// handle first
} else if ("second".contentEquals(reusable)) {
// handle second
}
// optionally handle other events
}
}
The Chronicle Way (aka Peter's magic)
This works with any marshallable types, as well as any primitive types and CharSequence subclasses (i.e. Strings), and Bytes. For more details have a read of MethodReader/MethodWriter documentation.
Suppose you have some data classes:
public class FirstDataType implements Marshallable { // alternatively - extends SelfDescribingMarshallable
// data fields...
}
public class SecondDataType implements Marshallable { // alternatively - extends SelfDescribingMarshallable
// data fields...
}
Then, to write those data classes to the queue, you just need to define the interface, like this:
interface EventHandler {
void first(FirstDataType first);
void second(SecondDataType second);
}
Writing
Then, writing data is as simple as:
final EventHandler writer = appender.methodWriterBuilder(EventHandler).get();
// assuming firstDatum and secondDatum are created earlier
writer.first(firstDatum);
writer.second(secondDatum);
What this does is the same as in the hardcore section - it writes event name (which is taken from the method name in method writer, i.e. "first" or "second" correspondingly), and then the actual data object.
Reading
Now, to read those events from the queue, you need to provide an implementation of the above interface, that will handle corresponding event types, e.g.:
// you implement this to read data from the queue
private class MyEventHandler implements EventHandler {
public void first(FirstDataType first) {
// handle first type of events
}
public void second(SecondDataType second) {
// handle second type of events
}
}
And then you read as follows:
EventHandler handler = new MyEventHandler();
MethodReader reader = tailer.methodReader(handler);
while (true) {
reader.readOne(); // readOne returns boolean value which can be used to determine if there's no more data, and pause if appropriate
}
Misc
You don't have to use the same interface for reading and writing. In case you want to only read events of second type, you can define another interface:
interface OnlySecond {
void second(SecondDataType second);
}
Now, if you create a handler implementing this interface and give it to tailer#methodReader() call, the readOne() calls will only process events of second type while skipping all others.
This also works for MethodWriters, i.e. if you have several processes writing different types of data and one process consuming all that data, it is not uncommon to define multiple interfaces for writing data and then single interface extending all others for reading, e.g.:
interface FirstOut {
void first(String first);
}
interface SecondOut {
void second(long second);
}
interface ThirdOut {
void third(ThirdDataType third);
}
interface AllIn extends FirstOut, SecondOut, ThirdOut {
}
(I deliberately used different data types for method parameters to show how it is possible to use various types)
With further testing, I have found that nested loops to read multiple queues which contain data in different POJO classes is possible. The problem with the code in the above question is that queue1Context is obtained once, OUTSIDE the loop that I expected to read queue1Data objects. My fundamental misconception was that DocumentContext objects managed stepping through objects in a queue, whereas actually ExcerptTailer objects manage stepping (maintaining indices) when reading a queue sequentially.
In case it might help someone else just getting started with ChronicleQueues, the inner loop in the original question should be:
while(true)
{
try (final DocumentContext queue1Context = queue1Tailer() )
{
queue1Data = queue1Context.wire()
.bytes()
.readObject(Queue1Data.class); // first read succeeds
if (queue1Data.getFieldValue() > limit) // if this fails the inner loop continues as expected
{ // and second and subsequent reads now succeed
// cache a value
break;
}
}
}
And of course the outer-most try block containing queue1Context (in the original code) should be removed.
So I have a really large list of zip codes (about 80,000) that I want to pass onto a url and get the JSON data from that url for each zip code.
I am running a query on that JSON to see if it has the end_lat and if it does then I want to save that zip code to a list.
As I am fetching and matching JSON for a lot of zip codes its taking forever.
So I tried few different methods to make it a multi threaded application. I tried the good old Thread method with runnable interface.
I tried executor services. But everything stops abruptly which makes me believe that I should be making synchronized writes to that list.
public void breakingZipCodesForThreads() {
List<String> zip_Codes = Serenity.sessionVariableCalled("zipCodes");
int size = (int) Math.ceil(zip_Codes.size() / 5.0);
ExecutorService executor = Executors.newFixedThreadPool(4);
for (int start = 0; start < zip_Codes.size(); start += size) {
int end = Math.min(start + size, zip_Codes.size());
Runnable worker = new MyRunnable(zip_Codes.subList(start, end));
executor.execute(worker);
}
//run() method bascially has this code for a function
for (String zipCode : zip_Codes) {
currentPage = pageUrl + zipCode;
Response response = given().urlEncodingEnabled(false)
.when()
.get(currentPage);
try {
Object end_lat = response.getBody().path("end_lat");
if (end_lat != null && !end_lat.toString().isEmpty()) {
resultantZipCode.add(zipCode);
}
} catch (Exception e) {
//Something else
}
So essentially I want all my threads to concurrently write to the list "resultantZipCode" and in the end give me single list for all the zipcodes in there that satisfy my condition.
So how do I make my zip codes break into pieces, run parallely for the run function and save all the resultant zip codes and returns me that list? What am I missing?
From flat file which contain data line by line, my task to be verify data from DB not present I am trying to verify Using Java first I have inserted flat file data into HashSet1 and another Hashset2 For DB data after that I am trying to check Hashset1.Contain(Hashset2) so that I can identify which data is not Present in DB.
Given Below is Dummy Code which you can Assume hashset1(which is some missing data) as File Reader data and hashset2(full data from db) as DB data
but as I mentioned here I have 30 Million Data need to verify, I am able to verify 1 million data through this way but not able to verify 30 million data which is my task. Is there any best way to do this kindly suggest and some sort of code it will we very thankful.
public class App
{
public static void sampleMethod() {
Set<Integer> hashset1 = new HashSet<Integer>();
Set<Integer> hashset2 = new HashSet<Integer>();
for(int i = 0; i<30000000; i++ ) {
if(i %50000 != 0) {
hashset1.add(i);
}
}
int count = 0;
for(int j =0;j<30000000;j++) {
if(hashset1.contains(j)) {
count++;
} else {
System.out.println(j+" Is Not Present");
hashset2.add(j);
}
}
System.out.println("Contain Value Count" + count);
}
public static void main( String[] args )
{
sampleMethod();
}
}
Error Stack Trace :
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.resize(HashMap.java:703)
at java.util.HashMap.putVal(HashMap.java:662)
at java.util.HashMap.put(HashMap.java:611)
at java.util.HashSet.add(HashSet.java:219)
at com.java.anz.BankingPro.App.sampleMethod(App.java:20)
at com.java.anz.BankingPro.App.main(App.java:38)
For combining two sets of data, it's enough to load only the smaller of the two into a hashset (1.), then, as the next step, detect the differences of the sets (2.), and only then modify data according to the found differences (3.). Let's call the small set simply smallHashSet in the following pseudo code:
load smaller set of data into smallHashSet
iterate (loop) over entries in bigger set of data, one by one - do not load it all at once, just load one after another and process one at a time:
2.1. let's say bigSetEntry is such an entry from the bigger set, then
if (smallHashSet.contains(bigSetEntry)) smallHashSet.remove(bigSetEntry).
When you are done, then smallHashSet contains only the entries which are in the small set, but missing in the big set. And you never needed to load the big set all at once. You can now do something with these differing entries, e.g. add them to the big data file.
I'm using spark in order to calculate the pagerank of user reviews, but I keep getting Spark java.lang.StackOverflowError when I run my code on a big dataset (40k entries). when running the code on a small number of entries it works fine though.
Entry Example :
product/productId: B00004CK40 review/userId: A39IIHQF18YGZA review/profileName: C. A. M. Salas review/helpfulness: 0/0 review/score: 4.0 review/time: 1175817600 review/summary: Reliable comedy review/text: Nice script, well acted comedy, and a young Nicolette Sheridan. Cusak is in top form.
The Code:
public void calculatePageRank() {
sc.clearCallSite();
sc.clearJobGroup();
JavaRDD < String > rddFileData = sc.textFile(inputFileName).cache();
sc.setCheckpointDir("pagerankCheckpoint/");
JavaRDD < String > rddMovieData = rddFileData.map(new Function < String, String > () {
#Override
public String call(String arg0) throws Exception {
String[] data = arg0.split("\t");
String movieId = data[0].split(":")[1].trim();
String userId = data[1].split(":")[1].trim();
return movieId + "\t" + userId;
}
});
JavaPairRDD<String, Iterable<String>> rddPairReviewData = rddMovieData.mapToPair(new PairFunction < String, String, String > () {
#Override
public Tuple2 < String, String > call(String arg0) throws Exception {
String[] data = arg0.split("\t");
return new Tuple2 < String, String > (data[0], data[1]);
}
}).groupByKey().cache();
JavaRDD<Iterable<String>> cartUsers = rddPairReviewData.map(f -> f._2());
List<Iterable<String>> cartUsersList = cartUsers.collect();
JavaPairRDD<String,String> finalCartesian = null;
int iterCounter = 0;
for(Iterable<String> out : cartUsersList){
JavaRDD<String> currentUsersRDD = sc.parallelize(Lists.newArrayList(out));
if(finalCartesian==null){
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD);
}
else{
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD).union(finalCartesian);
if(iterCounter % 20 == 0) {
finalCartesian.checkpoint();
}
}
}
JavaRDD<Tuple2<String,String>> finalCartesianToTuple = finalCartesian.map(m -> new Tuple2<String,String>(m._1(),m._2()));
finalCartesianToTuple = finalCartesianToTuple.filter(x -> x._1().compareTo(x._2())!=0);
JavaPairRDD<String, String> userIdPairs = finalCartesianToTuple.mapToPair(m -> new Tuple2<String,String>(m._1(),m._2()));
JavaRDD<String> userIdPairsString = userIdPairs.map(new Function < Tuple2<String, String>, String > () {
//Tuple2<Tuple2<MovieId, userId>, Tuple2<movieId, userId>>
#Override
public String call (Tuple2<String, String> t) throws Exception {
return t._1 + " " + t._2;
}
});
try {
//calculate pagerank using this https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java
JavaPageRank.calculatePageRank(userIdPairsString, 100);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
sc.close();
}
I have multiple suggestions which will help you to greatly improve the performance of the code in your question.
Caching: Caching should be used on those data sets which you need to refer to again and again for same/ different operations (iterative algorithms.
An example is RDD.count — to tell you the number of lines in the
file, the file needs to be read. So if you write RDD.count, at
this point the file will be read, the lines will be counted, and the
count will be returned.
What if you call RDD.count again? The same thing: the file will be
read and counted again. So what does RDD.cache do? Now, if you run
RDD.count the first time, the file will be loaded, cached, and
counted. If you call RDD.count a second time, the operation will use
the cache. It will just take the data from the cache and count the
lines, no recomputing.
Read more about caching here.
In your code sample you are not reusing anything that you've cached. So you may remove the .cache from there.
Parallelization: In the code sample, you've parallelized every individual element in your RDD which is already a distributed collection. I suggest you to merge the rddFileData, rddMovieData and rddPairReviewData steps so that it happens in one go.
Get rid of .collect since that brings the results back to the driver and maybe the actual reason for your error.
This problem will occur when your DAG grows big and too many level of transformations happening in your code. The JVM will not be able to hold the operations to perform lazy execution when an action is performed in the end.
Checkpointing is one option. I would suggest to implement spark-sql for this kind of aggregations. If your data is structured, try to load that into dataframes and perform grouping and other mysql functions to achieve this.
When your for loop grows really large, Spark can no longer keep track of the lineage. Enable checkpointing in your for loop to checkpoint your rdd every 10 iterations or so. Checkpointing will fix the problem. Don't forget to clean up the checkpoint directory after.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
Below things fixed stackoverflow error, as others pointed it's because of lineage that spark keeps building, specially when you have loop/iteration in code.
Set checkpoint directory
spark.sparkContext.setCheckpointDir("./checkpoint")
checkpoint dataframe/Rdd you are modifying/operating in iteration
modifyingDf.checkpoint()
Cache Dataframe which are reused in each iteration
reusedDf.cache()
I have this code that dumps documents into MongoDB once an ArrayBlockingQueue fills it's quota. When I run the code, it seems to only run once and then gives me a stack trace. My guess is that the BulkWriteOperation someone has to 'reset' or start over again.
Also, I create the BulkWriteOperations in the constructor...
bulkEvent = eventsCollection.initializeOrderedBulkOperation();
bulkSession = sessionsCollection.initializeOrderedBulkOperation();
Here's the stacktrace.
10 records inserted
java.lang.IllegalStateException: already executed
at org.bson.util.Assertions.isTrue(Assertions.java:36)
at com.mongodb.BulkWriteOperation.insert(BulkWriteOperation.java:62)
at willkara.monkai.impl.managers.DataManagers.MongoDBManager.dumpQueue(MongoDBManager.java:104)
at willkara.monkai.impl.managers.DataManagers.MongoDBManager.addToQueue(MongoDBManager.java:85)
Here's the code for the Queues:
public void addToQueue(Object item) {
if (item instanceof SakaiEvent) {
if (eventQueue.offer((SakaiEvent) item)) {
} else {
dumpQueue(eventQueue);
}
}
if (item instanceof SakaiSession) {
if (sessionQueue.offer((SakaiSession) item)) {
} else {
dumpQueue(sessionQueue);
}
}
}
And here is the code that reads from the queues and adds them to an BulkWriteOperation (initializeOrderedBulkOperation) to execute it and then dump it to the database. Only 10 documents get written and then it fails.
private void dumpQueue(BlockingQueue q) {
Object item = q.peek();
Iterator itty = q.iterator();
BulkWriteResult result = null;
if (item instanceof SakaiEvent) {
while (itty.hasNext()) {
bulkEvent.insert(((SakaiEvent) itty.next()).convertToDBObject());
//It's failing at that line^^
}
result = bulkEvent.execute();
}
if (item instanceof SakaiSession) {
while (itty.hasNext()) {
bulkSession.insert(((SakaiSession) itty.next()).convertToDBObject());
}
result = bulkSession.execute();
}
System.out.println(result.getInsertedCount() + " records inserted");
}
The general documentation applies to all driver implementations in this case:
"After execution, you cannot re-execute the Bulk() object without reinitializing."
So the .execute() method effectively "drains" the current list of operations that have been sent to it and now contains state information about how the commands were actually sent. So you cannot add more entries or call .execute() again on the same instance without reinitializing .
So after you call execute on each "Bulk" object, you need to call the intialize again:
bulkEvent = eventsCollection.initializeOrderedBulkOperation();
bulkSession = sessionsCollection.initializeOrderedBulkOperation();
Each of those lines placed again repectively after each .execute() call in your function. Then further calls to those instances can add operations and call execute again continuing the cycle.
Note that "Bulk" operations objects will store as many items as you want to put into them but will break up requests to the server into maximum amounts of 1000 items. After execution the state of the operations list will reflect exactly how this is done should you want to inspect that.