Reading documents from the Couchbase bucket as batches - java

I have a Couchbase cluster which has around 25M documents. I am able to read them sequentially and also I have a function that can read a specific number of documents from the database. But my use case is slightly different since I cannot store all the 25M documents (each document is huge) in memory.
I need to process the documents in batches, say 1M/batch, push that batch to my memory, (do some operation on those documents) and push the next batch.
The function which I have written to read specific number of documents doesn't ensure that it returns a different set of documents when called again.
Is there a way by which I can complete this functionality? I also have a function which can create documents in batches. I am not sure if I can write a similar function that can read the documents in batches. The function is given below.
public void createMultipleCustomerDocuments(String docId, Customer myCust, long numDocs) {
Gson gson = new GsonBuilder().create();
JsonObject content = JsonObject.fromJson(gson.toJson(myCust));
JsonDocument document = JsonDocument.create(docId, content);
jsonDocuments.add(document);
documentCounter++;
if (documentCounter == numDocs) {
Observable.from(jsonDocuments).flatMap(new Func1<JsonDocument, Observable<JsonDocument>>() {
public Observable<JsonDocument > call(final JsonDocument docToInsert) {
return (theBucket.async().upsert(docToInsert));
}
}).last().toBlocking().single();
documentCounter = 0;
//System.out.println("Batch counter: " + batchCounter++);
}
Can someone please help me with this?

I would try to create a view which containing all of the documents, and then querying the view with skip and limit. (Can use .startKey() and startKeyId() functions instead of skip() to avoid overhead.)
but, remember not to keep that view in a production env, it will be cpu hog.
Another option, use the DCP protocol to replicate the database into your app. but it is more work.

Related

How does Flinks Collector.collect() handle data?

Im trying to understand what Flinks Collector.collect() does and how it handles incoming/outgoing data:
Example taken from Flink DataSet API:
The following code transforms a DataSet of text lines into a DataSet of words:
DataSet<String> output = input.flatMap(new Tokenizer());
public class Tokenizer implements FlatMapFunction<String, String> {
#Override
public void flatMap(String value, Collector<String> out) {
for (String token : value.split("\\W")) {
out.collect(token);
}
}
}
So the text Lines get split into tokens and each of them gets "collected". As intuitive as it might sound but im missing the actual dynamics behind Collector.collect(). Where is the collected data stored before it gets assigned to output i.e does Flink put them in some sort of Buffer? And if yes, how is the data transferred to the network?
from the official source code documentation.
Collects a record and forwards it. The collector is the "push"
counterpart of the {#link java.util.Iterator}, which "pulls" data in.
So, it receives a value and stores one or more values into the Iterator. Then pushes to the next operator. But this is a matter of the network stack/ buffers.

Apache Flink : Add side inputs for DataStream API

In my Java application, I have three DataStreams. For example, for One stream data is consumed from Kafka, for another stream data is consumed from Apache Nifi. For these two streams Object type is different. For example, Stream-1 object type is Person, Stream-2 object type is Address.
The third one is the broadcast stream (for this data is consumed from Kafka).
Now I want to combine Stream-1 and Stream-2 in a Job class and want to split in the task process element. How to implement this?
Note :
Stream-1 is mainstream and Stream-2 is side input. MainStream is continuously fetching data from Kafka. For Side Input, initially while the application is UP all table data is loaded from DB and then read new data when the table data is updated (not frequently) .
Sample structure:
DataStream<Person> stream-1 = env.addSource(read data from kafka)....
DataStream<Address> stream-2 = env.addSource(read data from nifi)....
BroadcastStream<String> BroadCastStream = stream-3.broadcast(read data from kafka);
I was referred to as the following links.
FLIP-17 Side Inputs for DataStream API
jira/browse/FLINK-6131
My Use case is :
Join stream with slowly evolving data: The side input that we use for enriching is evolving over time (Data is read from DB). This can be done by waiting for some initial data to be available before processing the main input and the continuously ingesting new data into the internal side input structure as it arrives.
Based on the latest response, the recommendation by #Arvid was in fact what was needed here.
Core of the answer:
You can easily join stream1 and stream2 even if they have different
types. Then you can add the broadcast to the result
Links to doc and example, and a relevant snippet from the doc (the example is too long to be included in here):
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
.apply (new JoinFunction<Integer, Integer, String> (){
#Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});

Spark - Collect partitions using foreachpartition

We are using spark for file processing. We are processing pretty big files with each file around 30 GB with about 40-50 million lines. These files are formatted. We load them into data frame. Initial requirement was to identify records matching criteria and load them to MySQL. We were able to do that.
Requirement changed recently. Records not meeting criteria are now to be stored in an alternate DB. This is causing issue as the size of collection is too big. We are trying to collect each partition independently and merge into a list as suggested here
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/dont_collect_large_rdds.html
We are not familiar with scala, so we are having trouble converting this to Java. How can we iterate over partitions one by one and collect?
Thanks
Please use df.foreachPartition to execute for each partition independently and won't returns to driver. You can save the matching results into DB in each executor level. If you want to collect the results in driver, use mappartitions which is not recommended for your case.
Please refer the below link
Spark - Java - foreachPartition
dataset.foreachPartition(new ForeachPartitionFunction<Row>() {
public void call(Iterator<Row> r) throws Exception {
while (t.hasNext()){
Row row = r.next();
System.out.println(row.getString(1));
}
// do your business logic and load into MySQL.
}
});
For mappartitions:
// You can use the same as Row but for clarity I am defining this.
public class ResultEntry implements Serializable {
//define your df properties ..
}
Dataset<ResultEntry> mappedData = data.mapPartitions(new MapPartitionsFunction<Row, ResultEntry>() {
#Override
public Iterator<ResultEntry> call(Iterator<Row> it) {
List<ResultEntry> filteredResult = new ArrayList<ResultEntry>();
while (it.hasNext()) {
Row row = it.next()
if(somecondition)
filteredResult.add(convertToResultEntry(row));
}
return filteredResult.iterator();
}
}, Encoders.javaSerialization(ResultEntry.class));
Hope this helps.
Ravi

Apache Beam in Dataflow Large Side Input

This is most similar to this question.
I am creating a pipeline in Dataflow 2.x that takes streaming input from a Pubsub queue. Every single message that comes in needs to be streamed through a very large dataset that comes from Google BigQuery and have all the relevant values attached to it (based on a key) before being written to a database.
The trouble is that the mapping dataset from BigQuery is very large - any attempt to use it as a side input fails with the Dataflow runners throwing the error "java.lang.IllegalArgumentException: ByteString would be too long". I have attempted the following strategies:
1) Side input
As stated,the mapping data is (apparently) too large to do this. If I'm wrong here or there is a work-around for this, please let me know because this would be the simplest solution.
2) Key-Value pair mapping
In this strategy, I read the BigQuery data and Pubsub message data in the first part of the pipeline, then run each through ParDo transformations that change every value in the PCollections to KeyValue pairs. Then, I run a Merge.Flatten transform and a GroupByKey transform to attach the relevant mapping data to each message.
The trouble here is that streaming data requires windowing to be merged with other data, so I have to apply windowing to the large, bounded BigQuery data as well. It also requires that the windowing strategies are the same on both datasets. But no windowing strategy for the bounded data makes sense, and the few windowing attempts I've made simply send all the BQ data in a single window and then never send it again. It needs to be joined with every incoming pubsub message.
3) Calling BQ directly in a ParDo (DoFn)
This seemed like a good idea - have each worker declare a static instance of the map data. If it's not there, then call BigQuery directly to get it. Unfortunately this throws internal errors from BigQuery every time (as in the entire message just says "Internal error"). Filing a support ticket with Google resulted in them telling me that, essentially, "you can't do that".
It seems this task doesn't really fit the "embarrassingly parallelizable" model, so am I barking up the wrong tree here?
EDIT :
Even when using a high memory machine in dataflow and attempting to make the side input into a map view, I get the error java.lang.IllegalArgumentException: ByteString would be too long
Here is an example (psuedo) of the code I'm using:
Pipeline pipeline = Pipeline.create(options);
PCollectionView<Map<String, TableRow>> mapData = pipeline
.apply("ReadMapData", BigQueryIO.read().fromQuery("SELECT whatever FROM ...").usingStandardSql())
.apply("BQToKeyValPairs", ParDo.of(new BQToKeyValueDoFn()))
.apply(View.asMap());
PCollection<PubsubMessage> messages = pipeline.apply(PubsubIO.readMessages()
.fromSubscription(String.format("projects/%1$s/subscriptions/%2$s", projectId, pubsubSubscription)));
messages.apply(ParDo.of(new DoFn<PubsubMessage, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
JSONObject data = new JSONObject(new String(c.element().getPayload()));
String key = getKeyFromData(data);
TableRow sideInputData = c.sideInput(mapData).get(key);
if (sideInputData != null) {
LOG.info("holyWowItWOrked");
c.output(new TableRow());
} else {
LOG.info("noSideInputDataHere");
}
}
}).withSideInputs(mapData));
The pipeline throws the exception and fails before logging anything from within the ParDo.
Stack trace:
java.lang.IllegalArgumentException: ByteString would be too long: 644959474+1551393497
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.concat(ByteString.java:524)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:576)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.copyFrom(ByteString.java:559)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString$Output.toByteString(ByteString.java:1006)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillBag.persistDirectly(WindmillStateInternals.java:575)
com.google.cloud.dataflow.worker.WindmillStateInternals$SimpleWindmillState.persist(WindmillStateInternals.java:320)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillCombiningState.persist(WindmillStateInternals.java:951)
com.google.cloud.dataflow.worker.WindmillStateInternals.persist(WindmillStateInternals.java:216)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.flushState(StreamingModeExecutionContext.java:513)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:363)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1000)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$800(StreamingDataflowWorker.java:133)
com.google.cloud.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:771)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Check out the section called "Pattern: Streaming mode large lookup tables" in Guide to common Cloud Dataflow use-case patterns, Part 2. It might be the only viable solution since your side input doesn't fit into memory.
Description:
A large (in GBs) lookup table must be accurate, and changes often or
does not fit in memory.
Example:
You have point of sale information from a retailer and need to
associate the name of the product item with the data record which
contains the productID. There are hundreds of thousands of items
stored in an external database that can change constantly. Also, all
elements must be processed using the correct value.
Solution:
Use the "Calling external services for data enrichment" pattern
but rather than calling a micro service, call a read-optimized NoSQL
database (such as Cloud Datastore or Cloud Bigtable) directly.
For each value to be looked up, create a Key Value pair using the KV
utility class. Do a GroupByKey to create batches of the same key type
to make the call against the database. In the DoFn, make a call out to
the database for that key and then apply the value to all values by
walking through the iterable. Follow best practices with client
instantiation as described in "Calling external services for data
enrichment".
Other relevant patterns are described in Guide to common Cloud Dataflow use-case patterns, Part 1:
Pattern: Slowly-changing lookup cache
Pattern: Calling external services for data enrichment

Elasticsearch Performance Analysis

We are currently evaluating Elasticsearch as our solution for Analytics. The main driver is the fact that once the data is populated into Elasticsearch, the reporting comes for free with Kibana.
Before adopting it, I am tasked to do a performance analysis of the tool.
The main requirement is supporting a PUT rate of 500 evt/sec.
I am currently starting with a small setup as follows just to get a sense of the API before I upload that to a more serious lab.
My Strategy is basically, going over CSVs of analytics that correspond to the format I need and putting them into elasticsearch. I am not using the bulk API because in reality the events will not arrive in a bulk fashion.
Following is the main code that does this:
// Created once, used for creating a JSON from a bean
ObjectMapper mapper = new ObjectMapper();
// Creating a measurement for checking the count of sent events vs
// ES stored events
AnalyticsMetrics metrics = new AnalyticsMetrics();
metrics.startRecording();
File dir = new File(mFolder);
for (File file : dir.listFiles()) {
CSVReader reader = new CSVReader(new FileReader(file.getAbsolutePath()), '|');
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
AnalyticRecord record = new AnalyticRecord();
record.serializeLine(nextLine);
// Generate json
String json = mapper.writeValueAsString(record);
IndexResponse response = mClient.getClient().prepareIndex("sdk_sync_log", "sdk_sync")
.setSource(json)
.execute()
.actionGet();
// Recording Metrics
metrics.sent();
}
}
metrics.stopRecording();
return metrics;
I have the following questions:
How do I know through the API when all the requests are completed and the data is saved into Elasticsearch? I could query Elasticsearch for the objects counts in my particular index but doing that would be a new performance factor by itself, hence I am eliminating this option.
Is the above the fastest way to insert object to Elasticsearch or are there other optimizations I could do. Keep in mind the bulk API is not an option for now.
Thx in advance.
P.S: the Elasticsearch version I am using on both client and server is 1.0.0.
Elasticsearch index response has isCreated() method that returns true if the document is a new one or false if it has been updated and can be used to see if the document was successfully inserted/updated.
If bulk indexing is not an option there are other areas that could be tweaked to improve performance like
increasing index refresh interval using index.refresh_interval
disabling replicas by setting index.number_of_replicas to 0
Disabling _source and _all fields if they are not needed.

Categories

Resources