Spark streaming maintain state over window

Spark streaming maintain state over window - java

For spark streaming, are there ways that we can maintain state only for the current window? I understand updateStateByKey works but that maintains the state forever unless we purge it. Is it possible to store and reset the state per window?
To give more context. I'm trying to convert one type of object into another within a windowed stream. However, the conversion is the following:
Object 1 is either an invocation or a response.
Object 2 is not considered complete until we see both a invocation and a response.
However, since the response for the an object could be in a separate batch I need to maintain states across batches.
But I only wish to maintain the state for the current window. Are there any ways that I could achieve this through spark.
thank you!

You can use the mapWithState transformation instead of updateStateByKey and you can set time out to the State spec with duration of your batch interval.by this you can have the state for only last batch every time.but it will work if you invocation and response depends only on the last batch.other wise when you try to update key which got removed it will throw exception.
MapwithState is fast in performance compared to updateStateByKey.
you can find the sample code snippet below.
import org.apache.spark.streaming._
val stateSpec =
StateSpec
.function(updateUserEvents _)
.timeout(Minutes(5))

Related

How to get the element details which got removed using Arrays.asList(remove()) using couchbase Java SDK?

I am trying to implement a highly concurrent access pattern where every request should get a unique document when they do a get.
I can't use N1QL and I dont have key to do KV fetch.
I implemented an array of documents and as Arrays.asList(remove(0)) is a thread safe call, every parallel thread should be able to remove the rolling 0th element of the array, making sure, no 2 threads remove the same element.
This is working fine with concurrent thread. However, now problem is, that as every thread also wants to use the document content retrieved, I am not seeing any method to deserialize the removed element and read the content.
Remove call doesn't return the element as such.
Any guidance/pointers will be appreciated.
Here is my code snippet:
MutateInResult resultDet = collection.mutateIn("TestDoc", Arrays.asList(remove("[0]")));
Thanks
Naved

This feature (return the removed content) has been requested before, and is tracked as MB-31401. The comments on that issue explain why it wasn't part of the original design.
Until that issue is resolved, you'd need to get the value using lookupIn before removing it (using the CAS value from the lookupIn result, and retrying on CAS mismatch). There's an example of this technique in CouchbaseQueue.poll(), although that particular code runs the risk of removing the item from the database before it has actually been processed by your application.
I'm not aware of a simple way to ensure each array element is handled by exactly one thread. If you ensure the element is processed before it is deleted, it might be processed by multiple threads. On the other hand, if you delete the element before processing it, you can ensure it's handled by at most one thread, but you risk it being handled by zero threads if there's an application failure after the element is deleted and before it's processed.

How to Monitor/inspect data/attribute flow in Java code

I have a use case when I need to capture the data flow from one API to another. For example my code reads data from database using hibernate and during the data processing I convert one POJO to another and perform some more processing and then finally convert into final result hibernate object. In a nutshell something like POJO1 to POJO2 to POJO3.
In Java is there a way where I can deduce that an attribute from POJO3 was made/transformed from this attribute of POJO1. I want to look something where I can capture data flow from one model to another. This tool can be either compile time or runtime, I am ok with both.
I am looking for a tool which can run in parallel with code and provide data lineage details on each run basis.

Now instead of Pojos I will call them States! You are having a start position you iterate and transform your model through different states. At the end you have a final terminal state that you would like to persist to the database
stream(A).map(P1).map(P2).map(P3)....-> set of B
If you use a technic known as Event sourcing you can deduce it yes. How would this look like then? Instead of mapping directly A to state P1 and state P1 to state P2 you will queue all your operations that are necessary and enough to map A to P1 and P1 to P2 and so on... If you want to recover P1 or P2 at any time, it will be just a product of the queued operations. You can at any time rewind forward or rewind backwards as long as you have not yet chaged your DB state. P1,P2,P3 can act as snapshots.
This way you will be able to rebuild the exact mapping flow for this attribute. How fine grained you will queue your oprations, if it is going to be as fine as attribute level , or more course grained it is up to you.
Here is a good article that depicts event sourcing and how it works: https://kickstarter.engineering/event-sourcing-made-simple-4a2625113224
UPDATE:
I can think of one more technic to capture the attribute changes. You can instument your Pojo-s, it is pretty much the same technic used by Hibernate to enhance Pojos and same technic profiles use to for tracing. Then you can capture and react to each setter invocation on the Pojo1,Pojo2,Pojo3. Not sure if I would have gone that way though....
Here is some detiled readin about the byte code instrumentation if https://www.cs.helsinki.fi/u/pohjalai/k05/okk/seminar/Aarniala-instrumenting.pdf

I would imagine two reasons, either the code is not developed by you and therefore you want to understand the flow of data along with combinations to convert input to output OR your code is behaving in a way that you are not expecting.
I think you need to log the values of all the pojos, inputs and outputs to any place that you can inspect later for each run.
Example: A database table if you might need after hundred of runs, but if its one time may be to a log in appropriate form. Then you need to yourself manually use those data values layer by later to map to the next layer. I think with availability of code that would be easy. If you have a different need pls. explain.
Please accept and like if you appreciate my gesture to help with my ideas n experience.

There are "time travelling debuggers". For Java, a quick search did only spill this out:
Chronon Time Travelling Debugger, see this screencast how it might help you .
Since your transformations probably use setters and getters this tool might also be interesting: Flow
Writing your own java agent for tracking this is probably not what you want. You might be able to use AspectJ to add some stack trace logging to getters and setters. See here for a quick introduction.

Kafka - problems with TimestampExtractor

I use org.apache.kafka:kafka-streams:0.10.0.1
I'm attempting to work with a time series based stream that doesn't seem to be triggering a KStream.Process() to trigger ("punctuate"). (see here for reference)
In a KafkaStreams config I'm passing in this param (among others):
config.put(
StreamsConfig.TIMESTAMP_EXTRACTOR_CLASS_CONFIG,
EventTimeExtractor.class.getName());
Here, EventTimeExtractor is a custom timestamp extractor (that implements org.apache.kafka.streams.processor.TimestampExtractor) to extract the timestamp information from JSON data.
I would expect this to call my object (derived from TimestampExtractor) when each new record is pulled in. The stream in question is 2 * 10^6 records / minute. I have punctuate() set to 60 seconds and it never fires. I know the data passes this span very frequently since its pulling old values to catch up.
In fact it never gets called at all.
Is this the wrong approach to setting timestamps on KStream records?
Is this the wrong way to declare this configuration?

Update Nov 2017: Kafka Streams in Kafka 1.0 now supports punctuate() with both stream-time and with processing-time (wall clock time) behavior. So you can pick whichever behavior you prefer.
Your setup seems correct to me.
What you need to be aware of: As of Kafka 0.10.0, the punctuate() method operates on stream-time (by default, i.e. based on the default timestamp extractor, stream-time will mean event-time). And the stream-time is only advanced when new data records are coming in, and how much the stream-time is advanced is determined by the associated timestamps of these new records.
For example:
Let's assume you have set punctuate() to be called every 1 minute = 60 * 1000 (note: 1 minute of stream-time). Now, if it happens that no data is being received for the next 5 minutes, punctuate() will not be called at all -- even though you might expect it to be called 5 times. Why? Again, because punctuate() depends on stream-time, and the stream-time is only advanced based on newly received data records.
Might this be causing the behavior you are seeing?
Looking ahead: There's already a ongoing discussion in the Kafka project on how to make punctuate() more flexible, e.g. to have trigger it not only based on stream-time (which defaults to event-time) but also based on processing-time.

Your approach seems to be correct. Compare pargraph "Timestamp Extractor (timestamp.extractor):" in http://docs.confluent.io/3.0.1/streams/developer-guide.html#optional-configuration-parameters
Not sure, why your custom timestamp extractor is not used. Have a look into org.apache.kafka.streams.processor.internals.StreamTask. In the constructor there should be something like
TimestampExtractor timestampExtractor1 = (TimestampExtractor)config.getConfiguredInstance("timestamp.extractor", TimestampExtractor.class);
Check if your custom extractor is picked up there or not...

I think this is another case of issues at the broker level. I went and rebuilt the cluster using instances with more CPU and RAM. Now I'm getting the results I expected.
Note to distant observer(s): if your KStream app is behaving strangely take a look at your brokers and make sure they aren't stuck in GC and have plenty of 'headroom' for file handles, RAM, etc.
See also

Search optimization when data owner is someone else

In my project, we have 2 REST calls which take too much time, so we are planning to optimize that. Here is how it works currently - we make 1st call to system A and then pass the response to system B for further processing. Once we get the response from system B, we have to manipulate it further before passing it to UI layer and this entire process takes lot of time. We planned on using Solr/Lucene but since we are not the data owners, we can't implement that. Can someone please shed some light on how best this can be handled? We are using Spring MVC and Spring webflow. Thanks in advance!!
[EDIT:] This is not the actual scenario and I am writing this as an example for better understanding. Think of this as making a store locator call for a particular zip to get a list of 100 stores and then sending those 100 stores to another call to get a list of inventory etc. So, this list of stores would change for every zip code and also the inventory there.

If your queries parameters to System A / System B are frequently the same you can add a cache framework to your code. If you use Spring3, you can use the cache easily with an #Cacheable annotation on your code calling SystemA. See :
http://static.springsource.org/spring/docs/3.1.0.M1/spring-framework-reference/html/cache.html
The cache subsystem will cache the result including processing code.

Using A BlockingQueue With A Servlet To Persist Objects

First, this may be a stupid question, but I'm hoping someone will tell me so, and why. I also apologize if my explanation of what/why is lacking.
I am using a servlet to upload a HUGE (247MB) file, which is pipe (|) delineated. I grab about 5 of 20 fields, create an object, then add it to a list. Once this is done, I pass the the list to an OpenJPA transactional method called persistList().
This would be okay, except for the size of the file. It's taking forever, so I'm looking for a way to improve it. An idea I had was to use a BlockingQueue in conjunction with the persist/persistList method in a new thread. Unfortunately, my skills in java concurrency are a bit weak.
Does what I want to do make sense? If so, has anyone done anything like it before?

Servlets should respond to requests within a short amount of time. In this case, the persist of the file contents needs to be an asynchronous job, so:
The servlet should respond with some text about the upload job, expected time to complete or something like that.
The uploaded content should be written to some temp space in binary form, rather than keeping it all in memory. This is the usual way the multi-part post libraries to their work.
You should have a separate service that blocks on a queue of pending jobs. Once it gets a job, it processes it.
The 'job' is simply some handle to the temporary file that was written when the upload happened... and any metadata like who uploaded it, job id, etc.
The persisting service needs to upload a large number of rows, but make it appear 'atomic', either model the intermediate state as part of the table model(s), or write to temp spaces.
If you are writing to temp tables, and then copying all the content to the live table, remember to have enough log space and temp space at the database level.
If you have a full J2EE stack, consider modelling the job queue as a JMS queue, so recovery makes sense. Once again, remember to have proper XA boundaries, so all the row persists fall within an outer transaction.
Finally, consider also having a status check API and/or UI, where you can determine the state of any particular upload job: Pending/Processing/Completed.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.