I have a use case when I need to capture the data flow from one API to another. For example my code reads data from database using hibernate and during the data processing I convert one POJO to another and perform some more processing and then finally convert into final result hibernate object. In a nutshell something like POJO1 to POJO2 to POJO3.
In Java is there a way where I can deduce that an attribute from POJO3 was made/transformed from this attribute of POJO1. I want to look something where I can capture data flow from one model to another. This tool can be either compile time or runtime, I am ok with both.
I am looking for a tool which can run in parallel with code and provide data lineage details on each run basis.
Now instead of Pojos I will call them States! You are having a start position you iterate and transform your model through different states. At the end you have a final terminal state that you would like to persist to the database
stream(A).map(P1).map(P2).map(P3)....-> set of B
If you use a technic known as Event sourcing you can deduce it yes. How would this look like then? Instead of mapping directly A to state P1 and state P1 to state P2 you will queue all your operations that are necessary and enough to map A to P1 and P1 to P2 and so on... If you want to recover P1 or P2 at any time, it will be just a product of the queued operations. You can at any time rewind forward or rewind backwards as long as you have not yet chaged your DB state. P1,P2,P3 can act as snapshots.
This way you will be able to rebuild the exact mapping flow for this attribute. How fine grained you will queue your oprations, if it is going to be as fine as attribute level , or more course grained it is up to you.
Here is a good article that depicts event sourcing and how it works: https://kickstarter.engineering/event-sourcing-made-simple-4a2625113224
UPDATE:
I can think of one more technic to capture the attribute changes. You can instument your Pojo-s, it is pretty much the same technic used by Hibernate to enhance Pojos and same technic profiles use to for tracing. Then you can capture and react to each setter invocation on the Pojo1,Pojo2,Pojo3. Not sure if I would have gone that way though....
Here is some detiled readin about the byte code instrumentation if https://www.cs.helsinki.fi/u/pohjalai/k05/okk/seminar/Aarniala-instrumenting.pdf
I would imagine two reasons, either the code is not developed by you and therefore you want to understand the flow of data along with combinations to convert input to output OR your code is behaving in a way that you are not expecting.
I think you need to log the values of all the pojos, inputs and outputs to any place that you can inspect later for each run.
Example: A database table if you might need after hundred of runs, but if its one time may be to a log in appropriate form. Then you need to yourself manually use those data values layer by later to map to the next layer. I think with availability of code that would be easy. If you have a different need pls. explain.
Please accept and like if you appreciate my gesture to help with my ideas n experience.
There are "time travelling debuggers". For Java, a quick search did only spill this out:
Chronon Time Travelling Debugger, see this screencast how it might help you .
Since your transformations probably use setters and getters this tool might also be interesting: Flow
Writing your own java agent for tracking this is probably not what you want. You might be able to use AspectJ to add some stack trace logging to getters and setters. See here for a quick introduction.
Related
For spark streaming, are there ways that we can maintain state only for the current window? I understand updateStateByKey works but that maintains the state forever unless we purge it. Is it possible to store and reset the state per window?
To give more context. I'm trying to convert one type of object into another within a windowed stream. However, the conversion is the following:
Object 1 is either an invocation or a response.
Object 2 is not considered complete until we see both a invocation and a response.
However, since the response for the an object could be in a separate batch I need to maintain states across batches.
But I only wish to maintain the state for the current window. Are there any ways that I could achieve this through spark.
thank you!
You can use the mapWithState transformation instead of updateStateByKey and you can set time out to the State spec with duration of your batch interval.by this you can have the state for only last batch every time.but it will work if you invocation and response depends only on the last batch.other wise when you try to update key which got removed it will throw exception.
MapwithState is fast in performance compared to updateStateByKey.
you can find the sample code snippet below.
import org.apache.spark.streaming._
val stateSpec =
StateSpec
.function(updateUserEvents _)
.timeout(Minutes(5))
I'm using Java EE (JDBC, MVC, DAO) and MySql.
I'm making my own project, so all architecture's design - my responsibility.
I have a system "Facultative", where i have entity Facultative, that store information about course, lecturer and start and duration.
Now, it is also storing a field "Status": Wait (not started), Started and Ended.
And this is a place, i have problem: how should information be updated?
Of course, it is possible, to give this function to the admin, but it seems to easy and not efficient.
I have idea - not store field "status" at DB, but to check what status in Model Entity (by checking start date/duration).
I'm using MVC Pattern and not sure if it is correct to add such method to Class.
Thank you in advance.
This is really an issue of the "world" you are modeling. Ask yourself this:
Do courses ever fail to start at the scheduled time?
Do you want to explicitly model that?
If the answer to both of those questions is "yes", then you can't treat the status field as derived from (just) the start and end dates (and the current date). And similarly, automatically setting a (non-derived) status field based on the dates is dubious.
On the other hand ... setting the status administratively would be a bad idea too, since it needs to be done at a particular time; i.e. when the lecture actually starts.
But then ... actually modeling this accurately needs to acknowledge that there is a "gap" between the information in your database, and what is actually happening in the real world. It is (probably) impractical to ensure that the database is 100% accurate. So the pragmatic solution is to accept that: make it a "feature" of the system.
If you take the pragmatic view, then making status derived should be good enough. (Change its name to notional_status or something, and change the start and end fields to scheduled_start and scheduled_end or something.)
Storing the start date and end date (or duration) and deriving the status makes the most sense to me.
The main advantage is the data wont need to be updated as the Status transitions from Wait to Started and Started to Ended and just take care of itself as time passes naturally.
I'm building an application that downloads a set of images from a website, extracts some features from them and then allows a user to compare an image she submits to the downloaded set, to see which one is the closest. At the moment the application downloads the images and extracts the features from them. Then the image and the feature get wrapped in an object and stored in a map, with the key as the name of the image, and the value as the aforementioned wrapped object.
Because this is stored in memory, each time I start the application it has to go through the quite expensive process of downloading and feature extraction. It would be much quicker if it could just load this info from disk, but I'm not sure on the best way to go about it - I've thought about these options:
RDMS: something like Postgres or SQLite
NoSQL: something like
Voldemort or Reddis
Serialisation: use built in java methods to write
objects to a file (could also be used in conjunction with a DB
though...)
I want it to be really light weight; I want to keep the application as small as possible and keep configuration down to a minimum. For this reason serialisation seems like the way to go, but I'd like a second (or more) opinion on that, because something about doing it that way just feels wrong. I can't quite put my finger on why I feel like that...
I should also say that users can add images to the set when the application is running, I'd like to save these images too.
I wouldn't recommend serialzation - just too many pitfalls.
If what you have is really just a map, then i think any of the key-value stores ( like redis) would be appropriate.
If you have more complex data, then you might want to consider a database (whether SQL or no-sql).
In my project, we have 2 REST calls which take too much time, so we are planning to optimize that. Here is how it works currently - we make 1st call to system A and then pass the response to system B for further processing. Once we get the response from system B, we have to manipulate it further before passing it to UI layer and this entire process takes lot of time. We planned on using Solr/Lucene but since we are not the data owners, we can't implement that. Can someone please shed some light on how best this can be handled? We are using Spring MVC and Spring webflow. Thanks in advance!!
[EDIT:] This is not the actual scenario and I am writing this as an example for better understanding. Think of this as making a store locator call for a particular zip to get a list of 100 stores and then sending those 100 stores to another call to get a list of inventory etc. So, this list of stores would change for every zip code and also the inventory there.
If your queries parameters to System A / System B are frequently the same you can add a cache framework to your code. If you use Spring3, you can use the cache easily with an #Cacheable annotation on your code calling SystemA. See :
http://static.springsource.org/spring/docs/3.1.0.M1/spring-framework-reference/html/cache.html
The cache subsystem will cache the result including processing code.
First, this may be a stupid question, but I'm hoping someone will tell me so, and why. I also apologize if my explanation of what/why is lacking.
I am using a servlet to upload a HUGE (247MB) file, which is pipe (|) delineated. I grab about 5 of 20 fields, create an object, then add it to a list. Once this is done, I pass the the list to an OpenJPA transactional method called persistList().
This would be okay, except for the size of the file. It's taking forever, so I'm looking for a way to improve it. An idea I had was to use a BlockingQueue in conjunction with the persist/persistList method in a new thread. Unfortunately, my skills in java concurrency are a bit weak.
Does what I want to do make sense? If so, has anyone done anything like it before?
Servlets should respond to requests within a short amount of time. In this case, the persist of the file contents needs to be an asynchronous job, so:
The servlet should respond with some text about the upload job, expected time to complete or something like that.
The uploaded content should be written to some temp space in binary form, rather than keeping it all in memory. This is the usual way the multi-part post libraries to their work.
You should have a separate service that blocks on a queue of pending jobs. Once it gets a job, it processes it.
The 'job' is simply some handle to the temporary file that was written when the upload happened... and any metadata like who uploaded it, job id, etc.
The persisting service needs to upload a large number of rows, but make it appear 'atomic', either model the intermediate state as part of the table model(s), or write to temp spaces.
If you are writing to temp tables, and then copying all the content to the live table, remember to have enough log space and temp space at the database level.
If you have a full J2EE stack, consider modelling the job queue as a JMS queue, so recovery makes sense. Once again, remember to have proper XA boundaries, so all the row persists fall within an outer transaction.
Finally, consider also having a status check API and/or UI, where you can determine the state of any particular upload job: Pending/Processing/Completed.