Live queries implementation / aproaches in backend - java

I am working on think, which should be "live". I.e. use web-sockets or SSE to show current data in browser. Source of my data are two and they should be combined with a bit of business logic. Data can be retrieved using http get and they also come as web-hook notifications.
I am able to code needed thing in java + spring, but readability would suffer. I have discovered that using RethinkDB would make my task much more easier. But it seems that given project is not backed by live development.
I would like any java idiomatic approach / library / external SW (like database) to make easy (maintainable ~= less code) algorithm which would for example do thing like this:
2 inputs:
filesystem tree (git repo)
list of trees with some processing info in it. Each tree in list does contain:
root node with some irrelevant info
some number of children nodes
leaf nodes with filename from filesystem (with path), duration of action with file and status of file processing
Note: second input can contain for example 20 trees, where it have info about processing single file from filesystem tree 20 times. i.e. it is possible that for getting info about some particular file, we will need to crawl whole list of trees and there is no guarantee that file will have any matching processing info in second input. In this case, we will output "N/A" to resulting tree for given file.
I would like to transform these two inputs to another tree, which will have structure like first input and will contain info about last status (last from array) and sum of duration.
My current approach was not reactive. It have involved a lot of java stream api and using http GET to get actual data from two sources. It was working ok, it was fast enough, but not enough to introduce pooling and make user feel that it is real time.
To make this reactive, it would involve a lot of spaghetti code to keep current algorithms in place. so I have started another approach (from scratch).
I have started to make some nice OOP class, which will receive changes from both inputs and will produce "observable" changes as an output. This would be relatively nice, if my "query", which computes output would be immutable. It is not, due to design changes of business logic.
Can you point me to some approach making this problem implementation easy to maintain?
PS: I was considering using spring cache mechanism for receiving changes (caching methods which make http get calls for inputs and returns parsed, partly processed, input data). But this part of code is a bit small to mane any difference.

Related

Searching strategy to efficiently load data from service or server?

This question is not very a language-specific question, it's some kind of pattern-related question, but I would like to tag it with some popular languages that I can understand here.
I've not been very experienced with the requirement of efficiently loading data in combination with searching data (especially for mobile environment).
My strategy used before is load everything into local memory and search from there (such as using LINQ in C#).
One more strategy is reload the data every time a new search is executed. Doing something like this is of course not efficient, also we may need to do some more complicated things to sync the newly loaded data with the existing data (already loaded into local memory).
The last strategy I can think of is the hardest one to implement, that is lazily load the data together with the searching execution. That is when the search is executed, the return result should be cached locally. The search should look in the local memory first before fetching new result from the service/server. So the result of each search is a combination of the local search and the server search. The purpose here is to reduce the amount of data being reloaded from server every time a search is run.
Here is what I can think of to implement this kind of strategy:
When a search is run, look in the local memory first. Finishing this step gives out the local result.
Now before sending request to search on the server side, we need to somehow pass what are already put in the result (locally) to exclude them from the result when searching on the server side. So the searching method may include a list of arguments containing all the item IDs found by the fisrt step.
With that searching request, we can exclude the found result and return only new items to the client.
The last step is merge the 2 results: from local and server to have the final search result before showing on the UI to the user.
I'm not sure if this is the right approach but what I feel not really good here is at the step 2. Because we need to send a list of item IDs found on the step 1 to the server, so what if we have hundreds or thousands of such IDs, sending them in that case to the server may not be very efficient. Also the query to exclude such a large amount of items may not be also efficient (even using direct SQL or LINQ). I'm still confused at this point.
Finally if you have any better idea and importantly implemented in some production project, please share with me. I don't need any concrete example code, I just need some idea or steps to implement.
Too long for a comment....
Concerning step 2, you know you can run into many problems:
Amount of data
Over time, you may accumulate a huge amount of data so that even the set their id's gets bigger than the normal server answer. In the end, you could need to cache not only previous server's answers on the client, but also client's state on the server. What you're doing is sort of synchronization, so look at rsync for inspiration; it's an old but smart Unix tool. Also git push might be inspiring.
Basically, by organizing your IDs into a tree, you can easily synchronize the information (about what the client already knows) between the server and the client. The price may be increasing latency as multiple steps may be needed.
Using the knowledge
It's quite possible that excluding the already known objects from the SQL result could be more expensive than not, especially when you can't easily determine if a to-be-excluded object would be a part of the full answer. Still, you can save bandwidth by post-filtering the data.
Being up to date
If your data change or get deleted, your may find your client keeping obsolete data. The client subscribing for relevant changes is one possibility; associating a (logical) timestamp to your IDs is another one.
Summary
It can get pretty complicated and you should measure before you even try. You may find out that the problem itself is hard enough and that achieving these savings is even harder and the gain limited. You know the root of all evil, right?
I would approach the problem by thinking local and remote are two different data sources,
When a search is triggered, the search is initiated against both data sources (local - in memory and server)
Most likely local search will result in results first, so display them to the user.
When results returned from the server, you can append non duplicate results.
Optional - in case server data has changed and some results remove/ or changed, update/remove local results and update the view.

Algorithm and Implementation of Microflow Engine

I am working on microflow engine (backend) which is a process flow to be executed in runtime.
Consider the following diagram where each process is a Java Class. There are variables out from process to in to another process. Since flow is dynamic in nature, very complicated flow is possible with many gateways (GW) and processes.
Is DFS/BFS a good choice to implement the runtime engine? Any idea guys.
As far as the given example is concerned, it is solved via Depth First Search (DFS), using the output node as the "root" of the tree.
This is because:
For the output to obtain a value, it needs the output of Process4
For Process4 to produce an output, it needs the outputs of Process2 and
Process3
For Process2 / Process3 to produce an output, they need the
output of GW
For GW to produce an output it needs the output from
Process1
So, the general idea would be to do a DFS from each output, all the way back to the inputs.
This will work almost as described for anything that looks like a Directed Acyclic Graph (DAG, or in fact a Tree), from the point of view of the output.
If a workflow ends up having "cycle edges" or "feedback loops", that is, if it now looks like a Graph, then additional consideration will need to be given to avoid infinite traversals and re-evaluation of a Process output.
Finally, if a workflow needs to be aware of the concept of "Time" (in general) then additional consideration will need to be given so that it is ensured that although the graph is evaluated progressively, node-by-node, in the end, it has produced the right output for time instance (n). That is, you want to avoid some Processes producing output AHEAD of the current time instance just because they were called more frequently.
A trivial example of this is already present in the question. Due to DFS, GW will be evaluated for Process2 (or Process3) but it doesn't have to be re-evaluated (for the same time instance) for Process3 (or Process2). When dealing with DAGs, you can simply add an "Evaluated" flag on each Process which is cleared at the beginning of the traversal. Then, DFS would decide to descend down the branch of a node if it finds that it is not yet evaluated. Otherwise, it simply obtains the output of some Process that was evaluated during a previous traversal. (This is why I mention "almost as described" earlier). But, this trivial trick will not work with multiple feedback loops. In that case, you really need to make the nodes "aware" about the passage of time.
For more information and for a really thorough exposition of related issues, I would strongly recommend that you go through Bruno Preiss' Y logic simulator. Although it is in C++ and is a logic simulator, it goes through exactly the same considerations that are faced by any similar system of interconnected "abstract nodes" that are supposed to be carrying out some form of "processing".
Hope this helps.

Data structure for continuous additions and cheap deletions

I am reading this blog post about making animations with Gnuplot and Cairo -terminal which algo's plan is simply
to save png-images to working directory, and
to save latest the video to working directory.
I would like to have something more such that the user can also browse the images real time when the images are being converted:
Data-parallelism model - data structure regularly arranged in an array
to give the user some list in some interface which the user can browse by arrow buttons
in this interface, new images are being added to the end of the list
the user can also remove bad images from the stream in real time
which may work well in Data parallelism model of Parallel programming i.e. a data set regularly structured in an array.
The operations (additions, deletions) can operate on this data, but independently on distinct processes.
Let's assume that there is no need for efficient searches for simplicity in Version 1.
However, if you come with a model which can do that also, I am happy to consider it - let's call it Version 2.
I think a list is not a good data structure here because of the wanted opportunity for deletions and continuous easy addition to the end of the data structure.
The data structure stack is not going to work either because of deletions.
I think some sort of tree data structure can work because of rather cheap deletions and cheap search there.
However, a simple array in the Data-parallelism model can be sufficient.
Languages
I think Java is a good option here because of parallelism.
However, any language and pseudocode are good too.
Frontend
I have an intuition that requirements for such a system in the frontend should be qT as a terminal emulator.
What is a better data structure for cheap deletions and continuous additions to the end?
Java LinkedList seems to be the thing you could use for version 1. you can use its single param add() to append to the list in constant time. if by "real-time" you mean when the image is in user's display and thus pointed to somehow, can delete them in constant time as well.
optimum use of memory and no re-instantiation as you'd have with an Arraylist.
any doubly linked list implemented on objects (as opposed to an array) would do.
your second version isn't clear enough.

Best way to read file and process content in java

I'm curious about the best way to read files and then process each line of the file. Assuming that the resource that needs to be read from can grow in size (e.g. a very large file) and the reading and processing of files can be swapped with a different implementation (e.g. reading an xml source instead of a file containing Strings). Please consider the following approaches:
Create 2 services. First service is used to extract data from the file and return a list. Second service takes the list and iterate thru it to process each item. Pros for this approach is that it adheres to SRP and makes it posible to switch services(e.g. get the data from a different source like XML JSON). The only con i could think about is that the performance hit of iterating thru the collection again (first time is when reading then putting in a collection) to do the processing.
Have only 1 service that will do both tasks. This way you can do the processing inside the initial loop. Trade offs would be the coupling between the reading and processing functionality which breaks the SRP.
If you have other suggestions on how to accomplish this all ideas are very welcome! TYIA!!!
Another thing i want to take away from this question is how great designers and developers (guys from this site ☺) come up with decisions for this case: why would you trade off one benefit for the other? best practices when comes to trade offs? Whats an acceptable trade off for you and why?Etc. Thanks again!

Software pattern for matching object with handles

I have been thinking in an approach for this problem but I have not found any solution which convince me. I am programming a crawler and I have a downloading task for every url from a urls list. In addition, the different html documents are parsed in different mode depending of the site url and the information that I want to take. So my problem is how to link every task with its appropriate parse.
The ideas are:
Creating an huge 'if' where check the download type and to associate a parse.
(Avoided, because the 'if' is growing with every new different site added to crawler)
Using polymorphism, to create a download task different for every different site and related to type of information which I want to get, and then use a post-action where link its parse.
(Increase the complexity again with every new parser)
So I am looking for some kind of software pattern or idea for say:
Hey I am a download task with this information
Really? Then you need this parse for extract it. Here is the parse you need.
Additional information:
The architecture is very simple. A list with urls which are seeds for the crawler. A producer which download the pages. Other list with html documents downloaded. And a consumer who will should apply the right parse for the page.
Depending of the page download sometimes we need use a parse A, or a parse B, etc..
EDIT
An example:
We have three site webs: site1.com, site2.com and site3.com
There are three urls type which we want parsing: site1.com/A, site1.com/B, site1.com/C, site2.com/A, site2.com/B, site2.com/C, ... site3.com/C
Every url it parsed different and usually the same information is between site1.com/A - site2.com/A - site3.com/A ; ... ; site1.com/C - site2.com/C - site3.com/C
Looks like a Genetic Algorithm aproached solution fits for your description of the problem, what you need to find first is the basics (atomic) solutions.
Here's a tiny description from wikipedia:
In a genetic algorithm, a population of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem is evolved toward better solutions. Each candidate solution has a set of properties (its chromosomes or genotype) which can be mutated and altered; traditionally, solutions are represented in binary as strings of 0s and 1s, but other encodings are also possible.[2]
The evolution usually starts from a population of randomly generated individuals, and is an iterative process, with the population in each iteration called a generation. In each generation, the fitness of every individual in the population is evaluated; the fitness is usually the value of the objective function in the optimization problem being solved. The more fit individuals are stochastically selected from the current population, and each individual's genome is modified (recombined and possibly randomly mutated) to form a new generation. The new generation of candidate solutions is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population.
A typical genetic algorithm requires:
a genetic representation of the solution domain,
a fitness function to evaluate the solution domain.
A standard representation of each candidate solution is as an array of bits.[2] Arrays of other types and structures can be used in essentially the same way. The main property that makes these genetic representations convenient is that their parts are easily aligned due to their fixed size, which facilitates simple crossover operations. Variable length representations may also be used, but crossover implementation is more complex in this case. Tree-like representations are explored in genetic programming and graph-form representations are explored in evolutionary programming; a mix of both linear chromosomes and trees is explored in gene expression programming.
Once the genetic representation and the fitness function are defined, a GA proceeds to initialize a population of solutions and then to improve it through repetitive application of the mutation, crossover, inversion and selection operators.
I would externalize the parsing pattern / structure in some form ( like XML ) and use them dynamically.
For example, I have to download site1.com an site2.com . Both are having two different layout . I will create two xml which holds the layout pattern .
And one master xml which can hold which url should use which xml .
While startup load this master xml and use it as dictionary. When you have to download , download the page and find the xml from dictionary and pass the dictionary and stream to the parser ( single generic parser) which can read the stream based on Xml flow and xml information.
In this way, we can create common patterns in xml and use it to read similar sites. Use Regular expressions in xml patterns to cover most of sites in single xml.
If the layout is completely different , just create one xml and modify master xml that's it.
The secret / success of this design is how you create such generic xmls and it is purely depends on what you need and what you are doing after parsing.
This seems to be a connectivity problem. I'd suggest considering the quick find algorithm.
See here for more details.
http://jaysonlagare.blogspot.com.au/2011/01/union-find-algorithms.html
and here's a simple java sample,
https://gist.github.com/gtkesh/3604922

Categories

Resources