WatchService / dectect renames and or moves

WatchService / dectect renames and or moves - java

Note: Replace INSERT/DELETE with the appropriate events from WatchService...
One more question regarding the WatchService class from Java7. How can I reliably detect renames (and maybe even moves) of directories/files? On a more thorough thought it seems very hard to even detect renames, or more precisely it involes a lot of state to keep track of. I don't think it's enough to check for INSERT/DELETE or DELETE/INSERT pairs of the same file/directory identifier. I think it's not enough to keep track of only one event. It might as well interfere with other DELETEs or INSERTs of files. Thus all I can think of is a really ugly heuristic to keep track of unique identifiers and to watch for an INSERT and a subsequent DELETE which might interfere with other events or DELETE/INSERT. Thus I think I would need some kind of timeout and after it expires just insert/delete all tracked changes. Definitely error prone or a best effort to detect renames :-( I think it's possible to get the relevant info from Linux and Windows (RENAME event), but I currently don't intend to use JNI, as I've already implemented everything with WatchService (even if I've spend only a few days).

StandardWatchEventKinds.ENTRY_MODIFY is the event you are looking for.

Related

Detecting termination in CP-SAT

I was exploring the CP-SAT APIs for fetching all solution for given set of constraints.
As per the API documentation, onSolutionCallback() function is called for every solution found. However if I need to find all solutions for a given model, is there a way to detect the last solution or the feasibility of no more solutions through onSolutionCallback() function or other means?
I found that searchAllSolutions() API can be used and we can set termination conditions based on time or number of solutions. Assuming I can wait unlimited amount of time, how do I detect that there are no more solutions feasible?
https://developers.google.com/optimization/cp/cp_tasks
Another related question:
Is there any remote chance for a CP SAT solver to run into a non-deterministic state or run into infinite loops (or such) even when there is a feasible solution possible for given set of constraints?
I plan to use the CPSAT for a production usecase and hence would like to know its determinism and upper bounds for execution.
Edit: Added the second question.

how to undo any function

Is there a way to undo last function/method which is just finished to work?
for example I have a function/method which is doing this:
isTrue = false
num+=5
and many more functions. Each function start to work when user presses certain button on a screen. User should have possibility to undo his last action.
Is there universal way to undo any function which just finished to work?
Of course, I can do this way:
isTrue=true
num-=5
but that to do if there is many functions? Is there easy way to undo any function (android)?

I don't know any language or platform where you can do something like that.
Any code that your application executes modifies data in memory in some way (creates or destroys objects, changes values etc.). In order to undo changes made by that code operating system should save the full history of data modifications and that is practically impossible because of limited resources.

There isn't any built-in language feature to allow undoing everything that function has changed, but there are patterns that are established in other to help in these kind of situations.
One option you have is memento pattern, where you keep copy of previous state, so that you can always comeback to it, but one drawback of this is increased memory usage for saving all excess state, or in case you are persisting state on disk then there is performance overhead for reading/writing.
Another option is using command pattern where each command knows how to undo itself, this way you don't need to keep previous state, but you need to keep track of command history, what this means is that if you have a lot of state then command pattern would be better option because you don't need to save previous state but you will have a inverse function which knows to undo change that has been made by that command, while if you have only few strings then using memento pattern would be better option, especially if you need to allow undoing actions even after app is killed and started again.
Good example of using command pattern for undo with explanation can be found here.

Efficient recall of a delta-based data log in Java

My application has a number of objects in an internal list, and I need to be able to log them (e.g. once a second) and later recreate the state of the list at any time by querying the log file.
The current implementation logs the entire list every second, which is great for retrieval because I can simply load the log file, scan through it until I reach the desired time, and load the stored list.
However, the majority of my objects (~90%) rarely change, so it is wasteful in terms of disk space to continually log them at a set interval.
I am considering switching to a "delta" based log where only the changed objects are logged every second. Unfortunately this means it becomes hard to find the true state of the list at any one recorded time, without "playing back" the entire file to catch those objects that had not changed for a while before the desired recall time.
An alternative could be to store (every second) both the changed objects and the last-changed time for each unchanged object, so that a log reader would know where to look for them. I'm worried I'm reinventing the wheel here though — this must be a problem that has been encountered before.
Existing comparable techniques, I suppose, are those used in version control systems, but I'd like a native object-aware Java solution if possible — running git commit on a binary file once a second seems like it's abusing the intention of a VCS!
So, is there a standard way of solving this problem that I should be aware of? If not, any pitfalls that I might encounter when developing my own solution?

Concurrent Access within a Big InMemory B+Index

I am currently designing around a big memory index structure (several giga bytes). The index is actually a RTree which leafes are BTrees (dont ask). It supports a special query and pushes it to the logical limit.
Since those nodes are soley search nodes I ask my self how to best make it parallel.
I know of six solutions so far:
Block reads when a write is scheduled. The tree is completely blocked until the last read is finished and then the write is performed and after the write the tree can yet again used for multiple reads. (reads need no locking).
Clone Nodes to change and reuse existing nodes (including leafs) and switch between both by simply yet again stop reads switch and done. Since leaf pointers must be altered also the leaf pointers might become their own collection making it possible to switch modifications atomar and changes can be redo to a second version to avoid copy of the pointer on each insert.
Use independent copies of the index like double buffering. Update one copy of the index, switch it. Once noone reads the old index, alter this index in the same way. This way the change can be done without blocking existing reads. If another insert hits the tree in a reasonable amount of time these changes can also be done.
Use a serial share nothing architecture so each search thread has its own copy. Since a thread can only alter its tree after a single read is performed, this would be also lock free and simple. Due reads are spread evenly for each worker thread (being bound to a certain core), the throughput would not be harmed.
Use write / read locks for each node being about to be written and do only block a subtree during write. This would involve additional operations against the tree since splitting and merging would propagate upwards and therefore require a repass of the insert (since expanding locks upwards (parentwise) would introduce the chance of a deadlock). Since Split and Merge are not that frequent if you have a higher page size, this would also be a good way. Actually currently my BTree implementation currently uses a similar mechanism by spliting a node and reinsert the value unless no split is needed (which is not optimal but more simple).
Use double buffer for each node like the shadow cache for databases where each page is switched between two versions. So everytime a node is modified a copy is modified and once a read is issued the old versions are used or the new one. Each node gets a version number and the version that is more close to the active version (latest change) is choosen. To switch between to version, one needs only an atomar change on the root information. This way the tree can be altered and used. This swith can be done every time but it must be ensured that no read is using the old version when overriding the new one. This method has the possibility to not interfer with cache locality in order to link leafs and alike. But it also requires twice the amount of memory since a back buffer must be present but saves allocation time and might be good for a high frequency of changes.
With all that thoughts what is best? I know it depends but what is done in the wild? If there are 10 read threads (or even more) and being blocked by a single write operation I guess this is nothing I really want.
Also how about L3, L2 and L1 cache and in scenarios with multiple CPUs? Any issues on that? The beauty of the double buffering is the chance that those reads hitting the old version are still working with the correct cache version.
The version of creating a fresh copy of a node is quite not appealing. So what is meet in the wild of todays database landscapes?
[update]
By rereading the post, I wonder if using the write locks for split and merge would be better suited by creating replacement nodes since for a split and a merge I need to copy somewhat the half of elements around, those operations are very rare and so actually copy a node completely would do the trick by replacing this node in the parent node which is a simple and fast operation. This way the actual blocks for reads would be very limited and since we create copies anyway, the blocking only happens when the new nodes are replaced. Since during those access leafs may not be altered it is unimportant since the information density has not changed. But again this needs for every access of a node a increment and decrement of a read lock and checking for intended write locks. This all is overhead and this all is blocking further reads.
[Update2]
Solution 7. (currently favored)
Currently we favor a double buffer for the internal (non-leaf) nodes and use something similar to row locking.
Our logical tables that we try to decompose using those index structure (which is all a index does) results in using algebra of sets on those information. I noticed that this algebra of sets is linear (O(m+n) for intersection and union) and gives us the chance to lock each entry being part of such operation.
By double buffering the internal nodes (which is not hard to implement nor does it cost much (about <1% memory overhead)) we can live problem free on that issue not blocking too much read operations.
Since we batch modifications in a certain way it is very rarely seen that a given column is updated but once it is, it takes more time since those modifications might go in the thousands for this single entry.
So the goal is to alter the algebra of sets used to simply intersect those columns being currently modified later on. Since only one column is modified at a time such operation would only block once. And for everyone currently reading it, the write operation has to wait. And guess what, once a write operation waits, it usually lets another write operation of another column taking place that is not bussy. We calculate the propability of such a block to be very very low. So we dont need to care.
The locking mechanism is done using check for write, check for write intention, add read, check for write again and procced with the read. So there is no explicit object locking. We access fixed areas of bytes and if the structure is clear everything critical is planed to move into a c++ version to make it somewhat faster (2x we guess and this only takes one person one or two weeks to do especially if you use a Java to C++ translator).
The only effect that is now also important might be the caching issue since it invalidates L1 caches and maybe L2 too. So we plan to collect all modifications on such a table / index to be scheduled to run within 1 or more minutes timeshare but be evenly distributed to not make a system that has performance hickhups.
If you know of anything that helps us please go ahead.

As noone replied I would like to summarize what we (I) finally did. The structure is now separated. We have a RTree which leaf are actually Tables. Those tables can be even remote so we have a distribution way that is mostly transparent thanks to RMI and proxies.
The rest was simply easy. The RTree has the way to advise a table to split and this split is again a table. This split is been done on a single maschine and transfered to another if it has to be remote. Merge is almost similar.
This remote also is true for threads bound to different CPUs to avoid cache issues.
About the modification in memory it is as I already suggested. we duplicate internal nodes and turned the table 90° and adapted the algebraic set algorithms to handle locked columns efficiently. The test for a single table is simple and compared to the 1000ends of entries per column not a performance issue after all. Deadlocks are also impossible since one column is used at a time so there is only one lock per thread. We experiment with doing columns in parallel which would increase the response time. We also think about binding columns to a given virtual core so there is no locking again since the column is in isolation and yet again the modification can be serialized.
This way one can utilize 20 cores and more per CPU and also avoid cache misses.

General methods for optimizing program for speed

What are some generic methods for optimizing a program in Java, in terms of speed. I am using a DOM Parser to parse an XML file and then store certain words in an ArrayList, remove any duplicates then spell check those words by creating Google search URL's for each word, get the html document, locate the corrected word and save it to another ArrayList.
Any help would be appreciated! Thanks.

Why do you need to improve performance? From your explanation, it is pretty obvious that the big bottleneck here (or performance hit) is going to be the IO resulting from the fact that you are accessing a URL.
This will surely dwarf by orders of magnitude any minor improvements you make in data structures or XML frameworks.
It is a good general rule of thumb that your big performance problems will involve IO. Humorously enough, I am at this very moment waiting for a database query to return in a batch process. It has been running for almost an hour. But I welcome any suggested improvements to my XML parsing library nevertheless!
Here are my general methods:
Does your program perform any obviously expensive task from the perspective of latency (IO)? Do you have enough logging to see that this is where the delay is (if significant)?
Is your program prone to lock-contention (i.e. can it wait around, doing nothing, waiting for some resource to be "free")? Perhaps you are locking an entire Map whilst you make an expensive calculation for a value to store, blocking other threads from accessing the map
Is there some obvious algorithm (perhaps for data-matching, or sorting) that might have poor characteristics?
Run up a profiler (e.g. jvisualvm, which ships with the JDK itself) and look at your code hotspots. Where is the JVM spending its time?

SAX is faster than DOM. If you don't want to go through the ArrayList searching for duplicates, put everything in a LinkedHashMap -- no duplicates, and you still get the order-of-insertion that ArrayList gives you.
But the real bottleneck is going to be sending the HTTP request to Google, waiting for the response, then parsing the response. Use a spellcheck library, instead.
Edit: But take my educated guesses with a grain of salt. Use a code profiler to see what's really slowing down your program.

Generally the best method is to figure out where your bottleneck is, and fix it. You'll usually find that you spend 90% of your time in a small portion of your code, and that's where you want to focus your efforts.
Once you've figured out what's taking a lot of time, focus on improving your algorithms. For example, removing duplicates from an ArrayList can be O(n²) complexity if you're using the most obvious algorithm, but that can be reduced to O(n) if you leverage the correct data structures.
Once you've figured out which portions of your code are taking the most time, and you can't figure out how best to fix it, I'd suggest narrowing down your question and posting another question here on StackOverflow.
Edit
As #oxbow_lakes so snidely put it, not all performance bottlenecks are to be found in the code's big-O characteristics. I certainly had no intention to imply that they were. Since the question was about "general methods" for optimizing, I tried to stick to general ideas rather than talking about this specific program. But here's how you can apply my advice to this specific program:
See where your bottleneck is. There are a number of ways to profile your code, ranging from high-end, expensive profiling software to really hacky. Chances are, any of these methods will indicate that your program spends the 99% of its time waiting for a response from Google.
Focus on algorithms. Right now your algorithm is (roughly):
Parse the XML
Create a list of words
For each word
Ping Google for a spell check.
Return results
Since most of your time is spent in the "ping Google" phase, an obvious way to fix this would be to avoid doing that step more times than necessary. For example:
Parse the XML
Create a list of words
Send list of words to spelling service.
Parse results from spelling service.
Return results
Of course, in this case, the biggest speed boost would probably be by using spell checker that runs on the same machine, but that isn't always an option. For example, TinyMCE runs as a javascript program within the browser, and it can't afford to download the entire dictionary as part of the web page. So it packages up all the words into a distinct list and performs a single AJAX request to get a list of those words that aren't in the dictionary.

These folks are probably right, but a few random pauses will turn *probably" into "definitely, and here's why".

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.