what is the cost of using `checksumHeaderBypass` in mapdb?

what is the cost of using `checksumHeaderBypass` in mapdb? - java

we are using mapdb to store a list of files that have been visited during a long running process, so that if we need to abort or if the process crashes we can resume where we left off.
we want to protect against crashes corrupting our mapdb file store.
so we are using transactions where we periodically commit changes to disk.
but then i noticed something interesting that if we crash our process in certain times we still get the error
Header checksum broken. Store was not closed correctly and might be corrupted. Use DBMaker.checksumHeaderBypass() to recover your data. Use clean shutdown or enable transactions to protect the store in the future.
But indeed setting the checksumHeaderBypass makes the error go away. What is the cost of using this checksumHeaderBypass setting?

If you use mapdb from #postcontruct in springboot app it throws this error. Avoid initialising mapdb before the app started (don't initialize from #postconstruct).

Not getting any traffic here because there are note a whole lot of mapdb people on sof. So i'll post the answer I think is the best
Basically if you allow checksum header bypass you can load the mapdb but it may have invalid entries in the mapdb. because if the checksum doesn't match, that indicates the content isn't what it should be. so you'll likely have some bad data in the mapdb. depending how often you commit to storage, could result in a lot or a little corrupted data.

Related

How to speed up frequent writing

we created an java agent which does a check on our application suite to see if for instance the parent/child structure is still correct. Therefore it needs to check for 8000+ documents accros several applications.
The check itself goes very fast. We use a navigator to retrieve data from views and only read data from those entries. The problem is within our logging mechanism. Whenever we report a log entry with level SEVERE ( aka: A realy big issue ) the backend document is directly updated. This is becuase we dont want to lose any info about these issues.
In our test runs we see that everything runs smoot but as soon as we 'create' a lot of severe issues the performance drops enormously because of all the writes. I would like to see if there are any notes developers facing the same challenge.. How couuld we speed up the writing without losing any data?
-- added more info after comment from simon --
Its a scheduled agent which runs every night to check for inconsistencies. Goal is ofcourse to find inconsistencies and fix the cause and to eventualy have no inconsistencies reported at all.

Its a scheduled agent which runs every night to check for
inconsistencies.
OK. So there are a number of factors to take into account.
Are there any embedded Jars? When an agent has embedded jars the server has to detach them from the agent to the disk before they can run the code. This is done every time the agent executes. This can be a performance hit. If your agent spawns a number of times, remove the embedded jars and put them into the lib\ext folder on the server instead (requires server restart).
You mention it runs at night. By default general housekeeping processes run at night. Check the notes ini for Server Tasks scheduled and appraise what impact they have on the server/agent when running. For example:
ServerTasksAt1=Catalog,Design
ServerTasksAt2=Updall
ServerTasksAt5=Statlog
In this case if ran between 2-5 then UPDALL could have an impact on it. Also check program documents for scheduled executions.
In what way are you writing? If you are creating a document for each incident and the document contents is not much then the write time should be reasonable. What is liable to be a hit in performance is one of the following.
If you are multi threading those writes.
Pulling a log document, appending a line, saving and then repeating.
One last thing to think about. If you are getting 3000 errors, there must be a point where X amount of errors means that there is no point continuing and instead to alert the admin via SNMP/email/etc? It might be worth coding that in as well.
Other then that, you should probably post some sample code in relation to the write.

Hmm, difficult or general question.
As far as I understand, you update the documents in the view you are walking through. I would set view.AutoUpdate to false. This ensures that the view is not reloaded while you are running your code. This should speed up your code.
This is an extract from the Designer help:
Avoid automatically updating the parent view by explicitly setting
AutoUpdate to False. Automatic updates degrade performance and may
invalidate entries in the navigator ("Entry not found in index"). You
can update the view as needed with Refresh.
Hope that helps.
If that does not help you might want to post a code fragment or more details.

Create separate documents for each error rather than one huge document.
or
Write to a text file directly rather than a database and then pulling if necessary into a document. This should speed things up considerably.

Hack for a "real" Java flush on a remote/virtual disk

I'm looking for a "trick" or an "hack" to be certain that a file has been persisted on a remote disk, passing through vmware cache, NAS cache, etc.
Flushing and closing a FileOutputStream is not enough. I think Channel.force(true) is neither.
I'm thinking about something like these:
write the file and read back the file
write the file, check timestamp, rename the file, check for a different timestamp
write the file with "wrong content", overwrite with the original content, read it back and check the content
maybe someone had the same problem and found a solution.
My requirement is not to lose data. The java application works in this way:
accept a file from a remote source
add a digital signature and a certified timestamp creating a new file. If this file is lost it cannot be recreated in any way.
write this file to the storage
mark the file as signed on the database
tell the remote side that everything is ok
Tonight we had a crash and three transactions failed after step 5 but before the data was actually flushed to the remote store. So the database says that everything is fine, the remote side was told the same but 15 seconds of signed data was lost. And this is no good.
The correct solution could be to do a "synch mount" of the remote file-system. But this is not going to happen in a short time. Even in this case I do not completely trust this scenario given that the app is running on a VMWare server.
So I'd like to have a "best effort hack" to prevent (mitigate) incidents like this one.

Let's start with one assumption: you cannot guarantee any single write to any single disk. There are just too many layers of software and hardware between your write and the disk platter. And even if you could guarantee the write, you cannot guarantee that the data will be readable. It's possible that the disk will crash between the write and the read.
The only solution is redundancy, either provided by a framework (eg, RDMS) or your app.
When you receive and sign the file, you need to send it to multiple destinations on different physical hosts, and wait for them to reply that they saved the file. One of them might crash. Two of them might crash. How important the data is will determine how many remote hosts you need.
Incidentally, redundancy also applies to your database. The fact that a transaction committed does not mean that you'll be able to recover it after a database crash (although DBMS engineers have a lot more experience than either you or I in ensuring writes, all of it depends on a sysadmin who understands things like "logs and datafiles must reside on separate physical drives). I strongly recommend that you (redundantly) store enough metadata along with the file to be able to reconstruct the database entry.

Is there a Java local queue library I can use that keeps memory usage low by dumping to the hard drive?

This maybe not possible but I thought I might just give it a try. I have some work that process some data, it makes 3 decisions with each data it proceses: keep, discard or modify/reprocess(because its unsure to keep/discard). This generates a very large amount of data because the reprocess may break the data into many different parts.
My initial method was to send it to my executionservice that was processing the data but because the number of items to process was large I would run out of memory very quickly. Then I decided to maybe offload the queue off to a messaging server(rabbitmq) which works fine but now I'm bound by network IO. What I like about rabbitmq is it keeps messages in memory up to a certain level and then dumps old messages to the local drive so if I have 8 gigs of memory on my server I can still have a 100 gig message queue.
So my question is, is there any library that has a similar feature in Java? Something that I can use as a nonblocking queue that keeps only X items in queue(either by number of items or size) and writes the rest to the local drive.
note: Right now I'm only asking for this to be used on one server. In the future I might add more servers but because each server is self-generating data I would try to take messages from one queue and push them to another if one server's queue is empty. The library would not need to have network access but I would need to access the queue from another Java process. I know this is a long shot but thought if anyone knew it would be SO.

Not sure if it id the approach you are looking for, but why not using a lightweight database like hsqldb and a persistence layer like hibernate? You can have your messages in memory, then commit to db to save on disk, and later query them, with a convenient SQL query.

Actually, as Cuevas wrote, HSQLDB could be a solution. If you use the "cached table" provided, you can specify the maximum amount of memory used, exceeding data will be sent to the hard drive.

Use the filesystem. It's old-school, yet so many engineers get bitten with libraries because they are lazy. True that HSQLDB provides lots of value-add features, but in the context of being light weight....

Highly reliable storage for a 'log' / time series

In an application I'm working on, I need a write-behind data log. That is, the application accumulates data in memory, and can hold all the data in memory. It must, however, persist, tolerate reasonable faults, and allow for backup.
Obviously, I could write to a SQL database; Derby springs to mind for easy embedding. I'm not tremendously fond of the dealing with a SQL API (JDBC however lipsticked) and I don't need any queries, indices, or other decoration. The records go out, and on restart, I need to read them all back.
Are there any other suitable alternatives?

Try using a just a simple log file.
As data comes in, store in memory and write (append) to a file. write() followed by fsync() will guarantee (on most systems, read your system and filesystem docs carefully) that the data is written to persistent storage (disc). These are the same mechanisms any database engine would use to get data in persistent storage.
On restart, reload the log. Occasionally, trim the front of the log file so data usage doesn't grow infinitely. Or, model the log file as a circular buffer the same size as what you can hold in memory.

Have you looked at (now Oracle) Berkeley DB for Java? The "Direct Persistence Layer" is actually quite simple to use. Docs here for DPL.
Has different options for backups comes with a few utilities. Runs embedded.
(Licensing: a form of the BSD License I beleive.)

How to improve my software project's speed?

I'm doing a school software project with my class mates in Java.
We store the info on a remote db.
When we start the application we pull all the information from the database and transform it into objects to use in our application (using java sql statemens).
In the application we edit some of these objects and then when we exit the application
we save or update information in the database using Hibernate.
As you see we dont use Hibernate for pulling in information, we use it just for saving and updating.
We have 2, but very similar problems.
The loading of object(when we start the app) and the saving of objects(with Hibernate) in the db(when closing the app) is taking too much time.
And our project its not a huge enterprise application, its a quite small app, we just manage some students, teachers, homeworks and tests. So our db is also very very small.
How could we increase performance ?
later edit: if we use a local database it runs very quick, it just runs slow on remote databases

Are you saying you are loading the entire database into memory and then manipulating it? If that is the case, why don't you instead simply use the database as a storage device, and do lookups and manipulation as necessary (using Hibernate if you like, or something else if you don't)? The key there is to make sure that you are using connection pooling, as that will reduce the connection time.
If this is what you are doing, then you could be running into memory issues as well - first, by not caching the entire database in memory, you will reduce memory and will spread out the network load from the beginning/end to the times when it needs to happen.

These 2 sentences are red flags for me :
When we start the application we pull
all the information from the database
and transform it into objects to use
in our application (using java sql
statemens). In the application we edit
some of these objects and then when we
exit the application we save or update
information in the database using
Hibernate.
Is there a requirements reason that you are loading all the information from the database into memory at startup, or why you're waiting until shutdown to save changes back in the database?
If not, I'd suggest a design change. If you've already got Hibernate mappings for the tables in the DB, I'd use Hibernate for both all of your CRUD (create, read, update, delete) operations. And, I'd only load the data that each page in your app needs, as it needs it.
If you can't make that kind of design change at this point, I think you've got to look closely at how you're managing the database connections. Are you using connection pools? Are you opening up multiple connections? Forgetting to release them?
Something else to look at. How are you using Hibernate to save the entities to the db? Are you doing a getHibernateTemplate().get on each one and then doing an entity.save or entity.update on each one? If so, that means you are also causing Hibernate to run a select query for each database object before it does a save or update. So, essentially, you'd be loading each database object twice (once at the beginning of the program, once before saving). To see if that's what's happening, you can turn on the show_sql property or use P6Spy to see exactly what queries Hibernate is running.

For what you are doing, you may very well be better off serializing your objects and writing them out to a flat file.
But, much more likely, you should just read / update objects directly from your database as needed instead of all at once, for all the reasons aperkins gives.
Also, consider what happens if your application crashes? If all of your updates are saved only in memory until the application is closed, everything would be lost if the app closes unexpectedly.

The difference in loading everything from a remote DB server versus loading everything from a local DB server is the network latency / pipe size. The network is a much smaller pipe than anything else. Two questions: first, how much data are we really talking about? Second, what is your network speed? 10/100/1000? Figure between 10 and 20% of your pipe size is going to be overhead due to everything from networking protocols to the actual queries themselves.
As others have stated, the way you've architected is usually high on the list of "don't do". When starting, pull only enough data to initialize the app. As the user works through it, pull what you need for that task.
The ONLY time you pull everything is when they are working in a disconnected state. In that case, you still don't load everything as objects in the application, you just work from a local data store which gets sync'ed with the remote server every so often.

The project its pretty much complete. we cant do large refactoring on it now.
I tried to use a second level cache for Hibernate when saving. EhCacheProvider.
in hibernate.xml:
net.sf.ehcache.hibernate.EhCacheProvider
i have done a config for the cache, ehcache.xml:
i have put the cache.jar in the project build path
and i have set the hibernate property for every class and set in the mapping.
But this cache doesn't seem to have an effect. I dont know if it works(if it is used).

Try minimising number of SQL queries, since every query has its own overhead.
You can enable database compression, which should speed things up when there is a lot of data.
Maybe you are connecting to the database many times?
Check the ping time of remote database server - it might be the problem.

As your application is just slow when running on a remote database server, I'd assume that the performance loss is due to:
Connecting to the server: try to reuse connections (pass the instance around) or use connection pooling
Query round-trip time: use as few queries as possible, see here in case of a hand-written DAL:
Preferred way of retrieving row with multiple relating rows
For hibernate you may use its batch functionality and adjust hibernate.batch_size.
In all cases, especially when you can't refactor larger parts of the codebase, use a profiler (method time or sql queries) to find the bottleneck. I bet you'll find thousands of queries, each taking 10ms RTT) which could be merged into one.

Some other things you can look into:
You can allocate more memory to the JVM
Use the jconsole tool to investigate what the bottlenecks are.

Why dont you have two separate threads?
Thread 1 will load your objects one by one.
Thread 2 will process objects as they are loaded.
Your app will seem more interactive at startup.

It never hurts to review the basics:
Improving speed means reducing time (obviously), and to do that, you find activities that take significant time but can be eliminated or replaced with something that uses less time. What I mean by activity is almost always a function call, method call, or property call, performed on a specific line of code for a specific purpose. If may invoke I/O or it may invoke computation, or both. If its purpose is not essential, then it can be optimized.
Many people use profilers to try to find these time-wasting lines of code, but most profilers miss the target because they look at functions, not lines, they go to sleep during I/O, and they worry about "self time".
Many more people try to guess what could be the problem, or they ask others to guess, such as by asking on SO. Such guesses, in the nature of guesses, are sometimes right - more often not, but people still invest time and resources in them.
There's a very simple way to find out for sure, without guessing, what could fruitfully be optimized, and here is one way to do it in Java.

Thanks for your answers. Their were more than helpful.
We completely solved this problem like so:
Refactored the LOAD code. Now it uses Hibernate with Lazy Fetching.
Refactored the SAVE code. Now it saves, just the data that was modified and right after the time it was modified. This way we dont have a HUGE save an the end.
Im amazed of how good it all went. The amount of new code we had to write was very very small.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.