Couchbase 1 -> 2 migration. Backups obtaining and consistency check - java

Given: an old Java project using a Couchbase 1. It has a big problem with runtime backups. Both, the simple copying files and the cbbackup way doesn't work. The obtained backups were corrupted, Couchbase doesn't start with them. The only way to obtain data snapshot were a relatively long application shutdown.
Now, we're migrating to the Couchbase 2+. cbbackup fails with something like this (senseless message for me, no any design docs were in the Couchbase 1):
/pools/default/buckets/default/ddocs; reason: provide_design done
But, if we use the resulting files, Couchbase seems wake up and works properly.
Question 1: Any insights and help with the whole spoiled backups' situation?
Question 2: How, at least, we could assure a consistency of the new database backups in our case?
(Writing a huge check pack for all docs and fields through the client is very expensive and the last option.)
I appreciate any help, this is a vague legacy infrastructure for the team, googling and Couchbase documents aren't help us much.

Question 1: Any insights and help with the whole spoiled backups' situation?
Couchbase 1.x used SQlite as the on-disk format (shared into 4 files per Bucket IIRC), which has a number of issues at scale.
One of the major changes in Couchbase 2 was to move to a custom append-only file format (couchstore), which is much less susceptible to any corruption issues (as once written a block is never modified), until a new compacted file is later created by an automated job.
How, at least, we could assure a consistency of the new database backups in our case? (Writing a huge check pack for all docs and fields through the client is very expensive and the last option.)
If you want to check consistency of the backup you need to do something along the lines of what you mention.
Note however that if you're backing up a live system (as most people are) then the live system is likely to have changed between taking the backup and when you compare it.
As a final aside, I would suggest looking at the 1.8.x to 2.0 Upgrade Guide on the Couchbase website. Note that 2.x is now pretty old (3.x is current as of writing, 4.0 is in beta) hence the 2.0 documentation is in an archived section of the website.

Related

How to write custom storage plugin for apache drill

I have my data in a propriety format, None of the ones supported by Apache drill.
Are there any tutorial on how to write my own storage plugin to handle such data.
This is something that really should be in the docs but currently is not. The interface isn't too complicated, but it can be a bit much to look at one of the existing plugins and understand everything that is going on.
There are 2 major components to writing a storage plugin, exposing information to the query planner and schema management system and then actually implementing the translation from the datasource API to the drill record representation.
The Kudu plugin was added recently and is a reasonable model for a storage system with a lot of the elements Drill can take advantage of. One thing I would note is that if your storage system is not distributed and you just plan on making all remote reads you don't have to do as much work around affinities/work lists/assignments in the group scan. If I have some time soon I'll try to write up a doc on the different parts of the interface and maybe write a tutorial about one of the existing plugins.
https://github.com/apache/drill/tree/master/contrib/storage-kudu/src/main/java/org/apache/drill/exec/store/kudu

Using Lucene as storage

I would like to know if it would be recommended to use Lucene as data storage. I am saying 'recommended' because I already know that it's possible.
I am asking this question because the only Q&A I could find on SO was this one: Lucene as data store which is kind of outdated (from 2010) even if it is almost exactly the same question.
My main concern about having data exclusively in Lucene is the storage reliability. I have been using Lucene since 2011 and at that time (version 2.4) it was not improbable to encounter a CorruptIndexException, basically meaning that the data would be lost if you didn't have it somewhere else.
However, in the newest versions (from 4.x onward), I've never experienced any problem with Lucene indices.
The answer should not consider the performance too much as I already have a pretty good idea of what to expect in that field.
I am also open to hear about SOLR and ElasticSearch reliability experiences... (how often are the shards failing, what options do we have when this occurs, etc)
This sounds like a good good match for Solrcloud as it is able and willing to handle the load and also takes care of the backup. My only concern would be that it is not a datastore, it "only" works with the indexing of those documents.
We are using SolrCloud for data storage and reliability is pretty good till now.
However make sure that you configure and tune it well or else you could find nodes failing and zookeeper being unable to detect some of them after some time..

Tools to do data processing from Java

I've got a legacy system that uses SAS to ingest raw data from the database, cleanse and consolidate it, and then score the outputted documents.
I'm wanting to move to a Java or similar object oriented solution, so I can implement unit testing, and otherwise general better code control. (I'm not talking about overhauling the whole system, but injecting java where I can).
In terms of data size, we're talking about around 1 TB of data being both ingested and created. In terms of scaling, this might increase by a factor of around 10, but isn't likely to increase on massive scale like a worldwide web project might.
The question is - what tools would be most appropriate for this kind of project?
Where would I find this information - what search terms should be used?
Is doing processing on an SQL database (creating and dropping tables, adding columns, as needed) an appropriate, or awful, solution?
I've had a quick look at Hadoop - but due to the small scale of this project, would Hadoop be an unnecessary complication?
Are there any Java packages that do similar functionality as SAS or SQL in terms of merging, joining, sorting, grouping datasets, as well as modifying data?
It's hard for me to prescribe exactly what you need given your problem statement.
It sounds like a good database API (i.e. native JDBC might be all you need with a good open source database backend)
However, I think you should take some time to check out Lucene. It's a fantastic tool and may meet your scoring needs very well. Taking a search engine indexing approach to your problem may be fruitful.
I think the question you need to ask yourself is
what's the nature of your data set, how often it will be updated.
what's the workload you will have on this 1TB or more data in the future. Will there be mainly offline read and analysis operations? Or there will also have a lot random write operations?
Here is an article talking about if to choose using Hadoop or not which I think is worth reading.
Hadoop is a better choice if you only have daily or weekly update of your data set. And the major operations on the data is read-only operations, along with further data analysis. For the merging, joining, sorting, grouping datasets operation you mentioned, Cascading is a Java library running on top of Hadoop which supports this operation well.

Single-file, persistent, sorted key-value store for Java (alternative to Berkeley DB) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
Berkeley DB (JE) licensing may be a deal killer. I have a Java application going to a small set of customers but as it is a desktop application, my price cannot support individual instance licensing.
Is there a recommended Java alternative to Berkeley DB? Commercial or otherwise (good key-value store implementations can get non-trivial, I prefer to defer maintenance elsewhere). I need more than just a hash store as I'll need to iterate through subsequent key subsets and basic hash stores would O(m*n) that search and I expect the store to be ~50-60GiB on a desktop machine. Added benefit anyone that you can recommend that keeps its backing store in a single file?
You should definitely try JDBM2, it does what you want:
Disk backed HashMaps/TreeMaps thus you can iterate through keys.
Apache 2 license
In addition:
Fast, very small footprint
Transactional
Standalone jar have only 145 KB.
Simple usage
Scales well up to 1e9 records
Uses Java serialization, no ORM mapping
UPDATE
The project has now evolved into MapDB http://www.mapdb.org
I think SQLite is exactly what you want: Free (Public Domain), Single File Database, Zero-Configuration, Small Footprint, Fast, cross-platform, etc.. Here is a list of wrappers, there is a section for Java. Take a look to sqlite4java and read more on Java + SQLite here.
It won't be a single file, but if you want embedded database, I suggest Java DB (a rebranded version of Apache Derby, which I used in a previous job with wonderful results).
Plus, both are completely free.
Edit: reading the other comments, another note: Java DB/Derby is 100% Java.
--- Edited after seeing the size of the file ---
50 to 60 GiB files! It seems that you would have to know that your DB engine didn't load all of that in memory at once, and was very efficient in handling / scavenging off-loaded data backing blocks.
I don't know if Cloudscape is up to the task, and I wouldn't be surprised if it wasn't.
--- original post follows ---
Cloudscape often fits the bill. It's a bit more than Berkeley DB, but it gained enough traction to be distributed even with some JDK offerings.
Consider ehcache. I show here a class for wrapping it as a java.util.Map. You can easily store Lists or other data structures as your values, avoiding the O(m*n) issue you are concerned with. ehcache is Apache 2.0 license, with an commercial enterprise version available by Terracotta. The open source version will allow you to spill your cache to disk, and if you choose not to evict cache entries it is effectively a persistent key-value store.
JavaDB aka Derby aka Cloudscape would be a decent choice; it's a pure Java SQL database, and it's included in the JRE, so you don't have to ship it with your code or require users to install it separately.
(It's actually not included in the JRE provided by some Linux package managers, but there it will be a separate package that is trivial to install)
However, Derby has fairly poor performance. An alternative would be H2 - again, a pure Java SQL database that stores a database in a single file, with a ~1MB jar under a redistributable license, but one that is considerably faster and lighter than Derby.
I've happily used H2 for a number of small projects. JBoss liked it enough that they bundled it in AS7. It's trivial to set up, and definitely worth a try.
Persistit is the new challenger. It's a fast, persistent and transactional Java B+Tree library.
I'm afraid that there's no guarantee that it will still be maintained. Akiban, the company supporting Persistit, was recently acquired by FoundationDB. The latter did not provide any information on the future.
https://github.com/akiban/persistit
I just would like to point out that the storage backend of H2 can also be used as a key-value storage engine if you do not need sql / jdbc:
http://www.h2database.com/html/mvstore.html
H2 http://www.h2database.com/
It's a full-blown SQL/JDBC database, but it's lightweight and fast
Take a look at LMDBJava, Java bindings to LMDB, the fastest sorted ACID key-value store out there.

DSpace and Physical Separation of Documents

I'm looking for a document repository that supports physical data separation. I was looking into Open Text's LiveLink, but that price range is out of my league. I just started looking into DSpace! Its free and open source and works with PostgreSQL.
I'm really trying to figure out if DSpace supports physically separating the data. I guess I would create multiple DSpace Communities, and I want the documents for each Community store on different mounts. Is this possible? Or is there a better way?
Also, I wont be using DSpace's front end. The User will not even know DSpace is storing the docs. I will build my own front-end and use Java to talk to DSpace. Does anyone know a good Java API Wrappers for DSpace.
This answer is a pretty long time after you asked the question, so not sure if it's of any use to you.
DSpace currently does not support storing different parts of the data on different partitions, although your asset store can be located on its own partition separate from the database and the application. You may get want you need from the new DuraSpace application which deals with synchronising your data to the cloud.
In terms of Java APIs, DSpace supports SWORD 1.3 which is a deposit-only protocol, and a lightweight implementation of WebDAV with SOAP. Other than that, it's somewhat lacking. If you are looking for a real back-end repository with good web services you might look at Fedora which is a pure back end, with no native UI, and probably more suited to your needs. DSpace and Fedora are both part of the DuraSpace organisation so you can probably benefit from that also. I'm not knowledgeable enough about the way that Fedora stores data to say whether you can physically separate the storage, though.
Hope that helps.

Categories

Resources