I am pretty new to ES. I have been trying to search for a db migration tool for long and I could not find one. I am wondering if anyone could help to point me to the right direction.
I would be using Elasticsearch as a primary datastore in my project. I would like to version all mapping and configuration changes / data import / data upgrades scripts which I run as I develop new modules in my project.
In the past I used database versioning tools like Flyway or Liquibase.
Are there any frameworks / scripts or methods I could use with ES to achieve something similar ?
Does anyone have any experience doing this by hand using scripts and run migration scripts at least upgrade scripts.
Thanks in advance!
From this point of view/need, ES have a huge limitations:
despite having dynamic mapping, ES is not schemaless but schema-intensive. Mappings cant be changed in case when this change conflicting with existing documents (practically, if any of documents have not-null field which new mapping affects, this will result in exception)
documents in ES is immutable: once you've indexed one, you can retrieve/delete in only. The syntactic sugar around this is partial update, which makes thread-safe delete + index (with same id) on ES side
What does that mean in context of your question? You, basically, can't have classic migration tools for ES. And here's what can make your work with ES easier:
use strict mapping ("dynamic": "strict" and/or index.mapper.dynamic: false, take a look at mapping docs). This will protect your indexes/types from
being accidentally dynamically mapped with wrong type
get explicit error in case when you miss some error in data-mapping relation
you can fetch actual ES mapping and compare it with your data models. If your PL have high enough level library for ES, this should be pretty easy
you can leverage index aliases for migrations
So, a little bit of experience. For me, currently reasonable flow is this:
All data structures described as models in code. This models actually provide ORM abstraction too.
Index/mapping creation call is simple model's method.
Every index has alias (i.e. news) which points to actual index (i.e. news_index_{revision}_{date_created}).
Every time code being deployed, you
Try to put model(type) mapping. If it's done w/o error, this means that you've either
put the same mapping
put mapping that is pure superset of old one (only new fields was provided, old stays untouched)
no documents have values in fields affected by new mapping
All of this actually means that you're good to go with mappping/data you have, just work with data as always.
If ES provide exception about new mapping, you
create new index/type with new mapping (named like name_{revision}_{date}
redirect your alias to new index
fire up migration code that makes bulk requests for fast reindexing
During this reindexing you can safely index new documents normally through the alias. The drawback is that historical data is partially available during reindexing.
This is production-tested solution. Caveats around such approach:
you cannot do such, if your read requests require consistent historical data
you're required to reindex whole index. If you have 1 type per index (viable solution) then its fine. But sometimes you need multi-type indexes
data network roundtrip. Can be pain sometimes
To sum up this:
try to have good abstraction in your models, this always helps
try keeping historical data/fields stale. Just build your code with this idea in mind, that's easier than sounds at first
I strongly recommend to avoid relying on migration tools that leverage ES experimental tools. Those can be changed anytime, like river-* tools did.
Related
Recently I come across a schema model like this
Structure looks exactly the same, i just renamed with Entity name like Table (*)
Starting from Table C, all the tables are having close to 200 Columns, from C to L
Reason for posting this is like, I never come across structure like this before, if anyone who have already experienced like this or worked similar or more complex than this please do share your idea,
Having a structure like this is good or bad, and why?
Assume we need to have API to save data for the table structure like this,
how to design the API
How we are going to manage the Transactional across all these tables
In service code, there are few cases where we might need to get data from these table and transfer to external system.
Catch here is, external system is accepting the request in the flatten structure not in the hierarchy which we have as mentioned above. If this data needs to be transferred to external system, how can we manage marshaling and un marshaling
Last but not least, API which is going to manage the data like this can be consumed atleast 2K a day.
What is your thought on this, I don't know exactly why we need it, it needs a detailed discussion and we need to break up the things.
If I consider Spring Data JPA, Hibernate. What are all things i need to consider,
More Importantly, all these tables row values will be limited based on the the ownerId/tenantId, so the data needs to be consistent across all the tables.
I can not comment on the general aspect of the structure as that is pretty domain specific and one would need to know why this structure was chosen to be able to say if it's good or not. Either way, you probably can't change this anyway, so why bother asking if it's good or not?
Having said that, with such a model there are a few aspects that you should consider:
When updating data, it is pretty important to update only columns that really changed to avoid index trashing and allow the DB to use spare storage in pages. This is a performance concern that usually comes up when using Hibernate with such models as Hibernate usually updates all "updatable" columns, not just the dirty ones. There is an option to do dynamic updates though. Without dynamic updates, you might produce a few more IOs per update and thus keep locks for a longer time which affects the overall scalability.
When reading data, it is very important not to use join fetching by default as that might result in a result set size explosion.
We are migrating a whole application originally developed in Oracle Forms a few years back, to a Java (7) web based application with Hibernate (4.2.7.Final) and Hibernate Search (4.1.1.Final).
One of the requirements is: as users are using the new migrated version, they able to use the Oracle Forms version - so Hibernate Search indexes will be out of sync. Is it feasable to implement a servlet so that some PL-SQL accesses some link that updates the local indexes in the application server (AS)?
I thought of implementing a some sort clustering mechanism for hibernate, but as I read through the documentation I realised that as clustering may be a good option for scalabillity and performance, for maintaining legacy data in sync may be a bit overkill.
Does anyone have any idea of how to implement a service, accessible via servlet, to update local AS indexes in a given model entity with a given ID?
I don't know what exactly you mean by the clustering part, but anyways:
It seems like you are facing a similar problem like me. I am currently in the works of creating a Hibernate-Search adaption for JPA providers (that are not Hibernate-ORM, meaning EclipseLink, TopLink, etc.) and I am working on an automatic reindexing feature at the moment. Since JPA doesn't have a event system suitable for reindexation with Hibernate-Search I came up with the idea to use triggers on a database level to keep track of everything.
For a basic OneToOne relationship it's pretty straight forward and for other things like relation-tables or anything that is not stored in the main table of an entity it gets a bit trickier, but once you got a system for OneToOne relationships it's not that hard to get to that next step. Okay, Let's start:
Imagine two Entities: Place and Sorcerer in the Lord of the rings universe. In order to keep things simple let's just say they are in a (quite restrictive :D) 1:1 relationship with each other. Normally you end up with 2 tables named SORCERER and PLACE.
Now you have to create 3 triggers (one for CREATE, one for DELETE and one for UPDATE) on each Table (SORCERER and PLACE) that store information about what entity (only the id, for mapping tables there are always multiple ids) has changed and how (CREATE, UPDATE, DELETE) into special UPDATE tables. Let's call these PLACE_UPDATES and SORCERER_UPDATES.
In addition to the ID of the original Object that has changed and the event-type these will need an ID field that is needed to be UNIQUE among all UPDATE tables. This is needed because if you want to feed information from the Update tables to the Hibernate-Search index you have to make sure the events are in the right order or you will break your index. How such an UNIQUE ID can be created on your database should be easy to find on the internet/stackoverflow.
Okay. Now that you have set up the triggers correctly you will just have to find a way to access all the UPDATES tables in a feasible fashion (I do this via querying from multiple tables at once and sorting each query by our UNIQUE id field and then just comparing the first result of each query with the others) and then update my index.
This can be a bit tricky and you have to find the correct ways of dealing with the specific update event but it can be done (that's what I am currently working on).
If you're interested in that part, you can find it here:
https://github.com/Hotware/Hibernate-Search-JPA/blob/master/hibernate-search-db/src/main/java/com/github/hotware/hsearch/db/events/IndexUpdater.java
The link to the whole project is:
https://github.com/Hotware/Hibernate-Search-JPA/
This uses Hibernate-Search 5.0.0.
I hope this was of help (at least a little bit).
And about your remote indexing problem:
The update tables can easily be used as some kind of dump for events until you send them to the remote machine that is to be updated.
I've been using H2 on the functional tests part of a MySQL based application with Hibernate. I was finally fed up with it and I decided to usq jOOQ mostly so I could still abstract myself from the underlying database.
My problem is that I don't like this code generation thing jOOQ does at all since I'm yet to see an example with it properly set up in multiple profiles, also don't like connecting to the database as part of my build. It's overall quite a nasty set-up I don't want to spend a morning doing to realise is very horrible and I don't want it in the project.
I'm using tableByName() and fieldByName() instead which I thought was a good solution, but I'm getting problems with H2 putting everything in uppercase.
If I do something like Query deleteInclusiveQuery = jooqContext.delete(tableByName("inclusive_test"))... I get table inclusive_test not found. Note this has nothing to do with the connection delay or closing configuration.
I tried changing the connection to use ;DATABASE_TO_UPPER=false but then I get field not found (I thought it would translate all schema).
I'm not sure if H2 is either unable to create non-upper cased schemas or I'm failing at that. If the former then I'd expect jOOQ to also upper case the table and field names in the query.
example output is:
delete from "inclusive_test" where "segment_id" in (select "id" from "segment" where "external_taxonomy_id" = 1)
which would be correct if H2 schema would have not been created like this, however the query I'm creating the schema with specifically puts it in lowercase, yet in the end it ends up being upper cased, which Hibernate seems to understand or solve, but not jOOQ
Anyway, I'm asking if there is a solution because I'm quite disappointed at the moment and I'm considering just dropping the tests where I can't use Hibernate.
Any solution that is not using the code generation feature is welcome.
My problem is that I don't like this code generation thing jOOQ does at all since I'm yet to see an example with it properly set up in multiple profiles, also don't like connecting to the database as part of my build. It's overall quite a nasty set-up I don't to spend a morning doing to realise is very horrible and I don't want it in the project.
You're missing out on a ton of awesome jOOQ features if you're going this way. See this very interesting discussion about the rationale of why having a DB-connection in the build isn't that bad:
https://groups.google.com/d/msg/jooq-user/kQO757qJPbE/UszW4aUODdQJ
In any case, don't get frustrated too quickly. There are a couple of reasons why things have been done the way they are. DSL.fieldByName() creates a case-sensitive column. If you provide a lower-case "inclusive_test" column, then jOOQ will render the name with quotes and in lower case, by default.
You have several options:
Consistently name your MySQL and H2 tables / columns, explicitly specifying the case. E.g. `inclusive_test` in MySQL and "inclusive_test" in H2.
Use jOOQ's Settings to override the rendering behaviour. As I said, by default, jOOQ renders everything with quotes. You can override this by specifying RenderNameStyle.AS_IS
Use DSL.field() instead of DSL.fieldByName() instead. It will allow you to keep full control of your SQL string.
By the way, I think we'll change the manual to suggest using DSL.field() instead of DSL.fieldByName() to new users. This whole case-sensitivity has been causing too many issues in the past. This will be done with Issue #3218
I have an interesting scenario that is, I believe, an excellent application of an IMDB (such as H2) and, possibly, jOOQ. However, there are some interesting challenges and questions that arise.
We’ve developed a specialized, Java-based ETL platform for insurance data conversion that is now in its fourth generation. Without going into unnecessary detail, we routinely extract data from source systems such as SQL Server, DB2, etc. that are normalized to varying degrees. Insurance data conversion has two characteristics that are highly relevant here:
We typically convert one insurance entity (i.e. policy, application, claim, etc.) at a time (unless it’s part of a package or other transactional grouping, in which case we might be converting a few entities at a time). Importantly, therefore, a given conversion transaction seldom involves even 1 Mb of data at a time. Indeed, a typical transaction involves less than 50K of data—miniscule by any modern measure.
Because source and target systems can differ so dramatically in their schemas, granularity, and even underlying semantics, the transformations can be very complex. In terms of source processing, the queries are numerous and complex, frequently joining many tables, using subqueries, etc. Given this fact, obtaining reasonable performance means saving the query results in some fashion. Until now, we’ve relied on a proprietary approach involving “insurance maps,” which are specialized Java maps. We knew this approach was ultimately insufficient, but it served our needs initially.
Now that I have some time to reflect, I’m thinking about a long term approach. If we just consider the basic characteristics above, it would be seem that an IMDB like H2 would be perfect:
Execute all the complex queries against the source database (e.g. SQL Server) up-front, creating tables, performing inserts/updates, in order to create an IMDB representation of all the data that pertains to a single conversion transaction (e.g. a single insurance policy). BTW, I could see how jOOQ could be really helpful here (and elsewhere) for simlifying and increasing the type safety of these queries.
Execute all the complex transformation queries against the IMDB. Again, jOOQ might have significant benefits.
Discard and recreate the IMDB for each insurance conversion transaction.
One of the things that I love about this approach (at least with H2) is the ability to encapsulate queries in Java-based stored procedures—much better than writing T-SQL stored procs. And would it again make things even easier/safer to use jOOQ against the IMDB instead of, for example, the native H2 stored proc API?
However, I have two concerns:
Serialization--This is actually a distributed platform (I’ve simplified my description above for discussion purposes), and we make fairly heavy use of services and message queuing to pass/queue data. This all works wonderfully when we’re working with XML data sources, which is frequently the case. How well will this work with an IMDB? For a given insurance transaction IMDB, we must be able to a) serialize the IMDB, b) transmit and/or queue the IMDB and, finally, c) deserialize the data back into a fully functioning IMDB for conversion processing. It appears that the best way to do this with H2, for example, is to use the SQL SCRIPT command to serialize the data, and then run the script to deserialize the data. I’m wondering about the performance characteristics of this approach. I don’t consider our platform to be particularly performance sensitive, but I do want to avoid an approach that is particularly sluggish or architecturally awkward.
Target loading—This discussion has focused on source side database processing because, frequently we generate XML on the target side (we have mature subsystems for this purpose). Sometimes, however, we need to directly address databases on the target side as well. In this case, we must be able to directly insert/update against mainstream relational databases in accordance with the converted data. The approach I’m contemplating again uses an IMDB, but on the target side. The transformed data populates an IMDB with the same schema as the actual target database. Then, this target IMDB could be serialized and transmitted as needed. Finally, the contents of the target IMDB would be used to insert/update against the actual target database (which, of course, could have many gigabytes of data). What would be tremendous (but I’m not optimistic), is if I could use a simple SQL SCRIPT statement against the IMDB to generate a script containing INSERT/UPDATE statements that I could then simply run against the target database. I suspect it won’t be that easy.In any event, does this general approach to target loading seem reasonable?
I apologize for the length of this post, but this is a critically important question for our team. Thank you so much, in advance, for your responses.
A bit off topic... One thing to remember is that H2 is non-distributed database and a rather primitive solution thus at best. Essentially, this is a what-ever-fits-in-on-heap-of-a-single-JVM database. There are better approaches unless you are talking about absolutely simplistic use case (which I don't think you are).
GridGain's In-Memory Database, for example, uses H2 for its SQL processing internally (with all its benefits) but also provides full distribution for SQL as well as host of other features. There are other distributed in-memory databases and even some sophisticated data grids that can fit your use case.
Just my 2 cents here.
I'm integrating search functionality into a desktop application and I'm using vanilla Lucene to do so. The application handles (potentially thousands) of POJOs each with its own set of key/value(s) properties. When mapping models between my application and Lucene I originally thought of assigning each POJO a Document and add the properties as Fields. This approach works great as far as indexing and searching goes but the main downside is that whenever a POJO changes its properties I have to reindex ALL the properties again, even the ones that didn't change, in order to update the index. I have been thinking of changing my approach and instead create a Document per property and assign the same id to all the Documents from the same POJO. This way when a POJO property changes I only update its corresponding Document without reindexing all the other unchanged properties. I think that the graph db Neo4J follows a similar approach when comes to indexing, but I'm not completely sure. Could anyone comment on possible impact on performance, querying, etc?
It depends fundamentally on what you want to return as a Document in a search result.
But indexing is pretty cheap. Does a changed POJO really have so many properties that reindexing them all is a major problem?
If you only search one field in every search request, splitting one POJO to several documents will speed up reindexing. But it will cause another problem if search one multiple fields, a POJO may appear many times.
Actually, I agree with EJP, building index is very fast in small dataset.