Multilingual fields in DB tables

Multilingual fields in DB tables - java

I have an application that needs to support a multilingual interface, five languages to be exact. For the main part of the interface the standard ResourceBundle approach can be used to handle this.
However, the database contains numerous tables whose elements contain human readable names, descriptions, abstracts etc. It needs to be possible to enter each of these in all five languages.
While I suppose I could simply have fields on each table like
NameLang1
NameLang2
...
I feel that that leads to a significant amount of largely identical code when writing the beans the represent each table.
From a purely object oriented point of view the solution is however simple. Each class simply has a Text object that contains the relevant text in each of the languages. This is further helpful in that only one of the language is mandated, the others have fallback rules (e.g. if language 4 is missing return language 2 which fall back to language 1 which is mandatory).
Unfortunately, mapping this back to a relational database, means that I wind up with a single table that some 10-12 other tables FK to (some tables have more than one FK to it in fact).
This approach seems to work and I've been able to map the data to POJOs with Hibernate. About the only thing you cant do is map from a Text object to its parent (since you have no way of knowing which table you should link to), but then there is hardly any need to do that.
So, overall this seems to work but it just feels wrong to have multiple tables reference one table like this. Anyone got a better idea?
If it matters I'm using MySQL...

I had to do that once... multilingual text for some tables... I don't know if I found the best solution but what I did was have the table with the language-agnostic info and then a child table with all the multilingual fields. At least one record was required in the child table, for the default language; more languages could be added later.
On Hibernate you can map the info from the child tables as a Map, and get the info for the language you want, implementing the fallback on your POJO like you said. You can have different getters for the multilingual fields, that internally call the fallback method to get the appropiate child object for the needed language and then just return the required field.
This approach uses more table (one extra table for every table that needs multilingual info) but the performance is much better, as well as the maintenance I think...

The standard translation approach as used, for example, in gettext is to use a single string to describe the concept and make a call to a translate method which translates to the destination language.
This way you only need to store in the database a single string (the canonical representation) and then make a call in your application to the translate method to get the translated string. No FKs and total flexibility at the cost of a little of runtime performance (and maybe a bit more of maintenance trouble, but with some thought there's no need to make maintenance a problem in this scenario).

The approach I've seen in an application with a similar problem is that we use a "text id" column to store a reference, and we have a single table with all the translations. This provides some flexibility also in reusing the same keys to reduce the amount of required translations, which is an expensive part of the project.
It also provides a good separation between the data, and the translations which in my opinion is more of an UI thing.
If it is the case that the strings you require are not that many after all, then you can just load them all in memory once and use some method to provide translations by checking a data structure in memory.
With this approach, your beans won't have getters for each language, but you would use some other translator object:
MyTranslator.translate(myBean.getNameTextId());

Depending on your requirements, it may be best to have a separate label table for each table which needs to be multilingual. e.g.: you have a XYZ table with a xyz_id column, and a XYZ_Label table with a xyz_id, language_code, label, other_label, etc
The advantage of this, over having a single huge labels table, is that you can do unique constraints on the XYZ_labels table (e.g.: The english name for XYZ must be unique), and you can do indexed lookups much more efficiently, since the index will only be covering a single table at a time (e.g.: if you need to look up XYZ entities by english name) .

What about this:
http://rob.purplerockscissors.com/2009/07/24/internationalizing-websites/
...that is what user "Chochos" says in response #2

Related

Adding custom fields in my application

I have a SAAS product, which is build by Spring MVC and Hibernate. Generally SAAS products allow user's to customize the product like adding extra fields to the table. So i want to give the flexibility to users, to create custom fields in the tables for themselves. Please provide all the viable solutions to achieve it. Thank you so much for your help.

I'm guessing your trying to back this to a Relational database. The primary problem is that relational databases store things in tables, and tables don't really handle free form data well.
So one solution is to use a document structure that is flexible, like XML (and perhaps ditch the database) but databases have features which are nice, so let's also consider the database-using approaches.
You could create a "custom field" table which would have columns (composite primary key) for
ExtendedTable
ColumnName
but you'd also have to store the data somewhere
(ExtendedKey)
DataItem
And now we get into the really nasty bits. How would you apply constraints to this data? I mean, what would the type be of a DataItem? A general solution would be quite complex (being a type of free form database). Hopefully you could limit the solution to solve only the problems you require solved.
Another approach is to use a single "extra" column that contains an XML record which embeds it's own "column and value" extensions, but if you wanted to display a table of the efficiently, you'd have to parse out every XML document in every field, which is not ideal.
Neither one of these approaches will work well with the existing SQL query language, so you'll then start building your own query language.
I suggest you go back and look at real data requirements, instead of sweeping them under the table with a "and anything else one might want" set of columns on your table.

Your requirement is best suited use case for NoSQL databases (like MongoDB).
Dynamically creating relational database tables & columns (modifying schemas) upon user requests in an application is not a best practice as these involve DDL operations, which are very powerful and in case if you don't handle them carefully, the whole application's database goes to the inconsistent state.

Data structure for fast searching of custom object using its attributes (fields) in Java

I have abstract super class and some sub classes. My question is how is the best way to keep objects of those classes so I can easily find them using all the different parameters.
For example if I want to look up with resourceCode (every object is with unique resource code) I can use HashMap with key value resourceCode. But what happens if I want to look up with genre - there are many games with the same genre so I will get all those games. My first idea was with ArrayList of those objects, but isn’t it too slow if we have 1 000 000 games (about 1 000 000 operations).
My other idea is to have a HashTable with key value the product code. Complexity of the search is constant. After that I create that many HashSets as I have fields in the classes and for each field I get the productCode/product Codes of the objects, that are in the HashSet under that certain filed (for example game promoter). With those unique codes I can get everything I want from the HashTable. Is this a good idea? It seems there will be needed a lot of space for the date to be stored, but it will be fast.
So my question is what Data Structure should I use so I can implement fast finding of custom object, searching by its attributes (fields)
Please see the attachment: Classes Example
Thank you in advanced.
Stefan Stefanov

You can use Sorted or Ordered data structures to optimize search complexity.
You can introduce your own search index for custom data.
But it is better to use database or search engine.
Have a look at Elasticsearch, Apache Solr, PostgreSQL

It sounds like most of your fields can be mapped to a string (name, genre, promoter, description, year of release, ...). You could put all these strings in a single large index that maps each keyword to all objects that contain the word in any of their fields. Then if you search for certain keywords it will return a list of all entries that contain that word. For example searching for 'mine' should return 'minecraft' (because of title), as well as all mine craft clones (having 'minecraft-like' as genre) and all games that use the word 'mine' in the 'info text' field.
You can code this yourself, but I suppose some fulltext indexer, such as Lucene may be useful. I haven't used Lucene myself, but I suppose it would also allow you to search for multiple keyword at once, even if they occur in different fields.

This is not a very appealing answer.
Start with a database. Maybe an embedded database (like h2database).
Easy set of fixed develop/test data; can be easily changed. (The database dump.)
. Too many indices (hash maps) harm
Developing and optimizing queries is easier (declarative) than with data structures
Database tables are less coupled than data structures with help structures (maps)
The resulting system is far less complex and better scalable
After development has stabilized the set of queries, you can think of doing away of the DB part. Use at least a two tier separation of database and the classes.
Then you might find a stable and best fitting data model.
Should you still intend to do it all with pure objects, then work them out in detail as design documentation before you start programming. Example stories, and how one solves them.

Should I always retrieve full object from a database?

This is a very simple question that applies to programming web interfaces with java. Say, I am not using an ORM (even if I am using one), and let's say I've got this Car (id,name, color, type, blah, blah) entity in my app and I have a CAR table to represent this entity in the database. So, say I have this need to update only a subset of fields on a bunch of cars, I understand that the typical flow would be:
A DAO class (CarDAO) - getCarsForUpdate()
Iterate over all Car objects, update just the color to say green or something.
Another DAO call to updateCars(Cars cars).
Now, isn't this a little beating around the bush for what would be a simple select and update query? In the first step above, I would be retrieving the entire object data from the database: "select id,name,color,type,blah,blah.. where ..from CAR" instead of "select id,color from CAR where ...". So why should I retrieve those extra fields when post the DAO call I would never use anything other than "color"? The same applies to the last step 3. OR, say I query just for the id and color (select id,color) and create a car object with only id and color populated - that is perfectly ok, isn't it? The Car object is anemic anyway?
Doesn't all this (object oriented-ness) seem a little fake?

For one, I would prefer that if the RDBMS can handle your queries, let it. The reason is that you don't want your JVM do all the work especially when running an enterprise application (and you have many concurrent connections needing the same resource).
If you particularly want to update an object (e.g. set the car colour to green) in database, I would suggest a SQL like
UPDATE CAR SET COLOR = 'GREEN';
(Notice I haven't used the WHERE clause). This updates ALL CAR table and I didn't need to pull all Car object, call setColor("Green") and do an update.
In hindsight, what I'm trying to say is that apply engineering knowledge. Your DAO should simply do fast select, update, etc. and let all SQL "work" be handled by RDBMS.

From my experience, what I can say is :
As long as you're not doing join operations, i.e. just querying columns from the same table, the number of columns you fetch will change almost nothing to performance. What really affects performance is how many rows you get, and the where clause. Fetching 2 or 20 columns changes so little you won't see any difference.
Same thing for updating

I think that in certain situations, it is useful to request a subset of the fields of an object. This can be a performance win if you have a large number of columns or if there are some large BLOB columns that would impact performance if they were hydrated. Although the database usually reads in an entire row of information whenever there is a match, it is typical to store BLOB and other large fields in different locations with non-trivial IO requirements.
It might also make sense if you are iterating across a large table and doing some sort of processing. Although the savings might be insignificant on a single row, it might be measurable across a large table.
Also, if you are only using fields that are in indexes, I believe that the row itself will never be read and it will use the fields from the index itself. Not sure in your example if color would be indexed however.
All this said, if you are only persisting objects that are relatively simple without BLOB or other large database fields then this could turn into premature optimization since the query processing, row IO, JDBC overhead, and object creation are most likely going take a lot more time compared to hydrating a subset of the fields in the row. Converting database objects into the final Java class is typically a small portion of the load of each query.

What is the best way to iterate and process an entire table from database?

I have a table called Token in my database that represents texts tokenized.
Each row haves attributes like textblock, sentence and position(for identifying the text that the token is from) and logical fields like text, category, chartype, etc.
What I want to know is iterate over all tokens to find patterns and do some operations. For example, merging two adjacent tokens that have the category as Name into one (and after this, reset the positions). I think that I will need some kind of list
What is the best way to do this? With SQL queries to find the patterns or iterating over all tokens in the table. I think the queries will be complex a lot and maybe, iterating as a list will be more simple, but I don't know which is the way (as example, retrieving to a Java list or using a language that I can iterate and do changes right on database).
To this question not be closed, what I want to know is what the most recommended way to do this? I'm using Java, but if other language is better, no problem, I think I will need use R to do some statistic calculus.
Edit: The table is large, millions rows, load entire in memory is not possible.

If you are working with a small table, or proving out a merge strategy, then just setup a query that finds all of the candidate duplicate lines and dump the relevant columns out to a table. Then view that table in a text editor or spreadsheet to see if your hypothesis about the duplication is correct.
Keep in mind that any time you try to merge two rows into one, you will be deleting data. Worst case is that you might merge ALL of your rows into one. Proceed with caution!

This is an engineering decision to be made, based mostly on the size of the corpus you want to maintain, and the kind of operations you want to perform on them.
If the size gets bigger than "what fits in the editor", you'll need some kind of database. That may or may not be an SQL database. But there is also the code part: if you want perform non-trivial operations on the data, you might need a real programming language (could be anything: C, Java, Python. anything goes). In that case, the communication with the database will become a bottleneck: you need to generate queries that produce results that fit in the application programme's memory. SQL is powerful enough to represent and store N-grams and do some calculations on them, but that is about as far as you are going to get. In any case the database has to be fully normalised, and that will cause it to be more difficult to understand for non-DBAs.
My own toy project, http://sourceforge.net/projects/wakkerbot/ used a hybrid approach:
the data was obtained by a python crawler
the corpus was stored as-is in the database
the actual (modified MegaHal) Markov code stores it's own version of the corpus in a (binary) flatfile, containing the dictionary, N-grams, and the associated coefficients.
the training and text generation is done by a highly optimised C program
the output was picked up by another python script, and submitted to the target.
[in another life, I would probably have done some more normalisation, and stored N-grams or trees in the database. That would possibly cause the performance to drop to only a few generated sentences per second. It now is about 4000/sec]
My gut feeling is that what you want is more like a "linguistic workbench" than a program that does exactly one task efficiently (like wakkerbot). In any case you'll need to normalise a bit more: store the tokens as {tokennumber,tokentext} and refer to them only by number. Basically, a text is just a table (or array) containing a bunch of token numbers. An N-gram is just a couple of tokennumbers+the corresponding coefficients.

This is not the most optimized method but it's a design that allows you to write the code easily.
write an entity class that represent a row in your table.
write a factory method that allows you to get the entity object of a given row id, i.e. a method that create an object of entity class witht the values from the specified row.
write methods that remove and insert a given row object into table.
write a row counting method.
now, you can try to iterate your table using your java code. remember that if you merge between two row, you need to correctly adjust the next index.
This method allows you use small memory but you will be using a lot of query to create the row.
The concept is very similar or identical to ORM (Object Relational Mapping). If you know how tho use hibernate or other ORM then try those libraries.

IMO it'd be easier, and likely faster overall, to load everything into Java and do your operations there to avoid continually re-querying the DB.
There are some pretty strong numerical libs for Java and statistics, too; I wouldn't dismiss it out-of-hand until you're sure what you need isn't available (or is too slow).

This sounds like you're designing a text search engine. You should first see if pgsql's full text search engine is right for you.
If you do it without full text search, loading pl into pgsql and learning to drive it is likely to be the fastest and most efficient solution. It'll allow you to put all this work into a few well thought out lines of R, and do it all in the db where access to the data is closest. the only time to avoid such a plan is when it would make the database server work VERY hard, like holding the dataset in memory and cranking a single cpu core across it. Then it's ok to do it app side.
Whether you use pl/R or not, access large data sets in a cursor, it's by far the most efficient way to get either single or smaller subsets of rows. If you do it with a select with a where clause for each thing you want to process then you don't have to hold all those rows in memory at once. You can grab and discard parts of result sets while doing things like running averages etc.
Think about scale here. If you had a 5 TB database, how would you access it to do this the fastest? A poor scaling solution will come back to bite you even if it's only accessing 1% of the data set. And if you're already starting on a pretty big dataset today, it'll just get worse with time.
pl/R http://www.joeconway.com/plr/

Should I use composite primary keys or not?

There seems to only be 2nd class support for composite database keys in Java's JPA (via EmbeddedId or IdClass annotations). And when I read up on composite keys, regardless of language, people keep coming across as them being a bad thing. But I cannot understand why. Are composite keys still acceptable to use these days? If not, why not?
I've found one person who agrees with me:
http://weblogs.sqlteam.com/jeffs/archive/2007/08/23/composite_primary_keys.aspx
But another who doesn't:
http://weblogs.java.net/blog/bleonard/archive/2006/11/using_composite.html
Is it just me, or are people not able to make the distinction of where a composite key is appropriate or not? I see composite primary keys useful when the table doesn't represent an entity - i.e. when it represents a join table.
A simple example:
Actor { Id, Name, Email }
Movie { Id, Name, Year }
Character { Id, Name }
Role { Actor, Movie, Character }
Here Actor, Movie and Character obviously benefit from having an Id column as the primary key.
But Role is a Many-To-Many join table. I see no point in creating an id just to identify a row in the database. To me it seems obvious that the primary key is { Actor, Movie, Character }. It also seems like a rather limiting feature, especially if the data in the join table changes all the time, you could find yourself with primary key collisions once the primary key sequence wraps around to 0.
So, back to the original question, is it still acceptable practice to use composite primary keys? If not, why not?

In my personal opinion you should avoid composite primary keys due to several reasons:
Future changes: when you design a database you sometimes miss what in the future will become important. A significant example for this is thinking a combination of two or more fields is unique (and thus can become a primary key), whereas in the future you want to allow NULLs or other non-unique values in them. Having a single primary key is a good solid solution against such changes.
Uniformity: If every table has a unique numerical ID, and you also maintain some standard as to its name (e.g. "ID" or "tablename_id"), the code and SQL referring to it is clearer (in my opinion).
There are other reasons, but these are just a few.
The main question I would ask is why not use a separate primary key if you have a unique set of fields? What's the cost? An additional integer index? That's not too bad.
Hope that helps.

I think there's no problem using a composite key.
To me the database it's a component on its own, that should be treated the same way we treat code : for instance we want clean code, that communicates clearly its intent, that does one thing and does it well, that doesn't add any uneeded level of complexity, etc.
Same thing with the db, if the PK is composite, this is the reality, so the model should be kept clean and clear. A composite PK it's clearer than the mix auto-increment + constraint. When you see an ID column that does nothing you need to ask what's the real PK, are there any other hidden things that you should be aware of, etc. A clear PK doesn't leave any doubts.
The db is the base of your app, to me we need the most solid base that we can have. On this base we'll build the app ( web or not ). So I can't see why we should bend the db model to conform to some specific in one development tool/framework/language. The data is directing the application, not the other way around. What if the ORM changes in the future and becomes obsolete and a better solution appears that imposes another model ? We can't play with the db model to fit this or that framework, the model should stay the same, it should not depend on what tool we're using to access the data ...
If the db model change in the future, it should change because functionality changed. If we would know today how this functionality will change, we'll be modeling this already. ANd any future change will be dealt with when the time comes, we can't predict for instance the impact on existing data, so one extra column doesn't guarantee that it will withold any future change ...
We should design for today's functionality, and keep the db model the simplest possible, this way it will be easy to change/evolve in the future.

Religious wars have been, and still are, going on on this subject.
OO people have this zealous thing about "identity", and will tell you that the only thing that matters is the ability for you to "identify" "real-life objects" inside your programs, and that composite, "real-life" keys will only get you into trouble when trying to achieve that goal.
Data people have this thing about "uniqueness" that is perceived as "zealous" by the OO side, and will tell you that the only thing that matters is that if the business tells you that the combination of (values for) attribute X and attribute Y must be unique, then it is your job to see to it that the database enforces this business rule of uniqueness of the combined X+Y.
How you want your question answered is just a matter of which religion you prefer. My personal religion is the Data one. That religion has proven to be able to survive any hype and trend ever since 1969.

Similar questions have been asked on SO, and there is no consensus ;)
If you develop a web application, you will love single column pk's, as they make your URLs simpler.
For a sequence to wrap you'd need 2 billion records in a single table (32bit), or 10^18 with 64 bit pk's.
Btw, your data model does not allow for movie characters with unknown actors.

My general opinion is... no. don't use composite primary keys.
They will typically complicate ORMs if you use them (ORMs sometimes go so far as to call composite primary keys "legacy behaviour") and generally if you're using multiple keys, one or more of them will tend to be natural rather than technical keys, which for me is the bigger problem: IMHO you should certainly favour technical primary keys.
More on this in Database Development Mistakes Made by AppDevelopers.

It's a religious thing. I use natural keys and shun surrogates. I have no problem with composite keys either in theory or in practice.
Only the most trivial logical model would involve no composite keys. Call me lazy but I see no need to complicate the data model by introducing surrogates into the physical model on implementation. Sure, I'd consider one on a table if performance issues were found but I take the same approach as for denormalization i.e. as a last resort. Habitually using surrogates amounts to premature optimization, IMO.

In Ruby for Rails, when not explicitly specifying otherwise, your Role table would be kind of like you described (if the columns are actually the IDs from the other tables). Still, in the database you might want to ensure unique combinations by defining a unique index on those three columns, if only to help the database optimizing your queries. With that unique index in place and the framework not using any other primary key anyway, there is no need for a an additional numeric primary key in your Role table. Having said that, the unique index could could very be defined as a composite primary key instead.
As for future changes: defining a strict database for your first iteration will prevent unexpected data to be persisted, which will make migrations much easier.
So: I would use composite primary keys.

I would only ever use them in join tables. The only way to absolutely ensure that every record identifier is unique and consistent over time is to use a synthetic key.
Composite keys seem OK in theory, which is why they are tempting to use, but practice has shown that they usually indicate that there is a flaw in your data model. Worse still, in many cases they will fail to guarantee uniqueness, given a large enough data set. And data sets always grow over time, so using them may mean that you have planted a bomb in your application which will only explode when the application has been in production use for a while.
I think that people are underplaying ORMs. Every mainstream programming language has a defacto ORM, and has had for years, because they solve the fundamental incompatibility between OO and relational structures. Trying to write any complex, testable OO software against SQL databases without an ORM is very inefficient, at best.
Good ORMs also provide practices and tooling that make it much easier to create and maintain consistent high-quality database schema, so on average, a team will come out well ahead by working with an ORM. Handcrafting schema is rather like writing C++ ...people can do it, but in the real world it is so hard to maintain quality over time that the average product is not good.

I have almost never seen a case where a composite key was a good idea (exception, joining table consisting of only two surrogate keys). In the first palce you are wasting space in the child tables. You are harming performance in the joins as integer joins are generally much faster. If you have the composite key as a clustered index (talking SQL Server here), then you are causing the database to be less efficient about storing records and less efficient in building other indexes - all of which use the clusterd index.
When the data in the key changes (As it almost inevitably will) then you need to update all related tables as well casuing massive unecessary updates and wasting processing power on a task that is completely uneeded when the database is designed to use surrogaste keys. Primary keys need not only to be unique but to be unchanging. Composite keys often fail the second test.
So you are thinking of using a technique that harms performance, causes poor use of memory and database storage, uses way more space in child records (another waste of resources) and requires painful updating of what may be millions of child records when things change. And which might make it hard to use an ORM? Why would you do that? Because you are too lazy to put a surrogate key on and then define a unique index on the potential composite key? Is there any gain at all to using a composite index? For the lack of 5 minutes of work you are permanently harming your database?

In terms of the domain model, I see nothing wrong with creating a composite primary key when the table doesn't represent an entity - i.e. when it represents a join table (as you mention in your question), other than if it is not montonically increasing, then you will get a certain amount of page splits during insertions.
Some ORM's don't cope well with composite primary keys, so perhaps it is safer to create a surrogate auto-integer for the primary key, and cover the columns with a non-clustered index.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.