How to model semistructured data in Java? - java

I have an agriculture related data that I need to model in java.to mention how the data looks in a nutshell the data is collection of attributes which are collected whenever a new plant variety is produced. Currently the plants I want to model their data are around 135 grouped into 9 groups. The problem I am facing during the modeling process is that every plant has its own attributes that it doesn't share with others and also it have few attributes similar to other plants and there is also some difference in attributes of the same plant released in the same year which makes it difficult to restrict the amount of fields I tried to include in the class.
For example the attributes shown in the red might not be included in other variety or might be included adding other attributes it is not possible to know exactly which attributes can be there or not. The other problem is that some attributes similar to the one shown in the blue rectangle has range values.
What I have tried was to list down every possible attributes by looking over 24 books I have and having unique attributes making them into classes for every plant and extracting values which have ranges into other classes with min and max values and I ended up having over 200 classes which makes it very complicated to build a unified system that I can use to feed the data and retrieve info from it.
What is ur recommended approach to model the data the database I planned to use is mongodb since since some varieties may miss values which other varieties has. I am ready to move to other programming languages like python if I have to thanks.

Related

Data structure for fast searching of custom object using its attributes (fields) in Java

I have abstract super class and some sub classes. My question is how is the best way to keep objects of those classes so I can easily find them using all the different parameters.
For example if I want to look up with resourceCode (every object is with unique resource code) I can use HashMap with key value resourceCode. But what happens if I want to look up with genre - there are many games with the same genre so I will get all those games. My first idea was with ArrayList of those objects, but isn’t it too slow if we have 1 000 000 games (about 1 000 000 operations).
My other idea is to have a HashTable with key value the product code. Complexity of the search is constant. After that I create that many HashSets as I have fields in the classes and for each field I get the productCode/product Codes of the objects, that are in the HashSet under that certain filed (for example game promoter). With those unique codes I can get everything I want from the HashTable. Is this a good idea? It seems there will be needed a lot of space for the date to be stored, but it will be fast.
So my question is what Data Structure should I use so I can implement fast finding of custom object, searching by its attributes (fields)
Please see the attachment: Classes Example
Thank you in advanced.
Stefan Stefanov
You can use Sorted or Ordered data structures to optimize search complexity.
You can introduce your own search index for custom data.
But it is better to use database or search engine.
Have a look at Elasticsearch, Apache Solr, PostgreSQL
It sounds like most of your fields can be mapped to a string (name, genre, promoter, description, year of release, ...). You could put all these strings in a single large index that maps each keyword to all objects that contain the word in any of their fields. Then if you search for certain keywords it will return a list of all entries that contain that word. For example searching for 'mine' should return 'minecraft' (because of title), as well as all mine craft clones (having 'minecraft-like' as genre) and all games that use the word 'mine' in the 'info text' field.
You can code this yourself, but I suppose some fulltext indexer, such as Lucene may be useful. I haven't used Lucene myself, but I suppose it would also allow you to search for multiple keyword at once, even if they occur in different fields.
This is not a very appealing answer.
Start with a database. Maybe an embedded database (like h2database).
Easy set of fixed develop/test data; can be easily changed. (The database dump.)
. Too many indices (hash maps) harm
Developing and optimizing queries is easier (declarative) than with data structures
Database tables are less coupled than data structures with help structures (maps)
The resulting system is far less complex and better scalable
After development has stabilized the set of queries, you can think of doing away of the DB part. Use at least a two tier separation of database and the classes.
Then you might find a stable and best fitting data model.
Should you still intend to do it all with pure objects, then work them out in detail as design documentation before you start programming. Example stories, and how one solves them.

How should I design my DAO layer

Lets say I wanted a web page that would represent a zoo. There should be a list of enclosures (about a ten thousand of them) and it should be possible to display it in three ways:
all enclosures,
only enclosures that the currently logged in user has marked as favorite,
only enclosures that the currently logged in user has commented on.
In all of these cases the list could be too long to fit on a single page and therefore should be divided into multiple pages with a pagination bar.
In order to ease searching for a particular enclosure, all three modes should support additional filtering by a keyword (full-text search in enclosure names). I.e. the user should be able to e.g. display all enclosures marked as favorite that contain a given string in their names. Of course, the list can still be to large and pagination would be applicable here as well.
The question is - how to design the DAO layer to avoid code dupplication and spaghetti code full of conditions? Also, it would be fine to have the code divided into layers/areas of abstraction, so that e.g. the code for building the final SQL queries would not be scattered inconsistently across many different classes from different abstraction layers.
Assuming a traditional request/response web application style here is a sketch:
Represent the various filtering options as classes in supporting code for your DAO. Have the web client specify URL parameters representing the filtering options. You'll need a way to ensure that the filtering options are always sent in on each request, or store them on the user's session.
Map the filtering parameters to the filtering options and pass the options to your DAO. In your DAO's queries "expand" the filtering options into appropriate where claus(es) against the database.
For paging, have the concept of a paging "window". For example, you could have a class that represents the starting row and how many rows to return. Again, expand that class into a predicate executed against the database.
There are other ways to accomplish this (perhaps with one of the million frameworks that are around), but this is how I'd approach it if I had to develop it all from scratch.
Editing my original answer since I misread your criteria. Your DAO will be the same as any other basic DAO. It will (essentially) have a GET method for each of the three queries. If the user wants to narrow down the criteria after that, I would suggest using a jquery plugin like DataTables., assuming the amount of data that gets returned in the DAO methods isn't some outrageously huge amount. That plugin will allow you to add filters to each column that updates as you type, and also has sort, search, and paginate functionality.

Should I use a nested enum?

Say I need a data structure in Java involving one set of categories, each with one set of subcategories. For example, let's say the main category is 'brand' (like, of a product) and the subcategory is 'product'. I want to be able to map the combination of brand+product to a piece of data e.g. a price.
I'd like to use an enum type for both 'brand' and 'product' if they were on their own, because
Brand+product has only a small, single piece of data tied to it (the price)
I need to refer to them many times throughout a reasonably large program, so the chance that I'll mistype any string literal keys I assign to them is basically one.
However, the number of brands/products is too large to have a single enum for each brand/product combination (around twenty brands each with ten products and a good chance of adding more later). I'd like to be able to use the structure like this:
getPrice(APPLE.IPOD)
getPrice(APPLE.MACBOOK)
getPrice(HERSHEYS.PEANUT_BUTTER_CUPS)
Should I use some sort of nested enum? If so, how would that be implemented?
Bonus information: I've spent a bit of time googling 'java nested enum' but haven't come up with anything. The problem with structures like the first one in the ticked answer here or thelosts's answer here is that I have too many categories all exhibiting the same behavior to write out very similar enum definitions so many times.
I wouldn't use an enum for this.
I would suggest you load this information from a file or database. Java is not a good place for storing large amounts of data.
You could add a getter and setter to the Brand enum that allows setting a Product enum, but that will not enforce that a Product is actually manufactured by that Brand. Besides, there is ever only one instance of each enum value -- so you could never have APPLE.IPOD and APPLE.IPAD. You either need a single enum type that represents the Cartesian product, or you need to load your values from a data store like Peter Lawrey suggests.

google appengine mapper - map over range of dates

I would like to use the appengine mapper to iterate over a range of dates (from-date and to-date passed as properties to the configuration). For each date in the range, I would retrieve the entities that have this date as a property and operate on this set.
For example, if I have the following set of entities:
Key Date Value
a 2011/09/09 323
b 2011/09/09 132
c 2011/09/08 354
d 2011/09/08 432
e 2011/09/08 234
f 2011/09/07 423
g 2011/09/07 543
I would like to specify a date range of 2011/09/09 - 2011/09/07 which would create three mapper instances, for 2011/09/09, 2011/09/08 and 2011/09/07. In turn these would query for entities a+b, c+d+e and f+g respectively, and perform some operations on the values. (Each of the mappers would also make other datastore queries for additional data, hence the 'bonus question' below)
Presumably I need to create a custom InputFormat class, however I'm quite new to mapreduce/hadoop and I was hoping someone had some examples?
Bonus question: is it "bad form" to use a dao to load data in a mapper? Other distributed computing platforms I have worked with (eg DataSynapse) would require that you parcel all inputs up and provide with the task to prevent too much contention on a dataserver. However, with the appengine HR datastore I presume this isn't a concern?
It's not currently possible to iterate over a subset of entities of a given kind in App Engine's mapreduce implementaiton. If the entities make up a large proportion of the data, you can simply iterate over everything and ignore the unwanted entities; if they only make up a small proportion, you will have to roll-your-own update procedure using the task queue.
Based on Nick Johnson answer you will need to retrieve your date range from the context using custom parameters. Then mapper filters out (ignores) entity that falls out of range before processing it.
But if you insist on mapping across all entities of a given kind then there is a workaround solution that depending on your requirements may or may not be feasible. Suppose that you are pretty fixed on the date ranges (sounds unlikely but just maybe). Then for each expected range you create corresponding child entity kind with a parent key (or just a reference but parent key works better for consistency - think transaction across entity group) pointing to the main entity.
Thus each entity from the range receives a child entity of the kind corresponding to this range. Then setup a mapper on the child entity kind corresponding the range and retrieve its parent to work on it.
I do somewhat similar but in opposite direction and for single child entity kind when populating my data for Relation Index Entity pattern. Hence, the answer to your bonus question - go ahead use dao or whatever your data layer consists of.
While first approach is more sound, the latter may be feasible in cases when your ranges are not very dynamic and manageable. Given schema-less nature of the datastore creating new entity kinds is neither expensive nor a bad practice.

Multilingual fields in DB tables

I have an application that needs to support a multilingual interface, five languages to be exact. For the main part of the interface the standard ResourceBundle approach can be used to handle this.
However, the database contains numerous tables whose elements contain human readable names, descriptions, abstracts etc. It needs to be possible to enter each of these in all five languages.
While I suppose I could simply have fields on each table like
NameLang1
NameLang2
...
I feel that that leads to a significant amount of largely identical code when writing the beans the represent each table.
From a purely object oriented point of view the solution is however simple. Each class simply has a Text object that contains the relevant text in each of the languages. This is further helpful in that only one of the language is mandated, the others have fallback rules (e.g. if language 4 is missing return language 2 which fall back to language 1 which is mandatory).
Unfortunately, mapping this back to a relational database, means that I wind up with a single table that some 10-12 other tables FK to (some tables have more than one FK to it in fact).
This approach seems to work and I've been able to map the data to POJOs with Hibernate. About the only thing you cant do is map from a Text object to its parent (since you have no way of knowing which table you should link to), but then there is hardly any need to do that.
So, overall this seems to work but it just feels wrong to have multiple tables reference one table like this. Anyone got a better idea?
If it matters I'm using MySQL...
I had to do that once... multilingual text for some tables... I don't know if I found the best solution but what I did was have the table with the language-agnostic info and then a child table with all the multilingual fields. At least one record was required in the child table, for the default language; more languages could be added later.
On Hibernate you can map the info from the child tables as a Map, and get the info for the language you want, implementing the fallback on your POJO like you said. You can have different getters for the multilingual fields, that internally call the fallback method to get the appropiate child object for the needed language and then just return the required field.
This approach uses more table (one extra table for every table that needs multilingual info) but the performance is much better, as well as the maintenance I think...
The standard translation approach as used, for example, in gettext is to use a single string to describe the concept and make a call to a translate method which translates to the destination language.
This way you only need to store in the database a single string (the canonical representation) and then make a call in your application to the translate method to get the translated string. No FKs and total flexibility at the cost of a little of runtime performance (and maybe a bit more of maintenance trouble, but with some thought there's no need to make maintenance a problem in this scenario).
The approach I've seen in an application with a similar problem is that we use a "text id" column to store a reference, and we have a single table with all the translations. This provides some flexibility also in reusing the same keys to reduce the amount of required translations, which is an expensive part of the project.
It also provides a good separation between the data, and the translations which in my opinion is more of an UI thing.
If it is the case that the strings you require are not that many after all, then you can just load them all in memory once and use some method to provide translations by checking a data structure in memory.
With this approach, your beans won't have getters for each language, but you would use some other translator object:
MyTranslator.translate(myBean.getNameTextId());
Depending on your requirements, it may be best to have a separate label table for each table which needs to be multilingual. e.g.: you have a XYZ table with a xyz_id column, and a XYZ_Label table with a xyz_id, language_code, label, other_label, etc
The advantage of this, over having a single huge labels table, is that you can do unique constraints on the XYZ_labels table (e.g.: The english name for XYZ must be unique), and you can do indexed lookups much more efficiently, since the index will only be covering a single table at a time (e.g.: if you need to look up XYZ entities by english name) .
What about this:
http://rob.purplerockscissors.com/2009/07/24/internationalizing-websites/
...that is what user "Chochos" says in response #2

Categories

Resources