How properly work with big data using Swing - performance tips

How properly work with big data using Swing - performance tips - java

There is an application that I'm basically writing with Swing, JDBC and MySQL.
In DB there are tables like Article, Company, Order, Transaction, Client etc.
So also there are java classes which describes them.
User can create, update, delete information about them.
I give an example of my problem. The article characterizes with id, name, price, company, unit. And when user wants to save new article he chooses the company for this article from the list of all companies. This list in perspective could be really big.
Now I could think of two ways to solve this.
When application starts, it connects to the DB and load all the data with which then I will work.
public final class AllInformationController {
public static final Collection<Company> COMPANIES= new HashSet<>(1_000_000);
public static final Collection<Article> ARTICLES= new HashSet<>(1_000_000);
public static final Collection<Order> ORDERS= new HashSet<>(1_000_000);
public static final Collection<Transaction> transactionsHistory= new HashSet<>(10_000_000);
//etc...
private AllInformationController() {}
}
Then if user wants for example to change some Company data (like address or telephone etc.), after doing it the program should update the DB info for that company.
The second approach is basically to connect to the database every time user queries or changes some information. So then I will mostly work with ResultSet's.
I prefer the second way, but just not sure if it's the best one. Think there should be more productive ways to work with data that could be less expensive.

The 2nd approach is better, although there's probably a best case that lies somewhere between them. The 2nd approach here allows multiple applications (or users of the same application) to modify the data at the same time, as the 1st approach may end up using old data if you load all the data at once (especially if a user leaves the application on a while). I would go with the 2nd approach and then figure out what optimizations to make.
Since you think the 1st approach may be usable, I'd assume then you don't have too many users who would use the tool at the same time. If that is the case then, perhaps then you don't need to use any optimizations that the 1st method itself would give you as there's not going to be too much database usage.
When you say you working with ResultSets more often in the 2nd approach than the 1st, well it doesn't need to be that way. You can use the same methods from the 1st approach which translates your data into Java data structures to be used in the 2nd approach.

You already made a very bad decision here:
And when user wants to save new article he chooses the company for this article
from the _list_ of all companies
A list works only reasonably if the number of choices is fairly limited; below 10-20 you may get away with a combo box. For thousands of choices a list is very cumbersome, and the further it grows the slower and more unwieldly chosing from a list becomes.
This is typically solved by some kind of search field (e.g. user types customer number, presses tab and information is fetched), possibly combined with a search dialog (with more search options and a way to select a result found as "it").
Since you will typically be selecting only a few items with a search request, directly quering the DB is usually practical. For a search dialog you may need to artificially limit the number of results (using specific SQL clauses for paging).

Related

How is the best way to have a one-to-many relationship with an arbitrary number of fields in an apache derby database?

I am making an application for a campground that has a list of camping sites, each of which has a list of reservations. There can be any number of reservations for each camping site and any number of camping sites.
I started off by just storing this data in a csv and using that to handle everything. This is working really well because after a few columns of identifying information for each site, I just have an arbitrary number of reservations that I can go through. It's also really easy for me to add a new site if I want to add that functionality to my application since I can just add a new line and leave the reservation part blank if there aren't any reservations.
I'm thinking that I should instead use a database to store this information for the following reasons:
I need to be able to search and modify specific sites in addition to just iterating through them in order. With the CSV solution, iterating through them in order has been great when I need to use them all but I don't want to have to iterate through to find a specific reservation. Plus modifying the csv isn't looking to be that great either.
I am hoping to gain a better understanding of databases in general and I don't believe using a csv is the most professional way to do this, nor would it be the best practice.
How exactly could I do this? If, for example, I had only one camping site then I would just add another table that represents the reservations as an entry to the table but since I have at least 100 camping sites and can have an arbitrary amount, this doesn't seem to make sense. Am I just not understanding something or am I correct in coming to that conclusion?
Some more information:
Since I haven't said it yet, this is a java application with an apache derby database.
This project is not for a class or a job, I'm just trying to practice my programming skills and help someone out who runs a campground at the same time.

In populating an ObservableList, do I have to load ALL the records from my database?

So I'm porting my Swing Java database application to Java FX (still a beginner here, I recently just learned the basics of FXML and the MVC pattern, so please bear with me).
I intend to load the data from my existing database to the "students" ObservableList so I can show it on a TableView, but on my original Swing code, I have a search TextField, and when the user clicks on a button or presses Enter, the program:
Executes an SQLite command that searches for specific records, and retrieves the RecordSet.
Creates a DefaultTableModel based on the RecordSet contents
And throws that TableModel to the JTable.
However, Java FX is a completely different beast (or at least it seems so to me--don't get me wrong, I love Java FX :D ) so I'm not sure what to do.
So, my question is, do I have to load ALL the students in the database, then use some Java code to filter out students that don't fit the search criteria (and display all students when the search text is blank), or do I still use SQLite in filtering and retrieving records (which means I need to clear the list then add students every time a search is performed, and maybe it will also mess up with the bindings? Maybe there will be a speed penalty on this method also? Besides that, it will also reset the currently selected record because I clear the list--basically, bad UI design and will negatively impact the usability)
Depending on the right approach, there is also a follow-up question (sorry, I really can't find the answer to these even after Googling):
If I get ALL students from database and implement a search feature in Java, won't it use up more RAM than it should, because I am storing ALL the database data in RAM, instead of just the ones searched for? I mean, sure, even my lowly laptop has 4GB RAM, but the feeling of using more memory than I should makes me feel somewhat guilty LOL
If I choose to just update the contents of the ObservableList every time a new search has been performed, will it mess up with the bindings? Do I have to set up bindings again? How do I clear the contents of the ObservableList before adding the new contents?
I also have the idea of just setting the selected table item to the first record that matches the search string but I think it will be difficult to use, since only one record can be highlighted per search. Even if we highlight multiple rows, it'd be difficult to browse all selected items.
Please give me the proper way, not the "easy" way. This is my first time implementing a pattern (MVC or am I actually doing MVP, I don't know) and I realized how unmaintainable and ugly my previous programs are because I used my own style. This is a relatively big project that I need to support and improve for several years so having clean code and doing stuff the right way should help in maintaining the functionality of this program.
Thank you very much in advance for your help, and I hope I don't come off as a "dumb person who can't even Google" in asking these questions. Please bear with me here.

Basic design tradeoffs
You can, of course, do this either of the ways you describe. The basic tradeoffs are:
If you load everything from the database, and filter the table in Java, you use more memory (though not as much as you might think, as explained below)
If you filter from the database and reload every time the user changes the filter, there will be a bigger latency (delay) in displaying the data, as a new query will be executed on the database, with (usually) network communication between the database and the application being the biggest bottleneck (though there are others).
Database access and concurrency
In general, you should perform database queries on a background thread (see Using threads to make database requests); if you are frequently making database queries (i.e. filtering via the database), this gets complex and involves frequently disabling controls in the UI while a background task is running.
TableView design and memory management
The JavaFX TableView is a virtualized control. This means that the visual components (cells) are created only for visible elements (plus, perhaps, a small amount of caching). These cells are then reused as the user scrolls around, displaying different "items" as required. The visual components are typically quite memory-consumptive (they have hundreds of properties - colors, font properties, dimensions, layout properties, etc etc - most of which have CSS representations), so limiting the number created saves a lot of memory, and the memory consumption of the visible part of the table view is essentially constant, no matter how many items are in the table's backing list.
General memory consumption computations
The items observable list that forms the table's backing list contains only the data: it is not hard to ballpark-estimate the amount of memory consumed by a list of a given size. Strings use 2 bytes per character, plus a small fixed overhead, doubles use 8 bytes, ints use 4 bytes, etc. If you wrap the fields in JavaFX properties (which is recommended), there will be a few bytes overhead for each; each object has an overhead of ~16 bytes, and references themselves typically use up to 8 bytes. So a typical Student object that stores a few string fields will usually consume of the order of a few hundred bytes in memory. (Of course, if each has an image associated with it, for example, it could be a lot more.) Thus if you load, say 100,000 students from a database, you would use up of the order of 10-100MB of RAM, which is pretty manageable on most personal computer systems.
Rough general guidelines
So normally, for the kind of application you describe, I would recommend loading what's in your database and filtering it in memory. In my usual field of work (genomics), where we sometimes need 10s or 100s of millions of entities, this can't be done. (If your database contains, say, all registered students in public schools in the USA, you may run into similar issues.)
As a general rule of thumb, though, for a "normal" object (i.e. one that doesn't have large data objects such as images associated with it), your table size will be prohibitively large for the user to comfortably manage (even with filtering) before you seriously stretch the memory capacity of the user's machine.
Filtering a table in Java (all objects in memory)
Filtering in code is pretty straightforward. In brief, you load everything into an ObservableList, and wrap the ObservableList in a FilteredList. A FilteredList wraps a source list and a Predicate, which returns true is an item should pass the filter (be included) or false if it is excluded.
So the code snippets you would use might look like:
ObservableList<Student> allStudents = loadStudentsFromDatabase();
FilteredList<Student> filteredStudents = new FilteredList<>(allStudents);
studentTable.setItems(filteredStudents);
And then you can modify the predicate based on a text field with code like:
filterTextField.textProperty().addListener((obs, oldText, newText) -> {
if (newText.isEmpty()) {
// no filtering:
filteredStudents.setPredicate(student -> true);
} else {
filteredStudents.setPredicate(student ->
// whatever logic you need:
student.getFirstName().contains(newText) || student.getLastName().contains(newText));
}
});
This tutorial has a more thorough treatment of filtering (and sorting) tables.
Comments on implementing "filtering via queries"
If you don't want to load everything from the database, then you skip the filtered list entirely. Querying the database will almost certainly not work fast enough to filter (using a new database query) as the user types, so you would need an "Update" button (or action listener on the text field) which recomputed the new filtered data. You would probably need to do this in a background thread too. You would not need to set new cellValueFactorys (or cellFactorys) on the table's columns, or reload the columns; you would just call studentTable.setItems(newListOfStudents); when the database query finished.

Overcoming 30 sub-query limit for google datastore

Google datastore started off looking so good and has become so frustrating, but maybe it's just that I'm used to relational databases. I'm pretty new to datastore and nosql in general and have done a ton of research but can't seem to find a solution to this problem.
Assume I have a User class that looks like this
class User{
#Id
Long id;
String firstName, lastName;
List<Key<User>> friends;
}
I have another class that will model Events that users have done like so
class Event{
Key<User> user;
Date eventTime;
List<Key<User>> receivers;
}
and now what I'm trying to do is query for events that my friends have done.
In a usual relational way I would say :
select * from Event where user in (select friends from User where id = ?)
Taking that as a starting point I tried doing
// Key<User> userKey = ...
User user = ofy.load.type(User.class).key(userKey).first.now;
List<Key<User>> friends = user.getFriends();
ofy.load.type(Event.class).filter("user in", friends).order("-eventTime")list();
But I heard about this 30 sub-query limit making this unsustainable since I assume eventually someone will have more than 30 friends, not to mention using an 'in' clause will guarantee that you cannot get a cursor to continue loading events. I've done so much research and tried so many options but have yet to find a good way to approach this problem except to say "why Google, why."
Things I've considered :
add an extra field in event that is a copy of the users friendlist and use a single equals on MVP to find events (extremely wasteful since there may be many many events.
split event query up into batches of 30 friends at a time and somehow determine a way to ensure continued retrieval from a synthetic cursor based on time, and merge them (problem is waay too many edge cases and makes reading events very difficult.)
I would really appreciate any input you could offer since I am 100% out of ideas
TL;DR ~ GAE has limit on how many items an in-clause can handle and fml.

You come from a relational database background, so the concept of denormalization is probably a bit painful - I know it was for me.
Right now you have a single table that contains all events from all users. This approach works well in relational databases but is a nightmare in the datastore for the reasons you named.
So to solve this concrete problem you could restructure your data as follows:
All users have two timelines. One for their own posts and one from friends' posts. (There could be a third timeline for public stuff.)
When a new event is published, it is written to the timeline of the user who created it, and to all the timelines of the receiving users. (You may want to add references of the third-party timelines in the user's timeline, so you know what to delete when the user decides to delete an event)
Now every user has access to complete timelines, his/her own and the timeline that was created by third-party events. Those timelines are easy to query and you will not require sub-selects at all.
There are downsides to this approach:
Writing cost is higher. You have to write way more timelines than you had to until now. You will probably have to put this in a task queue to have enough time to write to all those timelines.
You're using a lot more storage, BUT storage is really cheap, I'm guessing the storage will be cheaper than running expensive queries in the long run.
What you get in return though is lightning fast responses with simple queries through this denormalization. All that remains is to merge the responses from the different timelines in the UI (you can do it on the server side, but i would do it in the UI)

Exploring user specific data in webapps

I am busy practicing on designing a simple todo list webapp whereby a user can authenticate into the app and save todo list items. The user is also only able to to view/edit the todo list items that they added.
This seems to be a general feature (authenticated user only views their own data) in most web applications (or applications in general).
To me what is important is having knowledge of the different options for accomplishing this. What I would like to achieve is a solution that can handle lots of users' data effectively. At the moment I am doing this using a Relational Database, but noSQL answers would be useful to me as well.
The following ideas came to mind:
Add a user_id column each time this "feature" is needed.
Add an association table (in the example above a user_todo_list_item table) that associates the data.
Design in such a way that you have a table per user per "feature" ... so you would have a todolist_userABC table. It's an option but I do not like it much since a thousand user's means a thousand tables?!
Add row level security to the specific "feature". I am not familiar on how this works but it seems to be a valid option. I am also not sure whether this is database vendor specific.
Of my choices I went with the user_id column on the todolist_item table. Although it can do the job, I feel that a user_id column might be problematic when reading data if the data within the table gets large enough. One could add an index I guess but I am not sure of the index's effectiveness.
What I don't like about it is that I need to have a user_id for every table where I desire this type of feature which doesn't seem correct to me? It also seems that when I implement the database layer I would have to add this to my queries for every feature (unless I use some AOP)?
I had a look around (How does Trello store data in MongoDB? (Collection per board?)), but it does not speak about the techniques regarding user_id columns or things like that. I also tried reading about this in some security frameworks (Spring Security to be specific) but it seems that it only goes into privileges/permissions on a table level and not a row level?
So the question is whether my choice was appropriate and if there are better techniques to do this?

Your choice is the natural thing to do.
The table-per-user is a non-starter (anything that modifies the database structure in response to user action is usually suspect).
Row-level security isn't really an option for webapps - it requires each user session to have a separate, persistent connection to the database, which is rarely practical. And yes, it is vendor-specific.
How you index your tables depends entirely on your usage patterns and types of queries you want to run. Is 'show all TODOs for a user' a query you want to support (seems like it would be)? Then and index on the user id is obviously needed.
Why does having a user_id column seem wrong to you? If you want to restrict access by user, you need to be able to identify which user the record belongs to. Doesn't actually mean that every table needs it - for example, if one record composes another (say, your TODOs have 'steps', each step belongs to a single TODO), only the root of the object graph needs the user id.

Google app engine: Poor Performance with JDO + Datastore

I have a simple data model that includes
USERS: store basic information (key, name, phone # etc)
RELATIONS: describe, e.g. a friendship between two users (supplying a relationship_type + two user keys)
COMMENTS: posted by users (key, comment text, user_id)
I'm getting very poor performance, for instance, if I try to print the first names of all of a user's friends. Say the user has 500 friends: I can fetch the list of friend user_ids very easily in a single query. But then, to pull out first names, I have to do 500 back-and-forth trips to the Datastore, each of which seems to take on the order of 30 ms. If this were SQL, I'd just do a JOIN and get the answer out fast.
I understand there are rudimentary facilities for performing two-way joins across un-owned relations in a relaxed implementation of JDO (as described at http://gae-java-persistence.blogspot.com) but they sound experimental and non-standard (e.g. my code won't work in any other JDO implementation).
Worse yet, what if I want to pull out all the comments posted by a user's friends. Then I need to get from User --> Relation --> Comments, i.e. a three-way join, which isn't even supported experimentally. The overhead of 500 back-and-forths to get a friend list + another 500 trips to see if there are any comments from a user's friends is already enough to push runtime >30 seconds.
How do people deal with these problems in real-world datastore-backed JDO applications? (Or do they?)
Has anyone managed to extract satisfactory performance from JDO/Datastore in this kind of (very common) situation?
-Bosh

First of all, for objects that are frequently accessed (like users), I rely on the memcache. This should speedup your application quite a bit.
If you have to go to the datastore, the right way to do this should be through getObjectsById(). Unfortunately, it looks like GAE doesn't optimize this call. However, a contains() query on keys is optimized to fetch all the objects in one trip to the datastore, so that's what you should use:
List myFriendKeys = fetchFriendKeys();
Query query = pm.newQuery(User.class, ":p.contains(key)");
query.execute(myFriendKeys);
You could also rely on the low-level API get() that accept multiple keys, or do like me and use objectify.
A totally different approach would be to use an equality filter on a list property. This will match if any item in the list matches. So if you have a friendOf list property in your user entity, you can issue a single Query friendOf == theUser. You might want to check this: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine

You have to minimize DB reads. That must be a huge focus for any GAE project - anything else will cost you. To do that, pre-calculate as much as you can, especially oft-read information. To solve the issue of reading 500 friends' names, consider that you'll likely be changing the friend list far less than reading it, so on each change, store all names in a structure you can read with one get.
If you absolutely cannot then you have to tweak each case by hand, e.g. use the low-level API to do a batch get.
Also, rather optimize for speed and not data size. Use extra structures as indexes, save objects in multiple ways so you can read it as quickly as possible. Data is cheap, CPU time is not.

Unfortunately Phillipe's suggestion
Query query = pm.newQuery(User.class, ":p.contains(key)");
is only optimized to make a single query when searching by primary key. Passing in a list of ten non-primary-key values, for instance, gives the following trace
alt text http://img293.imageshack.us/img293/7227/slowquery.png
I'd like to be able to bulk-fetch comments, for example, from all a user's friends. If I do store a List on each user, this list can't be longer than 1000 elements long (if it's an indexed property of the user) as described at: http://code.google.com/appengine/docs/java/datastore/overview.html .
Seems increasingly like I'm using the wrong toolset here.
-B

Facebook has 28 Terabytes of memory cache... However, making 500 trips to memcached isn't very cheap either. It can't be used to store a gazillion pieces of small items. "Denomalization" is the key. Such applications do not need to support ad-hoc queries. Compute and store the results directly for the few supported queries.
in your case, you probably have just 1 type of query - return data of this, that and the others that should be displayed on a user page. You can precompute this big ball of mess, so later one query based on userId can fetch it all.
when userA makes a comment to userB, you retrieve userB's big ball of mess, insert userA's comment in it, and save it.
Of course, there are a lot of problems with this approach. For giant internet companies, they probably don't have a choice, generic query engines just don't cut it. But for others? Wouldn't you be happier if you can just use the good old RDBMS?

If it is a frequently used query, you can consider preparing indexes for the same.
http://code.google.com/appengine/articles/index_building.html

The indexed property limit is now raised to 5000.
However you can go even higher than that by using the method described in http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
Basically just have a bunch of child entities for the User called UserFriends, thus splitting the big list and raising the limit to n*5000, where n is the number of UserFriends entities.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.