Maintaining preprocessed data from large, continous data feed in MySQL

Maintaining preprocessed data from large, continous data feed in MySQL - java

I'm currently working on an analytics tool that every night (with a Java program) parses huge event logs (approx. 1 GB each) to a MySQL database - for each event there's about 40 attributes. The event logs are parsed "raw" to the database.
The user of the application needs to see different graphs and charts based on complicated calculations on the log data. For the user not to wait several minuts for a chart-request to be fulfilled, we need to store the preprocessed data somehow ready to display for the user (the user is able to filter by dates, units etc., but the largest parts of the calculations can be done on beforehand). My question is concerned about how to maintain such preprocessed data - currently, all calculations are expressed in SQL as we assume is the most efficient way (is this a correct assumption?). We need to be able to easily expand with new calculations for new charts, customer specific wishes etc.
Some kind of materialized view jumps to my mind, but MySQL doesn't seem to support this feature. Similarly, we could execute the SQL calculation each night after the event logs has been imported, but in this way each calculation/preprocessed data table needs to know which events it has processed and which it hasn't. The table will contain up to a year worth of data (i.e. events) so simply truncating the table and doing all calculations once again seems not to be the solution? Using triggers doesn't seem right neither, as some calculations need to consider for example the time difference between to specific kinds of events?
I'm having a hard time weighing the pros and cons of possible solutions.

"Materialized Views" are not directly supported by MySQL. "Summary Tables" is another name for them in this context. Yes, that is the technique to use. You must create and maintain the summary table(s) yourself. They would be updated either as you insert data into the 'Fact' table, or periodically through a cron job, or simply after uploading the nightly dump.
The details of such are far more than can be laid out in this forum, and the specific techniques that would work best for you involve many questions. I have covered much of it in three blogs: DW, Summary Tables, and High speed ingestion. If you have further, more specific, questions, please open a new Question and I will dig into more details as needed.
I have done such in several projects; usually the performance is 10x better than reading the Fact table; in one extreme case, it was 1000x. I always end up with UI-friendly "reports" coming from the Summary Table(s).
In some situations, you are actually better off building the Summary Tables and not saving the Fact rows in a table. Optionally, you could simply keep the source file in case of a need to reprocess it. Not building the Fact table will get the summary info to the end-user even faster.
If you are gathering data for a year, and then purging the 'old' data, see my blog on partitioning. I often use that on the Fact table, but rarely feel the need on a Summary Table, since the Summary table is much smaller (that is, not filling up disk).
One use case had a 1GB dump every hour. A perl script moved the data to a Fact table, plus augmented 7 Summary Tables, in less than 10 minutes. The system was also replicated, that added some extra challenges. So, I can safely say that 1GB a day is not a problem.

Related

Cache query results

Let's say, we have a highly configurable report system, which allows users to select columns, filters, and sorting.
All this configuration comes to BE, where it's being transformed to SQL, executed against DB and then the user sees his report and can continue to work with it. But on each operation, like sorting, we still build a query.
The transformation itself takes few milliseconds, but the query execution against DB can take 3-5 seconds (up to 20 if there are a lot of parallel executions).
So, I'm thinking about adding some sort of cache.
Currently, I see 3 ways:
Add one table to cache all results without filtering, and then on user request sort/filter it on Java side.
Add one table per result, still without the filters. In this case, I will have the possibility to sort/filter on much less amount of data, but there are more than 10k different reports, and I don't think it would be good to create 10k small tables.
Like the first option, but LRU cache on Java side. We can fit in memory 2-3k report results. It will be usually faster than in the first option since we don't have a lot of parallel users, just users with lots of reports.
The cache invalidation will be a few times a day.
What do you see is the best way to make it faster? What cons and pros in proposed solutions from yours perspective? What would you do if you are free in selecting Database and technology (Java stack)?

OK, let's make sure I got it right.
there are more than 10k different reports
So it doesn't make sense to pre-calculate and pre-cache them, they have to be generated on-demand.
there is not a lot of data in rows, just short strings, dates and integers. It’s not costly to fetch it in memory and even save there for a while
So caching a small amount of data can avoit a big costly query, that's good.
Add one table to cache all results without filtering, and then on user request sort/filter it on Java side.
Problem is, most likely every report query will have different columns, with different names, so that doesn't fit a single table well unless you use a format like JSON, storing each cached result row as a JSON dictionary... And in this case indexing it would be a problem, even if you create indexes on fields inside JSON values, if you have a zillion different column names from your many reports you'll need a zillion indexes too...
Smells like a can of worms.
Add one table per result, still without the filters. In this case, I will have the possibility to sort/filter on much less amount of data, but there are more than 10k different reports, and I don't think it would be good to create 10k small tables.
Pros: each cache table can have the proper columns, data types and indexes. It is easy to invalidate the cache, just truncate it. You can set all the cache tables to UNLOGGED to make them faster. And you can do all the extra sorting/filtering on the cached result using the same SQL queries you were using before, so this might be the simpler option to code. It is also nice for pagination if you only want to fetch part of the result. And that will be the fastest option as far as copying the results of reporting queries into cache since the cache is already in postgres, there is no need to transfer data. You can also store the cache on another drive/SSD.
Cons: I've heard the main issue with tons of tables is if your filesystem slows down on directories with large numbers of files. That shouldn't be an issue on modern filesystems though, and I don't think postgres itself is going to be bothered at all by 10k tables.
It might make queries on information_schema slow, and stuff like "\dt" in psql problematic, so the cache tables would be better hidden away in a "cache" schema so they don't interfere. This will also make it easier to exclude them from backups.
It will also use some RAM on postgres server to cache the cache tables, that depends on the number of online users.
I'd say it would be worth a little bit of benchmarking. Create a schema, add 10k tables, see if something breaks.
Like the first option, but LRU cache on Java side. We can fit in memory 2-3k report results. It will be usually faster than in the first option since we don't have a lot of parallel users, just users with lots of reports.
That's a bit of reinventing the wheel, and you got to reimplement the sort/filter in java... plus the cache algos... meeeh.
There are other options though:
Put the cache in another database, on another machine. This may be a postgres instance, or another database (which may require rewriting some queries). Could be interesting only if the cache eats too much RAM on your database.
Put the cache in the web browser, and use javascript to filter/sort. That could be faster depending on speed of internet connection, and it would reduce server load, but you'll have to write lots of javascript code.
IMO you're cautious about the large number of tables, it is good to be cautious, but if it works well, it really is the simplest solution...

Collection processing or database request ? which one is better

This is my first post on stackoverflow, so please be nice to me :-)
So let me explain the context. I'm developing a web service with a standard layer (resources, services, DAO Layer...). I use JPA with hibernate implementation for my object model with the database.
For a class A parent and a class B child, most of the time when i want to find an object B on the collection, I use the streamAPI to filter the collection based on what i want. My question here is more general, is it better to search an object by requesting the database (from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)

If you consider latency, the database will always be slower.
So you gotta ask yourself some questions:
how far away is the database (latency)?
how big is the dataset?
How do I process them ?
do I have any major runtime issues ?
from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)
You're program is probably not very performant programmed. I suggest you check the O-Notation if you have any major runtime leaks.
Your Question is very broad, so it's hard to tell you, for your use-case, which might be the best.

Use database to return data what you need and Java to perform processing on them that would be complicated to do in a JPQL/SQL query.
Databases are designed to perform queries more efficiently than Java (stream or no).
Besides, fetching many data from a database to finally keep only a part of them is not efficient.

The database is usually faster since it is optimized for requesting specific data. Usually one would add indexes to speed up querying on certain fields.
TLDR: Filter your data in the database and process them from java.

This isn't an easy question to answer, since there are many different factors that would influence my decision to go to the db or not. First, I think it's fair to say that, for almost every app I've worked on in the past 20 years, hitting the DB for information is the default strategy. More recently (say past 10 or so years) data access through web service calls has become common as well.
For me, the main question would be something along the lines of, "Are there any situations when I would not hit an external resource (DB, Service, or even file read) for data every time I need it?"
So, I'll outline some of the things I would consider.
Is the data search space very small?
If you are searching a data space of tens of different records, then this information might be a candidate for non-db storage. On the other hand, once you get past a fairly small set records, this approach becomes increasingly untenable. Examples of these "small sets" might be something like salutations (Mr., Ms., Dr., Mrs., Lord). I looks for small sets of data that rarely change, which I, as a lazy developer, wouldn't mind typing into a configuration file. Once I get past something like 50 different records (like US States, for example), I want to pull that info from a DB or service call.
Are the data cacheable?
If you have multiple requests that could legitimately use the exact same data, then leverage caching in your application. Examine the data and expected usage of your service for opportunities to leverage regularities in data and likely requests to cache data whenever possible. Remember to consider cache keys, how long items should be cached, and when cached items should be evicted.
In many web usage scenarios, it's not uncommon that each display could include a fairly large amount of cached information, and a small amount of dynamic data. Menu and other navigation items are good candidates for caching. User-specific data, such as contract-sepcific pricing in an eCommerce app are often poor candidates.
Can you pre-load some data into cache?
Some items can be read once and cached for the entire duration of your application. A list of US States and/or Canadian Provinces is a good example here. These almost never change, so once read from the db, you would rarely need to read them again. Consider application components that can load such data on startup, and then hold this data in an appropriate collection.

More efficient to do SELECT and compare in Java or DELETE and INSERT

I am hitting a REST API to get data from a service. I transform this data and store it in a database. I will have to do this on some interval, 15 minutes, and then make sure this database has latest information.
I am doing this in a Java program. I am wondering if it would be better, after I have queried all data, to do
1. SELECT statements and compare vs transformed data and do UPDATEs (DELETE all associated records to what was changed and INSERT new)
OR
DELETE ALL and INSERT ALL every time.
Option 1 has potential to be a lot less transactions, guaranteed SELECT on all records because we are comparing, but potentially not a lot of UPDATEs since I don't expect data to be changing much. But it has downside of doing comparisons on all records to detect a change
I am planning on doing this using Spring Boot, JPA layer and possibly postgres

The short answer is "It depends. Test and see for your usecase."
The longer answer: this feels like preoptimization. And the general response for preoptimization is "don't." Especially in DB realms like this, what would be best in one situation can be awful in another. There are a number of factors, including (and not exclusive to) schema, indexes, HDD backing speed, concurrency, amount of data, network speed, latency, and so on:
First, get it working
Identify what's wrong → get a metric
Measure against that metric
Make any obvious or necessary changes
Repeat 1 through 4 as appropriate
The first question I would ask of you is "What does better mean?" Once you define that, the path forward will likely become clearer.

In populating an ObservableList, do I have to load ALL the records from my database?

So I'm porting my Swing Java database application to Java FX (still a beginner here, I recently just learned the basics of FXML and the MVC pattern, so please bear with me).
I intend to load the data from my existing database to the "students" ObservableList so I can show it on a TableView, but on my original Swing code, I have a search TextField, and when the user clicks on a button or presses Enter, the program:
Executes an SQLite command that searches for specific records, and retrieves the RecordSet.
Creates a DefaultTableModel based on the RecordSet contents
And throws that TableModel to the JTable.
However, Java FX is a completely different beast (or at least it seems so to me--don't get me wrong, I love Java FX :D ) so I'm not sure what to do.
So, my question is, do I have to load ALL the students in the database, then use some Java code to filter out students that don't fit the search criteria (and display all students when the search text is blank), or do I still use SQLite in filtering and retrieving records (which means I need to clear the list then add students every time a search is performed, and maybe it will also mess up with the bindings? Maybe there will be a speed penalty on this method also? Besides that, it will also reset the currently selected record because I clear the list--basically, bad UI design and will negatively impact the usability)
Depending on the right approach, there is also a follow-up question (sorry, I really can't find the answer to these even after Googling):
If I get ALL students from database and implement a search feature in Java, won't it use up more RAM than it should, because I am storing ALL the database data in RAM, instead of just the ones searched for? I mean, sure, even my lowly laptop has 4GB RAM, but the feeling of using more memory than I should makes me feel somewhat guilty LOL
If I choose to just update the contents of the ObservableList every time a new search has been performed, will it mess up with the bindings? Do I have to set up bindings again? How do I clear the contents of the ObservableList before adding the new contents?
I also have the idea of just setting the selected table item to the first record that matches the search string but I think it will be difficult to use, since only one record can be highlighted per search. Even if we highlight multiple rows, it'd be difficult to browse all selected items.
Please give me the proper way, not the "easy" way. This is my first time implementing a pattern (MVC or am I actually doing MVP, I don't know) and I realized how unmaintainable and ugly my previous programs are because I used my own style. This is a relatively big project that I need to support and improve for several years so having clean code and doing stuff the right way should help in maintaining the functionality of this program.
Thank you very much in advance for your help, and I hope I don't come off as a "dumb person who can't even Google" in asking these questions. Please bear with me here.

Basic design tradeoffs
You can, of course, do this either of the ways you describe. The basic tradeoffs are:
If you load everything from the database, and filter the table in Java, you use more memory (though not as much as you might think, as explained below)
If you filter from the database and reload every time the user changes the filter, there will be a bigger latency (delay) in displaying the data, as a new query will be executed on the database, with (usually) network communication between the database and the application being the biggest bottleneck (though there are others).
Database access and concurrency
In general, you should perform database queries on a background thread (see Using threads to make database requests); if you are frequently making database queries (i.e. filtering via the database), this gets complex and involves frequently disabling controls in the UI while a background task is running.
TableView design and memory management
The JavaFX TableView is a virtualized control. This means that the visual components (cells) are created only for visible elements (plus, perhaps, a small amount of caching). These cells are then reused as the user scrolls around, displaying different "items" as required. The visual components are typically quite memory-consumptive (they have hundreds of properties - colors, font properties, dimensions, layout properties, etc etc - most of which have CSS representations), so limiting the number created saves a lot of memory, and the memory consumption of the visible part of the table view is essentially constant, no matter how many items are in the table's backing list.
General memory consumption computations
The items observable list that forms the table's backing list contains only the data: it is not hard to ballpark-estimate the amount of memory consumed by a list of a given size. Strings use 2 bytes per character, plus a small fixed overhead, doubles use 8 bytes, ints use 4 bytes, etc. If you wrap the fields in JavaFX properties (which is recommended), there will be a few bytes overhead for each; each object has an overhead of ~16 bytes, and references themselves typically use up to 8 bytes. So a typical Student object that stores a few string fields will usually consume of the order of a few hundred bytes in memory. (Of course, if each has an image associated with it, for example, it could be a lot more.) Thus if you load, say 100,000 students from a database, you would use up of the order of 10-100MB of RAM, which is pretty manageable on most personal computer systems.
Rough general guidelines
So normally, for the kind of application you describe, I would recommend loading what's in your database and filtering it in memory. In my usual field of work (genomics), where we sometimes need 10s or 100s of millions of entities, this can't be done. (If your database contains, say, all registered students in public schools in the USA, you may run into similar issues.)
As a general rule of thumb, though, for a "normal" object (i.e. one that doesn't have large data objects such as images associated with it), your table size will be prohibitively large for the user to comfortably manage (even with filtering) before you seriously stretch the memory capacity of the user's machine.
Filtering a table in Java (all objects in memory)
Filtering in code is pretty straightforward. In brief, you load everything into an ObservableList, and wrap the ObservableList in a FilteredList. A FilteredList wraps a source list and a Predicate, which returns true is an item should pass the filter (be included) or false if it is excluded.
So the code snippets you would use might look like:
ObservableList<Student> allStudents = loadStudentsFromDatabase();
FilteredList<Student> filteredStudents = new FilteredList<>(allStudents);
studentTable.setItems(filteredStudents);
And then you can modify the predicate based on a text field with code like:
filterTextField.textProperty().addListener((obs, oldText, newText) -> {
if (newText.isEmpty()) {
// no filtering:
filteredStudents.setPredicate(student -> true);
} else {
filteredStudents.setPredicate(student ->
// whatever logic you need:
student.getFirstName().contains(newText) || student.getLastName().contains(newText));
}
});
This tutorial has a more thorough treatment of filtering (and sorting) tables.
Comments on implementing "filtering via queries"
If you don't want to load everything from the database, then you skip the filtered list entirely. Querying the database will almost certainly not work fast enough to filter (using a new database query) as the user types, so you would need an "Update" button (or action listener on the text field) which recomputed the new filtered data. You would probably need to do this in a background thread too. You would not need to set new cellValueFactorys (or cellFactorys) on the table's columns, or reload the columns; you would just call studentTable.setItems(newListOfStudents); when the database query finished.

Data Stitching Join/Merge - Oracle Vs Java based technique

Currently I am facing a distinct issue, where I receive data from a webservice call, same need to be loaded into Oracle Table.
Scenario:
- I have a very huge table with 500 columns - all columns mandatory, and no choice to split table.
- Dataset is 50m records, which I am trying to export from source system - and its continuously increasing
- At a time I receive 50 column data by firing request to webservice (at source system), hence I need to submit 10 request of 50 column each for getting full record.
- Also at a time I can only receive 100000 (1 lac) records in one request for specific set of columns.
Now, to import same data into Oracle DB at destination system I have following two choices:
1. First export data on temporary tables of 50 columns each and then run join for all of them to create final table with all 500 columns
2. Fire 10 parallel request of 50 columns each and stitch data on my java program and then send insert query with all 500 columns
Here I would like to know, which technique works out better, to go with Oracle based table join or apply stitching on java side by using Primary Key column?
As the data set is very huge, I am purely looking on performance aspect. Also any more optimized ways to solve same problem?

From performance point of view the Oracle based solution would clearly win. From implementation point of view (aiming for a clear and simple solution) Oracle tables win again. Here is why:
Architecture point of view: Combining the data in your app will make your app stateful. From a simple stateless (receive-save-forget) application you would turn it into a complex state-aware (save-look for joint records-did not find anything-store-wait-look again-etc). This is much harder to develop, maintain or debug.
Performance point of view: Saving data into multiple tables and later combining them into one (either by views or stored procedures or simple selects) is something Oracle is designed for. Immense amount of development time was spent on optimizing these basic features. Whatever you would come up with to implement the same features (even though you are aware of some specifics) would likely performe worse.
So overall I would strongly suggest Option #1, leave it for Oracle to do the hard part. Depending on how you want to use this data after the import (almost real-time / once in a while / after extra filtering applied) you can choose how you construct the final records by using one of these:
stored procedures
Oracle jobs
views.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.