Working with files using Java

Working with files using Java - java

I have a query in regards to what is the best way of handling huge files in Java?
Shall we use the no-sql database like Cassandra or try to use our existing Oracle database (to dump the content of the file).
My file can contain at most 1 or 2 fields. But mostly what I shall be able to do with the file content is just search an Id and return boolean.
File can contain records in tens of millions or as low as thousands.
Also this file can get refreshed on daily basis. Whenever refreshed I need to clear all previous values.
Any suggestions would be helpful!!
Regards,
Vicky

As per your requirements,
Oracle
Is good for indexing and fits your requirements if every day data is in tens of millions.
Index will be stored in memory and searches will be faster for this short data. If table is also short you can also request to keep table in memory and that will be even faster if any other column is also required.
You can drop table every day and import file again as new table. This should work.
Cassandra
Is also good for indexing. All your searches will also be faster (similar to oracle for such small data)
Cassandra is NoSQL database designed to provide scalability, high write throughput, availability for high volume data and queries.
Cassandra generally runs in clustered environments for above properties.
I would suggest to check your requirements, If you just to keep data in DB and wants to query once in a while or maybe 100 requests per sec then using Cassandra is like hitting a nail in wall with sledgehammer where small hammer or mallet is enough.

Related

Cache query results

Let's say, we have a highly configurable report system, which allows users to select columns, filters, and sorting.
All this configuration comes to BE, where it's being transformed to SQL, executed against DB and then the user sees his report and can continue to work with it. But on each operation, like sorting, we still build a query.
The transformation itself takes few milliseconds, but the query execution against DB can take 3-5 seconds (up to 20 if there are a lot of parallel executions).
So, I'm thinking about adding some sort of cache.
Currently, I see 3 ways:
Add one table to cache all results without filtering, and then on user request sort/filter it on Java side.
Add one table per result, still without the filters. In this case, I will have the possibility to sort/filter on much less amount of data, but there are more than 10k different reports, and I don't think it would be good to create 10k small tables.
Like the first option, but LRU cache on Java side. We can fit in memory 2-3k report results. It will be usually faster than in the first option since we don't have a lot of parallel users, just users with lots of reports.
The cache invalidation will be a few times a day.
What do you see is the best way to make it faster? What cons and pros in proposed solutions from yours perspective? What would you do if you are free in selecting Database and technology (Java stack)?

OK, let's make sure I got it right.
there are more than 10k different reports
So it doesn't make sense to pre-calculate and pre-cache them, they have to be generated on-demand.
there is not a lot of data in rows, just short strings, dates and integers. It’s not costly to fetch it in memory and even save there for a while
So caching a small amount of data can avoit a big costly query, that's good.
Add one table to cache all results without filtering, and then on user request sort/filter it on Java side.
Problem is, most likely every report query will have different columns, with different names, so that doesn't fit a single table well unless you use a format like JSON, storing each cached result row as a JSON dictionary... And in this case indexing it would be a problem, even if you create indexes on fields inside JSON values, if you have a zillion different column names from your many reports you'll need a zillion indexes too...
Smells like a can of worms.
Add one table per result, still without the filters. In this case, I will have the possibility to sort/filter on much less amount of data, but there are more than 10k different reports, and I don't think it would be good to create 10k small tables.
Pros: each cache table can have the proper columns, data types and indexes. It is easy to invalidate the cache, just truncate it. You can set all the cache tables to UNLOGGED to make them faster. And you can do all the extra sorting/filtering on the cached result using the same SQL queries you were using before, so this might be the simpler option to code. It is also nice for pagination if you only want to fetch part of the result. And that will be the fastest option as far as copying the results of reporting queries into cache since the cache is already in postgres, there is no need to transfer data. You can also store the cache on another drive/SSD.
Cons: I've heard the main issue with tons of tables is if your filesystem slows down on directories with large numbers of files. That shouldn't be an issue on modern filesystems though, and I don't think postgres itself is going to be bothered at all by 10k tables.
It might make queries on information_schema slow, and stuff like "\dt" in psql problematic, so the cache tables would be better hidden away in a "cache" schema so they don't interfere. This will also make it easier to exclude them from backups.
It will also use some RAM on postgres server to cache the cache tables, that depends on the number of online users.
I'd say it would be worth a little bit of benchmarking. Create a schema, add 10k tables, see if something breaks.
Like the first option, but LRU cache on Java side. We can fit in memory 2-3k report results. It will be usually faster than in the first option since we don't have a lot of parallel users, just users with lots of reports.
That's a bit of reinventing the wheel, and you got to reimplement the sort/filter in java... plus the cache algos... meeeh.
There are other options though:
Put the cache in another database, on another machine. This may be a postgres instance, or another database (which may require rewriting some queries). Could be interesting only if the cache eats too much RAM on your database.
Put the cache in the web browser, and use javascript to filter/sort. That could be faster depending on speed of internet connection, and it would reduce server load, but you'll have to write lots of javascript code.
IMO you're cautious about the large number of tables, it is good to be cautious, but if it works well, it really is the simplest solution...

How to filter and sort few gb of data?

I have 36gb big file which has about 600 milion lines of data in this structure:
LogID,SensorID,ValueNumeric,ValueString,DateAdded,VariableName,Name
1215220724,1182,1,0000,,2016-01-04 12:56:57.7770000,Scan_Frequency,MIC_2
I am only interested in date, value, variable name and stream (name). Problem is that there is a lot of duplicate entrys and that data is not ordered by date.
My current solution is that I go trough first 100.000 lines, read name of the variables (there 833 of them) and create a table in DB for each of them. For primary key I use date(I had cut off seconds and milliseconds), so I DB will be free from duplicates. I know that this is not the best to have string for primary key.
Then I read the file again and enter that data in the tables, but it's to slow. My estimation is that I should have at least 10x less lines in the end.
Does anyone have a better idea, how to read such a big file and sort it by date and remove duplicates. It would be enough to save data for 5 min intervals.

I'd use an Elasticsearch + Logstash based solution (they are free and work very well with their installation defaults). Logstash is designed to ingest data from several sources (including CSV files) and Elasticsearch is a NoSql database that does an amazing job both indexing documents and querying them.
See this question and answer for a starting point, and here is the documentation.

Your database will offer a tool to import csv files directly. This is most likely much faster than using JDBC. Futhermore, chances are high it also offers a tool to remove the duplicates you mention during the import. Once you have the data in the database it will take care of sorting the data for you.
Just to give you an example: If you were using MySQL there is the MySQL import utility mysqlimport which also offers an option to remove duplicates during the import using --replace.

Java File IO vs Local database

I am working on a project that involves parsing through a LARGE amount of data rapidly. Currently this data is on disk and broken down into a directory hierarchy:
(Folder: DataSource) -> (Files: Day1, Day2, Day3...Day1000...)
(Folder: DataSource2) -> (Files: Day1, Day2, Day3...Day1000...)
...
(Folder: DataSource1000) -> ...
...
Each Day file consists of entries that need to be accessed very quickly.
My initial plans were to use traditional FileIO in java to access these files, but upon further reading, I began to fear that this might be too slow.
In short, what is the fastest way I can selectively load entries from my filesystem from varying DataSources and Days?

The issue could be solved both ways but it depends on few factors
go for FileIO.
if the volume is < millons of rows
if your dont do a complicated query like Jon Skeet said
if your referance for fetching the row is by using hte Folder Name: "DataSource" as the key
go for DB
if you see your program reading through millions of records
you can do complicated selection, even multiple rows using a single select.
if you have knowledge of creating a basic table structure for DB

Depending on architecture you are using you can implement different ways of caching, in the Jboss there is a built-in Jboss Caching, there are also third party opensource software that lets utilizes caching, like Redis, or EhCache depending on your needs. Basically Caching stores objects in their memory, some are passivated/activated upon demand, when memory is exhausted it is stored as a physical IO file, which are also easily activated marshalled by the caching mechanism. It lowers the database connectivity held by your program. There are other caches but here are some of them that I've worked with:
Jboss:http://www.jboss.org/jbosscache/
Redis:http://redis.io/
EhCache:http://ehcache.org/

what is the fastest way I can selectively load entries from my filesystem from varying DataSources and Days?
selectively means filtering, so my answer is a localhost database. Generally speaking if you filter, sort, paginate or extract distinct records from a large number of records, it's hard to beat a localhost SQL server. You get a query optimizer (nobody does that Java), a cache (which requires effort in Java, especially the invalidation), database indexes (have not seen that being done in Java either) etc. It's possible to implement these things manually, but then your are writing a database in Java.
On top of this you gain access to higher level SQL functions like window aggegrates etc., so in most cases there is no need to post-process data in Java.

Fastest way for inserting very large number of records into a Table in SQL

The problem is, we have a huge number of records (more than a million) to be inserted into a single table from a Java application. The records are created by the Java code, it's not a move from another table, so INSERT/SELECT won't help.
Currently, my bottleneck is the INSERT statements. I'm using PreparedStatement to speed-up the process, but I can't get more than 50 recods per second on a normal server. The table is not complicated at all, and there are no indexes defined on it.
The process takes too long, and the time it takes will make problems.
What can I do to get the maximum speed (INSERT per second) possible?
Database: MS SQL 2008. Application: Java-based, using Microsoft JDBC driver.

Batch the inserts. That is, only send 1000 rows at a time, rather then one row at a time, so you hugely reduce round trips/server calls
Performing Batch Operations on MSDN for the JDBC driver. This is the easiest method without reengineering to use genuine bulk methods.
Each insert must be parsed and compiled and executed. A batch will mean a lot less parsing/compiling because a 1000 (for example) inserts will be compiled in one go
There are better ways, but this works if you are limited to generated INSERTs

Use BULK INSERT - it is designed for exactly what you are asking and significantly increases the speed of inserts.
Also, (just in case you really do have no indexes) you may also want to consider adding an indexes - some indexes (most an index one on the primary key) may improve the performance of inserts.
The actual rate at which you should be able to insert records will depend on the exact data, the table structure and also on the hardware / configuration of the SQL server itself, so I can't really give you any numbers.

Have you looked into bulk operations bulk operations?

Have you considered to use batch updates?

Is there any integrity constraint or trigger on the table ?
If so, droping it before inserts will help, but you have to be sure that you can afford the consequences.

Look into Sql Server's bcp utility.
This would mean a big change in your approach in that you'd be generating a delimited file and using an external utility to import the data. But this is the fastest method for inserting a large number of records into a Sql Server db and will speed up your load time by many orders of magnitude.
Also, is this a one-time operation you have to perform or something that will occur on a regular basis? If it's one time I would suggest not even coding this process but performing an export/import with a combination of db utilities.

I would recommend using an ETL engine for it. You can use Pentaho. It's free. The ETL engines are optimized for doing bulk loading on data and also any forms of transformation/validation that are required.

speed up operation on mysql

I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?

I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).

I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.

Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.

Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.

Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.