How to efficiently process strings in Java

How to efficiently process strings in Java - java

I am facing some optimization problem in java. I have to process a table which has 5 attributes. The table contains about 5 millions records. To simplify the problem, let say I have to read every record one by one. Then I have to process each record. From each record I have to generate a mathematical lattice structure which has 500 nodes. In other words each record generate 500 more new records which can be referred as parents of the original record. So in total there are 500 X 5 Millions records including original plus parent records. Now the job is to find the number of distinct records out of all 500 X 5 Millions records with their frequencies. Currently I have solved this problem as follow. I convert every record to a string with value of each attribute separated by "-". And I count them in a java HashMap. Since these records involve intermediate processing. A record is converted to a string and then back to a record during intermediate steps. The code is tested and it is working fine and produce accurate results for small number of records but it can not process 500 X 5 Millions records.
For large number of records it produce the following error
java.lang.OutOfMemoryError: GC overhead limit exceeded
I understand that the number of distinct records are not more than 50 thousands for sure. Which means that the data should not cause memory or heap overflow. Can any one suggest any option. I will be very thankful.

Most likely, you have some data-structure somewhere which is keeping references to the processed records, also known as a "memory leak". It sounds like you intend to process each record in turn and then throw away all the intermediate data, but in fact the intermediate data is being kept around. the garbage collector can't throw away this data if you have some collection or something still pointing to it.
Note also that there is the very important java runtime parameter "-Xmx". Without any further detail than what you've provided, I would have thought that 50,000 records would fit easily into the default values, but maybe not. Try doubling -Xmx (hopefully your computer has enough RAM). If this solves the problem then great. If it just gets you twice as far before it fails, then you know it's an algorithm problem.

Using a sqlite database can used to safe (1.3tb?) data. With query´s you can find fast info back. Also the data get saved when youre program ends.

You probably need to adopt a different approach to calculating the frequencies of occurrence. Brute force is great when you only have a few million :)
For instance, after your calculation of the 'lattice structure' you could combine that with the original data and take either the MD5 or SHA1. This should be unique except when the data is not "distinct". Which then should reduce your total data down back below 5 million.

Related

SQLite doing too many small size disk reads

Background I am using SQLite to store around 10M entries, where the size of each entry is around 1Kb. I am reading this data back in chunks of around 100K entries at a time, using multiple parallel threads. Read and writes are not going in parallel and all the writes are done before starting the reads.
Problem I am experiencing too many disk reads. Each second around 3k reads are happening and I am reading only 30Kb data in those 3k reads (Hence around 100 bytes per disk read). As the result, I am seeing a really horrible performance (It is taking around 30 minutes to read the data)
Question
Is there any SQlite settings/pragmas that I can use to avoid the small size disk reads?
Are there any best practices for batch parallel reads in SQlite?
Does SQlite read all the results of a query in one go? Or read the results in smaller chunks? If latter is the case, then where does it stone partial out of a query
Implementation Details My using SQlite with Java and my application is running on linux. JDBC library is https://github.com/xerial/sqlite-jdbc (Version 3.20.1).
P.S I am already built the necessary Indexes and verified that no table scans are going on (using Explain Query planner)

When you are searching for data with an index, the database first looks up the value in the index, and then goes to the corresponding table row to read all the other columns.
Unless the table rows happen to be stored in the same order as the values in the index, each such table read must go to a different page.
Indexes speed up searches only if the seach reduces the number of rows. If you're going to read all (or most of the) rows anyway, a table scan will be much faster.
Parallel reads will be more efficient only if the disk can actually handle the additional I/O. On rotating disks, the additional seeks will just make things worse.
(SQLite tries to avoid storing temporary results. Result rows are computed on the fly (as much as possible) while you're stepping through the cursor.)

How do I store Frequent data (Combination of Source-Destination) in Database efficiently to get top10 searches for past 30 Days

I am trying to write an algorithm which does insert of frequent data search.
Let's say User can search different combination of two entities (Source-Destination), Each time user search I want to store data with count, and if he search same combination(Source-Destination) I will update the count.
In this case if Users are 1000, and if User searches for 0 different combination(Source-Destination) and data will be stored for 30 Days.
So total number of rows will be 100000*30*30=13500000(1.3 Billion) Rows. (using Mysql)
Please suggest me If there is better way to write this.
GOAL: I want to get top 10 Searach Combination of users at any point of time.

1,000 users and 60,000 rows are nothing by today's standards. Don't even think about it, there is no performance concern whatsoever, so just focus on doing it properly instead of worrying about slowness. There will be no slowness.
The proper way of doing it is by creating a table in which each row contains the search terms, ([source,destination] in your case,) and a sum, and using a unique index on the [source, destination] pair of columns. Which is the same as making those two columns the primary key.
If you had 100,000,000 rows, and performance was critical, and you also had a huge budget affording you the luxury to do whatever weird thing it takes to make ends meet, then you would perhaps want to do something exotic, like appending each search to an indexless table (allowing the fastest appends possible) and then compute the sums in a nightly batch process. But with less than a million rows such an approach would be a complete overkill.
Edit:
Aha, so the real issue is the OP's need for a "sliding window". Well, in that case, I cannot see any approach other than saving every single search, along with the time that it happened, and in a batch process a) computing sums, and b) deleting entries that are older than the "window".

Which data structure should I use to represent this data set?

Suppose I have a data set as follows:
Screen ID User ID
1 24
2 50
2 80
3 23
5 50
3 60
6 64
. .
. .
. .
400,000 200,000
and I want to track the screens that each user visited. My first approach would be to create a Hash Map where the keys would be the user ids, and the values would be the screen ids. However, I get an OutofMemory error when using Java. Are there efficient data structures that can handle this volume of data? There will be about 3,000,000 keys and for each key about 1000 values. Would Spark(Python) be the way to go for this? The original dataset has around 300,000,000 rows and 2 columns.

Why do you want to store such a large data in memory it would be better to store it in data base and use only required data. As using any data structure in any language will consume nearly equal memory.

HashMap will not work with what you're describing as the keys must be unique. Your scenario is duplicating the keys.
If you want to be more memory efficient and don't have access to a relational database or an external file, consider designing something using arrays.
The advantage of arrays is the ability to store primitives which use less data than objects. Collections will always implicitly convert a primitive into its wrapper type when stored.
You could have your array index represent the screen id, and the value stored at the index could be another array or collection which stores the associated user ids.

What data type you are using? Let's say to your are using a..
Map<Integer,Integer>
.then each entry takes 8 bytes (32-Bit) or 16 bytes (64-Bit).. Let's calculate your memory consumption:
8 * 400000 = 3200000 bytes / 1024 = 3125 kbytes / 1024 = 3.05MB
or 6.1MB in case of an 64-Bit data type (like Long)
To say it short.. 3.05 MB or 6 MB is nothing for your hardware.
Even if we calc 3 million entries, we end up with an memory usage of 22 MB (in case of an integer entry set). I don't think a OutofMemory exception is caused by the data size. Check your data type or
switch to MapDB for a quick prototype (supports off-heap memory, see below).
Yes handling 3 000 000 000 entries is getting more seriously. We end up with a memory usage of 22.8 gig. In this case you should consider
a data storage that can handle this amount of data efficiently. I don't think a Java Map (or a vector in another language) is a good use case for such a data amount
(as Brain wrote, with this amount of data you have to increase the JVM heap space or use MapDB). Also think about your deployment; your product will need 22 gig in memory which
means high hardware costs. Then the question cost versus in-memory performance has to be balanced... I would go with one of the following alternatives:
Riak (Key-Value Storage, fits your data structure)
Neo4J (your data structure can be handled as a net graph; in this case a screen can have multiple relationships to users and versa-vi)
Or for a quick prototype consider MapDB (http://www.mapdb.org/)
For a professional and performance solution, you can look at SAP Hana (but its not for free)
H2 (http://www.h2database.com/html/main.html) can be also a good choice. It's an SQL in-memory database.
With one of the solutions above, you can also persist and query your data (without coding indexing, B-trees and stuff). And this is what you want to do, I guess,
process and operate with your data. At the end only tests can show which technology has the best performance for your needs.
The OutofMemory exception has nothing to do with java or python. Your use case can be implemented in java with no problems.

Just looking on the data structure. You have a two dimensional matrix indexed by user-id and screen-id containing a single boolean value, whether it was visisted by that user or not: visited[screen-id, user-id]
In the case each user visits almost every screen, the optimal representation would be a set of bits. This means you need 400k x 200k bits, which is roughly 10G bytes. In Java I would use a BitSet and linearize the access, e.g. BitSet.get(screen-id + 400000 * user-id)
If each user only visits a few screens, then there are a lot of repeating false-values in the bit set. This is what is called a sparse matrix. Actually, this is a well researched problem in computer science and you will find lots of different solutions for it.
This answers your original question, but probably does not solve your problem. In the comment you stated that you want to look up for the users that visited a specific screen. Now, that's a different problem domain, we are shifting from efficient data representation and storage to efficient data access.
Looking up the users that visited a set of screens, is essentially the identical problem to, looking up the documents that contain a set of words. That is a basic information retrieval problem. For this problem, you need a so called inverted index data structure. One popular library for this is Apache Lucene.
You can read in the visits and build a a data structure by yourself. Essentially it is a map, addressed by the screen-id, returning a set of the affected users, which is: Map<Integer, Set<Integer>>. For the set of integers the first choice would be a HashSet, which is not very memory efficient. I recommend using a high performance set library targeted for integer values instead, e.g. IntOpenHashSet. Still this will probably not fit in memory, however, if you use Spark you can split your processing in slices and join the processing results later.

Storing a large number of geolocation records in cached ArrayList or always query them from MongoDB?

I'm working on a geolocation app. This app holds about 500K records in a MongoDB properly indexed. Each row has its own latitude and longitude recorded values. Thus, a client must recover 200 nearest points from those 500k rows. I have a concern with performance. At first I thought of keep all records(lat/lng info) in a cache manager or in-memory database. After that, a given point(lat/lng) could be compared to those values in cache. At this moment my doubts take place.
Would be good to store all those records in an ArrayList in a cache-manager and then compare geolocations of records to the geolocations in ArrayList in order to calculate the distances ?
With that approach I prevent a huge number of queries in MongoDB, in other hand, that could be wrong by keeping about 500K records(geolocation) in an ArrayList and then fetch ths list to retrieve the 200 nearest. If not wrong, at least it's a performance penalty I think.
How can I deal with that issue?
Thanks in advance.

Keeping your data in-memory could be a performance enhancement. But when you have 500k records in an ArrayList and want to search for the 200 nearest to a given point, this means that every single one of the 500k records will have to be checked for every single request. This will take a while. Likely much, much longer than MongoDB would take.
But you can improve performance by doing the same thing MongoDB is doing with their geo-indexes: Use a smarter data-structure optimized for searching. An R-Tree, for example. In a well-balanced R-Tree, searching for all records in a given area is an operation with a runtime complexity of log n instead of n for an array-list. For 500k entries, that would be an improvement of several orders of magnitude.

Kyoto Cabinet / Berkeley DB : Hash table size limitations

I am having a hard time storing hundreds of millions of key/value pairs of 16/32bytes with a hash array on my SSD.
With Kyoto Cabinet: When it works fine, it inserts at 70000 record/s. Once it drops, it goes down to 10-500 records/s. With the default settings, the drop happens after around a million records. Looking at the documentation, that is the default number of buckets in the array, so it makes sense. I increased this number to 25 millions and indeed, it works fine until around 25 millions records. Problem is, as soon as I push the number of buckets to 30 millions or over, the insert rate is down to 10-500 records/s from the beginning. Kyoto Cabinet is not designed to increase the number of bucket after the database is created, so I cannot insert more than 25 millions records.
1/ Why would KC's insert rate get very low once the bucket number exceeds 25M ?
With Berkeley DB: The best speed I got is slightly lower than KC, closer to 50000 record/s, but still ok. With the default settings, just like KC, the speed drops suddenly after around a million records. I know BDB is designed to extend gradually its number of buckets. Regardless of that, It tried to increase the initial number, playing with HashNumElements and FillFactor, but any of these attempts made the situation worst. So I still cannot insert over 1-2 millions records with DBD. I tried activating non-synchronized transactions, tried different rates of checkpoints, increased caches. Nothing improves the drop down.
2/ What could cause BDB's insert rate to drop after 1-2 million inserts ?
Note: I'm working with java, and when the speed is dropping, the CPU usage lowers to 0-30% while at 100% when working at correct speed.
Note: Stopping the process and resuming the insertion changes nothing. So I don't think that is related to memory limits or garbage collection.
Thx.

Below is how I managed to store billions of records despite the writing limitations encountered with KC.
With much effort, I still haven't solved the problem for neither Kyoto Cabinet nor Berkeley DB. However I came up with an interesting workaround using Kyoto Cabinet.
I noticed I cannot write more than 25M records on one KC file, but read has no such limitation −it is always fast, regardless of the size of the database. The solution I found is to create a new KC file (a new database) for every 25M new records. That way the reading happens on many KC files and is still fast, and the writing happens only on the last created file and is fast as well. Only remaining problem was to allow update/deletion of the records on the previous files. For that, I copied the SSTables approach, which is :
All the 0 to N-1 files are read-only, file N is read+write.
Any insert/update/deletion is written in file N.
Reads look into files N to 0, and return the first-seen/last-written insertion/update/deletion.
A bloom filter is attached to each file to avoid accessing a file that doesn't have the wanted record.
As soon as file N reaches 25M records, it becomes read-only and file N+1 is created.
Notes :
Just like with SSTables, If a lot of updates/deletions are performed, we might want to perform compaction. However contrary to SSTables, compaction here doesn't require to rewrite the file. Outdated records are simply removed from the KC files, and if a KC file gets very small, it can be either removed −reinserting the records in file N− or reopenned for new insertions −provided the next files are compact.
A deletion does not delete the record, but write a special value that identifies the record as deleted. During compaction, deleted records are removed for real.
Checking if a record exists usually requires to look into the database. Thanks to the bloom filters, most of the negative answers can be given without any disk access.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.