Processing an array that can't be kept in memory [closed]

Processing an array that can't be kept in memory [closed] - java

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I have to process a big array of string in Java, which can't be kept in memory. Because of this, the array must be processed in several chunks. The size of each chunk can be specified by the program's user, but if the user doesn't specify a size, the program must decide the most appropriate size.
My first thought was to use a in-disk database like cassandra. That way, every time I want to process a chunk of the big array, I would do a query to the database.
The problem I saw was that I'd need to control available memory of the JVM and RAM, which I think would be too difficult. Also, I would have to figure out how to set the size of each chunk to make the most of the available memory without filling it.
For that, I've thought about using something like MemCached or SSDB (alternative to Redis that allows you to store a part of the database in disk - https://github.com/ideawu/ssdb), but I'm not sure if that's the best option. The idea is that Memcached or SSDB would help manage the exchange of data between memory and disk without me having to implement any control to avoid filling memory.
Really, I don't like too much the idea of adding dependencies (Memcached or SSDB) just to make my program function.
Then, my question is: are there any good alternatives to solve my problem? Is the previous reasoning wrong?
Thanks in advance!
CLARIFICATIONS
---------------
What kind of processing do you have to do?
processing is related to data analysis techniques for getting information using existing data (in the big array)
How big is the array? How big are the strings? Is your processing random access or sequential? Why can't you just use a file?
the size of the array can change, it haven't a fixed value. The idea is that a user (not end-user) can processing an array in chunk when it's neccesary for him. For example, an user maybe want to process an array of size 100.000 in several chunk and other user don't need to process an array in several chunk while the size of the array is less than 1.000.000 (depending on the size of memory of each user).
My processing is sequential.
I don't use a file because in other questions of this page recommend that it's better to use a database rather than a file. Moreover, if I used a file, I have to control the available memory space, preventing the memory is full (and an error occurs in the program)
Where are the Strings you wanna process? Are they already stored somewhere, or do you generate them somehow on the fly?
Strings are obtained from users and they were stored in an array completely. Now, the idea is to store the strings passed by the user to the database, and later (when user decide), the processing of strings will be done (it haven't to be immediately after storing the strings in the database).

Related

I want to preserve my data during service restart, but my data is not in simple variable name-value or table format. How should I go about this?

I want to preserve data during service restart, which uses a arraylist of {arraylist of integers} and some other variables.
Since it is about 40-60 MB, I don't want it be generated each time the service restarts(it takes a lot of time); I want to generate data once, and maybe copy it for next service restart.
How can it be done?
Please consider how will I go about putting a data structure similar to multidimensional array(3d or above) into file, before suggesting writing the data in a file; which when done, will likely take significant time to read too.

You can try writing your data after generation to a file. Then on next service restart, you can simply read that from the file.

If you need persistent data, then put it into database
https://developer.android.com/guide/topics/data/data-storage
or try some object database like http://objectbox.io/

So you're afraid reading from the file would take along time due to its size, the number and size of the rows (the inner arrays).
I think it might be worthy to stop for a minute and ask yourself whether you need all this data at once. Maybe you only need a portion of it at any given time and there are scenarios in which you don't use some (or maybe most) of the data? If this is likely, I would suggest that you'll compute the data on demand, when required, and only keep a memory based cache for future demand in the current session.
Otherwise, if you do need all the data at a given time, you have a trade-off here. Trade-off between size on disk and processing time. You can shrink the data using some algorithm, but it would be at the expense of the processing time. On the hand, you can just serialize your object of data and save it to disk as is. Less time, more disk space.
Another solution for your scenario, could be, to just use a DB and a cursor (room on top sqlite). I don't exactly know what it is that you're trying to do, but your arrays can easily be modeled into a DB. Model a single row as you'd like and add to that model the outer index of the array. Then save the models into the DB, potentially making the outer index field the primary key if the DB.
Regardless of the things I wrote, try to think if you really need this data persistent on your client, maybe you can store it at the server side? If so, there are other storage and access solutions which are not included at the Android client side.

Thank you all for answering this question.
This is what I have finally settled for:
Instead of using the structure as part of the app, I made this into a
tool, which will prepare data to be used with the main app. In doing
so, it also stopped the concern regarding service restart.
This tool will first read all the strings from input file(s).
Then put all of them into the structure one at a time.(This will be
the part which I was having doubts, and asked the question about.
Since all the data is into the structure here, as soon as program
terminates, this structured data is unusable.)
Now, I prepared another structure for putting this data into file,
and put all this data into file so that I do not need to read to all
input file again and again, but only few lines.
Then I thought, why spend time "read"ing files while I can hard code
it into my app. So, as final step of this preprocessing tool, I made
it into a class which has switch(input){case X: return Y}.
Now I will just have to put this class into the app I wanted to make.
I know this all sounds very abstract, even stretching the concept of abstract, if you want to know details, please let me know. I am also including link of my "tool". Please visit and let me know if there would have been some better way.
P.S. There could be errors in this tool yet, which if you find, let me know to fix them.
P.P.S.
link: Kompressor Tool

Importing massive dataset to Neo4j is extremely slow [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a rather large dataset, ~68 million data points. The data is currently stored in MongoDB and I have written a Java program that goes through the data to link data points together and place them in the Neo4j database using Cypher commands. I ran this program with a test set of data (~1.5 million) and it worked, ran it overnight. Now when I try to import the whole dataset, the program is extremely slow. Ran the whole weekend and only ~350,000 data points have made it. Through some short testing, it seems like Neo4j is the bottleneck. It's been half an hour since I stopped the Java program but Neo4j's CPU usage is at 100% and new nodes are still being added (from the Java program). Is there anyway to overcome this bottleneck? I've thought about multithreading, but since I'm trying to create a network, there are lots of dependencies and non-thread-safe operations being performed. Thanks for your help!
EDIT: The data I have is a list of users. The data that is contained is the user id, and an array of the user's friends' ids. My Cypher queries look a little like this:
"u:USER {id:" + currentID + "}) CREATE (u)-[:FRIENDS {ts:" + timeStamp}]->(u" + connectionID + ":USER {id:" + connectionID + "})"
Sorry if this is really terrible, pretty new to this

You should first look at this:
neo4j import slowing down
If you still decide to DIY, there's a few things you should look out for: First, make sure you don't try to import all your data in one transaction, otherwise your code will spend most of the time suspended by the Garbage Collector. Second, ensure you have given plenty of memory to the Neo4j process (or your application if you're using an embedded instance of Neo4j). 68 million nodes is trivial for Neo4j, but if the Cypher you're generating is constantly looking things up to e.g. create new relationships, then you'll run into severe paging issues if you don't allocate enough memory. Finally, if you are looking up nodes by properties (rather than by id) then you should be using labels and schema indexes:
http://neo4j.com/news/labels-and-schema-indexes-in-neo4j/

Did you configure neo4j.properties and neo4j-wrapper.conf files?
It is highly recommended to adjust the values according to the amount of RAM available on your machine.
in conf/neo4j-wrapper.conf I usually use for a 12GB RAM server
wrapper.java.initmemory=8000
wrapper.java.maxmemory=8000
in conf/neo4j.properties I set
dbms.pagecache.memory=8000
See http://neo4j.com/blog/import-10m-stack-overflow-questions/ for a complete example to import 10M nodes in a few minutes, it's a good starting point
SSD are also recommended to speed up import.

One thing I learned when loading bulk data into a database was to switch off indexing temporarily on the destination table(s). Otherwise every new record added caused a separate update to the indexes, resulting in a lot of work on the disk. It was much quicker to re-index the whole table in a separate operation after the data load was complete. YMMV.

Memory VS data computer program

I guess this is a beginners theoretical question:
I'm thinking of a program that will store data "internally" rather than processing data with a file as source. Data is input to the program by the user and the output is to stay "within" the program (does that make sense??).
I'm using 1 array of size 12 to input unlimited amount of string tokens from user.
I plan to perform searching and sorting operations mostly.
As the container grows, does the program process output of the data to itself the same way it would on a file external to the program?
I guess the real question is : is it better to store output to the external file and process it from there or is better to keep the data of the program?
For memory and speed purposes is an array better to use or are there better containers which I should use?
I'm sure I'd find an answer on a book but I just wanted to get your opinions.
Thanks

It is certainly possible to store data only in memory. Java provides a number of different containers to handle the details of this for you. Holding data in memory is often easier than a file because
It is much faster to access the data
You can jump around in the data easily
You can more easily build complex structures in memory than you can on disk
It is possible, if you are dealing with a very large amount of data, to exceed the size that can be reasonably held in memory. For a typical desktop computer of today, that size would probably be a few gigabytes.

how to improve file reading efficiency and its data insertion in java

We have an autosys job running in our production on daily basis. It calls a shell script which in turn calls a java servlet. This servlet reads these files and inserts the data into two different tables and then does some processing. Java version is 1.6 & application server is WAS7 and database is oracel-11g.
We get several issues with this process like it takes time, goes out of memory etc etc. Below are the details of the way we have coded this process. Please let me know if it can be improved.
When we read the file using BufferedReader, do we really get a lot of strings created in the memory as returned by readLine() method of BufferedReader? These files contain 4-5Lacs of line. All the records are separated by newline character. Is there a better way to read files in java to achieve efficiency? I couldnt find any provided the fact that all the record lines in the file are of variable length.
When we insert the data then we are doing a batch process with statement/prepared statement. We are making one batch containing all the records of the file. Does it really matter to break the batch size to have better performance?
If the tables have no indexes defined nor any other constraints and all the columns are VARCHAR type, then which operation will be faster:- inserting a new row or updating an existing row based upon some matching condition?

Reading the File
It is fine using BufferedReader. The key thing here is to read a bunch of lines, then process them. After that, read another bunch of lines, and so on. An important implication here is when you process the second bunch of lines, you no longer reference the previous bunch of lines. This way, you ensure you don't retain memory space unnecessarily. If, however, you retain all references to all the lines, you are likely running into memory issues.
If you do need to reference all the lines, you can either increase your heap size or, if many of the lines are duplicates, use the technique of intern() or something similar to save memory.
Modifying the Table
Always better to limit the size of a batch to a reasonable count. The larger the size, the more resource constraint you are imposing to the database end and probably your jvm side as well.
Insert or Update
If you have indexes defined, I would say updating performs better. However, if you don't have indexes, insert should be better. (You have access to the environment, perhaps you can do a test and share the result with us?)
Lastly, you can also consider using multiple threads to work on the part of 'Modifying the table' so as to improve overall performance and efficiency.

Lightweight database-like library for storing key-value pairs [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
What is the best way of storing key-value pairs of Strings in a file in Java, that is scalable (can work with a large number of pairs, i.e. doesn't read or write entire file on access), but is as lightweight as possible?
I am asking this because even the lightest database libraries, like SQLite and H2 seem like an overkill for this purpose, and are even impossible to use for ME programs (although I would need this mainly for SE programs for now).

Oracle BerkeleyDB java edition allows you to store key-value objects, it is simple to use and administer, and up-scalable to heaven (or so). At 820k is not that big.
But if you are thinking about down scaling to j2me, you may try TinySQL
Pros:
It is small (93k!)
It is embeddable
It uses DBF or text files files to store data, so they are easy
to read.
Cons:
It is an old unmaintained project
It is not designed to work in j2me, but since it can work in JDK 1.1.8 it won't be hard to make it work in j2me. Of course you will have to change some code from using RandomAccessFile to FileConnection and stuff like that, but at least you wont need to mess with generics related code.
It is not very fast, because it does not use indexes, so you need to try and see if it is fits your needs
It is not feature complete, just gives you a small subset of SQL

There are some good ideas in this SO answer. My own inclination would be to use noSQL or similar, while that discussion is more centered on hashmap. Either will do, I believe.

For a static set of key-value pairs, Dan Bernstein's cdb comes to mind. To quote from the cdb description:
cdb is a fast, reliable, simple package for creating and reading constant databases. Its database structure provides several features:
Fast lookups: A successful lookup in a large database normally takes just two disk accesses. An unsuccessful lookup takes only one.
Low overhead: A database uses 2048 bytes, plus 24 bytes per record, plus the space for keys and data.
No random limits: cdb can handle any database up to 4 gigabytes. There are no other restrictions; records don't even have to fit into memory. Databases are stored in a machine-independent format.
Fast atomic database replacement: cdbmake can rewrite an entire database two orders of magnitude faster than other hashing packages.
Fast database dumps: cdbdump prints the contents of a database in cdbmake-compatible format.
cdb is designed to be used in mission-critical applications like e-mail. Database replacement is safe against system crashes. Readers don't have to pause during a rewrite.
It appears there is a Java implementation available at http://www.strangegizmo.com/products/sg-cdb/ with a BSD license.

Obvious initial thoughts are to use Properties as these are streamed but they are ultimately fully loaded. You also couldn't partially read a buffered set.
With that in mind, you could see this additional other SO response. This refers to navigating (albeit imperfectly) around a stream so that you could reposition your read:
changing the index positioning in InputStream
With a separate index (say by initial character) you could intelligently reposition the cursor in the stream, perhaps.

Chronicle Map is a modern off-heap key-value store for Java. If could be (optionally) persisted to disk, acting like an eventually-consistent database. Chronicle Map features
Queries which are faster than 1 us, in some use-cases as fast as 100 ns (see comparison with other similar libraries for Java).
Perfect scalability for processing from multiple threads and even processes, thanks to segmented shared-nothing design and multi-level locks, allowing multiple operations to access the same data concurrently.
Very low overhead per entry, less than 20 bytes / entry is achievable.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.