Improve performance on parsing text file

Improve performance on parsing text file - java

First of all, sorry for my English and for the length of the message.
I'm writing a simple application in Java for visual cryptography for a school project that takes a schema File and a secret image, then creates n images using the information contained in the schema.
For each pixel in the secret image the application looks for a matrix in the schema file and write m pixels in the n shares (one row for each share).
A schema file contains the matrices (n*m) for every color needed for encoding and it is composed as follows
COLLECTION COLOR 1
START MATRIX 1
RGB
GBR
BGR
END
START MATRIX 2
.....
COLLECTION COLOR 2
START MATRIX 1
XXX
XXX
XXX
END
......
//
This file can be a few lines or many thousands so I can't save the matrices in the application, but I need to always read the file.
To test the performance I created a parser that simply search the matrix looking line by line, but it is very slow.
I thought I'd save the line number of each matrix and then use RandomAccessFile to read it but I wanted to know if there is a more powerful method for doing this.
Thanks

If you are truly dealing with massive, massive input files that exceed your ability to load the entire thing into RAM, then using a persistent key/value store like MapDB may be an easy way to do this. Parse the file once and build of an efficient [Collection+Color]->Matrix map. Store that in a persistent HTree. That'll take care of all of the caching, etc... for you. Make sure to create a good hash function for the Collection+Color tuple, and it should be very performant.
If your data access pattern tends to clump together, it may be faster to store in a B+Tree index - you can play with that and see what works best.

For your schema file, use a FileChannel and call .map() on it. With a little effort, you can calculate the necessary offsets into the mapped representation of the file and use that, or even encapsulate this mapping into a custom structure.

Related

Enlarge already big jpegs using less memory

I need to enlarge already big jpegs, they are used on printing so they need to be really big 300PPI files. The resulting image will be too big to be fully hold in memory. What i thought was something like breaking the original image into small strips, enlarge each one of them separatedly and go writing it in the output file(another jpeg), never keeping the final image fully on menoru. I've read about lossless operations on jpegs it seems the way to go(create a file with the strip and copy the mcus, huffman tables and quantizatuon tables to the final file), also read something about abbreviated streams on java. What is a good way to do this?

Your best bet is to leave the JPEGs alone and have the printing software scale the output to the device.
If you really want to double the size,you could use subsampling. Just double the Y component in each direction and change the sampling for Cb and Cr while leaving the data alone.
You could also do as you say, and recompress in strips of MCUs.

Sort huge file in java

I've huge file with unique words in each line. Size of file is around 1.6 GB(I've to sort other files after this which are around 15GB). Till now, for smaller files I used Array.sort(). But for this file I get java.lang.OutOfMemoryError: Java heap space. I know the reason for this error. Is there any way instead of writing complete quick sort or merge sort program.
I read that Array.sort() uses Quicksort or Hybrid Sort internally. Is there any procedure like Array.sort() ??
If I have to write a program for sorting, which one should I use? Quicksort or Merge sort. I'm worried about worst case.

Depending on the structure of the data to store, you can do many different things.
In case of well structured data, where you need to sort by one or more specific fields (in which case system tools might not be helpful), you are probably better off using a datastore that allows sorting. MongoDB comes to mind as a good fit for this given that the size doesn't exceed few 100s of GBs. Other NoSQL datastores might also fit the bill nicely, although Mongo's simplicity of use and installation and support for JSON data makes it a really great candidate.
If you really want to go with the java approach, it gets real tricky. This is the kind of questions you ask at job interviews and I would never actually expect anybody to implement code. However, the general solution is merge sort (using random access files is a bad idea because it means insertion sort, i.e., non optimal run time which can be bad given the size of your file).
By merge sort I mean reading one chunk of the file at a time small enough to fit it in memory (so it depends on how much RAM you have), sorting it and then writing it back to a new file on disk. After you read the whole file you can start merging the chunk files two at a time by reading just the head of each and writing (the smaller of the two records) back to a third file. Do that for the 'first generation' of files and then continue with the second one until you end up with one big sorted file. Note that this is basically a bottom up way of implementing merge sort, the academic recursive algorithm being the top down approach.
Note that having intermediate files can be avoided altogether by using a multiway merge algorithm. This is typically based on a heap/priority queue, so the implementation might get slightly more complex but it reduces the number of I/O operations required.
Please also see these links.
Implementing the above in java shouldn't be too difficult with some careful design although it can definitely get tricky. I still highly recommend an out-of-the-box solution like Mongo.

As it turns out, your problem is that your heap cannot accommodate such a large array, so you must forget any solution that implies loading the whole file content in an array (as long as you can't grow your heap).
So you're facing streaming. It's the only (and typical) solution when you have to handle input sources that are larger than your available memory. I would suggest streaming the file content to your program, which should perform the sorting by either outputting to a random access file (trickier) or to a database.

I'd take a different approach.
Given a file, say with a single element per line, I'd read the first n elements. I would repeat this m times, such that the amount of lines in the file is n * m + C with C being left-over lines.
When dealing with Integers, you may wish to use around 100,000 elements per read, with Strings I would use less, maybe around 1,000. It depends on the data type and memory needed per element.
From there, I would sort the n amount of elements and write them to a temporary file with a unique name.
Now, since you have all the files sorted, the smallest elements will be at the start. You can then just iterate over the files until you have processed all the elements, finding the smallest element and printing it to the new final output.
This approach will reduce the amount of RAM needed and instead rely on drive space and will allow you to handle sorting of any file size.

Build the array of record positions inside the file (kind of index), maybe it would fit into memory instead. You need a 8 byte java long per file record. Sort the array, loading records only for comparison and not retaining (use RandomAccessFile). After sorting, write the new final file using index pointers to get the records in the needed order.
This will also work if the records are not all the same size.

How to Handle Large Data in Java?

My application needs to use data in a text file which is up to 5 GB in size. I cannot load all of this data into RAM as it is far too large.
The data is stored like a table, 5 million records (rows) and 40 columns each containing text that will be converted in memory to either string, ints, or doubles.
I've tried caching only 10 - 100 MB of data in memory and reloading from the file when I need data outside but it is way too slow! When I run calculations because I can randomly jump from any row within the table it would constantly need to open the file, read and close.
I need something fast, I was thinking of using some sort of DB. I know calculations with large data like this may take a while which is fine. If I do use a DB it needs to be setup on launch of the desktop application and not require some sort of server component to be installed before.
Any tips? Thanks

I think you need to clarify some things:
This is desktop application (I assume yes), what is the memory limit for it?
Do you use your file in read-only mode?
What kind of calculations are you trying to do? (how often random rows are accessed, how often consequent rows are read, do you need to modify data)
Currently I see two ways for further investigation:
Use SQLite. This is small single-file DB, oriented mainly for desktop applications and single-user use. It's doesn't require any server, all you need is to have appropriate jdbc library.
Create some kind of index, using, for example, binary tree. First time you read your file, index the start position of the rows within the file. In conjunction with permanently open random access file this will help you to seek and read quickly desired row. For binary tree, your index may be approximately 120M. (it's RowsCount * 2 * IndexValueSize for binary tree)

You can use an embedded database, you can find a comparison here: Java Embedded Databases Comparison.
Or, depending on your use case you may even try to use Lucene which is a full text search engine.

CSV file, faster in binary format? Fastest search?

If I have a CSV file, is it faster to keep the file as place text or to convert it to some other format? (for searching)
In terms of searching a CSV file, what is the fastest method of retrieving a particular row (by key)? Not referring to sorting the file sorry, what I mean was looking up a arbitrary key in the file.
Some updates:
the file will be read-only
the file can be read and kept in memory

There are several things to consider for this:
What kind of data do you store? Does it actually make sense, to convert this to a binary format? Will binary format take up less space (the time it takes to read the file is dependent on size)?
Do you have multiple queries for the same file, while the system is running, or do you have to load the file each time someone does a query?
Do you need to efficiently transfer the file between different systems?
All these factors are very important for a decision. The common case is that you only need to load the file once and then do many queries. In that case it hardly matters what format you store the data in, because it will be stored in memory afterwards anyway. Spend more time thinking about good data structures to handle the queries.
Another common case is, that you cannot keep the main application running and hence you cannot keep the file in memory. In that case, get rid of the file and use a database. Any database you can use will most likely be faster than anything you could come up with. However it is not easy to transfer a database between system.
Most likely though, the file format will not be a real issue to consider. I've read quite a few very long CSV files and most often the time it took to read the file was negligible compared to what I needed to do with the data afterwards.

If you have too much data and is very production level, then use Apache Lucene
If its small dataset or its about learning then read through Suffix tree and Tries

"Convert" it (i.e. import it) into a database table (or preferably normalised tables) with indexes on the searchable columns and a primary key on the column that has the highest cardinality - no need to re-invent the wheel... you'll save yourself a lot of issues - transaction management, concurrency.... really - if it will be in production, the chance that you will want to keep it in csv format is slim to zero.

If the file is too large to keep in memory, then just keep the keys in memory. Some number of rows can also be keep in memory, with least-recently-accessed rows paged out as additional rows are needed. Use fseeks (directed by keys) with the file to find the row in the file itself. Then load that row into memory in case other entries on that row might be needed.

The right way to manage a big matrix in Java

I'm working with a big matrix (not sparse), it contains about 10^10 double.
Of course I cannot keep it in memory, and I need just 1 row at time.
I thought to split it in files, every file 1 row (it requires a lot of files) and just read a file every time I need a row. do you know any more efficient way?

Why do you want to store it in different files? Can't u use a single file?
You could use functions inside RandomAccessFile class to perform the reading from that File.

So, 800KB per file, sounds like a good division. Nothing really stops you from using one giant file, of course. A matrix, at least one like yours that isn't sparse, can be considered a file of fixed length records, making random access a trivial matter.
If you do store it one file per row, I might suggest making a directory tree corresponding to decimal digits, so 0/0/0/0 through 9/9/9/9.
Considerations one way or the other...
is it being backed up? Do you have high-capacity backup media or something ordinary?
does this file ever change?
if it does change and it is backed up, does it change all at once or are changes localized?

It depends on the algorithms you want to execute, but I guess that in most cases a representation where each file contains some square or rectangular region would be better.
For example, matrix multiplication can be done recursively by breaking a matrix into submatrices.

If you are going to be saving it in a file, I believe serializing it will save space/time over storing it as text.
Serializing the doubles will store them as 2 bytes (plus serialization overhead) and means that you will not have to convert these doubles back and forth to and from Strings when saving or loading the file.

I'd suggest to use a disk-persistent cache like Ehcache. Just configure it to keep as many fragments of your matrix in memory as you like and it will take care of the serialization. All you have to do is decide on the way of fragmentation.
Another approach that comes to my mind is using Terracotta (which recently bought Ehache by the way). It's great to get a large network-attached heap that can easily manage your 10^10 double values without caring about it in code at all.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.