Java. Should I use a list or an array - java

I am using java 7 but could upgrade to 8 if deemed worthwhile.
I am creating a program to compare files in 2 folders.
I am using Files.newDirectoryStream to extract the file data, although i have still to work out how to do that recursively, but I'm sure that should not be too difficult.
So to the point, I will do the compare and then add missing items so they are in sync.
This means I will need to store:
1) the name &
2) the path
Therefore I will need to use something to do this. I know I can do it with an Array[][]. But is this the best way to do this, or are lists more efficient.
I imagine the largest folder to hold 200 files.
Thanks in advance.

You can use ArrayList<File> which can be seen as a "Wrapper around plain arrays" (for the sake of simplicity and understanding). As each element in the List will be of type File, you will already have access to the path and name of the file and do not need to store them seperatly.
Of course the ArrayList has a bit more overhead then using simple arrays, but if you expect the largest amount to be 200 files, it's not a big deal and nothing to worry about. Except you run your program on your calculator ;)
If you can, you should first fetch the amount of files in your current working directory and use that number as initial size for the List:
int numberOfFiles = fetchFileCount(directory);
ArrayList<File> currentFiles = new ArrayList<>(numberOfFiles);

Related

Reading in an unknown number of elements from a file to an array?

As the title states, I'm trying to read an unknown # of elements from a file into an array. What's the most simple (professor wants us to avoid using things she hasnt taught) yet effective way of going about this?
I've thought about reading and counting the elements in the file one by one, then create the array after I know what size to make it, then actually store the elements in there. But that seems a little inefficient. Is there a better way?
There are only two ways to do this: the way you suggested (count, then read), and making an array and hoping it's good enough, then resize if it's not (which is the easier of the two, as ArrayList does that automatically for you).
Which is better depends on whether you're more limited by time or memory (as typically reading the file twice will be slower than reallocating an array even multiple times).
EDIT: There is a third way, which is only available if each record in the file has a fixed width, and the file is not compressed (or encoded in any other way that would mess with the content layout): get the size of the file, divide by record size, and that's exactly how many records you have to allocate for. Unfortunately, life is not always that easy. :)
As you have not mentioned that your professor has taught ArrayList or not so, in that case I would go for ArrayList for sure. The sole purpose f ArrayList is to deal with this kind of situations. Features like Dynamic Memory gives ArrayList some advantages over array.
You can count first and then create array accordingly. But this will be slow and why reinvent the wheel?
So, your approach should be to use ArrayList read the element and just use ArrayList.add()...

Sort huge file in java

I've huge file with unique words in each line. Size of file is around 1.6 GB(I've to sort other files after this which are around 15GB). Till now, for smaller files I used Array.sort(). But for this file I get java.lang.OutOfMemoryError: Java heap space. I know the reason for this error. Is there any way instead of writing complete quick sort or merge sort program.
I read that Array.sort() uses Quicksort or Hybrid Sort internally. Is there any procedure like Array.sort() ??
If I have to write a program for sorting, which one should I use? Quicksort or Merge sort. I'm worried about worst case.
Depending on the structure of the data to store, you can do many different things.
In case of well structured data, where you need to sort by one or more specific fields (in which case system tools might not be helpful), you are probably better off using a datastore that allows sorting. MongoDB comes to mind as a good fit for this given that the size doesn't exceed few 100s of GBs. Other NoSQL datastores might also fit the bill nicely, although Mongo's simplicity of use and installation and support for JSON data makes it a really great candidate.
If you really want to go with the java approach, it gets real tricky. This is the kind of questions you ask at job interviews and I would never actually expect anybody to implement code. However, the general solution is merge sort (using random access files is a bad idea because it means insertion sort, i.e., non optimal run time which can be bad given the size of your file).
By merge sort I mean reading one chunk of the file at a time small enough to fit it in memory (so it depends on how much RAM you have), sorting it and then writing it back to a new file on disk. After you read the whole file you can start merging the chunk files two at a time by reading just the head of each and writing (the smaller of the two records) back to a third file. Do that for the 'first generation' of files and then continue with the second one until you end up with one big sorted file. Note that this is basically a bottom up way of implementing merge sort, the academic recursive algorithm being the top down approach.
Note that having intermediate files can be avoided altogether by using a multiway merge algorithm. This is typically based on a heap/priority queue, so the implementation might get slightly more complex but it reduces the number of I/O operations required.
Please also see these links.
Implementing the above in java shouldn't be too difficult with some careful design although it can definitely get tricky. I still highly recommend an out-of-the-box solution like Mongo.
As it turns out, your problem is that your heap cannot accommodate such a large array, so you must forget any solution that implies loading the whole file content in an array (as long as you can't grow your heap).
So you're facing streaming. It's the only (and typical) solution when you have to handle input sources that are larger than your available memory. I would suggest streaming the file content to your program, which should perform the sorting by either outputting to a random access file (trickier) or to a database.
I'd take a different approach.
Given a file, say with a single element per line, I'd read the first n elements. I would repeat this m times, such that the amount of lines in the file is n * m + C with C being left-over lines.
When dealing with Integers, you may wish to use around 100,000 elements per read, with Strings I would use less, maybe around 1,000. It depends on the data type and memory needed per element.
From there, I would sort the n amount of elements and write them to a temporary file with a unique name.
Now, since you have all the files sorted, the smallest elements will be at the start. You can then just iterate over the files until you have processed all the elements, finding the smallest element and printing it to the new final output.
This approach will reduce the amount of RAM needed and instead rely on drive space and will allow you to handle sorting of any file size.
Build the array of record positions inside the file (kind of index), maybe it would fit into memory instead. You need a 8 byte java long per file record. Sort the array, loading records only for comparison and not retaining (use RandomAccessFile). After sorting, write the new final file using index pointers to get the records in the needed order.
This will also work if the records are not all the same size.

Java- Parsing a large text file

I had a quick question. I'm working on a school project and I need to parse an extremely large text file. It's for a database class, so I need to get unique actor names from the file because actors will be a primary key in the mysql database. I've already written the parser and it works great, but at the time I forgot to remove the duplicates. So, I decided the easiest way would be to create an actors arraylist. (Using ArrayList ADT) Then use the contain() method to check if the actor name is in the arraylist before I print it to a new text file. If it is I do nothing, if it isn't I add it to the arraylist and print to the page. Now the program is running extremely slow. Before the arraylist, it took about 5 minutes. The old actor file was 180k without duplicates removed. Now its been running for 30 minutes and at 12k so far. (I'm expecting 100k-150k total this time.)
I left the size of the arraylist blank because I dont know how many actors are in the file, but at least 1-2 million. I was thinking of just putting 5 million in for its size and checking to see if it got them all after. (Simply check last arraylist index and if empty, it didnt run out of space.) Would this reduce time because the arraylist isnt redoubling constantly and recopying everything over? Is there another method which would be faster than this? I'm also concerned my computer might run out of memory before it completes. Any advice would be great.
(Also I did try running 'unique' command on the text file without success. The actor names print out 1 per line. (in one column) I was thinking maybe the command was wrong. How would you remove duplicates from a text file column in a windows or linux command prompt?) Thank you and sorry for the long post. I have a midterm tomorrow and starting to get stressed.
Use Set instead of List so you don't have to check if the collection contains the element. Set doesn't allow duplicates.
Cost of lookup using arrayList contains() gives you roughly O(n) performance.
Doing this a million times is what I think, killing your program.
Use a HashSet implementation of Set. It will afford you theoretically constant time lookup and will automatically remove duplicates for you.
-try using memory mapped file in java for faster access to the large file
-and instead of ArrayList use a HashMap collection where the key is the actor's name (or the hash code) this will improve a lot the speed since the look-up of a key in a HashMap is very fast

Fast way to alphabetically sort the contents of a file in java

Can anyone recommend a fast way to sort the contents of a text file, based on the first X amount of characters of each line?
For example if i have in the text file the following text
Adrian Graham some more text here
John Adams some more text here
Then another record needs to be inserted for eg.
Bob Something some more text here
I need to keep the file sorted but this is a rather big file and i'd rather not load it entirely into memory at once.
By big i mean about 500 000 lines, so perhaps not terribly huge.
I've had a search around and found http://www.codeodor.com/index.cfm/2007/5/14/Re-Sorting-really-BIG-files---the-Java-source-code/1208
and i wanted to know if anyone could suggest any other ways? For the sake of having second opinions?
My initial idea before i read the above linked article was:
Read the file
Split it into several files, for eg A to Z
If a line begins with "a" then it is written to the file called A.txt
Each of the files then have their contents sorted (no clear idea how just yet apart from alphabetical order)
Then when it comes to reading data, i know that if i want to find a line which starts with A then i open A.txt
When inserting a new line the same thing applies and i just append to the end of the file. Later after the insert when there is time i can invoke my sorting program to reorder the files that have had stuff appended to them.
I realise that there are a few flaws in this like for eg. there won't be an even number of lines that start with a particular letter so some files may be bigger than others etc.
Which again is why i need a second opinion for suggestions on how to approach this?
The current program is in java but any programming language could be used for an example that would achieve this...I'll port what i need to.
(If anyone's wondering i'm not deliberately trying to give myself a headache by storing info this way, i inherited a painful little program which stores data to files instead of using some kind of database)
Thanks in advance
You may also want to simply call the DOS "sort" command to sort the file. It is quick and will require next to no programming on your part.
In a DOS box, type help sort|more for the sort syntax and options.
500,000 shouldn't really be that much to sort. Read the whole thing into memory, and then sort it using standard built in functions. I you really find that these are too slow, then move onto something more complicated. 500,000 lines x about 60 bytes per line still only ends up being 30 megs.
Another option might be to read the file and put it in a lightweight db (for example hsqldb in file mode)
Then get the data sorted, and write it back to a file. (Or simply migrate to program, so it uses a db)

The right way to manage a big matrix in Java

I'm working with a big matrix (not sparse), it contains about 10^10 double.
Of course I cannot keep it in memory, and I need just 1 row at time.
I thought to split it in files, every file 1 row (it requires a lot of files) and just read a file every time I need a row. do you know any more efficient way?
Why do you want to store it in different files? Can't u use a single file?
You could use functions inside RandomAccessFile class to perform the reading from that File.
So, 800KB per file, sounds like a good division. Nothing really stops you from using one giant file, of course. A matrix, at least one like yours that isn't sparse, can be considered a file of fixed length records, making random access a trivial matter.
If you do store it one file per row, I might suggest making a directory tree corresponding to decimal digits, so 0/0/0/0 through 9/9/9/9.
Considerations one way or the other...
is it being backed up? Do you have high-capacity backup media or something ordinary?
does this file ever change?
if it does change and it is backed up, does it change all at once or are changes localized?
It depends on the algorithms you want to execute, but I guess that in most cases a representation where each file contains some square or rectangular region would be better.
For example, matrix multiplication can be done recursively by breaking a matrix into submatrices.
If you are going to be saving it in a file, I believe serializing it will save space/time over storing it as text.
Serializing the doubles will store them as 2 bytes (plus serialization overhead) and means that you will not have to convert these doubles back and forth to and from Strings when saving or loading the file.
I'd suggest to use a disk-persistent cache like Ehcache. Just configure it to keep as many fragments of your matrix in memory as you like and it will take care of the serialization. All you have to do is decide on the way of fragmentation.
Another approach that comes to my mind is using Terracotta (which recently bought Ehache by the way). It's great to get a large network-attached heap that can easily manage your 10^10 double values without caring about it in code at all.

Categories

Resources