Java- Parsing a large text file - java

I had a quick question. I'm working on a school project and I need to parse an extremely large text file. It's for a database class, so I need to get unique actor names from the file because actors will be a primary key in the mysql database. I've already written the parser and it works great, but at the time I forgot to remove the duplicates. So, I decided the easiest way would be to create an actors arraylist. (Using ArrayList ADT) Then use the contain() method to check if the actor name is in the arraylist before I print it to a new text file. If it is I do nothing, if it isn't I add it to the arraylist and print to the page. Now the program is running extremely slow. Before the arraylist, it took about 5 minutes. The old actor file was 180k without duplicates removed. Now its been running for 30 minutes and at 12k so far. (I'm expecting 100k-150k total this time.)
I left the size of the arraylist blank because I dont know how many actors are in the file, but at least 1-2 million. I was thinking of just putting 5 million in for its size and checking to see if it got them all after. (Simply check last arraylist index and if empty, it didnt run out of space.) Would this reduce time because the arraylist isnt redoubling constantly and recopying everything over? Is there another method which would be faster than this? I'm also concerned my computer might run out of memory before it completes. Any advice would be great.
(Also I did try running 'unique' command on the text file without success. The actor names print out 1 per line. (in one column) I was thinking maybe the command was wrong. How would you remove duplicates from a text file column in a windows or linux command prompt?) Thank you and sorry for the long post. I have a midterm tomorrow and starting to get stressed.

Use Set instead of List so you don't have to check if the collection contains the element. Set doesn't allow duplicates.

Cost of lookup using arrayList contains() gives you roughly O(n) performance.
Doing this a million times is what I think, killing your program.
Use a HashSet implementation of Set. It will afford you theoretically constant time lookup and will automatically remove duplicates for you.

-try using memory mapped file in java for faster access to the large file
-and instead of ArrayList use a HashMap collection where the key is the actor's name (or the hash code) this will improve a lot the speed since the look-up of a key in a HashMap is very fast

Related

Checking if a String array contains a specific string

I have a string array(String []words) which gets populated during runtime and contains almost 400k members. I tried to check if the array contains a specific string :
boolean check=Arrays.asList(words).contains ("code");
I know the "code" is already within the array, but i never get the check true as if the checking process isn't even taking place. I also tried with Hash sets but with no success. Can anyone tell where the problem is???
For the given problem, the answer isn't really clear, as the question is a bit unclear. Most likely, your 400K elements don't contain "code" (for example you forgot to trim input, and there are spaces/newlines around).
But beyond that, a distinct non-answer here: when you have 400K elements to search in, then your idea of doing Arrays.asList(words).contains ("code"); is simply a bad idea.
Walking through 400K elements in order to figure if one entry matches a search word is terrible inefficient. You see, if you want to look up only one word, then why storing all the data in memory? So you probably intend to search for different words over time. And each time, you want to iterate 400K elements to figure if a word is present?
Instead: you should invest into a more appropriate data structure. That could be a (Hash)Set, or a (Hash)Map, or even beyond that (like a full text search enabled service, using Solr, ElasticSearch, ...)
Seriously: repeatedly iterating 400K elements on a mobile device is not something that your users will appreciate. Not at all.
Try below code. I hope it will help you.
boolean check=Arrays.asList(words.matches("(.)code(.)"));

Speed a search cache without using too much memory

I have to access a database with 380,000 entries. I don't have write access to the DB, I can just read it. I've made a search function using a map to search for users by firstname. Here is my process:
1 - Load everything from the DB
2 - Store everything into a Map<Charactere, ArrayList<User>>, using Alpha letters to store users according to the first letter of their firstname.
<A> {Alba, jessica, Alliah jane, etc ...}
<B> {Birsmben bani, etc ...}
When someone searches for a user, I take the firstletter of the firstname typed and use map.get(firstletter), then iterate on the ArrayList to find all the users.
The Map Take a huge space in the memory I guess (380,000 User object). I had to increase the heap size
I want to make it faster. Use firstname as key for the Map, in order to make it faster (there are many people with the same firstname).
I have two solutions in mind:
1 - Still use a map with firstname as key (increasing the heap size again?)
2 - Use files on the disk instead of Map (Alba.dat will contain all Alba for example) and open the right file for each search. No need to incease the heap size, but are there any side effects?
Which one is better? (pros and cons)
Update with more info
It's a database of customers who calls our customer service on the phone. The person who takes the call has to search using the customers names (usually firstname and then lastname). Using the Db is too slow to search. The solution I've implemented is much faster already (1/2 seconds vs 26 seconds using the db), but I want to improve it.
IMHO, I don't think you have to cache all the entries in memory, but a part of them, maybe:
Maybe just use a ring buffer, or
More complicated, and make more sense, to implement a LFU Cache, that keeps the N top most frequently accessed item only. See this question for a hint of how to implement such a cache.
There are several issues with your approach:
It implies that the number in users doesn't change, a good application design would work with any number of users without software change
It implies that the current problem is the only one. What happens if the next requirement that needs implementation is "search by caller id" or "search by zip code"?
It is reinventing the wheel, you are currently starting to write a database, index or information retrieval solution (however you want to name it) from scratch
The right thing to do is to export the user data into a database engine which provides proper search capabilities. The export/extraction hopefully can be speed up, if you have modification time stamps or if you can intercept updates and reapply it to your search index.
What you use for your search does not matter to much, a simple database on a modern system is fast enough. Most also provide indexing capabilities to speed up your search. If you want something which can be embedded in your application and is specialized on search and solves your problems above, I'd recommend using Lucene.

Sort huge file in java

I've huge file with unique words in each line. Size of file is around 1.6 GB(I've to sort other files after this which are around 15GB). Till now, for smaller files I used Array.sort(). But for this file I get java.lang.OutOfMemoryError: Java heap space. I know the reason for this error. Is there any way instead of writing complete quick sort or merge sort program.
I read that Array.sort() uses Quicksort or Hybrid Sort internally. Is there any procedure like Array.sort() ??
If I have to write a program for sorting, which one should I use? Quicksort or Merge sort. I'm worried about worst case.
Depending on the structure of the data to store, you can do many different things.
In case of well structured data, where you need to sort by one or more specific fields (in which case system tools might not be helpful), you are probably better off using a datastore that allows sorting. MongoDB comes to mind as a good fit for this given that the size doesn't exceed few 100s of GBs. Other NoSQL datastores might also fit the bill nicely, although Mongo's simplicity of use and installation and support for JSON data makes it a really great candidate.
If you really want to go with the java approach, it gets real tricky. This is the kind of questions you ask at job interviews and I would never actually expect anybody to implement code. However, the general solution is merge sort (using random access files is a bad idea because it means insertion sort, i.e., non optimal run time which can be bad given the size of your file).
By merge sort I mean reading one chunk of the file at a time small enough to fit it in memory (so it depends on how much RAM you have), sorting it and then writing it back to a new file on disk. After you read the whole file you can start merging the chunk files two at a time by reading just the head of each and writing (the smaller of the two records) back to a third file. Do that for the 'first generation' of files and then continue with the second one until you end up with one big sorted file. Note that this is basically a bottom up way of implementing merge sort, the academic recursive algorithm being the top down approach.
Note that having intermediate files can be avoided altogether by using a multiway merge algorithm. This is typically based on a heap/priority queue, so the implementation might get slightly more complex but it reduces the number of I/O operations required.
Please also see these links.
Implementing the above in java shouldn't be too difficult with some careful design although it can definitely get tricky. I still highly recommend an out-of-the-box solution like Mongo.
As it turns out, your problem is that your heap cannot accommodate such a large array, so you must forget any solution that implies loading the whole file content in an array (as long as you can't grow your heap).
So you're facing streaming. It's the only (and typical) solution when you have to handle input sources that are larger than your available memory. I would suggest streaming the file content to your program, which should perform the sorting by either outputting to a random access file (trickier) or to a database.
I'd take a different approach.
Given a file, say with a single element per line, I'd read the first n elements. I would repeat this m times, such that the amount of lines in the file is n * m + C with C being left-over lines.
When dealing with Integers, you may wish to use around 100,000 elements per read, with Strings I would use less, maybe around 1,000. It depends on the data type and memory needed per element.
From there, I would sort the n amount of elements and write them to a temporary file with a unique name.
Now, since you have all the files sorted, the smallest elements will be at the start. You can then just iterate over the files until you have processed all the elements, finding the smallest element and printing it to the new final output.
This approach will reduce the amount of RAM needed and instead rely on drive space and will allow you to handle sorting of any file size.
Build the array of record positions inside the file (kind of index), maybe it would fit into memory instead. You need a 8 byte java long per file record. Sort the array, loading records only for comparison and not retaining (use RandomAccessFile). After sorting, write the new final file using index pointers to get the records in the needed order.
This will also work if the records are not all the same size.

A good way to store/read a large amount of strings?

I'm developing a "funny quotes" app for android. I have over 1000 quotes which I want to use inside my app but I don't know whether I should use a database or text file. Please note that the app should not read the same sentence twice. and it has a previous/next button so i need to keep track of the previous quotes. please tell me which one is better and more optimized. Also, if you can please link me to a good tutorial about storing/reading the data.
thanks
Use a database. It's faster and more flexible than a text file. One day you will extend the app and then you will be glad you used a database. I recommend to, when you boot up the app, just select all the rows using the in-built random functionality of your database. 1000 rows should not take too long. Then just iterate through the resulting ArrayList (or whatever you choose to use) of strings you end up with - the first quote you show will be element 0 from that list, the second element element 1 from that list, and so on. If you use this approach, you won't need any other structure to keep track of used quotes - just use the iterator variable that you use for indexing the quote array.
fetchAllRows on this page seems to be what you want for getting the data.
If you choose not to keep too much in memory, you could keep just a list of quote IDs that have been used so far. The last element of that list would be the current quote, and the previous elements would be what the user should see when they press the back button.
If you will never read the same string twice the I will recommend you to not use String class as there objects are immutable and will stick in the string pool waiting to be reassigned to a reference, but that will never happen as you will never read the same string twice.
The use of DB's will over complicate things.
I suggest you to read a flat file in bytes and then translate them to StringBuider objects hence keeping it simple enough but still preventing intensive GC().
I hope it helps..
USing DB should be fine as I think you would not want all the data in memory. You can keep all the quotes in DB and keep a flag to keep track whether a quote was read or not (simply update it to true once read.)
This way you can choose from any of the quote which has the flag as false.
Have you considered CsvJdbc? You have the benefit of simple csv files with an easy upgrade path to a real database later when you have a significant number of records.
1k records is quite small and in my opinion not sufficient to merit a database.

Fast way to alphabetically sort the contents of a file in java

Can anyone recommend a fast way to sort the contents of a text file, based on the first X amount of characters of each line?
For example if i have in the text file the following text
Adrian Graham some more text here
John Adams some more text here
Then another record needs to be inserted for eg.
Bob Something some more text here
I need to keep the file sorted but this is a rather big file and i'd rather not load it entirely into memory at once.
By big i mean about 500 000 lines, so perhaps not terribly huge.
I've had a search around and found http://www.codeodor.com/index.cfm/2007/5/14/Re-Sorting-really-BIG-files---the-Java-source-code/1208
and i wanted to know if anyone could suggest any other ways? For the sake of having second opinions?
My initial idea before i read the above linked article was:
Read the file
Split it into several files, for eg A to Z
If a line begins with "a" then it is written to the file called A.txt
Each of the files then have their contents sorted (no clear idea how just yet apart from alphabetical order)
Then when it comes to reading data, i know that if i want to find a line which starts with A then i open A.txt
When inserting a new line the same thing applies and i just append to the end of the file. Later after the insert when there is time i can invoke my sorting program to reorder the files that have had stuff appended to them.
I realise that there are a few flaws in this like for eg. there won't be an even number of lines that start with a particular letter so some files may be bigger than others etc.
Which again is why i need a second opinion for suggestions on how to approach this?
The current program is in java but any programming language could be used for an example that would achieve this...I'll port what i need to.
(If anyone's wondering i'm not deliberately trying to give myself a headache by storing info this way, i inherited a painful little program which stores data to files instead of using some kind of database)
Thanks in advance
You may also want to simply call the DOS "sort" command to sort the file. It is quick and will require next to no programming on your part.
In a DOS box, type help sort|more for the sort syntax and options.
500,000 shouldn't really be that much to sort. Read the whole thing into memory, and then sort it using standard built in functions. I you really find that these are too slow, then move onto something more complicated. 500,000 lines x about 60 bytes per line still only ends up being 30 megs.
Another option might be to read the file and put it in a lightweight db (for example hsqldb in file mode)
Then get the data sorted, and write it back to a file. (Or simply migrate to program, so it uses a db)

Categories

Resources