Find a String in a 10k line files in Java efficiently

Find a String in a 10k line files in Java efficiently - java

I need to check if the password that an user entered is contained in a 10k lines .txt file that is locally stored in my computer. I've been asked to do this for a college project and they've been very emphatic about achieving this in an efficent manner, not taking too long to find the match.
The thing is that reading the file line by line using a BufferedReader the match is done almost instantly.
I've tested it in two computers, one with an ssd and the other one with an hdd and I cannot tell the difference.
Am I missing something? Is there another and more efficent way to do it? For example I could load the file or chunks of the file into memory, but is it worth it?

10k passwords isn't all that much and should easily fit in RAM. You can read the file into memory when your application starts and then only access the in-memory structure. The in-memory structure could even be parsed to provide more efficient lookup (i.e. using a HashMap or HashSet) or sort it in memory for the one-time cost of O(n × log n) to enable binary-searching the list (10k items can be searched with max. 14 steps). Or you could use even fancier data structures such as a bloom filter.
Just keep in mind: when you write "it is almost instant", then it probably already is efficient enough. (Again, 10k passwords isn't all that much, probably the file is only ~100kB in size)

Related

Binary Search a file with varying word length

I am making a crude Java spellchecker that takes an article and a pre-sorted dictionary file. The length of the article's words vary, and therefore I tried making a stack that takes in the words given by the file.
This unfortunately didn't work because the stack ran out of space (even with the shortened dictionary file) and due to performance concerns, I decided to read from the text file directly.
The issue is that the file doesn't have words of the same length. Because that the length of the words vary, I cannot and should not expect that the length of a single word is useful in determining both how many words are in the dictionary file from how large that file is.
Because of this I am stuck. I need to execute a binary search on that file in order to make the spellcheck program work. But I can't execute a binary search if there's no clear way to treat the file as an array, especially when the array is just too large to put into the program's memory.
What should I do?

The Oxford English dictionary suggets that there are about ~250,000 words that you need to consider for your dictionary (not factoring in words only used in highly specific domain). This is an important design information for you.
I see some solutions:
1) Simply using a HashSet<>
In theory, you can use a HashSet<> for this amount of elements (this SO post discusses theoretical limits of HashSets and other in detail).
However, this brings (as you have observed) a couple of problems:
It takes time (on every application start up) to read this into the RAM
It eats up RAM
Of course you can increase the heap size of your JRE but there is a natural limit to that (#StvnBrkddll linked a SO post that describes this perfectly in the comments)
2) Using a Database
I'd consider storing the valid words in a (relational) database:
You don't need to load everything on application start up
It does not weigh as heavy on your RAM as option (1)
It gives you more options, if you want to change your application to also suggest similar words without typos to the user (e.g. if you use PostgreSQL you could achiev pg_trgm)
It has some drawbacks though:
You mentioned your application is simple: Having a database system adds complexity

Sort huge file in java

I've huge file with unique words in each line. Size of file is around 1.6 GB(I've to sort other files after this which are around 15GB). Till now, for smaller files I used Array.sort(). But for this file I get java.lang.OutOfMemoryError: Java heap space. I know the reason for this error. Is there any way instead of writing complete quick sort or merge sort program.
I read that Array.sort() uses Quicksort or Hybrid Sort internally. Is there any procedure like Array.sort() ??
If I have to write a program for sorting, which one should I use? Quicksort or Merge sort. I'm worried about worst case.

Depending on the structure of the data to store, you can do many different things.
In case of well structured data, where you need to sort by one or more specific fields (in which case system tools might not be helpful), you are probably better off using a datastore that allows sorting. MongoDB comes to mind as a good fit for this given that the size doesn't exceed few 100s of GBs. Other NoSQL datastores might also fit the bill nicely, although Mongo's simplicity of use and installation and support for JSON data makes it a really great candidate.
If you really want to go with the java approach, it gets real tricky. This is the kind of questions you ask at job interviews and I would never actually expect anybody to implement code. However, the general solution is merge sort (using random access files is a bad idea because it means insertion sort, i.e., non optimal run time which can be bad given the size of your file).
By merge sort I mean reading one chunk of the file at a time small enough to fit it in memory (so it depends on how much RAM you have), sorting it and then writing it back to a new file on disk. After you read the whole file you can start merging the chunk files two at a time by reading just the head of each and writing (the smaller of the two records) back to a third file. Do that for the 'first generation' of files and then continue with the second one until you end up with one big sorted file. Note that this is basically a bottom up way of implementing merge sort, the academic recursive algorithm being the top down approach.
Note that having intermediate files can be avoided altogether by using a multiway merge algorithm. This is typically based on a heap/priority queue, so the implementation might get slightly more complex but it reduces the number of I/O operations required.
Please also see these links.
Implementing the above in java shouldn't be too difficult with some careful design although it can definitely get tricky. I still highly recommend an out-of-the-box solution like Mongo.

As it turns out, your problem is that your heap cannot accommodate such a large array, so you must forget any solution that implies loading the whole file content in an array (as long as you can't grow your heap).
So you're facing streaming. It's the only (and typical) solution when you have to handle input sources that are larger than your available memory. I would suggest streaming the file content to your program, which should perform the sorting by either outputting to a random access file (trickier) or to a database.

I'd take a different approach.
Given a file, say with a single element per line, I'd read the first n elements. I would repeat this m times, such that the amount of lines in the file is n * m + C with C being left-over lines.
When dealing with Integers, you may wish to use around 100,000 elements per read, with Strings I would use less, maybe around 1,000. It depends on the data type and memory needed per element.
From there, I would sort the n amount of elements and write them to a temporary file with a unique name.
Now, since you have all the files sorted, the smallest elements will be at the start. You can then just iterate over the files until you have processed all the elements, finding the smallest element and printing it to the new final output.
This approach will reduce the amount of RAM needed and instead rely on drive space and will allow you to handle sorting of any file size.

Build the array of record positions inside the file (kind of index), maybe it would fit into memory instead. You need a 8 byte java long per file record. Sort the array, loading records only for comparison and not retaining (use RandomAccessFile). After sorting, write the new final file using index pointers to get the records in the needed order.
This will also work if the records are not all the same size.

CSV file, faster in binary format? Fastest search?

If I have a CSV file, is it faster to keep the file as place text or to convert it to some other format? (for searching)
In terms of searching a CSV file, what is the fastest method of retrieving a particular row (by key)? Not referring to sorting the file sorry, what I mean was looking up a arbitrary key in the file.
Some updates:
the file will be read-only
the file can be read and kept in memory

There are several things to consider for this:
What kind of data do you store? Does it actually make sense, to convert this to a binary format? Will binary format take up less space (the time it takes to read the file is dependent on size)?
Do you have multiple queries for the same file, while the system is running, or do you have to load the file each time someone does a query?
Do you need to efficiently transfer the file between different systems?
All these factors are very important for a decision. The common case is that you only need to load the file once and then do many queries. In that case it hardly matters what format you store the data in, because it will be stored in memory afterwards anyway. Spend more time thinking about good data structures to handle the queries.
Another common case is, that you cannot keep the main application running and hence you cannot keep the file in memory. In that case, get rid of the file and use a database. Any database you can use will most likely be faster than anything you could come up with. However it is not easy to transfer a database between system.
Most likely though, the file format will not be a real issue to consider. I've read quite a few very long CSV files and most often the time it took to read the file was negligible compared to what I needed to do with the data afterwards.

If you have too much data and is very production level, then use Apache Lucene
If its small dataset or its about learning then read through Suffix tree and Tries

"Convert" it (i.e. import it) into a database table (or preferably normalised tables) with indexes on the searchable columns and a primary key on the column that has the highest cardinality - no need to re-invent the wheel... you'll save yourself a lot of issues - transaction management, concurrency.... really - if it will be in production, the chance that you will want to keep it in csv format is slim to zero.

If the file is too large to keep in memory, then just keep the keys in memory. Some number of rows can also be keep in memory, with least-recently-accessed rows paged out as additional rows are needed. Use fseeks (directed by keys) with the file to find the row in the file itself. Then load that row into memory in case other entries on that row might be needed.

Efficient and scalable way to sort large amount of strings in Java

I am looking for some ideas idea on sorting large amount of strings from an input file and print out the sorted results to a new file in Java. The requirement is that the input file could be extremely large. I need to consider the performance in the solution, so any ideas?

External Sorting technique is generally used to sort huge amounts of data. May be this is what you need.
externalsortinginjava is the java library for this.

Is an SQL database available? If you inserted all the data into a table, with the sortable column or section indexed, you may (or may not) be able to output the sorted result more efficiently. This solution may also be helpful if the amount of data, outweighs the amount of RAM available.
It would be interesting to know how large, and what the purpose is.

Break the file into amounts you can read in memory.
Sort each amount and write to a file. (If you could fit everything into memory you are done)
Merge sort the resulting files into a single sorted file.
You can also do a form of radix sort to improve CPU efficiency, but the main bottleneck is all the re-writing and re-reading you have to do.

File processing in java

I have a file of size 2GB which has student records in it. I need to find students based on certain attributes in each record and create a new file with results. The order of the filtered students should be same as in the original file. What's the efficient & fastest way of doing this using Java IO API and threads without having memory issues? The maxheap size for JVM is set to 512MB.

What kind of file? Text-based, like CSV?
The easiest way would be to do something like grep does: Read the file line by line, parse the line, check your filter criterion, if matched, output a result line, then go to the next line, until the file is done. This is very memory efficient, as you only have the current line (or a buffer a little larger) loaded at the same time. Your process needs to read through the whole file just once.
I do not think multiple threads are going to help much. It would make things much more complicated, and since the process seems to be I/O bound anyway, trying to read the same file with multiple threads probably does not improve throughput.
If you find that you need to do this often, and going through the file each time is too slow, you need to build some kind of index. The easiest way to do that would be to import the file into a DB (can be an embedded DB like SQLite or HSQL) first.

I wouldn't overcomplicate this until you find that the boringly simple way doesn't work for what you need. Essentially you just need to:
open input stream to 2GB file, remembering to buffer (e.g. by wrapping with BufferedInputStream)
open output stream to filtered file you're going to create
read first record from input stream, look at whatever attribute to decide if you "need" it; if you do, write it to output file
repeat for remaining records
On one of my test systems with extremely modest hardware, BufferedInputStream around a FileInputStream out of the box read about 500 MB in 25 seconds, i.e. probably under 2 minutes to process your 2GB file, and the default buffer size is basically as good as it gets (see the BufferedInputStream timings I made for more details). I imagine with state of the art hardware it's quite possible the time would be halved.
Whether you need to go to a lot of effort to reduce the 2/3 minutes or just go for a wee while you're waiting for it to run is a decision that you'll have to make depending on your requirements. I think the database option won't buy you much unless you need to do a lot of different processing runs on the same set of data (and there are other solutions to this that don't automatically mean database).

2GB for a file is huge, you SHOULD go for a db.
If you really want to use Java I/O API, then try out this: Handling large data files efficiently with Java and this: Tuning Java I/O Performance

I think you should use memory mapped files.This will help you to map the bigger file to a
smaller memory.This will act like virtual memory and as far as performance is concerned mapped files are the faster than stream write/read.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.