Binary Search a file with varying word length

Binary Search a file with varying word length - java

I am making a crude Java spellchecker that takes an article and a pre-sorted dictionary file. The length of the article's words vary, and therefore I tried making a stack that takes in the words given by the file.
This unfortunately didn't work because the stack ran out of space (even with the shortened dictionary file) and due to performance concerns, I decided to read from the text file directly.
The issue is that the file doesn't have words of the same length. Because that the length of the words vary, I cannot and should not expect that the length of a single word is useful in determining both how many words are in the dictionary file from how large that file is.
Because of this I am stuck. I need to execute a binary search on that file in order to make the spellcheck program work. But I can't execute a binary search if there's no clear way to treat the file as an array, especially when the array is just too large to put into the program's memory.
What should I do?

The Oxford English dictionary suggets that there are about ~250,000 words that you need to consider for your dictionary (not factoring in words only used in highly specific domain). This is an important design information for you.
I see some solutions:
1) Simply using a HashSet<>
In theory, you can use a HashSet<> for this amount of elements (this SO post discusses theoretical limits of HashSets and other in detail).
However, this brings (as you have observed) a couple of problems:
It takes time (on every application start up) to read this into the RAM
It eats up RAM
Of course you can increase the heap size of your JRE but there is a natural limit to that (#StvnBrkddll linked a SO post that describes this perfectly in the comments)
2) Using a Database
I'd consider storing the valid words in a (relational) database:
You don't need to load everything on application start up
It does not weigh as heavy on your RAM as option (1)
It gives you more options, if you want to change your application to also suggest similar words without typos to the user (e.g. if you use PostgreSQL you could achiev pg_trgm)
It has some drawbacks though:
You mentioned your application is simple: Having a database system adds complexity

Related

Find a String in a 10k line files in Java efficiently

I need to check if the password that an user entered is contained in a 10k lines .txt file that is locally stored in my computer. I've been asked to do this for a college project and they've been very emphatic about achieving this in an efficent manner, not taking too long to find the match.
The thing is that reading the file line by line using a BufferedReader the match is done almost instantly.
I've tested it in two computers, one with an ssd and the other one with an hdd and I cannot tell the difference.
Am I missing something? Is there another and more efficent way to do it? For example I could load the file or chunks of the file into memory, but is it worth it?

10k passwords isn't all that much and should easily fit in RAM. You can read the file into memory when your application starts and then only access the in-memory structure. The in-memory structure could even be parsed to provide more efficient lookup (i.e. using a HashMap or HashSet) or sort it in memory for the one-time cost of O(n × log n) to enable binary-searching the list (10k items can be searched with max. 14 steps). Or you could use even fancier data structures such as a bloom filter.
Just keep in mind: when you write "it is almost instant", then it probably already is efficient enough. (Again, 10k passwords isn't all that much, probably the file is only ~100kB in size)

Performance tuning for searching

I am fairly new to DS and Algorithms and recently at a job interview I was asked a question on performance tuning along with code. We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure. So which Java feature/library can we use to do the searching in the quickest time possible ?
On the spot I could not think of exact answer so I wrote that:
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
How can I understand the exact answer to this question and what can be the optimal solution(s) ?

After reading the question and getting clarification in the comments, I think what has become apparent to me is that: you needed to ask follow-up questions.
I'll try to break it down and provide comments that I hope will be helpful, because I also know what it's like to be "in the moment" and how nerves can stab you in the back when you least need them to.
We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure.
I think a good follow-up question here would've been:
Q: What specific data structure is being used to contain all this data?
I would press until they give me an actual name and explain why it is not possible to name a Java algorithm/library. For all you know, the data structure could've been String[], a Set<String>, or even a fancy name for a file on disk (if they're trying to throw you off). They could've also clarified and said the DS was not relevant and that you could pick whichever DS you thought was best.
The wording also implies that they implemented the structure and that it's already populated in a system with, presumably, enough memory to hold all of it. Asking to confirm that this is really the case could've given you helpful information.
For example: "Based on the wording, it seems this mystery data structure is already implemented and fully populated in memory in a system with enough memory to hold it. Can you confirm my understanding here is correct? If not, could you clarify further?"
Given the suggested wording, and the fact that we don't have additional clarifications to go from, I will assume, for the purposes of this answer, that my suppositions are indeed correct.
Note that if you had been asked to design the data structure to hold all of this info, you would've had to ask very different questions, take memory constraints into account, and perhaps even ask about character sets/encodings (e.g. ASCII vs multi-byte Unicode).
Also, if you had been asked to design the search algorithm, then knowing the DS is a pre-requisite, and not knowing this could've made the task impossible. For example, the binary search algorithm implementation will look very different if you're working on an array vs a binary search tree, even though both would offer O(lg n) time complexity.
So which java feature/library can we use to do the searching in the quickest time possible?
Consistent with the 1st part, this question only asks what pre-existing/built-in Java code you would choose to perform the search for you. The "quickest time possible" here should make you think about solutions that are in O(1), i.e. are constant time. However, the data structure may open/close doors for you.
Some search algorithms in Java work on generics and others work on other types like arrays. Some algorithms work on Maps while others work on Lists, Sets, and so on. The follow-up question from the first part could've helped in answering this question.
That said, even if you knew the DS, but couldn't think of a specific method name or such at the time, I also think it should be considered reasonable to mention the interface or at least a relevant package and say that further details can be checked on the the Java documentation if you're pressed for more specificity, given that's what it's there for in the first place.
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
Given the wording, my interpretation of their question was not "which data structure would you use?", but rather, "which pre-existing search algorithm would you choose?". It seems to me like it was them who needed to answer the question regarding DS.
That said, if you had indeed been asked "which data structure would you use?", then a Map would've still worked against you, since you didn't really need to map a key to a value. You only needed to store a value (i.e. the words). Therefore, a Set, specifically a HashSet, would've been a better candidate, since it also avoids duplicates and should consume less memory in the process because it stores singular values, rather than key/value pairs.
Of course, that's still under the assumption(s) I made earlier. If memory constraints are said to be an issue, then scaling horizontally to multiple servers and so on would've likely been necessary.
How can I understand the exact answer to this question and what can be the optimal solution(s)?
It is probably the case that they wanted to see if you would follow up with questions, given the lack of information they gave you.

There are a couple data structures that allow for efficient searching, assuming that memory requirements aren't an issue and the data structure is already populated.
Regarding time complexity, Set#contains and Map#containsKey are both O(1), assuming that the hash function isn't expensive and that there aren't many collisions.
Because the data structure stores words (assuming you're referring to Strings), then it could also be relatively efficient to use a trie (radix tree, prefix tree, etc.), which would allow you to search by character (which I believe would be O(log n)). If the hash function is expensive or there are many collisions, this could be a good alternative!
The answer that you gave to the interviewer should suffice since hashing is an effective searching method, even for billions of entries.

You did not mention whether the entries are words or documents (multiple words). In both cases a search index could be suitable.
Search indexes extract words from the billion document entries and manage a map of these words to the documents they are used in. Frameworks like Lucene (e.g. as part of SOLR or ElasticSearch) manage memory and persistence for you.
If it were only multiple of thousands of entries, a simple HashMap would be sufficient because there is no need for memory management then. If all of the billion entries are single words, a database could be a slightly better choice.

The hashmap solution is reasonable as stated by others but there are doubts with respect to scalability.
Here is a possible solution for the problem as discussed in the below post
Sub-string match If your entry blob is a single sting or word (without any white space) and you need to search arbitrary sub-string within it. In such cases you need to parse every entry to find best possible entries that matches. One uses algorithms like Boyer Moor algorithm. See this and this for details. This is also equivalent to grep - because grep uses similar stuff inside
Indexed search. Here you are assuming that entry contains set of words and search is limited to fixed word lengths. In this case, entries are indexed over all the possible occurrences of words. This is often called "Full Text search". There are number of algorithms to do this and number of open source projects that can be used directly. Many of them, also support wild card search, approximate search etc. as below :
a. Apache Lucene : http://lucene.apache.org/java/docs/index.html
b. OpenFTS : http://openfts.sourceforge.net/
c. Sphinx http://sphinxsearch.com/
Most likely if you need "fixed words" as queries, the approach two will be very fast and effective
Reference - https://softwareengineering.stackexchange.com/questions/118759/how-to-quickly-search-through-a-very-large-list-of-strings-records-on-a-databa

Multi-billion entries lie at the edge of what might conceivably be stored in main memory (for instance, storing 10 billion entries at 100 bytes per entry will take 1000 GB main memory).
While storing the data in main memory offers a very high throughput (thousands to millions of requests per second), you'd likely need special hardware (typical blade servers only offers 16 GB, but there are commodity servers that permit installation of up to 3000 GB of main memory). Also, keeping this much data in the Java Heap will likely cause garbage collector pauses of seconds or minutes unless special care is taken.
Therefore, unless the structure of your data admits a very compact representation in main memory (say, you only need membership checking among ints, which is possible with a 512 MB Bitset), you'll not want to store it in main memory, but on disk.
Therefore, you'll need persistence. Any relational or NoSQL database permits efficient searching by key and can handle such amounts of data with ease. To talk to a relational database, use JPA or JDBC. To talk to a non-relational database, you can use their proprietary Java API or an abstraction layer such as Spring Data.
You could also implement persistence from scratch if you wanted to (i.e. the interviewer asks for that). A data structure optimized for efficient lookup in external memory is the B-Tree, that's what many databases use internally :-)

Sort huge file in java

I've huge file with unique words in each line. Size of file is around 1.6 GB(I've to sort other files after this which are around 15GB). Till now, for smaller files I used Array.sort(). But for this file I get java.lang.OutOfMemoryError: Java heap space. I know the reason for this error. Is there any way instead of writing complete quick sort or merge sort program.
I read that Array.sort() uses Quicksort or Hybrid Sort internally. Is there any procedure like Array.sort() ??
If I have to write a program for sorting, which one should I use? Quicksort or Merge sort. I'm worried about worst case.

Depending on the structure of the data to store, you can do many different things.
In case of well structured data, where you need to sort by one or more specific fields (in which case system tools might not be helpful), you are probably better off using a datastore that allows sorting. MongoDB comes to mind as a good fit for this given that the size doesn't exceed few 100s of GBs. Other NoSQL datastores might also fit the bill nicely, although Mongo's simplicity of use and installation and support for JSON data makes it a really great candidate.
If you really want to go with the java approach, it gets real tricky. This is the kind of questions you ask at job interviews and I would never actually expect anybody to implement code. However, the general solution is merge sort (using random access files is a bad idea because it means insertion sort, i.e., non optimal run time which can be bad given the size of your file).
By merge sort I mean reading one chunk of the file at a time small enough to fit it in memory (so it depends on how much RAM you have), sorting it and then writing it back to a new file on disk. After you read the whole file you can start merging the chunk files two at a time by reading just the head of each and writing (the smaller of the two records) back to a third file. Do that for the 'first generation' of files and then continue with the second one until you end up with one big sorted file. Note that this is basically a bottom up way of implementing merge sort, the academic recursive algorithm being the top down approach.
Note that having intermediate files can be avoided altogether by using a multiway merge algorithm. This is typically based on a heap/priority queue, so the implementation might get slightly more complex but it reduces the number of I/O operations required.
Please also see these links.
Implementing the above in java shouldn't be too difficult with some careful design although it can definitely get tricky. I still highly recommend an out-of-the-box solution like Mongo.

As it turns out, your problem is that your heap cannot accommodate such a large array, so you must forget any solution that implies loading the whole file content in an array (as long as you can't grow your heap).
So you're facing streaming. It's the only (and typical) solution when you have to handle input sources that are larger than your available memory. I would suggest streaming the file content to your program, which should perform the sorting by either outputting to a random access file (trickier) or to a database.

I'd take a different approach.
Given a file, say with a single element per line, I'd read the first n elements. I would repeat this m times, such that the amount of lines in the file is n * m + C with C being left-over lines.
When dealing with Integers, you may wish to use around 100,000 elements per read, with Strings I would use less, maybe around 1,000. It depends on the data type and memory needed per element.
From there, I would sort the n amount of elements and write them to a temporary file with a unique name.
Now, since you have all the files sorted, the smallest elements will be at the start. You can then just iterate over the files until you have processed all the elements, finding the smallest element and printing it to the new final output.
This approach will reduce the amount of RAM needed and instead rely on drive space and will allow you to handle sorting of any file size.

Build the array of record positions inside the file (kind of index), maybe it would fit into memory instead. You need a 8 byte java long per file record. Sort the array, loading records only for comparison and not retaining (use RandomAccessFile). After sorting, write the new final file using index pointers to get the records in the needed order.
This will also work if the records are not all the same size.

how to search for a given word from a huge database?

What's the most efficient method to search for a word from a dictionary database. I searched for the answer and people had suggested to use trie data structure. But the strategy for creating the tree for a huge amount of words would be to load the primary memory. I am trying to make an android app which involves this implementation for my data structure project. So could anyone tell me how do the dictionary work.
Even when I use the t9 dictionary in my phone, the suggestions for words appear very quickly on the screen. Curious to know the algorithm and the design behind it.

You can use Trie which is most usefull for searching big dictionaries. Because too many words are using similar startup, trie brgins around constant factor search also you can use in place, with limited number of access to physical memory. You can find lots of implementation in the web.
If someone is not familiar with trie, I think this site is good and I'm just quoting their sample here:
A trie (from retrieval), is a multi-way tree structure useful for
storing strings over an alphabet. It has been used to store large
dictionaries of English (say) words in spelling-checking programs and
in natural-language "understanding" programs. Given the data:
an, ant, all, allot, alloy, aloe, are, ate, be
the corresponding trie would be:
This is good practical Trie implementation in java:
http://code.google.com/p/google-collections/issues/detail?id=5

There are lots of ways to do that. The one that I used some time ago (which is especially good if you don't make changes to your dictionary) is to create a prefix index.
That is, you sort your entries lexicologicaly. Then, you save the (end) positions of the ranges for different first letters. That is, if your entries have indexes from 1 to 1000, and words "aardvark -- azerbaijan" take the range from 1 to 200, you make an entry in a separate table "a | 200", then you do the same for first and second letters. Then, if you need to find a particular word, you greatly reduce the search scope. In my case, the index on first two letters was quite sufficient.
Again, this method requires you to use a DB, like SQLite, which I think is present on the Android.

Using a trie is indeed space conscious, just realized when I checked my RAM usage after loading 150,000 words in to trie, the usage was 150 MB (Trie was implemented in C++).The memory consumption was hugely due to pointers. I ended up with ternary tries which had very less memory wastage around 30 MB (compared to 150 MB) but the time complexity had increased a bit. Another option is to use "Left child Right sibling " in which there is very less wastage of memory but time complexity is more than that of ternary trie.

Is there a fast Java library to search for a string and its position in file?

I need to search a big number of files (i.e. 600 files, 0.5 MB each) for a specific string.
I'm using Java, so I'd prefer the answer to be a Java library or in the worst case a library in a different language which I could call from Java.
I need the search to return the exact position of the found string in a file (so it seems Lucene for example is out of the question).
I need the search to be as fast as possible.
EDIT START:
The files might have different format (i.e. EDI, XML, CSV) and contain sometimes pretty random data (i.e. numerical IDs etc.). This is why I preliminarily ruled out an index-based searching engine.
The files will be searched multiple times for similar but different strings (i.e. for IDs which might have similar length and format, but they will usually be different).
EDIT END
Any ideas?

600 files of 0.5 MB each is about 300MB - that can hardly be considered big nowadays, let alone large. A simple string search on any modern computer should actually be more I/O-bound than CPU-bound - a single thread on my system can search 300MB for a relatively simple regular expression in under 1.5 seconds - which goes down to 0.2 if the files are already present in the OS cache.
With that in mind, if your purpose is to perform such a search infrequently, then using some sort of index may result in an overengineered solution. Start by iterating over all files, reading each block-by-block or line-by-line and searching - this is simple enough that it barely merits its own library.
Set down your performance requirements, profile your code, verify that the actual string search is the bottleneck and then decide whether a more complex solution is warranted. If you do need something faster, you should first consider the following solutions, in order of complexity:
Use an existing indexing engine, such as Lucene, to filter out the bulk of the files for each query and then explicitly search in the (hopefully few) remaining files for your string.
If your files are not really text, so that word-based indexing would work, preprocess the files to extract a term list for each file and use a DB to create your own indexing system - I doubt you will find an FTS engine that uses anything else than words for its indexing.
If you really want to reduce the search time to the minimum, extract term/position pairs from your files, and enter those in your DB. You may still have to verify by looking at the actual file, but it would be significantly faster.
PS: You do not mention at all what king of strings we are discussing about. Does it contain delimited terms, e.g. words, or do your files contain random characters? Can the search string be broken into substrings in a meaningful manner, or is it a bunch of letters? Is your search string fixed, or could it also be a regular expression? The answer to each of these questions could significantly limit what is and what is not actually feasible - for example indexing random strings may not be possible at all.
EDIT:
From the question update, it seems that the concept of a term/token is generally applicable, as opposed to e.g. searching for totally random sequences in a binary file. That means that you can index those terms. By searching the index for any tokens that exist in your search string, you can significantly reduce the cases where a look at the actual file is needed.
You could keep a term->file index. If most terms are unique to each file, this approach might offer a good complexity/performance trade-off. Essentially you would narrow down your search to one or two files and then perform a full search on those files only.
You could keep a term->file:position index. For example, if your search string is "Alan Turing". you would first search the index for the tokens "Alan" and "Turing". You would get two lists of files and positions that you could cross-reference. By e.g. requiring that the positions of the token "Alan" precede those of the token "Turing" by at most, say, 30 characters, you would get a list of candidate positions in your files that you could verify explicitly.
I am not sure to what degree existing indexing libraries would help. Most are targeted towards text indexing and may mishandle other types of tokens, such as numbers or dates. On the other hand, your case is not fundamentally different either, so you might be able to use them - if necessary, by preprocessing the files you feed them to make them more palatable. Building an indexing system of your own, tailored to your needs, does not seem too difficult either.
You still haven't mentioned if there is any kind of flexibility in your search string. Do you expect being able to search for regular expressions? Is the search string expected to be found verbatim, or do you need to find just the terms in it? Does whitespace matter? Does the order of the terms matter?
And more importantly, you haven't mentioned if there is any kind of structure in your files that should be considered while searching. For example, do you want to be able to limit the search to specific elements of an XML file?

Unless you have an SSD, your main bottleneck will be all the file accesses. Its going to take about 10 seconds to read the files, regardless of what you in Java.
If you have an SSD, reading the files won't be a problem, and the CPU speed in Java will matter more.
If you can create an index for the files this will help enormously.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.