RandomaAccessFile, what is actually random? - java

Having read this oracle java link I would like to know what the writers of this class exactly meant by the term "Random", if the buffer has its own position, limit, capacity indicators. What would be randomly done? I think I am just misinterpreting the word Random in that context.
Is anybody able to clarify the point in other terms?
Thanks in advance.

Random access refers to the ability to access data at random. The opposite of random access is sequential access. To go from point A to point Z in a sequential-access system
For example in Random access file you can access any random position where you want, but for sequential access file you have to go from beginning to that specific point to get the data
according to wikipedia
random access (sometimes called direct access) is the ability to
access an element at an arbitrary position in a sequence in equal
time, independent of sequence size. The position is arbitrary in the
sense that it is unpredictable, thus the use of the term "random" in
"random access"
you can also see this link (or thousands of other search results) to get a clear idea about "random access"

Related

Performance tuning for searching

I am fairly new to DS and Algorithms and recently at a job interview I was asked a question on performance tuning along with code. We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure. So which Java feature/library can we use to do the searching in the quickest time possible ?
On the spot I could not think of exact answer so I wrote that:
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
How can I understand the exact answer to this question and what can be the optimal solution(s) ?
After reading the question and getting clarification in the comments, I think what has become apparent to me is that: you needed to ask follow-up questions.
I'll try to break it down and provide comments that I hope will be helpful, because I also know what it's like to be "in the moment" and how nerves can stab you in the back when you least need them to.
We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure.
I think a good follow-up question here would've been:
Q: What specific data structure is being used to contain all this data?
I would press until they give me an actual name and explain why it is not possible to name a Java algorithm/library. For all you know, the data structure could've been String[], a Set<String>, or even a fancy name for a file on disk (if they're trying to throw you off). They could've also clarified and said the DS was not relevant and that you could pick whichever DS you thought was best.
The wording also implies that they implemented the structure and that it's already populated in a system with, presumably, enough memory to hold all of it. Asking to confirm that this is really the case could've given you helpful information.
For example: "Based on the wording, it seems this mystery data structure is already implemented and fully populated in memory in a system with enough memory to hold it. Can you confirm my understanding here is correct? If not, could you clarify further?"
Given the suggested wording, and the fact that we don't have additional clarifications to go from, I will assume, for the purposes of this answer, that my suppositions are indeed correct.
Note that if you had been asked to design the data structure to hold all of this info, you would've had to ask very different questions, take memory constraints into account, and perhaps even ask about character sets/encodings (e.g. ASCII vs multi-byte Unicode).
Also, if you had been asked to design the search algorithm, then knowing the DS is a pre-requisite, and not knowing this could've made the task impossible. For example, the binary search algorithm implementation will look very different if you're working on an array vs a binary search tree, even though both would offer O(lg n) time complexity.
So which java feature/library can we use to do the searching in the quickest time possible?
Consistent with the 1st part, this question only asks what pre-existing/built-in Java code you would choose to perform the search for you. The "quickest time possible" here should make you think about solutions that are in O(1), i.e. are constant time. However, the data structure may open/close doors for you.
Some search algorithms in Java work on generics and others work on other types like arrays. Some algorithms work on Maps while others work on Lists, Sets, and so on. The follow-up question from the first part could've helped in answering this question.
That said, even if you knew the DS, but couldn't think of a specific method name or such at the time, I also think it should be considered reasonable to mention the interface or at least a relevant package and say that further details can be checked on the the Java documentation if you're pressed for more specificity, given that's what it's there for in the first place.
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
Given the wording, my interpretation of their question was not "which data structure would you use?", but rather, "which pre-existing search algorithm would you choose?". It seems to me like it was them who needed to answer the question regarding DS.
That said, if you had indeed been asked "which data structure would you use?", then a Map would've still worked against you, since you didn't really need to map a key to a value. You only needed to store a value (i.e. the words). Therefore, a Set, specifically a HashSet, would've been a better candidate, since it also avoids duplicates and should consume less memory in the process because it stores singular values, rather than key/value pairs.
Of course, that's still under the assumption(s) I made earlier. If memory constraints are said to be an issue, then scaling horizontally to multiple servers and so on would've likely been necessary.
How can I understand the exact answer to this question and what can be the optimal solution(s)?
It is probably the case that they wanted to see if you would follow up with questions, given the lack of information they gave you.
There are a couple data structures that allow for efficient searching, assuming that memory requirements aren't an issue and the data structure is already populated.
Regarding time complexity, Set#contains and Map#containsKey are both O(1), assuming that the hash function isn't expensive and that there aren't many collisions.
Because the data structure stores words (assuming you're referring to Strings), then it could also be relatively efficient to use a trie (radix tree, prefix tree, etc.), which would allow you to search by character (which I believe would be O(log n)). If the hash function is expensive or there are many collisions, this could be a good alternative!
The answer that you gave to the interviewer should suffice since hashing is an effective searching method, even for billions of entries.
You did not mention whether the entries are words or documents (multiple words). In both cases a search index could be suitable.
Search indexes extract words from the billion document entries and manage a map of these words to the documents they are used in. Frameworks like Lucene (e.g. as part of SOLR or ElasticSearch) manage memory and persistence for you.
If it were only multiple of thousands of entries, a simple HashMap would be sufficient because there is no need for memory management then. If all of the billion entries are single words, a database could be a slightly better choice.
The hashmap solution is reasonable as stated by others but there are doubts with respect to scalability.
Here is a possible solution for the problem as discussed in the below post
Sub-string match If your entry blob is a single sting or word (without any white space) and you need to search arbitrary sub-string within it. In such cases you need to parse every entry to find best possible entries that matches. One uses algorithms like Boyer Moor algorithm. See this and this for details. This is also equivalent to grep - because grep uses similar stuff inside
Indexed search. Here you are assuming that entry contains set of words and search is limited to fixed word lengths. In this case, entries are indexed over all the possible occurrences of words. This is often called "Full Text search". There are number of algorithms to do this and number of open source projects that can be used directly. Many of them, also support wild card search, approximate search etc. as below :
a. Apache Lucene : http://lucene.apache.org/java/docs/index.html
b. OpenFTS : http://openfts.sourceforge.net/
c. Sphinx http://sphinxsearch.com/
Most likely if you need "fixed words" as queries, the approach two will be very fast and effective
Reference - https://softwareengineering.stackexchange.com/questions/118759/how-to-quickly-search-through-a-very-large-list-of-strings-records-on-a-databa
Multi-billion entries lie at the edge of what might conceivably be stored in main memory (for instance, storing 10 billion entries at 100 bytes per entry will take 1000 GB main memory).
While storing the data in main memory offers a very high throughput (thousands to millions of requests per second), you'd likely need special hardware (typical blade servers only offers 16 GB, but there are commodity servers that permit installation of up to 3000 GB of main memory). Also, keeping this much data in the Java Heap will likely cause garbage collector pauses of seconds or minutes unless special care is taken.
Therefore, unless the structure of your data admits a very compact representation in main memory (say, you only need membership checking among ints, which is possible with a 512 MB Bitset), you'll not want to store it in main memory, but on disk.
Therefore, you'll need persistence. Any relational or NoSQL database permits efficient searching by key and can handle such amounts of data with ease. To talk to a relational database, use JPA or JDBC. To talk to a non-relational database, you can use their proprietary Java API or an abstraction layer such as Spring Data.
You could also implement persistence from scratch if you wanted to (i.e. the interviewer asks for that). A data structure optimized for efficient lookup in external memory is the B-Tree, that's what many databases use internally :-)

Name this collection

This question is language-agnostic (although it assumes one that is both procedural and OO).
I'm having trouble finding if there is a standard name for a collection with the following behavior:
-Fixed-capacity of N elements, maintaining insertion order.
-Elements are added to the 'Tail'
-Whenever an item is added, the head of the collection is returned (FIFO), although not necessarily removed.
-If the collection now contains more than N elements, the Head is removed - otherwise it remains in the collection (now having advanced one step further towards its ultimate removal).
I often use this structure to keep a running count - i.e. the frame length of the past N frames, so as to provide 'moving window' across which I can average, sum, etc.
Sounds very similar to a circular buffer to me; with the exception that you are probably under-defining or over constraining the add / remove behavior.
Note that there are two "views" of a circular buffer. One is the layout view, which has a section of memory being written to with "head" and "tail" indexes and a bit of logic to "wrap" around when the tail is "before" the head. The other is a "logical" view where you have a queue that's not exposing how it is being laid out, but definately has a limited number of slots which it can "grow to".
Within the context of doing computation, there is a very long standing project that I love (although the cli interface is a bit foreign if you're not use to such things). It's called the RoundRobinDatabase, where each databases stores exactly N copies of a single value (providing graphs, averages, etc). It adjusts the next bin based on a number of parameters, but most often it advances bins based on time. It's often the tool behind a large number of network throughput graphs, and it has configurable bin collision resolution, etc.
In general, algorithims that are sensitive to the last "some number" of entries are often called "sliding box" algorithms, but that's focusing on the algorithm and not on the data structure :)
The programming riddle sounds like a circular linked list to me.
Well, all these description fits, doesn't it?
• Fixed-capacity of N elements, maintaining insertion order.
• Elements are added to the 'Tail'
• Whenever an item is added, the head of the collection is returned (FIFO), although not necessarily removed.
This link with source codes for counting frames probably helps too: frameCounter

Serializing java.util.Random

I'm working on a small, simple game (mostly to learn what's new in Java 8 and JavaFX). One of the features I have is the ability to seed the game's random number generator so you can play roughly the same game as a friend on a different system (think Minecraft Maps or The Binding of Isaac games).
I would like to add the ability to save the game to be resumed at a later time. After looking over the documentation for the java.util.Random class, I can't find a way to get the current seed of the random number generator. The only ways I have come up with to restore the random number generator after saving the game is to either access the seed via reflection at save time and use that, or to seed the initial seed at load time and just call nextInt() over and over again until we've rolled forward the random number generator enough to be where it was before the game was saved.
First of all, as #user2357112 points out, Random implements Serializable, and does so by writing the seed field (along with the nextNextGaussian and haveNextNextGaussian fields). Have you tried simply serializing it? That should 'just work'™. Other serializers, like Gson, also work. gson.fromJson(gson.toJson(r), Random.class); returns an identical object.
You don't necessarily need the same Random instance, just a consistent one. You could simply call nextLong() and write that value to your save file as random_seed or whatever. Then just initialize a Random instance with that seed, and now all runs loaded from that file will behave the same. If you wanted, you could even reset the Random instance in your currently running game to the same seed too.
On the other hand if you're generating maps or other seemingly-constant content randomly and want it to persist between loads, I'd think you'd do better to simply seed your Random at the start, and save that value like you describe. To save on computation you could do this in chunks smaller than a whole level. For example, split each level into 10ths, and use (and save) a different seed for each 10th. Then you just have to generate the portion the user's on now, and not the parts they've already crossed. If you were to only save the current state like you propose, the user couldn't go backwards in the map (which might not be a problem for your game in particular, but wouldn't be a great practice in general).
UX caveat: saving your game's randomness seems potentially over-engineered. As a user, I don't generally expect a save file to persist randomness. In fact, sometimes players take advantage of that, for instance if they die in a random encounter right after saving, reloading the game doesn't immediately drop them back into the same encounter. I would consider just leaving your game's Random unseeded and let each game be slightly unique.

Is there a way to use limitless lists in java?

I'm trying to make a randomly-generated 2-d game, which I plan to do with a list of terrain to the right of the spawn point and a list of terrain to the left of the spawn point. However, I need these lists to not have a length limit, as I want the world to be infinite. If I can't find a way I will make the world "round" but infinite would be preferable. Is this possible?
An ArrayList is infinite... until memory runs out. But I guess that was not the question.
Update: Right, this is limited even though I argue nobody will notice the world restarting after two billion units.
Thought about that again. What you need is a random function that creates the same value again and again when you give it seed and current position. So you do not store the world, you recalculate it on the fly.
So you need an infinite counter only for the position in your world. The only challenge will be the storage of event results such us eaten mushrooms and destroyed bridges.
Storing all the data in a list will have a lot of limitations.
If you use an ArrayList, you can't have infinite elements.
If you use a LinkedList, you lose random access, so speed is a lot slower.
And for any list, RAM is an issue.
You'd be better off by splitting generated areas into chunks, then storing those to the harddrive.
Now, you'd still want a list of loaded areas, but this will be limited by a scope. If you're 2 game-miles to the East of some town, no point keeping the town information in reference (I hope).
One very popular game to this is Minecraft. Attempting to load the entire Minecraft world into your RAM won't happen - yet it still has the potential for infinite worlds.
If the world is going to be huge, I wouldn't store it in an ArrayList or a LinkedList. Instead you can make the whole world depend on a randomly selected long value seed. The terrain at position i can then be found using new Random(seed ^ i).nextInt() (or something). That way the world will be (effectively) infinite and you won't have to save the terrain in memory. Whenever you return to a previously visited part of the world it will be the same as it was before. The number of different worlds is 2^64 so you'd have to live a very long time before you saw the same world again.
ArrayList can contain up to 2^31 values (because length of array is integer, which is unsigned 4 byte structure).
However LinkedList is limitless, the only limit is the memory of JVM.

How to give each object in a document a unique ID?

I'm making a bitmap editor where a document consists of several layers where each layer represents a bitmap. Each layer must have a unique ID compared to all other layers that currently exist in the document. I also need to take into account that I need to save and load documents along with the layer IDs.
I'm using the command pattern to store actions that are performed on the document and the unique IDs are used to keep track of which layer an action should be performed on.
At the moment, I just keep a counter called X, when a new layer is created its ID is set to X then X is incremented. When loading, I need to make sure X is set to an appropriate number so that new layers are given unique IDs i.e. I could save the value of X and restore that, or set X based on the biggest layer ID loaded.
Given X is a 32-bit number, the user will need to create 4,294,967,296 layers working on the same file before IDs start to be reused which will cause weird behaviour. Should I implement a better unique ID system or is this generally good enough?
I'm in Java so I could use the UUID library which creates 128 bit unique IDs according to a standard algorithm. This seems overkill though.
Is there some general approach to this kind of problem?
This is perfectly good enough. At a rate of ten new layers per second 24/365, which is silly, it'll run fine for roughly three years.
If you think you might manipulate layers programmatically, and thus have some possibility of having 2^32 layers in the lifetime of the image, then throw all the layer IDs into a HashSet when you read the file, and update that set when you add/remove layers. (You don't need to explicitly store the set; the ID associated with each mask is enough.) Then instead of taking "the next number", take "the next number not in the set". Having a counter is still useful, but whether you set it to 0 or (max+1) when you read a file is immaterial; it won't take a significant amount of time to find an empty space unless you envision bafflingly large numbers of layers present simultaneously.
But since you're using Java, you could just use a long instead of an int, and then you wouldn't ever (practically speaking) be able to overflow the number even if all you did was create a one-pixel mask and destroy it over and over again.

Categories

Resources