Java Algorithm String-ID-generation hierarchical parent child - java

What is a good algorithm to generate unique IDs to be put into a Map<String, Entity> with Entity being a container/folder class that can contain other Entities and String being the ID? I think when generating a new Entity it should always use the ID of its parent, so right now what I do is
(Math.abs((parentName+entityName).hashCode())).toString;
But it seems pretty inefficient as the ID can be a String but may not contain "-", so it contains only numbers when it may as well contain letters and Math.abs halves the number of possible IDs. Oh, and the ID has to be of same length (8 letters). It has only to function as a key in the map and inside an XML-file and does not have to be secure.

There doesn't seem to be any advantage to including the parent id in the child id. A potential advantage of this would be to find all children by their parent id (i.e. return all ids that start with parent_id), but you're hashing the concatenated id and you have a max id length which makes this approach infeasible.
If your keys don't have to be secure then a counter would be efficient and guarantee uniqueness. A sample implementation would be to generate ids composed of case-sensitive alphanumerics, which would give you about 10^14 ids (you can also add special characters to increase the number of ids). You'll need an array of 62 characters: indices 0-25 have lowercase letters, indices 26-51 have uppercase letters, and indices 52-61 have numbers. You'll also need a state array of 8 integers (or shorts or bytes), initialized to all 0's. To retrieve an id, use the state array to look up characters in the character array and concatenate them together (so a state of {0, 1, 2, 0, 1, 2, 0, 1} generates an id of "abcabcab"); then increment the 0th index of the state array, if this results in a number greater than 61 then set the 0th index to 0 and increment the 1st index of the state array, if this results in a number greater than 61 then set the 1st index to 0 and increment the 2nd index of the state array, etc.
I suggest that you use a StringBuilder to concatenate the substrings, otherwise you're going to generate a lot of garbage strings. You might also be able to replace the state array with a StringBuilder, using StringBuilder#replace in place of the int/short/byte increment operations.
If your application is multi-threaded then the counter can become a bottleneck. One way to fix this is for each worker thread to reserve either 62 or 62^2 ids, for example: ID_Thread is the thread with the id generator, and its getBatchId method is synchronized and returns a copy of the state array. ID_Thread increments the 2nd index of the state array (not the 0th index), if this results in a number greater than 61 then it sets the 2nd index to 0 and increments the 3rd index, etc. Meanwhile, Worker_Thread has called getBatchId and now has a copy of a state array; it uses this to generate ids, after which it increments the 0th index of the state array, if this results in a number greater than 61 then it sets the 0th index to 0 and increments the 1st index, and if this results in a number greater than 61 then it calls getBatchId for a new state array. This means that the Worker_Thread instances only need to call a synchronized method for one out of every 62^2 ids.
An alternative multi-threaded implementation would be for Id_Thread to continually generate ids and place them in a BlockingQueue (with maximum queue size of, say, 32), with the Worker_Thread instances pulling ids from this queue.

Related

Which Pattern Matching Algorithm fits for my case?

I have a project which it needs to compare to text documents and find the similarity rate between every single sentence and the general similarities of the texts.
I did some transforming on texts like lowering all words,deleting duplicate words,deleting punctuations except fullstops. After doing some operations, i had 2 arraylists which include sentences and the words all seperated. It looks like
[["hello","world"],["welcome","here"]]
Then i sorted every sentence alphabetically.After all these, i'm comparing all the words one by one,doing linear search but if the word which i'm searching is bigger than i'm looking (ASCII of first character like world > burger) ,i'm not looking remaining part,jumping other word. It seems like complicated but i need an answer of " Is there any faster,efficient common algorithms like Boyer Moore,Hashing or other?" . I'm not asking any code peace but i need some theoretical advices.Thank you.
EDIT:
I should've tell the main purpose of the project. Actually it is kinda plagiarism detector.There are two txt files which are main.txt and sub.txt.The program will compare them and it gives an output something like that:
Output:
Similarity rate of two texts is: %X
{The most similar sentence}
{The most similar 2nd sentence}
{The most similar 3d sentence}
{The most similar 4th sentence}
{The most similar 5th sentence}
So i need to find out sub.txt similarity rate to main.txt file.I thought that i need to compare all the sentences in two files with each other.
For instance, main.txt has 10 sentences and sub.txt has 5 sentences,
there will be 50 comparison and 50 similarity rate will be calculated
and stored.
Finally i sort the similarty rates and print the most 5 sentences.Actually i've done the project,but it's not efficient. It has 4 nested for loops and compares all words uncountable times and complexity becomes like O(n^4) ( maybe not that much) but it's really huge even in the worst case. I found Levenshtein distance algorithm and Cosine similarity algorithms but i'm not sure about them. Thanks for any suggestion!
EDIT2:
For my case similarity between 2 sentence is like:
main_sentence:"Hello dude how are you doing?"
sub_sentence:"Hello i'm fine dude."
Since intersection is 2 words ["hello","dude"]
The similarity is : (length of intersected words)*100/(length of main text)
For this case it's: 2*100/6 = %33,3
As a suggestion, and even if this is not a "complete answer" to your issue, comparing Strings is usually a "heavy" operation (even if you first check their length, which, in fact, is one of the first things the equals() method already performs when comparing Strings)
What I suggest is doing next: create a dummy hashcode()-like method. It won't be a real hashcode(), but the number associated to the order in which that word was read by your code. Something like a cryptographic method, but much simpler.
Note that string.hashCode() won't work, as the word "Hello" from the first document wouldn't return the same hashcode than the word "Hello" from the second document.
Data "Warming" - PreConversion
Imagine you have a shared HashMap<String,Integer> (myMap), which key is an String and the value an Integer. Note that HashMap's hashing in java with String keys lower than 10 characters (which usually are, in english language) is incredibly fast. Without any check, just put each word with its counter value:
myMap.put(yourString, ++counter);
Let's say you have 2 documents:
1.txt- Welcome mate what are you doing here
2.txt- Mate I was here before are you dumb
I assume you already lowercased all words, and removed duplicates.
You start reading the first document and assigning each word to a number. The map would look like:
KEY VALUE
welcome 1
mate 2
what 3
are 4
you 5
doing 6
here 7
Now with the second document. If a key is repeated, the put() method will update its value. So:
KEY VALUE
welcome 1
mate 8
what 3
are 13
you 14
doing 6
here 11
I 9
was 10
before 12
dumb 15
Once complete, you create another HashMap<Integer,String> (reverseMap), this way in reverse:
KEY VALUE
1 welcome
8 mate
3 what
13 are
14 you
6 doing
11 here
9 I
10 was
12 before
15 dumb
You convert both documents into a List of Integers, so they look like:
1.txt- Welcome mate what are you doing here
2.txt- Mate I was here before are you dumb
to:
listOne - [1, 8, 3, 13, 14, 6, 11]
listTwo - [8, 9, 10, 11, 12, 13, 14, 15]
Duplicate words, positions and sequences
To find the duplicated within both documents:
First, create a deep clone of one of the lists, for example, listTwo. A deep clone of a List of Integers is relatively easy to perform. Calling it listDuplicates as that will be its objective.
List<Integer> listDuplicates = new ArrayList<>();
for (Integer i:listTwo)
listDuplicates.add(new Integer(i));
Call retainAll:
listDuplicates.retainAll(listOne);
The result would be:
listDuplicates- [8,11,13,14]
So, from a total of listOne.size()+listTwo.size() = 15 words found on 2 documents, 4 are duplicates are 11 are unique.
In order to get the converted values, just call:
for (Integer i : listDuplicates)
System.out.println(reverseMap.get(i)); // mate , here, are, you
Now that duplicates are identified, listOne and listTwo can also be used now in order to:
Identify the position on each list (so we can get the difference in the positions of this words). The first word would have -1 value, as its the first one and doesn't have a diff with the previous one, but won't necessarily mean they are consequent with any other (they are just the first duplicates).
If the next element has -1 value, that means the [8] and [11] would aslo be consecutive:
doc1 doc2 difDoc1 difDoc2
[8] 2 1 -1 (0-1) -1 (0-1)
[11] 7 4 -5 (2-7) -3 (1-4)
[13] 4 6 3 (7-4) -2 (4-6)
[14] 5 7 -1 (4-5) -1 (6-7)
In this case, the distance shown in [14] with its previous duplicate (the diff between [13] and [14]) is the same in both documents: -1: that means that not only are duplicates, but both are consequently placed in both documents.
Hence, we've found not only duplicate words, but also a duplicate sequence of two words between those lines:
[13][14]--are you
The same mechanism (identifying a diff of -1 for the same variable in both documents) would also help to find a complete duplicate sequence of 2 or more words. If all the duplicates show a diff of -1 in both documents, that means we've found a complete duplicate line:
In this example this is shown clearer:
doc1- "here i am" [4,5,6]
doc2- "here i am" [4,5,6]
listDuplicates - [4,5,6]
doc1 doc2 difDoc1 difDoc2
[4] 1 1 -1 (0-1) -1 (0-1)
[5] 2 2 -1 (1-2) -1 (1-2)
[6] 3 3 -1 (2-3) -1 (2-3)
All the diffs are -1 for the same variable in both documents -> all duplicates are next to each other in both documents --> The sentence is exactly the same in both documents. So, this time, we've found a complete duplicate line of 3 words.
[4][5][6] -- here i am
Apart of this duplicate sequence search, this difference table would also be helpful when calculating the variance, median,... from the duplicate words, in order to get some kind of "similarity" factor (something like a basic indicative value of equity between documents. By no mean definitive, but somehow helpful)
Unique values - helpful in order to get a non-equity indicative ?
Similar mechanisms would be used to get those unique values. For example, by removing the duplicates from the reverseMap:
for (Integer i: listDuplicates)
reverseMap.remove(i);
Now the reverseMap only contains unique values. reverseMap.size() = 11
KEY VALUE
1 welcome
3 what
6 doing
9 I
10 was
12 before
15 dumb
In order to get the unique words:
reverseMap.values() = {welcome,what,doing,I,was,before,dumb}
If you need to know which unique words are from which document, you could use the reverseMap (as the Lists may be altered after you execute methods such as retainAll on them):
Count the number of words from the 1st document. This time, 7.
If the key of the reverseMap is <=7, that unique word comes from the 1st document. {welcome,what,doing}
If the key is >7, that unique word comes from the 2nd document. {I,was,before,dumb}
The uniqueness factor could also be another indicative, this way, a negative one (as we are searching for similarities here). Still could be really helpful.
equals and hashCode - avoid
As the hashcode() method for Strings won't return the same value for two same words (only for two same String Object references), wouldn't work here. String.equals() method works by comparing the chars (also checks for the length, as you do manually) which would be total overkill if used for big documents:
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String) anObject;
int n = value.length;
if (n == anotherString.value.length) {
char v1[] = value;
char v2[] = anotherString.value;
int i = 0;
while (n-- != 0) {
if (v1[i] != v2[i])
return false;
i++;
}
return true;
}
}
return false;
}
My oppinion is to avoid this as much as possible, specially hashCode() should never be used, as:
String one = "hello";
String two = "hello";
one.hashCode() != two.hashCode()
There's an exception to this, but only when the compiler interns strings; Once you load thousands of them, that won't ever again be used by the compiler. In those rare cases where both String Objects reference the same cached memory address, this will also be true:
one.hashCode() == two.hashCode() --> true
one == two --> true
But those are really unusual exceptions, and once string internship doesn't kick, those hashCodes won't be equal and the operator == to compare Strings will return false even if the Strings hold the same value (as usual, because it works comparing their memory addresses).
The essential technique is to see this is as a multi-stage process. The key is that you're not trying to compare every document with every other document, but rather, you have a first pass that identifies small clusters of likely matches in essentially a one-pass process:
(1) Index or cluster the documents in a way that will allow candidate matches to be identified;
(2) Identify candidate documents that may be a match based on those indexes/clusters;
(3) For each cluster or index match, have a scoring algorithm that scores the similarity of a given pair of documents.
There are a number of ways to solve (1) and (3), depending on the nature and number of the documents. Options to consider:
For certain datasets, (1) could be as simple as indexing on unusual words/clombinations of words
For more complex documents and/or larger datasets, you will need to do something sometimes called 'dimension reduction': rather than clustering on shared combinations of words, you'll need to cluster on combinations of shared features, where each feature is identified by a set of words. Look at a feature extraction technique sometimes referred to as "latent semantic indexing" -- essentially, you model the documents mathematically as a matrix of "words per feature" multiplied by "feature per document", and then by factorising the matrix you arrive at an approximation of a set of features, along with a weighted list of which words make up each feature
Then, once you have a means of identifying a set of words/features to index on, you need some kind of indexing function that will mean that candidate document matches have identical/similar index keys. Look at cosine similarity and "locality-sensitive hashing" such as SimHash.
Then for (3), given a small set of candidate documents (or documents that cluster together in your hashing system), you need a similarity metric. Again, what method is appropriate depends on your data, but conceptually, one way you could see this as "for each sentence in document X, find the most similar document in document Y and score its similarity; obtain a 'plagiarism score' that his the sum of these values". There are various ways to define 'similarity score' between two strings: e.g. longest common subsequence, edit distance, number of common word pairs/sequences...
As you can probably imagine from all of this, there's no single algorithm that will hand you exactly what you need on a plate. (That's why entire companies and research departments are dedicated to this problem...) But hopefully the above will give you some pointers.

A collection which holds a specific number of values before being overwritten

I need to use a Collection type in Java which will allow me to add values input by the user one at a time. When a new value is added it is added at index 0 and the previous index 0 value is moved to index 1 etc. Once there are 20 values I then want it to start replacing the 1st value, then the 2nd value and so on whilst moving the other values up 1 index.
i.e once all 20 values are filled, the next input becomes index 0, index 1 -> index 2 and so on. index 20 will be forgotten.
I have so far used an ArrayList of Integers but now I have come across this problem. I am wondering if another type of Collection would be best? The Collection will more than likely hold duplicate values.
From the 20 values I will want to sort in ascending order and then find the average of the top 8. I am currently doing that by copying to a second Arraylist within a method, sorting and then adding up the top 8 values and dividing by 8. This way the master list remains the same.
I am not sure there is an efficient way to do what I need with an arraylist.

Delete the ith element in constant time [duplicate]

This question already has answers here:
Ordered list with O(1) random access and removal
(2 answers)
Data structure that allows accessing elements by index and delete them in O(1)
(4 answers)
Closed 4 years ago.
I'm looking for a data structure in Java, that has the following properties:
Deletion in O(1) time using the index inside the structure, while maintaining the relative order of the elements (sorted initially).
Addition, only at the end of the structure.
No updation is required.
Single traversal after all deletions.
Options that I've tried:
Array: Can not delete in O(1) time, as shifting is required. Plus, if I use a HashSet of deleted (or not deleted) elements, then too I would have to go through the deleted elements once, while travelling through the array.
Linked List: Deletion is O(1) (if you have a reference to the Node to be deleted, and it's a doubly linked list, preferably), and shifting is not necessary. But there is no indexing, so I have to traverse from the start, to determine the Node that has to be deleted.
TreeSet: A treeset can maintain the order, and deletion is O(1) time,but via the element, but not the index inside the structure.
I'm looking for a data structure that can help me in the tasks mentioned above, if possible, in Java. If it is not built-in, then I would like to know the implementation of the said data structure.
The need:
I was trying to solve this question. A string of English characters is given initially, then a number of operations are to be performed on the string. Each operation has a character c and a number an alongside, which means that we have to delete nth occurrence of the character c.
My solution:
I would create an array of type X (the data structure I am looking for), of length 26 (for each character). I would add each occurrence of a character, say d, in the 3rd slot (starting from 0), in objects that contain the index in the String itself. I would do this for all the characters of the String. This would take a total time of O(n), if the length of the string is n.
Once this is done, I would start processing the queries. Each query requires us to delete the nth occurrence of the character c (variable, not the actual English character c), which we can do in O(1) time (as required). So, each such deletion would take O(q) time, where q is the number of queries.
Then we can make a charArray that has the length of that of the original string. Then traverse through all the elements remaining in each slot of the array of objects of type X, and put them in their respective places. Once this is done, we can traverse this charArray again and ignore all the empty places, and construct a string from the elements left.

Having a huge list of numbers and an order with unique order numbers, how to make both O(1) accessible?

Imagine I have a huge list with values
123
567
2355
479977
....
These are say ordered ascending
so
123 - 1
567 - 2
2355 - 3
479977 - 4
...
I want to have a single object that gives me access with the order number (1 or 2 or 3 ...) for the value, as well as with the actual value (123 or 567 or ...) for the order number. Does a structure like this exist?
EDIT: insertions and deletions should be possible.
If I have 2 Hashmaps, I need twice the memory and have to perform the operations twice.
You can maintain a ArrayList<Integer> which has O(1) index lookup to store all your ints and the (index -> int) relationship and a HashMap<Integer, Integer> which also has O(1) lookup to store the (int -> index) relationship.
Doing so, you have O(1) for each lookup direction.
If you can have multiple datastructures, I would recommend having 2 Maps :
1) Since the data is in an Array, you can directly access the values present at a given order (position). considering the update in the question that insertions and deletions need to supported, this can be achieved by using a HashMap where key = order, value = value at that order
2) Another reverse HashMap where key = value , value = order
Then you can have O(1) lookup time for both the cases.
If I understand correctly, you're looking for something like a HashMap. You can read about them on Oracle's website: JavaDocs
Using a sorted Tree, you can get log(n) time for both. Understanding that O(log(n)) is not O(1), it's definitely on the order of. For instance if you have 4 billion elements, it would be 1 (theoretically, though not always) vs 32.
If you need both to be O(1), you should create an array of sorted ints / longs / BigIntegers or whatever you need for lookup one way, and then a HashMap of ints / longs / BigIntegers to its position index
Answer for static data:
O(1) is not possible for value to order. (A hashmap has not true O(1) compexity)
But O(1) is easy for order to value.
About value to order:
The Order will be about log2(N). You need either a binary search (log2 (n)) or a hashmap with similar effort. I expect the binSearch to be the fastest (less Object overhead) when using the standard HashMap implementation, where you map the number with its order value.
Update: Why the binary Search:
With binary search you find the array position of the element, which is equal to the order. This does not need any additonal memory, and need far less memory than the std hashMap implementation.
for dynamic data, it depens how often one inserts and how often one searches.
for high number of searches and few insertions I wouldstay with Array(List).
About order to value:
Simply store values ascening in an array (Or ArrayList) a[]; a[orderNr] gives the value.
.

(Java) data structure for fast insertion, deletion, and RANDOM SELECTION

I need a data structure that supports the following operations in O(1):
myList.add(Item)
myList.remove(Item.ID) ==> It actually requires random access
myList.getRandomElement() (with equal probability)
--(Please note that getRandomElement() does not mean random access, it just means: "Give me one of the items at random, with equal probability")
Note that my items are unique, so I don't care if a List or Set is used.
I checked some java data structures, but it seems that none of them is the solution:
HashSet supports 1,2 in O(1), but it cannot give me a random element in O(1). I need to call mySet.iterator().next() to select a random element, which takes O(n).
ArrayList does 1,3 in O(1), but it needs to do a linear search to find the element I want to delete, though it takes O(n)
Any suggestions? Please tell me which functions should I call?
If java does not have such data structure, which algorithm should I use for such purpose?
You can use combination of HashMap and ArrayList if memory permits as follows:-
Store numbers in ArrayList arr as they come.
Use HashMap to give mapping arr[i] => i
While generating random select random form arrayList
Deleting :-
check in HashMap for num => i
swap(i,arr.size()-1)
HashMap.remove(num)
HashMap(arr[i])=> i
arr.remove(arr.size()-1)
All operation are O(1) but extra O(N) space
You can use a HashMap (of ID to array index) in conjunction with an array (or ArrayList).
add could be done in O(1) by simply adding to the array and adding the ID and index to the HashMap.
remove could be done in O(1) by doing a lookup (and removal) from the HashMap to find the index, then move the last index in the array to that index, update that element's index in the HashMap and decreasing the array size by one.
getRandomElement could be done in O(1) by returning a random element from the array.
Example:
Array: [5,3,2,4]
HashMap: [5->0, 3->1, 2->2, 4->3]
To remove 3:
Look up (and remove) key 3 in the HashMap (giving 3->1)
Swap 3 and, the last element, 4 in the array
Update 4's index in the HashMap to 1
Decrease the size of the array by 1
Array: [5,4,2]
HashMap: [5->0, 2->2, 4->1]
To add 6:
Simply add it to the array and HashMap
Array: [5,4,2,6]
HashMap: [5->0, 2->2, 4->1, 6->3]

Categories

Resources