Find super-string in a set of strings - java

I have a list of strings, like:
cargo
cargo pants
cargo pants men buy
cargo pants men
cargo pants men melbourne buy
In this, the string that contains all remaining strings is cargo pants men melbourne buy. I'd like to remove all the shorter strings and preserve only the longest "super string".
Note, if 2 queries cargo pants and cargo shorts exist, they will be treated as 2 different queries and won't be combined.
So far, I've been doing this the brute force way - pick a string from set and walk through the same set deleting all other strings that are "substrings" of the current string. Roughly,
for (String p: big_set) {
for (String q: big_set) {
if (!p.equals(q)) {
if (has_all_words(p, q)) { /* If all words in 'p' is also in 'q' */
big_set.remove(p);
break;
}
}
}
}
Is there an intelligent algorithm to do this in less than O(n^2) time? In this function, has_all_words will preserve the order of words while comparing.
For the curious, I have a massive list of a few billion search queries (like the ones send to Google/Yahoo/Bing) and I'm trying to find hypernyms for these queries. There's a server that parses this string and produces various interesting categories. I am trying to compress the queries list in the hopes of minimizing compute cost and bandwidth. This method surely reduces bandwidth significantly (because humans can't just think of buy cargo pants melbourne in one go), but the pre-computation cost is prohibitive. And so I've been hunting for algorithms that can do this, but I haven't come across anything that does this yet.

I think all you want to ask for is to remove all those sub strings
which can be found in a super string .Like in the case for ["foo
bar", "foo baz"] you will have to store both the strings .
If my guess is right then yes you can achieve it in less than O(n^2).
before starting with anything short each super-strings alphabetically
so that no such case remains like cargo pants pants cargo men buy
first, sort your string in decreasing order according to there
lengths.
Then pick up sub strings of the longest string (as we are
iterating from first index and have sorted in reverse order) and
start searching for it in rest of the strings.
If string is found remove it and Once searching and removing
completes just iterate again with the next sub string of the
same super-string with the last sub-string included.
In the end you will be left with only strings which are unique (if
you consider ["foo bar", "foo baz"] as a unique string.

Related

Custom Sorting Algorithm in Java?

Hey guys so I want to write a small Java Program that helps me sort a list. Imagine the list looks like this:
Apples, Grapefruit, Bananas, Pineapples, Coconuts
Now I don't want to sort alphabetically or anything like that but for example by what fruit I like the most, so the sorted list could look like this: Coconuts, Bananas, Apples, Pineapples, Grapefruit
My idea so far was that it could kinda go like that: Apples is written into the list. Then Grapefruit and apple is compared and the user says what he likes more (here Apples) so Grapefruits move under Apples. Then it compares Bananas with eg Apples and the user tells the program he likes Bananas more so it goes above Apples and doesnt have to compare with Grapefruit anymore which saves a lot of time. The Program should handle a few hundred entries and comparisions in the end so saving time by asking fewer questions will save a lot of time. Am I on the right track? Also what would be the best way to input the list, an array, an arraylist, or...?
How should this be implemented? Is there a fitting sorting Algorithm? Thanks in advance!
You should build a Binary Search Tree.
As you're inserting new fruits, you ask the user which they like best, to find where to insert the new fruit node. To keep the number of questions down, keep the tree Balanced.
Once the "preference tree" has been built, you can iterate the tree depth-first, assigning incremental "preference values" to each fruit, and build a Map<String, Integer>, so you can quickly lookup any fruits preference values, aka sort sequence number.
The simplest way is to use Collections.sort with a custom comparator that asks for user input.
Scanner sc = new Scanner(System.in);
Collections.sort(fruits, (a, b) -> {
System.out.println("Do you prefer " + a + " or " + b + "?");
String preference = sc.next();
return preference.equals(a) ? -1 : preference.equals(b) ? 1 : 0;
});
The accepted answer is fine, but consider also this alternative:
You could store the fruits in a max-heap, which can be done with an ArrayList (when you don't know the number of fruits beforehand) or a plain array.
When you insert a new fruit, it is appended at the end of the array, and as you let it sift up (according to the heap algorithm), you ask the user for the result of the comparison(s), until the fruit is considered less liked than the one compared with (or it becomes the root -- at index 0).
As post processing, you need to pull out all elements from the top of the heap, again using the appropriate heap algorithm. They will come out in sorted order.

Query Dynamodb with more than one sort keys

How can i query a dynamo db sorted by two attributes?
Let’s say for example my entity is github_repository. With the attributes name (S), owner (S), watches (I), stars (I) and forks (I).
(S) - string: (I) - int
And i want to return the top 10 github_repositories by stars and forks. Let’s say if two items have the same stars, i use forks as the tie breaker on the sorting.
Thanks
If you know how you want to sort the (stars, fork) tuple, you can serialize it in an appropriate as a string, and use that as the sort key.
In your case, you said:
I want to return the top 10 github_repositories by stars and forks. Let’s say if two items have the same stars, i use forks as the tie breaker on the sorting.
So if "stars" are numbers 0 through 9, and "forks" (?) are also 0 through 9, you just concatenate the two numbers together to get one string. Or take 10-number if you want reverse sorting (highest stars first). For example, an item with 3 stars and 6 forks will get the string "74". If one item has more stars than another, it will be first in the sort order. If two items have the same stars, then the one with more forks will be first. As you required.
If the meaning of your stars and forks are different, or your sorting requirements are different, the details will be different - but in many cases you can find a way to encode what you want to sort into a single string that DynamoDB can sort lexicographically.
You can only sort by one key. For your issue there are two possible solutions I can think of:
Use a sort key that is a sum of stars and forks (if that suits your use case).
Fetch the top ten by starts and then locally use a sort for forks (but that might bring a problem if nr 10 has less forks than nr 11).

Algorithm for printing all accepted strings (to an upper bound if infinite amount) from a given DFA

As the title states I'm trying to write an algorithm that generates accepted strings to an upper bound from a given DFA (Deterministic Finite Automata) on input.
It should not generate more strings than the upper bound n if it contains cyclic patterns, because obviously I can't print an infinite amount of strings, which leads me to my problem.
Machines with finite languages are very straight forward as I can just do a DFS search and traverse through the graph and concatenate all letters that connect the states recursively, but I have no clue how I should deal with infinite language DFAs unless I hardcode a limit on how many times the DFS should traverse states that can potentially lead to cycles.
So my question is; how should go about approaching this problem. Are there any known algorithms that I could use to tackle this task?
the bound specifies the number of strings, not the length of them. The string length is not allowed to exceed 5000 characters, but should preferably not come near that in length, as the max bound, n, is 1000 at most on the tests.
My current and very naive solution is the following:
public void generateStrings(int currentStateNum, Set<String> output, String prefix, Set<Integer> traversedStates){
State currentState, tempState;
String nextString;
currentState = dfa.get(currentStateNum);
//keeps track of recursion track, i.e the states we've already been to.
//not in use because once there are cyclic patterns the search terminates too quickly
//traversedStates.add(currentStateNum);
//My current, bad solution to avoid endless loops is by checking how many times we've already visited a state
currentState.incrementVisited();
//if we're currently looking at an accepted state we add the current string to our list of accepted strings
if(currentState.acceptedState){
output.add(prefix);
}
//Check all transitions from the current state by iterating through them.
HashMap<Character, State> transitions = currentState.getTransitions();
for (Map.Entry<Character, State> table : transitions.entrySet()) {
//if we've reached the max count of strings, return, we're done.
if (output.size() == maxCount) {
return;
}
tempState = table.getValue();
//new appended string. I realize stringbuilder is more efficient and I will switch to that
//once core algorithm works
nextString = prefix + table.getKey().toString();
//my hardcoded limit, will now traverse through the same states as many times as there are states
if (tempState.acceptedState || tempState.getVisitedCount() <= stateCount) {
generateStrings(tempState.getStateNum(), output, nextString, traversedStates);
}
}
}
It is not really a dfs because I don't check which states I've already visited, because if I do that, everything that will be printed is the simplest path to the nearest accept state, which is not what I want. I want to generate as many strings as required (if the language for the DFA is not finite, that is).
This solution works up until a point where either the "visit limit" that I chose arbritarily no longer cuts it, so my solution is somewhat or entirely incorrect.
As you can see my datastructure for representing automata is an ArrayList with states, where State is a separate class that contains a HashMap with transitions, where the key is the edge char and the value is the state that the transition leads to.
Does anyone have any idea how I should proceed with this problem? I tried hard to find similar questions but I couldn't find anything helpful more than some github repos with code that is way too complicated for me to learn anything from.
Thanks a lot in advance!
I would use a bounded queue of objects representing the current state and the string generated so far, and then proceed with something like the following,
Add {"", start} to the queue, representing the string created so far (which is empty) and the start state of the DFA.
As long as there is something on the queue
Pop the front of queue
If the current state is accepting, add the currentString to your result set.
For each transition from this state, add entries to the queue of the form {currentString+nextCharacter, nextState}
Stop when you've hit enough strings, or if the strings are getting too long, or if the queue is empty (finite language).

Anagram Algorithm using a hashtable and/or tries

I have been searching the internet for awhile now for steps to find all the anagrams of a string (word) (i.e. Team produces the word tame) using a hashtable and a trie. All I have found here on SO is to verify 2 words are anagrams. I would like to take it a step further and find an algorithm in english so that I can program it in Java.
For example,
Loop through all the characters.
For each unique character insert into the hashtable.
and so forth.
I don't want a complete program. Yes, I am practicing for an interview. If this question comes up then I will know it and know how to explain it not just memorize it.
the most succinct answer due to some guy quoted in the "programming pearls" book is (paraphrasing):
"sort it this way (waves hand horizontally left to right), and then that way (waves hand vertically top to bottom)"
this means, starting from a one-column table (word), create a two column table: (sorted_word, word), then sort it on the first column.
now to find anagrams of a word, first compute sorted word and do a binary search for its first occurrence in the first column of the table, and read off the second column values while the first column is the same.
input (does not need to be sorted):
mate
tame
mote
team
tome
sorted "this way" (horizontally):
aemt, mate
aemt, tame
emot, mote
aemt, team
emot, tome
sorted "that way" (vertically):
aemt, mate
aemt, tame
aemt, team
emot, mote
emot, tome
lookup "team" -> "aemt"
aemt, mate
aemt, tame
aemt, team
As far as hashtables/tries they only come into the picture if you want a slightly speedier lookup. Using hash tables you can partition the 2-column vertically sorted table into k-partitions based on the hash of the first column. this will give you a constant factor speedup because you have to do a binary search only within one partition. tries are a different way of optimizing by helping you avoid doing too many string comparisons, you hang off the index of the first row for the appropriate section of the table for each terminal in the trie.
Hash tables are not the best solution, so I doubt you would be required to use them.
The simplest approach to finding anagram pairs (that I know of) is as follows:
Map characters as follows:
a -> 2
b -> 3
c -> 5
d -> 7
and so on, such that letters a..z are mapped to the first 26 primes.
Multiply the character values for each character in the word, lets call it the "anagram number". Its pretty easy to see TEAM and TAME will produce the same number. Indeed the anagram values of two different words will be the same if and only if they are anagrams.
Thus the problem of finding anagrams between the two lists reduces to finding anagram values that appear on both lists. This easily done by sorting each list by anagram number and stepping through to find common values, in nlog(n) times.
String to char[]
sort it char[]
generate String from sorted char[]
use it as key to HashMap<String, List<String>>
insert current original String to list of values associated
for example for
car, acr, rca, abc it would have
acr: car, acr, rca
abc: abc

Returning a Subset of Strings from 10000 ascii strings

My college is getting over so I have started preparing for the interviews to get the JOB and I came across this interview question while I was preparing for the interview
You have a set of 10000 ascii strings (loaded from a file)
A string is input from stdin.
Write a pseudocode that returns (to stdout) a subset of strings in (1) that contain the same distinct characters (regardless of order) as
input in (2). Optimize for time.
Assume that this function will need to be invoked repeatedly. Initializing the string array once and storing in memory is okay .
Please avoid solutions that require looping through all 10000 strings.
Can anyone provide me a general pseudocode/algorithm kind of thing how to solve this problem? I am scratching my head thinking about the solution. I am mostly familiar with Java.
Here is an O(1) algorithm!
Initialization:
For each string, sort characters, removing duplicates - eg "trees" becomes "erst"
load sorted word into a trie tree using the sorted characters, adding a reference to the original word to the list of words stored at the each node traversed
Search:
sort input string same as initialization for source strings
follow source string trie using the characters, at the end node, return all words referenced there
They say optimise for time, so I guess we're safe to abuse space as much as we want.
In that case, you could do an initial pass on the 10000 strings and build a mapping from each of the unique characters present in the 10000 to their index (rather a set of their indices). That way you can ask the mapping the question, which sets contain character 'x'? Call this mapping M> ( order: O(nm) when n is the number of strings and m is their maximum length)
To optimise in time again, you could reduce the stdin input string to unique characters, and put them in a queue, Q. (order O(p), p is the length of the input string)
Start a new disjoint set, say S. Then let S = Q.extractNextItem.
Now you could loop over the rest of the unique characters and find which sets contain all of them.
While (Q is not empty) (loops O(p)) {
S = S intersect Q.extractNextItem (close to O(1) depending on your implementation of disjoint sets)
}
voila, return S.
Total time: O(mn + p + p*1) = O(mn + p)
(Still early in the morning here, I hope that time analysis was right)
As Bohemian says, a trie tree is definitely the way to go!
This sounds like the way an address book lookup would work on a phone. Start punching digits in, and then filter the address book based on the number representation as well as any of the three (or actually more if using international chars) letters that number would represent.

Categories

Resources