Spell-Check: Find one-to-one token difference mapping between two strings

Spell-Check: Find one-to-one token difference mapping between two strings - java

I recently stumbled over this question on an internet archive and am having some difficulty wrapping my head around it. I want to find a desired mapping amongst the different tokens between two strings. The output should a String-to-String map.
For example:
String1: hewlottpackardenterprise helped american raleways in N Y
String2: hewlett packard enterprise helped american railways in NY
Output:
hewlottpackardenterprise -> hewlett packard enterprise
hewlott -> hewlett
raleways -> railways
N Y -> NY
Note: I have been able to write an edit-distance method, which finds all types of edits (segregated by types, like deletion, substitution etc.) and can convert the first string to second by a convert method
What have I tried so far?
Approach 1: I began with a naive approach of splitting both the strings by space, inserting the tokens of the first string into a hash map and comparing the tokens of the other string with this hashmap. However, this approach quickly fails as misses on relevant mappings.
Approach 2: I utilize my covert method to find the edit positions in the string, and type of edits. Using space edits, I'm able to create a mapping from hewlottpackardenterprise -> hewlett packardenterprise. However, the method just explodes as more and more things need to be splitted within the same word.
Appreciate any thoughts in this regard! Will clear any doubts in the comments.
public String returnWhiteSpaceEdittoken(EditDone e, List<String> testTokens) {
int pos = e.pos, count=0, i=0;
String resultToken = null;
if (e.type.equals(DeleteEdit)) {
for (i=0;i<testTokens.size();i++) {
count+=testTokens.get(i).length();
if (count==pos) {
break;
}
if (i!=testTokens.size()-1) {
count++;
}
}
resultToken = testTokens.get(i) + " " + testTokens.get(i+1);
} else if (e.type.equals(InsertEdit)) {
for (i=0;i<testTokens.size();i++) {
count+=testTokens.get(i).length();
if (count>pos) {
break;
}
if (i!=testTokens.size()-1) {
count++;
}
}
String token = testTokens.get(i);
resultToken = token.substring(count-token.length(), pos) + token.substring(pos, count);
}
return resultToken;
}

A pretty common way of handling problems like this is to find the longest common subsequence (or it's dual the shortest edit script) between the two strings and then post-process the output to get the specific format you want; in your case the string maps.
Wikipedia has a pretty decent introduction to the problem here: https://en.wikipedia.org/wiki/Longest_common_subsequence_problem
and a great paper "An O(ND) Difference Algorithm and Its Variations" by Myers can be found here. http://www.xmailserver.org/diff2.pdf

Related

Why is my collections.sort leading to different outputs in 2 arraylists with the same data [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 10 months ago.
Improve this question
I am having some trouble sorting 2 arraylists of mine which have the exact same data, one is received through my API and the other is parsed through ingame text fields, for some reason they are sorted differently even though they are the exact same.
public ArrayList<String> names = new ArrayList<>();
AbowAPI api = new AbowAPI();
#Override
public void onStart() throws InterruptedException {
try {
names = api.getTargets();
Collections.sort(names);
} catch (IOException e) {
log(e.getStackTrace());
}
}
#Override
public int onLoop() throws InterruptedException {
if(tabs.getOpen() != Tab.FRIENDS) {
tabs.open(Tab.FRIENDS);
} else {
ArrayList<String> friendList = getFriendNames();
Collections.sort(friendList);
log(friendList);
}
This here is the resulting output
[INFO][Bot #1][05/06 07:59:51 em]: [abc, abow42069, adam, bad, bl ack, blood, blue, bye, dead, dog, google, her, him, john wick, light, lol, mad, red]
[INFO][Bot #1][05/06 07:59:51 em]: [abc, abow42069, adam, bad, blood, blue, bl ack, bye, dog, google, her, him, john wick, light, lol, mad, red]
As I try comparing the 2 arraylists they are not also not equal, I need them to be sorted the same way so they match but I'm having troubles with it, any help to why they are sorting in different ways?
This is my API call to get the targets, maybe this is what is causing the weird bug?
enter code here
URL url = new URL("http://127.0.0.1:5000/snipes");
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.setRequestMethod("GET");
BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream()));
String inputLine;
ArrayList<String> content = new ArrayList<>();
while ((inputLine = in.readLine()) != null) {
content.add(inputLine);
}
in.close();
return content;

The third character in "bl ack" is in one case a character that sorts before the lower-case alphabetics, and in the other case a character that sorts after them.
I would go with the hunch that says the first one is the "normal" space, and the second one is some other space (of which there are a few), and therefore the second one will have a character code greater than 127 - i.e., outside the usual ASCII range - which is possible since Java does not use ASCII.
To debug, I'd insert code like this:
for (String f : friendList)
for (int k=0; k<f.length(); k++) {
char c = f.charAt(k);
if (c > 127)
System.out.printf("String '%s' char %d has code %d (%x)%n",
f, k, c, c);
}
It's not pretty but it'll get the job done. It'll work equally well if I'm wrong that the 'funny' character is a form of space, but is just something the logger is replacing by space. Armed with the character code in hex, you can look it up at unicode.org.
Replace 'printf' with anything more suitable to your development environment, if appropriate.
(For the purists, I'm guessing it's not going to be a surrogate pair, thus I'm using char rather than codepoint).
Once you know what you're dealing with, you can devise a handling strategy, which might be "replace the funny character with plain old space".
Edited since we now know the character is non-breaking space, 00a0 in hex.
You could, for example, change this:
ArrayList<String> friendList = getFriendNames();
Collections.sort(friendList);
to this:
ArrayList<String> friendList = new ArrayList<>();
ArrayList<String> temp = getFriendNames();
for (String t : temp) {
friendList.add(t.replaceAll("\\h", " "));
}
Collections.sort(friendList);
The \h represents any horizontal whitespace in a Java regular expression Pattern, which includes the non-breaking space and others. So we're going above and beyond the observed problem, normalizing all possible "spaces".
It would probably be better to make the same replacement in the getFriendNames method, but I don't think you've shown that code. Nevertheless, I hope you get the idea.
(Code typed in, not tested).

Looking for similar strings in a string array [duplicate]

This question already has answers here:
How to search an array for a part of string?
(6 answers)
Closed 1 year ago.
I have a string array. For example:
["Tartrazine","Orange GGN", "Riboflavin-5-Phosphate"]
And I have a string. For example:
"Riboflvin"
I want to look for most similar string in the array and get it if it exists.
So I need this output:
"Riboflavin-5-Phosphate"
But if the array looks like this:
["Tartrazine","Orange GGN", "Quinoline"]
I want something like this output:
"No similar strings found"
I tried using FuzzyWuzzy library, but it shows a lot of false alarms.

You can use String#contains method, sequentially reducing the length of the string to search if the full string is not found:
String[] arr = {"Tartrazine", "Orange GGN", "Riboflavin-5-Phosphate"};
String element = "Riboflvin";
boolean found = false;
for (int i = 0; i < element.length(); i++) {
// take shorter substring if nothing found at previous step
String part = element.substring(0, element.length() - i);
// if any string from array contains this substring
if (Arrays.stream(arr).anyMatch(str -> str.contains(part))) {
System.out.println("Found part: " + part);
// then print these strings one by one
Arrays.stream(arr).filter(str -> str.contains(part))
.forEach(System.out::println);
found = true;
break;
}
}
// if nothing found
if (!found) {
System.out.println("No similar strings found");
}
Output:
Found part: Ribofl
Riboflavin-5-Phosphate

Well, it depends what you want to do exactly.
There are a couple of things you can do you can check wether the array contains an exact match of the String you are looking for by just calling list.contains("yourStr") the list directly. You could also check each value to see whether it contains a certain substring like so:
foreach(String s : list) {
if (s.contains(subStr) {
return s;
}
}
Otherwise, if you really would like to check similarity it becomes a bit more complicated. Then we really have to answer the question: "how similar is similar enough?". I guess this post as a decent answer to that problem: Similarity String Comparison in Java

String Fragment Combinations Puzzle

Let's say I am given a list of String fragments. Two fragments can be concatenated on their overlapping substrings.
e.g.
"sad" and "den" = "saden"
"fat" and "cat" = cannot be combined.
Sample input:
aw was poq qo
Sample output:
awas poqo
So, what's the best way to write a method which find the longest string that can be made by combining the strings in a list. If the string is infinite the output should be "infinite".
public class StringUtil {
public static String combine(List<String> fragments) {
StringBuilder combined = new StringBuilder();
for (int i = 0; i < fragments.size(); i++) {
char last = (char) (fragments.get(i).length() - 1);
if (Character.toString(last).equals(fragments.get(i).substring(0))) {
combined.append(fragments.get(i)).append(fragments.get(i+1));
}
}
return combined.toString();
}
}
Here's my JUnit test:
public class StringUtilTest {
#Test
public void combine() {
List<String> fragments = new ArrayList<String>();
fragments.add("aw");
fragments.add("was");
fragments.add("poq");
fragments.add("qo");
String result = StringUtil.combine(fragments);
assertEquals("awas poqo", result);
}
}
This code doesn't seem to be working on my end... It returning an empty string:
org.junit.ComparisonFailure: expected:<[awas poqo]> but was:<[]>
How can I get this to work? And also how can I get it to check for infinite strings?

I don't understand how fragments.get(i).length() - 1 is supposed to be a char. You clearly casted it on purpose, but I can't for the life of me tell what that purpose is. A string of length < 63 will be converted to an ASCII (Unicode?) character that isn't a letter.
I'm thinking you meant to compare the last character in one fragment to the first character in another, but I don't think that's what that code is doing.
My helpful answer is to undo some of the method chaining (function().otherFunction()), store the results in temporary variables, and step through it with a debugger. Break the problem down into small steps that you understand and verify the code is doing what you think it SHOULD be doing at each step. Once it works, then go back to chaining.
Edit: ok I'm bored and I like teaching. This smells like homework so I won't give you any code.
1) method chaining is just convenience. You could (and should) do:
String tempString = fragments.get(i);
int lengthOfString = tempString.length() - 1;
char lastChar = (char) lengthOfString;//WRONG
Etc.
This lets you SEE the intermediate steps, and THINK about what you are doing. You are literally taking the length of a string, say 3, and converting that Integer to a Char. You really want the last character in the string. When you don't use method chaining, you are forced to declare a Type of intermediate variable, which of course forces you to think about what the method ACTUALLY RETURNS. And this is why I told you to forgo method chaining until you are familiar with the functions.
2) I'm guessing at the point you wrote the function, the compiler complained that it couldn't implicitly cast to char from int. You then explicitly cast to a char to get it to shut up and compile. And now you are trying to figure out why it's failing at run time. The lesson is to listen to the compiler while you are learning. If it's complaining, you're messing something up.
3) I knew there was something else. Debugging. If you want to code, you'll need to learn how to do this. Most IDE's will give you an option to set a break point. Learn how to use this feature and "step through" your code line by line. THINK about exactly what step you are doing. Write down the algorithm for a short two letter pair, and execute it by hand on paper, one step at a time. Then look at what the code DOES, step by step, until you see somewhere it does something that you don't think is right. Finally, fix the section that isn't giving you the desired result.

Looking at your unit test, the answer seems to be quite simple.
public static String combine(List<String> fragments) {
StringBuilder combined = new StringBuilder();
for (String fragment : fragments) {
if (combined.length() == 0) {
combined.append(fragment);
} else if (combined.charAt(combined.length() - 1) == fragment.charAt(0)) {
combined.append(fragment.substring(1));
} else {
combined.append(" " + fragment);
}
}
return combined.toString();
}
But seeing at your inqusition example, you might be looking for something like this,
public static String combine(List<String> fragments) {
StringBuilder combined = new StringBuilder();
for (String fragment : fragments) {
if (combined.length() == 0) {
combined.append(fragment);
} else if (combined.charAt(combined.length() - 1) == fragment.charAt(0)) {
int i = 1;
while (i < fragment.length() && i < combined.length() && combined.charAt(combined.length() - i - 1) == fragment.charAt(i))
i++;
combined.append(fragment.substring(i));
} else {
combined.append(" " + fragment);
}
}
return combined.toString();
}
But note that for your test, it will generate aws poq which seems to be logical.

transform short word to original word

I used some word counting algorithm and by a closer look I was wondering because I got out less words than originally in the text because they count for example "it's" as one word. So I tried to find a solution but without any success, so I asked myself if their exist anything to transform a "short word" like "it's" to their "base words", say "it is".

Well, basically you need to provide a data structure that maps abbreviated terms to their corresponding long versions. However, this will not be as simple as it sounds, for example you won't want to transform "The client's car." to "The client is car."
To manage these cases, you will probably need a heuristic that has a deeper understanding of the language you are processing and the grammar rules it incorporates.

I just built this from scratch for the challenge. It seems to be working on my end. Let me know how it works for you.
public static void main(String[] args) {
String s = "it's such a lovely day! it's really amazing!";
System.out.println(convertText(s));
//output: it is such a lovely day! it is really amazing!
}
public static String convertText(String text) {
String noContraction = null;
String replaced = null;
String[] words = text.split(' ');
for (String word : words) {
if (word.contains("'s")) {
String replaceAposterphe = word.replace("'", "$");
String[] splitWord = replaceAposterphe.split('$');
noContraction = splitWord[0] + " is";
replaced = text.replace(word, noContraction);
}
}
return replaced;
}
I did this in C# and tried to convert it into Java. If you see any syntax errors, please point them out.

Java: simpler way of cutting off end of string

Everytime I encounter this I ask myself the same question: Isn't there a simpler and less annoying way of cutting a string from the end by X characters.
Let's say I got "Helly there bla bla" and - why ever - I need to cut off the last 2 characters, resulting in "Helly there bla b".
I now would do the following:
String result = text.substring(0, text.length() - 2);
I rather want to do something like:
String result = text.cutOffEnd(2);
I know there are many String libraries out there, but don't know many of them and I never saw something like that so I hoped someone of you might know better :)
EDIT:
Q: Why don't you just build your own util method / class?
A: I don't want to use an own util method. I don't write a util method for "null or empty" or other trivial things. I go with the opinion that there MUST BE something available already as I would say that tons of people need this kind of function pretty often in their lifetime.
Plus: I work in many different projects and just want to rely on a simple library call like "Strings.nullToEmpty(str)" etc. I just don't build something like that on my own, although it's trivial.
Q: why is text.substring(0, text.length() - 2); not good enough?
A: It's very bulky if you compare it with my desired function. Also, think of that: If you determine the string, it becomes even unhandier:
String result = otherClass.getComplicatedCalculatedText(par1, par2).substring(0,
otherClass.getComplicatedCalculatedText(par1, par2).length() - 2);
Obviously I'd need to use a local variable, which is so unnecessary at this point... As it could simply be:
String result = otherClass.getComplicatedCalculatedText(par1, par2).cutOffEnd(2);

By using some string library. I suggest Apache's commons lang.
For your case this is enough.
import org.apache.commons.lang.StringUtils;
String result = StringUtils.removeEnd( "Helly there bla bla", "la");

Go through the following code
public class OddNumberLoop {
public static void main(String[] args) {
String st1 = "Helly there bla bla";
String st2 = st1.substring(0, st1.length() - 2);
System.out.println(st2);
}
}
Good Luck !!!

There is no built-in utility for this in the starnard library, but how hard is it to write a util method for this yourself?
public static String cutOffEnd(String s, int n) {
return s == null || s.length() <= n ? "" : s.substring(0, s.length() - n);
}
A complete solution with checks included:
public static String cutOffEnd(String s, int n) {
if (s == null)
throw new NullPointerException("s cannot be null!");
if (n > s.length())
throw new IllegalArgumentException("n cannot be greater"
+ " than the length of string!");
return s.substring(0, s.length() - n);
}

By the way: Often such feature-requests are made to the java-compiler. Usually these features are not Compelling for the Language and won't be fixed.
Why do you like to cut it off? If its for a higher-level-solution then you like to match a pattern or structure. In this case you anyway shall use a own Util-Method for parsing.
A realistic Example:
String url = "http://www.foo.bar/#abc";
String site = url.substring(0, url.indexOf("#"));
// this shall be extracted into a utils-method
// anyway like `MyURLParser.cutOfAnchor()`.
Its forbidden to ask for a concrete Library here.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.