How to make unique string out of substrings? - java

When I concatenate two strings (e.g. "q5q3q2q1" and "q5q4q3q2q1") I get string that have duplicate substrings q5,q3,q2,q1 which appears twice.
The resultant string would be "q5q3q2q1q5q4q3q2q1" and I need to have each substring (q[number]) appear once i.e."q5q4q3q2q1".
Substring doesn't have to start with 'q', but I could set restriction that it doesn't start with number, also it could have multiple numbers like q11.
What could I use to get this string? If solution could be written in Java that would be good, otherwise only algorithm would be useful.

You can split the concatenated string in groups and then use a set, if order of groups doesn't matter, or a dictionary if it is.
a = "q5q3q2q1"
b = "q5q4q3q2q1"
# Concatenate strings
c = a + b
print(c)
# Create the groups
d = ["q" + item for item in c.split("q") if item != ""]
print(d)
# If order does not matter
print("".join(set(d)))
# If order does matter
print("".join({key: 1 for key in d}.keys()))

Another solution, this one is using regular expression. Concatenate the string and find all patterns ([^\d]+\d+) (regex101). Then add found strings to set to remove duplicates and join them:
import re
s1 = "q5q3q2q1"
s2 = "q5q4q3q2q1"
out = "".join(set(re.findall(r"([^\d]+\d+)", s1 + s2)))
print(out)
Prints:
q5q2q1q4q3

Some quick way of doing this via java as you asked in question:
String a = "q1q2q3";
String b = "q1q2q3q4q5q11";
List l1 = Arrays.asList(a.split("q"));
List l2 = Arrays.asList(b.split("q"));
List l3 = new ArrayList<String>();
l3.addAll(l1);
List l4 = new ArrayList<String>();
l4.addAll(l2);
l4.removeAll(l3);
l3.addAll(l4);
System.out.println(String.join("q", l3));
Output:
q1q2q3q4q5q11

This is a variation of #DanConstantinescu's solution in JS:
Start with the concatenated string.
Split string at the beginning of a substring composed of text followed by a number. This is implemented as a regex lookahead, so split returns the string portions as an array.
Build a set from this array. The constructor performs deduplication.
Turn the set into an array again
Concat the elements with the empty string.
While this code is not Java it should be straightforward to port the idea to other (imperative or object-oriented) languages.
let s_concatenated = "q5q3q2q1" + "q5q4q3q2q1" + "q11a13b4q11"
, s_dedup
;
s_dedup =
Array.from(
new Set(s_concatenated
.split(/(?=[^\d]+\d+)/) // Split into an array
) // Build a set, deduplicating
) // Turn the set into an array again
.join('') // Concat the elements with the empty string.
;
console.log(`'${s_concatenated}' -> '${s_dedup}'.`);

Related

Only display part of string after a certain word in Java [duplicate]

This question already has answers here:
How do I split a string in Java?
(39 answers)
Closed 6 years ago.
I am trying to only display data after a certain static word (in)
Example:
String jobName = job.getDescription();
returns the following:
XYZ/LMNOP in ABCEFG
I only want the data after the "in" in this scenario. However the XYZ/LMNOP is different in almost every case so I cannot simply call out that section of the string.
You can use split() in the String class.
String jobName = job.getDescription();
String[] parts = jobName.split("in"); { "XYZ/LMNOP", "ABCEFG" }
String before = parts[0]; // XYZ/LMNOP
String after = parts[1]; // ABCEFG
Find index of "in" in the string and then use the string from that particular index+3 to last.
int k = p.indexOf("in");
System.out.println(p.substring(k+3));
index+3 because "i", "n" , " " are 3 characters.
First you need to understand your strings possible data values. If it is always <some_text> in <some_text> then there are muliple ways as other users have mentioned.
Here is another way, whih is bit simpler
String[] strArray = job.getDescription().split(" "); //splits array on spaces
System.out.println(strArray[2]);
try using this
String search = " in ";
String d = job.getDescription();
d = d.substring(d.indexOf(search) + search.length(), d.length());
Outputs, given the inputs:
[find something in a string] -> [a string]
[finding something in a string] -> [a string] // note findINg, that does not match
The search key can be changed to simply in if desired, or left as is to match the question posted to avoid an accidental in in a word.
If you so choose, you can also use .toLower() on getDescription() if you want to be case insensitive when matching the word in as well.

How to get the desired character from the variable sized strings?

I need to extract the desired string which attached to the word.
For example
pot-1_Sam
pot-22_Daniel
pot_444_Jack
pot_5434_Bill
I need to get the names from the above strings. i.e Sam, Daniel, Jack and Bill.
Thing is if I use substring the position keeps on changing due to the length of the number. How to achieve them using REGEX.
Update:
Some strings has 2 underscore options like
pot_US-1_Sam
pot_RUS_444_Jack
Assuming you have a standard set of above formats, It seems you need not to have any regex, you can try using lastIndexOf and substring methods.
String result = yourString.substring(yourString.lastIndexOf("_")+1, yourString.length());
Your answer is:
String[] s = new String[4];
s[0] = "pot-1_Sam";
s[1] = "pot-22_Daniel";
s[2] = "pot_444_Jack";
s[3] = "pot_5434_Bill";
ArrayList<String> result = new ArrayList<String>();
for (String value : s) {
String[] splitedArray = value.split("_");
result.add(splitedArray[splitedArray.length-1]);
}
for(String resultingValue : result){
System.out.println(resultingValue);
}
You have 2 options:
Keep using the indexOf method to get the index of the last _ (This assumes that there is no _ in the names you are after). Once that you have the last index of the _ character, you can use the substring method to get the bit you are after.
Use a regular expression. The strings you have shown essentially have the pattern where in you have numbers, followed by an underscore which is in turn followed by the word you are after. You can use a regular expression such as \\d+_ (which will match one or more digits followed by an underscore) in combination with the split method. The string you are after will be in the last array position.
Use a string tokenizer based on '_' and get the last element. No need for REGEX.
Or use the split method on the string object like so :
String[] strArray = strValue.split("_");
String lastToken = strArray[strArray.length -1];
String[] s = {
"pot-1_Sam",
"pot-22_Daniel",
"pot_444_Jack",
"pot_5434_Bill"
};
for (String e : s)
System.out.println(e.replaceAll(".*_", ""));

Select words with at least two different letters

I am using this code
Matcher m2 = Pattern.compile("\\b[ABE]+\\b").matcher(key);
to only get keys from a HashMap that contain the letters A, B or E
I am not though interested in words such as AAAAAA or EEEEE I need words with at least two different letters (in the best case, three).
Is there a way to modify the regex ? Can anyone offer insight on this?
Replace everything except your letters, make a Set of the result, test the Set for size.
public static void main (String args[])
{
String alphabet = "ABC";
String totest = "BBA";
if (args.length == 2)
{
alphabet = args[0];
totest = args[1];
}
String cleared = totest.replaceAll ("[^" + alphabet + "]", "");
char[] ca = cleared.toCharArray ();
Set <Character> unique = new HashSet <Character> ();
for (char c: ca)
unique.add (c);
System.out.println ("Result: " + (unique.size () > 1));
}
Example implementation
You could use a more complicated regex to do it e.g.
(.*A.*[BE].*|.*[BE].*A.*)|(.*B.*[AE].*|.*[AE].*B.*)|(.*E.*[BA].*|.*[BA].*E.*)
But it's probably going to be more easy to understand to do some kind of replacement, for instance make a loop that replaces one letter at a time with '', and check the size of the new string each time - if it changes the size of the string twice, then you've got two of your desired characters. EDIT: actually, if you know the set of desired characters at runtime before you do the check, NullUserException had it right in his comment - indexOf or contains will be more efficient and probably more readable than this.
Note that if your set of desired characters is unknown at compile time (or at least pre-string-checking at runtime), the second option is preferable - if you're looking for any characters, just replace all occurrences of the first character in a while(str.length > 0) loop - the number of times it goes through the loop is the number of different characters you've got.
Mark explicitly the repetition of desired letters,
It would look like this :
\b[ABE]{1,3}\b
It matches AAE, EEE, AEE but not AAAA, AAEE

Efficient way to search for a set of strings in a string in Java

I have a set of elements of size about 100-200. Let a sample element be X.
Each of the elements is a set of strings (number of strings in such a set is between 1 and 4). X = {s1, s2, s3}
For a given input string (about 100 characters), say P, I want to test whether any of the X is present in the string.
X is present in P iff for all s belong to X, s is a substring of P.
The set of elements is available for pre-processing.
I want this to be as fast as possible within Java. Possible approaches which do not fit my requirements:
Checking whether all the strings s are substring of P seems like a costly operation
Because s can be any substring of P (not necessarily a word), I cannot use a hash of words
I cannot directly use regex as s1, s2, s3 can be present in any order and all of the strings need to be present as substring
Right now my approach is to construct a huge regex out of each X with all possible permutations of the order of strings. Because number of elements in X <= 4, this is still feasible. It would be great if somebody can point me to a better (faster/more elegant) approach for the same.
Please note that the set of elements is available for pre-processing and I want the solution in java.
You can use regex directly:
Pattern regex = Pattern.compile(
"^ # Anchor search to start of string\n" +
"(?=.*s1) # Check if string contains s1\n" +
"(?=.*s2) # Check if string contains s2\n" +
"(?=.*s3) # Check if string contains s3",
Pattern.DOTALL | Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();
foundMatch is true if all three substrings are present in the string.
Note that you might need to escape your "needle strings" if they could contain regex metacharacters.
It sounds like you're prematurely optimising your code before you've actually discovered a particular approach is actually too slow.
The nice property about your set of strings is that the string must contain all elements of X as a substring -- meaning we can fail fast if we find one element of X that is not contained within P. This might turn out a better time saving approach than others, especially if the elements of X are typically longer than a few characters and contain no or only a few repeating characters. For instance, a regex engine need only check 20 characters in 100 length string when checking for the presence of a 5 length string with non-repeating characters (eg. coast). And since X has 100-200 elements you really, really want to fail fast if you can.
My suggestion would be to sort the strings in order of length and check for each string in turn, stopping early if one string is not found.
Looks like a perfect case for the Rabin–Karp algorithm:
Rabin–Karp is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior. However, Rabin–Karp is an algorithm of choice for multiple pattern search.
When the preprocessing time doesn't matter, you could create a hash table which maps every one-letter, two-letter, three-letter etc. combination which occurs in at least one string to a list of strings in which it occurs.
The algorithm to index a string would look like that (untested):
HashMap<String, Set<String>> indexes = new HashMap<String, Set<String>>();
for (int pos = 0; pos < string.length(); pos++) {
for (int sublen=0; sublen < string.length-pos; sublen++) {
String substring = string.substr(pos, sublen);
Set<String> stringsForThisKey = indexes.get(substring);
if (stringsForThisKey == null) {
stringsForThisKey = new HashSet<String>();
indexes.put(substring, stringsForThisKey);
}
stringsForThisKey.add(string);
}
}
Indexing each string that way would be quadratic to the length of the string, but it only needs to be done once for each string.
But the result would be constant-speed access to the list of strings in which a specific string occurs.
You are probably looking for Aho-Corasick algorithm, which constructs an automata (trie-like) from the set of strings (dictionary), and try to match the input string to the dictionary using this automata.
You might want to consider using a "Suffix Tree" as well. I haven't used this code, but there is one described here
I have used proprietary implementations (that I no longer even have access to) and they are very fast.
One way is to generate every possible substring and add this to a set. This is pretty inefficient.
Instead you can create all the strings from any point to the end into a NavigableSet and search for the closest match. If the closest match starts with the string you are looking for, you have a substring match.
static class SubstringMatcher {
final NavigableSet<String> set = new TreeSet<String>();
SubstringMatcher(Set<String> strings) {
for (String string : strings) {
for (int i = 0; i < string.length(); i++)
set.add(string.substring(i));
}
// remove duplicates.
String last = "";
for (String string : set.toArray(new String[set.size()])) {
if (string.startsWith(last))
set.remove(last);
last = string;
}
}
public boolean findIn(String s) {
String s1 = set.ceiling(s);
return s1 != null && s1.startsWith(s);
}
}
public static void main(String... args) {
Set<String> strings = new HashSet<String>();
strings.add("hello");
strings.add("there");
strings.add("old");
strings.add("world");
SubstringMatcher sm = new SubstringMatcher(strings);
System.out.println(sm.set);
for (String s : "ell,he,ow,lol".split(","))
System.out.println(s + ": " + sm.findIn(s));
}
prints
[d, ello, ere, hello, here, ld, llo, lo, old, orld, re, rld, there, world]
ell: true
he: true
ow: false
lol: false

Characters Being Added To List<String>

I have a List called dbData and two StringBuilders called infoSB and historySB. I've debugged my project and the two StringBuilders have all the data they are supposed to have, but for some reason it also adds some random characters to the data. All I've done to add the data is the code below:
dbData.add(infoSB.toString());
dbData.add(historySB.toString());
The characters being added are [ ] and ,
Has anyone ran into this before and know how to keep it from doing this?
UPDATE: Here is how I'm getting the data and assigning it to the StringBuilder.
JSONObject json_data = jArray.getJSONObject(i);
double altitudeData = json_data.getDouble("altitude");
double altitudeInFeet = altitudeData * 3.281;
historySB.append("Altitude: " + df.format(altitudeInFeet) + "ft\n");
Are these characters at the beginning and end of the string, and is the a comma somewhere in the middle? This is what the toStringmethod of List is meant to do.
If you have a list of three elements
"car"
"van"
"bike"
Then the list will create the following string [car, van, bike]. The [] denote the beginning and end of the list, and commas denote the boundary between elements.
If you just want to concatenate strings then either use the + operator or a StringBuilder / StringBuffer.
eg.
String data = infoSB + historySB;
- First Check your data source, does they carry some sort of odd characters you are receiving.
- I hope what those StringBuilder holds are not JSON, i suspected it cause you are getting [] and , in your StringBuilders , so if it does you need to parse them first and then need to retrieve the specific Data you want....
- You can avoid any dangling whitespaces using trim() method
Eg:
dbData.add(infoSB.toString().trim());
dbData.add(historySB.toString().trim());
///////////////////Edited Part///////////////////
DecimalFormat df = new DecimalFormat("##.###");
String altitudeData = json_data.getString("altitude");
double altitudeInFeet = Double.parseDouble(altitudeData) * 3.281;
historySB.append("Altitude: " + df.format(altitudeInFeet) + "ft\n");

Categories

Resources