Java splitting string at index without cutting the word [duplicate] - java

This question already has answers here:
Large string split into lines with maximum length in java
(8 answers)
Closed 4 years ago.
I was just wondering it here is an API or some easy and quick way to split String at given index into String[] array but if there is a word at that index then put it to other String.
So lets say I have a string: "I often used to look out of the window, but I rarely do that anymore"
The length of that string is 68 and I have to cut it at 36, which is in this given sentence n, but now it should split the word at the so that the array would be ["I often used to look out of the", "window, but I rarely do that anymore"].
And if the new sentence is longer than 36 then it should be split aswell, so if I had a bit longer sentence: "I often used to look out of the window, but I rarely do that anymore, even though I liked it"
Would be ["I often used to look out of the", "window, but I rarely do that anymore", ",even though I liked it"]

Here's an old-fashioned, non-stream, non-regex solution:
public static List<String> chunk(String s, int limit)
{
List<String> parts = new ArrayList<String>();
while(s.length() > limit)
{
int splitAt = limit-1;
for(;splitAt>0 && !Character.isWhitespace(s.charAt(splitAt)); splitAt--);
if(splitAt == 0)
return parts; // can't be split
parts.add(s.substring(0, splitAt));
s = s.substring(splitAt+1);
}
parts.add(s);
return parts;
}
This doesn't trim additional spaces either side of the split point. Also, if a string cannot be split, because it doesn't contain any whitespace in the first limit characters, then it gives up and returns the partial result.
Test:
public static void main(String[] args)
{
String[] tests = {
"This is a short string",
"This sentence has a space at chr 36 so is a good test",
"I often used to look out of the window, but I rarely do that anymore, even though I liked it",
"I live in Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch",
};
int limit = 36;
for(String s : tests)
{
List<String> chunks = chunk(s, limit);
for(String st : chunks)
System.out.println("|" + st + "|");
System.out.println();
}
}
Output:
|This is a short string|
|This sentence has a space at chr 36|
|so is a good test|
|I often used to look out of the|
|window, but I rarely do that|
|anymore, even though I liked it|
|I live in|

This matches between 1 and 30 characters repetitively (greedy) and requires a whitespace behind each match.
public static List<String> chunk(String s, int size) {
List<String> chunks = new ArrayList<>(s.length()/size+1);
Pattern pattern = Pattern.compile(".{1," + size + "}(=?\\s|$)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
chunks.add(matcher.group());
}
return chunks;
}
Note that it doesn't work if there's a long string (>size) whitout whitespace.

Related

Java extracting substring from sentences

There are combination of words like is, is not, does not contain. We have to match these words in a sentence and have to split it.
Intput : if name is tom and age is not 45 or name does not contain tom then let me know.
Expected output:
If name is
tom and age is not
45 or name does not contain
tom then let me know
I tried below code to split and extract but the occurrence of "is" is in "is not" as well which my code is not able to find out:
public static void loadOperators(){
operators.add("is");
operators.add("is not");
operators.add("does not contain");
}
public static void main(String[] args) {
loadOperators();
for(String s : operators){
System.out.println(str.split(s).length - 1);
}
}
Since there could be multiple occurence of a word split wouldn't solve your use case, as in is and is not being different operators for you. You would ideally :
Iterate :
1. Find the index of the 'operator'.
2. Search for the next space _ or word.
3. Then update your string as substring from its index to length-1.
I am not entirely sure about what you try to achieve, but let's give it a shot.
For your case, a simple "workaround" might work just fine:
Sort the operators by their length, descending. This way the "largest match" will get found first. You can define "largest" as either literally the longest string, or preferably the number of words (number of spaces contained), so is a has precedence over contains
You'll need to make sure that no matches overlap though, which can be done by comparing all matches' start and end indices and discarding overlaps by some criteria, like first match wins
This code does what you seem to be wanting to do (or what I guessed you are wanting to do):
public static void main(String[] args) {
List<String> operators = new ArrayList<>();
operators.add("is");
operators.add("is not");
operators.add("does not contain");
String input = "if name is tom and age is not 45 or name does not contain tom then let me know.";
List<String> output = new ArrayList<>();
int lastFoundOperatorsEndIndex = 0; // First start at the beginning of input
for (String operator : operators){
int indexOfOperator = input.indexOf(operator); // Find current operator's position
if (indexOfOperator > -1) { // If operator was found
int thisOperatorsEndIndex = indexOfOperator + operator.length(); // Get length of operator and add it to the index to include operator
output.add(input.substring(lastFoundOperatorsEndIndex, thisOperatorsEndIndex).trim()); // Add operator to output (and remove trailing space)
lastFoundOperatorsEndIndex = thisOperatorsEndIndex; // Update startindex for next operator
}
}
output.add(input.substring(lastFoundOperatorsEndIndex, input.length()).trim()); // Add rest of input as last entry to output
for (String part : output) { // Output to console
System.out.println(part);
}
}
But it is highly dependant on the order of the sentence and the operators. If we're talking about user-input, the task will be much more complicated.
A better method using regular expressions (regExp) would be:
public static void main(String... args) {
// Define inputs
String input1 = "if name is tom and age is not 45 or name does not contain tom then let me know.";
String input2 = "the name is tom and he is 22 years old but the name does not contain jack, but merry is 24 year old.";
// Output split strings
for (String part : split(input1)) {
System.out.println(part.trim());
}
System.out.println();
for (String part : split(input2)) {
System.out.println(part.trim());
}
}
private static String[] split(String input) {
// Define list of operators - 'is not' has to precede 'is'!!
String[] operators = { "\\sis not\\s", "\\sis\\s", "\\sdoes not contain\\s", "\\sdoes contain\\s" };
// Concatenate operators to regExp-String for search
StringBuilder searchString = new StringBuilder();
for (String operator : operators) {
if (searchString.length() > 0) {
searchString.append("|");
}
searchString.append(operator);
}
// Replace all operators by operator+\n and split resulting string at \n-character
return input.replaceAll("(" + searchString.toString() + ")", "$1\n").split("\n");
}
Notice the order of the operators! 'is' has to come after 'is not' or 'is not' will always be split.
You can prevent this by using a negative lookahead for the operator 'is'.
So "\\sis\\s" would become "\\sis(?! not)\\s" (reading like: "is", not followed by a " not").
A minimalist Version (with JDK 1.6+) could look like this:
private static String[] split(String input) {
String[] operators = { "\\sis(?! not)\\s", "\\sis not\\s", "\\sdoes not contain\\s", "\\sdoes contain\\s" };
return input.replaceAll("(" + String.join("|", operators) + ")", "$1\n").split("\n");
}

How can I eliminate duplicate words from String in Java?

I have an ArrayList of Strings and it contains records such as:
this is a first sentence
hello my name is Chris
what's up man what's up man
today is tuesday
I need to clear this list, so that the output does not contain repeated content. In the case above, the output should be:
this is a first sentence
hello my name is Chris
what's up man
today is tuesday
as you can see, the 3rd String has been modified and now contains only one statement what's up man instead of two of them.
In my list there is a situation that sometimes the String is correct, and sometimes it is doubled as shown above.
I want to get rid of it, so I thought about iterating through this list:
for (String s: myList) {
but I cannot find a way of eliminating duplicates, especially since the length of each string is not determined, and by that I mean there might be record:
this is a very long sentence this is a very long sentence
or sometimes short ones:
single word singe word
is there some native java function for that maybe?
Assuming the String is repeated just twice, and with an space in between as in your examples, the following code would remove repetitions:
for (int i=0; i<myList.size(); i++) {
String s = myList.get(i);
String fs = s.substring(0, s.length()/2);
String ls = s.substring(s.length()/2+1, s.length());
if (fs.equals(ls)) {
myList.set(i, fs);
}
}
The code just split each entry of the list into two substrings (dividing by the half point). If both are equal, substitute the original element with only one half, thus removing the repetition.
I was testing the code and did not see #Brendan Robert answer. This code follows the same logic as his answer.
I would suggest using regular expressions. I was able to remove duplicates using this pattern: \b([\w\s']+) \1\b
public class Main {
static String [] phrases = {
"this is a first sentence",
"hello my name is Chris",
"what's up man what's up man",
"today is tuesday",
"this is a very long sentence this is a very long sentence",
"single word single word",
"hey hey"
};
public static void main(String[] args) throws Exception {
String duplicatePattern = "\\b([\\w\\s']+) \\1\\b";
Pattern p = Pattern.compile(duplicatePattern);
for (String phrase : phrases) {
Matcher m = p.matcher(phrase);
if (m.matches()) {
System.out.println(m.group(1));
} else {
System.out.println(phrase);
}
}
}
}
Results:
this is a first sentence
hello my name is Chris
what's up man
today is tuesday
this is a very long sentence
single word
hey
Assumptions:
Uppercase words are equal to lowercase counterparts.
String fullString = "lol lol";
String[] words = fullString.split("\\W+");
StringBuilder stringBuilder = new StringBuilder();
Set<String> wordsHashSet = new HashSet<>();
for (String word : words) {
// Check for duplicates
if (wordsHashSet.contains(word.toLowerCase())) continue;
wordsHashSet.add(word.toLowerCase());
stringBuilder.append(word).append(" ");
}
String nonDuplicateString = stringBuilder.toString().trim();
simple logic : split every word by token space i.e " " and now add it in LinkedHashSet , Retrieve back, Replace "[","]",","
String s = "I want to walk my dog I want to walk my dog";
Set<String> temp = new LinkedHashSet<>();
String[] arr = s.split(" ");
for ( String ss : arr)
temp.add(ss);
String newl = temp.toString()
.replace("[","")
.replace("]","")
.replace(",","");
System.out.println(newl);
o/p : I want to walk my dog
It depends on the situation that you have but assuming that the string can be repeated at most twice and not three or more times you could find the length of the entire string, find the halfway point and compare each index after the halfway point with the matching beginning index. If the string can be repeated more than once you will need a more complicated algorithm that would first determine how many times the string is repeated and then finds the starting index of each repeat and truncates all index's from the beginning of the first repeat onward. If you can provide some more context into what possible scenarios you expect to handle we can start putting together some ideas.
//Doing it in Java 8
String str1 = "I am am am a good Good coder";
String[] arrStr = str1.split(" ");
String[] element = new String[1];
return Arrays.stream(arrStr).filter(str1 -> {
if (!str1.equalsIgnoreCase(element[0])) {
element[0] = str1;
return true;
}return false;
}).collect(Collectors.joining(" "));

Get certain data from text - Java

I am creating a bukkit plugin for minecraft and i need to know a few things before i move on.
I want to check if a text has this layout: "B:10 S:5" for example.
It stands for Buy:amount and Sell:amount
How can i check the easiest way if it follows the syntax?
It can be any number that is 0 or over.
Another problem is to get this data out of the text. how can i check what text is after B: and S: and return it as an integer
I have not tried out this yet because i have no idea where to start.
Thanks for help!
In the simple problem you gave, you can get away with a simple answer. Otherwise, see the regex answer below.
boolean test(String str){
try{
//String str = "B:10 S:5";
String[] arr = str.split(" ");//split to left and right of space = [B:10,S:5]
String[] bArr = arr[0].split(":");//split ...first colon = [B,10]
String[] sArr = arr[1].split(":");//... second colon = [S,5]
//need to use try/catch here in case the string is not an int value.
String labelB = bArr[0];
Integer b = Integer.parseInt(bArr[1]);
String labelS = sArr[0];
Integer s = Integer.parseInt(sArr[1]);
}catch( Exception e){return false;}
return true;
}
See my answer here for a related task. More related details below.
How can I parse a string for a set?
Essentially, you need to use regex and iterate through the groups. Just in case the grammar is not always B and S, I made this more abstract.Also, if there are extra white spaces in the middle for what ever reason, I made that more broad too. The pattern says there are 4 groups (indicated by parentheses): label1, number1, label2, and number2. + means 1 or more. [] means a set of characters. a-z is a range of characters (don't put anything between A-Z and a-z). There are also other ways of showing alpha and numeric patterns, but these are easier to read.
//this is expensive
Pattern p=Pattern.compile("([A-Za-z]+):([0-9]+)[ ]+([A-Za-z]+):([0-9]+)");
boolean test(String txt){
Matcher m=p.matcher(txt);
if(!m.matches())return false;
int groups=m.groupCount();//should only equal 5 (default whole match+4 groups) here, but you can test this
System.out.println("Matched: " + m.group(0));
//Label1 = m.group(1);
//val1 = m.group(2);
//Label2 = m.group(3);
//val2 = m.group(4);
return true;
}
Use Regular Expression.
In your case,^B:(\d)+ S:(\d)+$ is enough.
In java, to use a regular expression:
public class RegExExample {
public static void main(String[] args) {
Pattern p = Pattern.compile("^B:(\d)+ S:(\d)+$");
for (int i = 0; i < args.length; i++)
if (p.matcher(args[i]).matches())
System.out.println( "ARGUMENT #" + i + " IS VALID!")
else
System.out.println( "ARGUMENT #" + i + " IS INVALID!");
}
}
This sample program take inputs from command line, validate it against the pattern and print the result to STDOUT.

Splitting a string in the middle of a words issue

I have automated some flow of filling in a form from a website by taking the fields data from a csv.
Now, for the address there are 3 fields in the form:
Address 1 ____________
Address 2 ____________
Address 3 ____________
Each field have a limit of 35 characters, so whenever I get to 35 characters im continuing the address string in the second address field...
Now, the issue is that my current solution will split it but it will cut the word if it got to 35 chars, for instant, if the word 'barcelona' in the str and 'o' is the 35th char so in the address 2 will be 'na'.
in that case I want to identify if the 35th char is a middle of a word and take the whole word to the next field.
this is my current solution:
private def enterAddress(purchaseInfo: PurchaseInfo) = {
val webElements = driver.findElements(By.className("address")).toList
val strings = purchaseInfo.supplierAddress.grouped(35).toList
strings.zip(webElements).foreach{
case (text, webElement) => webElement.sendKeys(text)
}
}
I would appreciate some help here, preferably with Scala but java will be fine as well :)
thanks allot!
Since you said you'd accept Java code as well... the following code will wrap a given input string to several lines of a given maximum length:
import java.util.ArrayList;
import java.util.List;
public class WordWrap {
public static void main(String[] args) {
String input = "This is a rather long address, somewhere in a small street in Barcelona";
List<String> wrappedLines = wrap(input, 35);
for (String line : wrappedLines) {
System.out.println(line);
}
}
private static List<String> wrap(String input, int maxLength) {
String[] words = input.split(" ");
List<String> lines = new ArrayList<String>();
StringBuilder sb = new StringBuilder();
for (String word : words) {
if (sb.length() == 0) {
// Note: Will not work if a *single* word already exceeds maxLength
sb.append(word);
} else if (sb.length() + word.length() < maxLength) {
// Use < maxLength as we add +1 space.
sb.append(" " + word);
} else {
// Line is full
lines.add(sb.toString());
// Restart
sb = new StringBuilder(word);
}
}
// Add the last line
if (sb.length() > 0) {
lines.add(sb.toString());
}
return lines;
}
}
Output:
This is a rather long address,
somewhere in a small street in
Barcelona
This is not necessarily the best approach, but I guess you'll have to adapt it to Scala anyway.
If you prefer a library solution (because... why re-invent the wheel?) you can also have a look at WordUtils.wrap() from Apache Commons.
Words in the English language are delimited by space (or other punctuation, but that is irrelevant in this case unless you actually want to wrap lines based on that), and there are a couple of options for using this to your advantage:
One thing you could potentially do is take a substring of 35 characters from your string, use String.lastIndexOf to figure out where the space is, and add only up to that space to your address line, then repeating the process starting from that space character until you have entered the string.
Another method (showcased in Marvin's answer) is to just use String.split on spaces and concatenate them back together until the next word would cause the string to exceed 35 characters.

Efficient Regular Expression for big data, if a String contains a word

I have a code that works but is extremely slow. This code determines whether a string contains a keyword. The requirements I have need to be efficient for hundreds of keywords that I will search for in thousands of documents.
What can I do to make finding the keywords (without falsely returning a word that contains the keyword) efficiently?
For example:
String keyword="ac";
String document"..." //few page long file
If i use :
if(document.contains(keyword) ){
//do something
}
It will also return true if document contains a word like "account";
so I tried to use regular expression as follows:
String pattern = "(.*)([^A-Za-z]"+ keyword +"[^A-Za-z])(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(document);
if(m.find()){
//do something
}
Summary:
This is the summary: Hopefully it will be useful to some one else:
My regular expression would work but extremely impractical while
working with big data. (it didn't terminate)
#anubhava perfected the regular expression. it was easy to
understand and implement. It managed to terminate which is a big
thing. but it was still a bit slow. (Roughly about 240 seconds)
#Tomalak solution is abit complex to implement and understand but it
was the fastest solution. so hats off mate.(18 seconds)
so #Tomalak solution was ~15 times faster than #anubhava.
Don't think you need to have .* in your regex.
Try this regex:
String pattern = "\\b"+ Pattern.quote(keyword) + "\\b";
Here \\b is used for word boundary. If the keyword can contain special characters, make sure they are not at the start or end of the word, or the word boundaries will fail to match.
Also you must be using Pattern.quote if your keyword contains special regex characters.
EDIT: You might use this regex if your keywords are separated by space.
String pattern = "(?<=\\s|^)"+ Pattern.quote(keyword) + "(?=\\s|$)";
The fastest-possible way to find substrings in Java is to use String.indexOf().
To achieve "entire-word-only" matches, you would need to add a little bit of logic to check the characters before and after a possible match to make sure they are non-word characters:
public class IndexOfWordSample {
public static void main(String[] args) {
String input = "There are longer strings than this not very long one.";
String search = "long";
int index = indexOfWord(input, search);
if (index > -1) {
System.out.println("Hit for \"" + search + "\" at position " + index + ".");
} else {
System.out.println("No hit for \"" + search + "\".");
}
}
public static int indexOfWord(String input, String word) {
String nonWord = "^\\W?$", before, after;
int index, before_i, after_i = 0;
while (true) {
index = input.indexOf(word, after_i);
if (index == -1 || word.isEmpty()) break;
before_i = index - 1;
after_i = index + word.length();
before = "" + (before_i > -1 ? input.charAt(before_i) : "");
after = "" + (after_i < input.length() ? input.charAt(after_i) : "");
if (before.matches(nonWord) && after.matches(nonWord)) {
return index;
}
}
return -1;
}
}
This would print:
Hit for "long" at position 44.
This should perform better than a pure regular expressions approach.
Think if ^\W?$ already matches your expectation of a "non-word" character. The regular expression is a compromise here and may cost performance if your input string contains many "almost"-matches.
For extra speed, ditch the regex and work with the Character class, checking a combination of the many properties it provides (like isAlphabetic, etc.) for before and after.
I've created a Gist with an alternative implementation that does that.

Categories

Resources