Java extracting substring from sentences

Java extracting substring from sentences - java

There are combination of words like is, is not, does not contain. We have to match these words in a sentence and have to split it.
Intput : if name is tom and age is not 45 or name does not contain tom then let me know.
Expected output:
If name is
tom and age is not
45 or name does not contain
tom then let me know
I tried below code to split and extract but the occurrence of "is" is in "is not" as well which my code is not able to find out:
public static void loadOperators(){
operators.add("is");
operators.add("is not");
operators.add("does not contain");
}
public static void main(String[] args) {
loadOperators();
for(String s : operators){
System.out.println(str.split(s).length - 1);
}
}

Since there could be multiple occurence of a word split wouldn't solve your use case, as in is and is not being different operators for you. You would ideally :
Iterate :
1. Find the index of the 'operator'.
2. Search for the next space _ or word.
3. Then update your string as substring from its index to length-1.

I am not entirely sure about what you try to achieve, but let's give it a shot.
For your case, a simple "workaround" might work just fine:
Sort the operators by their length, descending. This way the "largest match" will get found first. You can define "largest" as either literally the longest string, or preferably the number of words (number of spaces contained), so is a has precedence over contains
You'll need to make sure that no matches overlap though, which can be done by comparing all matches' start and end indices and discarding overlaps by some criteria, like first match wins

This code does what you seem to be wanting to do (or what I guessed you are wanting to do):
public static void main(String[] args) {
List<String> operators = new ArrayList<>();
operators.add("is");
operators.add("is not");
operators.add("does not contain");
String input = "if name is tom and age is not 45 or name does not contain tom then let me know.";
List<String> output = new ArrayList<>();
int lastFoundOperatorsEndIndex = 0; // First start at the beginning of input
for (String operator : operators){
int indexOfOperator = input.indexOf(operator); // Find current operator's position
if (indexOfOperator > -1) { // If operator was found
int thisOperatorsEndIndex = indexOfOperator + operator.length(); // Get length of operator and add it to the index to include operator
output.add(input.substring(lastFoundOperatorsEndIndex, thisOperatorsEndIndex).trim()); // Add operator to output (and remove trailing space)
lastFoundOperatorsEndIndex = thisOperatorsEndIndex; // Update startindex for next operator
}
}
output.add(input.substring(lastFoundOperatorsEndIndex, input.length()).trim()); // Add rest of input as last entry to output
for (String part : output) { // Output to console
System.out.println(part);
}
}
But it is highly dependant on the order of the sentence and the operators. If we're talking about user-input, the task will be much more complicated.
A better method using regular expressions (regExp) would be:
public static void main(String... args) {
// Define inputs
String input1 = "if name is tom and age is not 45 or name does not contain tom then let me know.";
String input2 = "the name is tom and he is 22 years old but the name does not contain jack, but merry is 24 year old.";
// Output split strings
for (String part : split(input1)) {
System.out.println(part.trim());
}
System.out.println();
for (String part : split(input2)) {
System.out.println(part.trim());
}
}
private static String[] split(String input) {
// Define list of operators - 'is not' has to precede 'is'!!
String[] operators = { "\\sis not\\s", "\\sis\\s", "\\sdoes not contain\\s", "\\sdoes contain\\s" };
// Concatenate operators to regExp-String for search
StringBuilder searchString = new StringBuilder();
for (String operator : operators) {
if (searchString.length() > 0) {
searchString.append("|");
}
searchString.append(operator);
}
// Replace all operators by operator+\n and split resulting string at \n-character
return input.replaceAll("(" + searchString.toString() + ")", "$1\n").split("\n");
}
Notice the order of the operators! 'is' has to come after 'is not' or 'is not' will always be split.
You can prevent this by using a negative lookahead for the operator 'is'.
So "\\sis\\s" would become "\\sis(?! not)\\s" (reading like: "is", not followed by a " not").
A minimalist Version (with JDK 1.6+) could look like this:
private static String[] split(String input) {
String[] operators = { "\\sis(?! not)\\s", "\\sis not\\s", "\\sdoes not contain\\s", "\\sdoes contain\\s" };
return input.replaceAll("(" + String.join("|", operators) + ")", "$1\n").split("\n");
}

Related

Java splitting string at index without cutting the word [duplicate]

This question already has answers here:
Large string split into lines with maximum length in java
(8 answers)
Closed 4 years ago.
I was just wondering it here is an API or some easy and quick way to split String at given index into String[] array but if there is a word at that index then put it to other String.
So lets say I have a string: "I often used to look out of the window, but I rarely do that anymore"
The length of that string is 68 and I have to cut it at 36, which is in this given sentence n, but now it should split the word at the so that the array would be ["I often used to look out of the", "window, but I rarely do that anymore"].
And if the new sentence is longer than 36 then it should be split aswell, so if I had a bit longer sentence: "I often used to look out of the window, but I rarely do that anymore, even though I liked it"
Would be ["I often used to look out of the", "window, but I rarely do that anymore", ",even though I liked it"]

Here's an old-fashioned, non-stream, non-regex solution:
public static List<String> chunk(String s, int limit)
{
List<String> parts = new ArrayList<String>();
while(s.length() > limit)
{
int splitAt = limit-1;
for(;splitAt>0 && !Character.isWhitespace(s.charAt(splitAt)); splitAt--);
if(splitAt == 0)
return parts; // can't be split
parts.add(s.substring(0, splitAt));
s = s.substring(splitAt+1);
}
parts.add(s);
return parts;
}
This doesn't trim additional spaces either side of the split point. Also, if a string cannot be split, because it doesn't contain any whitespace in the first limit characters, then it gives up and returns the partial result.
Test:
public static void main(String[] args)
{
String[] tests = {
"This is a short string",
"This sentence has a space at chr 36 so is a good test",
"I often used to look out of the window, but I rarely do that anymore, even though I liked it",
"I live in Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch",
};
int limit = 36;
for(String s : tests)
{
List<String> chunks = chunk(s, limit);
for(String st : chunks)
System.out.println("|" + st + "|");
System.out.println();
}
}
Output:
|This is a short string|
|This sentence has a space at chr 36|
|so is a good test|
|I often used to look out of the|
|window, but I rarely do that|
|anymore, even though I liked it|
|I live in|

This matches between 1 and 30 characters repetitively (greedy) and requires a whitespace behind each match.
public static List<String> chunk(String s, int size) {
List<String> chunks = new ArrayList<>(s.length()/size+1);
Pattern pattern = Pattern.compile(".{1," + size + "}(=?\\s|$)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
chunks.add(matcher.group());
}
return chunks;
}
Note that it doesn't work if there's a long string (>size) whitout whitespace.

How can I eliminate duplicate words from String in Java?

I have an ArrayList of Strings and it contains records such as:
this is a first sentence
hello my name is Chris
what's up man what's up man
today is tuesday
I need to clear this list, so that the output does not contain repeated content. In the case above, the output should be:
this is a first sentence
hello my name is Chris
what's up man
today is tuesday
as you can see, the 3rd String has been modified and now contains only one statement what's up man instead of two of them.
In my list there is a situation that sometimes the String is correct, and sometimes it is doubled as shown above.
I want to get rid of it, so I thought about iterating through this list:
for (String s: myList) {
but I cannot find a way of eliminating duplicates, especially since the length of each string is not determined, and by that I mean there might be record:
this is a very long sentence this is a very long sentence
or sometimes short ones:
single word singe word
is there some native java function for that maybe?

Assuming the String is repeated just twice, and with an space in between as in your examples, the following code would remove repetitions:
for (int i=0; i<myList.size(); i++) {
String s = myList.get(i);
String fs = s.substring(0, s.length()/2);
String ls = s.substring(s.length()/2+1, s.length());
if (fs.equals(ls)) {
myList.set(i, fs);
}
}
The code just split each entry of the list into two substrings (dividing by the half point). If both are equal, substitute the original element with only one half, thus removing the repetition.
I was testing the code and did not see #Brendan Robert answer. This code follows the same logic as his answer.

I would suggest using regular expressions. I was able to remove duplicates using this pattern: \b([\w\s']+) \1\b
public class Main {
static String [] phrases = {
"this is a first sentence",
"hello my name is Chris",
"what's up man what's up man",
"today is tuesday",
"this is a very long sentence this is a very long sentence",
"single word single word",
"hey hey"
};
public static void main(String[] args) throws Exception {
String duplicatePattern = "\\b([\\w\\s']+) \\1\\b";
Pattern p = Pattern.compile(duplicatePattern);
for (String phrase : phrases) {
Matcher m = p.matcher(phrase);
if (m.matches()) {
System.out.println(m.group(1));
} else {
System.out.println(phrase);
}
}
}
}
Results:
this is a first sentence
hello my name is Chris
what's up man
today is tuesday
this is a very long sentence
single word
hey

Assumptions:
Uppercase words are equal to lowercase counterparts.
String fullString = "lol lol";
String[] words = fullString.split("\\W+");
StringBuilder stringBuilder = new StringBuilder();
Set<String> wordsHashSet = new HashSet<>();
for (String word : words) {
// Check for duplicates
if (wordsHashSet.contains(word.toLowerCase())) continue;
wordsHashSet.add(word.toLowerCase());
stringBuilder.append(word).append(" ");
}
String nonDuplicateString = stringBuilder.toString().trim();

simple logic : split every word by token space i.e " " and now add it in LinkedHashSet , Retrieve back, Replace "[","]",","
String s = "I want to walk my dog I want to walk my dog";
Set<String> temp = new LinkedHashSet<>();
String[] arr = s.split(" ");
for ( String ss : arr)
temp.add(ss);
String newl = temp.toString()
.replace("[","")
.replace("]","")
.replace(",","");
System.out.println(newl);
o/p : I want to walk my dog

It depends on the situation that you have but assuming that the string can be repeated at most twice and not three or more times you could find the length of the entire string, find the halfway point and compare each index after the halfway point with the matching beginning index. If the string can be repeated more than once you will need a more complicated algorithm that would first determine how many times the string is repeated and then finds the starting index of each repeat and truncates all index's from the beginning of the first repeat onward. If you can provide some more context into what possible scenarios you expect to handle we can start putting together some ideas.

//Doing it in Java 8
String str1 = "I am am am a good Good coder";
String[] arrStr = str1.split(" ");
String[] element = new String[1];
return Arrays.stream(arrStr).filter(str1 -> {
if (!str1.equalsIgnoreCase(element[0])) {
element[0] = str1;
return true;
}return false;
}).collect(Collectors.joining(" "));

Splitting a string in the middle of a words issue

I have automated some flow of filling in a form from a website by taking the fields data from a csv.
Now, for the address there are 3 fields in the form:
Address 1 ____________
Address 2 ____________
Address 3 ____________
Each field have a limit of 35 characters, so whenever I get to 35 characters im continuing the address string in the second address field...
Now, the issue is that my current solution will split it but it will cut the word if it got to 35 chars, for instant, if the word 'barcelona' in the str and 'o' is the 35th char so in the address 2 will be 'na'.
in that case I want to identify if the 35th char is a middle of a word and take the whole word to the next field.
this is my current solution:
private def enterAddress(purchaseInfo: PurchaseInfo) = {
val webElements = driver.findElements(By.className("address")).toList
val strings = purchaseInfo.supplierAddress.grouped(35).toList
strings.zip(webElements).foreach{
case (text, webElement) => webElement.sendKeys(text)
}
}
I would appreciate some help here, preferably with Scala but java will be fine as well :)
thanks allot!

Since you said you'd accept Java code as well... the following code will wrap a given input string to several lines of a given maximum length:
import java.util.ArrayList;
import java.util.List;
public class WordWrap {
public static void main(String[] args) {
String input = "This is a rather long address, somewhere in a small street in Barcelona";
List<String> wrappedLines = wrap(input, 35);
for (String line : wrappedLines) {
System.out.println(line);
}
}
private static List<String> wrap(String input, int maxLength) {
String[] words = input.split(" ");
List<String> lines = new ArrayList<String>();
StringBuilder sb = new StringBuilder();
for (String word : words) {
if (sb.length() == 0) {
// Note: Will not work if a *single* word already exceeds maxLength
sb.append(word);
} else if (sb.length() + word.length() < maxLength) {
// Use < maxLength as we add +1 space.
sb.append(" " + word);
} else {
// Line is full
lines.add(sb.toString());
// Restart
sb = new StringBuilder(word);
}
}
// Add the last line
if (sb.length() > 0) {
lines.add(sb.toString());
}
return lines;
}
}
Output:
This is a rather long address,
somewhere in a small street in
Barcelona
This is not necessarily the best approach, but I guess you'll have to adapt it to Scala anyway.
If you prefer a library solution (because... why re-invent the wheel?) you can also have a look at WordUtils.wrap() from Apache Commons.

Words in the English language are delimited by space (or other punctuation, but that is irrelevant in this case unless you actually want to wrap lines based on that), and there are a couple of options for using this to your advantage:
One thing you could potentially do is take a substring of 35 characters from your string, use String.lastIndexOf to figure out where the space is, and add only up to that space to your address line, then repeating the process starting from that space character until you have entered the string.
Another method (showcased in Marvin's answer) is to just use String.split on spaces and concatenate them back together until the next word would cause the string to exceed 35 characters.

Java String.contains() to take care of natural numbers

I'm a computer science student learning Java, and as an exercise, we're doing a permutation algorhythm.
Now, i'm stuck at a point where i need to search for a natural number within a String full of numbers, splitted by a comma:
String myString = "0,1,2,10,14,";
The problem is i'm using...
myString.contains(String.valueOf(anInteger);
...to check for the presence of a specific number. This works for numbers from 0 to 9, but when looking for a more-than-1-digit number, the program does not recognize it as a natural number.
In other words, and as an example: "14" is not the integer 14, its just a string with an "1", and a "4"; so, if i run...
String myString = "0,1,2,10,14,";
if (myString.contains(myString.valueOf(4))) { doSomething(); }
...the "if" statement will be true, since the integer "4" is present in the string, as part of the natural number "14".
At this point, i've been searching through StackOverflow and other pages for a solution, and learnt i should use Pattern and Matcher.
My question is: what's the best way to do use them?
Relevant part of my code:
for (int i = 0; i<r; i++)
{
if (!act.contains(String.valueOf(i)))
{
...
}
...
}
I use this method several times in my code, so an exact substitution would be nice.
Thank you all in advance!

You only need a method call to matches():
if (myString.matches(".*\\b" + anInteger + "\\b.*"))
// string contains the number
This works using by creating a regex that has a word boundary (\b) at either end of the target number. The leading and trailing .* are required because matches() must match the whole string to return true.

Look into how to split a String into an array of String. So:
String[] splitStrings = myString.split(",")
ArrayList<Integer> parsedInts = new ArrayList<Integer>();
for (String str : splitStrings) {
parsedInts.add(Integer.parseInt(str));
}
then in your for loop:
if (parsedInts.contains(i)) {
// body
}

Something like this:
String myString = "0,1,2,10,14,";
String[] split = myString.split(",");
for (String string : split) {
int num = Integer.parseInt(string);
if (num == 4) {
System.out.println(num);
// ...
}
}

String myString = "0,1,2,10,14,2323232";
String[] allList = myString.split(",");
for (String string : allList) {
if(string.matches("[0-9]*"))
{
System.out.println("Its number with value "+string);
}
}

I think you need to pick all the numbers in the given string and find the permutation.
I think you need to Tokenize the given string with the Comma Separator.
When I do such program, I divide my logic to parse the String and write the logic in another method. Below is the snippet
String myString = "0,1,2,10,14,";
StringTokenizer st2 = new StringTokenizer(myString , ",");
while (st2.hasMoreElements()) {
doSomething(st2.nextElement());
}

Checking if a character is an integer or letter

I am modifying a file using Java. Here's what I want to accomplish:
if an & symbol, along with an integer, is detected while being read, I want to drop the & symbol and translate the integer to binary.
if an & symbol, along with a (random) word, is detected while being read, I want to drop the & symbol and replace the word with the integer 16, and if a different string of characters is being used along with the & symbol, I want to set the number 1 higher than integer 16.
Here's an example of what I mean. If a file is inputted containing these strings:
&myword
&4
&anotherword
&9
&yetanotherword
&10
&myword
The output should be:
&0000000000010000 (which is 16 in decimal)
&0000000000000100 (or the number '4' in decimal)
&0000000000010001 (which is 17 in decimal, since 16 is already used, so 16+1=17)
&0000000000000101 (or the number '9' in decimal)
&0000000000010001 (which is 18 in decimal, or 17+1=18)
&0000000000000110 (or the number '10' in decimal)
&0000000000010000 (which is 16 because value of myword = 16)
Here's what I tried so far, but haven't succeeded yet:
for (i=0; i<anyLines.length; i++) {
char[] charray = anyLines[i].toCharArray();
for (int j=0; j<charray.length; j++)
if (Character.isDigit(charray[j])) {
anyLines[i] = anyLines[i].replace("&","");
anyLines[i] = Integer.toBinaryString(Integer.parseInt(anyLines[i]);
}
else {
continue;
}
if (Character.isLetter(charray[j])) {
anyLines[i] = anyLines[i].replace("&","");
for (int k=16; j<charray.length; k++) {
anyLines[i] = Integer.toBinaryString(Integer.parseInt(k);
}
}
}
}
I hope that I am articulate enough. Any suggestions on how to accomplish this task?

Character.isLetter() //tests to see if it is a letter
Character.isDigit() //tests the character to

It looks like something you could match against a regex. I don't know Java but you should have at least one regex engine at your disposal. Then the regex would be:
regex1: &(\d+)
and
regex2: &(\w+)
or
regex3: &(\d+|\w+)
in the first case, if regex1 matches, you know you ran into a number, and that number is into the first capturing group (eg: match.group(1)). If regex2 matches, you know you have a word. You can then lookup that word into a dictionary and see what its associated number is, or if not present, add it to the dictionary and associate it with the next free number (16 + dictionary size + 1).
regex3 on the other hand will match both numbers and words, so it's up to you to see what's in the capturing group (it's just a different approach).
If neither of the regex match, then you have an invalid sequence, or you need some other action. Note that \w in a regex only matches word characters (ie: letters, _ and possibly a few other characters), so &çSomeWord or &*SomeWord won't match at all, while the captured group in &Hello.World would be just "Hello".
Regex libs usually provide a length for the matched text, so you can move i forward by that much in order to skip already matched text.

You have to somehow tokenize your input. It seems you are splitting it in lines and then analyzing each line individually. If this is what you want, okay. If not, you could simply search for & (indexOf('%')) and then somehow determine what the next token is (either a number or a "word", however you want to define word).
What do you want to do with input which does not match your pattern? Neither the description of the task nor the example really covers this.
You need to have a dictionary of already read strings. Use a Map<String, Integer>.

I would post this as a comment, but don't have the ability yet. What is the issue you are running into? Error? Incorrect Results? 16's not being correctly incremented? Also, the examples use a '%' but in your description you say it should start with a '&'.
Edit2: Was thinking it was line by line, but re-reading indicates you could be trying to find say "I went to the &store" and want it to say "I went to the &000010000". So you would want to split by whitespace and then iterate through and pass the strings into your 'replace' method, which is similar to below.
Edit1: If I understand what you are trying to do, code like this should work.
Map<String, Integer> usedWords = new HashMap<String, Integer>();
List<String> output = new ArrayList<String>();
int wordIncrementer = 16;
String[] arr = test.split("\n");
for(String s : arr)
{
if(s.startsWith("&"))
{
String line = s.substring(1).trim(); //Removes &
try
{
Integer lineInt = Integer.parseInt(line);
output.add("&" + Integer.toBinaryString(lineInt));
}
catch(Exception e)
{
System.out.println("Line was not an integer. Parsing as a String.");
String outputString = "&";
if(usedWords.containsKey(line))
{
outputString += Integer.toBinaryString(usedWords.get(line));
}
else
{
outputString += Integer.toBinaryString(wordIncrementer);
usedWords.put(line, wordIncrementer++);
}
output.add(outputString);
}
}
else
{
continue; //Nothing indicating that we should parse the line.
}
}

How about this?
String input = "&myword\n&4\n&anotherword\n&9\n&yetanotherword\n&10\n&myword";
String[] lines = input.split("\n");
int wordValue = 16;
// to keep track words that are already used
Map<String, Integer> wordValueMap = new HashMap<String, Integer>();
for (String line : lines) {
// if line doesn't begin with &, then ignore it
if (!line.startsWith("&")) {
continue;
}
// remove &
line = line.substring(1);
Integer binaryValue = null;
if (line.matches("\\d+")) {
binaryValue = Integer.parseInt(line);
}
else if (line.matches("\\w+")) {
binaryValue = wordValueMap.get(line);
// if the map doesn't contain the word value, then assign and store it
if (binaryValue == null) {
binaryValue = wordValue;
wordValueMap.put(line, binaryValue);
wordValue++;
}
}
// I'm using Commons Lang's StringUtils.leftPad(..) to create the zero padded string
String out = "&" + StringUtils.leftPad(Integer.toBinaryString(binaryValue), 16, "0");
System.out.println(out);
Here's the printout:-
&0000000000010000
&0000000000000100
&0000000000010001
&0000000000001001
&0000000000010010
&0000000000001010
&0000000000010000
Just FYI, the binary value for 10 is "1010", not "110" as stated in your original post.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java extracting substring from sentences - java

Related

Java splitting string at index without cutting the word [duplicate]

How can I eliminate duplicate words from String in Java?

Splitting a string in the middle of a words issue

Java String.contains() to take care of natural numbers

Checking if a character is an integer or letter

Categories

Resources