Find space separated names using Apache OpenNLP - java

I am using NER of Apache Open NLP. I have successfully trained my custom data. And while using the name finder, I am splitting the given string based on white space and passing the string array as given below.
NameFinderME nameFinder = new NameFinderME(model);
String []sentence = input.split(" "); //eg:- input = Give me list of test case in project X
Span nameSpans[] = nameFinder.find(sentence);
Here, when I use split, test and case are given as separate values and is never detected by the namefinder. How would I possibly overcome the above issue. Is there a way by which I can pass the complete string (without splitting it into array) such that, test case will be considered as a whole by itself ?

You can do it using regular expressions. Try replacing the second line with this:
String []sentence = input.split("\\s(?<!(\\stest\\s(?=case\\s)))");
Maybe there is a better way to write the expression, but this works for me and the output is:
Give
me
list
of
test case
in
project
X
EDIT: If you are interested in the details check here where I split: https://regex101.com/r/6HLBnL/1
EDIT 2: If you have many words that don't get separated, I wrote a method that generates the regex for you. This is how the regex in this case should look like (if you don't want to separate 'test case' and 'in project'):
\s(?<!(\stest\s(?=case\s))|(\sin\s(?=project\s)))
Following is a simple program to demonstrate it. In this example you just put the words that don't need separation in the array unseparated.
class NoSeparation {
private static String[][] unseparated = {{"test", "case"}, {"in", "project"}};
private static String getRegex() {
String regex = "\\s(?<!";
for (int i = 0; i < unseparated.length; i++)
regex += "(\\s" + separated[i][0] + "\\s(?=" + separated[i][1] + "\\s))|";
// Remove the last |
regex = regex.substring(0, regex.length() - 1);
return (regex + ")");
}
public static void main(String[] args) {
String input = "Give me list of test case in project X";
String []sentence = input.split(getRegex());
for (String i: sentence)
System.out.println(i);
}
}
EDIT 3: Following is a very dirty way to handle strings with more than 2 words. It works, but I am pretty sure that you can do it in a more efficient way. It will work fine in short inputs, but in longer it will probably be slow.
You have to put the words that should not be splitted in a 2d array, as in unseparated. You should also choose a separator if you don't want to use %% for some reason (e.g. if there is a chance your input contains it).
class NoSeparation {
private static final String SEPARATOR = "%%";
private static String[][] unseparated = {{"of", "test", "case"}, {"in", "project"}};
private static String[] splitString(String in) {
String[] splitted;
for (int i = 0; i < unseparated.length; i++) {
String toReplace = "";
String replaceWith = "";
for (int j = 0; j < unseparated[i].length; j++) {
toReplace += unseparated[i][j] + ((j < unseparated[i].length - 1)? " " : "");
replaceWith += unseparated[i][j] + ((j < unseparated[i].length - 1)? SEPARATOR : "");
}
in = in.replaceAll(toReplace, replaceWith);
}
splitted = in.split(" ");
for (int i = 0; i < splitted.length; i++)
splitted[i] = splitted[i].replaceAll(SEPARATOR, " ");
return splitted;
}
public static void main(String[] args) {
String input = "Give me list of test case in project X";
// Uncomment this if there is a chance to have multiple spaces/tabs
// input = input.replaceAll("[\\s\\t]+", " ");
for (String str: splitString(input))
System.out.println(str);
}
}

Related

How to split a string after every 10 words?

I looking for a way to split my chunk of string every 10 words.
I am working with the below code.
My input will be a long string.
Ex: this is an example file that can be used as a reference for this program, i want this line to be split (newline) by every 10 words each.
private void jButton27ActionPerformed(java.awt.event.ActionEvent evt) {
String[] names = jTextArea13.getText().split("\\n");
var S = names.Split().ToList();
for (int k = 0; k < S.Count; k++) {
nam.add(S[k]);
if ((k%10)==0) {
nam.add("\r\n");
}
}
jTextArea14.setText(nam);
output:
this is an example file that can be used as
a reference for this program, i want this line to
be split (newline) by every 10 words each.
Any help is appreciated.
I am looking for a way to split my chunk of string every 10 words
A regex with a non-capturing group is a more concise way of achieving that:
str = str.replaceAll("((?:[^\\s]*\\s){9}[^\\s]*)\\s", "$1\n");
The 9 in the above example is just words-1, so if you want that to split every 20 words for instance, change it to 19.
That means your code could become:
jTextArea14.setText(jTextArea13.getText().replaceAll("((?:[^\\s]*\\s){9}[^\\s]*)\\s", "$1\n"));
To me, that's much more readable. Whether it's more readable in your case of course depends on whether users of your codebase are reasonably proficient in regex.
You can try this as well leveraging the java util
public static final String WHITESPACE = " ";
public static final String LINEBREAK = System.getProperty("line.separator");
public static String splitString(String text, int wordsPerLine)
{
final StringBuilder newText = new StringBuilder();
final StringTokenizer wordTokenizer = new StringTokenizer(text);
long wordCount = 1;
while (wordTokenizer.hasMoreTokens())
{
newText.append(wordTokenizer.nextToken());
if (wordTokenizer.hasMoreTokens())
{
if (wordCount++ % wordsPerLine == 0)
{
newText.append(LINEBREAK);
}
else
{
newText.append(WHITESPACE);
}
}
}
return newText.toString();
}
You were so close.
You were not appending your split words before setting it back into your text box. StringBuilder sb.append(S[k]) will add your split name to a buffer. sb.append(" ") will then add a space. Each line will be of 10 space separated names.
StringBuilder sb = new StringBuilder();
String[] names = jTextArea13.getText().split(" ");
for (int k = 0; k < S.length; k++) {
sb.append(S[k]).append(" ");
if (((k+1)%10)==0) {
sb.append("\r\n");
}
}
At last print it back to your jTextArea using:
jTextArea14.setText(sb.toString());
Just a side note, since sb is StringBuilder, you need to change it to string using toString nethod.

split a string when there is a change in character without a regular expression

There is a way to split a string into repeating characters using a regex function but I want to do it without using it.
for example, given a string like: "EE B" my output will be an array of strings e.g
{"EE", " ", "B"}
my approach is:
given a string I will first find the number of unique characters in a string so I know the size of the array. Then I will change the string to an array of characters. Then I will check if the next character is the same or not. if it is the same then append them together if not begin a new string.
my code so far..
String myinput = "EE B";
char[] cinput = new char[myinput.length()];
cinput = myinput.toCharArray(); //turn string to array of characters
int uniquecha = myinput.length();
for (int i = 0; i < cinput.length; i++) {
if (i != myinput.indexOf(cinput[i])) {
uniquecha--;
} //this should give me the number of unique characters
String[] returninput = new String[uniquecha];
Arrays.fill(returninput, "");
for (int i = 0; i < uniquecha; i++) {
returninput[i] = "" + myinput.charAt(i);
for (int j = 0; j < myinput.length - 1; j++) {
if (myinput.charAt(j) == myinput.charAt(j + 1)) {
returninput[j] += myinput.charAt(j + 1);
} else {
break;
}
}
} return returninput;
but there is something wrong with the second part as I cant figure out why it is not beginning a new string when the character changes.
You question says that you don't want to use regex, but I see no reason for that requirement, other than this is maybe homework. If you are open to using regex here, then there is a one line solution which splits your input string on the following pattern:
(?<=\S)(?=\s)|(?<=\s)(?=\S)
This pattern uses lookarounds to split whenever what precedes is a non whitespace character and what proceeds is a whitespace character, or vice-versa.
String input = "EE B";
String[] parts = input.split("(?<=\\S)(?=\\s)|(?<=\\s)(?=\\S)");
System.out.println(Arrays.toString(parts));
[EE, , B]
^^ a single space character in the middle
Demo
If I understood correctly, you want to split the characters in a string so that similar-consecutive characters stay together. If that's the case, here is how I would do it:
public static ArrayList<String> splitString(String str)
{
ArrayList<String> output = new ArrayList<>();
String combo = "";
//iterates through all the characters in the input
for(char c: str.toCharArray()) {
//check if the current char is equal to the last added char
if(combo.length() > 0 && c != combo.charAt(combo.length() - 1)) {
output.add(combo);
combo = "";
}
combo += c;
}
output.add(combo); //adds the last character
return output;
}
Note that instead of using an array (has a fixed size) to store the output, I used an ArrayList, which has a variable size. Also, instead of checking the next character for equality with the current one, I preferred to use the last character for that. The variable combo is used to temporarily store the characters before they go to output.
Now, here is one way to print the result following your guidelines:
public static void main(String[] args)
{
String input = "EEEE BCD DdA";
ArrayList<String> output = splitString(input);
System.out.print("[");
for(int i = 0; i < output.size(); i++) {
System.out.print("\"" + output.get(i) + "\"");
if(i != output.size()-1)
System.out.print(", ");
}
System.out.println("]");
}
The output when running the above code will be:
["EEEE", " ", "B", "C", "D", " ", "D", "d", "A"]

break large String into small Strings

i have a large string which contains Id's example :
HD47-4585-GG89
here at the above i have an id of a single object but sometimes it may contain id's of multiple objects like this :
HD47-4585-GG89-KO89-9089-RT45
the above haves ids of 2 objects now i want to convert the above string to an array or in multiple small Strings
something like :
id1 = HD47-4585-GG89
id2 = KO89-9089-RT45
every single id haves a fixed number of characters in it here its 14 (counting the symbols too) and the number of total id's in a single String is not determined
i dont know how to do it any one can guide me with this ?
i think all i have to do is clip the first 14 characters of string then assign a variable to it and repeat this until string is empty
You could also use regex:
String input = "HD47-4585-GG89-KO89-9089-RT45";
Pattern id = Pattern.compile("(\\w{4}-\\w{4}-\\w{4})");
Matcher matcher = id.matcher(input);
List<String> ids = new ArrayList<>();
while(matcher.find()) {
ids.add(matcher.group(1));
}
System.out.println(ids); // [HD47-4585-GG89, KO89-9089-RT45]
See Ideone.
Although this assumes that each group of characters (HD47) is 4 long.
Using guava Splitter
class SplitIt
{
public static void main (String[] args) throws java.lang.Exception
{
String idString = "HD47-4585-GG89-KO89-9089-RT45-HD47-4585-GG89";
Iterable<String> result = Splitter
.fixedLength(15)
.trimResults(CharMatcher.inRange('-', '-'))
.split(idString);
String[] parts = Iterables.toArray(result, String.class);
for (String id : parts) {
System.out.println(id);
}
}
}
StringTokenizer st = new StringTokenizer(String,"-");
while (st.hasMoreTokens()) {
System.out.println(st.nextToken());
}
these tokens can be stored in some arrays and then using index you can get required data.
String text = "HD47-4585-GG89-KO89-9089-RT45";
String first = "";
String second = "";
List<String> textList = Arrays.asList(text.split("-"));
for (int i = 0; i < textList.size() / 2; i++) {
first += textList.get(i) + "-";
}
for (int i = textList.size() / 2; i < textList.size(); i++) {
second += textList.get(i) + "-";
}
first = first.substring(0, first.length() - 1);
second = second.substring(0, second.length() - 1);

How to retain matched sub string and replace unmatched sub strings in Java String

Hello I try to print in an array of Strings
In the following way:
Input: big = "12xy34", small = "xy" output: "** xy **"
Input: big = "" 12xt34 "", small = "xy" output: "******"
Input: big = "12xy34", small = "1" output: "1 *****"
Input: big = "12xy34xyabcxy", small = "xy" output: "** xy ** xy *** xy"
Input: big = "78abcd78cd", small = "78" output: "78 **** 78 **"
What I need to write a condition to receive as up?
public static String stars(String big, String small) {
//throw new RuntimeException("not implemented yet ");
char[] arr = big.toCharArray();
for (int i = 0; i < arr.length; i++) {
if (big.contains(small) ) {
arr[i] = '*';
}
}
String a = Arrays.toString(arr);
return big+""+a;
}
Algorithm:
Convert big and small String's to char[] array's bigC and smallC respectively
Iterate over each character of big String
At every index during iteration, identify whether there is a sub-string possible beginning current character
If there is a sub-string possibility, advance the index in big String iteration by length of small String
Otherwise, replace the character by *
Code:
public class StringRetainer {
public static void main(String args[]) {
String big[] = {"12xy34", "12xt34", "12xy34", "12xy34xyabcxy", "78abcd78cd"};
String small[] = {"xy", "xy", "1", "xy", "78"};
for(int i = 0; i < big.length & i < small.length; i++) {
System.out.println("Input: big = \"" + big[i] + "\", small = \"" + small[i] + "\" output : \"" + stars(big[i], small[i]) + "\"");
}
}
public static String stars(String big, String small) {
//String to char[] array conversions
char[] bigC = big.toCharArray();
char[] smallC = small.toCharArray();
//iterate through every character of big String and selectively replace
for(int i = 0; i < bigC.length; i++) {
//flag to determine whether small String occurs in big String
boolean possibleSubString = true;
int j = 0;
//iterate through every character of small String to determine
//the possibility of character replacement
for(; j < smallC.length && (i+j) < bigC.length; j++) {
//if there is a mismatch of at least one character in big String
if(bigC[i+j] != smallC[j]) {
//set the flag indicating sub string is not possible and break
possibleSubString = false;
break;
}
}
//if small String is part of big String,
//advance the loop index with length of small String
//replace with '*' otherwise
if(possibleSubString)
i = i+j-1;
else
bigC[i] = '*';
}
big = String.copyValueOf(bigC);
return big;
}
}
Note:
This is one possible solution (legacy way of doing)
Looks like there is no straight forward way of making this happen using built-in String/StringBuffer/StringBuilder methods

How can I split a string in to multiple parts?

I have a string with several words separated by spaces, e.g. "firstword second third", and an ArrayList. I want to split the string into several pieces, and add the 'piece' strings to the ArrayList.
For example,"firstword second third" can be split to three separate strings , so the ArrayList would have 3 elements; "1 2 3 4" can be split into 4 strings, in 4 elements of the ArrayList. See the code below:
public void separateAndAdd(String notseparated) {
for(int i=0;i<canBeSepartedinto(notseparated);i++{
//what should i put here in order to split the string via spaces?
thearray.add(separatedstring);
}
}
public int canBeSeparatedinto(String string)
//what do i put here to find out the amount of spaces inside the string?
return ....
}
Please leave a comment if you dont get what I mean or I should fix some errors in this post. Thanks for your time!
You can split the String at the spaces using split():
String[] parts = inputString.split(" ");
Afterwards iterate over the array and add the individual parts (if !"".equals(parts[i]) to the list.
If you want to split on one space, you can use .split(" ");. If you want to split on all spaces in a row, use .split(" +");.
Consider the following example:
class SplitTest {
public static void main(String...args) {
String s = "This is a test"; // note two spaces between 'a' and 'test'
String[] a = s.split(" ");
String[] b = s.split(" +");
System.out.println("a: " + a.length);
for(int i = 0; i < a.length; i++) {
System.out.println("i " + a[i]);
}
System.out.println("b: " + b.length);
for(int i = 0; i < b.length; i++) {
System.out.println("i " + b[i]);
}
}
}
If you are worried about non-standard spaces, you can use "\\s+" instead of " +", as "\\s" will capture any white space, not just the 'space character'.
So your separate and add method becomes:
void separateAndAdd(String raw) {
String[] tokens = raw.split("\\s+");
theArray.ensureCapacity(theArray.size() + tokens.length); // prevent unnecessary resizes
for(String s : tokens) {
theArray.add(s);
}
}
Here's a more complete example - note that there is a small modification in the separateAndAdd method that I discovered during testing.
import java.util.*;
class SplitTest {
public static void main(String...args) {
SplitTest st = new SplitTest();
st.separateAndAdd("This is a test");
st.separateAndAdd("of the emergency");
st.separateAndAdd("");
st.separateAndAdd("broadcast system.");
System.out.println(st);
}
ArrayList<String> theArray = new ArrayList<String>();
void separateAndAdd(String raw) {
String[] tokens = raw.split("\\s+");
theArray.ensureCapacity(theArray.size() + tokens.length); // prevent unnecessary resizes
for(String s : tokens) {
if(!s.isEmpty()) theArray.add(s);
}
}
public String toString() {
StringBuilder sb = new StringBuilder();
for(String s : theArray)
sb.append(s).append(" ");
return sb.toString().trim();
}
}
I would suggest using the
apache.commons.lang.StringUtils library.
It is the easiest and covers all the different conditions you can want int he spliting up of a string with minimum code.
Here is a reference to the split method :
Split Method
you can also refer to the other options available for the split method on the same link.
Do this:
thearray = new ArrayList<String>(Arrays.asList(notseparated.split(" ")));
or if thearray already instantiated
thearray.addAll(Arrays.asList(notseparated.split(" ")));
If you want to split the string in different parts, like here i am going to show you that how i can split this string 14-03-2016 in day,month and year.
String[] parts = myDate.split("-");
day=parts[0];
month=parts[1];
year=parts[2];
You can do that using .split() try this
String[] words= inputString.split("\\s");
try this:
string to 2 part:
public String[] get(String s){
int l = s.length();
int t = l / 2;
String first = "";
String sec = "";
for(int i =0; i<l; i++){
if(i < t){
first += s.charAt(i);
}else{
sec += s.charAt(i);
}
}
String[] result = {first, sec};
return result;
}
example:
String s = "HelloWorld";
String[] res = get(s);
System.out.println(res[0]+" "+res[1])
output:
Hello World

Categories

Resources