Stop words not being correctly removed from string - java

I have a function which reads stop words from a file and saves it in a HashSet.
HashSet<String> hset = readFile();
This is my string
String words = "the plan crash is invisible";
I am trying to remove all the stop words from the string but it is not working correctly
The output i am getting: plan crash invible
Output i want => plan crash invisible
Code:
HashSet<String> hset = readFile();
String words = "the plan crash is invisible";
String s = words.toLowerCase();
String[] split = s.split(" ");
for(String str: split){
if (hset.contains(str)) {
s = s.replace(str, "");
} else {
}
}
System.out.println("\n" + "\n" + s);

While hset.contains(str) matches full words, s.replace(str, ""); can replace occurrences of the "stop" words which are part of words of the input String. Hence "invisible" becomes "invible".
Since you are iterating over all the words of s anyway, you can construct a String that contains all the words not contained in the Set:
StringBuilder sb = new StringBuilder();
for(String str: split){
if (!hset.contains(str)) {
if (sb.length() > 0) {
sb.append(' ');
}
sb.append(str);
}
}
System.out.println("\n" + "\n" + sb.toString());

No need so check if your string contain the stop word or split your string, you can use replaceAll which use regex, like this :
for (String str : hset) {
s = s.replaceAll("\\s" + str + "|" + str + "\\s", " ");
}
Excample :
HashSet<String> hset = new HashSet<>();
hset.add("is");
hset.add("the");
String words = "the plan crash is invisible";
String s = words.toLowerCase();
for (String str : hset) {
s = s.replaceAll("\\s" + str + "|" + str + "\\s", " ");
}
s = s.replaceAll("\\s+", " ").trim();//comment and idea of #davidxxx
System.out.println(s);
This can gives you :
plan crash invisible

Related

split String If get any capital letters

My String:
BByTTheWay .I want to split the string as B By T The Way BByTheWay .That means I want to split string if I get any capital letters and last put the main string as it is. As far I tried in java:
public String breakWord(String fileAsString) throws FileNotFoundException, IOException {
String allWord = "";
String allmethod = "";
String[] splitString = fileAsString.split(" ");
for (int i = 0; i < splitString.length; i++) {
String k = splitString[i].replaceAll("([A-Z])(?![A-Z])", " $1").trim();
allWord = k.concat(" " + splitString[i]);
allWord = Arrays.stream(allWord.split("\\s+")).distinct().collect(Collectors.joining(" "));
allmethod = allmethod + " " + allWord;
// System.out.print(allmethod);
}
return allmethod;
}
It givs me the output: B ByT The Way BByTTheWay . I think stackoverflow community help me to solve this.
You may use this code:
Code 1
String s = "BByTTheWay";
Pattern p = Pattern.compile("\\p{Lu}\\p{Ll}*");
String out = p.matcher(s)
.results()
.map(MatchResult::group)
.collect(Collectors.joining(" "))
+ " " + s;
//=> "B By T The Way BByTTheWay"
RegEx \\p{Lu}\\p{Ll}* matches any unicode upper case letter followed by 0 or more lowercase letters.
CODE DEMO
Or use String.split using same regex and join it back later:
Code 2
String out = Arrays.stream(s.split("(?=\\p{Lu})"))
.collect(Collectors.joining(" ")) + " " + s;
//=> "B By T The Way BByTTheWay"
Use
String s = "BByTTheWay";
Pattern p = Pattern.compile("[A-Z][a-z]*");
Matcher m = p.matcher(s);
String r = "";
while (m.find()) {
r = r + m.group(0) + " ";
}
System.out.println(r + s);
See Java proof.
Results: B By T The Way BByTTheWay
EXPLANATION
--------------------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
[a-z]* any character of: 'a' to 'z' (0 or more
times (matching the most amount possible))
As per requirements, you can write in this way checking if a character is an alphabet or not:
char[] chars = fileAsString.toCharArray();
StringBuilder fragment = new StringBuilder();
for (char ch : chars) {
if (Character.isLetter(ch) && Character.isUpperCase(ch)) { // it works as internationalized check
fragment.append(" ");
}
fragment.append(ch);
}
String.join(" ", fragment).concat(" " + fileAsString).trim(); // B By T The Way BByTTheWay

how to split string by reading text from file

I am trying to read os-release file on Linux and trying to get the OS version by finding VERSION_ID="12.3" line. I wrote below piece of code but at the last after splitting the string first time I am unable to go further. I reached up to splitting to "12.3" from VERSION_ID, after this I applied split function again and getting "java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 0" error. , Need your help and valuable suggestions.
File fobj2 = new File("C:\\os-release");
if(fobj2.exists() && !fobj2.isDirectory())
{
Scanner sc = new Scanner(fobj2);
while(sc.hasNextLine())
{
String line = sc.nextLine();
//VERSION_ID="12.3"
if(line.contains("VERSION_ID="))
{
System.out.println(" VERSION_ID= " + line );
String [] ver = line.split("=");
if(ver.length > 1)
{
String [] ver_id = line.split("=");
System.out.println(" ver_id.length " + ver_id.length );
System.out.println(" ver_id[0] " + ver_id[0] );
System.out.println(" ver_id[1] " + ver_id[1] );
System.out.println(" ver_id[1].length() " + ver_id[1].length() );
String [] FinalVer1 = ver_id[1].split(".");
System.out.println(" FinalVer1[1].length() " + FinalVer1[1].length() );
System.out.println(" FinalVer1[1] " + FinalVer1[1] );
}
}
}
Here is yet another way to gather the Major and Minor version values from the line String:
String line = "VERSION_ID=\"12.3\"";
// Bring line to lowercase to eliminate possible letter case
// variations (if any) then see if 'version-id' is in the line:
if (line.toLowerCase().contains("version_id")) {
// Remove all whitespaces and double-quotes from line then
// split the line regardless of whether or not there are
// whitespaces before or after the '=' character.
String version = line.replaceAll("[ \"]", "").split("\\s*=\\s*")[1];
//Get version Major.
String major = version.split("[.]")[0];
// Get version Minor.
String minor = version.split("[.]")[1];
// Display contents of all variables
System.out.println("Line: --> " + line);
System.out.println("Version: --> " + version);
System.out.println("Major: --> " + major);
System.out.println("Minor: --> " + minor);
}
Try this !
File fobj2 = new File("C:\\os-release");
if(fobj2.exists() && !fobj2.isDirectory())
{
Scanner sc = new Scanner(fobj2);
while(sc.hasNextLine())
{
String line = sc.nextLine();
//VERSION_ID="12.3"
if(line.contains("VERSION_ID="))
{
System.out.println(" VERSION_ID= " + line );
String [] ver = line.split("=");
if(ver.length > 1)
{
String [] ver_id = line.split("=");
System.out.println(" ver_id.length " + ver_id.length );
System.out.println(" ver_id[0] " + ver_id[0] );
System.out.println(" ver_id[1] " + ver_id[1] );
System.out.println(" ver_id[1].length() " + ver_id[1].length() );
// Here
String [] FinalVer1 = ver_id[1].split("\\.");
System.out.println(" FinalVer1[1].length() " + FinalVer1[1].length() );
System.out.println(" FinalVer1[1] " + FinalVer1[1] );
}
}
}
Just need to escape the dot like this.
I think that when You are using split method on string like "VERSION_ID="12.3"", the output would be: String arr = {"VERSION_ID", "12.3"}. So why are You using the next split with "=" on the ver_id array. Try using ".", from my understanding You are looking for number after the dot.

How do I build this words to one sentence after looping?

Am Looping through a sentence splitting it to capitalize.But its hard to build it back after getting the individual words.
String str = "Not the answer you're looking for.";
StringBuilder stringBuilder = new StringBuilder();
String oneWord =" ";
for (String word : str.toLowerCase().split(" ")){
char firstLetter = word.substring(0,1).toUpperCase().charAt(0);
oneWord = firstLetter + word.substring(1);
System.out.println(stringBuilder.append(oneWord + " "));
}
}
I expect to get only one fully built String "Not The Answer You're Looking For."
String str = "Not the answer you're looking for.";
StringBuilder stringBuilder = new StringBuilder();
String oneWord =" ";
for (String word : str.toLowerCase().split(" ")){
char firstLetter = word.substring(0,1).toUpperCase().charAt(0);
oneWord = firstLetter + word.substring(1);
stringBuilder.append(oneWord + " ");
}
System.out.println(stringBuilder.toString());
You are not getting just one string because you use System.out.println inside for loop.
Consider my example above
oneWord += firstLetter + word.substring(1) + " ";
after loop
oneWord = oneWord.trim();
System.out.println(oneWord);
So the solution is:
String str = "Not the answer you're looking for.";
StringBuilder sb = new StringBuilder();
for (String word : str.toLowerCase().split(" ")) {
sb.append(str.substring(0, 1).toUpperCase());
sb.append(str.substring(1));
sb.append(" ");
}
System.out.println(stringBuilder.toString().trim());
Also your solution is not optimal.
Check String.join() or use somth like this
Arrays.stream(str.toLowerCase().split(" "))
.map(word -> str.substring(0, 1).toUpperCase() + str.substring(1))
.collect(Collectors.joining(" "));

Can StringTokenizer countTokens ever be zero?

I just found a piece of Java code inside a method:
if (param.contains("|")) {
StringTokenizer st = new StringTokenizer(param.toLowerCase().replace(" ", ""), "|");
if (st.countTokens() > 0) {
...
}
} else {
return myString.contains(param);
}
Can countTokens in the above case ever be less than 1?
It can, if the string you're trying to tokenize is empty, otherwise it'll always at least be 1
Example 1:
String myStr = "abcdefg";
StringTokenizer st = new StringTokenizer(myStr, ";");
int tokens = st.countTokens();
System.out.println("Number of tokens: " + tokens);
> "Number of tokens: 1"
Example 2:
String myStr = "";
StringTokenizer st = new StringTokenizer(myStr, ";");
int tokens = st.countTokens();
System.out.println("Number of tokens: " + tokens);
> "Number of tokens: 0"
Example 3:
String myStr = "abc;defg";
StringTokenizer st = new StringTokenizer(myStr, ";");
int tokens = st.countTokens();
System.out.println("Number of tokens: " + tokens);
> "Number of tokens: 2"
Below return 0:
new StringTokenizer("", "|").countTokens()
new StringTokenizer("|", "|").countTokens()
new StringTokenizer("||||", "|").countTokens()
so countTokens() returns 0 when:
the String is empty
the String contains only the delimeter
Look at this
String param="";
StringTokenizer st = new StringTokenizer(param.toLowerCase().replace(" ", ""), "|");
System.out.println(st.countTokens());
answer is 0(zero)

removing comma from string array

I want to execute a query like
select ID from "xyz_DB"."test" where user in ('a','b')
so the corresponding code is like
String s="(";
for(String user:selUsers){
s+= " ' " + user + " ', ";
}
s+=")";
Select ID from test where userId in s;
The following code is forming the value of s as ('a','b',)
i want to remove the comma after the end of array how to do this ?
Here is one way to do this:
String s = "(";
boolean first = true;
for(String user : selUsers){
if (first) {
first = false;
} else {
s += ", ";
}
s += " ' " + user + " '";
}
s += ")";
But it is more efficient to use a StringBuilder to assemble a String if there is looping involved.
StringBuilder sb = new StringBuilder("(");
boolean first = true;
for(String user : selUsers){
if (first) {
first = false;
} else {
sb.append(", ");
}
sb.append(" ' ").append(user).append(" '");
}
sb.append(")");
String s = sb.toString();
This does the trick.
String s = "";
for(String user : selUsers)
s += ", '" + user + "'";
if (selUsers.size() > 0)
s = s.substring(2);
s = "(" + s + ")";
But, a few pointers:
When concatenating strings like this, you are advised to work with StringBuilder and append.
If this is part of an SQL-query, you probably want to sanitize the user-names. See xkcd: Exploits of a Mom for an explanation.
For fun, a variation of Stephen C's answer:
StringBuilder sb = new StringBuilder("(");
boolean first = true;
for(String user : selUsers){
if (!first || (first = false))
sb.append(", ");
sb.append('\'').append(user).append('\'');
}
sb.append(')');
you could even do the loop it like this :-)
for(String user : selUsers)
sb.append(!first || (first=false) ? ", \'" : "\'").append(user).append('\'');
Use the 'old style' of loop where you have the index, then you add the comma on every username except the last:
String[] selUsers = {"a", "b", "c"};
String s="(";
for(int i = 0; i < selUsers.length; i++){
s+= " ' " + selUsers[i] + " ' ";
if(i < selUsers.length -1){
s +=" , ";
}
}
s+=")";
But as others already mentioned, use StringBuffer when concatenating strings:
String[] selUsers = {"a", "b", "c"};
StringBuffer s = new StringBuffer("(");
for(int i = 0; i < selUsers.length; i++){
s.append(" ' " + selUsers[i] + " ' ");
if(i < selUsers.length -1){
s.append(" , ");
}
}
s.append(")");
Use StringUtils.join from apache commons.
Prior to adding the trailing ')' I'd strip off the last character of the string if it's a comma, or perhaps just replace the trailing comma with a right parenthesis - in pseudo-code, something like
if s.last == ',' then
s = s.left(s.length() - 1);
end if;
s = s + ')';
or
if s.last == ',' then
s.last = ')';
else
s = s + ')';
end if;
Share and enjoy.
i would do s+= " ,'" + user + "'"; (place the comma before the value) and add a counter to the loop where i just do s = "'" + user + "'"; if the counter is 1 (or 0, depending on where you start to count).
(N.B. - I'm not a Java guy, so the syntax may be wrong here - apologies if it is).
If selUsers is an array, why not do:
selUsers.join(',');
This should do what you want.
EDIT:
Looks like I was wrong - I figured Java had this functionality built-in. Looks like the Apache project has something that does what I meant, though. Check out this SO answer: Java: function for arrays like PHP's join()?
I fully support Stephen C's answer - that's the one I wanted to suggest aswell: simply avoid adding the additional comma.
But in your special case, the following should work too:
s = s.replace(", \\)", ")");
It simply replaces the substring ", )" with a single bracket ")".
Java 1.4+
s = s.replaceFirst("\\, \\)$", ")");
Edited: I forgot last space before parethesis
StringBuilder has a perfectly good append(int) overload.
String [] array = {"1","2","3" ...};
StringBuilder builder = new StringBuilder();
builder.append(s + "( ")
for(String i : array)
{
if(builder.length() != 0)
builder.append(",");
builder.append(i);
}
builder.append(" )")
Answer shamelessly copied from here

Categories

Resources