I have a quite simple question here is that i have a string 0-1000
say str = "0-1000"
I successfully extract 0 and 1000 by using str.split("-")
Now, I am assigned to check the number because i am noticed that those two numbers can be a negative.
If I continue str.split("-"), then I will skip the negative sign as well.
Could anyone suggest methods for me?
Since String.split() uses regular expressions to split, you could do something like this:
String[] nos = "-1000--1".split("(?<=\\d)-";
This means you split at minus characters that follow a digit, i.e. must be an operator.
Note that the positive look-behind (?<=\d) needs to be used since you only want to match the minus character. String.split() removes all matching separators and thus something like \d- would remove digits as well.
To parse the numbers you'd then iterate over the array elements and call Integer.valueOf(element) or Integer.parseInt(element).
Note that this assumes the input string to be valid. Depending on what you want to achieve, you might first have to check the input for a match, e.g. by using -?\d--?\d to check whether the string is in format x-y where x and y can be positive or negative integers.
You can use regex like this :Works for all cases
public static void main(String[] args) {
String s = "-500--578";
String[] arr = s.split("(?<=\\d)-"); // split on "-" only if it is preceeded by a digit
for (String str : arr)
System.out.println(str);
}
O/P:
-500
-578
Related
I am working on twitter data normalization. Twitter users frequently uses terms like ts I looooooove it in order to emphasize the word love. I want to such repeated characters to a proper English word by replacing repeat characters till I get a proper meaningful word (I am aware that I can not differentiate between good and god by this mechanism).
My strategy would be
identify existence of such repeated strings. I would look for more than 2 same characters, as probably there is no English word with more than two repeat characters.
String[] strings = { "stoooooopppppppppppppppppp","looooooove", "good","OK", "boolean", "mee", "claaap" };
String regex = "([a-z])\\1{2,}";
Pattern pattern = Pattern.compile(regex);
for (String string : strings) {
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(string+" TRUE ");
}
}
Search for such words in a Lexicon like Wordnet
Replace all but two such repeat characters and check in Lexicon
If not there in the Lexicon remove one more repeat character (Otherwise treat it as misspelling).
Due to my poor Java knowledge I am unable to manage 3 and 4. Problem is I can not replace all but two repeated consecutive characters.
Following code snippet replace all but one repeated characters System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));
Help is required to find out
A. How to replace all but 2 consecutive repeat characters
B. How to remove one more consecutive character from the output of A
[I think B can be managed by the following code snippet]
System.out.println(data.replaceAll("([a-zA-Z])\\1{1,}", "$1"));
Edit: Solution provided by Wiktor Stribiżew works perfectly in Java. I was wondering what changes are required to get the same result in python.
Python uses re.sub.
Your regex ([a-z])\\1{2,} matches and captures an ASCII letter into Group 1 and then matches 2 or more occurrences of this value. So, all you need to replace with a backreference, $1, that holds the value captured. If you use one $1, the aaaaa will be replaced with a single a and if you use $1$1, it will be replaced with aa.
String twoConsecutivesOnly = data.replaceAll(regex, "$1$1");
String noTwoConsecutives = data.replaceAll(regex, "$1");
See the Java demo.
If you need to make your regex case insensitive, use "(?i)([a-z])\\1{2,}" or even "(\\p{Alpha})\\1{2,}". If any Unicode letters must be handled, use "(\\p{L})\\1{2,}".
BONUS: In a general case, to replace any amount of any repeated consecutive chars use
text = text.replaceAll("(?s)(.)\\1+", "$1"); // any chars
text = text.replaceAll("(.)\\1+", "$1"); // any chars but line breaks
text = text.replaceAll("(\\p{L})\\1+", "$1"); // any letters
text = text.replaceAll("(\\w)\\1+", "$1"); // any ASCII alnum + _ chars
/*This code checks a character in a given string repeated consecutively 3 times
if you want to check for 4 consecutive times change count==2--->count==3 OR
if you want to check for 2 consecutive times change count==2--->count==1*/
public class Test1 {
static char ch;
public static void main(String[] args) {
String str="aabbbbccc";
char[] charArray = str.toCharArray();
int count=0;
for(int i=0;i<charArray.length;i++){
if(i!=0 ){
if(charArray[i]==ch)continue;//ddddee
if(charArray[i]==charArray[i-1]) {
count++;
if(count==2){
System.out.println(charArray[i]);
count=0;
ch=charArray[i];
}
}
else{
count=0;//aabb
}
}
}
}
}
I've got a string in my Java project which looks something like this
9201,92710,94500,920,1002
How can I enter a dot 2 places before the comma? So it looks like
this:
920.1,9271.0,9450.0,92.0,100.2
I had an attempt at it but I can't get the last number to get a dot.
numbers = numbers.replaceAll("([0-9],)", "\\.$1");
The result I got is
920.1,9271.0,9450.0,92.0,1002
Note: The length of the string is not always the same. It can be longer / shorter.
Check if string ends with ",". If not, append a "," to the string, run the same replaceAll, remove "," from end of String.
Split string by the "," delimiter, process each piece adding the "." where needed.
Just add a "." at numbers.length-1 to solve the issue with the last number
As your problem is not only inserting the dot before every comma, but also before end of string, you just must add this additional condition to your capturing group:
numbers = numbers.replaceAll("([0-9](,|$))", "\\.$1");
As suggested by Siguza, you could as well use a non-capturing group which is even more what a "human" would expect to be captured in the capturing group:
numbers = numbers.replaceAll("([0-9](?:,|$))", "\\.$1");
But as a non-capturing group is (although a really nice feature) not standard Regex and the overhead is not that significant here, I would recommend using the first option.
You could use word boundary:
numbers = numbers.replaceAll("(\\d)\b", ".$1");
Your solution is fine, as long as you put a comma at the end like dan said.
So instead of:
numbers = numbers.replaceAll("([0-9],)", "\\.$1");
write:
numbers = (numbers+",").replaceAll("([0-9],)", "\\.$1");
numbers = numbers.substring(0,numbers.size()-1);
You may use a positive lookahead to check for the , or end of string right after a digit and a zeroth backreference to the whole match:
String s = "9201,92710,94500,920,1002";
System.out.println(s.replaceAll("\\d(?=,|$)", ".$0"));
// => 920.1,9271.0,9450.0,92.0,100.2
See the Java demo and a regex demo.
Details:
\\d - exactly 1 digit...
(?=,|$) - that must be before a , or end of string ($).
A capturing variation (Java demo):
String s = "9201,92710,94500,920,1002";
System.out.println(s.replaceAll("(\\d)(,|$)", ".$1$2"));
You where right to go for the replaceAll method. But your regex was not matching the end of the string, the last set of numbers.
Here is my take on your problem:
public static void main(String[] args) {
String numbers = "9201,92710,94500,920,1002";
System.out.println(numbers.replaceAll("(\\d,|\\d$)", ".$1"));
}
the regex (\\d,|\\d$) matches a digit followed by a comma \d,, OR | a digit followed by the end of the string \d$.
I have tested it and found to work.
As others have suggested you could add a comma at the end, run the replace all and then remove it. But it seems as extra effort.
Example:
public static void main(String[] args) {
String numbers = "9201,92710,94500,920,1002";
//add on the comma
numbers += ",";
numbers = numbers.replaceAll("(\\d,)", "\\.$1");
//remove the comma
numbers = numbers.substring(0, numbers.length()-1);
System.out.println(numbers);
}
I am trying to create a String[] which contains only words that comprise of certain characters. For example I have a dictionary containing a number of words like so:
arm
army
art
as
at
attack
attempt
attention
attraction
authority
automatic
awake
baby
back
bad
bag
balance
I want to narrow the list down so that it only contains words with the characters a, b and g. Therefore the list should only contain the word 'bag' in this example.
Currently I am trying to do this using regexes but having never used them before I can't seem to get it to work.
Here is my code:
public class LetterJugglingMain {
public static void main(String[] args) {
String dictFile = "/Users/simonrhillary/Desktop/Dictionary(3).txt";
fileReader fr = new fileReader();
fr.openFile(dictFile);
String[] dictionary = fr.fileToArray();
String regx = "able";
String[] newDict = createListOfValidWords(dictionary, regx);
printArray(newDict);
}
public static String[] createListOfValidWords(String[] d, String regex){
List<String> narrowed = new ArrayList<String>();
for(int i = 0; i<d.length; i++){
if(d[i].matches(regex)){
narrowed.add(d[i]);
System.out.println("added " + d[i]);
}
}
String[] narrowArray = narrowed.toArray(new String[0]);
return narrowArray;
}
however the array returned is always empty unless the String regex is the exact word! Any ideas? I can post more code if needed...I think I must be trying to initialise the regex wrong.
The narrowed down list must contain ONLY the characters from the regex.
Frankly, I'm not an expert in regexes, but I don't think it's the best tool to do what you want. I would use a method like the following:
public boolean containsAll(String s, Set<Character> chars) {
Set<Character> copy = new HashSet<Character>();
for (int i = 0; i < s.length() && copy.size() < chars.size(); i++) {
char c = s.charAt(i);
if (chars.contains(c)) {
copy.add(c);
}
}
return copy.size() == chars.size();
}
The regex able will match only the string "able". However, if you want a regular expression to match either character of a, b, l or e, the regex you're looking for is [able] (in brackets). If you want words containing several such characters, add a + for repeating the pattern: [able]+.
The OP wants words that contain every character. Not just one of them.
And other characters are not a problem.
If this is the case, I think the simiplest way would be to loop through the entire string, character by character, and check to see if it contains all of the characters you want. Keep flags to check and see if every character has been found.
If this isn't the case.... :
Try using the regex:
^[able]+$
Here's what it does:
^ matches the beginning of the string and $ matches the end of the string. This makes sure that you're not getting a partial match.
[able] matches the characters you want the string to consist of, in this case a, b, l, and e. + Makes sure that there are 1 or more of these characters in the string.
Note: This regex will match a string that contains these 4 letters. For example, it will match:
able, albe, aeble, aaaabbblllleeee
and will not match
qable, treatable, and abled.
A sample regex that filters out words that contains at least one occurrence of all characters in a set. This will match any English word (case-insensitive) that contains at least one occurrence of all the characters a, b, g:
(?i)(?=.*a)(?=.*b)(?=.*g)[a-z]+
Example of strings that match would be bag, baggy, grab.
Example of strings that don't match would be big, argument, nothing.
The (?i) means turns on case-insensitive flag.
You need to append as many (?=.*<character>) as the number of characters in the set, for each of the characters.
I assume a word only contains English alphabet, so I specify [a-z]. Specify more if you need space, hyphen, etc.
I assume matches(String regex) method in String class, so I omitted the ^ and $.
The performance may be bad, since in the worst case (the characters are found at the end of the words), I think that the regex engine may go through the string for around n times where n is the number of characters in the set. It may not be an actual concern at all, since the words are very short, but if it turns out that this is a bottleneck, you may consider doing simple looping.
Alright so here is my problem. Basically I have a string with 4 words in it, with each word seperated by a #. What I need to do is use the substring method to extract each word and print it out. I am having trouble figuring out the parameters for it though. I can always get the first one right, but the following ones generally have problems.
Here is the first piece of the code:
word = format.substring( 0 , format.indexOf('#') );
Now from what I understand this basically means start at the beginning of the string, and end right before the #. So using the same logic, I tried to extract the second word like so:
wordTwo = format.substring ( wordlength + 1 , format.indexOf('#') );
//The plus one so I don't start at the #.
But with this I continually get errors saying it doesn't exist. I figured that the compiler was trying to read the first # before the second word, so I rewrote it like so:
wordTwo = format.substring (wordlength + 1, 1 + wordLength + format.indexOf('#') );
And with this it just completely screws it up, either not printing the second word or not stopping in the right place. If I could get any help on the formatting of this, it would be greatly appreciated. Since this is for a class, I am limited to using very basic methods such as indexOf, length, substring etc. so if you could refrain from using anything to complex that would be amazing!
If you have to use substring then you need to use the variant of indexOf that takes a start. This means you can start look for the second # by starting the search after the first one. I.e.
wordTwo = format.substring ( wordlength + 1 , format.indexOf('#', wordlength + 1 ) );
There are however much better ways of splitting a string on a delimiter like this. You can use a StringTokenizer. This is designed for splitting strings like this. Basically:
StringTokenizer tok = new StringTokenizer(format, "#");
String word = tok.nextToken();
String word2 = tok.nextToken();
String word3 = tok.nextToken();
Or you can use the String.split method which is designed for splitting strings. e.g.
String[] parts = String.split("#");
String word = parts[0];
String word2 = parts[1];
String word3 = parts[2];
You can go with split() for this kind of formatting strings.
For instance if you have string like,
String text = "Word1#Word2#Word3#Word4";
You can use delimiter as,
String delimiter = "#";
Then create an string array like,
String[] temp;
For splitting string,
temp = text.split(delimiter);
You can get words like this,
temp[0] = "Word1";
temp[1] = "Word2";
temp[2] = "Word3";
temp[3] = "Word4";
Use split() method to do this with "#" as the delimiter
String s = "hi#vivek#is#good";
String temp = new String();
String[] arr = s.split("#");
for(String x : arr){
temp = temp + x;
}
Or if you want to exact each word... you have it already in arr
arr[0] ---> First Word
arr[1] ---> Second Word
arr[2] ---> Third Word
I suggest that you've a look at the Javadoc for String before you proceed further.
Since this is your homework, I'll give you a couple of hints and maybe you can solve it yourself:
The format for subString is public void subString(int beginIndex, int endIndex). As per the javadoc for this method:
Returns a new string that is a substring of this string. The substring
begins at the specified beginIndex and extends to the character at
index endIndex - 1. Thus the length of the substring is
endIndex-beginIndex.
Note that if you've to use this method, understand that you'll have to shift your beginIndex and endIndex each time because in your situation, you'll have multiple words that are separated by #.
However if you look closely, there's another method in String class that might be helpful to you. That's the public String[] split(String regex) method. The javadoc for this one states:
Splits this string around matches of the given regular expression.
This method works as if by invoking the two-argument split method with
the given expression and a limit argument of zero. Trailing empty
strings are therefore not included in the resulting array.
The split() method looks pretty interesting for your case. You can split your String with the delimiter that you have as the parameter to this method, get the String array and work with that.
Hope this helps you to understand your problem and get started towards a solution :)
Since this is a home work, it may be better to have try to write it your self. But I will give a clue.
Clue:
The indexOf method has another overload: int indexOf(int chr,
int fromIndex) which find the first character chr in the string
from the fromIndex.
http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html
From this clue, the program will look something like this:
Find the index of the first '#' from the start of the string.
Extract the word from 0th character to that index.
Find the index of the first '#' from the character AFTER the first '#'.
Extract the word from the first '#' that index.
... Just do it until you get 4 words or the string ends.
Hope this helps.
I don't know why you're forced to use String#substring, but as others have mentioned, it seems like the wrong method for the kind of functionality you need.
String#split(String regex) is what you would use for such a problem, or, if your input sequence is something you don't control, I would suggest you look at the overloaded method String#split(String regex, int limit); this way you can impose a limit on the amount of matches you make, controlling your resulting array.
Let's say I have a String, foo, with values separated by whitespace:
[value 1] [value 2] [value 3] ... [value n]
What would the regular expression be to split(...) foo, such that the resulting String array contained all values except value 1? Is this even possible?
Here's what I have thus far:
String[] array = foo.split("\\s");
And that's not much, I know.
EDIT:
I'm looking to accomplish this purely through regular expressions. If this is not possible, please let me know!
Once you've split your string into an array of values, loop through the array and do whatever you need, skipping the first iteration.
for(i=1; i<array.count(); i++){
//Act on the data value
}
Your delimiter could be "either a whitespace sequence OR chunks of non-ws at the beginning of a string, but this leaves you an empty string at the front:
Arrays.toString("abc def ghi jkl".split("\\s+|^\\S+\\s+"))
produces
[,def,ghi]
That is the problem with split -- you will, I think, always get something at the beginning of your array. Unfortunately I think you need to whack off the front of the string before splitting, or use Java's Arrays.copyOfRange() or similar to post-process.
Dropping the beginning can be done with replaceFirst:
import java.util.Arrays;
public class SplitExample {
public static final String data = "abc def ghi";
public static void main(String[] args) {
System.out.println(Arrays.toString(data.split("\\s+")));
System.out.println(Arrays.toString(data.split("\\s+|^\\S+\\s+")));
System.out.println(Arrays.toString(data.replaceFirst("^\\S+\\s+", "").split("\\s+")));
}
}
The final line is as close as I can get, because split produces matches AROUND your delimiters. How can you avoid the blank string at the front with a single split? I am not sure there is a way....
Since the consensus seems to be that it is impossible, I propose this solution:
Assuming you only have ONE space in the 'junk' value,
int ix = theString.indexOf(" ");
ix = theString.indexOf(" ", ix);
theString.substring(ix + 1).split("\\s");
This gets the substring from the second space in the string (the first space after the space in the 'junk' value) then splits it.
Would you be able to do this?
String[] array = foo.split("\\S*\\s")[1].split("\\s");
It's 2 regex splits instead of one, but it's neater than looping later. I'm not sure it's correct, but it should separate the string first into "any number of non-whitespace characters followed by a whitespace" and everything else. You can then split everything else by whitespace only, and you'll be left with an array excluding the first element.
Edit: Yeah, it can't be done with a single split since the only way to have anything other than "" as the first element in your array is to have something you're not removing with the split at the front of the string.