Regex to split a string into different parts (using Java) - java

I'm looking for a regex to split the following strings
red 12478
blue 25 12375
blue 25, 12364
This should give
Keywords red, ID 12478
Keywords blue 25, ID 12475
Keywords blue IDs 25, 12364
Each line has 2 parts, a set of keywords and a set of IDs. Keywords are separated by spaces and IDs are separated by commas.
I came up with the following regex: \s*((\S+\s+)+?)([\d\s,]+)
However, it fails for the second one. I've been trying to work with lookahead, but can't quite work it out
I am trying to split the string into its component parts (keywords and IDs)
The format of each line is one or more space separated keywords followed by one or more comma separated IDs. IDs are numeric only and keywords do not contain commas.
I'm using Java to do this.

I found a two-line solution using replaceAll and split:
pattern = "(\\S+(?<!,)\\s+(\\d+\\s+)*)";
String[] keywords = theString.replaceAll(pattern+".*","$1").split(" ");
String[] ids = theString.split(pattern)[1].split(",\\s?");
I assumed that the comma will always be immediately after the ID for each ID (this can be enforced by removing spaces adjacent to a comma), and that there is no trailing space.
I also assumed that the first keyword is a sequence of non-whitespace chars (without trailing comma) \\S+(?<!,)\\s+, and the rest of the keywords (if any) are digits (\\d+\\s+)*. I made this assumption based on your regex attempt.
The regex here is very simple, just take (greedily) any sequence of valid keywords that is followed by a space (or whitespaces). The longest will be the list of keywords, the rest will be the IDs.
Full Code:
public static void main(String[] args){
String pattern = "(\\S+(?<!,)\\s+(\\d+\\s+)*)";
Scanner sc = new Scanner(System.in);
while(true){
String theString = sc.nextLine();
String[] keywords = theString.replaceAll(pattern+".*","$1").split(" ");
String[] ids = theString.split(pattern)[1].split(",\\s?");
System.out.println("Keywords:");
for(String keyword: keywords){
System.out.println("\t"+keyword);
}
System.out.println("IDs:");
for(String id: ids){
System.out.println("\t"+id);
}
System.out.println();
}
}
Sample run:
red 124
Keywords:
red
IDs:
124
red 25 124
Keywords:
red
25
IDs:
124
red 25, 124
Keywords:
red
IDs:
25
124

I came up with:
(red|blue)( \d+(?!$)(?:, \d+)*)?( \d+)?$
as illustrated in http://rubular.com/r/y52XVeHcxY which seems to pass your tests. It's a straightforward matter to insert your keywords between the match substrings.

Ok since the OP didn't specify a target language, I am willing to tilt at this windmill over lunch as a brain teaser and provide a C#/.Net Regex replace with match evaluator which gives the required output:
Keywords red, ID 12478
Keywords blue 25 ID 12375
Keywords blue IDs 25, 12364
Note there is no error checking and this is fine example of using a lamda expression for the match evaluator and returning a replace per rules does the job. Also of note due to the small sampling size of data it doesn't handle multiple Ids/keywords as the case may actually be.
string data = #"red 12478
blue 25 12375
blue 25, 12364";
var pattern = #"(?xmn) # x=IgnorePatternWhiteSpace m=multiline n=explicit capture
^
(?<Keyword>[^\s]+) # Match Keyword Color
[\s,]+
(
(?<Numbers>\d+)
(?<HasComma>,)? # If there is a comma that signifies IDs
[,\s]*
)+ # 1 or more values
$";
Console.WriteLine (Regex.Replace(data, pattern, (mtch) =>
{
StringBuilder sb = new StringBuilder();
sb.AppendFormat("Keywords {0}", mtch.Groups["Keyword"].Value);
var values = mtch.Groups["Numbers"]
.Captures
.OfType<Capture>()
.Select (cp => cp.Value)
.ToList();
if (mtch.Groups["HasComma"].Success)
{
sb.AppendFormat(" IDs {0}", string.Join(", ", values));
}
else
{
if (values.Count() > 1)
sb.AppendFormat(" {0} ID {1}", values[0], values[1] );
else
sb.AppendFormat(", ID {0}", values[0]);
}
return sb.ToString();
}));

Related

Replacing consecutive repeated characters in java

I am working on twitter data normalization. Twitter users frequently uses terms like ts I looooooove it in order to emphasize the word love. I want to such repeated characters to a proper English word by replacing repeat characters till I get a proper meaningful word (I am aware that I can not differentiate between good and god by this mechanism).
My strategy would be
identify existence of such repeated strings. I would look for more than 2 same characters, as probably there is no English word with more than two repeat characters.
String[] strings = { "stoooooopppppppppppppppppp","looooooove", "good","OK", "boolean", "mee", "claaap" };
String regex = "([a-z])\\1{2,}";
Pattern pattern = Pattern.compile(regex);
for (String string : strings) {
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(string+" TRUE ");
}
}
Search for such words in a Lexicon like Wordnet
Replace all but two such repeat characters and check in Lexicon
If not there in the Lexicon remove one more repeat character (Otherwise treat it as misspelling).
Due to my poor Java knowledge I am unable to manage 3 and 4. Problem is I can not replace all but two repeated consecutive characters.
Following code snippet replace all but one repeated characters System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));
Help is required to find out
A. How to replace all but 2 consecutive repeat characters
B. How to remove one more consecutive character from the output of A
[I think B can be managed by the following code snippet]
System.out.println(data.replaceAll("([a-zA-Z])\\1{1,}", "$1"));
Edit: Solution provided by Wiktor Stribiżew works perfectly in Java. I was wondering what changes are required to get the same result in python.
Python uses re.sub.
Your regex ([a-z])\\1{2,} matches and captures an ASCII letter into Group 1 and then matches 2 or more occurrences of this value. So, all you need to replace with a backreference, $1, that holds the value captured. If you use one $1, the aaaaa will be replaced with a single a and if you use $1$1, it will be replaced with aa.
String twoConsecutivesOnly = data.replaceAll(regex, "$1$1");
String noTwoConsecutives = data.replaceAll(regex, "$1");
See the Java demo.
If you need to make your regex case insensitive, use "(?i)([a-z])\\1{2,}" or even "(\\p{Alpha})\\1{2,}". If any Unicode letters must be handled, use "(\\p{L})\\1{2,}".
BONUS: In a general case, to replace any amount of any repeated consecutive chars use
text = text.replaceAll("(?s)(.)\\1+", "$1"); // any chars
text = text.replaceAll("(.)\\1+", "$1"); // any chars but line breaks
text = text.replaceAll("(\\p{L})\\1+", "$1"); // any letters
text = text.replaceAll("(\\w)\\1+", "$1"); // any ASCII alnum + _ chars
/*This code checks a character in a given string repeated consecutively 3 times
if you want to check for 4 consecutive times change count==2--->count==3 OR
if you want to check for 2 consecutive times change count==2--->count==1*/
public class Test1 {
static char ch;
public static void main(String[] args) {
String str="aabbbbccc";
char[] charArray = str.toCharArray();
int count=0;
for(int i=0;i<charArray.length;i++){
if(i!=0 ){
if(charArray[i]==ch)continue;//ddddee
if(charArray[i]==charArray[i-1]) {
count++;
if(count==2){
System.out.println(charArray[i]);
count=0;
ch=charArray[i];
}
}
else{
count=0;//aabb
}
}
}
}
}

How do I count repetitive/continuous appearance of a character in String(When I don't know index of start/end)?

So if I have 22332, I want to replace that for BEA, as in mobile keypad.I want to see how many times a digit appear so that I can count A--2,B--22,C--222,D--3,E--33,F--333, etc(and a 0 is pause).I want to write a decoder that takes in digit string and replaces digit occurrences with letters.example : 44335557075557777 will be decoded as HELP PLS.
This is the key portion of the code:
public void printMessages() throws Exception {
File msgFile = new File("messages.txt");
Scanner input = new Scanner(msgFile);
while(input.hasNext()) {
String x = input.next();
String y = input.nextLine();
System.out.println(x+":"+y);
}
It takes the input from a file as digit String.Then Scanner prints the digit.I tried to split the string digits and then I don't know how to evaluate the appearance of the mentioned kind in the question.
for(String x : b.split(""))
System.out.print(x);
gives: 44335557075557777(input from the file).
I don't know how can I call each repetitive index and see how they formulate such pattern as in mobile keypad.If I use for loop then I have to cycle through whole string and use lots of if statements. There must be some other way.
Another suggestion of making use of regex in breaking the encoded string.
By making use of look-around + back-reference makes it easy to split the string at positions that preceding and following characters are different.
e.g.
String line = "44335557075557777";
String[] tokens = line.split("(?<=(.))(?!\\1)");
// tokens will contain ["44", "33", "555", "7", "0", "7", "555", "7777"]
Then it should be trivial for you to map each string to its corresponding character, either by a Map or even naively by bunch of if-elses
Edit: Some background on the regex
(?<=(.))(?!\1)
(?<= ) : Look behind group, which means finding
something (a zero-length patternin this example)
preceded by this group of pattern
( ) : capture group #1
. : any char
: zero-length pattern between look behind and look
ahead group
(?! ) : Negative look ahead group, which means finding
a pattern (zero-length in this example) NOT followed
by this group of pattern
\1 : back-reference, whatever matched by
capture group #1
So it means, find any zero-length positions, for which the character before and after such position is different, and use such positions to do splitting.

Regex to match continuous pattern of integer then space

I'm asking the user for input through the Scanner in Java, and now I want to parse out their selections using a regular expression. In essence, I show them an enumerated list of items, and they type in the numbers for the items they want to select, separated by a space. Here is an example:
1 yorkshire terrier
2 staffordshire terrier
3 goldfish
4 basset hound
5 hippopotamus
Type the numbers that correspond to the words you wish to exclude: 3 5
The enumerated list of items can be a just a few elements or several hundred. The current regex I'm using looks like this ^|\\.\\s+)\\d+\\s+, but I know it's wrong. I don't fully understand regular expressions yet, so if you can explain what it is doing that would be helpful too!
Pattern pattern = new Pattern(^([0-9]*\s+)*[0-9]*$)
Explanation of the RegEx:
^ : beginning of input
[0-9] : only digits
'*' : any number of digits
\s : a space
'+' : at least one space
'()*' : any number of this digit space combination
$: end of input
This treats all of the following inputs as valid:
"1"
"123 22"
"123 23"
"123456 33 333 3333 "
"12321 44 452 23 "
etc.
You want integers:
\d+
followed by any number of space, then another integer:
\d+( \d+)*
Note that if you want to use a regex in a Java string you need to escape every \ as \\.
To "parse out" the integers, you don't necessarily want to match the input, but rather you want to split it on spaces (which uses regex):
String[] nums = input.trim().split("\\s+");
If you actually want int values:
List<Integer> selections = new ArrayList<>();
for (String num : input.trim().split("\\s+"))
selections.add(Integer.parseInt(num));
If you want to ensure that your string contains only numbers and spaces (with a variable number of spaces and trailing/leading spaces allowed) and extract number at the same time, you can use the \G anchor to find consecutive matches.
String source = "1 3 5 8";
List<String> result = new ArrayList<String>();
Pattern p = Pattern.compile("\\G *(\\d++) *(?=[\\d ]*$)");
Matcher m = p.matcher(source);
while (m.find()) {
result.add(m.group(1));
}
for (int i=0;i<result.size();i++) {
System.out.println(result.get(i));
}
Note: at the begining of a global search, \G matches the start of the string.

I want to perform a split() on a string using a regex in Java, but would like to keep the delimited tokens in the array [duplicate]

This question already exists:
Is there a way to split strings with String.split() and include the delimiters? [duplicate]
Closed 8 years ago.
How can I format my regex to allow this?
Here's the regular expression:
"\\b[(\\w'\\-)&&[^0-9]]{4,}\\b"
It's looking for any word that is 4 letters or greater.
If I want to split, say, an article, I want an array that includes all the delimited values, plus all the values between them, all in the order that they originally appeared in. So, for example, if I want to split the following sentence: "I need to purchase a new vehicle. I would prefer a BMW.", my desired result from the split would be the following, where the italicized values are the delimiters.
"I ", "need", " to ", "purchase", " a new ", "vehicle", ". I ", "would", " ", "prefer", "a BMW."
So, all words with >4 characters are one token, while everything in between each delimited value is also a single token (even if it is multiple words with whitespace). I will only be modifying the delimited values and would like to keep everything else the same, including whitespace, new lines, etc.
I read in a different thread that I could use a lookaround to get this to work, but I can't seem to format it correctly. Is it even possible to get this to work the way I'd like?
I am not sure what you are trying to do but just in case that you want to modify words that have at least four letters you can use something like this (it will change words with =>4 letters to its upper cased version)
String data = "I need to purchase a new vehicle. I would prefer a BMW.";
Pattern patter = Pattern.compile("(?<![a-z\\-_'])[a-z\\-_']{4,}(?![a-z\\-_'])",
Pattern.CASE_INSENSITIVE);
Matcher matcher = patter.matcher(data);
StringBuffer sb = new StringBuffer();// holder of new version of our
// data
while (matcher.find()) {// lets find all words
// and change them with its upper case version
matcher.appendReplacement(sb, matcher.group().toUpperCase());
}
matcher.appendTail(sb);// lets not forget about part after last match
System.out.println(sb);
Output:
I NEED to PURCHASE a new VEHICLE. I WOULD PREFER a BMW.
OR if you change replacing code to something like
matcher.appendReplacement(sb, "["+matcher.group()+"]");
you will get
I [need] to [purchase] a new [vehicle]. I [would] [prefer] a BMW.
Now you can just split such string on every [ and ] to get your desired array.
Assuming that "word" is defined as [A-Za-z], you can use this regex:
(?<=(\\b[A-Za-z]{4,50}\\b))|(?=(\\b[A-Za-z]{4,50}\\b))
Full code:
class RegexSplit{
public static void main(String[] args){
String str = "I need to purchase a new vehicle. I would prefer a BMW.";
String[] tokens = str.split("(?<=(\\b[A-Za-z]{4,50}\\b))|(?=(\\b[A-Za-z]{4,50}\\b))");
for(String token: tokens){
System.out.print("["+token+"]");
}
System.out.println();
}
}
to get this output:
[I ][need][ to ][purchase][ a new ][vehicle][. I ][would][ ][prefer][ a BMW.]

StringTokenizer -How to ignore spaces within a string

I am trying to use a stringtokenizer on a list of words as below
String sentence=""Name":"jon" "location":"3333 abc street" "country":"usa"" etc
When i use stringtokenizer and give space as the delimiter as below
StringTokenizer tokens=new StringTokenizer(sentence," ")
I was expecting my output as different tokens as below
Name:jon
location:3333 abc street
country:usa
But the string tokenizer tries to tokenize on the value of location also and it appears like
Name:jon
location:3333
abc
street
country:usa
Please let me know how i can fix the above and if i need to do a regex what kind of the expression should i specify?
This can be easily handled using a CSV Reader.
String str = "\"Name\":\"jon\" \"location\":\"3333 abc street\" \"country\":\"usa\"";
// prepare String for CSV parsing
CsvReader reader = CsvReader.parse(str.replaceAll("\" *: *\"", ":"));
reader.setDelimiter(' '); // use space a delimiter
reader.readRecord(); // read CSV record
for (int i=0; i<reader.getColumnCount(); i++) // loop thru columns
System.out.printf("Scol[%d]: [%s]%n", i, reader.get(i));
Update: And here is pure Java SDK solution:
Pattern p = Pattern.compile("(.+?)(\\s+(?=(?:(?:[^\"]*\"){2})*[^\"]*$)|$)");
Matcher m = p.matcher(str);
for (int i=0; m.find(); i++)
System.out.printf("Scol[%d]: [%s]%n", i, m.group(1).replace("\"", ""));
OUTPUT:
Scol[0]: [Name:jon]
Scol[1]: [location:3333 abc street]
Scol[2]: [country:usa]
Live Demo: http://ideone.com/WO0NK6
Explanation: As per OP's comments:
I am using this regex:
(.+?)(\\s+(?=(?:(?:[^\"]*\"){2})*[^\"]*$)|$)
Breaking it down now into smaller chunks.
PS: DQ represents Double quote
(?:[^\"]*\") 0 or more non-DQ characters followed by one DQ (RE1)
(?:[^\"]*\"){2} Exactly a pair of above RE1
(?:(?:[^\"]*\"){2})* 0 or more occurrences of pair of RE1
(?:(?:[^\"]*\"){2})*[^\"]*$ 0 or more occurrences of pair of RE1 followed by 0 or more non-DQ characters followed by end of string (RE2)
(?=(?:(?:[^\"]*\"){2})*[^\"]*$) Positive lookahead of above RE2
.+? Match 1 or more characters (? is for non-greedy matching)
\\s+ Should be followed by one or more spaces
(\\s+(?=RE2)|$) Should be followed by space or end of string
In short: It means match 1 or more length any characters followed by "a space OR end of string". Space must be followed by EVEN number of DQs. Hence space outside double quotes will be matched and inside double quotes will not be matched (since those are followed by odd number of DQs).
StringTokenizer is too simple-minded for this job. If you don't need to deal with quote marks inside the values, you can try this regex:
String s = "\"Name\":\"jon\" \"location\":\"3333 abc street\" \"country\":\"usa\"";
Pattern p = Pattern.compile("\"([^\"]*)\"");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group(1));
}
Output:
Name
jon
location
3333 abc street
country
usa
This won't handle internal quote marks within values—where the output should be, e.g.,
Name:Fred ("Freddy") Jones
You can use Json, Its looks like You are using Json kind of schema.
Do a bit google and try to implement Json.
String sentence=""Name":"jon" "location":"3333 abc street" "country":"usa"" etc
Will be key, value pair in Json like name is key and Jon is value. location is key and 3333 abc street is value. and so on....
Give it a try.
Here is one link
http://www.mkyong.com/java/json-simple-example-read-and-write-json/
Edit:
Its just a bit silly answer, But You can try something like this,
sentence = sentence.replaceAll("\" ", "");
StringTokenizer tokens=new StringTokenizer(sentence,"");

Categories

Resources