StringTokenizer -How to ignore spaces within a string

StringTokenizer -How to ignore spaces within a string - java

I am trying to use a stringtokenizer on a list of words as below
String sentence=""Name":"jon" "location":"3333 abc street" "country":"usa"" etc
When i use stringtokenizer and give space as the delimiter as below
StringTokenizer tokens=new StringTokenizer(sentence," ")
I was expecting my output as different tokens as below
Name:jon
location:3333 abc street
country:usa
But the string tokenizer tries to tokenize on the value of location also and it appears like
Name:jon
location:3333
abc
street
country:usa
Please let me know how i can fix the above and if i need to do a regex what kind of the expression should i specify?

This can be easily handled using a CSV Reader.
String str = "\"Name\":\"jon\" \"location\":\"3333 abc street\" \"country\":\"usa\"";
// prepare String for CSV parsing
CsvReader reader = CsvReader.parse(str.replaceAll("\" *: *\"", ":"));
reader.setDelimiter(' '); // use space a delimiter
reader.readRecord(); // read CSV record
for (int i=0; i<reader.getColumnCount(); i++) // loop thru columns
System.out.printf("Scol[%d]: [%s]%n", i, reader.get(i));
Update: And here is pure Java SDK solution:
Pattern p = Pattern.compile("(.+?)(\\s+(?=(?:(?:[^\"]*\"){2})*[^\"]*$)|$)");
Matcher m = p.matcher(str);
for (int i=0; m.find(); i++)
System.out.printf("Scol[%d]: [%s]%n", i, m.group(1).replace("\"", ""));
OUTPUT:
Scol[0]: [Name:jon]
Scol[1]: [location:3333 abc street]
Scol[2]: [country:usa]
Live Demo: http://ideone.com/WO0NK6
Explanation: As per OP's comments:
I am using this regex:
(.+?)(\\s+(?=(?:(?:[^\"]*\"){2})*[^\"]*$)|$)
Breaking it down now into smaller chunks.
PS: DQ represents Double quote
(?:[^\"]*\") 0 or more non-DQ characters followed by one DQ (RE1)
(?:[^\"]*\"){2} Exactly a pair of above RE1
(?:(?:[^\"]*\"){2})* 0 or more occurrences of pair of RE1
(?:(?:[^\"]*\"){2})*[^\"]*$ 0 or more occurrences of pair of RE1 followed by 0 or more non-DQ characters followed by end of string (RE2)
(?=(?:(?:[^\"]*\"){2})*[^\"]*$) Positive lookahead of above RE2
.+? Match 1 or more characters (? is for non-greedy matching)
\\s+ Should be followed by one or more spaces
(\\s+(?=RE2)|$) Should be followed by space or end of string
In short: It means match 1 or more length any characters followed by "a space OR end of string". Space must be followed by EVEN number of DQs. Hence space outside double quotes will be matched and inside double quotes will not be matched (since those are followed by odd number of DQs).

StringTokenizer is too simple-minded for this job. If you don't need to deal with quote marks inside the values, you can try this regex:
String s = "\"Name\":\"jon\" \"location\":\"3333 abc street\" \"country\":\"usa\"";
Pattern p = Pattern.compile("\"([^\"]*)\"");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group(1));
}
Output:
Name
jon
location
3333 abc street
country
usa
This won't handle internal quote marks within values—where the output should be, e.g.,
Name:Fred ("Freddy") Jones

You can use Json, Its looks like You are using Json kind of schema.
Do a bit google and try to implement Json.
String sentence=""Name":"jon" "location":"3333 abc street" "country":"usa"" etc
Will be key, value pair in Json like name is key and Jon is value. location is key and 3333 abc street is value. and so on....
Give it a try.
Here is one link
http://www.mkyong.com/java/json-simple-example-read-and-write-json/
Edit:
Its just a bit silly answer, But You can try something like this,
sentence = sentence.replaceAll("\" ", "");
StringTokenizer tokens=new StringTokenizer(sentence,"");

Related

How to build a Regex in java to detect a whitespace or end of a string?

I am trying to build a Regex to find and extract the string containing Post office box.
Here is two examples:
str = "some text p.o. box 12456 Floor 105 streetName Street";
str = "po box 1011";
str = "post office Box 12 Floor 105 Tallapoosa Street";
str = "leclair ryan pc p.o. Box 2499 8th floor 951 east byrd street";
str = "box 1 slot 3 building 2 136 harvey road";
Here is my pattern and code:
Pattern p = Pattern.compile("p.*o.*box \\d+(\\z|\\s)");
Matcher m = p.matcher(str);
int count =0;
while(m.find()) {
count++;
System.out.println("Match number "+count);
System.out.println("start(): "+m.start());
System.out.println("end(): "+m.end());
}
It works with the second example and note for the first one!
If change my pattern to the following:
Pattern p = Pattern.compile("p.*o.*box \d+ ");
It works just for the first example.
The question is how to group the Regex for end of string "\z" and Regex for whitespace "\s" or " "?
New Pattern:
Pattern p = Pattern.compile("(?i)((p.*o.box\s\w\s*\d*(\z|\s*)|(box\s*\w\s*\d*(\z|\s*)) ))");

You can leverage the following code:
String str = "some text p.o. box 12456 Floor 105 streetName Street";
Pattern p = Pattern.compile("(?i)\\bp\\.?\\s*o\\.?\\s*box\\s*(\\d+)(?:\\z|\\s)");
Matcher m = p.matcher(str);
int count =0;
while(m.find()) {
count++;
System.out.println("Match: "+m.group(0));
System.out.println("Digits: "+m.group(1));
System.out.println("Match number "+count);
System.out.println("start(): "+m.start());
System.out.println("end(): "+m.end());
}
To make the pattern case insensitive, just add Pattern.CASE_INSENSITIVE flag to the Pattern.compile declaration or pre-pend the inline (?i) modifier to the pattern.
Also, .* matches any characters other than a newline zero or more times, I guess you wanted to match . optionally. So, you need just ? quantifier and to escape the dot so as to match a literal dot. Note how I used (...) to capture digits into Group 1 (it is called a capturing group). The group where you match the end of the string or space is inside a non-capturing grouo ((?:...)) that is used for grouping only, not for storing its value in the memory buffer. Since you wanted to match a word boundary there, I suggest replacing (?:\\z|\\s) with a mere \\b:
Pattern p = Pattern.compile("(?i)\\bp\\.?\\s*o\\.?\\s*box\\s*(\\d+)\\b");

There are a couple items in your regex that look like they need work. From what I understand you want to extract the P.O. Box number from strings of such format that you've provided. Given that, the following regex will accomplish what you want, with a following explanation. See it in action here: https://regex101.com/r/cQ8lH3/2
Pattern p = Pattern.compile("p\.?o\.? box [^ \r\n\t]+");
Firstly, you need to use only ONE slash, for escape sequences. Also, you must escape the dots. If you do not escape the dots, regex will match . as ANY single character. \. will instead match a dot symbol.
Next, you need to change the * quantifier after the \. to a ?. Why? The * symbol will match zero or more of the preceding symbol while the ? quantifier will match only one or none.
Finally rethink how you're matching the box number. Instead of matching all characters AND THEN white space, just match everything that isn't a whitespace. [^ \r\n\t]+ will match all characters that are NOT a space (), carriage return (\r), newline (\n), or tab (\t). Therefore it will consume the box number and stop as soon as it hits any whitespace or end of file.
Some of these changes may not be necessary to get your code to work for the examples you gave, but they are the proper way to build the regex you want.

How to get the desired character from the multiple underscored variable sized strings?

This is continuation of How to get the desired character from the variable sized strings?
but includes additional scenarios from which, a person names are to be extracted
pot-1_Sam [Sam is the word to be extracted]
pot_444_Jack [Jack is the word to be extracted]
pot_US-1_Sam [Sam is the word to be extracted]
pot_RUS_444_Jack[Jack is the word to be extracted]
pot_UK_3_Nick_Samuel[Nick_Samuel is the word to be extracted]
pot_8_James_Baldwin[James_Baldwin is the word to be extracted]
pot_8_Jack_Furleng_Derik[Jack_Furleng_Derik is the word to be extracted]
Above are the Sample words from which the names person names are to be extracted. One thing to note is that The person name will always start after a "Number" and an "Underscore". How to achieve the above using Java?

String[] strings = {
"pot-1_Sam",
"pot_444_Jack",
"pot_US-1_Sam",
"pot_RUS_444_Jack",
"pot_UK_3_Nick_Samuel",
"pot_8_James_Baldwin",
"pot_8_Jack_Furleng_Derik"
};
Pattern pattern = Pattern.compile("\\d_(\\w+)$");
for (String s : strings ){
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
}

Use the following regular expression:
^.*\d+_(.*)$
... and extract the value of the 1st group.

String str= "pot_8_Jack_Furleng_Derik";
// name starts after number and _
Pattern p = Pattern.compile("\\d+_");
Matcher m = p.matcher(str);
int index = -1;
// find the index of name
if(m.find())
index = m.start()+2;
str= str.substring(index);
System.out.print(str);

Here is the regex: \d_(.*$)
\d stands for one digit
_ stands for underscore
() stands for group
.*$ stands for "all till the end of line"

You can try using Reflection to get the actual name of the variable as seen
here
You can further break that down with Regex, as also mentioned

Extracting numbers into a string array

I have a string which is of the form
String str = "124333 is the otp of candidate number 9912111242.
Please refer txn id 12323335465645 while referring blah blah.";
I need 124333, 9912111242 and 12323335465645 in a string array. I have tried this with
while (Character.isDigit(sms.charAt(i)))
I feel that running the above said method on every character is inefficient. Is there a way I can get a string array of all the numbers?

Use a regex (see Pattern and matcher):
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(<your string here>);
while (m.find()) {
//m.group() contains the digits you want
}
you can easily build ArrayList that contains each matched group you find.
Or, as other suggested, you can split on non-digits characters (\D):
"blabla 123 blabla 345".split("\\D+")
Note that \ has to be escaped in Java, hence the need of \\.

You can use String.split():
String[] nbs = str.split("[^0-9]+");
This will split the String on any group of non-numbers digits.

And this works perfectly for your input.
String str = "124333 is the otp of candidate number 9912111242. Please refer txn id 12323335465645 while referring blah blah.";
System.out.println(Arrays.toString(str.split("\\D+")));
Output:
[124333, 9912111242, 12323335465645]
\\D+ Matches one or more non-digit characters. Splitting the input according to one or more non-digit characters will give you the desired output.

Java 8 style:
long[] numbers = Pattern.compile("\\D+")
.splitAsStream(str)
.mapToLong(Long::parseLong)
.toArray();
Ah if you only need a String array, then you can just use String.split as the other answers suggests.

Alternatively, you can try this:
String str = "124333 is the otp of candidate number 9912111242. Please refer txn id 12323335465645 while referring blah blah.";
str = str.replaceAll("\\D+", ",");
System.out.println(Arrays.asList(str.split(",")));
\\D+ matches one or more non digits
Output
[124333, 9912111242, 12323335465645]

First thing comes into my mind is filter and split, then i realized that it can be done via
String[] result =str.split("\\D+");
\D matches any non-digit character, + says that one or more of these are needed, and leading \ escapes the other \ since \D would be parsed as 'escape character D' which is invalid

Java Regex Remove comma's between numbers from String

I am trying to parse a string, to remove commas between numbers. Request you to read the complete question and then please answer.
Let us consider the following string. AS IS :)
John loves cakes and he always orders them by dialing "989,444 1234". Johns credentials are as follows"
"Name":"John", "Jr", "Mobile":"945,234,1110"
Assuming i have the above line of text in a java string, now, i would like to remove all comma's between numbers. I would like to replace the following in the same string:
"945,234,1110" with "9452341110"
"945,234,1110" with "9452341110"
without making any other changes to the string.
I could iterate through the loop, when ever a comma is found, i could check the previous index and next index for numbers and then could delete the required comma. But it looks ugly. Doesn't it?
If i use Regex "[0-9],[0-9]" then i would loose two char, before and after comma.
I am seeking for an efficient solution rather than doing a brute force "search and replace" over the complete string. The real time string length is ~80K char. Thanks.

public static void main(String args[]) throws IOException
{
String regex = "(?<=[\\d])(,)(?=[\\d])";
Pattern p = Pattern.compile(regex);
String str = "John loves cakes and he always orders them by dialing \"989,444 1234\". Johns credentials are as follows\" \"Name\":\"John\", \"Jr\", \"Mobile\":\"945,234,1110\"";
Matcher m = p.matcher(str);
str = m.replaceAll("");
System.out.println(str);
}
Output
John loves cakes and he always orders them by dialing "989444 1234". Johns credentials are as follows" "Name":"John", "Jr", "Mobile":"9452341110"

This regex uses a positive lookbehind and a positive lookahead to only match commas with a preceding digit and a following digit, without including those digits in the match itself:
(?<=\d),(?=\d)

You couldtry regex like this :
public static void main(String[] args) {
String s = "asd,asdafs,123,456,789,asda,dsfds";
System.out.println(s.replaceAll("(?<=\\d),(?=\\d)", "")); //positive look-behind for a digit and positive look-ahead for a digit.
// i.e, only (select and) remove the comma preceeded by a digit and followed by another digit.
}
O/P :
asd,asdafs,123456789,asda,dsfds

Java Regex for changing every ith index in every word of a string

I've written a regex \b\S\w(\S(?=.)) to find every third symbol in a word and replace it with '1'. Now I'm trying to use this expression but really don't know how to do it right.
Pattern pattern = Pattern.compile("\\b\\S\\w(\\S(?=.))");
Matcher matcher = pattern.matcher("lemon apple strawberry pumpkin");
while (matcher.find()) {
System.out.print(matcher.group(1) + " ");
}
So result is:
m p r m
And how can I use this to make a string like this
le1on ap1le st1awberry pu1pkin

You could use something like this:
"lemon apple strawberry pumpkin".replaceAll("(?<=\\b\\S{2})\\S", "1")
Would produce your example output. The regex would replace any non space character preceded by two non space characters and then a word boundary.
This means that "words" like 12345 would be changed into 12145 since 3 is matched by \\S (not space).
Edit:
Updated the regex to better cater to the revised question title, change 2 to i-1 to replace the ith letter of the word.

There is another way to access the index of the matcher
Like this:
Pattern pattern = Pattern.compile("\\b\\S\\w(\\S(?=.))");
String string = "lemon apple strawberry pumpkin";
char[] c = string.toCharArray();
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
c[matcher.end() - 1] = '1';////// may be it's not perfect , but this way in case of you want to access the index in which the **sring** is matches with the pattern
}
System.out.println(c);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

StringTokenizer -How to ignore spaces within a string - java

Related

How to build a Regex in java to detect a whitespace or end of a string?

How to get the desired character from the multiple underscored variable sized strings?

Extracting numbers into a string array

Java Regex Remove comma's between numbers from String

Java Regex for changing every ith index in every word of a string

Categories

Resources