Count & Split by regex pattern in java - java

I have a string in below format.
-52/ABC/35/BY/200/L/DEF/307/C/110/L
I need to perform the following.
1. Find the no of occurrences of 3 letter word's like ABC,DEF in the above text.
2. Split the above string by ABC and DEF as shown below.
ABC/35/BY/200/L
DEF/307/C/110/L
I have tried using regex with below code, but it always shows the match count is zero. How to approach this easily.
static String DEST_STRING = "^[A-Z]{3}$";
static Pattern DEST_PATTERN = Pattern.compile(DEST_STRING,
Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
public static void main(String[] args) {
String test = "-52/ABC/35/BY/200/L/DEF/307/C/110/L";
Matcher destMatcher = DEST_PATTERN.matcher(test);
int destCount = 0;
while (destMatcher.find()) {
destCount++;
}
System.out.println(destCount);
}
Please note i need to use JDK 6 for this,

You can use this code :
public static void main(String[] args) throws Exception {
String s = "-52/ABC/35/BY/200/L/DEF/307/C/110/L";
// Pattern to find all 3 letter words . The \\b means "word boundary", which ensures that the words are of length 3 only.
Pattern p = Pattern.compile("(\\b[a-zA-Z]{3}\\b)");
Matcher m = p.matcher(s);
Map<String, Integer> countMap = new HashMap<>();
// COunt how many times each 3 letter word is used.
// Find each 3 letter word.
while (m.find()) {
// Get the 3 letter word.
String val = m.group();
// If the word is present in the map, get old count and add 1, else add new entry in map and set count to 1
if (countMap.containsKey(val)) {
countMap.put(val, countMap.get(val) + 1);
} else {
countMap.put(val, 1);
}
}
System.out.println(countMap);
// Get ABC.. and DEF.. using positive lookahead for a 3 letter word or end of String
// Finds and selects everything starting from a 3 letter word until another 3 letter word is found or until string end is found.
p = Pattern.compile("(\\b[a-zA-Z]{3}\\b.*?)(?=/[A-Za-z]{3}|$)");
m = p.matcher(s);
while (m.find()) {
String val = m.group();
System.out.println(val);
}
}
O/P :
{ABC=1, DEF=1}
ABC/35/BY/200/L
DEF/307/C/110/L

Check this one:
String stringToSearch = "-52/ABC/35/BY/200/L/DEF/307/C/110/L";
Pattern p1 = Pattern.compile("\\b[a-zA-Z]{3}\\b");
Matcher m = p1.matcher(stringToSearch);
int startIndex = -1;
while (m.find())
{
//Try to use Apache Commons' StringUtils
int count = StringUtils.countMatches(stringToSearch, m.group());
System.out.println(m.group +":"+ count);
if(startIndex != -1){
System.out.println(stringToSearch.substring(startIndex,m.start()-1));
}
startIndex = m.start();
}
if(startIndex != -1){
System.out.println(stringToSearch.substring(startIndex));
}
output:
ABC : 1
ABC/35/BY/200/L
DEF : 1
DEF/307/C/110/L

Related

Get substring in a string with multiple occurring string

I have a string something like
(D#01)5(D#02)14100319530033M(D#03)1336009-A-A(D#04)141002A171(D#05)1(D#06)
Now i want to get substring between (D#01)5(D#02)
If i have something like
(D#01)5(D#02)
i can get detail with
quantity = content.substring(content.indexOf("(D#01)") + 6, content.indexOf("(D#02)"));
But somethings D#02 can be different like #05, Now how can i use simple (D# to get string in between. there are multiple repetitions of (D#
Basically this is what i want to do
content.substring(content.indexOf("(D#01)") + 6, content.nextOccurringIndexOf("(D#"));
I suppose you can do
int fromIndex = content.indexOf("(D#01)") + 6;
int toIndex = content.indexOf("(D#", fromIndex); // next occurring
if (fromIndex != -1 && toIndex != -1)
str = content.substring(fromIndex, toIndex);
Output
5
See http://ideone.com/RrUtBy demo.
Assuming that the marker and value are some how linked and you want to know each ((D#01) == 5), then you can make use of the Pattern/Matcher API, for example
String text = "(D#01)5(D#02)14100319530033M(D#03)1336009-A-A(D#04)141002A171(D#05)1(D#06)";
Pattern p = Pattern.compile("\\(D#[0-9]+\\)");
Matcher m = p.matcher(text);
while (m.find()) {
String name = m.group();
if (m.end() < text.length()) {
String content = text.substring(m.end()) + 1;
content = content.substring(0, content.indexOf("("));
System.out.println(name + " = " + content);
}
}
Which outputs
(D#01) = 5
(D#02) = 14100319530033M
(D#03) = 1336009-A-A
(D#04) = 141002A171
(D#05) = 1
Now, this is a little heavy handed, I'd create some kind of "marker" object which contained the key (D#01) and it's start and end indices. I'd then keep this information in a List and cut up each value based on the end of the earlier key and the start of the last key...but that's just me ;)
You can use regex capture groups if want the content between the (D###)'s
Pattern p = Pattern.compile("(\\(D#\\d+\\))(.*?)(?=\\(D#\\d+\\))");
Matcher matcher = p.matcher("(D#01)5(D#02)14100319530033M(D#03)1336009-A-A(D#04)141002A171(D#05)1(D#06)");
while(matcher.find()) {
System.out.println(String.format("%s start: %2s end: %2s matched: %s ",
matcher.group(1), matcher.start(2), matcher.end(2), matcher.group(2)));
}
(D#01) start: 6 end: 7 matched: 5
(D#02) start: 13 end: 28 matched: 14100319530033M
(D#03) start: 34 end: 45 matched: 1336009-A-A
(D#04) start: 51 end: 61 matched: 141002A171
(D#05) start: 67 end: 68 matched: 1
You can user regex to split the input - as suggested by #MadProgrammer. split() method produces a table of Strings, so the order of the occurrences of the searched values will be exactly the same as the order of the values in the table produced by split(). For example:
String input = "(D#01)5(D#02)14100319530033M(D#03)1336009-A-A(D#04)141002A171(D#05)1(D#06)";
String[] table = input.split("\(D#[0-9]+\)");
Try this:
public static void main(String[] args) {
String input = "(D#01)5(D#02)14100319530033M(D#03)1336009-A-A(D#04)141002A171(D#05)1(D#06)";
Pattern p = Pattern.compile("\\(D#\\d+\\)(.*?)(?=\\(D#\\d+\\))");
Matcher matches = p.matcher(input);
while(matches.find()) {
int number = getNum(matches.group(0)); // parses the number
System.out.printf("%d. %s\n", number, matches.group(1)); // print the string
}
}
public static int getNum(String str) {
int start = str.indexOf('#') + 1;
int end = str.indexOf(')', start);
return Integer.parseInt(str.substring(start,end));
}
Result:
1. 5
2. 14100319530033M
3. 1336009-A-A
4. 141002A171
5. 1

How to Split a string in java based on limit

I have following String and i want to split this string into number of sub strings(by taking ',' as a delimeter) when its length reaches 36. Its not exactly splitting on 36'th position
String message = "This is some(sampletext), and has to be splited properly";
I want to get the output as two substrings follows:
1. 'This is some (sampletext)'
2. 'and has to be splited properly'
Thanks in advance.
A solution based on regex:
String s = "This is some sample text and has to be splited properly";
Pattern splitPattern = Pattern.compile(".{1,15}\\b");
Matcher m = splitPattern.matcher(s);
List<String> stringList = new ArrayList<String>();
while (m.find()) {
stringList.add(m.group(0).trim());
}
Update:
trim() can be droped by changing the pattern to end in space or end of string:
String s = "This is some sample text and has to be splited properly";
Pattern splitPattern = Pattern.compile("(.{1,15})\\b( |$)");
Matcher m = splitPattern.matcher(s);
List<String> stringList = new ArrayList<String>();
while (m.find()) {
stringList.add(m.group(1));
}
group(1) means that I only need the first part of the pattern (.{1,15}) as output.
.{1,15} - a sequence of any characters (".") with any length between 1 and 15 ({1,15})
\b - a word break (a non-character before of after any word)
( |$) - space or end of string
In addition I've added () surrounding .{1,15} so I can use it as a whole group (m.group(1)).
Depending on the desired result, this expression can be tweaked.
Update:
If you want to split message by comma only if it's length would be over 36, try the following expression:
Pattern splitPattern = Pattern.compile("(.{1,36})\\b(,|$)");
The best solution I can think of is to make a function that iterates through the string. In the function you could keep track of whitespace characters, and for each 16th position you could add a substring to a list based on the position of the last encountered whitespace. After it has found a substring, you start anew from the last encountered whitespace. Then you simply return the list of substrings.
Here's a tidy answer:
String message = "This is some sample text and has to be splited properly";
String[] temp = message.split("(?<=^.{1,16}) ");
String part1 = message.substring(0, message.length() - temp[temp.length - 1].length() - 1);
String part2 = message.substring(message.length() - temp[temp.length - 1].length());
This should work on all inputs, except when there are sequences of chars without whitespace longer than 16. It also creates the minimum amount of extra Strings by indexing into the original one.
public static void main(String[] args) throws IOException
{
String message = "This is some sample text and has to be splited properly";
List<String> result = new ArrayList<String>();
int start = 0;
while (start + 16 < message.length())
{
int end = start + 16;
while (!Character.isWhitespace(message.charAt(end--)));
result.add(message.substring(start, end + 1));
start = end + 2;
}
result.add(message.substring(start));
System.out.println(result);
}
If you have a simple text as the one you showed above (words separated by blank spaces) you can always think of StringTokenizer. Here's some simple code working for your case:
public static void main(String[] args) {
String message = "This is some sample text and has to be splited properly";
while (message.length() > 0) {
String token = "";
StringTokenizer st = new StringTokenizer(message);
while (st.hasMoreTokens()) {
String nt = st.nextToken();
String foo = "";
if (token.length()==0) {
foo = nt;
}
else {
foo = token + " " + nt;
}
if (foo.length() < 16)
token = foo;
else {
System.out.print("'" + token + "' ");
message = message.substring(token.length() + 1, message.length());
break;
}
if (!st.hasMoreTokens()) {
System.out.print("'" + token + "' ");
message = message.substring(token.length(), message.length());
}
}
}
}

Break a long string into lines with proper word wrapping

String original = "This is a sentence.Rajesh want to test the application for the word split.";
List matchList = new ArrayList();
Pattern regex = Pattern.compile(".{1,10}(?:\\s|$)", Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(original);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
System.out.println("Match List "+matchList);
I need to parse text into an array of lines that do not exceed 10 characters in length and should not have a break in word at the end of the line.
I used below logic in my scenario but the problem it is parsing to the nearest white space after 10 characters if there is a break at end of line
for eg: The actual sentence is "This is a sentence.Rajesh want to test the application for the word split." But after logic execution its getting as below.
Match List [This is a , nce.Rajesh , want to , test the , pplication , for the , word , split.]
OK, so I've managed to get the following working, with max line length of 10, but also splitting the words that are longer than 10 correctly!
String original = "This is a sentence. Rajesh want to test the applications for the word split handling.";
List matchList = new ArrayList();
Pattern regex = Pattern.compile("(.{1,10}(?:\\s|$))|(.{0,10})", Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(original);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
System.out.println("Match List "+matchList);
This is the result:
This is a
sentence.
Rajesh want
to test
the
applicatio
ns word
split
handling.
This question was tagged as Groovy at some point. Assuming a Groovy answer is still valid and you are not worried about preserving multiple white spaces (e.g. ' '):
def splitIntoLines(text, maxLineSize) {
def words = text.split(/\s+/)
def lines = ['']
words.each { word ->
def lastLine = (lines[-1] + ' ' + word).trim()
if (lastLine.size() <= maxLineSize)
// Change last line.
lines[-1] = lastLine
else
// Add word as new line.
lines << word
}
lines
}
// Tests...
def original = "This is a sentence. Rajesh want to test the application for the word split."
assert splitIntoLines(original, 10) == [
"This is a",
"sentence.",
"Rajesh",
"want to",
"test the",
"application",
"for the",
"word",
"split."
]
assert splitIntoLines(original, 20) == [
"This is a sentence.",
"Rajesh want to test",
"the application for",
"the word split."
]
assert splitIntoLines(original, original.size()) == [original]
I avoided regex as is doesn't pull the weight. This code word-wraps, and if a single word is more than 10 chars, breaks it. It also takes care of excess whitespace.
import static java.lang.Character.isWhitespace;
public static void main(String[] args) {
final String original =
"This is a sentence.Rajesh want to test the application for the word split.";
final StringBuilder b = new StringBuilder(original.trim());
final List<String> matchList = new ArrayList<String>();
while (true) {
b.delete(0, indexOfFirstNonWsChar(b));
if (b.length() == 0) break;
final int splitAt = lastIndexOfWsBeforeIndex(b, 10);
matchList.add(b.substring(0, splitAt).trim());
b.delete(0, splitAt);
}
System.out.println("Match List "+matchList);
}
static int lastIndexOfWsBeforeIndex(CharSequence s, int i) {
if (s.length() <= i) return s.length();
for (int j = i; j > 0; j--) if (isWhitespace(s.charAt(j-1))) return j;
return i;
}
static int indexOfFirstNonWsChar(CharSequence s) {
for (int i = 0; i < s.length(); i++) if (!isWhitespace(s.charAt(i))) return i;
return s.length();
}
Prints:
Match List [This is a, sentence.R, ajesh, want to, test the, applicatio, n for the, word, split.]

How to Insert Commas Into a Number WITHIN a String of Other Words

I have a String like the following:
"The answer is 1000"
I want to insert commas into the number 1000 without destroying the rest of the String.
NOTE: I also want to use this for other Strings of differing lengths, so substring(int index) would not be advised for getting the number.
The best way that I can think of is to use a regex command, but I have no idea how.
Thanks in advance!
The following formats all the non-decimal numbers:
public String formatNumbers(String input) {
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(input);
NumberFormat nf = NumberFormat.getInstance();
StringBuffer sb = new StringBuffer();
while(m.find()) {
String g = m.group();
m.appendReplacement(sb, nf.format(Double.parseDouble(g)));
}
return m.appendTail(sb).toString();
}
e.g. if you call: formatNumbers("The answer is 1000 1000000")
Result is: "The answer is 1,000 1,000,000"
See: NumberFormat and Matcher.appendReplacement().
modified from Most efficient way to extract all the (natural) numbers from a string:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Example {
private static final String REGEX = "\\d+";
public static void main(String[] args) {
String input = "dog dog 1342 dog doggie 2321 dogg";
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(input); // get a matcher object
int end = 0;
String result = "";
while (m.find()) {
result = result + input.substring(end, m.start());
result = result
+ addCommas(
input.substring(
m.start(), m.end()));
end = m.end();
}
System.out.println(result);
}
private static String addCommas(String s) {
char[] c = s.toCharArray();
String result = "";
for (int i = 0; i < s.length(); i++) {
if (s.length() % 3 == i % 3)
result += ",";
result += c[i];
}
return result;
}
}
You could use the regular expression:
[0-9]+
To find contiguous sets of digits, so it would match 1000, or 7500 or 22387234, etc.. You can test this on http://regexpal.com/ This doesn't handle the case of numbers that involve decimal points, BTW.
This isn't a complete, with code answer, but the basic algorithm is as follows:
You use that pattern to find the index(es) of the match(es) within the string (the index of the characters where the various matches start)
From each of those indexes, you copy the digits into a temporary string that contains only the digits of the number(s) in the String
You write a function that starts at the end of the String, and for every 3rd digit (from the end) you insert a comma before it, unless the index of the current character is 0 (which will prevent 300 from being turned into ,300
Replace the original number in the source string with the comma'ed String, using the replace() method

How can I count the number of matches for a regex?

Let's say I have a string which contains this:
HelloxxxHelloxxxHello
I compile a pattern to look for 'Hello'
Pattern pattern = Pattern.compile("Hello");
Matcher matcher = pattern.matcher("HelloxxxHelloxxxHello");
It should find three matches. How can I get a count of how many matches there were?
I've tried various loops and using the matcher.groupCount() but it didn't work.
matcher.find() does not find all matches, only the next match.
Solution for Java 9+
long matches = matcher.results().count();
Solution for Java 8 and older
You'll have to do the following. (Starting from Java 9, there is a nicer solution)
int count = 0;
while (matcher.find())
count++;
Btw, matcher.groupCount() is something completely different.
Complete example:
import java.util.regex.*;
class Test {
public static void main(String[] args) {
String hello = "HelloxxxHelloxxxHello";
Pattern pattern = Pattern.compile("Hello");
Matcher matcher = pattern.matcher(hello);
int count = 0;
while (matcher.find())
count++;
System.out.println(count); // prints 3
}
}
Handling overlapping matches
When counting matches of aa in aaaa the above snippet will give you 2.
aaaa
aa
aa
To get 3 matches, i.e. this behavior:
aaaa
aa
aa
aa
You have to search for a match at index <start of last match> + 1 as follows:
String hello = "aaaa";
Pattern pattern = Pattern.compile("aa");
Matcher matcher = pattern.matcher(hello);
int count = 0;
int i = 0;
while (matcher.find(i)) {
count++;
i = matcher.start() + 1;
}
System.out.println(count); // prints 3
This should work for matches that might overlap:
public static void main(String[] args) {
String input = "aaaaaaaa";
String regex = "aa";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
int from = 0;
int count = 0;
while(matcher.find(from)) {
count++;
from = matcher.start() + 1;
}
System.out.println(count);
}
From Java 9, you can use the stream provided by Matcher.results()
long matches = matcher.results().count();
If you want to use Java 8 streams and are allergic to while loops, you could try this:
public static int countPattern(String references, Pattern referencePattern) {
Matcher matcher = referencePattern.matcher(references);
return Stream.iterate(0, i -> i + 1)
.filter(i -> !matcher.find())
.findFirst()
.get();
}
Disclaimer: this only works for disjoint matches.
Example:
public static void main(String[] args) throws ParseException {
Pattern referencePattern = Pattern.compile("PASSENGER:\\d+");
System.out.println(countPattern("[ \"PASSENGER:1\", \"PASSENGER:2\", \"AIR:1\", \"AIR:2\", \"FOP:2\" ]", referencePattern));
System.out.println(countPattern("[ \"AIR:1\", \"AIR:2\", \"FOP:2\" ]", referencePattern));
System.out.println(countPattern("[ \"AIR:1\", \"AIR:2\", \"FOP:2\", \"PASSENGER:1\" ]", referencePattern));
System.out.println(countPattern("[ ]", referencePattern));
}
This prints out:
2
0
1
0
This is a solution for disjoint matches with streams:
public static int countPattern(String references, Pattern referencePattern) {
return StreamSupport.stream(Spliterators.spliteratorUnknownSize(
new Iterator<Integer>() {
Matcher matcher = referencePattern.matcher(references);
int from = 0;
#Override
public boolean hasNext() {
return matcher.find(from);
}
#Override
public Integer next() {
from = matcher.start() + 1;
return 1;
}
},
Spliterator.IMMUTABLE), false).reduce(0, (a, c) -> a + c);
}
Use the below code to find the count of number of matches that the regex finds in your input
Pattern p = Pattern.compile(regex, Pattern.MULTILINE | Pattern.DOTALL);// "regex" here indicates your predefined regex.
Matcher m = p.matcher(pattern); // "pattern" indicates your string to match the pattern against with
boolean b = m.matches();
if(b)
count++;
while (m.find())
count++;
This is a generalized code not specific one though, tailor it to suit your need
Please feel free to correct me if there is any mistake.

Categories

Resources