Extracting a word containing a symbol from a string in Java - java

The basic idea is that I want to pull out any part of the string with the form "text1.text2". Some examples of the input and output of what I'd like to do would be:
"employee.first_name" ==> "employee.first_name"
"2 * employee.salary AS double_salary" ==> "employee.salary"
Thus far I have just .split(" ") and then found what I needed and .split("."). Is there any cleaner way?

I would go with an actual Pattern and an iterative find, instead of splitting the String.
For instance:
String test = "employee.first_name 2 * ... employee.salary AS double_salary blabla e.s blablabla";
// searching for a number of word characters or puctuation, followed by dot,
// followed by a number of word characters or punctuation
// note also we're avoiding the "..." pitfall
Pattern p = Pattern.compile("[\\w\\p{Punct}&&[^\\.]]+\\.[\\w\\p{Punct}&&[^\\.]]+");
Matcher m = p.matcher(test);
while (m.find()) {
System.out.println(m.group());
}
Output:
employee.first_name
employee.salary
e.s
Note: to simplify the Pattern you could only list the allowed punctuation forming your "."-separated words in the categories
For instance:
Pattern p = Pattern.compile("[\\w_]+\\.[\\w_]+");
This way, foo.bar*2 would be matched as foo.bar

You need to make use of split to break the string into fragments.Then search for . in each of those fragments using contains method, to get the desired fragments:
Here you go:
public static void main(String args[]) {
String str = "2 * employee.salary AS double_salary";
String arr[] = str.split("\\s");
for (int i = 0; i < arr.length; i++) {
if (arr[i].contains(".")) {
System.out.println(arr[i]);
}
}
}

String mydata = "2 * employee.salary AS double_salary";
pattern = Pattern.compile("(\\w+\\.\\w+)");
Matcher matcher = pattern.matcher(mydata);
if (matcher.find())
{
System.out.println(matcher.group(1));
}

I'm not an expert in JAVA, but as I used regex in python and based on internet tutorials, I offer you to use r'(\S*)\.(\S*)' as the pattern. I tried it in python and it worked well in your example.
But if you want to use multiple dots continuously, it has a bug. I mean if you are trying to match something like first.second.third, this pattern identifies ('first.second', 'third') as the matched group and I think it relates to the best match strategy.

Related

extract values from string with Regular Expression

I have this java code
String msg = "*1*20*11*30*IGNORE*53*40##";
String regex = "\\*1\\*(.*?)\\*11\\*(.*?)\\*(.*?)\\*53\\*(.*?)##";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(msg);
if (matcher.find()) {
for (int i = 0; i < matcher.groupCount(); i++) {
System.out.println(matcher.group((i+1)));
}
}
the output is
20
30
IGNORE
40
How do I have to change the regex, that the String which is IGNORE is ignored?
I want to,that anything which is written there not to be found by the matcher.
the positions where 20,30,40 is are values for me which I need to extract, IGNORE in my case is any protocol specific counter which has no need for me
Always ignore the 3rd parameter:
Simply don't create a capture (don't use parentheses).
\\*1\\*(.*?)\\*11\\*(.*?)\\*.*?\\*53\\*(.*?)##
Ignore independently of position:
You need to capture the IGNORE part just like you're doing, and check in your loop if it needs to be ignored:
String msg = "*1*20*11*30*IGNORE*53*40##";
String regex = "\\*1\\*(.*?)\\*11\\*(.*?)\\*(.*?)\\*53\\*(.*?)##";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(msg);
if (matcher.find()) {
for (int i = 0; i < matcher.groupCount(); i++) {
if (!matcher.group(i+1).equals("IGNORE")) {
System.out.println(matcher.group(i+1));
}
}
}
DEMO
You can use a tempered greedy token to make sure you do not get a match when IGNORE is in-between the 2nd and 3rd capture groups:
\\*1\\*(.*?)\\*11\\*(.*?)\\*(?:(?!IGNORE).)*\\*53\\*(.*?)##
See demo. In this case, the 3rd group cannot contain IGNORE.
The token is useful when you need to match the closest window between two subpatterns that does not contain some substring.
In case you just do not want the 3rd group to be equal to IGNORE, use a negative look-ahead:
\\*1\\*(.*?)\\*11\\*(.*?)\\*(?!IGNORE\\*)(.*?)\\*53\\*(.*?)##
^^^^^^^^^^^^
See demo
Split the input on * and treat IGNORE as an optional part of the delimiter, having first trimmed off the prefix and suffix:
String[] parts = msg.replaceAll("^\\*\\d\\*|##$","").split("(\\*IGNORE)?\\*\\d+\\*");
Some test code:
String msg = "*1*20*11*30*IGNORE*53*40##";
String[] parts = msg.replaceAll("^\\*\\d\\*|##$","").split("(\\*IGNORE)?\\*\\d+\\*");
System.out.println(Arrays.toString(parts));
Output:
[20, 30, 40]

Pattern Matcher Vs String Split, which should I use?

First time posting.
Firstly I know how to use both Pattern Matcher & String Split.
My questions is which is best for me to use in my example and why?
Or suggestions for better alternatives.
Task:
I need to extract an unknown NOUN between two known regexp in an unknown string.
My Solution:
get the Start and End of the noun (from Regexp 1&2) and substring to extract the noun.
String line = "unknownXoooXNOUNXccccccXunknown";
int goal = 12 ;
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
I need to locate the index position AFTER the first regex.
I need to locate the index position BEFORE the second regex.
A) I can use pattern matcher
Pattern p = Pattern.compile(regexp1);
Matcher m = p.matcher(line);
if (m.find()) {
int afterRegex1 = m.end();
} else {
throw new IllegalArgumentException();
//TODO Exception Management;
}
B) I can use String Split
String[] split = line.split(regex1,2);
if (split.length != 2) {
throw new UnsupportedOperationException();
//TODO Exception Management;
}
int afterRegex1 = line.indexOf(split[1]);
Which Approach should I use and why?
I don't know which is more efficient on time and memory.
Both are near enough as readable to myself.
I'd do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regex = "Xo+X(.*?)Xc+X";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(line);
if (m.find()) {
String noun = m.group(1);
}
The (.*?) is used to make the inner match on the NOUN reluctant. This protects us from a case where our ending pattern appears again in the unknown portion of the string.
EDIT
This works because the (.*?) defines a capture group. There's only one such group defined in the pattern, so it gets index 1 (the parameter to m.group(1)). These groups are indexed from left to right starting at 1. If the pattern were defined like this
String regex = "(Xo+X)(.*?)(Xc+X)";
Then there would be three capture groups, such that
m.group(1); // yields "XoooX"
m.group(2); // yields "NOUN"
m.group(3); // yields "XccccccX"
There is a group 0, but that matches the whole pattern, and it's equivalent to this
m.group(); // yields "XoooXNOUNXccccccX"
For more information about what you can do with the Matcher, including ways to get the start and end positions of your pattern within the source string, see the Matcher JavaDocs
You should use String.split() for readability unless you're in a tight loop.
Per split()'s javadoc, split() does the equivalent of Pattern.compile(), which you can optimize away if you're in a tight loop.
It looks like you want to get a unique occurrence. For this do simply
input.replaceAll(".*Xo+X(.*)Xc+X.*", "$1")
For efficiency, use Pattern.matcher(input).replaceAll instead.
In case you input contains line breaks, use Pattern.DOTALL or the s modifier.
In case you want to use split, consider using Guava's Splitter. It behaves more sane and also accepts a Pattern which is good for speed.
If you really need the locations you can do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
Matcher m=Pattern.compile(regexp1).matcher(line);
if(m.find())
{
int start=m.end();
if(m.usePattern(Pattern.compile(regexp2)).find())
{
final int end = m.start();
System.out.println("from "+start+" to "+end+" is "+line.substring(start, end));
}
}
But if you just need the word in between, I recommend the way Ian McLaird has shown.

Java regex pattern for number

I know that this question can be stupid but I am trying to get some information from text and you are my last hope after last three hours of trying..
DIC: C/40764176 IC: 407641'6
Dekujerne a t8ime se na shledanou
I need to get for example this 40764176
I need to get string with 8-10 length, sometimes there can be some special chars like I,i,G,S,O,ó,l) but I have tried a lot of patterns for this and no one works...
I tried:
String generalDicFormatPattern = "([0-9IiGSOól]{8,10})";
String generalDicFormatPattern = ".*([0-9IiGSOól]{8,10}).*";
String generalDicFormatPattern = "\\b([0-9IiGSOól]{8,10})\\b";
nothing works... do you know where is the problem?
edit:
I use regex in this way:
private List<String> getGeneralDicFromLine(String concreteLine) {
List<String> allMatches = new ArrayList<String>();
Pattern pattern = Pattern.compile(generalDicFormatPattern);
Matcher matcher = pattern.matcher(concreteLine);
while (matcher.find()) {
allMatches.add(matcher.group(1));
}
return allMatches;
}
If your string's pattern is fixed you can use the regex
C/([^\s]{8,10})\sIC:
Sample code:
String s = "DIC: C/40764176 IC: 407641'6";
Pattern p = Pattern.compile("C/([^\\s]{8,10})\\sIC:");
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1)); // 40764176
}
I'm expecting any character (includes the special ones you've shown in examples) but a white space.
May be you can split your string with spaces (string.split('\\s');), then you should have an array like this :
DIC:
C/40764176
IC: 407641'6
...
shledanou
Get the second string, split it using '/', and get the second element.
I hope it helped you.
Tip : you can check after the result using a regex (([0-9IiGSOól]{8,10})

How would I do this in Java Regex?

Trying to make a regex that grabs all words like lets just say, chicken, that are not in brackets. So like
chicken
Would be selected but
[chicken]
Would not. Does anyone know how to do this?
String template = "[chicken]";
String pattern = "\\G(?<!\\[)(\\w+)(?!\\])";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(template);
while (m.find())
{
System.out.println(m.group());
}
It uses a combination of negative look-behind and negative look-aheads and boundary matchers.
(?<!\\[) //negative look behind
(?!\\]) //negative look ahead
(\\w+) //capture group for the word
\\G //is a boundary matcher for marking the end of the previous match
(please read the following edits for clarification)
EDIT 1:
If one needs to account for situations like:
"chicken [chicken] chicken [chicken]"
We can replace the regex with:
String regex = "(?<!\\[)\\b(\\w+)\\b(?!\\])";
EDIT 2:
If one also needs to account for situations like:
"[chicken"
"chicken]"
As in one still wants the "chicken", then you could use:
String pattern = "(?<!\\[)?\\b(\\w+)\\b(?!\\])|(?<!\\[)\\b(\\w+)\\b(?!\\])?";
Which essentially accounts for the two cases of having only one bracket on either side. It accomplishes this through the | which acts as an or, and by using ? after the look-ahead/behinds, where ? means 0 or 1 of the previous expression.
I guess you want something like:
final Pattern UNBRACKETED_WORD_PAT = Pattern.compile("(?<!\\[)\\b\\w+\\b(?!])");
private List<String> findAllUnbracketedWords(final String s) {
final List<String> ret = new ArrayList<String>();
final Matcher m = UNBRACKETED_WORD_PAT.matcher(s);
while (m.find()) {
ret.add(m.group());
}
return Collections.unmodifiableList(ret);
}
Use this:
/(?<![\[\w])\w+(?![\w\]])/
i.e., consecutive word characters with no square bracket or word character before or after.
This needs to check both left and right for both a square bracket and a word character, else for your input of [chicken] it would simply return
hicke
Without look around:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class MatchingTest
{
private static String x = "pig [cow] chicken bull] [grain";
public static void main(String[] args)
{
Pattern p = Pattern.compile("(\\[?)(\\w+)(\\]?)");
Matcher m = p.matcher(x);
while(m.find())
{
String firstBracket = m.group(1);
String word = m.group(2);
String lastBracket = m.group(3);
if ("".equals(firstBracket) && "".equals(lastBracket))
{
System.out.println(word);
}
}
}
}
Output:
pig
chicken
A bit more verbose, sure, but I find it more readable and easier to understand. Certainly simpler than a huge regular expression trying to handle all possible combinations of brackets.
Note that this won't filter out input like [fence tree grass]; it will indicate that tree is a match. You cannot skip tree in that without a parser. Hopefully, this is not a case you need to handle.

how can i extract an value using regex java?

i need to extract the numbers alone from this text i use sub string to extract the details some times the number decreases so i am getting an error value...
example(16656);
Use Pattern to compile your regular expression and Matcher to get a particular captured group. The regex I'm using is:
example\((\d+)\)
which captures the digits (\d+) within the parentheses. So:
Pattern p = Pattern.compile("example\\((\\d+)\\)");
Matcher m = p.matcher(text);
if (m.find()) {
int i = Integer.valueOf(m.group(1));
...
}
look at Java Regular Expression sample here:
http://java.sun.com/developer/technicalArticles/releases/1.4regex/
specially focus on find method.
String yourString = "example(16656);";
Pattern pattern = Pattern.compile("\\w+\\((\\d+)\\);");
Matcher matcher = pattern.matcher(yourString);
if (matcher.matches())
{
int value = Integer.parseInt(matcher.group(1));
System.out.println("Your number: " + value);
}
I will suggest you to write your own logic to do this. Using Pattern and Matcher things from java are good practice but these are standard solutions and may not suit as a solution in effective manner always. Like cletus provided a very neat solution but what happens in this logic is that a substring matching algorithm is performed in the background to trace digits. You do not need the pattern finding here I suppose. You just need to extract the digits from a string (like 123 from "a1b2c3") .See the following code which does it in clean manner in O(n) and does not perform unnecessary extra operation as Pattern and Matcher classes do for you (just do copy and paste and run :) ):
public class DigitExtractor {
/**
* #param args
*/
public static void main(String[] args) {
String sample = "sdhj12jhj345jhh6mk7mkl8mlkmlk9knkn0";
String digits = getDigits(sample);
System.out.println(digits);
}
private static String getDigits(String sample) {
StringBuilder out = new StringBuilder(10);
int stringLength = sample.length();
for(int i = 0; i <stringLength ; i++)
{
char currentChar = sample.charAt(i);
int charDiff = currentChar -'0';
boolean isDigit = ((9-charDiff)>=0&& (9-charDiff <=9));
if(isDigit)
out.append(currentChar);
}
return out.toString();
}
}

Categories

Resources