Java regex matching - java

I need to parse through a file(built from a string) searching for occurences of a single or multiline text. Will this solution always work ? If not - how should I change it ?
private int parseString(String s){
Pattern p = Pattern.compile(searchableText);
Matcher m = p.matcher(s);
int count = 0;
while(m.find()) {
count++;
}
return count;
}

Consider Pattern.quote if text can contain regex metacharacters.
Consider java.util.Scanner and findWithinHorizon

For multiline strings you need to adjust the newline codes to what is the file, or (probably better) use a regular expression to be able to accept all the various combinations of \n and \r.

Consider searching for a\s+b\s+c\s+ where a, b, and c are the sequence of words (or characters if you wish) you are searching for. It may be (over) greedy, but hopefully it should give an insight. You can also perform a normalization of the text first.

Related

Regex matching up to a character if it occurs

I need to match string as below:
match everything upto ;
If - occurs, match only upto - excluding -
For e.g. :
abc; should return abc
abc-xyz; should return abc
Pattern.compile("^(?<string>.*?);$");
Using above i can achieve half. but dont know how to change this pattern to achieve the second requirement. How do i change .*? so that it stops at forst occurance of -
I am not good with regex. Any help would be great.
EDIT
I need to capture it as group. i cant change it since there many other patterns to match and capture. Its only part of it that i have posted.
Code looks something like below.
public static final Pattern findString = Pattern.compile("^(?<string>.*?);$");
if(findString.find())
{
return findString.group("string"); //cant change anything here.
}
Just use a negated char class.
^[^-;]*
ie.
Pattern p = Pattern.compile("^[^-;]*");
Matcher m = p.matcher(str);
while(m.find()) {
System.out.println(m.group());
}
This would match any character at the start but not of - or ;, zero or more times.
This should do what you are looking for:
[^-;]*
It matches characters that are not - or ;.
Tipp: If you don't feel sure with regular expressions there are great online solutions to test your input, e.g. https://regex101.com/
UPDATE
I see you have an issue in the code since you try to access .group in the Pattern object, while you need to use the .group method of the Matcher object:
public static String GetTheGroup(String str) {
Pattern findString = Pattern.compile("(?s)^(?<string>.*?)[;-]");
Matcher matcher = findString.matcher(str);
if (matcher.find())
{
return matcher.group("string"); //you have to change something here.
}
else
return "";
}
And call it as
System.out.println(GetTheGroup("abc-xyz;"));
See IDEONE demo
OLD ANSWER
Your ^(?<string>.*?);$ regex only matches 0 or more characters other than a newline from the beginning up to the first ; that is the last character in the string. I guess it is not what you expect.
You should learn more about using character classes in regex, as you can match 1 symbol from a specified character set that is defined with [...].
You can achieve this with a String.split taking the first element only and a [;-] regex that matches a ; or - literally:
String res = "abc-xyz;".split("[;-]")[0];
System.out.println(res);
Or with replaceAll with (?s)[;-].*$ regex (that matches the first ; or - and then anything up to the end of string:
res = "abc-xyz;".replaceAll("(?s)[;-].*$", "");
System.out.println(res);
See IDEONE demo
I have found the solution without removing groupings.
(?<string>.*?) matches everything upto next grouping pattern
(?:-.*?)? followed by a non grouping pattern starts with - and comes zero or once.
; end character.
So putting all together:
public static final Pattern findString = Pattern.compile("^(?<string>.*?)(?:-.*?)?;$");
if(findString.find())
{
return findString.group("string"); //cant change anything here.
}

Regular expression for counting words in a sentence

public static int getWordCount(String sentence) {
return sentence.split("(([a-zA-Z0-9]([-][_])*[a-zA-Z0-9])+)", -1).length
+ sentence.replaceAll("([[a-z][A-Z][0-9][\\W][-][_]]*)", "").length() - 1;
}
My intention is to count the number of words in a sentence. The input to this function is the the lengthy sentence. It may have 255 words.
The word should take hyphens or underscores in between
Function should only count valid words means special character should not be counted eg. &&&& or #### should not count as a word.
The above regular expression is working fine, but when hyphen or underscore comes in between the word eg: co-operation, the count returning as 2, it should be 1. Can anyone please help?
Instead of using .split and .replaceAll which are quite expensive operations, please use an approach with constant memory usage.
Based on your specifications, you seem to look for the following regex:
[\w-]+
Next you can use this approach to count the number of matches:
public static int getWordCount(String sentence) {
Pattern pattern = Pattern.compile("[\\w-]+");
Matcher matcher = pattern.matcher(sentence);
int count = 0;
while (matcher.find())
count++;
return count;
}
online jDoodle demo.
This approach works in (more) constant memory: when splitting, the program constructs an array, which is basically useless, since you never inspect the content of the array.
If you don't want words to start or end with hyphens, you can use the following regex:
\w+([-]\w+)*
This part ([-][_])* is wrong. The notation [xyz] means "any single one of the characters inside the brackets" (see http://www.regular-expressions.info/charclass.html). So effectively, you allow exactly the character - and exactly the character _, in that order.
Fixing your group makes it work:
[a-zA-Z0-9]+([-_][a-zA-Z0-9]+)*
and it can be further simplified using \w to
\w+(-\w+)*
because \w matches 0..9, A..Z, a..z and _ (http://www.regular-expressions.info/shorthand.html) and so you only need to add -.
if you can use java 8:
long wordCount = Arrays.stream(sentence.split(" ")) //split the sentence into words
.filter(s -> s.matches("[\\w-]+")) //filter only matching words
.count();
With java 8
public static int getColumnCount(String row) {
return (int) Pattern.compile("[\\w-]+")
.matcher(row)
.results()
.count();
}

How to best strip out certain strings in a file?

If I have a file with the following content:
11:17 GET this is my content #2013
11:18 GET this is my content #2014
11:19 GET this is my content #2015
How can I use a Scanner and ignore certain parts of a `String line = scanner.nextLine();?
The result that I like to have would be:
this is my content
this is my content
this is my content
So I'd like to trip everything from the start until GET, and then take everything until the # char.
How could this easily be done?
You can use the String.indexOf(String str) and String.indexOf(char ch) methods. For example:
String line = scanner.nextLine();
int start = line.indexOf("GET");
int end = line.indexOf('#');
String result = line.substring(start + 4, end);
One way might be
String strippedStart = scanner.nextLine().split(" ", 3)[2];
String result = strippedStart.substring(0, strippedStart.lastIndexOf("#")).trim();
This assumes the are always two space separated tokens at the beginning (11:22 GET or 11:33 POST, idk).
You could do something like this:-
String line ="11:17 GET this is my content #2013";
int startIndex = line.indexOf("GET ");
int endIndex = line.indexOf("#");
line = line.substring(startIndex+4, endIndex-1);
System.out.println(line);
In my opinion the best solution for your problem would be using Java regex. Using regex you can define which group or groups of text you want to retrieve and what kind of text comes where. I haven't been working with Java in a long time, so I'll try to help you out from the top of my head. I'll try to give you a point in the right direction.
First off, compile a pattern:
Pattern pattern = Pattern.compile("^\d{1,2}:\d{1,2} GET (.*?) #\d+$", Pattern.MULTILINE);
First part of the regex says that you expect one or two digits followed by a colon followed by one or two digits again. After that comes the GET (you can use GET|POST if you expect those words or \w+? if you expect any word). Then you define the group you want with the parentheses. Lastly, you put the hash and any number of digits with at least one digit. You might consider putting flags DOTALL and CASE_INSENSITIVE, although I don't think you'll be needing them.
Then you continue with the matcher:
Matcher matcher = pattern.matcher(textToParse);
while (matcher.find())
{
//extract groups here
String group = matcher.group(1);
}
In the while loop you can use matcher.group(1) to find the text in the group you selected with the parentheses (the text you'd like extracted). matcher.group(0) gives the entire find, which is not what you're currently looking for (I guess).
Sorry for any errors in the code, it has not been tested. Hope this puts you on the right track.
You can try this rather flexible solution:
Scanner s = new Scanner(new File("data"));
Pattern p = Pattern.compile("^(.+?)\\s+(.+?)\\s+(.*)\\s+(.+?)$");
Matcher m;
while (s.hasNextLine()) {
m = p.matcher(s.nextLine());
if (m.find()) {
System.out.println(m.group(3));
}
}
This piece of code ignores first, second and last words from every line before printing them.
Advantage is that it relies on whitespaces rather than specific string literals to perform the stripping.

Get substring between two characters

How do you build a regex to return for the characters between < and # of a string?
For example <1001#10.2.2.1> would return 1001.
Would something using <.?> work?
Would something using "<.?>" work?
A slightly modified version of it would work: <.*?# (you need an # at the end, and you need a reluctant quantifier *? in place of an optional mark ?). However it could be inefficient because of backtracking. Something like this would be better:
<([^#]*)#
This expression starts by finding <, taking as many non-# characters as it could, and capturing the # before stopping.
Parentheses denote a capturing group. Use regex API to extract it:
Pattern p = Pattern.compile("<([^#]*)#");
Matcher m = p.matcher("<1001#10.2.2.1>");
if (m.find()) {
System.out.println(m.group(1));
}
This prints 1001 (demo).
What about the next:
(?<=<)[^#]*
e.g.:
private static final Pattern REGEX_PATTERN =
Pattern.compile("(?<=<)[^#]*");
public static void main(String[] args) {
String input = "<1001#10.2.2.1>";
Matcher matcher = REGEX_PATTERN.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
Output:
1001
Um.
<([0-9]*?)#
I'm assuming it's numbers only.
if all characters use this..
<(.*?)#
tested here..
Maybe i'm lacking knowledge but my understanding of regex is that you need () to get the capture groups... otherwise if you don't you'll just be selecting characters without actually "capturing" them.
so this..
<.?>
won't do anything .

How to remove special characters from a string?

I want to remove special characters like:
- + ^ . : ,
from an String using Java.
That depends on what you define as special characters, but try replaceAll(...):
String result = yourString.replaceAll("[-+.^:,]","");
Note that the ^ character must not be the first one in the list, since you'd then either have to escape it or it would mean "any but these characters".
Another note: the - character needs to be the first or last one on the list, otherwise you'd have to escape it or it would define a range ( e.g. :-, would mean "all characters in the range : to ,).
So, in order to keep consistency and not depend on character positioning, you might want to escape all those characters that have a special meaning in regular expressions (the following list is not complete, so be aware of other characters like (, {, $ etc.):
String result = yourString.replaceAll("[\\-\\+\\.\\^:,]","");
If you want to get rid of all punctuation and symbols, try this regex: \p{P}\p{S} (keep in mind that in Java strings you'd have to escape back slashes: "\\p{P}\\p{S}").
A third way could be something like this, if you can exactly define what should be left in your string:
String result = yourString.replaceAll("[^\\w\\s]","");
This means: replace everything that is not a word character (a-z in any case, 0-9 or _) or whitespace.
Edit: please note that there are a couple of other patterns that might prove helpful. However, I can't explain them all, so have a look at the reference section of regular-expressions.info.
Here's less restrictive alternative to the "define allowed characters" approach, as suggested by Ray:
String result = yourString.replaceAll("[^\\p{L}\\p{Z}]","");
The regex matches everything that is not a letter in any language and not a separator (whitespace, linebreak etc.). Note that you can't use [\P{L}\P{Z}] (upper case P means not having that property), since that would mean "everything that is not a letter or not whitespace", which almost matches everything, since letters are not whitespace and vice versa.
Additional information on Unicode
Some unicode characters seem to cause problems due to different possible ways to encode them (as a single code point or a combination of code points). Please refer to regular-expressions.info for more information.
This will replace all the characters except alphanumeric
replaceAll("[^A-Za-z0-9]","");
As described here
http://developer.android.com/reference/java/util/regex/Pattern.html
Patterns are compiled regular expressions. In many cases, convenience methods such as String.matches, String.replaceAll and String.split will be preferable, but if you need to do a lot of work with the same regular expression, it may be more efficient to compile it once and reuse it. The Pattern class and its companion, Matcher, also offer more functionality than the small amount exposed by String.
public class RegularExpressionTest {
public static void main(String[] args) {
System.out.println("String is = "+getOnlyStrings("!&(*^*(^(+one(&(^()(*)(*&^%$##!#$%^&*()("));
System.out.println("Number is = "+getOnlyDigits("&(*^*(^(+91-&*9hi-639-0097(&(^("));
}
public static String getOnlyDigits(String s) {
Pattern pattern = Pattern.compile("[^0-9]");
Matcher matcher = pattern.matcher(s);
String number = matcher.replaceAll("");
return number;
}
public static String getOnlyStrings(String s) {
Pattern pattern = Pattern.compile("[^a-z A-Z]");
Matcher matcher = pattern.matcher(s);
String number = matcher.replaceAll("");
return number;
}
}
Result
String is = one
Number is = 9196390097
Try replaceAll() method of the String class.
BTW here is the method, return type and parameters.
public String replaceAll(String regex,
String replacement)
Example:
String str = "Hello +-^ my + - friends ^ ^^-- ^^^ +!";
str = str.replaceAll("[-+^]*", "");
It should remove all the {'^', '+', '-'} chars that you wanted to remove!
To Remove Special character
String t2 = "!##$%^&*()-';,./?><+abdd";
t2 = t2.replaceAll("\\W+","");
Output will be : abdd.
This works perfectly.
Use the String.replaceAll() method in Java.
replaceAll should be good enough for your problem.
You can remove single char as follows:
String str="+919595354336";
String result = str.replaceAll("\\\\+","");
System.out.println(result);
OUTPUT:
919595354336
If you just want to do a literal replace in java, use Pattern.quote(string) to escape any string to a literal.
myString.replaceAll(Pattern.quote(matchingStr), replacementStr)

Categories

Resources