regular expression for specific URL - java

How could i use a regular-expression in java to extract URLs in the form
/p/{any set of characters}/bugs/{any set of numbers start from 0 to 999}
From text file. I tried one as the following
final String regex = "(\\/p\\/.*\\/bugs\\/(\\d{0,3}))";
But i didn't work fine for me.

Not sure why you have those backslashes in your regex. Additionally, use a negated character class to match any set of characters in the pattern. The following might work for you:
/p/[^/]*/bugs/[0-9]{1,3}

Try this regex:
final String regex = "/p/.+/bugs/[0-9]{1,3}";

Related

Java RegEx backreference not working

In java I tried to replace .JPG.json part with .JPG, saving letter case ( sometimes it maybe .jpg.json)
My Code :
String Path="MyImage.JPG.json";
Result=Path.replaceFirst("/(.jpg)\\.json/i","$1");
But it returns :
MyImage.JPG.json
Instead of:
MyImage.JPG
You need to remove the / slashes. In java, you don't need to include / as a regex delimiter. And also you must need to escape the dots. To do a case-insensitive match, add (?i) modifier at the first.
Path.replaceFirst("(?i)(\\.jpg)\\.json", "$1");
OR
You could use lookbehind assertion also.
Path.replaceFirst("(?i)(?<=\\.jpg)\\.json", "");
(?<=\\.jpg) Asserts that the string going to be matched must be preceded by .jpg. If yes then match only the following .json string. Replacing the matched .json string with an empty string will give you the desired output.
Try this command. It should work for you:
Path.replaceFirst("(?i)(\\.jpg)\\.json","$1")

Remove everything from a string upto a certain character and optionally a string if it follows too

I am looking to write a regex that can remove any characters upto the first &emsp and if there is a (new section) following &emsp then remove that as well. But the following regex doesn't seem to work. Why? How do I correct this?
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
Pattern removeEmspPattern1 = Pattern.compile("(.*( (\\(new section\\)))?)(.*)", Pattern.MULTILINE);
System.out.println(removeEmspPattern1.matcher(removeEmsp).replaceAll("$2"));
Have you tried String Split? This creates an array of strings from a string, based on a deliminator.
Once you have the string split, just select the elements of the array that you need for print statement.
Read more here
Your regex is very long and I do not want to debug it. However the tip is that some characters have special meaning in regular expressions. For example & means "and". Squire brackets allow defining characters groups etc. Such characters must be escaped if you want them to be interpreted as just characters and not regex commands. To escape special character you have to write \ in front of it. But \ is escape character for java too, so it should be duplicate.
For example to replace ampersand by letter A you should write str.replaceAll("\\&", "A")
Now you have all information you need. Try to start from simpler regex and then expand it to what you need. Good luck.
EDIT
BTW parsing XML and/or HTML using regular expressions is possible but is highly not recommended. Use special parser for such formats.
Try this:
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
System.out.println(removeEmsp.replaceFirst("^.*?\\ (\\(new\\ssection\\))?", ""));
System.out.println(removeEmsp.replaceAll("^.*?\\ (\\(new\\ssection\\))?", ""));
Output:
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
It will remove everything up to " " and optionally, the following "(new section)" text if any.

String split, words including accented characters

I'm using this regex:
x.split("[^a-zA-Z0-9']+");
This returns an array of strings with letters and/or numbers.
If I use this:
String name = "CEN01_Automated_TestCase.java";
String[] names = name.Split.split("[^a-zA-Z0-9']+");
I got:
CEN01
Automated
TestCase
Java
But if I use this:
String name = "CEN01_Automação_Caso_Teste.java";
String[] names = name.Split.split("[^a-zA-Z0-9']+");
I got:
CEN01
Automa
o
Caso
Teste
Java
How can I modify this regex to include accented characters? (á,ã,õ, etc...)
From http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Categories that behave like the java.lang.Character boolean ismethodname methods (except for the deprecated ones) are available through the same \p{prop} syntax where the specified property has the name javamethodname.
Since Character class contains isAlphabetic method you can use
name.split("[^\\p{IsAlphabetic}0-9']+");
You can also use
name.split("(?U)[^\\p{Alpha}0-9']+");
but you will need to use UNICODE_CHARACTER_CLASS flag which can be used by adding (?U) in regex.
I would check out the Java Documentation on Regular Expressions. There is a unicode section which I believe is what you may be looking for.
EDIT: Example
Another way would be to match on the character code you are looking for. For example
\uFFFF where FFFF is the hexadecimal number of the character you are trying to match.
Example: \u00E0 matches à
Realize that the backslash will need to be escaped in Java if you are using it as a string literal.
Read more about it here.
You can use this:
String[] names = name.split("[^a-zA-Z0-9'\\p{L}]+");
System.out.println(Arrays.toString(names)); Will output:
[CEN01, Automação, Caso, Teste, java]
See this for more information.
Why not split on the separator characters?
String[] names = name.split("[_.]");
Instead of blacklisting all the characters you don't want, you could always whitlist the characters you want like :
^[^<>%$]*$
The expression [^(many characters here)] just matches any character that is not listed.
But that is a personnal opinion.

Extract words (multiple whitespace) starting with # by regular expression

I have a problem with my regular expression:
String regex = "(?<=[\\s])#\\w+\\s";
I want a regex that formats a string like this:
"This is a Text #tag1 #tag2 #tag3"
With the regular expression, I get the last two values as result but not tag1 - because there is more than one whitespace. But i want all 3 of them!
I tried some variations, but nothing worked.
Use this regular expression:
(?<=(^|\\S)\\s)#\\w+(?=\\s|$)
Here's a demo.
It's a bit unclear from your question what you're really after, so I've put up some simple alternatives:
To capture all the tags in the string, we can use a lookbehind:
((?<=\\s|^)#\\w+)
To capture all the tags at the end of the string, we can use a lookahead:
(#\\w+(?=\\s#)|#\\w+$)
If there's always three tags at the end, there's no need for a lookaround:
(#\\w+)\s(#\\w+)\s(#\\w+)$

How do I write a regular expression to find the following pattern?

I am trying to write a regular expression to do a find and replace operation. Assume Java regex syntax. Below are examples of what I am trying to find:
12341+1
12241+1R1
100001+1R2
So, I am searching for a string beginning with one or more digits, followed by a "1+1" substring, followed by 0 or more characters. I have the following regex:
^(\d+)(1\\+1).*
This regex will successfully find the examples above, however, my goal is to replace the strings with everything before "1+1". So, 12341+1 would become 1234, and 12241+1R1 would become 1224. If I use the first grouped expression $1 to replace the pattern, I get the wrong result as follows:
12341+1 becomes 12341
12241+1R1 becomes 12241
100001+1R2 becomes 100001
Any ideas?
Your existing regex works fine, just that you are missing a \ before \d
String str = "100001+1R2";
str = str.replaceAll("^(\\d+)(1\\+1).*","$1");
Working link
IMHO, the regex is correct.
Perhaps you wrote it wrong in the code. If you want to code the regex ^(\d+)(1\+1).* in a string, you have to write something like String regex = "^(\\d+)(1\\+1).*".
Your output is the result of ^(\d+)(1+1).* replacement, as you miss some backslash in the string (e.g. "^(\\d+)(1\+1).*").
Your regex looks fine to me - I don't have access to java but in JavaScript the code..
"12341+1".replace(/(\d+)(1\+1)/g, "$1");
Returns 1234 as you'd expect. This works on a string with many 'codes' in too e.g.
"12341+1 54321+1".replace(/(\d+)(1\+1)/g, "$1");
gives 1234 5432.
Personally, I wouldn't use a Regex at all (it'd be like using a hammer on a thumbtack), I'd just create a substring from (Pseudocode)
stringName.substring(0, stringName.indexOf("1+1"))
But it looks like other posters have already mentioned the non-greedy operator.
In most Regex Syntaxes you can add a '?' after a '+' or '*' to indicate that you want it to match as little as possible before moving on in the pattern. (Thus: ^(\d+?)(1+1) matches any number of digits until it finds "1+1" and then, NOT INCLUDING the "1+1" it continues matching, whereas your original would see the 1 and match it as well).

Categories

Resources