Matching Urls Inside Strings

Matching Urls Inside Strings - java

I am trying to write a regex that will match urls inside strings of text that may be html-encoded. I am having a considerable amount of trouble with lookaround though. I need something that would correctly match both links in the string below:
some text "http://www.notarealwebsite.com/?q=asdf&searchOrder=1" "http://www.notarealwebsite.com" some other text
A verbose description of what I want would be: "http://" followed by any number of characters that are not spaces, quotes, or the string "&quot[semicolon]" (I don't care about accepting other non-url-safe characters as delimiters)
I have tried a few regexes using lookahead to check for &'s followed by q's followed by u's and so on, but as soon as I put one into the [^...] negation it just completely breaks down and evaluates more like: "http:// followed by any number of characters that are not spaces, quotes, ampersands, q's, u's, o's, t's, or semicolons" which is obviously not what I am looking for.
This will correctly match the &'s at the beginning of the &quot[semicolon]:
&(?=q(?=u(?=o(?=t(?=;)))))
But this does not work:
http://[^ "&(?=q(?=u(?=o(?=t(?=;)))))]*
I know just enough about regexes to get into trouble, and that includes not knowing why this won't work the way I want it to. I understand to some extent positive and negative lookaround, but I don't understand why it breaks down inside the [^...]. Is it possible to do this with regexes? Or am I wasting my time trying to make it work?

If your regex implementation supports it, use a positive look ahead and a backreference with a non-greedy expression in the body.
Here is one with your conditions: (["\s]|")(http://.*?)(?=\1)
For example, in Python:
import re
p = re.compile(r'(["\s]|")(https?://.*?)(?=\1)', re.IGNORECASE)
url = "http://test.url/here.php?var1=val&var2=val2"
formatstr = 'text "{0}" more text {0} and more "{0}" test greed"'
data = formatstr.format(url)
for m in p.finditer(data):
print "Found:", m.group(2)
Produces:
Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2
Or in Java:
#Test
public void testRegex() {
Pattern p = Pattern.compile("([\"\\s]|")(https?://.*?)(?=\\1)",
Pattern.CASE_INSENSITIVE);
final String URL = "http://test.url/here.php?var1=val&var2=val2";
final String INPUT = "some text " + URL + " more text + \"" + URL +
"\" more then "" + URL + "" testing greed "";
Matcher m = p.matcher(INPUT);
while( m.find() ) {
System.out.println("Found: " + m.group(2));
}
}
Produces the same output.

Related

How to parse string using regex

I'm pretty new to java, trying to find a way to do this better. Potentially using a regex.
String text = test.get(i).toString()
// text looks like this in string form:
// EnumOption[enumId=test,id=machine]
String checker = text.replace("[","").replace("]","").split(",")[1].split("=")[1];
// checker becomes machine
My goal is to parse that text string and just return back machine. Which is what I did in the code above.
But that looks ugly. I was wondering what kinda regex can be used here to make this a little better? Or maybe another suggestion?

Use a regex' lookbehind:
(?<=\bid=)[^],]*
See Regex101.
(?<= ) // Start matching only after what matches inside
\bid= // Match "\bid=" (= word boundary then "id="),
[^],]* // Match and keep the longest sequence without any ']' or ','
In Java, use it like this:
import java.util.regex.*;
class Main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("(?<=\\bid=)[^],]*");
Matcher matcher = pattern.matcher("EnumOption[enumId=test,id=machine]");
if (matcher.find()) {
System.out.println(matcher.group(0));
}
}
}
This results in
machine

Assuming you’re using the Polarion ALM API, you should use the EnumOption’s getId method instead of deparsing and re-parsing the value via a string:
String id = test.get(i).getId();

Using the replace and split functions don't take the structure of the data into account.
If you want to use a regex, you can just use a capturing group without any lookarounds, where enum can be any value except a ] and comma, and id can be any value except ].
The value of id will be in capture group 1.
\bEnumOption\[enumId=[^=,\]]+,id=([^\]]+)\]
Explanation
\bEnumOption Match EnumOption preceded by a word boundary
\[enumId= Match [enumId=
[^=,\]]+, Match 1+ times any char except = , and ]
id= Match literally
( Capture group 1
[^\]]+ Match 1+ times any char except ]
)\]
Regex demo | Java demo
Pattern pattern = Pattern.compile("\\bEnumOption\\[enumId=[^=,\\]]+,id=([^\\]]+)\\]");
Matcher matcher = pattern.matcher("EnumOption[enumId=test,id=machine]");
if (matcher.find()) {
System.out.println(matcher.group(1));
}
Output
machine
If there can be more comma separated values, you could also only match id making use of negated character classes [^][]* before and after matching id to stay inside the square bracket boundaries.
\bEnumOption\[[^][]*\bid=([^,\]]+)[^][]*\]
In Java
String regex = "\\bEnumOption\\[[^][]*\\bid=([^,\\]]+)[^][]*\\]";
Regex demo

A regex can of course be used, but sometimes is less performant, less readable and more bug-prone.
I would advise you not use any regex that you did not come up with yourself, or at least understand completely.
PS: I think your solution is actually quite readable.
Here's another non-regex version:
String text = "EnumOption[enumId=test,id=machine]";
text = text.substring(text.lastIndexOf('=') + 1);
text = text.substring(0, text.length() - 1);
Not doing you a favor, but the downvote hurt, so here you go:
String input = "EnumOption[enumId=test,id=machine]";
Matcher matcher = Pattern.compile("EnumOption\\[enumId=(.+),id=(.+)\\]").matcher(input);
if(!matcher.matches()) {
throw new RuntimeException("unexpected input: " + input);
}
System.out.println("enumId: " + matcher.group(1));
System.out.println("id: " + matcher.group(2));

Regex extract last numbers from String

I have some strings which are indexed and are dynamic.
For example:
name01,
name02,
name[n]
now I need to separate name from index.
I've come up with this regex which works OK to extract index.
([0-9]+(?!.*[0-9]))
But, there are some exceptions of these names. Some of them may have a number appended which is not the index.(These strings are limited and I know them, meaning I can add them as "exceptions" in the regex)
For example,
panLast4[01]
Here the last '4' is not part of the index, so I need to distinguish.
So I tried:
[^panLast4]([0-9]+(?!.*[0-9]))
Which works for panLast4[123] but not panLast4[43]
Note: the "[" and "]" is for explanation purposes only, it's not present in the strings
What is wrong?
Thanks

You can use the split method with this pattern:
(?<!^panLast(?=4)|^nm(?=14)|^nm1(?=4))(?=[0-9]+$)
The idea is to find the position where there are digits until the end of the string (?=[0-9]+$). But the match will succeed if the negative lookbehind allows it (to exclude particular names (panLast4 and nm14 here) that end with digits). When one of these particular names is found, the regex engine must go to the next position to obtain a match.
Example:
String s ="panLast412345";
String[] res = s.split("(?<!^panLast(?=4)|^nm(?=14)|^nm1(?=4))(?=[0-9]+$)", 2);
if ( res.length==2 ) {
System.out.println("name: " + res[0]);
System.out.println("ID: " + res[1]);
}
An other method with matches() that simply uses a lazy quantifier as last alternative:
Pattern p = Pattern.compile("(panLast4|nm14|.*?)([0-9]+)");
String s = "panLast42356";
Matcher m = p.matcher(s);
if ( m.matches() && m.group(1).length()>0 ) {
System.out.println("name: "+ m.group(1));
System.out.println("ID: "+ m.group(2));
}

String Matches, Java

I have a sort of a problem with this code:
String[] paragraph;
if(paragraph[searchKeyword_counter].matches("(.*)(\\b)"+"is"+"(\\b)(.*)")){
if i am not mistaken to use .matches() and search a particular character in a string i need a .* but what i want to happen is to search a character without matching it to another word.
For example is the keyword i am going to search I do not want it to match with words that contain is character like ship, his, this. so i used \b for boundary but the code above is not working for me.
Example:
String[] Content= {"is,","his","fish","ish","its","is"};
String keyword = "is";
for(int i=0;i<Content.length;i++){
if(content[i].matches("(.*)(\\b)"+keyword+"(\\b)(.*)")){
System.out.println("There are "+i+" is.");
}
}
What i want to happen here is that it will only match with is is, but not with his fish. So is should match with is, and is meaning I want it to match even the character is beside a non-alphanumerical character and spaces.
What is the problem with the code above?
what if one of the content has a uppercase character example IS and it is compared with is, it will be unmatched. Correct my if i am wrong. How to match a lower cased character to a upper cased character without changing the content of the source?

String string = "...";
String word = "is";
Pattern p = Pattern.compile("\\b" + Pattern.quote(word) + "\\b");
Matcher m = p.matcher(string);
if (m.find()) {
...
}

just add spaces like this:
suppose message equal your content string and pattern is your keyword
if ((message).matches(".* " + pattern + " .*")||(message).matches("^" + pattern + " .*")
||(message).matches(".* " + pattern + "$")) {

Regular expression to match unescaped special characters only

I'm trying to come up with a regular expression that can match only characters not preceded by a special escape sequence in a string.
For instance, in the string Is ? stranded//? , I want to be able to replace the ? which hasn't been escaped with another string, so I can have this result : **Is Dave stranded?**
But for the life of me I have not been able to figure out a way. I have only come up with regular expressions that eat all the replaceable characters.
How do you construct a regular expression that matches only characters not preceded by an escape sequence?

Use a negative lookbehind, it's what they were designed to do!
(?<!//)[?]
To break it down:
(
?<! #The negative look behind. It will check that the following slashes do not exist.
// #The slashes you are trying to avoid.
)
[\?] #Your special charactor list.
Only if the // cannot be found, it will progress with the rest of the search.
I think in Java it will need to be escaped again as a string something like:
Pattern p = Pattern.compile("(?<!//)[\\?]");

Try this Java code:
str="Is ? stranded//?";
Pattern p = Pattern.compile("(?<!//)([?])");
m = p.matcher(str);
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, m.group(1).replace("?", "Dave"));
}
m.appendTail(sb);
String s = sb.toString().replace("//", "");
System.out.println("Output: " + s);
OUTPUT
Output: Is Dave stranded?

I was thinking about this and have a second simplier solution, avoiding regexs. The other answers are probably better but I thought I might post it anyway.
String input = "Is ? stranded//?";
String output = input
.replace("//?", "a717efbc-84a9-46bf-b1be-8a9fb714fce8")
.replace("?", "Dave")
.replace("a717efbc-84a9-46bf-b1be-8a9fb714fce8", "?");
Just protect the "//?" by replacing it with something unique (like a guid). Then you know any remaining question marks are fair game.

Use grouping. Here's one example:
import java.util.regex.*;
class Test {
public static void main(String[] args) {
Pattern p = Pattern.compile("([^/][^/])(\\?)");
String s = "Is ? stranded//?";
Matcher m = p.matcher(s);
if (m.matches)
s = m.replaceAll("$1XXX").replace("//", "");
System.out.println(s + " -> " + s);
}
}
Output:
$ java Test
Is ? stranded//? -> Is XXX stranded?
In this example, I'm:
first replacing any non-escaped ? with "XXX",
then, removing the "//" escape sequences.
EDIT Use if (m.matches) to ensure that you handle non-matching strings properly.
This is just a quick-and-dirty example. You need to flesh it out, obviously, to make it more robust. But it gets the general idea across.

Match on a set of characters OTHER than an escape sequence, then a regex special character. You could use an inverted character class ([^/]) for the first bit. Special case an unescaped regex character at the front of the string.

String aString = "Is ? stranded//?";
String regex = "(?<!//)[^a-z^A-Z^\\s^/]";
System.out.println(aString.replaceAll(regex, "Dave"));
The part of the regular expression [^a-z^A-Z^\\s^/] matches non-alphanumeric, whitespace or non-forward slash charaters.
The (?<!//) part does a negative lookbehind - see docco here for more info
This gives the output Is Dave stranded//?

try matching:
(^|(^.)|(.[^/])|([^/].))[special characters list]

I used this one:
((?:^|[^\\])(?:\\\\)*[ESCAPABLE CHARACTERS HERE])
Demo: https://regex101.com/r/zH1zO3/4

how to read string part in java

I have this string :
<meis xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" uri="localhost/naro-nei" onded="flpSW531213" identi="lemenia" id="75" lastStop="bendi" xsi:noNamespaceSchemaLocation="http://localhost/xsd/postat.xsd xsd/postat.xsd">
How can I get lastStop property value in JAVA?
This regex worked when tested on http://www.myregexp.com/
But when I try it in java I don't see the matched text, here is how I tried :
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class SimpleRegexTest {
public static void main(String[] args) {
String sampleText = "<meis xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" uri=\"localhost/naro-nei\" onded=\"flpSW531213\" identi=\"lemenia\" id=\"75\" lastStop=\"bendi\" xsi:noNamespaceSchemaLocation=\"http://localhost/xsd/postat.xsd xsd/postat.xsd\">";
String sampleRegex = "(?<=lastStop=[\"']?)[^\"']*";
Pattern p = Pattern.compile(sampleRegex);
Matcher m = p.matcher(sampleText);
if (m.find()) {
String matchedText = m.group();
System.out.println("matched [" + matchedText + "]");
} else {
System.out.println("didn’t match");
}
}
}
Maybe the problem is that I use escape char in my test , but real string doesn't have escape inside. ?
UPDATE
Does anyone know why this doesn't work when used in java ? or how to make it work?

(?<=lastStop=[\"']?)[^\"]+

The reason it doesn't work as you expect is because of the * in [^\"']*. The lookbehind is matching at the position before the " in lastStop=", which is permitted because the quote is optional: [\"']?. The next part is supposed to match zero or more non-quote characters, but because the next character is a quote, it matches zero characters.
If you change that * to a +, the second part will fail to match at that position, forcing the regex engine to bump ahead one more position. The lookbehind will match the quote, and [^\"']+ will match what follows. However, you really shouldn't be using a lookbehind for this in the first place. It's much easier to just match the whole sequence in the normal way and extract the part you want to keep via a capturing group:
String sampleRegex = "lastStop=[\"']?([^\"']*)";
Pattern p = Pattern.compile(sampleRegex);
Matcher m = p.matcher(sampleText);
if (m.find()) {
String matchedText = m.group(1);
System.out.println("matched [" + matchedText + "]");
} else {
System.out.println("didn’t match");
}
It will also make it easier to deal with the problem #Kobi mentioned. You're trying to allow for values contained in double-quotes, single-quotes or no quotes, but your regex is too simplistic. For one thing, a quoted value can contain whitespace, but an unquoted one can't. To deal with all three possibilities, you'll need two or three capturing groups, not just one.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Matching Urls Inside Strings - java

Related

How to parse string using regex

Regex extract last numbers from String

String Matches, Java

Regular expression to match unescaped special characters only

how to read string part in java

Categories

Resources