issues with my regex to detect urls in a string? - java

Greetings all.
I am using the following regex to detect urls in a string
and wrap them inside the < a > tag
public static String detectUrls(String text) {
String newText = text
.replaceAll("(?:https?|ftps?|http?)://[\\w/%.-?&=]+",
"<a href='$0'>$0</a>").replaceAll(
"(www\\.)[\\w/%.-?&=]+", "<a href='http://$0'>$0</a>");
return newText;
}
i have a problem that the following links are not detected correctly:
i am not that good with regex, so please advise.
http://code.google.com/p/shindig-dnd/
http://confluence.atlassian.com/display/GADGETDEV/Gadgets+and+JIRA+Portlets
www.liferay.com/web/raymond.auge/blog/
(www.opensocial.org/)
http://www.google.com

I'm using this:
private static final String URL_REGEX =
"http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
Matcher matcher = URL_PATTERN.matcher(text);
text = matcher.replaceAll("$0");
return text;

The problem you have is that you are using - within a character group ([]) without escaping it, which is being used to define the range .-? (i.e. the characters ./0123456789:;<=>?). Either escape it \\- or put it at the end of the character class so that it doesn't complete a range.
public static String detectUrls(String text) {
String newText = text
.replaceAll("(?:https?|ftps?|http?)://[\\w/%.\\-?&=]+",
"<a href='$0'>$0</a>").replaceAll(
"(www\\.)[\\w/%.\\-?&=]+", "<a href='http://$0'>$0</a>");
return newText;
}

As marcog said, you should escape the - and to match the last 2 examples you gave, you have to make the http optionnal. Also http? matches htt wich is not a correct protocol.
So the regex will be:
"(?:(?:https?|ftps?)://)?[\\w/%.?&=-]+"

Related

Split a string in java based on custom logic

I have a string
"target/abcd12345671.csv"
and I need to extract
"abcd12345671"
from the string using Java. Can anyone suggest me a clean way to extract this.
Core Java
String fileName = Paths.get("target/abcd12345671.csv").getFileName().toString();
fileName = filename.replaceFirst("[.][^.]+$", "")
Using apache commons
import org.apache.commons.io.FilenameUtils;
String fileName = Paths.get("target/abcd12345671.csv").getFileName().toString();
String fileNameWithoutExt = FilenameUtils.getBaseName(fileName);
I like a regex replace approach here:
String filename = "target/abcd12345671.csv";
String output = filename.replaceAll("^.*/|\\..*$", "");
System.out.println(output); // abcd12345671
Here we use a regex alternation to remove all content up, and including, the final forward slash, as well as all content from the dot in the extension to the end of the filename. This leaves behind the content you actually want.
Here is an approach with using regex
String filename = "target/abcd12345671.csv";
var pattern = Pattern.compile("target/(.*).csv");
var matcher = pattern.matcher(filename);
if (matcher.find()) {
// Whole matched expression -> "target/abcd12345671.csv"
System.out.println(matcher.group(0));
// Matched in the first group -> in regex it is the (.*) expression
System.out.println(matcher.group(1));
}

Java won't replace all strings, because there is text next to the tags (post improved)

I'm working on a program, which formats HTML Code, extracted from a PDF file.
I have a String list, which contains paragraphs and is divided by that.
As the PDF has hyperlinks, I decided to replace them with a foot note number "[1]".
This will be used for citation of sources. I will eventually plan, to put it at the end of a paragraph, or sentence, so you can look up the sources, like you would in a book.
My Problem
For some reason not all the hyperlinks are replaced.
The reason is most likely, that there is text directly next to the tag.
Hell<a href="http://www.example.com">o old chap!
Specifically the "o" part and the "hell" part is blocking the java .replaceAll function, from doing it's job.
Expected Result
Hello [1] old chap!
EDIT:
If I would just add space, before and after the URL, it might split some words like "help", into "hel p", which is also not an option.
My code would have to replace the URL tag (without the ) and create no new extra spaces.
This is some of my code, where the problem occures:
for (int i = 0; i < EN.length; i++) {
Pattern pattern_URL = Pattern.compile("<a(.+?)\">", Pattern.DOTALL);
Matcher matcher_URL = pattern_URL.matcher(EN[i]); //Checks in the curren Array part.
if (matcher_URL.find() == true) {
source_number++;
String extractedURL = matcher_URL.group(0);
//System.out.println(extractedURL);
String extractedURL_fully = extractedURL.replaceAll("href=\"", ""); //Anführungszeichen
//System.out.println(extractedURL_fully);
String nobracketURL = extractedURL.replaceAll("\\)", ""); //Remove round brackets from URL
EN[i] = EN[i].replaceAll("\\)\"", "\""); /*Replace round brackets from URL in Array. (For some reasons there have been href URLs, with an bracket at the end. This was already in the PDF. They were causing massive problems, because it didn't comment them out, so the entire replaceAll command didn't function.)*/
EN[i] = EN[i].replaceAll(nobracketURL, "[" + source_number + "]"); //Replace URL tags with number and Edgy brackets
}
else{
//System.out.println("FALSE: " + "[" + i + "]");
}
}
The whole idea of this is, that it loops through the array and replaces all the URLs, including it's starting tag <a until the end of the starting tag "> (which can also be seen in the pattern regex.)
Correct me if I'm wrong, but what you need is to eliminate all the <a> tags from a given string, right? If that's the case all you needed to do was use a code like the following:
final String string = "<a href=\"http://www.example.com\">Sen";
final Pattern pattern = Pattern.compile("<a(.+?)>", Pattern.DOTALL);
final Matcher matcher = pattern.matcher(string);
final String result = matcher.replaceAll("");
System.out.println(result); // prints "Sen"
Notice I didn't use the replaceAll from the String object, but from the Matcher object. This replaces all matches for the empty string "".

How to extract word from string?

Suppose I have a string:
String message = "you should try http://google.com/";
Now, I want to send "http://google.com/" to a new
String url
What I want to do is:
check if a "word" in the string begins with "http://" and extract that word, where a word is
something that's surrounded by spaces (general english definition of word).
I have no idea how to extract the string, and the best I can do is use startsWith on the string. How to I use startsWith on a word, and extract the word?
Sorry if this is a little bit difficult to explain.
Thanks in advance!
EDIT: Also, what should I do to extract the word from the REGEX operation? And how should I handle it if there is more than 1 url in the string?
Use Pattern & Matcher classes.
String str = "blabla http://www.mywebsite.com blabla";
String regex = "((https?:\\/\\/)?(www.)?(([a-zA-Z0-9-]){2,}\\.){1,4}([a-zA-Z]){2,6}(\\/([a-zA-Z-_/.0-9#:+?%=&;,]*)?)?)";
Matcher m = Pattern.compile(regex).matcher(str);
if (m.find()) {
String url = m.group(); //value "http://www.mywebsite.com"
}
This regex will work for http://..., https://... and even www... URLs. Others regex can be easily found on the net.
You can try this:
String str = "blabla http://www.mywebsite.com blabla";
Matcher m = Pattern.compile("(http://.*)").matcher(str);
if (m.find()) {
String url = (new StringTokenizer(m.group(), " ")).nextToken();
}
The "correct" way to perform this task is to split the String by whitespace -- String#split("\s") -- and then pipe it to the URL constructor. If the string starts with your prefix and a MalformedURLException is thrown it is invalid. The URL class constructor is far better tested and more robust than any solution that you or I could come up with. So, use it, please and don't reinvent the wheel.
You can use Java Regex for this:
The following regex catches any string starting with http:// or https:// till the next whitespace character:
Pattern urlPattern = Pattern.compile("(http(s)?://[.^[\\S]]*)");
Matcher matcher = compile.matcher(myString);
if (matcher.find()) {
String url = matcher.group();
}

Java Regex - How to replace a pattern or how to

I have a bunch of HTML files. In these files I need to correct the src attribute of the IMG tags.
The IMG tags look typically like this:
<img alt="" src="./Suitbert_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />`
where the attributes are NOT in any specific order.
I need to remove the dot and the forward slash at the beginning of the src attribute of the IMG tags so they look like this:
<img alt="" src="Suitbert%20%E2%80%93%20Wikipedia_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />
I have the following class so far:
import java.util.regex.*;
public class Replacer {
// this PATTERN should find all img tags with 0 or more attributes before the src-attribute
private static final String PATTERN = "<img\\.*\\ssrc=\"\\./";
private static final String REPLACEMENT = "<img\\.*\\ssrc=\"";
private static final Pattern COMPILED_PATTERN = Pattern.compile(PATTERN, Pattern.CASE_INSENSITIVE);
public static void findMatches(String html){
Matcher matcher = COMPILED_PATTERN.matcher(html);
// Check all occurance
System.out.println("------------------------");
System.out.println("Following Matches found:");
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(matcher.group());
}
System.out.println("------------------------");
}
public static String replaceMatches(String html){
//Pattern replace = Pattern.compile("\\s+");
Matcher matcher = COMPILED_PATTERN.matcher(html);
html = matcher.replaceAll(REPLACEMENT);
return html;
}
}
So, my method findMatches(String html) seems to find correctly all IMG tags where the src attributes starts with ./.
Now my method replaceMatches(String html) does not correctly replace the matches.
I am a newbie to regex, but I assume that either the REPLACEMENT regex is incorrect or the usage of the replaceAll method or both.
A you can see, the replacement String contains 2 parts which are identical in all IMG tags:
<img and src="./. In between these 2 parts, there should be the 0 or more HTML attributes from the original string.
How do I formulate such a REPLACEMENT string?
Can somebody please enlighten me?
Don't use regex for HTML. Use a parser, obtain the src attribute and replace it.
Try these:
PATTERN = "(<img[^>]*\\ssrc=\")\\./"
REPLACEMENT = "$1"
Basically, you capture everything except the ./ in group #1, then plug it back in using the $1 placeholder, effectively stripping off the ./.
Notice how I changed your .* to [^>]*, too. If there happened to be two IMG tags on the same line, like this:
<img src="good" /><img src="./bad" />
...your regex would match this:
<img src="good" /><img src="./
It would do that even if you used a non-greedy .*?. [^>]* makes sure the match is always contained within the one tag.
Your replacement is incorrect. It will replace the matched string by the replacement (not interpreted as a regexp). If you want to achieve, what you want, you need to use groups. A group is delimited by the parenthesis of the regexp. Each opening parenthesis indicates a new group.
You can use $i in the replacement string to reproduce what a groupe has matched and where 'i' is your group number reference. See The doc of appendReplacement for the details.
// Here is an example (it looks a bit like your case but not exactly)
String input = "<img name=\"foobar\" src=\"img.png\">";
String regexp = "<img(.+)src=\"[^\"]+\"(.*)>";
Matcher m = Pattern.compile(regexp).matcher(input);
StringBuffer sb = new StringBuffer();
while(m.find()) {
// Found a match!
// Append all chars before the match and then replaces the match by the
// replacement (the replacement refers to group 1 & 2 with $1 & $2
// which match respectively everything between '<img' and 'src' and,
// everything after the src value and the closing >
m.appendReplacement(sb, "<img$1src=\"something else\"$2>";
}
m.appendTail(sb);// No more match, we append the end of input
Hope this helps you
If src attributes only occur in your HTML within img tags, you can just do this:
input.replace("src=\"./", "src=\"")
You could also do this without java by using sed if you're using a *nix OS

String ReplaceAll method not working

I'm using this method to parse out plain text URLs in some HTML and make them links
private String fixLinks(String body) {
String regex = "^(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]";
body = body.replaceAll(regex, "$1");
Log.d(TAG, body);
return body;
}
No URLs are replaced in the HTML however. The regular expression seems to be matching URLs in other regular expression testers. What's going on?
The ^ anchor means the regex can only match at the start of the string. Try removing it.
Also, it looks like you mean $0 rather than $1, since you want the entire match and not the first capture group, which is (https?|ftp|file).
In summary, the following works for me:
private String fixLinks(String body) {
String regex = "(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]";
body = body.replaceAll(regex, "$0");
Log.d(TAG, body);
return body;
}

Categories

Resources