java regex to retrieve link from text

java regex to retrieve link from text - java

I have a input String as:
String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";
I want to convert this text to:
Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it
So here:
1) I want to replace the link tag with plain link. If the tag contains label then it should go in braces after the URL.
2) If the URL is relative, I want to prefix the base URL (http://www.google.com).
3) I want to append a parameter to the URL. (&myParam=pqr)
I am having issues retrieving the tag with URL and label, and replacing it.
I wrote something like:
public static void main(String[] args) {
String text = "String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";";
text = text.replaceAll("<", "<");
text = text.replaceAll(">", ">");
text = text.replaceAll("&", "&");
// this is not working
Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(text);
String url = null;
if (m.find()) {
url = m.group(1);
}
}
// helper method to append new query params once I have the url
public static URI appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
URI oldUri = new URI(uriToUpdate);
String newQueryParams = oldUri.getQuery();
if (newQueryParams == null) {
newQueryParams = queryParamsToAppend;
} else {
newQueryParams += "&" + queryParamsToAppend;
}
URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
oldUri.getPath(), newQueryParams, oldUri.getFragment());
return newUri;
}
Edit1:
Pattern p = Pattern.compile("HREF=\"(.*?)\"");
This works. But then I want it to be capitalization agnostic. Href, HRef, href, hrEF, etc. all should work.
Also, how do I handle if my text has several URLs.
Edit2:
Some progress.
Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
url = m.group(1);
System.out.println(url);
}
This handles the case of multiple URLs.
Last pending issue is, how do I get hold of the label and replace the href tags in original text with URL and label.
Edit3:
By multiple URL cases, I mean there are multiple url present in given text.
String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it and another link <A HREF=\"/relative-path/vegetables.cgi?param1=abc&param2=xyz\">URL2 Label</A> and some more text";
Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
url = m.group(1); // this variable should contain the link URL
url = appendBaseURI(url);
url = appendQueryParams(url, "license=ABCXYZ");
System.out.println(url);
}

public static void main(String args[]) {
String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it and another link <A HREF=\"/relative-path/vegetables.cgi?param1=abc&param2=xyz\">URL2 Label</A> and some more text";
text = StringEscapeUtils.unescapeHtml4(text);
Pattern p = Pattern.compile("(.*?)", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
while (m.find()) {
text = text.replace(m.group(0), cleanUrlPart(m.group(1), m.group(2)));
}
System.out.println(text);
}
private static String cleanUrlPart(String url, String label) {
if (!url.startsWith("http") && !url.startsWith("www")) {
if (url.startsWith("/")) {
url = "http://www.google.com" + url;
} else {
url = "http://www.google.com/" + url;
}
}
url = appendQueryParams(url, "myParam=pqr").toString();
if (label != null && !label.isEmpty()) url += " (" + label + ")";
return url;
}
Output
Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it and another link http://www.google.com/relative-path/vegetables.cgi?param1=abc&param2=xyz&myParam=pqr (URL2 Label) and some more text

You can use apache commons text StringEscapeUtils to decode the html entities and then replaceAll, i.e.:
import org.apache.commons.text.StringEscapeUtils;
String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";
String output = StringEscapeUtils.unescapeHtml4(text).replaceAll("([^<]+).+\"(.*?)\">(.*?)<[^>]+>(.*)", "$1https://google.com$2&your_param ($3)$4");
System.out.print(output);
// Some content which contains link as https://google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&your_param (URL Label) and some text after it
Demos:
jdoodle
Regex Explanation

// this is not working
Because your regex is case-sensitive.
Try:-
Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
Edit1:
To get the label, use Pattern.compile("(?<=>).*?(?=</a>)", Pattern.CASE_INSENSITIVE) and m.group(0).
Edit2:
To replace the tag (including label) with your final string, use:-
text.replaceAll("(?i)<a href=\"(.*?)</a>", "new substring here")

Almost there:
public static void main(String[] args) throws URISyntaxException {
String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it and another link <A HREF=\"/relative-path/vegetables.cgi?param1=abc&param2=xyz\">URL2 Label</A> and some more text";
text = StringEscapeUtils.unescapeHtml4(text);
System.out.println(text);
System.out.println("**************************************");
Pattern patternTag = Pattern.compile("<a([^>]+)>(.+?)</a>", Pattern.CASE_INSENSITIVE);
Pattern patternLink = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
Matcher matcherTag = patternTag.matcher(text);
while (matcherTag.find()) {
String href = matcherTag.group(1); // href
String linkText = matcherTag.group(2); // link text
System.out.println("Href: " + href);
System.out.println("Label: " + linkText);
Matcher matcherLink = patternLink.matcher(href);
String finalText = null;
while (matcherLink.find()) {
String link = matcherLink.group(1);
System.out.println("Link: " + link);
finalText = getFinalText(link, linkText);
break;
}
System.out.println("***************************************");
// replacing logic goes here
}
System.out.println(text);
}
public static String getFinalText(String link, String label) throws URISyntaxException {
link = appendBaseURI(link);
link = appendQueryParams(link, "myParam=ABCXYZ");
return link + " (" + label + ")";
}
public static String appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
URI oldUri = new URI(uriToUpdate);
String newQueryParams = oldUri.getQuery();
if (newQueryParams == null) {
newQueryParams = queryParamsToAppend;
} else {
newQueryParams += "&" + queryParamsToAppend;
}
URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
oldUri.getPath(), newQueryParams, oldUri.getFragment());
return newUri.toString();
}
public static String appendBaseURI(String url) {
String baseURI = "http://www.google.com/";
if (url.startsWith("/")) {
url = url.substring(1, url.length());
}
if (url.startsWith(baseURI)) {
return url;
} else {
return baseURI + url;
}
}

Related

How to ignore encoding certain characters in a url in java?

I have a url that looks like this: https://123.com/screen-shot-2021-02-25-at-7.31.10%2520PM.png
screen-shot-2021-02-25-at-7.31.10%2520PM.png is the file name and %25 is the encoded value for %
This gives me a 404. I need % to not be encoded. What is the proper way to ignore this when encoding a url using Google's UrlEscapers.urlFragmentEscaper().escape(); for Java other than using a replace() method?
Code for encoding:
private static String FILENAME_REGEX = ".*//?(.*)$";
private static Pattern FILENAME_PATTERN = Pattern.compile(FILENAME_REGEX);
public String sanitizedURL(#NonNull String url) throws URISyntaxException {
String contentUrl = url;
Matcher matcher = FILENAME_PATTERN.matcher(url);
if (matcher.matches()) {
String filename = matcher.group(1);
String encodedFilename = UrlEscapers.urlFragmentEscaper().escape(filename);
contentUrl = url.replace(filename, encodedFilename);
//contentUrl = contentUrl.replace("%25", "%");
}
// validate this is a good URI
URI uri = new URI(contentUrl);
return uri.toString();
}

Try UrlDecoder.decode(String s, String enc)
e.g.
jshell> URLDecoder.decode("https://123.com/screen-shot-2021-02-25-at-7.31.10%2520PM.png", "UTF-8")
$1 ==> "https://123.com/screen-shot-2021-02-25-at-7.31.10%20PM.png"

Parsing Google search result Error

I reference the answer to parse the google search result.
How can you search Google Programmatically Java API
However ,when I try the code .Error occurs .
How should I make the modifications?
import java.net.URLDecoder;
import java.net.URLEncoder;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements ;
public class JavaApplication22 {
public static void main(String[] args) {
String google = "http://www.google.com/search?q=";
String search = "stackoverflow";
String charset = "UTF-8";
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search, charset)).userAgent(userAgent).get().select(".g>.r>a");
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}
}
I guess it is because the libraries matters.
But I tried ctrl +shift+i .It shows that nothing to fix in import statements.
Error
Exception in thread "main" java.lang.RuntimeException: Uncompilable
source code - unreported exception java.io.IOException; must be caught
or declared to be thrown at
javaapplication22.JavaApplication22.main(JavaApplication22.java:32)
How should I modify the code so that I can parse the Google Search result ?

Please replace your main class with below code :
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
String google = "http://www.google.com/search?q=";
String search = "stackoverflow";
String charset = "UTF-8";
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search, charset)).userAgent(userAgent).get().select(".g>.r>a");
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}

How to add CDATA in a XML without the Loss of <br/> tag in java?

How to add CDATA in a XML without the Loss of <br/> tag in java?
I need to add the Cdata to the String temp1 and also need to retain the break tag.
Then the program and sample below:
i) program-AddCDATASectionToDOMDocument.java
ii) input xml
iii) required output
i) program-AddCDATASectionToDOMDocument.java
public class AddCDATASectionToDOMDocument {
public static void main(String[] args) throws Exception {
xmlreader xmlr = new xmlreader();
String temp1 = xmlr.xmlFileReader("example.xml", "contentmeta","subtitle");
String temp2 = "<![CDATA[" + temp1 + "]]>";
xmlr.xmlFileWriter("example.xml", "contentmeta", "subtitle", temp2);
}
}
ii)example.xml
iii)required out put

How about using regular expressions instead of parsing it with DOM? This code may work with your example:
String input = new String(Files.readAllBytes(Paths.get("file1.xml")));
final Pattern regex = Pattern.compile("<subtitle>(.+?)</subtitle>");
final Matcher matcher = regex.matcher(input);
String modification;
if (matcher.find()) {
modification = "<subtitle><![CDATA["+matcher.group(1)+"]]></subtitle>";
String output = matcher.replaceFirst(modification);
System.out.println(output);
FileOutputStream outputStream = new FileOutputStream("file2.xml");
outputStream.write(output.getBytes());
}

Read JSP page and Write HTML file UTF-8 issuses

i want read JSP page and write it to HTML page. I have 3 method in parse class. first readHTMLBody(), second WriteNewHTML(), third ZipToEpub().
When I called this method in parse class, all method work. But called in JSP or webservice UTF-8 character looks like "?" in readHTMLBody(). How can I fix it?
public String readHTMLBody() {
try {
String url = "http://localhost:8080/Library/part.jsp";
Document doc = Jsoup.parse((new URL(url)).openStream(), "utf-8", url);
String body = doc.html();
Elements title = doc.select("xxx");
linkURI = title.toString();
linkURI = linkURI.replaceAll("<xxx>", "");
linkURI = linkURI.replaceAll("</xxx>", "");
linkURI = linkURI.replaceAll("\\s", "");
resultBody = body;
resultBody = resultBody.replaceAll("part/" + linkURI + "/assets/", "assets/");
} catch (IOException e) {
}
return resultBody;
}

Apache Common UrlValidator does not support unicode. alernative is avaliable?

i try to url validation.
but UrlValidator is does not support unicode.
here is code
public static boolean isValidHttpUrl(String url) {
String[] schemes = {"http", "https"};
UrlValidator urlValidator = new UrlValidator(schemes);
if (urlValidator.isValid(url)) {
System.out.println("url is valid");
return true;
}
System.out.println("url is invalid");
return false;
}
String url = "ftp://hi.com";
boolean isValid = isValidHttpUrl(url);
assertFalse(isValid);
url = "http:// hi.com";
isValid = isValidHttpUrl(url);
assertFalse(isValid);
url = "http://hi.com";
isValid = isValidHttpUrl(url);
assertTrue(isValid);
// this is problem... it's not true...
url = "http://안녕.com";
isValid = isValidHttpUrl(url);
assertTrue(isValid);
do you know any alternative url validator support unicode?
i add some case... http://seapy_hi.com is invalid. why?
underbar is valid domain why invalid?

It doesn't support IDN. You need to convert URL to Punycode first. Try this,
isValid = isValidHttpUrl(IDN.toASCII(url));

There may be a more recent RFC that supersedes this one, but technically speaking URLs do not suppor Unicode. RFC1738
The relevant section in particular:
No corresponding graphic US-ASCII:
URLs are written only with the
graphic printable characters of the
US-ASCII coded character set. The
octets 80-FF hexadecimal are not
used in US-ASCII, and the octets 00-1F
and 7F hexadecimal represent
control characters; these must be
encoded.

As Kaerber mention in the comment to accepted answer - that one have a bug if the string starts with a scheme.
So here's my solution with fix of that:
public static String convertUnicodeURLToAscii(String url) throws URISyntaxException {
if(url == null) {
return null;
}
url = url.trim();
URI uri = new URI(url);
boolean includeScheme = true;
// URI needs a scheme to work properly with authority parsing
if(uri.getScheme() == null) {
uri = new URI("http://" + url);
includeScheme = false;
}
String scheme = uri.getScheme() != null ? uri.getScheme() + "://" : null;
String authority = uri.getRawAuthority() != null ? uri.getRawAuthority() : ""; // includes domain and port
String path = uri.getRawPath() != null ? uri.getRawPath() : "";
String queryString = uri.getRawQuery() != null ? "?" + uri.getRawQuery() : "";
String fragment = uri.getRawFragment() != null ? "#" + uri.getRawFragment() : "";
// Must convert domain to punycode separately from the path
url = (includeScheme ? scheme : "") + IDN.toASCII(authority) + path + queryString + fragment;
// Convert path from unicode to ascii encoding
return new URI(url).normalize().toASCIIString();
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java regex to retrieve link from text - java

Related

How to ignore encoding certain characters in a url in java?

Parsing Google search result Error

How to add CDATA in a XML without the Loss of <br/> tag in java?

Read JSP page and Write HTML file UTF-8 issuses

Apache Common UrlValidator does not support unicode. alernative is avaliable?

Categories

Resources