Strip off a sentence that contains a URL - java

I am looking for a way to remove a sentence that contains a URL in Java. Note that I want to remove the entire sentence and not just the URL.
I found a way to do this and it works, but I am looking for a simpler way to do this, maybe with just one RegEx?
Detect a sentence (can end with .?!) using BreakIterator : Split string into sentences
Use a Regex to read every line and detect the pattern :
Detect and extract url from a string?. If found, just remove the sentence.
String source = "Sorry, we are closed today. Visit our website tomorrow at https://www.google.com. Thank you and have a nice day!";
iterator.setText(source);
int start = iterator.first();
int end = iterator.next();
while(end != BreakIterator.DONE){
if(SENT.matcher(source.substring(start,end)).find()) {
source = source.substring(0, start) + source.substring(end);
iterator.setText(source);
start = iterator.first();
}else{
start = end;
}
end = iterator.next();
}
System.out.println(source);
This prints : Sorry, we are closed today. Thank you and have a nice day!

"(?<=^|[?!.])[^?!.]+" + urlRegex + ".*?(?:$|[?!.])"
This will match each whole sentence whose part matches urlRegex, according to your definition of a sentence; you can use replaceAll to get rid of them. (There are many URL regexes around, and you didn't specify which one you were using, so I left the exact definition of URL to you.)

It'd be best to break/split our sentences first, prior to having it passed through an expression.
Then, this expression might simply return only those lines (sentences) that do not have a URL,
^(?!.*https?[^\s]+.*).*$
Here, we'd be defining a URL as https?[^\s]+.
Demo
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "^(?!.*https?[^\\s]+.*).*$";
final String string = "Sorry, we are closed today. Visit our website tomorrow at https://www.google.com. Thank you and have a nice day!\n\n"
+ "Sorry, we are closed today. Visit our website tomorrow at. Thank you and have a nice day!\n\n"
+ "Sorry, we are closed today. Visit our website tomorrow at https://www.goog. Thank you and have a nice day!\n";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
RegEx Circuit
jex.im visualizes regular expressions:

Related

Java - replaceFirst - jump to next match

I am trying to escape the HTML only inside <pre> tags that I meet ( don't ask me if there's much logic in this )
I did write this short program and it works fine, but I want to jump to the next match, without actually adding the id="ProcessedTag" so it doesn't replace the first match only. Here's my code :
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
public class ReplaceHTML {
public static void main(String[] args) {
String html = "something something < > && \"\" <pre> text\n" +
"< >\n" +
"more text\n" +
"&\n" +
"<\n" +
"</pre>\n" +
"and some more text\n" +
"<pre> text < </pre>";
Pattern pattern = Pattern.compile("(?i)(?s)<pre>(.*?)</pre>");
Matcher matcher = pattern.matcher(html);
while(matcher.find()) {
html = html.replaceFirst("(?i)(?s)<pre>(.*?)</pre>", "<pre id=\"ProcessedTag\">" + escapeHtml4(matcher.group(1)) + "</pre>");
}
System.out.println(html);
}
}
So in order not to replace the first occurrence only, I decided to add this id="ProcessedTag", so the replaceFirst can move to the next match. I guess there should be a smarter way of doing this without adding anything additional.
Excuse me if this is a stupid question or it has been asked before ( couldn't find anything useful )
Regards.
You should be using Matcher#appendReplacement here:
Pattern pattern = Pattern.compile("(?i)(?s)<pre>(.*?)</pre>");
Matcher matcher = pattern.matcher(html);
StringBuffer buffer = new StringBuffer("");
while (matcher.find()) {
matcher.appendReplacement(buffer, "<pre>" + escapeHtml4(matcher.group(1)) + "</pre>");
}
matcher.appendTail(buffer);
System.out.println(buffer);
Note that in general it is not desirable to use regex against HTML content. But, in this case, the tags you want to replace are not nested, regex is potentially viable.

Can't undertand why my regex in Java doesn't work

I'm trying to find pieces of text on the webpage I fetch that lay between 'align="left">\n" and '</form>\n</td>' substrings.
I wrote a regex:
(align=\"left\">\\n)(?<part>.*?)(<\/form>\\n<\/td>)
and tested it at https://www.freeformatter.com/java-regex-tester.html where it works just as I need.
But in the Java code it can't find anything.
My test code that I'm trying make working:
String frontPage = "<html>\n<head>\n<title>Hello</title>\n</head>\n" +
"<body>\n<table>\n<tr align=\"left\">\n" +
"<td>Hello \n<form>\n<input type=\"submit\" value=\"ok\">\n" +
"</form>\n</td>\n" +
"<td>World \n<form>\n<input type=\"submit\" value=\"ok\">\n" +
"</form>\n</td>\n" +
"</tr>\n</table>\n</body>\n</html>";
java.util.regex.Pattern p =
java.util.regex.Pattern.compile(
"(align=\"left\">\\n)(?<part>.*?)(<\\/form>\\n<\\/td>)");
java.util.regex.Matcher m = p.matcher(frontPage);
List<String> parts = new ArrayList<>();
while (m.find()) {
parts.add(m.group("part"));
}
if (parts.size() == 0)
System.out.println("No page parts found");
else {
System.out.println("Something matches at least");
}
It finds matches if only first two groups specified, but when I add at least simple (form) sequence to the last group, it stops matching anything, and I can't even guess why.
Add DOTALL to the compile. Like
java.util.regex.Pattern.compile(
"(align=\"left\">\\n)(?<part>.*?)(<\\/form>\\n<\\/td>)",
java.util.regex.Pattern.DOTALL
);
See it here at ideone.

take replaced variable from replaceAll

I have a big string and I wanna take links from that string. I can print link.
Pattern pattern = Pattern.compile(".*(?<=overlay-link\" href=\").*?(?=\">).*");
with that code. Example output:
<a title="TITLE" class="overlay-link" href="LINK HERE"></a>
when I try string.replaceAll, regex deleting link and printing another variables.
EX: <a title="TITLE" class="overlay-link" href=""></a>
I am new on regex. Can you help me?
Here is full code :
String content;
Pattern pattern = Pattern.compile(".*(?<=overlay-link\" href=\").*?(?=\">).*");
try {
Scanner scanner = new Scanner(new File("sourceCode.txt"));
while (scanner.hasNext()) {
content = scanner.nextLine();
if (pattern.matcher(content).matches()) {
System.out.println(content.replaceAll("(?<=overlay-link\" href=\").*?(?=\">)", ""));
}
}
} catch (IOException ex) {
Logger.getLogger(SourceCodeExample.class.getName()).log(Level.SEVERE, null, ex);
}
If I understand your question correctly you are looking to pull out just the link specified in the href tag.
To do this you should use a capture group in your regex itself instead of trying to replaceAll.
The replaceAll method is accurately finding the link and replacing it with an empty string and returning the full resulting string as per the docs which is not the desired result.
The regex you should use is as such: .*(?<=overlay-link\" href=\")(.*?)(?=\">).* Notice the capture group () around the link.
This will allow you to find the matches and access the capture group 1. I found a good example of how to do this in this other question. (important snippet pasted below)
String line = "This order was placed for QT3000! OK?"; //<a> tag string
Pattern pattern = Pattern.compile("(.*?)(\\d+)(.*)"); //insert regex provided above
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
System.out.println("group 1: " + matcher.group(1)); //This will be your link
System.out.println("group 2: " + matcher.group(2));
System.out.println("group 3: " + matcher.group(3));
}
Comments added by me
Note: index 0 represents the whole Pattern

Regex for finding mp4 in string

I want to get all .mp4 URLs of this String using Regex.
Also I want to know how to get only the last .mp4 URL using Regex.
Thanks
contentType=application/x-mpegURL, url=https://video.twimg.com/amplify_video/822938952332144642/pl/BjHU8aBCbOgZNzXQ.m3u8},
Variant{bitrate=0, contentType=application/dash+xml, url=https://video.twimg.com/amplify_video/822938952332144642/pl/BjHU8aBCbOgZNzXQ.mpd},
Variant{bitrate=320000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/320x180/YqZ72rzLj3VWVhy4.mp4},
Variant{bitrate=832000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/640x360/A2vMgzo2ElpPP6TE.mp4},
Variant{bitrate=2176000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/1280x720/j9xbNzRZqEbYs_2s.mp4}]}]";
Regex:
https?.*?\.mp4
Literal http
Followed by an optional 's': s?
Remove the question mark if they will all use HTTPS.
Followed by as few characters as possible: .*?
Followed by an mp4 extension (literal dot) \.mp4
2 Approaches:
If you're sure the URL's will always begin with https:// and will not contain a mp4 after the complete URL is finished, then you can use
pattern = "https://.*mp4";
String[] arr = {
"contentType=application/x-mpegURL, url=https://video.twimg.com/amplify_video/822938952332144642/pl/BjHU8aBCbOgZNzXQ.m3u8}",
"Variant{bitrate=0, contentType=application/dash+xml, url=https://video.twimg.com/amplify_video/822938952332144642/pl/BjHU8aBCbOgZNzXQ.mpd}",
"Variant{bitrate=320000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/320x180/YqZ72rzLj3VWVhy4.mp4}",
"Variant{bitrate=832000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/640x360/A2vMgzo2ElpPP6TE.mp4}",
"Variant{bitrate=2176000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/1280x720/j9xbNzRZqEbYs_2s.mp4}]}]"
};
String pattern = "https://.*mp4";
Pattern r = Pattern.compile(pattern);
for (String line : arr) {
Matcher m = r.matcher(line);
if (m.find()) {
System.out.println(m.group(0));
} else {
System.out.println("NO MATCH");
}
}
If not, to Support all types of URL's then change your pattern to what is defined here with a little modification,
String pattern =
"(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" +
"(\\w+:\\w+#)?(([-\\w]+\\.)+(com|org|net|gov" +
"|mil|biz|info|mobi|name|aero|jobs|museum" +
"|travel|[a-z]{2}))(:[\\d]{1,5})?" +
"(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" +
"((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" +
"(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" +
"(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b"+"mp4";
Output:
NO MATCH
NO MATCH
https://video.twimg.com/amplify_video/822938952332144642/vid/320x180/YqZ72rzLj3VWVhy4.mp4
https://video.twimg.com/amplify_video/822938952332144642/vid/640x360/A2vMgzo2ElpPP6TE.mp4
https://video.twimg.com/amplify_video/822938952332144642/vid/1280x720/j9xbNzRZqEbYs_2s.mp4

Regex expression to get the file name

I want to extract only filename from the complete file name + time stamp . below is the input.
String filePath = "fileName1_20150108.csv";
expected output should be: "fileName1"
String filePath2 = "fileName1_filedesc1_20150108_002_20150109013841.csv"
And expected output should be: "fileName1_filedesc1"
I wrote a below code in java to get the file name but it is working for first part (filePath) but not for filepath2.
Pattern pattern = Pattern.compile(".*.(?=_)");
String filePath = "fileName1_20150108.csv";
String filePath2 = "fileName1_filedesc1_20150108_002_20150109013841.csv";
Matcher matcher = pattern.matcher(filePath);
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(matcher.group());
}
Can somebody please help me to correct the regex so i can parse both filepath using same regex?
Thanks
Anchor the start, and make the .* non-greedy:
^.*?(_\D.*?)?(?=[_.])
Update: change the second group (for fileDesc) to optional, and enforce that it starts with a non-digit character. This will work as long as your fileDesc strings never start with numbers.
You can get the characters before the first underscode, the first underscore, and then the characters until the next underscore:
^[^_]*_[^_]*
This should work: "^(.*?)_([0-9_]*)\\.([^.]*)$"
It will return you 3 groups:
the base name (assuming not a single part will be all numbers)
the timestamp info
the extension.
You can test here: http://fiddle.re/v0hne6 (RegexPlanet)

Categories

Resources