Can't undertand why my regex in Java doesn't work - java

I'm trying to find pieces of text on the webpage I fetch that lay between 'align="left">\n" and '</form>\n</td>' substrings.
I wrote a regex:
(align=\"left\">\\n)(?<part>.*?)(<\/form>\\n<\/td>)
and tested it at https://www.freeformatter.com/java-regex-tester.html where it works just as I need.
But in the Java code it can't find anything.
My test code that I'm trying make working:
String frontPage = "<html>\n<head>\n<title>Hello</title>\n</head>\n" +
"<body>\n<table>\n<tr align=\"left\">\n" +
"<td>Hello \n<form>\n<input type=\"submit\" value=\"ok\">\n" +
"</form>\n</td>\n" +
"<td>World \n<form>\n<input type=\"submit\" value=\"ok\">\n" +
"</form>\n</td>\n" +
"</tr>\n</table>\n</body>\n</html>";
java.util.regex.Pattern p =
java.util.regex.Pattern.compile(
"(align=\"left\">\\n)(?<part>.*?)(<\\/form>\\n<\\/td>)");
java.util.regex.Matcher m = p.matcher(frontPage);
List<String> parts = new ArrayList<>();
while (m.find()) {
parts.add(m.group("part"));
}
if (parts.size() == 0)
System.out.println("No page parts found");
else {
System.out.println("Something matches at least");
}
It finds matches if only first two groups specified, but when I add at least simple (form) sequence to the last group, it stops matching anything, and I can't even guess why.

Add DOTALL to the compile. Like
java.util.regex.Pattern.compile(
"(align=\"left\">\\n)(?<part>.*?)(<\\/form>\\n<\\/td>)",
java.util.regex.Pattern.DOTALL
);
See it here at ideone.

Related

How can I get non-matching groups using a Matcher in Java?

I'm trying to write a java regex to catch some groups of words from a String using a Matcher.
Say i got this string: "Hello, we are #happy# to see you today".
I would like to get 2 group of matches, one having
Hello, we are
to see you today
and the other
happy
So far, I was only able to match the word between the #s using this Pattern:
Pattern p = Pattern.compile("#(.+?)#");
I've read about negative lookahead and lookaround, played a bit with it but without success.
I assume I should do some sort of negation of the regex so far, but I couldn't come up with anything.
Any help would be really appreciated, thank you.
From comment:
I may incur in a string where I got more than one instances of words wrapped by #, such as "#Hello# kind #stranger#"
From comment:
I need to apply some different style format to both the text inside and outside.
Since you need to apply different stylings, the code need to process each block of text separately, and needs to know if the text is inside or outside a #..# section.
Note, in the following code, it will silently skip the last #, if there is an odd number of them.
String input = ...
for (Matcher m = Pattern.compile("([^#]+)|#([^#]+)#").matcher(input); m.find(); ) {
if (m.start(1) != -1) {
String outsideText = m.group(1);
System.out.println("Outside: \"" + outsideText + "\"");
} else {
String insideText = m.group(2);
System.out.println("Inside: \"" + insideText + "\"");
}
}
Output for input = "Hello, we are #happy# to see you today"
Outside: "Hello, we are "
Inside: "happy"
Outside: " to see you today"
Output for input = "#Hello# kind #stranger#"
Inside: "Hello"
Outside: " kind "
Inside: "stranger"
Output for input = "This #text# has unpaired # characters"
Outside: "This "
Inside: "text"
Outside: " has unpaired "
Outside: " characters"
The best I could do is splitting in 3 groups, then merging the group 1 and 4 :
(^.*)(\#(.+?)\#)(.*)
Test it here
EDIT: Taking remarks from the comments :
(^[^\#]*)(?:\#(.+?)\#)([^\#]*)
Thanks to #Lino we don't capture the useless group with # anymore, and we capture anything except #, instead of any non whitespace character in the 1st and 2nd groups.
Test it here
Is this solution fine?
Pattern pattern =
Pattern.compile("([^#]+)|#([^#]*)#");
Matcher matcher =
pattern.matcher("Hello, we are #happy# to see you today");
List<String> notBetween = new ArrayList<>(); // not surrounded by #
List<String> between = new ArrayList<>(); // surrounded by #
while (matcher.find()) {
if (Objects.nonNull(matcher.group(1))) notBetween.add(matcher.group(1));
if (Objects.nonNull(matcher.group(2))) between.add(matcher.group(2));
}
System.out.println("Printing group 1");
for (String string :
notBetween) {
System.out.println(string);
}
System.out.println("Printing group 2");
for (String string :
between) {
System.out.println(string);
}

Strip off a sentence that contains a URL

I am looking for a way to remove a sentence that contains a URL in Java. Note that I want to remove the entire sentence and not just the URL.
I found a way to do this and it works, but I am looking for a simpler way to do this, maybe with just one RegEx?
Detect a sentence (can end with .?!) using BreakIterator : Split string into sentences
Use a Regex to read every line and detect the pattern :
Detect and extract url from a string?. If found, just remove the sentence.
String source = "Sorry, we are closed today. Visit our website tomorrow at https://www.google.com. Thank you and have a nice day!";
iterator.setText(source);
int start = iterator.first();
int end = iterator.next();
while(end != BreakIterator.DONE){
if(SENT.matcher(source.substring(start,end)).find()) {
source = source.substring(0, start) + source.substring(end);
iterator.setText(source);
start = iterator.first();
}else{
start = end;
}
end = iterator.next();
}
System.out.println(source);
This prints : Sorry, we are closed today. Thank you and have a nice day!
"(?<=^|[?!.])[^?!.]+" + urlRegex + ".*?(?:$|[?!.])"
This will match each whole sentence whose part matches urlRegex, according to your definition of a sentence; you can use replaceAll to get rid of them. (There are many URL regexes around, and you didn't specify which one you were using, so I left the exact definition of URL to you.)
It'd be best to break/split our sentences first, prior to having it passed through an expression.
Then, this expression might simply return only those lines (sentences) that do not have a URL,
^(?!.*https?[^\s]+.*).*$
Here, we'd be defining a URL as https?[^\s]+.
Demo
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "^(?!.*https?[^\\s]+.*).*$";
final String string = "Sorry, we are closed today. Visit our website tomorrow at https://www.google.com. Thank you and have a nice day!\n\n"
+ "Sorry, we are closed today. Visit our website tomorrow at. Thank you and have a nice day!\n\n"
+ "Sorry, we are closed today. Visit our website tomorrow at https://www.goog. Thank you and have a nice day!\n";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
RegEx Circuit
jex.im visualizes regular expressions:

Java - replaceFirst - jump to next match

I am trying to escape the HTML only inside <pre> tags that I meet ( don't ask me if there's much logic in this )
I did write this short program and it works fine, but I want to jump to the next match, without actually adding the id="ProcessedTag" so it doesn't replace the first match only. Here's my code :
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
public class ReplaceHTML {
public static void main(String[] args) {
String html = "something something < > && \"\" <pre> text\n" +
"< >\n" +
"more text\n" +
"&\n" +
"<\n" +
"</pre>\n" +
"and some more text\n" +
"<pre> text < </pre>";
Pattern pattern = Pattern.compile("(?i)(?s)<pre>(.*?)</pre>");
Matcher matcher = pattern.matcher(html);
while(matcher.find()) {
html = html.replaceFirst("(?i)(?s)<pre>(.*?)</pre>", "<pre id=\"ProcessedTag\">" + escapeHtml4(matcher.group(1)) + "</pre>");
}
System.out.println(html);
}
}
So in order not to replace the first occurrence only, I decided to add this id="ProcessedTag", so the replaceFirst can move to the next match. I guess there should be a smarter way of doing this without adding anything additional.
Excuse me if this is a stupid question or it has been asked before ( couldn't find anything useful )
Regards.
You should be using Matcher#appendReplacement here:
Pattern pattern = Pattern.compile("(?i)(?s)<pre>(.*?)</pre>");
Matcher matcher = pattern.matcher(html);
StringBuffer buffer = new StringBuffer("");
while (matcher.find()) {
matcher.appendReplacement(buffer, "<pre>" + escapeHtml4(matcher.group(1)) + "</pre>");
}
matcher.appendTail(buffer);
System.out.println(buffer);
Note that in general it is not desirable to use regex against HTML content. But, in this case, the tags you want to replace are not nested, regex is potentially viable.

Match everything but the following regex? How can I go about this?

I'm using this REGEX which selects the part (2017-03-06T17:32:33.618) which I need to ignore while matching: \d{4}(-)\d{2}(-)\d{2}T\d{2}:\d{2}:\d{2}.\d{3}.
I used all possible combination to get that "Match everything but the following Regex" result I need. But I can't seem to get it working.
String test =
" drawId, MIN(draw.draw_date_time) nearestFeatureDraw FROM draw WHERE draw.draw_date_time > " +
"'2017-03-06T17:32:33.618' GROUP BY draw.lottery_info_id ) nearestDraw on nearestDraw.lotto_id = li.id " +
"WHERE 1 = 1 AND li.id = 3 AND lower(li.name) LIKE '%blablabla%' " +
"ORDER BY jackPot DESC ORDER BY nearestFeatureDraw DESC ";
boolean pleaseBeTrue = test.matches("Input your Regex here please, and return True");
System.out.println(pleaseBeTrue);
I would appreciate your help to get the right Regex to match everything but that exact DateTime.
Make a temporary variable temp that does not contain the timestamp, then invoke the matches() method on it:
String temp = test.replaceAll("\\d{4}(-)\\d{2}(-)\\d{2}T\\d{2}:\\d{2}:\\d{2}.\\d{3}", "");
boolean pleaseBeTrue = temp.matches("Input your Regex here please, and return True");

take replaced variable from replaceAll

I have a big string and I wanna take links from that string. I can print link.
Pattern pattern = Pattern.compile(".*(?<=overlay-link\" href=\").*?(?=\">).*");
with that code. Example output:
<a title="TITLE" class="overlay-link" href="LINK HERE"></a>
when I try string.replaceAll, regex deleting link and printing another variables.
EX: <a title="TITLE" class="overlay-link" href=""></a>
I am new on regex. Can you help me?
Here is full code :
String content;
Pattern pattern = Pattern.compile(".*(?<=overlay-link\" href=\").*?(?=\">).*");
try {
Scanner scanner = new Scanner(new File("sourceCode.txt"));
while (scanner.hasNext()) {
content = scanner.nextLine();
if (pattern.matcher(content).matches()) {
System.out.println(content.replaceAll("(?<=overlay-link\" href=\").*?(?=\">)", ""));
}
}
} catch (IOException ex) {
Logger.getLogger(SourceCodeExample.class.getName()).log(Level.SEVERE, null, ex);
}
If I understand your question correctly you are looking to pull out just the link specified in the href tag.
To do this you should use a capture group in your regex itself instead of trying to replaceAll.
The replaceAll method is accurately finding the link and replacing it with an empty string and returning the full resulting string as per the docs which is not the desired result.
The regex you should use is as such: .*(?<=overlay-link\" href=\")(.*?)(?=\">).* Notice the capture group () around the link.
This will allow you to find the matches and access the capture group 1. I found a good example of how to do this in this other question. (important snippet pasted below)
String line = "This order was placed for QT3000! OK?"; //<a> tag string
Pattern pattern = Pattern.compile("(.*?)(\\d+)(.*)"); //insert regex provided above
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
System.out.println("group 1: " + matcher.group(1)); //This will be your link
System.out.println("group 2: " + matcher.group(2));
System.out.println("group 3: " + matcher.group(3));
}
Comments added by me
Note: index 0 represents the whole Pattern

Categories

Resources