How to write Java string literals that contain double-quotes (")? - java

I am getting the compile time error.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class gfile
{
public static void main(String args[]) {
// create a Pattern
Pattern p = Pattern.compile("<div class="dinner">(.*?)</div>");//some prob with this line
// create a Matcher and use the Matcher.group() method
String can="<tr>"+
"<td class="summaryinfo">"+
"<div class="dinner">1,000</div>" +
"<div style="margin-top:5px " +
"font-weight:bold">times</div>"+
"</td>"+
"</tr>";
Matcher matcher = p.matcher(can);
// extract the group
if(matcher.find())
{
System.out.println(matcher.group());
}
else
System.out.println("could not find");
}
}

You have unescaped quotes inside your call to Pattern.compile.
Change:
Pattern p = Pattern.compile("<div class="dinner">(.*?)</div>");
To:
Pattern p = Pattern.compile("<div class=\"dinner\">(.*?)</div>");
Note: I just saw the same problem in your String can.
Change it to:
String can="<tr>"+
"<td class=\"summaryinfo\">"+
"<div class=\"dinner\">1,000</div>" +
"<div style=\"margin-top:5px " +
"font-weight:bold\">times</div>"+
"</td>"+
"</tr>";
I don't know if this fixes it, but it will at least compile now.

But, your Regex is matching (.*?) "Any character, any number of repetitions, as few as possible"
Meaning, it matches nothing...and everything.
...or the fact that your quotes aren't escaped.

You should use an HTML parser to parse and process HTML - not a regular expression.

As already pointed out, you'll need to escape the double quotes inside all of your strings.
And, if you want to have "1,000" as result, you'll need to use group(1), else you'll get the complete match of the pattern.
Resulting code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class gfile
{
public static void main(String args[]) {
// create a Pattern
Pattern p = Pattern.compile("<div class=\"dinner\">(.*?)</div>");
// create a Matcher and use the Matcher.group() method
String can="<tr>"+
"<td class=\"summaryinfo\">"+
"<div class=\"dinner\">1,000</div>" +
"<div style=\"margin-top:5px " +
"font-weight:bold\">times</div>"+
"</td>"+
"</tr>";
Matcher matcher = p.matcher(can);
if(matcher.find())
{
System.out.println(matcher.group(1));
}
else
System.out.println("could not find");
}
}

(.*?) might need to be (.*)?

Related

use regex to map TY_111.22-L007-C010 from a text

I want to get TY_111.22-L007-C010,Tzo11-L010-C100 and Tff-L010-C110 from this string with regex
"12.5*MAX(\"TY_111.22-L007-C010\";\"Tzo11-L010-C100\";\"Tff-L010-C110\")
I tested this T.*-L\d*-C\d* but it don't give the result I want :
My code java for test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "T.*-L\\d*-C\\d*";
final String string = "\"12.5*MAX(\\\"TY_111.22-L007-C010\\\";\\\"Tzo11-L010-C100\\\";\\\"Tff-L010-C110\\\"";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
You need to use this regex T.*?\-L\d*?\-C\d*
final String regex = "T.*?\\-L\\d*?\\-C\\d*";
Note: you need to escape the hyphens \- and use non-greedy quantifier .*? instead of .*, also you can use only matcher.group() instead of matcher.group(0), in your regex you don't have any groups, so the 0 is useless.
Outputs
Full match: TC_24.00-L010-C090
Full match: TC_24.00-L010-C100
Full match: TC_24.00-L010-C110
Why use a verbose regex pattern matcher when you can handle the problem with one line of code:
String input = "12.5*MAX(\"Txxxx-L007-C010\";\"Txxxx-L010-C100\";\"Txxxx-L010-C110\")";
String[] matches = input.replaceAll("^.*?\"|\"[^\"]*$", "")
.split("\";\"");
System.out.println(Arrays.toString(matches));
This prints:
[Txxxx-L007-C010, Txxxx-L010-C100, Txxxx-L010-C110]
OK...I used three lines of code, but the first and third are just for setting up the data and printing it.

Java - replaceFirst - jump to next match

I am trying to escape the HTML only inside <pre> tags that I meet ( don't ask me if there's much logic in this )
I did write this short program and it works fine, but I want to jump to the next match, without actually adding the id="ProcessedTag" so it doesn't replace the first match only. Here's my code :
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
public class ReplaceHTML {
public static void main(String[] args) {
String html = "something something < > && \"\" <pre> text\n" +
"< >\n" +
"more text\n" +
"&\n" +
"<\n" +
"</pre>\n" +
"and some more text\n" +
"<pre> text < </pre>";
Pattern pattern = Pattern.compile("(?i)(?s)<pre>(.*?)</pre>");
Matcher matcher = pattern.matcher(html);
while(matcher.find()) {
html = html.replaceFirst("(?i)(?s)<pre>(.*?)</pre>", "<pre id=\"ProcessedTag\">" + escapeHtml4(matcher.group(1)) + "</pre>");
}
System.out.println(html);
}
}
So in order not to replace the first occurrence only, I decided to add this id="ProcessedTag", so the replaceFirst can move to the next match. I guess there should be a smarter way of doing this without adding anything additional.
Excuse me if this is a stupid question or it has been asked before ( couldn't find anything useful )
Regards.
You should be using Matcher#appendReplacement here:
Pattern pattern = Pattern.compile("(?i)(?s)<pre>(.*?)</pre>");
Matcher matcher = pattern.matcher(html);
StringBuffer buffer = new StringBuffer("");
while (matcher.find()) {
matcher.appendReplacement(buffer, "<pre>" + escapeHtml4(matcher.group(1)) + "</pre>");
}
matcher.appendTail(buffer);
System.out.println(buffer);
Note that in general it is not desirable to use regex against HTML content. But, in this case, the tags you want to replace are not nested, regex is potentially viable.

RegEx for matching mp3 URLs

How can I get an mp3 url with REGEX?
This mp3 url, for example:
https://www.soundhelix.com/examples/mp3/SoundHelix-Song-1.mp3
This is a what I've tried so far but I want it to only accept a url with '.mp3' on the end.
(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]
This expression would likely pass your desired inputs:
^(https?|ftp|file):\/\/(www.)?(.*?)\.(mp3)$
If you wish to add more boundaries to it, you can do that. For instance, you can add a list of chars instead of .*.
I have added several capturing groups, just to be simple to call, if necessary.
RegEx
If this wasn't your desired expression, you can modify/change your expressions in regex101.com.
RegEx Circuit
You can also visualize your expressions in jex.im:
const regex = /^(https?|ftp|file):\/\/(www.)?(.*?)\.(mp3)$/gm;
const str = `https://www.soundhelix.com/examples/mp3/SoundHelix-Song-1.mp3
http://soundhelix.com/examples/mp3/SoundHelix-Song-1.mp3
http://www.soundhelix.com/examples/mp3/SoundHelix-Song-1.mp3
ftp://soundhelix.com/examples/mp3/SoundHelix-Song-1.mp3
file://localhost/examples/mp3/SoundHelix-Song-1.mp3
file://localhost/examples/mp3/SoundHelix-Song-1.wav
file://localhost/examples/mp3/SoundHelix-Song-1.avi
file://localhost/examples/mp3/SoundHelix-Song-1.m4a`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Java Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "^(https?|ftp|file):\\/\\/(www.)?(.*?)\\.(mp3)$";
final String string = "https://www.soundhelix.com/examples/mp3/SoundHelix-Song-1.mp3\n"
+ "http://soundhelix.com/examples/mp3/SoundHelix-Song-1.mp3\n"
+ "http://www.soundhelix.com/examples/mp3/SoundHelix-Song-1.mp3\n"
+ "ftp://soundhelix.com/examples/mp3/SoundHelix-Song-1.mp3\n"
+ "file://localhost/examples/mp3/SoundHelix-Song-1.mp3\n"
+ "file://localhost/examples/mp3/SoundHelix-Song-1.wav\n"
+ "file://localhost/examples/mp3/SoundHelix-Song-1.avi\n"
+ "file://localhost/examples/mp3/SoundHelix-Song-1.m4a";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
If you want it to match inputs ending with '.mp3' you should add \.mp3$ at the end of your regex.
$ indicates the end of your expression
(https?|ftp|file):\/\/[-a-zA-Z0-9+&##\/%?=~_|!:,.;]*[-a-zA-Z0-9+&##\/%=~_|]\.mp3$
Matching:
https://www.soundhelix.com/examples/mp3/SoundHelix-Song-1.mp3 **=> Match**
https://www.soundhelix.com/examples/mp3/SoundHelix-Song-1.mp4 **=> No Match**
You could use anchors to assert the start ^ and the end $ of the string and end the pattern with .mp3:
^https?://\S+\.mp3$
Explanation
^ Assert start of string
https?:// Match http with optional s and ://
\S+ Match 1+ times a non whitespace char
\.mp3 Match .mp3
$ Assert end of string
Regex demo | Java demo
For example:
String regex = "^https?://\\S+\\.mp3$";
String[] strings = {
"https://www.soundhelix.com/examples/mp3/SoundHelix-Song-1.mp3",
"https://www.soundhelix.com/examples/mp3/SoundHelix-Song-1.mp4"
};
Pattern pattern = Pattern.compile(regex);
for (String s : strings) {
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
System.out.println(matcher.group(0));
}
}
Result
https://www.soundhelix.com/examples/mp3/SoundHelix-Song-1.mp3

How do I escape '+' in pattern matching to highlight keyword?

I'm implementing a keyword highlighter in Java. I'm using java.util.regex.Pattern to highlight (making bold) keyword within String content. The following piece of code is working fine for alphanumeric keywords, but it is not working for some special characters. For example, in String content, I would like to highlight the keyword c++ which has the special character + (plus), but it's not getting highlighted properly. How do I escape + character so that c++ is highlighted?
public static void main(String[] args)
{
String content = "java,c++,ejb,struts,j2ee,hibernate";
System.out.println("CONTENT: " + content);
String highlight = "C++";
System.out.println("HIGHLIGHT KEYWORD: " + highlight);
//highlight = highlight.replaceAll(Pattern.quote("+"), "\\\\+");
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile("\\b" + highlight + "\\b", java.util.regex.Pattern.CASE_INSENSITIVE);
System.out.println("PATTERN: " + pattern.pattern());
java.util.regex.Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
System.out.println("Match found!!!");
for (int i = 0; i <= matcher.groupCount(); i++) {
System.out.println(matcher.group(i));
content = matcher.replaceAll("<B>" + matcher.group(i) + "</B>");
}
}
System.out.println("RESULT: " + content);
}
Output:
CONTENT: java,c++,ejb,struts,j2ee,hibernate
HIGHLIGHT KEYWORD: C++
PATTERN: \bC++\b
Match found!!!
c
RESULT: java,c++,ejb,struts,j2ee,hibernate
I even tried to escape '+' before calling Pattern.compile like this,
highlight = highlight.replaceAll(Pattern.quote("+"), "\\\\+");
but still I'm not able to get the syntax right. Can somebody help me solve this?
This should do what you need:
Pattern pattern = Pattern.compile(
"\\b"
+ Pattern.quote(highlight)
+ "\\b",
Pattern.CASE_INSENSITIVE);
Update: you are right, the above doesn't work for C++ (\b matches word boundaries and doesn't recognize ++ as a word). We need a more complicated solution:
Pattern pattern = Pattern.compile(
"\\b"
+ Pattern.quote(highlight)
+ "(?![^\\p{Punct}\\s])", // matches if the match is not followed by
// anything other than whitespace or punctuation
Pattern.CASE_INSENSITIVE);
Update in response to comments: it seems that you need more logic in your pattern creation. Here's a helper method to create the pattern for you:
private static final String WORD_BOUNDARY = "\\b";
// edit this to suit your neds:
private static final String ALLOWED = "[^,.!\\-\\s]";
private static final String LOOKAHEAD = "(?!" + ALLOWED + ")";
private static final String LOOKBEHIND = "(?<!" + ALLOWED + ")";
public static Pattern createHighlightPattern(final String highlight) {
final Pattern pattern = Pattern.compile(
(Character.isLetterOrDigit(highlight.charAt(0))
? WORD_BOUNDARY : LOOKBEHIND)
+ Pattern.quote(highlight)
+ (Character.isLetterOrDigit(highlight.charAt(highlight.length() - 1))
? WORD_BOUNDARY : LOOKAHEAD),
Pattern.CASE_INSENSITIVE);
return pattern;
}
And here is some test code to check that it works:
private static void testMatch(final String haystack, final String needle) {
final Matcher matcher = createHighlightPattern(needle).matcher(haystack);
if (!matcher.find())
System.out.println("Failed to find pattern " + needle);
while (matcher.find())
System.out.println("Found additional match: " + matcher.group() +
" for pattern " + needle);
}
public static void main(final String[] args) {
final String testString = "java,c++,hibernate,.net,asp.net,c#,spring";
testMatch(testString, "java");
testMatch(testString, "c++");
testMatch(testString, ".net");
testMatch(testString, "c#");
}
When I run this method, I don't see any output (which is good :-))
The problem is that the \b word boundary anchor is not matching, because + is a non word character and I assume there is a whitespace following that is also a non word character.
A word boundary \b is matching a change from a word character (Member in \w) to a non word character (no member of \w).
Also if you want to match a + literally you have to escape it. Here you are searching for C++ that means match at least one C and the ++ is a possessive quantifier matching at least 1 C and does not backtrack.
Try changing your pattern to something like this
java.util.regex.Pattern.compile("\\b" + highlight + "(?=\s)", java.util.regex.Pattern.CASE_INSENSITIVE);
(?=\s) is a positive lookahead that will check if there is a whitespace following your highlight
Additionally you will need to esacape the + your are searching for.
All you need is here :
Pattern.compile("\\Q"+highlight+"\\E", java.util.regex.Pattern.CASE_INSENSITIVE);
Assuming your keyword does not begin or end with punctuation, here is a commented regex which uses lookahead and lookbehind to achieve your desired matching behavior:
// Compile regex to match a keyword or keyphrase.
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile(
"(?<=[\\s'\".?!,;:]|^) # Word preceded by ws, quote, punct or BOS.\n" +
// Escape any regex metacharacters in the keyword phrase.
java.util.regex.Pattern.quote(highlight) + " # Keyword to be matched.\n" +
"(?=[\\s'\".?!,;:]|$) # Word followed by ws, quote, punct or EOS.",
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.COMMENTS);
Note that this solution works even if your keyword is a phrase containing spaces.

Extract text with java

If I have the string below, how can I extract the EDITORS PREFACE text with java? Thanks.
<div class='chapter'><a href='page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE'>EDITORS PREFACE</a></div>
As you wrote in a comment of your question that you want what is within href, using Regex here it is:
<a[^>]*? href=\"(?<url>[^\"]+)\"[^>]*?>
This regex will work with Microsoft .NET Framework. It'll capture the content within href putting it in a group called url.
Just noted that this question is tagged with Java. In Java there's no named group as of JDK 6, so here's the solution for Java:
<a[^>]*? href="([^"]+)"[^>]*?>
The above regex will capture the content within href putting it in group 1.
Test it here: http://www.regexplanet.com/simple/index.html
Run this program:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches
{
public static void main( String args[] ){
// String to be scanned to find the pattern.
String line = "<a href='page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE'>EDITORS PREFACE</a>";
String pattern = "<a[^>]*? href=\'([^\']+)\'[^>]*?>";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find( ))
{
// Found value: <a href='page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE'>
System.out.println("Found value: " + m.group(0) );
// Found value: page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE
System.out.println("Found value: " + m.group(1) );
}
else
{
System.out.println("NO MATCH");
}
}
}

Categories

Resources