Extract text with java

Extract text with java - java

If I have the string below, how can I extract the EDITORS PREFACE text with java? Thanks.
<div class='chapter'><a href='page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE'>EDITORS PREFACE</a></div>

As you wrote in a comment of your question that you want what is within href, using Regex here it is:
<a[^>]*? href=\"(?<url>[^\"]+)\"[^>]*?>
This regex will work with Microsoft .NET Framework. It'll capture the content within href putting it in a group called url.
Just noted that this question is tagged with Java. In Java there's no named group as of JDK 6, so here's the solution for Java:
<a[^>]*? href="([^"]+)"[^>]*?>
The above regex will capture the content within href putting it in group 1.
Test it here: http://www.regexplanet.com/simple/index.html
Run this program:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches
{
public static void main( String args[] ){
// String to be scanned to find the pattern.
String line = "<a href='page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE'>EDITORS PREFACE</a>";
String pattern = "<a[^>]*? href=\'([^\']+)\'[^>]*?>";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find( ))
{
// Found value: <a href='page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE'>
System.out.println("Found value: " + m.group(0) );
// Found value: page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE
System.out.println("Found value: " + m.group(1) );
}
else
{
System.out.println("NO MATCH");
}
}
}

Related

Java - replaceFirst - jump to next match

I am trying to escape the HTML only inside <pre> tags that I meet ( don't ask me if there's much logic in this )
I did write this short program and it works fine, but I want to jump to the next match, without actually adding the id="ProcessedTag" so it doesn't replace the first match only. Here's my code :
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
public class ReplaceHTML {
public static void main(String[] args) {
String html = "something something < > && \"\" <pre> text\n" +
"< >\n" +
"more text\n" +
"&\n" +
"<\n" +
"</pre>\n" +
"and some more text\n" +
"<pre> text < </pre>";
Pattern pattern = Pattern.compile("(?i)(?s)<pre>(.*?)</pre>");
Matcher matcher = pattern.matcher(html);
while(matcher.find()) {
html = html.replaceFirst("(?i)(?s)<pre>(.*?)</pre>", "<pre id=\"ProcessedTag\">" + escapeHtml4(matcher.group(1)) + "</pre>");
}
System.out.println(html);
}
}
So in order not to replace the first occurrence only, I decided to add this id="ProcessedTag", so the replaceFirst can move to the next match. I guess there should be a smarter way of doing this without adding anything additional.
Excuse me if this is a stupid question or it has been asked before ( couldn't find anything useful )
Regards.

You should be using Matcher#appendReplacement here:
Pattern pattern = Pattern.compile("(?i)(?s)<pre>(.*?)</pre>");
Matcher matcher = pattern.matcher(html);
StringBuffer buffer = new StringBuffer("");
while (matcher.find()) {
matcher.appendReplacement(buffer, "<pre>" + escapeHtml4(matcher.group(1)) + "</pre>");
}
matcher.appendTail(buffer);
System.out.println(buffer);
Note that in general it is not desirable to use regex against HTML content. But, in this case, the tags you want to replace are not nested, regex is potentially viable.

Regex back reference to match a number (or any char sequence) with itself

I am missing something basic here. I have this regex (.*)=\1 and I am using it to match 100=100 and its failing. When I remove the back reference from the regex and continue to use the capturing group, it shows that the captured group is '100'. Why does it not work when I try to use the back reference?
package test;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
public static void main(String[] args) {
String eqPattern = "(.*)=\1";
String input[] = {"1=1"};
testAndPrint(eqPattern, input); // this does not work
eqPattern = "(.*)=";
input = new String[]{"1=1"};
testAndPrint(eqPattern, input); // this works when the backreference is removed from the expr
}
static void testAndPrint(String regexPattern, String[] input) {
System.out.println("\n Regex pattern is "+regexPattern);
Pattern p = Pattern.compile(regexPattern, Pattern.CASE_INSENSITIVE);
boolean found = false;
for (String str : input) {
System.out.println("Testing "+str);
Matcher matcher = p.matcher(str);
while (matcher.find()) {
System.out.println("I found the text "+ matcher.group() +" starting at " + "index "+ matcher.start()+" and ending at index "+matcher.end());
found = true;
System.out.println("Group captured "+matcher.group(1));
}
if (!found) {
System.out.println("No match found");
}
}
}
}
When I run this, I get the following output
Regex pattern is (.*)=\1
Testing 100=100
No match found
Regex pattern is (.*)=
Testing 100=100
I found the text 100= starting at index 0 and ending at index 4
Group captured 100 -->If the group contains 100, why doesnt it match when I add \1 above
?

You have to escape the pattern string.
String eqPattern = "(.*)=\\1";

I think you need to escape the backslash.
String eqPattern = "(.*)=\\1";

Find all <a href>link</a> in a string with java regex

I have a String which contains some url how i can find all the href with a regular expression?
prodotto di prova
Now i have this which find all amazon links now i need to add also the href to this regex:
String regex="(http|www\\.)(amazon|AMAZON)\\.(com|it|uk|fr|de)\\/(?:gp\\/product|gp\\/product\\/glance|[^\\/]+\\/dp|dp|[^\\/]+\\/product-reviews)\\/([^\\/]{10})";

This pattern works for me in Java: (IDEONE here)
String input = "prodotto di prova\"";
String pattern = "href=(?<link>['\\\"](?:https?:\\/\\/)?(?:www\\.)?(?:amazon|AMAZON)\\.(?:com|it|uk|fr|de)\\/(?<product>:gp\\/product|gp\\/product\\/glance|[^\\/]+\\/dp|dp|[^\\/]+\\/product-reviews)\\/(?<productID>[^\\/]{10})\\/(?<queryString>.*?)\\\")";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(input);
if (m.find( )) {
System.out.println("Amazon link: " + m.group(0) );
System.out.println("product: " + m.group("product") );
System.out.println("productID: " + m.group("productID"));
System.out.println("querystring: " + m.group("queryString"));
} else {
System.out.println("NO MATCH");
}
output:
Amazon link:
href="http://www.amazon.it/Die-10-Symphonien-Orchesterlieder-Sinfonie-Complete/dp/B003LQSHBO/ref=sr_1_2?ie=UTF8&qid=1440101590&sr=8-2&keywords=mahler"
product: Die-10-Symphonien-Orchesterlieder-Sinfonie-Complete/dp
productID: B003LQSHBO
querystring: ref=sr_1_2?ie=UTF8&qid=1440101590&sr=8-2&keywords=mahler
Java's rules for backslashes and escapes in strings are absolutely infuriating to me and I never get it right. You may find it helpful to go to http://www.regexplanet.com/advanced/java/index.html and enter a regex, which it will convert into a java string with the proper escapes. (I couldn't get mine working until I did this!)

Regex to find all the img tags with http url

I have is javascript regex to extract all the <img> tags that have the src as http://.... from a string.
regex = /<img[^>]+src="?(http:\/\/[^">]+)"?\s*\/>/g;
My question is how to do this in Java, and secondly the above regex only gives the content of src, I want to extract and replace the whole <img> with blank spaces.
PS. The may have many other properties also along with the src, like 'class', 'alt' etc.

//Try this solution:
//This answer was tested I hope it is what you're looking for :
Pattern p = Pattern.compile("<img?(.+)?\\s*\\/>");
Matcher m = p.matcher("<img src=\"http://google.com\"/>");
if(m.find())
System.out.println(m.group(1));

try this one:
regex = /(<img[^>]+src="?http:\/\/[^">]+"?[^>]+\/>)/g
it should get all img tags. (changed the end of regexp and moved brackets around img tag)

Please try the below segment
.*(<img\s+.*src\s*=\s*"([^"]+)".*>).*
here it will create two matches
1. Match 1 would be the complete img tag
2. Match 2 will hold the URL of image only.
Example
package com.company;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String htmlFragment = "<img src='http://img01.ibnlive.in/ibnlive/uploads/2015/11/Videocon-Delite.gif' width='90' height='62'>Videocon Mobile Phones has launched three new Android smartphones - Z55 Delite, Z45 Dazzle, and Z45 Amaze with prices starting at Rs 4,599.";
Pattern pattern =
Pattern.compile( ".*(<img\\s+.*src\\s*=\\s*'([^']+)'.*>).*" );
Matcher matcher = pattern.matcher( htmlFragment );
if( matcher.matches()) {
String match = matcher.group(1);
String match1 = matcher.group(2);
//match.replaceAll("'","");
System.out.println(match);
System.out.println(match1);
//System.out.println(match2);
String newString = htmlFragment.replaceAll(match,"");
System.out.println(newString);
}
}
}
The example is with a single quote image url , but the provided regex at the top is for your case with double inverted quotes.

How to write Java string literals that contain double-quotes (")?

I am getting the compile time error.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class gfile
{
public static void main(String args[]) {
// create a Pattern
Pattern p = Pattern.compile("<div class="dinner">(.*?)</div>");//some prob with this line
// create a Matcher and use the Matcher.group() method
String can="<tr>"+
"<td class="summaryinfo">"+
"<div class="dinner">1,000</div>" +
"<div style="margin-top:5px " +
"font-weight:bold">times</div>"+
"</td>"+
"</tr>";
Matcher matcher = p.matcher(can);
// extract the group
if(matcher.find())
{
System.out.println(matcher.group());
}
else
System.out.println("could not find");
}
}

You have unescaped quotes inside your call to Pattern.compile.
Change:
Pattern p = Pattern.compile("<div class="dinner">(.*?)</div>");
To:
Pattern p = Pattern.compile("<div class=\"dinner\">(.*?)</div>");
Note: I just saw the same problem in your String can.
Change it to:
String can="<tr>"+
"<td class=\"summaryinfo\">"+
"<div class=\"dinner\">1,000</div>" +
"<div style=\"margin-top:5px " +
"font-weight:bold\">times</div>"+
"</td>"+
"</tr>";
I don't know if this fixes it, but it will at least compile now.

But, your Regex is matching (.*?) "Any character, any number of repetitions, as few as possible"
Meaning, it matches nothing...and everything.
...or the fact that your quotes aren't escaped.

You should use an HTML parser to parse and process HTML - not a regular expression.

As already pointed out, you'll need to escape the double quotes inside all of your strings.
And, if you want to have "1,000" as result, you'll need to use group(1), else you'll get the complete match of the pattern.
Resulting code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class gfile
{
public static void main(String args[]) {
// create a Pattern
Pattern p = Pattern.compile("<div class=\"dinner\">(.*?)</div>");
// create a Matcher and use the Matcher.group() method
String can="<tr>"+
"<td class=\"summaryinfo\">"+
"<div class=\"dinner\">1,000</div>" +
"<div style=\"margin-top:5px " +
"font-weight:bold\">times</div>"+
"</td>"+
"</tr>";
Matcher matcher = p.matcher(can);
if(matcher.find())
{
System.out.println(matcher.group(1));
}
else
System.out.println("could not find");
}
}

(.*?) might need to be (.*)?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract text with java - java

If I have the string below, how can I extract the EDITORS PREFACE text with java? Thanks. <div class='chapter'><a href='page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE'>EDITORS PREFACE</a></div>

Related

Java - replaceFirst - jump to next match

Regex back reference to match a number (or any char sequence) with itself

Find all <a href>link</a> in a string with java regex

Regex to find all the img tags with http url

How to write Java string literals that contain double-quotes (")?

Categories

Resources