How do I regex match (Non inclusive)? - java

I want to get the String between (not including): alt=" and "
Here is a small sample of my code:
Pattern p2 = compile("alt=\"(.*?)\");
Matcher m2 = p2.matcher(result);
while (m2.find()) {
names.add(m2.group());
}
The output is for example: alt="Harry Potter"
when I want the output to be just: Harry Potter

Your code has a typo (a missing double quote in compile) and the group you need to access is Group 1 (use compile("alt=\"(.*?)\"") and m2.group(1)).
You should think about using an HTML parser for getting values from HTML, like jsoup. Here is a way to get what you need with it:
Document doc = Jsoup.parse(html_contents);
for (Element element : doc.getAllElements())
{
for (Attribute attribute : element.attributes())
{
if(attribute.getKey().equalsIgnoreCase("alt"))
{
names.add(attribute.getValue());
}
}
}

Related

Extract between html tag with unknown tagname?

<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>....
I want to extract everything that comes after <b>Topic1</b> and the next <b> starting tag. Which in this case would be: <ul>asdasd</ul><br/>.
Problem: it must not necessairly be the <b> tag, but could be any other repeating tag.
So my question is: how can I dynamically extract those text? The only static thinks are:
The signal keyword to look for is always "Topic1". I'd like to take the surrounding tags as the one to look for.
The tag is always repeated. In this case it's always <b>, it might as well be <i> or <strong> or <h1> etc.
I know how to write the java code, but what would the regex be like?
String regex = ">Topic1<";
Matcher m = Pattern.compile(regex).matcher(text);
while (m.find()) {
for (int i = 1; i <= m.groupCount(); i++) {
System.out.println(m.group(i));
}
}
The following should work
Topic1</(.+?)>(.*?)<\\1>
Input: <b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>
Output: <ul>asdasd</ul><br/>
Code:
Pattern p = Pattern.compile("Topic1</(.+?)>(.*?)<\\1>");
// get a matcher object
Matcher m = p.matcher("<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>");
while(m.find()) {
System.out.println(m.group(2)); // <ul>asdasd</ul><br/>
}
Try this
String pattern = "\\<.*?\\>Topic1\\<.*?\\>"; // this will see the tag no matter what tag it is
String text = "<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b>"; // your string to be split
String[] attributes = text.split(pattern);
for(String atr : attributes)
{
System.out.println(atr);
}
Will print out:
<ul>asdasd</ul><br/><b>Topic2</b>

How to detect URL to different page (also in the same domain)

I have question about detect url in page. I'm founding the best way how it solve. For downloading page I use Jsoup.
URI uri = new URI("http://www.niocchi.com/");
Document doc = Jsoup.connect(uri.toString()).get();
Elements links = doc.select("a")
And this page get me some links. For example this:
http://www.niocchi.com/#Package organization
http://www.niocchi.com/#Architecture
http://www.linkedin.com/in/ivanprado
http://www.niocchi.com/examples/
I need get only different pages without references to paragraphs.
I would like to get from example this:
http://www.linkedin.com/in/ivanprado
http://www.niocchi.com/examples/
It looks like you want to select only these <a> with href attribute with value build from characters which are not #. In that case you can use
doc.select("a[href~=^[^#]+$]")
attribute~=regex is syntax used to check if part of value of attribute can be matched with regex.
regex accepting one or more non # characters can look like this [^#]+
regex accepting only entire string (not only its part) need to be surrounded with ^ and $ anchors which represents
^ - start of the string,
$ end of the string.
You could convert them to strings and then split them based on the # mark.
for example:
public void stringSplitter() {
String result = null;
// example
String[] stringURL = {"http://www.niocchi.com/#Package organization", "http://www.niocchi.com/#Architecture",
"http://www.linkedin.com/in/ivanprado", "http://www.niocchi.com/examples/ "};
try {
for (int i = 0; i < stringURL.length; i++) {
String [] parts = stringURL[i].split("#");
result = parts[0];
System.out.println(result);
}
}catch (Exception ex) {
ex.printStackTrace();
}
}
The output is:
http://www.niocchi.com/
http://www.niocchi.com/
http://www.linkedin.com/in/ivanprado
http://www.niocchi.com/examples/
I would even think about setting a part of the method to return only unique URL's

java : generating xpath using string matcher regex

I want to generate xPath from html file. So far, I have been succeded to store Html source in a String and generating basic xpath using matcher regex as follows:-
String text = "<html><body><table><tr id=\"x\"><td>abc</td><td></td><td>xyz</td></tr></table></body></html>";
//I want xpath till label "xyz"
String unwanted= "xyz";
//so splitting and storing needed String
String[] neededString=text.split(unwanted);
String a="";
//pattern for extracting tags
String patternString1 = "<(.+?)>";
Pattern pattern = Pattern.compile(patternString1);
Matcher matcher = pattern.matcher(neededString[0]);
while(matcher.find()) {
a=a.concat(matcher.group(1)+"/");
System.out.println(a);
}
This code works for basic tag Structure without multiple child nodes like multiple <td>'s in <tr>. Can anyone improve my above code to include xpath generation for multiple childs and also for capturing attrributes like Ids,Class etc.
Any help is much appreciated.
Thanks in advance.
Regex is not so Accurate for Extracting the Html content.
Use Jsoup Html Parser
public static void main(String[] args){
String html = "<html><body><table><tr id=\"x\"><td>abc</td><td></td>" +
"<td>xyz</td></tr></table></body></html>";
Document doc = Jsoup.parse(html);
for (Element table : doc.select("table")) {
for (Element row : table.select("tr[id=x]")) {
Elements tds = row.select("td)");
System.out.println(tds.get(2).text());
}
}
}

Regex for a particular pattern

I am trying to extract a string that looks something like this(below) using java regex.
Automotive Vehicles (154949)
Cars (91364)
Auto Parts & Accessories (29987)
Motorcycles & Scooters (11648)
I have tried the following below:
for (Element link : links) {
String cat = link.text();
String pattern = "(\\w+\\w+?\\s?.?\\w+)";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(cat);
while (m.find( )) {
System.out.println("Category: "+m.group(0));
}
}
Extract the text and the number with a vim regex
\(.*\)(\(\d*\))
Group 1 is the text, Group 2 is the number
So.. it's been a while since I've done RegExs in Java but I think its:
(.*)\((\d+)\)

Java: I have a big string of html and need to extract the href="..." text

I have this string containing a large chunk of html and am trying to extract the link from href="..." portion of the string. The href could be in one of the following forms:
<a href="..." />
<a class="..." href="..." />
I don't really have a problem with regex but for some reason when I use the following code:
String innerHTML = getHTML();
Pattern p = Pattern.compile("href=\"(.*)\"", Pattern.DOTALL);
Matcher m = p.matcher(innerHTML);
if (m.find()) {
// Get all groups for this match
for (int i=0; i<=m.groupCount(); i++) {
String groupStr = m.group(i);
System.out.println(groupStr);
}
}
Can someone tell me what is wrong with my code? I did this stuff in php but in Java I am somehow doing something wrong... What is happening is that it prints the whole html string whenever I try to print it...
EDIT: Just so that everyone knows what kind of a string I am dealing with:
<a class="Wrap" href="item.php?id=43241"><input type="button">
<span class="chevron"></span>
</a>
<div class="menu"></div>
Everytime I run the code, it prints the whole string... That's the problem...
And about using jTidy... I'm on it but it would be interesting to know what went wrong in this case as well...
.*
This is an greedy operation that will take any character including the quotes.
Try something like:
"href=\"([^\"]*)\""
There are two problems with the code you've posted:
Firstly the .* in your regular expression is greedy. This will cause it to match all characters until the last " character that can be found. You can make this match be non-greedy by changing this to .*?.
Secondly, to pick up all the matches, you need to keep iterating with Matcher.find rather than looking for groups. Groups give you access to each parenthesized section of the regex. You however, are looking for each time the whole regular expression matches.
Putting these together gives you the following code which should do what you need:
Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.DOTALL);
Matcher m = p.matcher(innerHTML);
while (m.find())
{
System.out.println(m.group(1));
}
Regex is great but not the right tool for this particular purpose. Normally you want to use a stackbased parser for this. Have a look at Java HTML parser API's like jTidy.
Use a built in parser. Something like:
EditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
kit.read(reader, doc, 0);
HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
while (it.isValid())
{
SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();
String href = (String)s.getAttribute(HTML.Attribute.HREF);
System.out.println( href );
it.next();
}
Or use the ParserCallback:
import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;
public class ParserCallbackText extends HTMLEditorKit.ParserCallback
{
public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
{
if (tag.equals(HTML.Tag.A))
{
String href = (String)a.getAttribute(HTML.Attribute.HREF);
System.out.println(href);
}
}
public static void main(String[] args)
throws Exception
{
Reader reader = getReader(args[0]);
ParserCallbackText parser = new ParserCallbackText();
new ParserDelegator().parse(reader, parser, true);
}
static Reader getReader(String uri)
throws IOException
{
// Retrieve from Internet.
if (uri.startsWith("http:"))
{
URLConnection conn = new URL(uri).openConnection();
return new InputStreamReader(conn.getInputStream());
}
// Retrieve from file.
else
{
return new FileReader(uri);
}
}
}
The Reader could be a StringReader.
Another easy and reliable way to do it is by using Jsoup
Document doc = Jsoup.connect("http://example.com/").get();
Elements links = doc.select("a[href]");
for (Element link : links){
System.out.println(link.attr("abs:href"));
}
you may use a html parser library. jtidy for example gives you a DOM model of the html, from wich you can extract all "a" elements and read their "href" attribute
"href=\"(.*?)\"" should also work, but I think Kugel's answer will work faster.

Categories

Resources