Getting substring from a given string in Java - java

I am reading the content from a web page and then I am parsing it with the help of Jsoup parser to get only the hyperlinks that exists in the body section. I am getting the output as:
<font color="#0000FF">Sports</font>
<font color="#0000FF">Titanic</font>
license plates
miracle cars
Clear
and even more hyperlinks.
From all of them, all I am interested in is data like
/sports/sports.asp
/titanic/titanic.asp
gastheft.asp
miracle.asp
/crime/warnings/clear.asp
How can I do this using Strings or is there any other way or method to extract this information usinf Jsoup Parser itself?

You can try this, its works.
public class AttributeParsing {
/**
* #param args
*/
public static void main(String[] args) {
final String html = "<font color=\"#0000FF\">Sports</font>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Element th = doc.select("a[href]").first();
String href = th.attr("href");
System.out.println(th);
System.out.println(href);
}
}
Output :
th : <font color="#0000FF">Sports</font>
href : /sports/sports.asp

Try this it may help
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String nextIndex = linkHref .indexOf ("\"", linkHref );

This should be a basic bit of parsign using
String.indexOf
as in
index = jsoupOutput.indexOf ("href=\"");
and
nextIndex = jsoupOutput.indexOf ("\"", index);
with the necessary checks in place.

Let's assume that String anchor contains one of these links then the beginning index of the substring will after href=" and the end index will be the first quotation mark after index 9 this way:
String anchor = "<font color=\"#0000FF\">Sports</font>";
int beginIndex = anchor.indexOf("href=\"") + 6; //To start after <a href="
int endIndex = anchor.indexOf("\"", beginIndex);
String desiredPart = anchor.substring(beginIndex, endIndex);
And that's it if the shape of the anchor is going to always be that way.. better options are using regular expressions and best would be using an XML parser.

Use this as reference
import java.util.regex.*;
public class HelloWorld{
public static void main(String []args){
String s = "<font color=\"#0000FF\">Sports</font>"+
"<font color=\"#0000FF\">Titanic</font>"+
"license plates"+
"miracle cars"+
"Clear";
Pattern p = Pattern.compile("href=\".+?\"");
Matcher m = p.matcher(s);
while(m.find())
{
System.out.println(m.group().split("=")[1].replace("\"",""));
}
}
}
Output
/sports/sports.asp
/titanic/titanic.asp
gastheft.asp
miracle.asp
/crime/warnings/clear.asp

You can do it in one line:
String[] paths = str.replaceAll("(?m)^.*?\"(.*?)\".*?$", "$1").split("(?ms)$.*?^");
The first method call removes everything except the target from each line, and the second splits on newlines (will work on all OS terminators).
FYI (?m) turns on "multiline mode" and (?ms) also turns on the "dotall" flag.

Related

Extract between html tag with unknown tagname?

<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>....
I want to extract everything that comes after <b>Topic1</b> and the next <b> starting tag. Which in this case would be: <ul>asdasd</ul><br/>.
Problem: it must not necessairly be the <b> tag, but could be any other repeating tag.
So my question is: how can I dynamically extract those text? The only static thinks are:
The signal keyword to look for is always "Topic1". I'd like to take the surrounding tags as the one to look for.
The tag is always repeated. In this case it's always <b>, it might as well be <i> or <strong> or <h1> etc.
I know how to write the java code, but what would the regex be like?
String regex = ">Topic1<";
Matcher m = Pattern.compile(regex).matcher(text);
while (m.find()) {
for (int i = 1; i <= m.groupCount(); i++) {
System.out.println(m.group(i));
}
}
The following should work
Topic1</(.+?)>(.*?)<\\1>
Input: <b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>
Output: <ul>asdasd</ul><br/>
Code:
Pattern p = Pattern.compile("Topic1</(.+?)>(.*?)<\\1>");
// get a matcher object
Matcher m = p.matcher("<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>");
while(m.find()) {
System.out.println(m.group(2)); // <ul>asdasd</ul><br/>
}
Try this
String pattern = "\\<.*?\\>Topic1\\<.*?\\>"; // this will see the tag no matter what tag it is
String text = "<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b>"; // your string to be split
String[] attributes = text.split(pattern);
for(String atr : attributes)
{
System.out.println(atr);
}
Will print out:
<ul>asdasd</ul><br/><b>Topic2</b>

Can I include white space between all html text() elements in Jsoup

I want to use Jsoup to extract all text from an HTML page and return a single string of all the text without any HTML. The code I am using half works, but has the effect of joining elements which affects my keyword searches against the string.
This is the Java code I am using:
String resultText = scrapePage(htmldoc);
private String scrapePage(Document doc) {
Element allHTML = doc.select("html").first();
return allHTML.text();
}
Run against the following HTML:
<html>
<body>
<h1>Title</h1>
<p>here is para1</p>
<p>here is para2</p>
</body>
</html>
Outputting resultText gives "Titlehere is para1here is para2" meaning I can't search for the word "para1" as the only word is "para1here".
I don't want to split document into further elements than necessary (for example, getting all H1, p.text elements as there is such a wide range of tags I could be matching
(e.g. data1data2 would come from):
<td>data1</td><td>data2</td>
Is there a way if can get all the text from the page but also include a space between the tags? I don't want to preserve whitepsace otherwise, no need to keep line breaks etc. as I am just preparing a keyword string. I will probably trim all white space otherwise to a single space for this reason.
I don't have this issue using JSoup 1.7.3.
Here's the full code i used for testing:
final String html = "<html>\n"
+ " <body>\n"
+ " <h1>Title</h1>\n"
+ " <p>here is para1</p>\n"
+ " <p>here is para2</p>\n"
+ " </body>\n"
+ "</html>";
Document doc = Jsoup.parse(html);
Element element = doc.select("html").first();
System.out.println(element.text());
And the output:
Title here is para1 here is para2
Can you run my code? Also update to a newer version of jsoup if you don't have 1.7.3 yet.
Previous answer is not right, because it works just thanks to "\n" end of lines added to each line, but in reality you may not have end of line on end of each HTML line...
void example2text() throws IOException {
String url = "http://www.example.com/";
String out = new Scanner(new URL(url).openStream(), "UTF-8").useDelimiter("\\A").next();
org.jsoup.nodes.Document doc = Jsoup.parse(out);
String text = "";
Elements tags = doc.select("*");
for (Element tag : tags) {
for (TextNode tn : tag.textNodes()) {
String tagText = tn.text().trim();
if (tagText.length() > 0) {
text += tagText + " ";
}
}
}
System.out.println(text);
}
By using answer: https://stackoverflow.com/a/35798214/4642669

java, parsing xml with specific symbols

I am trying to get xml from string.
Specific symbols locate in tags title.
I did it:
public class Demo {
public static void main(String[] args) throws Exception {
String data = "<title> \"sad\" <<dd> ><\n </title>";
String pattern = "(<title>)(.+?)([<>'\"&])(.+?)(\n </title>)";
Matcher m = Pattern.compile(pattern).matcher(data);
while (m.find()) {
String bugString = m.group(3) + m.group(4);
String fixed = bugString.replaceAll("<", "<");
fixed = fixed.replaceAll(">", ">");
fixed = fixed.replaceAll(">", ">");
fixed = fixed.replaceAll("'", "&apos;");
fixed = fixed.replaceAll("\"", """);
fixed = fixed.replaceAll("&", "&");
data = data.replace(bugString, fixed);
}
System.out.println(data);
}
}
But it looks a little ugly. How I can improve it, if I don't want use additional library?
If you could influence the String you could put the titles tag text within a CDATA section. Within this you do not have to encode the special XML characters.
CDATA section is explained e.g. here http://en.m.wikipedia.org/wiki/CDATA
So your title could be like
<title> <![CDATA[ here comes my special title with "/<> ]]> </title>

Customize HTML img tags with Java Regular Expressions

I'm new to regular expressions, but I believe this is the method for my solution. I'm trying to take an arbitrary HTML snippet and customize the image tags. For example,
If I had this HTML code:
<><><><><img src="blah.jpg"><><><><><><><><img src="blah2.jpg"><><><>
I want to turn it into:
<><><><><img src="images/blah.jpg"><><><><><><><><img src="images/blah2.jpg"><><><>
The Code I have now is this:
Pattern p = Pattern.compile("<img.*src=\".*\\..*\"");
Matcher m = p.matcher(htmlString);
boolean b = m.find();
String imgPath = "src=\"images/";
while(b)
{
//Get file name.
String name="test.jpg\"";
//Assign new path.
m.group().replaceAll("src=\".*\"",imgPath+name);
}
Regular expressions are not the correct way to parse HTML. Don't do it. It's not possible to do correctly.
Use a proper parser.
Document doc = Jsoup.parse(someHtml);
Elements imgs = doc.select("img");
for (Element img : imgs) {
img.attr("src", "images/" + img.attr("src")); // or whatever
}
doc.outerHtml(); // returns the modified HTML
This code is almost perfect. It prints out alot of info, so look for where it says "Final result" and "original" to see the result of customizing the IMG tags. There's a small flaw that I'm still not sure how to fix. "in10" is the variable for testing an input string. The rest are regex.
I noticed problems occur when I use newline characters and when "src=" is left blank instead of "src=\"\"" or "src=''" The quotes seem to effect the results.
private static String r16 = "(?s)(<img.*?)(src\\s*?=\\s*?(?:\"|').*?(?:\"|'))";
private static String in10 = "<><><><><img width=1 height=888 src=\"bnm.jpg\"<><><><><img src=\"\"> <img src = \"\"><img src ='folder1/folder2/bnm.jpg'><><><img src =\"'>";
private static String r14 = "(?s)\\/|\\=";
String path="images/";
String name="";
Pattern p = Pattern.compile(r16);
Matcher m = p.matcher(in10);
StringBuffer sb = new StringBuffer();
int i=1;
while(m.find())
{
String g0 = m.group();
String g2 = m.group(2);
System.out.println("Main group"+i+":"+g0);
System.out.println("Inner group1:"+m.group(1));
System.out.println("Inner group2:"+g2);
String[] names=g2.split(r14);
printNames(names);
/*
* src="/folder1/folder2/blah.jpg" ---> blah.jpg
* src="bnm.jpg" ---> src="bnm.jp"
*/
if(names.length>=1)
{
name = names[names.length-1];
}
else
{
name = "";
}
//Name might be empty string.
name = name.replaceAll("\"|'","");
System.out.println("Retrieved Name:"+name);
m.appendReplacement(sb,"$1src=\""+path+name+"\"");
i++;
}
m.appendTail(sb);
INPUT=sb.toString();
System.out.println("Final Result:"+INPUT);
System.out.println("Original____:"+in10);
System.out.println("Count:"+m.groupCount());
}
You should not use regex for this.The way which josh3736 said is robust.But if you want to use regex you should use :
String s = "<><><><><img src=\"blah.jpg\"><><><><><><><><img src=\"blah2.jpg\"><><><>";
s = s.replaceAll("(?<=img src=\")([^\"]+)(?=\">)","images/$1");
System.out.println(s);
output :
<><><><><img src="images/blah.jpg"><><><><><><><><img src="images/blah2.jpg"><><><>
Although I agree with the others that doing this with regular expressions is the wrong way to modify html fragments, here is a JUnit test case that shows how to replace src elements with a Pattern in Java:
import static org.junit.Assert.*;
import static org.hamcrest.CoreMatchers.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import org.junit.Test;
public class ImgSrcReplace {
#Test
public void replaceWithRegex() {
String dir = "image/";
String htmlFragment = "<body>\n"+
"<img src=\"single-line.jpg\">"+
"<img src=\n"+
"\"multiline.jpg\">\n"+
"<img src='single-quote.jpg'><img src=\"broken.gif\'>"+
"<img class=\"before\" src=\"class-before.jpg\">"+
"<img src=\"class-after.gif\" class=\"after\">"+
"</body>";
Pattern replaceImgSrc =
Pattern.compile(
"(<img\\b[^>]*\\bsrc\\s*=\\s*)([\"\'])((?:(?!\\2)[^>])*)\\2(\\s*[^>]*>)",
Pattern.CASE_INSENSITIVE&Pattern.MULTILINE);
String result =
replaceImgSrc.matcher(htmlFragment)
.replaceAll("$1$2"+Matcher.quoteReplacement(dir)+"$3$2$4");
assertThat("the single line image tag was updated", result,
containsString("image/single-line.jpg"));
assertThat("the multiline image tag was updated", result,
containsString("image/multiline.jpg"));
assertThat("the single quote image tag was updated", result,
containsString("image/single-quote.jpg"));
assertThat("the broken gif was ignored.", result,
containsString("\"broken.gif'"));
assertThat("attributes before are preseved.", result,
containsString("<img class=\"before\" src=\"image/class-before.jpg\">"));
assertThat("attributes after are preseved.", result,
containsString("<img src=\"image/class-after.gif\" class=\"after\">"));
}
}

Java: I have a big string of html and need to extract the href="..." text

I have this string containing a large chunk of html and am trying to extract the link from href="..." portion of the string. The href could be in one of the following forms:
<a href="..." />
<a class="..." href="..." />
I don't really have a problem with regex but for some reason when I use the following code:
String innerHTML = getHTML();
Pattern p = Pattern.compile("href=\"(.*)\"", Pattern.DOTALL);
Matcher m = p.matcher(innerHTML);
if (m.find()) {
// Get all groups for this match
for (int i=0; i<=m.groupCount(); i++) {
String groupStr = m.group(i);
System.out.println(groupStr);
}
}
Can someone tell me what is wrong with my code? I did this stuff in php but in Java I am somehow doing something wrong... What is happening is that it prints the whole html string whenever I try to print it...
EDIT: Just so that everyone knows what kind of a string I am dealing with:
<a class="Wrap" href="item.php?id=43241"><input type="button">
<span class="chevron"></span>
</a>
<div class="menu"></div>
Everytime I run the code, it prints the whole string... That's the problem...
And about using jTidy... I'm on it but it would be interesting to know what went wrong in this case as well...
.*
This is an greedy operation that will take any character including the quotes.
Try something like:
"href=\"([^\"]*)\""
There are two problems with the code you've posted:
Firstly the .* in your regular expression is greedy. This will cause it to match all characters until the last " character that can be found. You can make this match be non-greedy by changing this to .*?.
Secondly, to pick up all the matches, you need to keep iterating with Matcher.find rather than looking for groups. Groups give you access to each parenthesized section of the regex. You however, are looking for each time the whole regular expression matches.
Putting these together gives you the following code which should do what you need:
Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.DOTALL);
Matcher m = p.matcher(innerHTML);
while (m.find())
{
System.out.println(m.group(1));
}
Regex is great but not the right tool for this particular purpose. Normally you want to use a stackbased parser for this. Have a look at Java HTML parser API's like jTidy.
Use a built in parser. Something like:
EditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
kit.read(reader, doc, 0);
HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
while (it.isValid())
{
SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();
String href = (String)s.getAttribute(HTML.Attribute.HREF);
System.out.println( href );
it.next();
}
Or use the ParserCallback:
import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;
public class ParserCallbackText extends HTMLEditorKit.ParserCallback
{
public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
{
if (tag.equals(HTML.Tag.A))
{
String href = (String)a.getAttribute(HTML.Attribute.HREF);
System.out.println(href);
}
}
public static void main(String[] args)
throws Exception
{
Reader reader = getReader(args[0]);
ParserCallbackText parser = new ParserCallbackText();
new ParserDelegator().parse(reader, parser, true);
}
static Reader getReader(String uri)
throws IOException
{
// Retrieve from Internet.
if (uri.startsWith("http:"))
{
URLConnection conn = new URL(uri).openConnection();
return new InputStreamReader(conn.getInputStream());
}
// Retrieve from file.
else
{
return new FileReader(uri);
}
}
}
The Reader could be a StringReader.
Another easy and reliable way to do it is by using Jsoup
Document doc = Jsoup.connect("http://example.com/").get();
Elements links = doc.select("a[href]");
for (Element link : links){
System.out.println(link.attr("abs:href"));
}
you may use a html parser library. jtidy for example gives you a DOM model of the html, from wich you can extract all "a" elements and read their "href" attribute
"href=\"(.*?)\"" should also work, but I think Kugel's answer will work faster.

Categories

Resources