Jsoup matching with regex

Jsoup matching with regex - java

I came across a problem using jsoup. I can't match the <div id="shout_132684"> those digits are changing. How should I match those?
Elements content = doc.select("div:matches(id=\"shout_.+?\")");
Doesn't work.

You can use the startswith CSS selector ^=. It is supported by Jsoups .select(...).
You can do it like this:
doc.select("div[id^=shout]");
This is an full example:
public static void main(String[] args) {
Document parse = Jsoup.parse("<div id=\"shout_23\"/>" +
"<div id=\"shout_42\"/>" +
"<div id=\"notValidId\"/>" +
"<div id=\"shout_1337\"/>");
Elements divs = parse.select("div[id^=shout");
for (Element element : divs) {
System.out.println(element);
}
}
It will print:
<div id="shout_23"></div>
<div id="shout_42"></div>
<div id="shout_1337"></div>

For more accurate parsing you can still do it with regular expressions:
Elements content = doc.select("div[id~=(shout_)[0-9]+]");

Related

get all links from a div with JSoup

Basically, I am using Jsoup to parse a site, I want to get all the links from the following html:
<ul class="detail-main-list">
<li>
Dis Be the link
</li>
</ul>
Any idea how?

Straight from jsoup.org, right there, first thing you see:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
log("%s\n\t%s",
headline.attr("title"), headline.absUrl("href"));
}
Modifying this to what you need seems trivial:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
Elements anchorTags = doc.select("ul.detail-main-list a");
for (Element anchorTag : anchorTags) {
System.out.println("Links to: " + anchorTag.attr("href"));
System.out.println("In absolute form: " + anchorTag.absUrl("href"));
System.out.println("Text content: " + anchorTag.text());
}
The ul.detail-main-list a part is a so-called selector string. A real short tutorial on these:
foo means: Any HTML element with that tag name, i.e. <foo></foo>.
.bar means: Any HTML element with class bar, i.e. <foo class="bar baz"></foo>
#bar means: Any HTML element with id bar, i.e. <foo id="bar">
These can be combined: ul.detail-main-list matches any <ul> tags that have the string detail-main-list in their list of classes.
a b means: all things matching the 'b' selection that have something matching 'a' as a parent. So ul a matches all <a> tags that have a <ul> tag around them someplace.
The JSoup docs are excellent.

You can do a specific a href link in this way from any website.
public static void main(String[] args) {
String htmlString = "<html>\n" +
" <head></head>\n" +
" <body>\n" +
"<ul class=\"detail-main-list\">\n" +
" <li> \n" +
" Dis Be the link\n" +
" </li> \n" +
"</ul>" +
" </body>\n" +
"</html>"
+ "<head></head>";
Document html = Jsoup.parse(htmlString);
Elements elements = html.select("a");
for(Element element: elements){
System.out.println(element.attr("href"));
}
}
Output:
/manga/toki_wa/v01/c001/1.html

Replace some HTML attributes with Jsoup without changing rest of input

I work with incoming html text blocks, like this:
String html = "<p>Some text here with already existing tags and it's escaped symbols.\n" +
" More text here:<br/>\\r\\n---<br/>\\r\\n" +
" <img src=\"/attachments/a0d4789a-1575-4b70-b57f-9e8fe21df46b\" sha256=\"2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c\"></a>" +
" It was img tag with attr to replace above</p>\\r\\n\\r\\n<p>More text here\n" +
" and here.<br/>\\r\\n---</p>";
I need to replace src attribute value in img tags with slightly modified sha256 attribute value in the same tag. I can do it easily with Jsoup like this:
Document doc = Jsoup.parse(html);
Elements elementsByAttribute = doc.select("img[src]");
elementsByAttribute.forEach(x -> x.attr("src", "/usr/myfolder/" + x.attr("sha256") + ".zip"));
But there is a problem. Incoming text already has some formatting, html tags, escaping etc that need to be preserved. But Jsoup removes tags / adds tags / unescapes / escapes and does some other modifications to the original input.
For example, System.out.println(doc); or System.out.println(doc.html()); gives me following:
<html>
<head></head>
<body>
<p>Some text here with already existing tags and it's escaped symbols. More text here:<br>\r\n---<br>\r\n <img src="/usr/myfolder/2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c.zip" sha256="2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c"> It was img tag with attr to replace above</p>\r\n\r\n
<p>More text here and here.<br>\r\n---</p>
</body>
</html>
My src attribute is replaced, but a lot more html-tags are added, it's is escaped to it's.
If I use System.out.println(doc.text()); i receive following:
Some text here with already existing tags and it's escaped symbols. More text here: \r\n--- \r\n It was img tag with attr to replace above\r\n\r\n More text here and here. \r\n---
My tags are removed here, it's is escaped to it's again.
I tried some other Jsoup features but didn't find how to solve this problem.
Quesion: is there any way to replace only some attributes with Jsoup without changing other tags? Maybe there is some othere library for that purpose? Or my only way is regex?

I encoutered same problem and in my case this code doesn't change original formatting.
Try this:
public static void m(){
String html = "<p>Some text here with already existing tags and it's escaped symbols.\n" +
" More text here:<br/>\r\n---<br/>\r\n" +
" <img src=\"/attachments/a0d4789a-1575-4b70-b57f-9e8fe21df46b\" sha256=\"2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c\"></a>" +
" It was img tag with attr to replace above</p>\r\n\r\n<p>More text here\n" +
" and here.<br/>\r\n---</p>";
Document doc = Jsoup.parseBodyFragment(html);
doc.outputSettings().prettyPrint(false);
Elements elementsByAttribute = doc.select("img[src]");
elementsByAttribute.forEach(x -> x.attr("src", "/usr/myfolder/" + x.attr("sha256") + ".zip"));
String result = doc.body().html();
System.out.println(result);
}
It outputs in console (In your example there is dandling </a> after <img/> so library remove it):
<p>Some text here with already existing tags and it's escaped symbols.
More text here:<br>
---<br>
<img src="/usr/myfolder/2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c.zip" sha256="2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c"> It was img tag with attr to replace above</p>
<p>More text here
and here.<br>
---</p>
And my case in Kotlin (replace content of src of <img/> attrs & remove <script></script> tags)(text is input String? variable from outer scope):
val content: String get(){
var c = text ?: ""
//val document = Jsoup.parse(c)
val document = Jsoup.parseBodyFragment(c)
document.outputSettings().prettyPrint(false)
val elementsByAttr = document.select("img[src]")
elementsByAttr.forEach {
val srcContent = it.attr("src")
val (type,value) = srcContent.let {
val eqIdx = it.indexOf('=')
it.substring(0, max(0,eqIdx)) to it.substring(eqIdx+1)
}
if (type=="path"){
it.attr("src", ArticleRepo.imgPathPrefix+value)
}
}
document.select("script").remove()
c = document.body().html()
return c
}

Can I include white space between all html text() elements in Jsoup

I want to use Jsoup to extract all text from an HTML page and return a single string of all the text without any HTML. The code I am using half works, but has the effect of joining elements which affects my keyword searches against the string.
This is the Java code I am using:
String resultText = scrapePage(htmldoc);
private String scrapePage(Document doc) {
Element allHTML = doc.select("html").first();
return allHTML.text();
}
Run against the following HTML:
<html>
<body>
<h1>Title</h1>
<p>here is para1</p>
<p>here is para2</p>
</body>
</html>
Outputting resultText gives "Titlehere is para1here is para2" meaning I can't search for the word "para1" as the only word is "para1here".
I don't want to split document into further elements than necessary (for example, getting all H1, p.text elements as there is such a wide range of tags I could be matching
(e.g. data1data2 would come from):
<td>data1</td><td>data2</td>
Is there a way if can get all the text from the page but also include a space between the tags? I don't want to preserve whitepsace otherwise, no need to keep line breaks etc. as I am just preparing a keyword string. I will probably trim all white space otherwise to a single space for this reason.

I don't have this issue using JSoup 1.7.3.
Here's the full code i used for testing:
final String html = "<html>\n"
+ " <body>\n"
+ " <h1>Title</h1>\n"
+ " <p>here is para1</p>\n"
+ " <p>here is para2</p>\n"
+ " </body>\n"
+ "</html>";
Document doc = Jsoup.parse(html);
Element element = doc.select("html").first();
System.out.println(element.text());
And the output:
Title here is para1 here is para2
Can you run my code? Also update to a newer version of jsoup if you don't have 1.7.3 yet.

Previous answer is not right, because it works just thanks to "\n" end of lines added to each line, but in reality you may not have end of line on end of each HTML line...

void example2text() throws IOException {
String url = "http://www.example.com/";
String out = new Scanner(new URL(url).openStream(), "UTF-8").useDelimiter("\\A").next();
org.jsoup.nodes.Document doc = Jsoup.parse(out);
String text = "";
Elements tags = doc.select("*");
for (Element tag : tags) {
for (TextNode tn : tag.textNodes()) {
String tagText = tn.text().trim();
if (tagText.length() > 0) {
text += tagText + " ";
}
}
}
System.out.println(text);
}
By using answer: https://stackoverflow.com/a/35798214/4642669

What regular expression needs to be used to extract a particular value from an HTML tag?

What regular expression can be used to extract the value of src attribute in the iframe tag?

If you really are using Java (not JavaScript) and you only have the iframe, you can try the regular expression:
(?<=src=")[^"]*(?<!")
e.g.:
private static final Pattern REGEX_PATTERN =
Pattern.compile("(?<=src=\")[^\"]*(?<!\")");
public static void main(String[] args) {
String input = "<iframe name=\"I1\" id=\"I1\" marginwidth=\"1\" marginheight=\"1\" height=\"430px\" width=\"100%\" border=\"0\" frameborder=\"0\" scrolling=\"no\" src=\"report.htm?view=country=us\">";
System.out.println(
REGEX_PATTERN.matcher(input).matches()
); // prints "false"
Matcher matcher = REGEX_PATTERN.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
Output:
report.htm?view=country=us

I would say look into dom parsing. from there it would be extremely similar to the javascript answer.
Dom parser will turn the html into a document from there you can do:
iframe = document.getElementById("I1");
src = iframe.getAttribute("src");

Regex is little bit costlier do not use it until you have other simple solution, in java try this
String src="<iframe name='I1' id='I1' marginwidth='1' marginheight='1'" +
" height='430px' width='100%' border='0' frameborder='0' scrolling='no'" +
" src='report.htm?view=country=us'>";
int position1 = src.indexOf("src") + 5;
System.out.println(position1);
int position2 = src.indexOf("\'", position1);
System.out.println(position2);
System.out.println(src.substring(position1, position2));
Output:
134
160
report.htm?view=country=us

In case you meant javascript instead of java:
var iframe = document.getElementById("I1");
var src = iframe.getAttribute("src");
alert(src); //outputs the value of the src attribute

src="(.*?)"
The regular expression will match src="report.htm?view=country=us", but you will find only the part between the " in the first (and only) submatch.
When you only want to match src-attributes when they are in an iframe, do this:
<iframe.*?src="(.*?)".*?>
but there are certain corner-cases where this could fail due to the inherently non-regular nature of HTML. See the top answer to RegEx match open tags except XHTML self-contained tags for an amusing rant about this problem.

Why Jsoup cannot select td element?

I have made little test (with Jsoup 1.6.1):
String s = "" +Jsoup.parse("<td></td>").select("td").size();
System.out.println("Selected elements count : " + s);
It outputs:
Selected elements count : 0
But it should return 1, because I have parsed html with td element. What is wrong with my code or is there bug in Jsoup?

Because Jsoup is a HTML5 compliant parser and you feeded it with invalid HTML. A <td> has to go inside at least a <table>.
int size = Jsoup.parse("<table><td></td></table>").select("td").size();
System.out.println("Selected elements count : " + size);

String url = "http://foobar.com";
Document doc = Jsoup.connect(url).get();
Elements td = doc.select("td");

Jsoup 1.6.2 allows to parse with different parser and simple XML parser is provided. With following code I could solve my problem. You can later parse your fragment with HTML parse, to get valid HTML.
// Jsoup 1.6.2
String s = "" + Jsoup.parse("<td></td>", "", Parser.xmlParser()).select("td").size();
System.out.println("Selected elements count : " + s);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jsoup matching with regex - java

I came across a problem using jsoup. I can't match the <div id="shout_132684"> those digits are changing. How should I match those? Elements content = doc.select("div:matches(id=\"shout_.+?\")"); Doesn't work.

For more accurate parsing you can still do it with regular expressions: Elements content = doc.select("div[id~=(shout_)[0-9]+]");

Related

get all links from a div with JSoup

Replace some HTML attributes with Jsoup without changing rest of input

Can I include white space between all html text() elements in Jsoup

What regular expression needs to be used to extract a particular value from an HTML tag?

Why Jsoup cannot select td element?

Categories

Resources