How to extract specific text in html code with JSoup - java

I have a website where I want to extract some data from. I want to extract the 8a on the second line (a-element) with JSoup. I can not use Regex because sometimes 8a is just 2 or 7c+ and these same values can be in the text in between the a tags as well. Ideas?
<div class="vsr">
L'Américain (intégral) 8a
<span class="ag">7c+</span>
<em>Tony Fouchereau</em>
<span class="btype">traversée d-g, surplomb, départ assis</span>
<span class="glyphicon glyphicon-camera" aria-hidden="true"></span>
<span class="glyphicon glyphicon-film" aria-hidden="true"></span>
</div>

You can use Jsoup css selectors to extract specific information.
https://jsoup.org/cookbook/extracting-data/selector-syntax
#Test
public void extract8a() {
Document doc = Jsoup.parse("<div class=\"vsr\"> \n" +
" L'Américain (intégral) 8a \n" +
" <span class=\"ag\">7c+</span> \n" +
" <em>Tony Fouchereau</em> \n" +
" <span class=\"btype\">traversée d-g, surplomb, départ assis</span> \n" +
" <span class=\"glyphicon glyphicon-camera\" aria-hidden=\"true\"></span> \n" +
" <span class=\"glyphicon glyphicon-film\" aria-hidden=\"true\"></span> \n" +
"</div>");
System.out.println(doc.select("div.vsr").first().ownText());
}

Related

JSOUP - Select only some text from html

I am trying to select some text from the HTML using Jsoup in Android.
My HTML code looks like that:
<tr class="tip " data-original-title="">
<td>
!!! NOT That !!! </td>
<td>
A205 </td>
<td>
I want to get this </td>
<td>
And this </td>
<td>
!!! And not this !!! </td>
<td>
</td>
</tr>
How can I do that? Thank you so much!
For example:
package ru.java.study;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Main {
private static String htmlText =
"<tr class=\"tip \" data-original-title=\"\">" +
"<td>!!! NOT That !!!</td>" +
" <td>" +
" A205 </td>" +
" <td>" +
" I want to get this </td>" +
" <td>" +
" And this </td>" +
" <td>" +
" !!! And not this !!! </td>" +
" <td>" +
" </td>" +
" </tr>";
public static void main(String[] args) {
Document document = Jsoup.parse("<table>"+htmlText); //Add <table>
String first_TD = document.select("td").get(2).text();
String second_TD = document.select("td").get(3).text();;
System.out.println(first_TD);
System.out.println(second_TD);
}
}
You must be more specific in your selection. There should be id="..." or class="..." attributes in <table> tag to precisely identify the table that you need.
// Don't forget about <table> tag
String html = "<table>" +
"<tr class=\"tip \" data-original-title=\"\">" +
"<td>!!! NOT That !!!</td>" +
"<td>A205</td>" +
"<td>I want to get this</td>" +
"<td>And this</td>" +
"<td>!!! And not this !!!</td>" +
"<td></td>" +
"</tr>" +
"</table>";
Document doc = Jsoup.parseBodyFragment(html);
// You should use more specific selector.
// For example if table tag looks like this: <table id="myID">...</table>
// then selector should look like this "table#myID tr.tip > td"
Elements cells = doc.select("tr.tip > td");
String cell3content = cells.get(2).html(); // use .text() for content without html tags
String cell4content = cells.get(3).html();
System.out.println(cell3content);
System.out.println(cell4content);

how to retrieve data from strong tags in html file using jsoup?

I have some html data like
<div class="bs-example">
<div class="panel panel-primary">
<div class="panel-heading">
<h3 class="panel-title">ABC</h3>
</div>
<div class="panel-body">
<div class="slimScroller" style="height:280px; position: relative;" data-rail-visible="1" data-always-visible="1">
<strong>Name:</strong>
<br />
<strong>ID No:</strong> XXXXX<br />
<strong>Status:</strong> ACTIVE<br />
<strong>Class:</strong> 5<br />
<strong>Category:</strong> A<br />
<strong>Marks:</strong> 500<br />
</div>
</div>
</div>
</div>
I want output as (multiple students data):
Name: ABC
ID No.: XXXXX
Status: Active
Class: 5
Category: A
Marks: 500
How to get this data using jsoup or any other way? Please help.
You can use Element.nextElementSibling() or/and Element.nextSibling() to get the output you need.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Exam {
public static void main(String[] args) {
String html = "<div class=\"bs-example\">" +
" <div class=\"panel panel-primary\">" +
" <div class=\"panel-heading\">" +
" <h3 class=\"panel-title\">ABC</h3>" +
" </div>" +
" <div class=\"panel-body\">" +
" <div class=\"slimScroller\" style=\"height:280px; position: relative;\" data-rail-visible=\"1\" data-always-visible=\"1\">" +
" <strong>Name:</strong>" +
" <br />" +
" <strong>ID No:</strong> XXXXX<br />" +
" <strong>Status:</strong> ACTIVE<br />" +
" <strong>Class:</strong> 5<br />" +
" <strong>Category:</strong> A<br />" +
" <strong>Marks:</strong> 500<br />" +
" </div>" +
" </div>" +
" </div>" +
"</div>";
Document doc = Jsoup.parse(html);
Elements eles = doc.select("div.slimScroller strong");
for(Element e :eles)
System.out.println(e.text() +
( e.nextElementSibling().tagName().equals("a")?
e.nextElementSibling().attr("href").replace("https://", ""):
e.nextSibling().toString()));
}
}
The following code should provide the output specified based off your comment describing how your a tags are:
private static void printStudentInfo(Document document){
Elements students = document.select("div.slimScroller strong");
for(Element student : students){
System.out.print(student.text());
System.out.println(student.nextElementSibling().tagName().equals("a") ?
student.nextElementSibling().text() : student.nextSibling().toString());
}
}

Jsoup how to parse text inside span class="hps"

<span id="result_box" class="short_text" lang="es">
<span class="hps">
hello
</span>
<span class="hps">
world
</span>
</span>
I want to get the hello world String using Jsoup but i have no idea how to do this.
Use Jsoup.parse to get the html Document. Select the elements that you want using css selector like: span.hps (http://jsoup.org/apidocs/org/jsoup/select/Selector.html)
Document doc = Jsoup.parse("<span id=\"result_box\" class=\"short_text\" lang=\"es\">\n" +
" <span class=\"hps\">\n" +
" hello\n" +
" </span>\n" +
" <span class=\"hps\">\n" +
" world\n" +
" </span>\n" +
"</span>");
System.out.println(doc.html());
Elements els = doc.select("span.hps");
for(Element e:els){
System.out.print(e.text());
}
In case you don't care about each element value you can replace the for loop:
els.text()

JSOUP extract data from <select name=...>

here's an html code. I want to print "Color:" and various color options present. And somehow I want it by using "select name=att1" that means by name tag of select.
<div class="box-body">
<div id="attributeInputs" class="attribute-inputs" data-defcolor="Palm">
<div class="row thinpad-top att1row">
<div class="small-24 columns">
<label for="att1_BA0FEDC6-8BF1-11E4-B816-87E377679EE2">Color:</label>
</div>
<div class="small-24 columns">
<select name="att1" id="att1_BA0FEDC6-8BF1-11E4-B816-87E377679EE2">
<option value="">Please Select Color</option>
<option value="Black">Black</option>
<option value="Palm">Palm</option>
</select>
</div>
I've tried so many jsoup tags. But I'm not able to get required output
I want output something like this:
Please Select Color:
Black
Palm
please help
This code will extract the elements inside the select tag and option tag
String html="<div class=\"box-body\">\n" +
"\n" +
" <div id=\"attributeInputs\" class=\"attribute-inputs\" data-defcolor=\"Palm\">\n" +
"\n" +
" <div class=\"row thinpad-top att1row\">\n" +
" <div class=\"small-24 columns\">\n" +
" <label for=\"att1_BA0FEDC6-8BF1-11E4-B816-87E377679EE2\">Color:</label>\n" +
" </div>\n" +
" <div class=\"small-24 columns\">\n" +
" <select name=\"att1\" id=\"att1_BA0FEDC6-8BF1-11E4-B816-87E377679EE2\">\n" +
" <option value=\"\">Please Select Color</option>\n" +
" <option value=\"Black\">Black</option>\n" +
" <option value=\"Palm\">Palm</option>\n" +
" </select>\n" +
" </div>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("select option");
for (Element link : links) {
String linkText = link.text();
System.out.println(linkText);
}

Copy to clipboard script with breaks and more text

I am trying to make a fun script for work documentation. Here is what I have so far.
<script type="text/javascript">
function ClipBoard() {
window.clipboardData.setData('text',
document.getElementById('name').value +
document.getElementById('phone').value +
document.getElementById('serial').value +
document.getElementById('new').value +
document.getElementById('cuts').value +
document.getElementById('agts').value
);
}
</script>
<form id="form1">
Name: <input id="name" /><br />
Phone Number: <input id="phone" maxlength="10" /><br />
Serial Number: <input id="serial" maxlength="10" /><br />
New/Existing: <input id="new" /><br />
CU TS: <input id="cuts" /><br />
Agent TS: <input id="agts" /><br />
<input type="button" onclick="ClipBoard()" value="Copy"/>
<input type="reset" />`
Right now after I paste the inputs do not have "breaks" instead, the copied text copies in a line. Ex: namephoneserialnew etc.
I would like:
Name
Phone
Serial
New
Etc. with breaks.
If at all possible.
Also, when copying the inputs, is there a way to copy the text before the input.
Ex: Name: (with input), Phone Number: (with input) etc.
Any suggestions will be very helpful; this is just a basic script nothing serious. Thanks everybody!
Try to add '\r\n' after ctl value.
function ClipBoard() {
window.clipboardData.setData('text',
document.getElementById('name').value + '\r\n' +
document.getElementById('phone').value + '\r\n' +
document.getElementById('serial').value + '\r\n' +
document.getElementById('new').value + '\r\n' +
document.getElementById('cuts').value + '\r\n' +
document.getElementById('agts').value
);
}
JavaScript doesn't generate line breaks this way. You may try adding "<br>" in the code and that might bring up the line breaks.
function ClipBoard() {
window.clipboardData.setData('text',
document.getElementById('name').value + "<br>"
document.getElementById('phone').value + "<br>"
document.getElementById('serial').value + "<br>"
document.getElementById('new').value + "<br>"
document.getElementById('cuts').value + "<br>"
document.getElementById('agts').value
);
}
And if you want Labels before the values you can just put a string before them.
function ClipBoard() {
window.clipboardData.setData('text',
"Name: " + document.getElementById('name').value + "<br>"
"Phone: " + document.getElementById('phone').value + "<br>"
"Serial: " + document.getElementById('serial').value + "<br>"
"New: " + document.getElementById('new').value + "<br>"
"Cuts: " + document.getElementById('cuts').value + "<br>"
"Agts: " + document.getElementById('agts').value
);
}
BTW, you forgot the ending tag in the HTML.

Categories

Resources