How to get information using jsoup

How to get information using jsoup - java

I have this html
<td rowspan="3" class="event start"> INFO1<br>DETAIL1 DETAIL2<br>INFO2 </td>
how can I get INFO1 and INFO2 using jsoup??

Try something like that :
Document doc = Jsoup.connect(url).get();
Elements td=doc.select("td.event.start");
Elements a = td.first().getElementsByTag("a");
String [] words = a.text().split(" ");
System.out.println(words[0]+" "+words[3]);
Tested on:
<!doctype html>
<html>
<body>
<td rowspan="3" class="event start">
<a href="/search/1065650;1/note">
INFO1<br>DETAIL1 DETAIL2<br>INFO2
</a>
</td>
</body>
</html>

Related

Get data from table(html) except div tag by jsoup

I have html code:
<table width="100%" cellpadding="5" cellspacing="2" class="zebra">
<tr>
<td colspan="5">
<div class="paginator">
2
</div>
</td>
</tr>
<tr>
<td>some_value</td>
</tr>
<tr>
<td>some_value</td>
</tr>
<tr>
<td colspan="2">
<div class="paginator">
2
</div>
</td>
</tr>
</table>
I use Jsoup. How can I get all links except links in div tag?
I try to do something like this, but It doesn't work. Element contains all the links.
org.jsoup.nodes.Elements tableText = doc.select("table.zebra").not("tr td div.paginator");
for (org.jsoup.nodes.Element td : tableText.select("td a")) {
System.out.println(td.attr("href")); // http://some_link
....
}

You can use the below code..
Document html = Jsoup.parse(htmlStr);
for (Element e : html.getElementsByTag("a")) {
if (!"div".equalsIgnoreCase(e.parentNode().nodeName())) {
System.out.println(e.attr("href"));
}
}
Here I am checking that the parent node of the anchor element is not div. if it is not div I am printing the url.

Character Encoding not working for Japanese ,Chinese and Korean

I have Unicode characters for all the European countries and for a few Asian countries like Japan, China, Korean. All the Unicodes are working fine for European countries except for Japan, China, Korean.
Example for Japan:
dear_name=\u30c7\u30a3\u30fc\u30e9\u30fc
Example for China:
dear_name=\u4eb2\u7231\u7684
Example for Korean:
dear_name=\uce5c\uc560\ud558\ub294
Example for Sweden (this one is working fine):
dear_name=Till
Default character encoding is UTF-8.
Template template = VelocityFactory.getTemplate("test.vm", "UTF-8");
String messageText = VelocityFactory.merge(context, template, charset);
While debuging the merge method I found out that the merged result is getting grabled here itself for chinese,Japanese,korean.
public static String merge(VelocityContext context, Template template, String charset) throws Exception {
String newResult = null;
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
OutputStreamWriter streamWriter;
if(charset != null && charset.length() > 0) {
streamWriter = new OutputStreamWriter(outputStream, charset);
} else {
streamWriter = new OutputStreamWriter(outputStream);
}
template.merge(context, streamWriter);
streamWriter.close();
mergedResult = outputStream.toString();
outputStream.close();
return newResult;
}
}
Below is the mail template and only for header it is displaying in correct format for Japanese, Chinese, and Korean, but not for the body:
<html>
<head>
<meta http-equiv="Content-Type" content="$contentType">
</head>
<body>
<div id="content">
<table border="0" cellpadding="0" cellspacing="0" style="margin-left: 0px;">
<tr>
<td>
<table border="0" cellpadding="0" cellspacing="0" class="textBody" style="margin-bottom: 120px;">
<tr>
<td valign="bottom" class="mainHeader" nowrap>
$velocityUtils.getMessage("test")
</td>
</tr>
<tr>
<td colspan="2">
<img src="$imageBar" class="clipped">
</td>
</tr>
</table>
<div id="info" class="textBody">$velocityUtils.getMessage("test1")<br><br></div>
</td>
</tr>
</table>
</div>
</body>
</html>
Any information how to fix this? How do i encode correctly??

try adding this to the top of your JSP
<%# page language="java" pageEncoding="UTF-8"%>

you need to specify the character sets for japanese, korean and chinese
For japanese try: charset=iso-2022-jp
For korean try: charset=iso-2022-kr
For chinese try: charset=big5

angular - catch new line(\n) from java

I am having a problem catching or detecting a new line(\n) insert from my server side(java servlet) in my angular ng-repeat.
server side:
StringBuilder name = new StringBuilder();
name.append(getFirst(id)); //id - #param
name.append("\n"); //new line
name.append(getLast(id)); //id - #param
But on my client side, in my ng-repeat:
{{reporter.name}} //jhon doe
instead of:
jhon
doe
if there is more code needed I will post it.

You can use the css style for your requirement:
white-space: pre|pre-line|pre-wrap;
pre - Whitespace is preserved by the browser. Text will only wrap on line breaks. Acts like the <pre> tag in HTML
pre-line - Sequences of whitespace will collapse into a single whitespace. Text will wrap when necessary, and on line breaks
pre-wrap Whitespace is preserved by the browser. Text will wrap when necessary, and on line breaks
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Example - example-example111-production</title>
<script src="//ajax.googleapis.com/ajax/libs/angularjs/1.5.6/angular.min.js"></script>
<script src="//ajax.googleapis.com/ajax/libs/angularjs/1.5.6/angular-sanitize.js"></script>
<style>
table {
border-collapse: collapse;
}
table, td, th {
border: 1px solid black;
}
</style>
</head>
<body ng-app="sanitizeExample">
<script>
angular.module('sanitizeExample', [])
.controller('ExampleController', ['$scope', function($scope) {
$scope.friends = [
{name:'John\nDoe', age:25, gender:'boy'},
{name:'Jessie\nSimpson', age:30, gender:'girl'},
{name:'Johanna\nLN', age:28, gender:'girl'},
{name:'Joy\nLN', age:15, gender:'girl'},
{name:'Mary\nLN', age:28, gender:'girl'},
{name:'Peter\nLN', age:95, gender:'boy'},
{name:'Sebastian\nLN', age:50, gender:'boy'},
{name:'Erika\nLN', age:27, gender:'girl'},
{name:'Patrick\nLN', age:40, gender:'boy'},
{name:'Samantha\nLN', age:60, gender:'girl'}
];
}]);
</script>
<div ng-controller="ExampleController">
<table
class="table-bordered">
<thead>
<tr>
<th>Name</th>
<th>Age</th>
<th>Gender</th>
</tr>
</thead>
<tbody>
<tr data-ng-repeat="friend in friends">
<td style="white-space: pre;">{{friend.name}}</td>
<td>{{friend.age}}</td>
<td>{{friend.gender}}</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>

Replace all \n with < br>, do something like this
str = str.replace(/(?:\r\n|\r|\n)/g, '<br />');

Wrap it in <pre> tag
<pre>{{reporter.name}}</pre>

Html parsing text from TD Tag

I have my Html data
<table border='0' cellpadding='3' bgcolor="#CCCCCC" class="hostinfo_title2" width='100%' align="center">
<tr align='center' bgcolor="#ffffff">
<td width='26%' class="hostinfo_title3">Archive Url</td>
</tr>
<tr bgcolor="#ffffff"
<td height="25" align="center">http://www.toradio.com/prgramdetails/20130413_vali_mm.mp3</td>
</tr>
</table>
I want to get mp3 url(http://www.toradio.com/prgramdetails/20130413_vali_mm.mp3) from above HTML text.
I'm following this link,Is it Correct or any better way to parse this text
Could any one help?

Check out JSoup. It's a nice HTML Parser for JAVA.
You should be able to do that with something like this:
String html = "<YOUR HTML HERE>";
Document doc = Jsoup.parse(html);
Elements tds = doc.select("table.hostinfo_title2").select("td");
String mp3Link = "";
for(Element td : tds) {
if(td.text().contains("mp3") {
mp3Link = td.text();
// do something with mp3Link
}
}

Parse the inner most html tags using jSoup

Here is my code.
String tags="<html><head></head><body><table><tr><td>1</td></tr><tr><td><table><tr><td>3</td><td>4</td></tr></table></td></tr></table><body></html>";
Document document = Jsoup.parse(tags);
for(int i=0;i<document.body().childNodes().size();i++)
{
if(!document.body().childNodes().get(i).nodeName().startsWith("#"))
{
System.out.println("1st Level Nodes:"+document.body().childNodes().get(i).nodeName());
while(document.body().childNodes().get(i).childNodes().size()>1)
{
System.out.println("2nd Level: "+document.body().childNodes().get(i).childNodes().get(0).nodeName());
}
}
}
How to parse the HTML which return tag by tag. Loop is not covered innermost tags.
Here is a well formatted html code. Parse the all the tags to inner most.
<html>
<head></head>
<body>
<table>
<tr>
<td>1</td>
</tr>
<tr>
<td>
<table>
<tr>
<td>3</td>
<td>4</td>
</tr>
</table>
</td>
</tr>
</table>
<body>
</html>
I want to get all the html in between tag as a hierarchy of html which i shown in html code. So i like to get all the tag one after another as per sequence of parent and child.

If you need only the tags you can use this here:
String tags = "<html><head></head><body><table><tr><td>1</td></tr><tr><td><table><tr><td>3</td><td>4</td></tr></table></td></tr></table><body></html>";
Document doc = Jsoup.parse(tags);
for( Element e : doc.select("*") // you can use 'doc.getAllElements()' here too
{
System.out.println(e.tag());
}
Output:
#root
html
head
body
table
tbody
tr
td
tr
td
table
tbody
tr
td
td

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get information using jsoup - java

I have this html <td rowspan="3" class="event start"> INFO1<br>DETAIL1 DETAIL2<br>INFO2 </td> how can I get INFO1 and INFO2 using jsoup??

Related

Get data from table(html) except div tag by jsoup

Character Encoding not working for Japanese ,Chinese and Korean

angular - catch new line(\n) from java

Html parsing text from TD Tag

Parse the inner most html tags using jSoup

Categories

Resources