Parse xml with empty valued attribute

Parse xml with empty valued attribute - java

Have and input of this format:
<table>
<tbody>
<tr bgcolor='#999999'>
<td nowrap width='1%'>
</td>
<td nowrap width='3%' align='center'>
<font style='font-size: 8pt'> System ID </font>
</td>
<td nowrap width='5%' align='center'>
In order to remove nowrap attribute , was earlier using this code:
if (deletedString == null)
{
return exportedTable;
}
int tagPos = 0;
String resultTable = exportedTable;
while (resultTable.indexOf(deletedString) != -1)
{
tagPos = resultTable.indexOf(deletedString, tagPos);
String beforTag = resultTable.substring(0, tagPos);
String afterTag = resultTable.substring(tagPos + deletedString.length());
resultTable = beforTag + afterTag;
}
return resultTable;
deletedString is nowrap, and input is exportedTable.
But this is causing Performance issues. Is there any better way to do it?

My recommendation: StringUtils.remove(source, substring) will remove all instances of the substring from the source string. This answer benchmarked this method and found it to be five times faster than a few alternatives.
Alternatively, use a StringBuilder to aggregate your substrings - every time you concatenate two strings you're creating a new string, whereas StringBuilder is mutable and doesn't need to create a new copy on an update.

You could create a xmlstreamreader and have a while loop that parses through the xml as long as the streamreader.hasNext().
Format:
//Create stream reader
//Position at beginning of document
//While the stream reader has next (can see next line)
//perform action

Related

Unable to grab attribute value having a space in attribute name using jsoup java instead getting empty string

I'm new to jsoup and trying to grab the attribute value of "title data-original-title" attribute but getting an empty string. I want the value
Jul-30-2015 03:26:13 PM
<table class="table table-hover">
<thead>
<tr style="border-color: #E1E1E1; border-width: 1px; background-color: #F9F9F9; border-top-style: solid;">
<th>Height</th>
<th>Age</th>
<th>txn</th>
<th>Uncles</th>
<th>Miner</th>
<th>GasUsed</th>
<th>GasLimit</th>
<th>Avg.GasPrice</th>
<th>Reward</th>
</tr>
</thead>
<tbody>
<tr><td></td>
<td>
**<span rel="tooltip" data-placement="bottom" title="" data-original-title="Jul-30-2015 03:26:13 PM">1149 days 18 hrs ago</span>**
</td>
My code is
for (int i = total_pages; i >= 1; i--) {
System.out.println("\nDisplaying blocks on page " + i);
String newString = "https://etherscan.io/blocks?p=" + i;
Document d3 = Jsoup.connect(newString).get();
Elements e = d3.select("table.table-hover > tbody");
Elements r = e.get(0).select("tr");
for (Element cr : r) {
Elements test = d3.select("span");
System.out.println(test.attr("data-original-title"));
}
}
Any help would be appreciated. I modified the attribute value to get data placement value and it is being retrieved correctly. But the data-original-title still returns empty string.

Data attributes are special kind of attributes so accessing them is a bit different but still very easy.
Instead of
System.out.println(test.attr("data-original-title"));
use:
System.out.println(test.first().dataset().get("original-title"));

You can try to see if this works :
d3.select("span[data-original-title]").get(0).attr("data-original-title")
Explanation :
This looks for the first span containing attribute "data-original-title" and gets the value of that attribute.

Html parsing in Java using Jsoup

I've been using Jsoup for HTML parsing, but I encountered a big problem. It takes too long like 1 hour.
Here's the site that I am parsing.
<tr>
<td class="class1">value1 </td>
<td class="class1">value2</td>
<td class="class1">value3</td>
<td class="class1">value4</td>
<td class="class1">value5 </td>
<td class="class1">value6</td>
<td class="class1">value7</td>
<td class="class1">value8</td>
<td class="class1">value9</td>
</tr>
In the site, there are thousands of tables like this, and I need to parse them all to a list. I only need value1 and value6, so to do that I am using this code.
Document doc = Jsoup.connect(url).get();
ls = new LinkedList();
for(int i = 15; i<doc.text().length(); i++) {//15 because the tables I want starting from 15
Element element = doc.getElementsByTag("tr").get(i);//table index
Elements row = element.getElementsByTag("td");
value6 = row.get(5).text();//getting value6
value1 = row.get(0).text();//getting value1
node = new Node(value1, value6);
ls.insert(node);
As I said it takes too much time, so I need to do it faster. Any ideas how to fix this problem ?

I think your problem stems from the for loop for(int i = 15; i<doc.text().length(); i++). What you do here is loop over the whole text of the document character by character. I highly doubt that this is what you want to do. I think you want to cycle over the table rows instead. So something like this should work:
Document doc = Jsoup.connect(url).get();
Elements trs = doc.select("tr");
for (int i = 15; i < trs.size(); i++){
Element tr = trs.get(i);
Elements tds = tr.select("td").;
String value6 = tds.get(5).text(); //getting value6
String value1 = tds.get(1).text(); //getting value1
//do whatever you need to do with the values
}

Select link in a table with jsoup using Java code

I need to get the download link in this table:
<table cellpadding="0" cellspacing="3" border="0">
<tr>
<td><img class="img" src="...path" /></td>
<td>File -
<a id="1569" class="tepLink" href="javascript:void(0);">[Click me]</a>
</td>
</tr>
</table>
and this is what I tried:
Element table = doc.select("table[cellpadding=\"0\" cellspacing=\"3\" border=\"0\"]").first();
Element dwlLink = table.select("td:has(a)").first();
String absPath = dwlLink.attr("abs:href");
//use download manager to download from string absPath
I always get a "null object reference" so I must be wrong with that code, what should it do?

Just select all anchor tags and then get the first element in the Elements object.
Elements anchorTags = doc.select("table[cellpadding=0][cellspacing=3][border=0] a");
if(anchorTags.isEmpty())
{
System.out.println("Not found");
}
else
{
System.out.println(anchorTags.first());
}
EDIT:
I changed the select method to include the cellpadding, cellspacing and border attributes since that seems like what you were after in one of your examples.
Also, the Element.first() method returns null if the Elements list is empty. Always check for null when calling that method to prevent NullPointerExceptions.

table.select("td:has(a)").first(); will select the first <tr> element that contains an anchor. It will not select the anchor <a> itself.
here is what you can do:
Element aEl = doc.select("table[cellpadding] td a").first();

I would like to display hashmap data one by one using foreach?

i want to display a hashmap but via key and one td per value instead of the whole lot in one? heres what i mean :
String getDescription = rs.getString("description");
int level = rs.getInt("level");
Timestamp startDate = rs.getTimestamp("startDateTime");
Timestamp endDate = rs.getTimestamp("endDateTime");
String LessonId = rs.getString("lessonid");
this.less = new Lesson(getDescription, startDate, endDate, level, LessonId);
putDescriptions.add(less.description);
putStartTime.add(less.startTime);
endTime.add(less.endTime);
List list = Arrays.asList(less.date.split("2010"));
for (int i = 0; i < list.size(); i++) {
putDates.add(list.get(i).toString());
Level.add(less.level);
LessonID.add(less.ID);
this.lessons.put("description", putDescriptions);
this.lessons.put("StartDate", putDates);
this.lessons.put("StartTime", putStartTime);
this.lessons.put("EndTime", endTime);
this.lessons.put("Level", Level);
this.lessons.put("LessonID", LessonID);
//above is bean code
jstl :
<c:forEach var="temp" items="${sessionScope.AvailableLessons['description']}">
<tbody>
<tr>
<form action="" method="POST">
<td>
<c:out value="${temp}"/>

The below should be sufficient,
1.) Loop should be outside, but inside your table tag.
2.) Inside for each loop, open tr or td as per your need.
<c:forEach var="myVar" items="${myMap}">
${myVar.key} ${myVar.value}
</c:forEach>

Jsoup returned string " " is not returning true on equals(" ")

Just playing around and pulling some data off a site to manipulate when I come across this:
String request = "http://foo";
String data = "bar";
Connection.Response res = Jsoup.connect(request).data(data).method(Method.POST).execute();
Document doc = res.parse();
Elements all = doc.select("td");
for(Element elem : all){
String test = elem.text();
if(test.equals(" ")){
//redefine test to 0 and print it
}
else{
//print it
}
The site in question is coded as so:
<td align="center">Henry</td>
<td>23</td>
<td align="center">Savannah</td>
<td>15</td></tr>
...
<td align="center"> </td>
<td> </td>
<td align="center">Jane</td>
<td>15</td></tr>
In my for loop, test is never redefined.
I've debugged in Eclipse and String test is showing as so:
Edit
Debugging test chartAt(0):
org.jsoup.nodes.Element.text() says "Returns unencoded text or empty string if none". I'm assuming the unencoded part has something to do with this, but I can't figure it out.
I ran a test program:
public static void main(String[] args) {
String str = " ";
if (str.equals(" ")){
System.out.println("True");
}
}
and it returns true.
What gives?

I don't know if you control the HTML being sent in the body of the response or if that is what you see in a browser's source page or elsewhere
<td> </td>
But it's possible the actual content is
<td>&nbsp</td> // or &#160
where &nbsp is the HTML entity for the non-breaking space.
In java, you can represent it as
char nbsp = 160;
So you could just check for both char values, the one for space and the one for non-breaking space.
Note that there might be other codepoints that are represented as white space. You need to know what you're looking for.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parse xml with empty valued attribute - java

You could create a xmlstreamreader and have a while loop that parses through the xml as long as the streamreader.hasNext(). Format: //Create stream reader //Position at beginning of document //While the stream reader has next (can see next line) //perform action

Related

Unable to grab attribute value having a space in attribute name using jsoup java instead getting empty string

Html parsing in Java using Jsoup

Select link in a table with jsoup using Java code

I would like to display hashmap data one by one using foreach?

Jsoup returned string " " is not returning true on equals(" ")

Categories

Resources