Get Index on JSOUP not working - java

try {
String url = "http://www.billboard.com/charts/artist-100";
String urlFound;
String closing = ")";
String start = "h";
Document doc = Jsoup.connect(url).get();
Elements urls = doc.getElementsByClass("chart-row__image");
for (Element u : urls) {
urlFound = u.attr("style");
String sub = urlFound.substring(urlFound.indexOf(start), urlFound.indexOf(closing));
System.out.println(sub);
//Log.d("URLS,", attr.substring(attr.indexOf("http://"), attr.indexOf(")")));
}
}
catch(IOException ex){
}
I tried debugging this several times, but I keep getting the error, Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1 I' m not sure why this happening either? Can someone give me an idea of what could be wrong?

You're extracting the style attribute Strings from all the div class="chart-row__image elements, but understand that many elements in this group don't have a style attribute. In this situation JSoup is returning an empty String, and this is messing up your program. The solution is not to do this but instead to let jsoup select only those elements that have a style attribute.
For instance, not:
Elements urls = doc.getElementsByClass("chart-row__image");
but rather:
Elements urls = doc.select(".chart-row__image[style]");
And yeah, don't ignore exceptions.
So
String url = "http://www.billboard.com/charts/artist-100";
String urlFound;
String closing = ")";
String start = "h";
Document doc;
try {
doc = Jsoup.connect(url).get();
// Elements urls = doc.getElementsByClass("chart-row__image");
Elements urls = doc.select(".chart-row__image[style]");
for (Element u : urls) {
urlFound = u.attr("style");
int startingIndex = urlFound.indexOf(start);
int endingIndex = urlFound.indexOf(closing);
if (startingIndex > 0 && endingIndex > 0) {
String sub = urlFound.substring(startingIndex, endingIndex);
System.out.println(sub);
}
}
} catch (IOException e) {
e.printStackTrace();
}

Related

How to get hyperlink boundaries of inline words with Aspose Words for Androd?

The android app reading paragraphs and some properties in Ms Word document with Aspose Words for Android library. It's getting paragraph text, style name and is seperated value. There are some words have hyperlink in paragraph line. How to get start and end boundaries of the hyperlink of words? For example:
This is an inline hyperlink paragraph example that the start bound is 18 and end bound is 27.
public static ArrayList<String[]> GetBookLinesByTag(String file) {
ArrayList<String[]> bookLines = new ArrayList<>();
try {
Document doc = new Document(file);
ParagraphCollection paras = doc.getFirstSection().getBody().getParagraphs();
for(int i = 0; i < paras.getCount(); i++){
String styleName = paras.get(i).getParagraphFormat().getStyleName().trim();
String isStyleSeparator = Integer.toString(paras.get(i).getBreakIsStyleSeparator() ? 1 : 0);
String content = paras.get(i).toString(SaveFormat.TEXT).trim();
bookLines.add(new String[]{content, styleName, isStyleSeparator});
}
} catch (Exception e){}
return bookLines;
}
Edit:
Thanks Alexey Noskov, solved with you.
public static ArrayList<String[]> GetBookLinesByTag(String file) {
ArrayList<String[]> bookLines = new ArrayList<>();
try {
Document doc = new Document(file);
ParagraphCollection paras = doc.getFirstSection().getBody().getParagraphs();
for(int i = 0; i < paras.getCount(); i++){
String styleName = paras.get(i).getParagraphFormat().getStyleName().trim();
String isStyleSeparator = Integer.toString(paras.get(i).getBreakIsStyleSeparator() ? 1 : 0);
String content = paras.get(i).toString(SaveFormat.TEXT).trim();
for (Field field : paras.get(i).getRange().getFields()) {
if (field.getType() == FieldType.FIELD_HYPERLINK) {
FieldHyperlink hyperlink = (FieldHyperlink) field;
String urlId = hyperlink.getSubAddress();
String urlText = hyperlink.getResult();
// Reformat linked text: urlText:urlId
content = urlText + ":" + urlId;
}
}
bookLines.add(new String[]{content, styleName, isStyleSeparator});
}
} catch (Exception e){}
return bookLines;
}
Hyperlinks in MS Word documents are represented as fields. If you press Alt+F9 in MS Word you will see something like this
{ HYPERLINK "https://aspose.com" }
Follow the link to learn more about fields in Aspose.Words document model and in MS Word.
https://docs.aspose.com/display/wordsjava/Introduction+to+Fields
In your case you need to locate position of FieldStart – this will be the start position, then measure length of content between FieldSeparator and FieldEnd – start position plus the calculated length will the end position.
Disclosure: I work at Aspose.Words team.

Using Selenium an JUnit to parse HTML Document for links

NullPointerException at if (hrefAttr.contains("?"))
I'm running into a problem. I'm using selenium and JUnit to parse through links and compare them to a list of links provided from a CSV file.
Everything was going well until I realized that I have to test the URLs and the query strings separately. I attempted to create an if statement saying that if the href attribute contained a "?" split the entire URL into an array containing two strings. The URL destination being the first string indexed and the query string being the second string indexed. and return the URL destination and append it to an ID. If there was no "?" in the URL string, just return the URL string and append it to an ID
I think the logic looks accurate but I keep returning a Null Pointer Exception at Line 76 (where the href.contains("?") condition is located. Code below:
public static ArrayList<String> getURLSFromHTML(WebDriver driver) {
// prepares variable for array of html link URLs
ArrayList <String> pageLinksList = new ArrayList<String>();
// prepares array to place all of the <a></a> tags found in the HTML
List <WebElement> aElements = driver.findElements(By.tagName("a"));
// loops through all the <a></a> tags found in the HTML
for (WebElement aElement : aElements) {
/*
* grabs the href attribute value and stores it into a variable
* grabs the QA_ID attribute value and stores it in a variable
* concatenates the QA_ID value with the href value and stores them in a variable
*/
String hrefAttr = aElement.getAttribute("href");
String QA_ID = aElement.getAttribute("QA_ID");
String linkConcat;
if (hrefAttr.contains("?")) {
String[] splitHref = hrefAttr.split("\\?");
String URL = splitHref[0];
linkConcat = QA_ID + "_" + URL;
} else {
linkConcat = QA_ID + "_" + hrefAttr;
}
String urlIgnoreAttr = aElement.getAttribute("URL_ignore");
String combIgnore = QA_ID + "_" + urlIgnoreAttr;
String combIgnoreVal = "ignore";
/*
* if the QA_ID is not null then add value to pageLinksList
* if URL_ignore attribute="ignore" in html, then add combIgnore value to pageLinksList
* else add linkConcat to pageLinksList
*/
if(!Objects.isNull(QA_ID)) {
if (Objects.equals(urlIgnoreAttr, combIgnoreVal)) {
pageLinksList.add(combIgnore);
}else {
pageLinksList.add(linkConcat);
}
}
}
System.out.println(pageLinksList);
return pageLinksList;
}
Please help!
The obvious solution is to check for null:
if (hrefAttr != null && hrefAttr.contains("?")) {
String[] splitHref = hrefAttr.split("\\?");
String URL = splitHref[0];
linkConcat = QA_ID + "_" + URL;
} else {
linkConcat = QA_ID + "_" + hrefAttr;
}
An anchor tag without href attribute can still be valid. Without html source we cannot explain the reason for the missing href attributes. The else branch will not throw a NPE, but it my be useless with hrefAttr == null.

How to use Jsoup to get href link without the extra characters?

I have an Element list of which i'm using jsoup's method attr() to get the href attribute.
Here is part of my code:
String searchTerm = "tutorial+programming+"+i_SearchPhrase;
int num = 10;
String searchURL = GOOGLE_SEARCH_URL + "?q="+searchTerm+"&num="+num;
Document doc = Jsoup.connect(searchURL).userAgent("chrome/5.0").get();
Elements results = doc.select("h3.r > a");
String linkHref;
for (Element result : results) {
linkHref = result.attr("href").replace("/url?q=","");
//some more unrelated code...
}
So for example, when i use the search prase "test", the attr("href") produces (first in the list):
linkHref = https://www.tutorialspoint.com/software_testing/&sa=U&ved=0ahUKEwi_lI-T69jTAhXIbxQKHU1kBlAQFggTMAA&usg=AFQjCNHr6EzeYegPDdpHJndLJ-889Sj3EQ
where i only want: https://www.tutorialspoint.com/software_testing/
What is the best way to fix this? Do i just add some string operations on linkHref (which i know how) or is there a way to make the href attribute contain the shorter link to begin with?
Thank you in advanced
If you always want to remove the query parameters you can make use of String.indexOf() e.g.
int lastPos;
if(linkHref.indexOf("?") > 0) {
lastPos = linkHref.indexOf("?");
} else if (linkHref.indexOf("&") > 0){
lastPos = linkHref.indexOf("&");
}
else lastPos = -1;
if(lastPos != -1)
linkHref = linkHref.subsring(0, lastPos);

Read a specified line of text from a webpage with Jsoup

So I am trying to get the data from this webpage using Jsoup...
I've tried looking up many different ways of doing it and I've gotten close but I don't know how to find tags for certain stats (Attack, Strength, Defence, etc.)
So let's say for examples sake I wanted to print out
'Attack', '15', '99', '200,000,000'
How should I go about doing this?
You can use CSS selectors in Jsoup to easily extract the column data.
// retrieve page source code
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
// find all of the table rows
Elements rows = doc.select("div#contentHiscores table tr");
ListIterator<Element> itr = rows.listIterator();
// loop over each row
while (itr.hasNext()) {
Element row = itr.next();
// does the second col contain the word attack?
if (row.select("td:nth-child(2) a:contains(attack)").first() != null) {
// if so, assign each sibling col to variable
String rank = row.select("td:nth-child(3)").text();
String level = row.select("td:nth-child(4)").text();
String xp = row.select("td:nth-child(5)").text();
System.out.printf("rank=%s level=%s xp=%s", rank, level, xp);
// stop looping rows, found attack
break;
}
}
A very rough implementation would be as below. I have just shown a snippet , optimizations or other conditionals need to be added
public static void main(String[] args) throws Exception {
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
Element contentHiscoresDiv = doc.getElementById("contentHiscores");
Element table = contentHiscoresDiv.child(0);
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
for (Element column : tds) {
if (column.children() != null && column.children().size() > 0) {
Element anchorTag = column.getElementsByTag("a").first();
if (anchorTag != null && anchorTag.text().contains("Attack")) {
System.out.println(anchorTag.text());
Elements attributeSiblings = column.siblingElements();
for (Element attributeSibling : attributeSiblings) {
System.out.println(attributeSibling.text());
}
}
}
}
}
}
Attack
15
99
200,000,000

Empty / Null Nodes returned from getChildNodes

I'm trying to parse the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<docusign-cfg>
<tagConfig>
<tags>
<approve>approve</approve>
<checkbox>checkbox</checkbox>
<company>company</company>
<date>date</date>
<decline>decline</decline>
<email>email</email>
<emailAddress>emailAddress</emailAddress>
<envelopeID>envelopeID</envelopeID>
<firstName>firstName</firstName>
<lastName>lastName</lastName>
<number>number</number>
<ssn>ssn</ssn>
<zip>zip</zip>
<signHere>signHere</signHere>
<checkbox>checkbox</checkbox>
<initialHere>initialHere</initialHere>
<dateSigned>dateSigned</dateSigned>
<fullName>fullName</fullName>
</tags>
</tagConfig>
</docusign-cfg>
I want to read either the name or content of each tag in the <tags> tag. I can do so with the following code:
public String[] getAvailableTags() throws Exception
{
String path = "/docusign-cfg/tagConfig/tags";
XPathFactory f = XPathFactory.newInstance();
XPath x = f.newXPath();
Object result = null;
try
{
XPathExpression expr = x.compile(path);
result = expr.evaluate(doc, XPathConstants.NODE);
}
catch (XPathExpressionException e)
{
throw new Exception("An error ocurred while trying to retrieve the tags");
}
Node node = (Node) result;
NodeList childNodes = node.getChildNodes();
String[] tags = new String[childNodes.getLength()];
System.out.println(tags.length);
for(int i = 0; i < tags.length; i++)
{
String content = childNodes.item(i).getNodeName().trim().replaceAll("\\s", "");
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags[i] = content;
}
}
return tags;
}
After some searching I found that parsing it this way causes it to read whitespace between nodes / tags causes those whitespaces to be read as children. In this case the whitespaces are considered children of <tags> .
My output:
37
null
approve
null
checkbox
null
company
null
date
null
decline
null
email
null
emailAddress
null
envelopeID
null
firstName
null
lastName
null
number
null
ssn
null
zip
null
signHere
null
checkbox
null
initialHere
null
dateSigned
null
fullName
null
37 is the number of nodes it found in <tags>
Everything below 37 is the content of the tag array.
How are these null elements being added to the tag array despite my checking for null?
I think that is because of the indexing of tag. The if check also skips an index. So even though value is not being inserted it will result in null. Use separate index for tag array
int j = 0;
for(int i = 0; i < tags.length; i++)
{
String content = childNodes.item(i).getNodeName().trim().replaceAll("\\s", "");
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags[j++] = content;
}
}
Since you are omitting some of the child nodes, creating an array of entire child nodes length may result in wastage of memory. You can use a List instead. If you are particular about String array you can later convert this to an array as well.
public String[] getAvailableTags() throws Exception
{
String path = "/docusign-cfg/tagConfig/tags";
XPathFactory f = XPathFactory.newInstance();
XPath x = f.newXPath();
Object result = null;
try
{
XPathExpression expr = x.compile(path);
result = expr.evaluate(doc, XPathConstants.NODE);
}
catch (XPathExpressionException e)
{
throw new Exception("An error ocurred while trying to retrieve the tags");
}
Node node = (Node) result;
NodeList childNodes = node.getChildNodes();
List<String> tags = new ArrayList<String>();
for(int i = 0; i < tags.length; i++)
{
String content = childNodes.item(i).getNodeName().trim().replaceAll("\\s", "");
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags.add(content);
}
}
String[] tagsArray = tags.toArray(new String[tags.size()]);
return tagsArray;
}
The contents of tag array defaults to null.
So it is not a case of how does the element become null, it is the case of it being left as null.
To prove this to yourself, add the following else block like this:
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags[i] = content;
} else {
tags[i] = "Foo Bar";
}
You should now see 'Foo Bar' instead of null.
A better solution here would be to use an ArrayList, and append the tags to it instead of using an array. Then you do not need to track the indexes and so less chance of this type of bug.

Categories

Resources