How to use Jsoup to get href link without the extra characters? - java

I have an Element list of which i'm using jsoup's method attr() to get the href attribute.
Here is part of my code:
String searchTerm = "tutorial+programming+"+i_SearchPhrase;
int num = 10;
String searchURL = GOOGLE_SEARCH_URL + "?q="+searchTerm+"&num="+num;
Document doc = Jsoup.connect(searchURL).userAgent("chrome/5.0").get();
Elements results = doc.select("h3.r > a");
String linkHref;
for (Element result : results) {
linkHref = result.attr("href").replace("/url?q=","");
//some more unrelated code...
}
So for example, when i use the search prase "test", the attr("href") produces (first in the list):
linkHref = https://www.tutorialspoint.com/software_testing/&sa=U&ved=0ahUKEwi_lI-T69jTAhXIbxQKHU1kBlAQFggTMAA&usg=AFQjCNHr6EzeYegPDdpHJndLJ-889Sj3EQ
where i only want: https://www.tutorialspoint.com/software_testing/
What is the best way to fix this? Do i just add some string operations on linkHref (which i know how) or is there a way to make the href attribute contain the shorter link to begin with?
Thank you in advanced

If you always want to remove the query parameters you can make use of String.indexOf() e.g.
int lastPos;
if(linkHref.indexOf("?") > 0) {
lastPos = linkHref.indexOf("?");
} else if (linkHref.indexOf("&") > 0){
lastPos = linkHref.indexOf("&");
}
else lastPos = -1;
if(lastPos != -1)
linkHref = linkHref.subsring(0, lastPos);

Related

Java - Pair variable resets between consecutive for loop body executions

This is an excerpt from my project:
import javafx.util.Pair;
import org.w3c.dom.*;
private Pair<Element, Integer> findBestAlbumElement(Element recording) {
Pair<Element, Integer> best = new Pair<>(null, Integer.MIN_VALUE);
NodeList list = recording.getElementsByTagName("release");
for (int i = 0; i < list.getLength(); i++) {
System.out.println((best.getKey() == null ? "null" : best.getKey().getTextContent()) + "; " + best.getValue());
Element album = (Element) list.item(i);
int mark = getAlbumAndYearMark(recording, album);
if (mark > best.getValue()) best = new Pair<>(album, mark);
System.out.println((best.getKey() == null ? "null" : best.getKey().getTextContent()) + "; " + best.getValue());
}
return best;
}
and I'm running into a strange problem in this piece of code. The variable best resets between loop iterations, as seen in the beginning of the printout to console:
null; -2147483648
Live USABootlegAlbumLive1990DE1990GermanyGermanyDE212CD2T.N.T255240; 6
null; -2147483648
...
The first line is the first System.out.println(), the second line is the second one (where the variable best is properly set as expected) and the third line is the first one again (where the variable best seemingly just resets of its own accord).
I've tried to replicate the problem with the following code:
Pair<String, Integer> best = new Pair<>("", Integer.MIN_VALUE);
String[] strings = {"asdf", "fdsa", "dsaf"};
int[] marks = {1, 5, 3};
for (int i = 0; i < strings.length; i++) {
System.out.println(best.getKey() + " " + best.getValue());
if (marks[i] > best.getValue()) best = new Pair<>(strings[i], marks[i]);
System.out.println(best.getKey() + " " + best.getValue());
}
which replaces the NodeList with a String array, but this code works as expected.
My problem is, I don't even know how to approach this issue. I don't know how to debug this further or even reproduce the problem in a smaller example, as I don't know how to create a valid NodeList (since it's an interface, so I can't just new NodeList).
I'm also at a bit of a loss, as it looks to me like the bug appears in a place where it shouldn't even be possible, since the only code that is supposed to execute between the two println calls is i++ (not altering or even accessing best in any way). Am I wrong about this?
Does anyone have any idea what could be going on, or even how I would get closer to pinpointing the issue?
EDIT
As per request, here's getAlbumAndYearMark, which uses the jaudiotagger library (apologies for the ugly long lined code, this is a fairly old project).
private Tag tag;
private int getAlbumAndYearMark(Element recording, Element album) {
int mark = 0;
if (album == null) return tag.hasField(FieldKey.YEAR) ? getYearMark(album) : 0;
if (contains(album.getElementsByTagName("primary-type"), "Album")) mark += 2;
else if (!contains(album.getElementsByTagName("secondary-type"), "Album")) return Integer.MIN_VALUE;
Node title = album.getElementsByTagName("title").item(0);
if (title != null && tag.hasField(FieldKey.ALBUM)) mark += title.getTextContent().equals(tag.getFirst(FieldKey.ALBUM)) ? 7 : -4;
Node date = album.getElementsByTagName("date").item(0);
if (date != null && tag.hasField(FieldKey.YEAR)) mark += date.getTextContent().equals(tag.getFirst(FieldKey.YEAR).trim()) ? 3 : -3;
Node track = album.getElementsByTagName("number").item(0);
if (track != null && tag.hasField(FieldKey.TRACK)) mark += track.getTextContent().equals(tag.getFirst(FieldKey.TRACK).trim()) ? 3 : -1;
return mark;
}
private int getYearMark(Element element) {
NodeList dates = element.getElementsByTagName("date");
for (int i = 0; i < dates.getLength(); i++)
if (dates.item(i).getTextContent().substring(0, 4).equals(tag.getFirst(FieldKey.YEAR))) return 7;
return -7;
}
private static boolean contains(NodeList list, String string) {
for (int i = 0; i < list.getLength(); i++)
if (list.item(i).getTextContent().trim().equalsIgnoreCase(string)) return true;
return false;
}
but I don't believe this method is the problem, as I still have the same issue if I replace int mark = getAlbumYearMark(recording, album); with int mark = (int) (Math.random() * 10);
Here's a (heavily trimmed) example XML file, printed directly from the program:
<?xml version="1.0" encoding="UTF-8"?><metadata xmlns="http://musicbrainz.org/ns/mmd-2.0#" xmlns:ext="http://musicbrainz.org/ns/ext#-2.0" created="2018-02-16T02:07:28.816Z">
<recording-list count="72" offset="0">
<recording ext:score="100" id="6e702972-00c2-4725-b3e5-60e85ef0de25">
<title>T.N.T</title>
<artist-credit>
<name-credit>
<artist id="66c662b6-6e2f-4930-8610-912e24c63ed1">
<name>AC/DC</name>
</artist>
</name-credit>
</artist-credit>
<release-list>
<release id="ddaa5690-df97-4bb2-b93d-396fe5fb49d5">
<title>Live USA</title>
<release-group id="6b1ace64-bf92-3c42-8a1f-aea6fa08edec" type="Live">
<primary-type>Album</primary-type>
<secondary-type-list>
<secondary-type>Live</secondary-type>
</secondary-type-list>
</release-group>
<date>1990</date>
<country>DE</country>
<release-event-list>
<release-event>
<date>1990</date>
<area id="85752fda-13c4-31a3-bee5-0e5cb1f51dad">
<name>Germany</name>
<sort-name>Germany</sort-name>
<iso-3166-1-code-list>
<iso-3166-1-code>DE</iso-3166-1-code>
</iso-3166-1-code-list>
</area>
</release-event>
</release-event-list>
<medium-list>
<track-count>21</track-count>
<medium>
<position>2</position>
<format>CD</format>
<track-list count="11" offset="1">
<track id="caadf3b8-4a44-34c6-b9dc-c9870c5d9bc0">
<number>2</number>
</track>
</track-list>
</medium>
</medium-list>
</release>
</release-list>
</recording>
</recording-list>
</metadata>
You can see an untrimmed example by querying the musicbrainz database directly, for example this query.
It's not an answer (yet). I'm trying with next xml:
<root>
<release-list>
<release id="ddaa5690-df97-4bb2-b93d-396fe5fb49d5">
<title>Live USA</title>
<date>1990</date>
<country>DE</country>
</release>
<release id="qqqa5690-df97-4bb2-b93d-396fe5fb49d5">
<title>German collections</title>
<date>1991</date>
<country>DE</country>
</release>
</release-list>
<release-list>
<release id="zzza5690-df97-4bb2-b93d-396fe5fb49d5">
<title>Just USA</title>
<date>1995</date>
<country>US</country>
</release>
</release-list>
<release-list>
<release id="aaaa5690-df97-4bb2-b93d-396fe5fb49d5">
<title>Anoother USA</title>
<primary-type>Album</primary-type>
<date>1999</date>
<country>RUS</country>
</release>
</release-list>
And I have no issues. Could you please try with this xml?
Also I'm using both incremental mark like return mockMark++; and array-based mark like
private int getAlbumAndYearMark(Element recording, Element album) {
int[] arr = {1,0,5,7,3,6,8,9,10};
return arr[mockMark++];
}

Get Index on JSOUP not working

try {
String url = "http://www.billboard.com/charts/artist-100";
String urlFound;
String closing = ")";
String start = "h";
Document doc = Jsoup.connect(url).get();
Elements urls = doc.getElementsByClass("chart-row__image");
for (Element u : urls) {
urlFound = u.attr("style");
String sub = urlFound.substring(urlFound.indexOf(start), urlFound.indexOf(closing));
System.out.println(sub);
//Log.d("URLS,", attr.substring(attr.indexOf("http://"), attr.indexOf(")")));
}
}
catch(IOException ex){
}
I tried debugging this several times, but I keep getting the error, Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1 I' m not sure why this happening either? Can someone give me an idea of what could be wrong?
You're extracting the style attribute Strings from all the div class="chart-row__image elements, but understand that many elements in this group don't have a style attribute. In this situation JSoup is returning an empty String, and this is messing up your program. The solution is not to do this but instead to let jsoup select only those elements that have a style attribute.
For instance, not:
Elements urls = doc.getElementsByClass("chart-row__image");
but rather:
Elements urls = doc.select(".chart-row__image[style]");
And yeah, don't ignore exceptions.
So
String url = "http://www.billboard.com/charts/artist-100";
String urlFound;
String closing = ")";
String start = "h";
Document doc;
try {
doc = Jsoup.connect(url).get();
// Elements urls = doc.getElementsByClass("chart-row__image");
Elements urls = doc.select(".chart-row__image[style]");
for (Element u : urls) {
urlFound = u.attr("style");
int startingIndex = urlFound.indexOf(start);
int endingIndex = urlFound.indexOf(closing);
if (startingIndex > 0 && endingIndex > 0) {
String sub = urlFound.substring(startingIndex, endingIndex);
System.out.println(sub);
}
}
} catch (IOException e) {
e.printStackTrace();
}

Read a specified line of text from a webpage with Jsoup

So I am trying to get the data from this webpage using Jsoup...
I've tried looking up many different ways of doing it and I've gotten close but I don't know how to find tags for certain stats (Attack, Strength, Defence, etc.)
So let's say for examples sake I wanted to print out
'Attack', '15', '99', '200,000,000'
How should I go about doing this?
You can use CSS selectors in Jsoup to easily extract the column data.
// retrieve page source code
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
// find all of the table rows
Elements rows = doc.select("div#contentHiscores table tr");
ListIterator<Element> itr = rows.listIterator();
// loop over each row
while (itr.hasNext()) {
Element row = itr.next();
// does the second col contain the word attack?
if (row.select("td:nth-child(2) a:contains(attack)").first() != null) {
// if so, assign each sibling col to variable
String rank = row.select("td:nth-child(3)").text();
String level = row.select("td:nth-child(4)").text();
String xp = row.select("td:nth-child(5)").text();
System.out.printf("rank=%s level=%s xp=%s", rank, level, xp);
// stop looping rows, found attack
break;
}
}
A very rough implementation would be as below. I have just shown a snippet , optimizations or other conditionals need to be added
public static void main(String[] args) throws Exception {
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
Element contentHiscoresDiv = doc.getElementById("contentHiscores");
Element table = contentHiscoresDiv.child(0);
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
for (Element column : tds) {
if (column.children() != null && column.children().size() > 0) {
Element anchorTag = column.getElementsByTag("a").first();
if (anchorTag != null && anchorTag.text().contains("Attack")) {
System.out.println(anchorTag.text());
Elements attributeSiblings = column.siblingElements();
for (Element attributeSibling : attributeSiblings) {
System.out.println(attributeSibling.text());
}
}
}
}
}
}
Attack
15
99
200,000,000

How do I check for empty tags while parsing xml?

I am using the Document object to extract all the tags from an xml. If the xml has an empty tag, I get a null pointer exception. How do I guard against this? How do I check for an empty tag?
<USTrade>
<CreditorId>
<CustomerNumber>xxxx</CustomerNumber>
<Name></Name>
<Industry code="FY" description="Factor"/>
</CreditorId>
<DateReported format="MM/CCYY">02/2012</DateReported>
<AccountNumber>54000</AccountNumber>
<HighCreditAmount>0000299</HighCreditAmount>
<BalanceAmount>0000069</BalanceAmount>
<PastDueAmount>0000069</PastDueAmount>
<PortfolioType code="O" description="Open Account (30, 60, or 90 day account)"/>
<Status code="5" description="120 Dys or More PDue"/>
<Narratives>
<Narrative code="GS" description="Medical"/>
<Narrative code="CZ" description="Collection Account"/>
</Narratives>
</USTrade>
<USTrade>
So, when I use:
NodeList nm = docElement.getElementsByTagName("Name");
if (nm.getLength() > 0)
name = nullIfBlank(((Element) nm.item(0))
.getFirstChild().getTextContent());
Nodelist gives a length of 1, because there is a tag, but when I do getTextContent(), it hits the null pointer because FirstChild() doesn't return anything for tag = Name
And, I have done this for each xml tag. Is there a simple check I can do before every tag extraction?
The first thing I would do would be to unchain your calls. This will give you the chance to determine exactly which reference is null and which reference you need to do a null check for:
NodeList nm = docElement.getElementsByTagName("Name");
if (nm.getLength() > 0) {
Node n = nm.item(0);
Node child = n.getFirstChild();
if(child == null) {
// null handling
name = null;
}
else {
name = nullIfBlank(child.getTextContent());
}
}
Also, check out the hasChildNodes() method on Node! http://docs.oracle.com/javase/1.4.2/docs/api/org/w3c/dom/Node.html#hasChildNodes%28%29
while(current != null){
if(current.getNodeType() == Node.ELEMENT_NODE){
String nodeName = current.getNodeName();
System.out.println("\tNode: "+nodeName);
NamedNodeMap attributes = current.getAttributes();
System.out.println("\t\tNumber of Attributes: "+attributes.getLength());
for(int i=0; i<attributes.getLength(); i++){
Node attr = attributes.item(i);
String attName = attr.getNodeName();
String attValue= attr.getNodeValue();
System.out.println("\t\tAttribute Name: "+ attName+ "\tAttribute Value:"+ attValue);
}
}
Are you also wanting to print out the value of the node? If so, it's one line of code in my example you would have to add, and I can share that as well.
Did you tried something like that?
NodeList nm = docElement.getElementsByTagName("Name");
if ((Element) nm.item(0))
name = nullIfBlank(((Element) nm.item(0)).getFirstChild().getTextContent());

how can i get data out of DIV using html parser in java

i am using Java html parser(link text) to try to parse this line.
<td class=t01 align=right><div id="OBJ123" name=""></div></td>
But I am looking for the value like I see on my web browser, which is a number. Can you help me get the value?
Please let me know if you need more details.
Thanks
From the documentation, all you have to do is find all of the DIV elements that also have an id of OBJ123 and take the first result's value.
NodeList nl = parser.parse(null); // you can also filter here
NodeList divs = nl.extractAllNodesThatMatch(
new AndFilter(new TagNameFilter("DIV"),
new HasAttributeFilter("id", "OBJ123")));
if( divs.size() > 0 ) {
Tag div = divs.elementAt(0);
String text = div.getText(); // this is the text of the div
}
UPDATE: if you're looking at the ajax url, you can use similar code like:
// make some sort of constants for all the positions
const int OPEN_PRICE = 0;
const int HIGH_PRICE = 1;
const int LOW_PRICE = 2;
// ....
NodeList nl = parser.parse(null); // you can also filter here
NodeList values = nl.extractAllNodesThatMatch(
new AndFilter(new TagNameFilter("TD"),
new HasAttributeFilter("class", "t1")));
if( values.size() > 0 ) {
Tag openPrice = values.elementAt(OPEN_PRICE);
String openPriceValue = openPrice.getText(); // this is the text of the div
}

Categories

Resources