Parsing link from HTM with Jsoup

Parsing link from HTM with Jsoup - java

I'm trying to get a link from a page, but it seems not working. I need to catch a singluar link, contaning a keyword, but it seems catching nothing...
I use this code:
Element summonerLink = doc.select("a:[href]:contanis(summoner)").first();
summonerURL = summonerLink.attr("href");
I tried to use setText(summonerURL) but it goes blank... (Of course summonerURL is String)

this should work
doc.select("a[href*=summoner]")
for more info check jsoup documentation look for [attr*=value]

Related

Beginner to API Json - Attempting to understand URL output

I am using FME to get output from the following: https://coronavirus.data.gov.uk/developers-guide
I am just a beginner and is first time I want to write this up. To generate an Output URL from this with the relevant columns, can anyone explain what I need to get the request URL. I am using GET as the HTTP method
An example of a url i get some data out of is - https://soa.smext.faa.gov/asws/api/airport/status/SFO
But when i try with the links below which i tested does not get any output https://api.coronavirus.data.gov.uk/v1/data?filters=areaType=nation;areaName=england&structure=%7B%22name%22:%22areaName%22%7D

https://api.coronavirus.data.gov.uk//v1/data?filters=areaType=nation;areaName=england&structure={"date":"date","areaName":"areaName","areaCode":"areaCode","newCasesByPublishDate":"2020-07-07","cumCasesByPublishDate":"2020-08-08","newDeathsByDeathDate":"2020-02-06","cumDeathsByDeathDate":"2020-06-09"}

This is a better link for you to try out.
https://api.coronavirus.data.gov.uk/v1/data?filters=areaType=nation;areaName=england&structure={%22date%22:%22date%22,%22areaName%22:%22areaName%22,%22areaCode%22:%22areaCode%22,%22newCasesByPublishDate%22:%22newCasesByPublishDate%22,%22cumCasesByPublishDate%22:%22cumCasesByPublishDate%22,%22newDeaths28DaysByDeathDate%22:%22newDeaths28DaysByDeathDate%22}
Took me a while to suss it out.

Document.select("a[href]") not getting all the href

I am using JSOUP to fetch the documents from a website.
Below is my code
webPageUrl = https://mwcc.ms.gov/#/electronicDataInterchange
Document doc = Jsoup.connect(webPageUrl).get();
Elements links = doc.getElementsByAttribute("a[href]");
Below line of code is not working. It is supposed to return an element but doesn't:
doc.getElementsByAttribute("a[href]")
Can someone please point out the mistake in my code?

That page seems to be an Angular application, which means it loads some (probably all or most) of its content via JavaScript scripts.
The fact that the URL contains the fragment separator # is already a strong indicator of that fact, because if you do a HTTP request, then everything after that indicator is cut off (i.e. not sent to the server), so the actual request will just be of https://mwcc.ms.gov/.
As far as I know JSoup does not support running JavaScript, so you might need to look into a more involved scraping tool (possibly running a full browser engine).

Find the code that this line is referring to?

This is a batch downloader for images on Flickr. I'm curious about how the program gets the original url, so looking through the source code (Favorites.java line 271) I see this, but I wasn't able to find what it's referring to.
String originalUrl = null;
try {
originalUrl = curPhoto.getOriginalUrl();
} catch (FlickrException e) {
// if the original url just isn't available, fine. no need
// to panic.
}
https://github.com/magnusvk/flickrfaves
I'm using Netbeans right now and it's not finding anything when I click on any of the Navigate > Go To buttons on curPhoto. I'd imagine there's an easy way to find the code that it's referring to, but I don't really know what to search on google to learn how to do it.
My question is, where can I find the code for curPhoto.getOriginalUrl() and how should I be finding things like this on my own?

Looks like they are using flickrj to interface with Flickr.
curPhoto is of type Photo, and if you look in the imports, Photo is imported as import com.aetrion.flickr.photos.Photo;. I did a google search for com.aetrion.flickr and it turned up that library.
The documentation for that function can be found here

Url working in Google chrome inaccessible by Java w/Jsoup?

I'm having quite a confusing problem. I have literally only been doing networking for a day, so please forgive me and I apologize if I am making a dumb error. My issue is that I cannot access a URL in a programmatic fashion which I can access through copy-pasting into chrome.
I am using a library called jsoup (http://jsoup.org/apidocs/) which parses text out of raw html from a website. My goal in general is to use a base url to which I can attach a string, and get a webpage from it. I am using the code (edit for those who asked for more code, I know this is still sparse but this is the only code preceding the error)
String url = "https://www.google.com/search?q=definition+of+";
url += search; //search is the passed in string
Document doc = Jsoup.connect(url).get(); //url is the String in question
to get the webpage. My ultimate goal is to use this method to get the text of the box at the top of chrome searches when you search for the definition of a word. I.e the box at the top here: https://www.google.com/search?q=definition+of+apple
However, I come to an issue when I attempt to use the above link as my url, for I get a org.jsoup.HttpStatusException, so I think it is a networking problem. What causes this url to work when typed into chrome, but not in Java? (I would also not be adverse to different ways to get the information in that box, since my current method feels a bit roundabout)
The full error message (edited in)
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://www.google.com/search?q=definition+of+apple
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:435)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
at test.Test.parseDef(Test.java:68)
at test.Test.main(Test.java:112)
To whomever answers, thank you for spending your time to help a networking newbie!

Most likely, Google is accurately identifying your program as a "robot" and acting accordingly. Google encourages robots to use the Google Custom Search API and discourages them from using the human-oriented search interface.
In fact, all web spiders are supposed to check robots.txt, right? Here is Google's: http://www.google.com/robots.txt. Note that /search is disallowed.
Please see this question for further information. It's basically the python version of your question. Why does Google Search return HTTP Error 403?

If you use Jsoup you have to replace spaces with %20 and not with +.
Try this url :
https://www.google.com/search?q=definition%20of%20apple
String url = "https://www.google.com/search?q=definition%20of%20";
url += search; //search is the passed in string
Document doc = Jsoup.connect(url).get();

public static void main(String[] args) {
Document doc = Jsoup.connect(link)
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(1000)
.post();
}

Same Jsoup code behaving differently on Android and desktop

I've got 5-line, simple Jsoup code parsing some strings, it smoothly runs and returns an array list with values that i want, however on android emulator and phone, it just returns nothing without even giving an error.
Thats the whole code :
Document doc = Jsoup.connect(myURL).get();
Elements els = doc.select("div font a");
for (int i = 3; i < els.size(); i++) {
latestNews.add(els.get(i).text());
}
On desktop, it adds elements into array list, however on device, nothing occurs. Could anyone help about it ?

Are you sure you are receiving the same HTML from the site? you should debug and check your doc variable to make sure it contains the same HTML as you'd expect on the site. Possible case of grabbing the mobile site when you are parsing the full site? (not sure if Jsoup prevents getting the mobile site or not). You likely need to set the user agent so that you receive the full desktop variant of the website.
ex.
Document doc = Jsoup.connect(myURL).userAgent("Mozilla").get();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing link from HTM with Jsoup - java

this should work doc.select("a[href=summoner]") for more info check jsoup documentation look for [attr=value]

Related

Beginner to API Json - Attempting to understand URL output

Document.select("a[href]") not getting all the href

Find the code that this line is referring to?

Url working in Google chrome inaccessible by Java w/Jsoup?

Same Jsoup code behaving differently on Android and desktop

Categories

Resources

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing link from HTM with Jsoup - java

this should work doc.select("a[href*=summoner]") for more info check jsoup documentation look for [attr*=value]

Related

Beginner to API Json - Attempting to understand URL output

Document.select("a[href]") not getting all the href

Find the code that this line is referring to?

Url working in Google chrome inaccessible by Java w/Jsoup?

Same Jsoup code behaving differently on Android and desktop

Categories

Resources

this should work doc.select("a[href=summoner]") for more info check jsoup documentation look for [attr=value]