Scraping wikipedia URLs through jsoup using keywords/macros - java

I'm not sure what the best way to frame this question is but here goes:
I have a program that needs to do a keyword search on the web and looks for corresponding URLs and in particular the Wikipedia URLs to extract summary information from it. The question is - using jsoup, instead of giving the entire URL can I input information like this - http://en.wikipedia.org/wiki/keyword where keyword is a user input and can be anything like Coffee, Hinduism, World War II etc.
Also how would I check if the corresponding link exists?

If I understand the question correctly, you want to connect to Wikipedia pages based off of a keyword input and get out a JSoup Document that you can apply JSoup to.
If that's the case, you could make a method that takes in a keyword parameter and returns the document from the Wikipedia page:
Document getWikiDocumentByKeyword(String keyword) throws Exception {
Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/" + keyword).get();
return doc;
}
This will connect to the page for the given keyword and return the Document which you can do whatever you like with.
To address your second question, this method will throw an exception if it can't connect to the page for whatever reason (i.e. page doesn't exist).
You would then want to handle that exception in your call of the method somehow:
try{
Document document = getWikiDocumentByKeyword("Coffee");
} catch(Exception ex){
// Trouble connecting, handle the exception however you want.
System.out.println(ex);
}

Related

Document.select("a[href]") not getting all the href

I am using JSOUP to fetch the documents from a website.
Below is my code
webPageUrl = https://mwcc.ms.gov/#/electronicDataInterchange
Document doc = Jsoup.connect(webPageUrl).get();
Elements links = doc.getElementsByAttribute("a[href]");
Below line of code is not working. It is supposed to return an element but doesn't:
doc.getElementsByAttribute("a[href]")
Can someone please point out the mistake in my code?
That page seems to be an Angular application, which means it loads some (probably all or most) of its content via JavaScript scripts.
The fact that the URL contains the fragment separator # is already a strong indicator of that fact, because if you do a HTTP request, then everything after that indicator is cut off (i.e. not sent to the server), so the actual request will just be of https://mwcc.ms.gov/.
As far as I know JSoup does not support running JavaScript, so you might need to look into a more involved scraping tool (possibly running a full browser engine).

Parsing link from HTM with Jsoup

I'm trying to get a link from a page, but it seems not working. I need to catch a singluar link, contaning a keyword, but it seems catching nothing...
I use this code:
Element summonerLink = doc.select("a:[href]:contanis(summoner)").first();
summonerURL = summonerLink.attr("href");
I tried to use setText(summonerURL) but it goes blank... (Of course summonerURL is String)
this should work
doc.select("a[href*=summoner]")
for more info check jsoup documentation look for [attr*=value]

Url working in Google chrome inaccessible by Java w/Jsoup?

I'm having quite a confusing problem. I have literally only been doing networking for a day, so please forgive me and I apologize if I am making a dumb error. My issue is that I cannot access a URL in a programmatic fashion which I can access through copy-pasting into chrome.
I am using a library called jsoup (http://jsoup.org/apidocs/) which parses text out of raw html from a website. My goal in general is to use a base url to which I can attach a string, and get a webpage from it. I am using the code (edit for those who asked for more code, I know this is still sparse but this is the only code preceding the error)
String url = "https://www.google.com/search?q=definition+of+";
url += search; //search is the passed in string
Document doc = Jsoup.connect(url).get(); //url is the String in question
to get the webpage. My ultimate goal is to use this method to get the text of the box at the top of chrome searches when you search for the definition of a word. I.e the box at the top here: https://www.google.com/search?q=definition+of+apple
However, I come to an issue when I attempt to use the above link as my url, for I get a org.jsoup.HttpStatusException, so I think it is a networking problem. What causes this url to work when typed into chrome, but not in Java? (I would also not be adverse to different ways to get the information in that box, since my current method feels a bit roundabout)
The full error message (edited in)
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://www.google.com/search?q=definition+of+apple
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:435)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
at test.Test.parseDef(Test.java:68)
at test.Test.main(Test.java:112)
To whomever answers, thank you for spending your time to help a networking newbie!
Most likely, Google is accurately identifying your program as a "robot" and acting accordingly. Google encourages robots to use the Google Custom Search API and discourages them from using the human-oriented search interface.
In fact, all web spiders are supposed to check robots.txt, right? Here is Google's: http://www.google.com/robots.txt. Note that /search is disallowed.
Please see this question for further information. It's basically the python version of your question. Why does Google Search return HTTP Error 403?
If you use Jsoup you have to replace spaces with %20 and not with +.
Try this url :
https://www.google.com/search?q=definition%20of%20apple
String url = "https://www.google.com/search?q=definition%20of%20";
url += search; //search is the passed in string
Document doc = Jsoup.connect(url).get();
public static void main(String[] args) {
Document doc = Jsoup.connect(link)
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(1000)
.post();
}

Gathering data from inconsistant HTML pages - JSoup

I'm trying to get a lot of data from multiple pages but its not always consistent. here is an example of the html I am working with!:
Example HTML
I need to get something like: Team | Team | Result all into different variables or lists.
I just need some help on where to start because the main table I'm working with on multiple pages isn't the same on everyone.
heres my java so far:
try {
Document team_page = Jsoup.connect("http://www.soccerstats.com/team.asp?league=" + league + "&teamid=" + teamNumber).get();
Element home_team = team_page.select("[class=homeTitle]").first();
String teamName = home_team.text();
System.out.println(teamName + "'s Latest Results: ");
Elements main_page = team_page.select("[class=stat]");
System.out.println(main_page);
} catch (IOException e) {
System.out.println("unable to parse content");
}
I am getting the league and teamid from different methods of my program.
Thanks!
Yes. This is one of the problems with webpage scraping.
You have to figure out one or more heuristics that will extract the information that you need across all of the pages that you need to access. There's no magic bullet. Just hard work. (And you'll have to do it all over again if the site changes its page layout.)
A better idea is to request the information as XML or JSON using the site or sites' RESTful APIs ... assuming they exist and are available to you.
(And if you continue with the web-scraping approach, check the site's Terms of Service to make sure that your activity is acceptable.)

Alert if webpage has been updated

I am creatin an app in Java that checks if a webpage has been updated.
However some webpages dont have a "last Modified" header.
I even tried checking for a change in content length but this method is not reliable as sometimes the content length changes without any modification in the webpage giving a false alarm.
I really need some help here as i am not able to think of a single foolproof method.
Any ideas???
If you connect the whole time to the webpage like this code it can help:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class main {
String updatecheck = "";
public static void main(String args[]) throws Exception {
//Constantly trying to load page
while (true) {
try {
System.out.println("Loading page...");
// connecting to a website with Jsoup
Document doc = Jsoup.connect("URL").userAgent("CHROME").get();
// Selecting a part of this website with Jsoup
String pick = doc.select("div.selection").get(0);
// printing out when selected part is updated.
if (updatecheck != pick){
updatecheck = pick;
System.out.println("Page is changed.");
}
} catch (Exception e) {
e.printStackTrace();
System.out.println("Exception occured.... going to retry... \n");
}
}
}
}
How to get notified after a webpage changes instead of refreshing?
Probably the most reliable option would be to store a hash of the page contet.
If you are saying that content-length changes then probably the webpages your are trying to check are dynamically generated and or not whatsoever a static in nature. If that is the case then even if you check the 'last-Modified' header it won't reflect the changes in content in most cases anyway.
I guess the only solution would be a page specific solution working only for a specific page, one page you could parse and look for content changes in some parts of this page, another page you could check by last modified header and some other pages you would have to check using the content length, in my opinion there is no way to do it in a unified mode for all pages on the internet. Another option would be to talk with people developing the pages you are checking for some markers which will help you determine if the page changed or not but that of course depends on your specific use case and what you are doing with it.

Categories

Resources