Url working in Google chrome inaccessible by Java w/Jsoup?

Url working in Google chrome inaccessible by Java w/Jsoup? - java

I'm having quite a confusing problem. I have literally only been doing networking for a day, so please forgive me and I apologize if I am making a dumb error. My issue is that I cannot access a URL in a programmatic fashion which I can access through copy-pasting into chrome.
I am using a library called jsoup (http://jsoup.org/apidocs/) which parses text out of raw html from a website. My goal in general is to use a base url to which I can attach a string, and get a webpage from it. I am using the code (edit for those who asked for more code, I know this is still sparse but this is the only code preceding the error)
String url = "https://www.google.com/search?q=definition+of+";
url += search; //search is the passed in string
Document doc = Jsoup.connect(url).get(); //url is the String in question
to get the webpage. My ultimate goal is to use this method to get the text of the box at the top of chrome searches when you search for the definition of a word. I.e the box at the top here: https://www.google.com/search?q=definition+of+apple
However, I come to an issue when I attempt to use the above link as my url, for I get a org.jsoup.HttpStatusException, so I think it is a networking problem. What causes this url to work when typed into chrome, but not in Java? (I would also not be adverse to different ways to get the information in that box, since my current method feels a bit roundabout)
The full error message (edited in)
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://www.google.com/search?q=definition+of+apple
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:435)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
at test.Test.parseDef(Test.java:68)
at test.Test.main(Test.java:112)
To whomever answers, thank you for spending your time to help a networking newbie!

Most likely, Google is accurately identifying your program as a "robot" and acting accordingly. Google encourages robots to use the Google Custom Search API and discourages them from using the human-oriented search interface.
In fact, all web spiders are supposed to check robots.txt, right? Here is Google's: http://www.google.com/robots.txt. Note that /search is disallowed.
Please see this question for further information. It's basically the python version of your question. Why does Google Search return HTTP Error 403?

If you use Jsoup you have to replace spaces with %20 and not with +.
Try this url :
https://www.google.com/search?q=definition%20of%20apple
String url = "https://www.google.com/search?q=definition%20of%20";
url += search; //search is the passed in string
Document doc = Jsoup.connect(url).get();

public static void main(String[] args) {
Document doc = Jsoup.connect(link)
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(1000)
.post();
}

Related

Document.select("a[href]") not getting all the href

I am using JSOUP to fetch the documents from a website.
Below is my code
webPageUrl = https://mwcc.ms.gov/#/electronicDataInterchange
Document doc = Jsoup.connect(webPageUrl).get();
Elements links = doc.getElementsByAttribute("a[href]");
Below line of code is not working. It is supposed to return an element but doesn't:
doc.getElementsByAttribute("a[href]")
Can someone please point out the mistake in my code?

That page seems to be an Angular application, which means it loads some (probably all or most) of its content via JavaScript scripts.
The fact that the URL contains the fragment separator # is already a strong indicator of that fact, because if you do a HTTP request, then everything after that indicator is cut off (i.e. not sent to the server), so the actual request will just be of https://mwcc.ms.gov/.
As far as I know JSoup does not support running JavaScript, so you might need to look into a more involved scraping tool (possibly running a full browser engine).

Scraping wikipedia URLs through jsoup using keywords/macros

I'm not sure what the best way to frame this question is but here goes:
I have a program that needs to do a keyword search on the web and looks for corresponding URLs and in particular the Wikipedia URLs to extract summary information from it. The question is - using jsoup, instead of giving the entire URL can I input information like this - http://en.wikipedia.org/wiki/keyword where keyword is a user input and can be anything like Coffee, Hinduism, World War II etc.
Also how would I check if the corresponding link exists?

If I understand the question correctly, you want to connect to Wikipedia pages based off of a keyword input and get out a JSoup Document that you can apply JSoup to.
If that's the case, you could make a method that takes in a keyword parameter and returns the document from the Wikipedia page:
Document getWikiDocumentByKeyword(String keyword) throws Exception {
Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/" + keyword).get();
return doc;
}
This will connect to the page for the given keyword and return the Document which you can do whatever you like with.
To address your second question, this method will throw an exception if it can't connect to the page for whatever reason (i.e. page doesn't exist).
You would then want to handle that exception in your call of the method somehow:
try{
Document document = getWikiDocumentByKeyword("Coffee");
} catch(Exception ex){
// Trouble connecting, handle the exception however you want.
System.out.println(ex);
}

I am coding in Android Studio, and I need to fetch and display a specific line of data from a specific webpage

I am very new to coding in Java/Android Studio. I have everything setup that I have been able to figure out thus far. I have a button, and I need to put code inside of the button click event that will fetch information from a website, convert it to a string and display it. I figured I would have to use the html source code in order to do this, so I have installed Jsoup html parser. All of the help with Jsoup I have found only leads me up to getting the HTML into a "Document". And I am not sure if that is the best way to accomplish what I need. Can anyone tell me what code to use to fetch the html code from the website, and then do a search through the html looking for a specific match, and convert that match to a string. Or can anyone tell me if there is a better way to do this. I only need to grab one piece of information and display it.
Here is the piece of html code that contains the value I want:
writeBidRow('Wheat',-60,false,false,false,0.5,'01/15/2015','02/26/2015','All',' ',' ',60,'even','c=2246&l=3519&d=G15',quotes['KEH15'], 0-0);
I need to grab and display whatever value represents the quotes['KEH15'], in that html code.
Thank you in advance for your help.
Keith

Grabbing raw HTML is an extremely tedious way to access information from the web, bad practice, and difficult to maintain in the case that wherever you are fetching the info from changes their HTML.
I don't know your specific situation and what the data is that you are fetching, but if there is another way for you to fetch that data via an API, use that instead.
Since you say you are pretty new to Android and Java, let me explain something I wish had been explained to me very early on (although I am mostly self taught).
The way people access information across the Internet is traditionally through HTML and JavaScript (which is interpreted by your browser like Chrome or Firefox to look pretty), which are transferred over the internet using the protocol called HTTP. This is a great way for humans to communicate with computers that are far away, and the average person probably doesn't realize that there is more to the internet than this--your browser and the websites you can go to.
Although there are multiple methods, for the purpose of what I think you're looking for, applications communicate over the internet a slightly different way:
When an android application asks a server for some information, rather than returning HTML and JavaScript which is intended for human consumption, the server will (traditionally) return what's called JSON (or sometimes XML, which is very similar). JSON is a very simple way to get information about an object, and put it into a form that is readable easily by both humans (developers) and computers, and can be transmitted over the internet easily. For example, let's say you ask a server for some kind of "Video" object for an app that plays video, it may give you something like this:
{
"name": "Gangnam Style",
"metadata": {
"url": "https://www.youtube.com/watch?v=9bZkp7q19f0",
"views": 2000000000,
"ageRestricted": false,
"likes": 43434
"dislikes":124
},
"comments": [
{
"username": "John",
"comment": "10/10 would watch again"
},
{
"username": "Jane",
"number": "12/10 with rice"
}
]
}
That is very readable by us humans, but also by computers! We know the name is "Gangnam Style", the link of the video, etc.
A super helpful way to interact with JSON in Java and Android is Google's GSON library, which lets you cast a Java object as JSON or parse a JSON object to a Java object.
To get this information in the first place, you have to make a network call to an API, Application Programming Interface. Just a fancy term for communication between a server and a client. One very cool, free, and easy to understand API that I will use for this example is the OMDB API, which just spits back information about movies from IMDB. So how do you talk to the API? Well luckily they've got some nice documentation, which says that to get information on a movie we need to use some parameters in the url, like perhaps
http://www.omdbapi.com/?t=Interstellar
They want a title with the parameter "t". We could put a year, or return type, but this should be good to understand the basics. If you go to that URL in your browser, it spits back lots of information about Interstellar in JSON form. That stuff we were talking about! So how would you get this information from your Android application?
Well, you could use Android's built in HttpUrlConnection classes and research for a few hours on why your calls aren't working. But doesn't essentially every app now use networking? Why reinvent the wheel when virtually every valuable app out there has probably done this work before? Perhaps we can find some code online to do this work for us.
Or even better, a library! In particular, an open source library developed by Square, retrofit. There are multiple libraries like it (go ahead and research that out, it's best to find the best fit for your project), but the idea is they do all the hard work for you like low level network programming. Following their guides, you can reduce a lot of code work into just a few lines. So for our OMDB API example, we can set up our network calls like this:
//OMDB API
public ApiClient{
//an instance of this client object
private static OmdbApiInterface sOmdbApiInterface;
//if the omdbApiInterface object has been instantiated, return it, but if not, build it then return it.
public static OmdbApiInterface getOmdbApiClient() {
if (sOmdbApiInterface == null) {
RestAdapter restAdapter = new RestAdapter.Builder()
.setEndpoint("http://www.omdbapi.com")
.build();
sOmdbApiInterface = restAdapter.create(OmdbApiInterface.class);
}
return sOmdbApiInterface;
}
public interface OmdbApiInterface {
#GET("/")
void getInfo(#Query("t") String title, Callback<JsonObject> callback);
}
}
After you have researched and understand what's going on up there using their documentation, we can now use this class that we have set up anywhere in your application to call the API:
//you could get a user input string and pass it in as movieName
ApiClient.getOmdbApiClient().getInfo(movieName, new Callback<List<MovieInfo>>() {
//the nice thing here is that RetroFit deals with the JSON for you, so you can just get information right here from the JSON object
#Override
public void success(JsonObject movies, Response response) {
Log.i("TAG","Movie name is " + movies.getString("Title");
}
#Override
public void failure(RetrofitError error) {
Log.e("TAG", error.getMessage());
}
});
Now you've made an API call to get info from across the web! Congratulations! Now do what you want with the data. In this case we used Omdb but you can use anything that has this method of communication. For your purposes, I don't know exactly what data you are trying to get, but if it's possible, try to find a public API or something where you can get it using a method similar to this.
Let me know if you've got any questions.
Cheers!

As #caleb-allen said, if an API is available to you, it's better to use that.
However, I'm assuming that the web page is all you have to work with.
There are many libraries that can be used on Android to get the content of a URL.
Choices range from using the bare-bones HTTPUrlConnection to slightly higher-level HTTPClient to using robust libraries like Retrofit. I personally recommend Retrofit. Whatever you do, make sure that your HTTP access is asynchronous, and not done on the UI thread. Retrofit will handle this for you by default.
For parsing the results, I've had good results in the past using the open-source HTMLCleaner library - see http://htmlcleaner.sourceforge.net
Similar to JSoup, it takes a possibly-badly-formed HTML document and creates a valid XML document from it.
Once you have a valid XML document, you can use HTMLCleaner's implementation of the XML DOM to parse the document to find what you need.
Here, for example, is a method that I use to parse the names of 'projects' from a <table> element on a web page where projects are links within the table:
private List<Project> parseProjects(String html) throws Exception {
List<Project> parsedProjects = new ArrayList<Project>();
HtmlCleaner pageParser = new HtmlCleaner();
TagNode node = pageParser.clean(html);
String xpath = "//table[#class='listtable']".toString();
Object[] tables = node.evaluateXPath(xpath);
TagNode tableNode;
if(tables.length > 1) {
tableNode = (TagNode) tables[0];
} else {
throw new Exception("projects table not found in html");
}
TagNode[] projectLinks = tableNode.getElementsByName("a", true);
for(int i = 0; i < projectLinks.length; i++) {
TagNode link = projectLinks[i];
String projectName = link.getText().toString();
String href = link.getAttributeByName("href");
String projectIdString = href.split("=")[1];
int projectId = Integer.parseInt(projectIdString);
Project project = new Project(projectId, projectName);
parsedProjects.add(project);
}
return parsedProjects;
}

If you have permission to edit the webpage to add hyper link to specified line of that page you can use this way
First add code for head of line that you want to go there in your page
head your text if wanna
Then in your apk app on control click code enter
This.mwebview.loadurl("https:#######.com.html#target")
in left side of # enter your address of webpage and then #target in this example that your id is target.
Excuse me if my english lang. isn't good

HtmlUnit to take snapshot of Ajax applications

I create a basic GWT (Google Web Toolkit) Ajax application, and now I'm trying to create snapshots to the crawlers read the page.
I create a Servlet to response the crawlers, using HtmlUnit.
My application runs perfectly when I'm on a browser. But when in HtmlUnit, it throws a lot of errors about the special chars I have in the HTML. But these chars are content, and I wouldn't like to replace it with the special codes, once it's currently working, just because of the HtmlUnit. (at least I should check before if I'm using HtmlUnit correctly )
I think HtmlUnit should read the charset information of the page and render it as a browser, once it's the objective of the project I think.
I haven't found good information about this problem. Is this an HtmlUnit limitation? Do I need to change all the content of my website to use this java library to take snapshots?
Here's my code:
if ((queryString != null) && (queryString.contains("_escaped_fragment_"))) {
// ok its the crawler
// rewrite the URL back to the original #! version
// remember to unescape any %XX characters
url = URLDecoder.decode(url, "UTF-8");
String ajaxURL = url.replace("?_escaped_fragment_=", "#!");
final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24);
HtmlPage page = webClient.getPage(ajaxURL);
// important! Give the headless browser enough time to execute JavaScript
// The exact time to wait may depend on your application.
webClient.waitForBackgroundJavaScript(3000);
// return the snapshot
response.getWriter().write(page.asXml());

The problem was XML confliting with the HTML. #ColinAlworth comments helped me.
I followed Google example, and there was not working.
To it work, you need to remove XML tags and let just the HTML be responded, changing the line:
// return the snapshot
response.getWriter().write(page.asXml());
to
response.getWriter().write(page.asXml().replaceFirst("<\\?.*>",""));
Now it's rendering.
But although it is being rendered, the CSS is ot working, and the DOM is not updated (GWT updates page title when page opens). HTMLUnit throwed a lot of errors about CSS, and I'm using twitter bootstrap without any changes. Apparently, HtmlUnit project have a lot of bugs, good for small tests, but not to parse complex (or even simple) HTMLs.

Parsing link from HTM with Jsoup

I'm trying to get a link from a page, but it seems not working. I need to catch a singluar link, contaning a keyword, but it seems catching nothing...
I use this code:
Element summonerLink = doc.select("a:[href]:contanis(summoner)").first();
summonerURL = summonerLink.attr("href");
I tried to use setText(summonerURL) but it goes blank... (Of course summonerURL is String)

this should work
doc.select("a[href*=summoner]")
for more info check jsoup documentation look for [attr*=value]

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Url working in Google chrome inaccessible by Java w/Jsoup? - java

If you use Jsoup you have to replace spaces with %20 and not with +. Try this url : https://www.google.com/search?q=definition%20of%20apple String url = "https://www.google.com/search?q=definition%20of%20"; url += search; //search is the passed in string Document doc = Jsoup.connect(url).get();

public static void main(String[] args) { Document doc = Jsoup.connect(link) .data("query", "Java") .userAgent("Mozilla") .cookie("auth", "token") .timeout(1000) .post(); }

Related

Document.select("a[href]") not getting all the href

Scraping wikipedia URLs through jsoup using keywords/macros

I am coding in Android Studio, and I need to fetch and display a specific line of data from a specific webpage

HtmlUnit to take snapshot of Ajax applications

Parsing link from HTM with Jsoup

Categories

Resources