Document.select("a[href]") not getting all the href - java

I am using JSOUP to fetch the documents from a website.
Below is my code
webPageUrl = https://mwcc.ms.gov/#/electronicDataInterchange
Document doc = Jsoup.connect(webPageUrl).get();
Elements links = doc.getElementsByAttribute("a[href]");
Below line of code is not working. It is supposed to return an element but doesn't:
doc.getElementsByAttribute("a[href]")
Can someone please point out the mistake in my code?

That page seems to be an Angular application, which means it loads some (probably all or most) of its content via JavaScript scripts.
The fact that the URL contains the fragment separator # is already a strong indicator of that fact, because if you do a HTTP request, then everything after that indicator is cut off (i.e. not sent to the server), so the actual request will just be of https://mwcc.ms.gov/.
As far as I know JSoup does not support running JavaScript, so you might need to look into a more involved scraping tool (possibly running a full browser engine).

Related

How to get video src with jsoup (twitch)?

I am getting null pointer, is possible to get this link?
Element element = document.select("div.tw-absolute.tw-bottom-0.tw-left-0.tw-overflow-hidden.tw-right-0.tw-top-0.video-player__container").first();
System.out.println(element.absUrl("src"));
Tried this too
nullpointer as well
Element video = document.select("video").first();
String absSrc = video.absUrl("src");
System.out.println(absSrc);
html part
<div class = "tw-absolute tw-bottom-0 tw-left-0 tw-overflow-hidden tw-right-0 tw-top-0 video-player__container" data-test-selector="video-player__video-container">
<video playsinline="" webkit-playsinline="" src="https://clips-media-assets2.twitch.tv/40487770748-offset-9048.mp4?token=%7B%22authorization%22:%7B%22forbidden%22:false,%22reason%22:%22%22%7D,%22chansub%22:%7B%22restricted_bitrates%22:%5B%5D%7D,%22device_id%22:%226518a1542e035018%22,%22expires%22:1609419047,%22https_required%22:true,%22privileged%22:false,%22user_id%22:500437676,%22version%22:2,%22vod_id%22:850278065%7D&sig=5e17731db577b99e535c4aad3eacc70c0cc34521"></video>
link: https://www.twitch.tv/scream/clip/BrightOilyAppleMcaT
Looks like this one will require again, a lot of work to unpick.
Here's what I can tell you just from a quick look:
when you make the initial request, it does not contain the result you're looking for in the HTML. Therefore it must be coming from a subsequent HTTP request that is fired off once the page is loaded... i.e. there's javascript communicating with back end servers to get JSON payloads. In one of those payloads you'll find ".mp4".
If you use Chrome developer tools, you can flick over to the "Network" tab, click on each request following the first one, and check the "Preview" tab. You will find some requests contain JSON responses, others are just .css, .png, etc. ignore these. In the JSON responses, check the results for the occurrence of some generic value you're interested in like ".mp4". Once you've found it:
.. you then need to try to recreate the headers, the request body (as its not empty), the type of HTTP request (POST), and pass any relevant cookies (in the headers).
You're going to have to make anywhere between 1 and 5 HTTP requests to get what you need to get this JSON payload. Once you have it you can then parse it back.
This is another one of those jobs that's so big I'm not going to begin to try to do it for you.
If it were me doing the job, I'd check the Twitch API docs https://dev.twitch.tv/docs/api/ to see if there's a better/easier way that's just 1-2 requests.
You can change the CSS query as below.
Element element = document.select("div.tw-absolute.tw-bottom-0.tw-left-0.tw-overflow-hidden.tw-right-0.tw-top-0.video-player__container > video").first();
String src = element.attr("src");
System.out.println(src);

Make gwt website crawlable without hash symbol?

In GWT we need to use # in a URL to navigate from one page to another i.e for creating history for eg. www.abc.com/#questions/10245857 but due to which I am facing a problem in sharing the url. Google scrappers are reading the url only before # i.e. www.abc.com.
Now I want to remove # from my url and want to keep it straight as www.abc.com/question/10245857.
I am unable to do so. How can I do this?
When user navigates the app I use the hash urls and History object (as
to not reload the pages). However sometimes it's nice/needed to have a
pretty URL (e.g. for sharing, showing in public, etc..) so I would like to know how to
provide the pretty URL of the same page.
Note:
We have to do this to make our webpages url crawlable and to link the website with outside world.
There are 3 issues here, and each can be solved:
The URL should appear prettier to the user
Going directly to the pretty URL should work.
WebCrawlers should be able to get the content
These may all seem like the same issue, but they are quite distinct in this context.
Display Pretty URLs
Can be done with a small javascript file which uses HTML5 state methods. You can see a simple demo here, with source here. This makes all changes to "#" appear without the "#" (on modern browsers).
Relevent code from fiddle:
var stateObj = {locationHash: hash};
history.replaceState(stateObj, "Page Title", baseURL + hash.substring(1));
Repsond to Pretty URLs
This is relatively simple, as long as you have a listener in GWT to load based on the "#" at page load already. You can just throw up a simple re-direct servlet which reinserts the "#" mark where it belongs when requests come in.
For a servlet, listening for the pretty URL:
if(request.getPathInfo()!=null && request.getPathInfo().length()>1){
response.sendRedirect("#" + request.getPathInfo());
return;
}
Alternatively, you can serve up your GWT app directly from this servlet, and initialize it with parameters from the URL, but there's a bit of relative-path bookkeeping to be aware of.
WebCrawlers
This is the trickiest one. Basically you can't get around having static(ish) pages here. That's not too hard if there are a finite set of simple states that you're indexing. One simple scheme is to have a separate servlet which returns the raw content you normally fetch with GWT, in minimal formatted HTML. This servlet can have a different URL pattern like "/indexing/". These wouldn't be meant for humans, just for the webcrawlers. You can attach a simple javascript in the <head> to redirect users to the pretty url once the page loads.
Here's an example for the doGet method of such a servlet:
response.setContentType("text/html;charset=UTF-8");
response.setStatus(200);
pw = response.getWriter();
pw.println("<html>");
pw.println("<head><script>");
pw.println("window.location.href='http://www.example.com/#"
+ request.getPathInfo() + "';");
pw.println("</script></head>");
pw.println("<body>");
pw.println(getRawPageContent(request.getPathInfo()));
pw.println("</body>");
pw.println("</html>");
pw.flush();
pw.close();
return;
You should then just have some links to these indexing pages hidden somewhere on your main app URL (or behind a link on your main app URL).

HtmlUnit to take snapshot of Ajax applications

I create a basic GWT (Google Web Toolkit) Ajax application, and now I'm trying to create snapshots to the crawlers read the page.
I create a Servlet to response the crawlers, using HtmlUnit.
My application runs perfectly when I'm on a browser. But when in HtmlUnit, it throws a lot of errors about the special chars I have in the HTML. But these chars are content, and I wouldn't like to replace it with the special codes, once it's currently working, just because of the HtmlUnit. (at least I should check before if I'm using HtmlUnit correctly )
I think HtmlUnit should read the charset information of the page and render it as a browser, once it's the objective of the project I think.
I haven't found good information about this problem. Is this an HtmlUnit limitation? Do I need to change all the content of my website to use this java library to take snapshots?
Here's my code:
if ((queryString != null) && (queryString.contains("_escaped_fragment_"))) {
// ok its the crawler
// rewrite the URL back to the original #! version
// remember to unescape any %XX characters
url = URLDecoder.decode(url, "UTF-8");
String ajaxURL = url.replace("?_escaped_fragment_=", "#!");
final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24);
HtmlPage page = webClient.getPage(ajaxURL);
// important! Give the headless browser enough time to execute JavaScript
// The exact time to wait may depend on your application.
webClient.waitForBackgroundJavaScript(3000);
// return the snapshot
response.getWriter().write(page.asXml());
The problem was XML confliting with the HTML. #ColinAlworth comments helped me.
I followed Google example, and there was not working.
To it work, you need to remove XML tags and let just the HTML be responded, changing the line:
// return the snapshot
response.getWriter().write(page.asXml());
to
response.getWriter().write(page.asXml().replaceFirst("<\\?.*>",""));
Now it's rendering.
But although it is being rendered, the CSS is ot working, and the DOM is not updated (GWT updates page title when page opens). HTMLUnit throwed a lot of errors about CSS, and I'm using twitter bootstrap without any changes. Apparently, HtmlUnit project have a lot of bugs, good for small tests, but not to parse complex (or even simple) HTMLs.

Url working in Google chrome inaccessible by Java w/Jsoup?

I'm having quite a confusing problem. I have literally only been doing networking for a day, so please forgive me and I apologize if I am making a dumb error. My issue is that I cannot access a URL in a programmatic fashion which I can access through copy-pasting into chrome.
I am using a library called jsoup (http://jsoup.org/apidocs/) which parses text out of raw html from a website. My goal in general is to use a base url to which I can attach a string, and get a webpage from it. I am using the code (edit for those who asked for more code, I know this is still sparse but this is the only code preceding the error)
String url = "https://www.google.com/search?q=definition+of+";
url += search; //search is the passed in string
Document doc = Jsoup.connect(url).get(); //url is the String in question
to get the webpage. My ultimate goal is to use this method to get the text of the box at the top of chrome searches when you search for the definition of a word. I.e the box at the top here: https://www.google.com/search?q=definition+of+apple
However, I come to an issue when I attempt to use the above link as my url, for I get a org.jsoup.HttpStatusException, so I think it is a networking problem. What causes this url to work when typed into chrome, but not in Java? (I would also not be adverse to different ways to get the information in that box, since my current method feels a bit roundabout)
The full error message (edited in)
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://www.google.com/search?q=definition+of+apple
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:435)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
at test.Test.parseDef(Test.java:68)
at test.Test.main(Test.java:112)
To whomever answers, thank you for spending your time to help a networking newbie!
Most likely, Google is accurately identifying your program as a "robot" and acting accordingly. Google encourages robots to use the Google Custom Search API and discourages them from using the human-oriented search interface.
In fact, all web spiders are supposed to check robots.txt, right? Here is Google's: http://www.google.com/robots.txt. Note that /search is disallowed.
Please see this question for further information. It's basically the python version of your question. Why does Google Search return HTTP Error 403?
If you use Jsoup you have to replace spaces with %20 and not with +.
Try this url :
https://www.google.com/search?q=definition%20of%20apple
String url = "https://www.google.com/search?q=definition%20of%20";
url += search; //search is the passed in string
Document doc = Jsoup.connect(url).get();
public static void main(String[] args) {
Document doc = Jsoup.connect(link)
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(1000)
.post();
}

Java Servlet as a HTTP Proxy

I have read hundreds of SO Posts and studied several Java HTTP-Proxy Sources available... but I could not find a solution for my Problem.
I wrote a WebApp that proxies Http-Requests. The WebApp is working, but links and referrers become broken because the "Root" of the proxied page points to the root of my server and not to the path of my proxyservlet..
To make it more clear:
My ProxyServlet gets a Request "http://myserver.com/proxy/ProxyServlet?foo=bar"
The ProxyServlet now fetches the pagecontent from ServerX (e.g. "http://original.com/test.html")
The content of the page is delivered to the browser by just reading and writing from one stream to the other and copying the headers.
The browser displays the page, the URL, that the browser shows is the original request ("http://myserver.com/proxy/ProxyServlet?foo=bar"), but all relative links now point to
"http://myserver.com/XXX.html" instead of "http://myserver.com/proxy/ProxyServlet/XXX.html"
Is there a response-header where I can change the "path" so that relative links correctly point to my ProxyServlet?
(Rewriting the page-content and replacing links would be too difficult, because the page contains relatively addressed elements such as javascript code and other active content...)
(Changing the mapping for my Servlet to "/*" is also not possible... it must be accessed via this path...)
You are inventing a "reverse proxy", and miss the "URL rewriting" feature...
Off the top of my search results, here's an open source proxy servlet that does this:
http://j2ep.sourceforge.net/docs/rewrite.html
Also you should know there is probably something wrong with the system architecture if you have to do this. Dropping in a standalone proxy like Apache, nginex, Varnish should always be an option, as you will HAVE to add one (or more!) as you start scaling.
It sounds like the page you're proxying in is using absolute links, e.g. <a href="/XXX.html"> which means "no matter where this link is found, look for it relative to the document root". If you have control of it, the best thing is for the proxy target to be more lenient in it's linking, and instead use <a href="XXX.html">. If you can't do that, then you need to re-write these URLs, some example code, using JSoup:
Document doc = Jsoup.parse(rawBody, getDisplayUrl());
for(Element cssALink : doc.select("link[rel=stylesheet],a[href]"))
{
cssALink.attr("href", cssALink.absUrl("href"));
}
for(Element imgJsLink : doc.select("script[src],img[src]"))
{
imgJsLink.attr("src", imgJsLink.absUrl("src"));
}
return doc.toString();

Categories

Resources