Java Servlet as a HTTP Proxy - java

I have read hundreds of SO Posts and studied several Java HTTP-Proxy Sources available... but I could not find a solution for my Problem.
I wrote a WebApp that proxies Http-Requests. The WebApp is working, but links and referrers become broken because the "Root" of the proxied page points to the root of my server and not to the path of my proxyservlet..
To make it more clear:
My ProxyServlet gets a Request "http://myserver.com/proxy/ProxyServlet?foo=bar"
The ProxyServlet now fetches the pagecontent from ServerX (e.g. "http://original.com/test.html")
The content of the page is delivered to the browser by just reading and writing from one stream to the other and copying the headers.
The browser displays the page, the URL, that the browser shows is the original request ("http://myserver.com/proxy/ProxyServlet?foo=bar"), but all relative links now point to
"http://myserver.com/XXX.html" instead of "http://myserver.com/proxy/ProxyServlet/XXX.html"
Is there a response-header where I can change the "path" so that relative links correctly point to my ProxyServlet?
(Rewriting the page-content and replacing links would be too difficult, because the page contains relatively addressed elements such as javascript code and other active content...)
(Changing the mapping for my Servlet to "/*" is also not possible... it must be accessed via this path...)

You are inventing a "reverse proxy", and miss the "URL rewriting" feature...
Off the top of my search results, here's an open source proxy servlet that does this:
http://j2ep.sourceforge.net/docs/rewrite.html
Also you should know there is probably something wrong with the system architecture if you have to do this. Dropping in a standalone proxy like Apache, nginex, Varnish should always be an option, as you will HAVE to add one (or more!) as you start scaling.

It sounds like the page you're proxying in is using absolute links, e.g. <a href="/XXX.html"> which means "no matter where this link is found, look for it relative to the document root". If you have control of it, the best thing is for the proxy target to be more lenient in it's linking, and instead use <a href="XXX.html">. If you can't do that, then you need to re-write these URLs, some example code, using JSoup:
Document doc = Jsoup.parse(rawBody, getDisplayUrl());
for(Element cssALink : doc.select("link[rel=stylesheet],a[href]"))
{
cssALink.attr("href", cssALink.absUrl("href"));
}
for(Element imgJsLink : doc.select("script[src],img[src]"))
{
imgJsLink.attr("src", imgJsLink.absUrl("src"));
}
return doc.toString();

Related

How to get video src with jsoup (twitch)?

I am getting null pointer, is possible to get this link?
Element element = document.select("div.tw-absolute.tw-bottom-0.tw-left-0.tw-overflow-hidden.tw-right-0.tw-top-0.video-player__container").first();
System.out.println(element.absUrl("src"));
Tried this too
nullpointer as well
Element video = document.select("video").first();
String absSrc = video.absUrl("src");
System.out.println(absSrc);
html part
<div class = "tw-absolute tw-bottom-0 tw-left-0 tw-overflow-hidden tw-right-0 tw-top-0 video-player__container" data-test-selector="video-player__video-container">
<video playsinline="" webkit-playsinline="" src="https://clips-media-assets2.twitch.tv/40487770748-offset-9048.mp4?token=%7B%22authorization%22:%7B%22forbidden%22:false,%22reason%22:%22%22%7D,%22chansub%22:%7B%22restricted_bitrates%22:%5B%5D%7D,%22device_id%22:%226518a1542e035018%22,%22expires%22:1609419047,%22https_required%22:true,%22privileged%22:false,%22user_id%22:500437676,%22version%22:2,%22vod_id%22:850278065%7D&sig=5e17731db577b99e535c4aad3eacc70c0cc34521"></video>
link: https://www.twitch.tv/scream/clip/BrightOilyAppleMcaT
Looks like this one will require again, a lot of work to unpick.
Here's what I can tell you just from a quick look:
when you make the initial request, it does not contain the result you're looking for in the HTML. Therefore it must be coming from a subsequent HTTP request that is fired off once the page is loaded... i.e. there's javascript communicating with back end servers to get JSON payloads. In one of those payloads you'll find ".mp4".
If you use Chrome developer tools, you can flick over to the "Network" tab, click on each request following the first one, and check the "Preview" tab. You will find some requests contain JSON responses, others are just .css, .png, etc. ignore these. In the JSON responses, check the results for the occurrence of some generic value you're interested in like ".mp4". Once you've found it:
.. you then need to try to recreate the headers, the request body (as its not empty), the type of HTTP request (POST), and pass any relevant cookies (in the headers).
You're going to have to make anywhere between 1 and 5 HTTP requests to get what you need to get this JSON payload. Once you have it you can then parse it back.
This is another one of those jobs that's so big I'm not going to begin to try to do it for you.
If it were me doing the job, I'd check the Twitch API docs https://dev.twitch.tv/docs/api/ to see if there's a better/easier way that's just 1-2 requests.
You can change the CSS query as below.
Element element = document.select("div.tw-absolute.tw-bottom-0.tw-left-0.tw-overflow-hidden.tw-right-0.tw-top-0.video-player__container > video").first();
String src = element.attr("src");
System.out.println(src);

Document.select("a[href]") not getting all the href

I am using JSOUP to fetch the documents from a website.
Below is my code
webPageUrl = https://mwcc.ms.gov/#/electronicDataInterchange
Document doc = Jsoup.connect(webPageUrl).get();
Elements links = doc.getElementsByAttribute("a[href]");
Below line of code is not working. It is supposed to return an element but doesn't:
doc.getElementsByAttribute("a[href]")
Can someone please point out the mistake in my code?
That page seems to be an Angular application, which means it loads some (probably all or most) of its content via JavaScript scripts.
The fact that the URL contains the fragment separator # is already a strong indicator of that fact, because if you do a HTTP request, then everything after that indicator is cut off (i.e. not sent to the server), so the actual request will just be of https://mwcc.ms.gov/.
As far as I know JSoup does not support running JavaScript, so you might need to look into a more involved scraping tool (possibly running a full browser engine).

Make gwt website crawlable without hash symbol?

In GWT we need to use # in a URL to navigate from one page to another i.e for creating history for eg. www.abc.com/#questions/10245857 but due to which I am facing a problem in sharing the url. Google scrappers are reading the url only before # i.e. www.abc.com.
Now I want to remove # from my url and want to keep it straight as www.abc.com/question/10245857.
I am unable to do so. How can I do this?
When user navigates the app I use the hash urls and History object (as
to not reload the pages). However sometimes it's nice/needed to have a
pretty URL (e.g. for sharing, showing in public, etc..) so I would like to know how to
provide the pretty URL of the same page.
Note:
We have to do this to make our webpages url crawlable and to link the website with outside world.
There are 3 issues here, and each can be solved:
The URL should appear prettier to the user
Going directly to the pretty URL should work.
WebCrawlers should be able to get the content
These may all seem like the same issue, but they are quite distinct in this context.
Display Pretty URLs
Can be done with a small javascript file which uses HTML5 state methods. You can see a simple demo here, with source here. This makes all changes to "#" appear without the "#" (on modern browsers).
Relevent code from fiddle:
var stateObj = {locationHash: hash};
history.replaceState(stateObj, "Page Title", baseURL + hash.substring(1));
Repsond to Pretty URLs
This is relatively simple, as long as you have a listener in GWT to load based on the "#" at page load already. You can just throw up a simple re-direct servlet which reinserts the "#" mark where it belongs when requests come in.
For a servlet, listening for the pretty URL:
if(request.getPathInfo()!=null && request.getPathInfo().length()>1){
response.sendRedirect("#" + request.getPathInfo());
return;
}
Alternatively, you can serve up your GWT app directly from this servlet, and initialize it with parameters from the URL, but there's a bit of relative-path bookkeeping to be aware of.
WebCrawlers
This is the trickiest one. Basically you can't get around having static(ish) pages here. That's not too hard if there are a finite set of simple states that you're indexing. One simple scheme is to have a separate servlet which returns the raw content you normally fetch with GWT, in minimal formatted HTML. This servlet can have a different URL pattern like "/indexing/". These wouldn't be meant for humans, just for the webcrawlers. You can attach a simple javascript in the <head> to redirect users to the pretty url once the page loads.
Here's an example for the doGet method of such a servlet:
response.setContentType("text/html;charset=UTF-8");
response.setStatus(200);
pw = response.getWriter();
pw.println("<html>");
pw.println("<head><script>");
pw.println("window.location.href='http://www.example.com/#"
+ request.getPathInfo() + "';");
pw.println("</script></head>");
pw.println("<body>");
pw.println(getRawPageContent(request.getPathInfo()));
pw.println("</body>");
pw.println("</html>");
pw.flush();
pw.close();
return;
You should then just have some links to these indexing pages hidden somewhere on your main app URL (or behind a link on your main app URL).

vaadin legacy-styles.css doesn't load in IE8

I am using vaadin to create a web app. I want to import legacy-styles.css into my styles.css
my styles.css is as follow:
#import "../reindeer/legacy-styles.css";
.v-app {
background: yellow;
}
Then use morderniz to targer IE8
Element head = response.getDocument().head();
Element meta = head.appendElement("meta");
meta.attr("name", "viewport");
meta.attr("content", "width=device-width, initial-scale=1, maximum-scale=1");
// Meta tag to force IE8 to standard mode
String ie8Meta = "<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge,chrome=1\">";
head.prepend(ie8Meta);
// some other stuffs ...
// Adding modernizr library to target ie8
// http://modernizr.com/docs/
String modernizr = "<script type=\"text/javascript\" src=\"//cdnjs.cloudflare.com/ajax/libs/modernizr/2.6.2/modernizr.min.js\"></script>";
head.prepend(modernizr);
Strangely, the legacy-styles loaded properly in IE10. Then legacy-styles.css doesn't load in IE8
The error reported is
com.vaadin.client.VConsole
SEVERE: CSS files may have not loaded properly.
I have tried to rearrange moderniz.js (using append instead of prepend) but didn't work
Before we dig into the problem at hand, using #import is a super bad idea. While it is a convenient thing to do for you, it means that every single user that comes to your site will have to download a file (the document itself), in order to download another file (the one containing the #import string) just to download yet another file. In addition to DNS resolution and tcp handshakes, you are looking at possibly up to a dozen round trips between your server and their computer just to get this one single document.
It would be much better to use a css precompiler, such as sass or less, and concatenating the older file with the new one into a single, compressed, file.
Next, modernizr shouldn't be used to target IE8, or any browser for that matter. That is actually completely against the purpose of Modernizr - which specializes in feature detection, rather than browser detection. That means that rather than saying you want to "target IE8", you would think more along the lines of wanting to "target browsers lacking feature X" (svg, geolocation, rounded corners, etc).
That being said, there are a number of reasons to specifically target versions of internet explorer, but modernizr is not the way to do it. The more correct way to do this would be using IE's conditional classes
<!--[if IE 8]>
//IE specific stlyes here
<![endif]-->
Nothing you have shown has anything to do with IE specific code, and since modernizr doesn't add any features itself, there it isn't really clear why IE 10 would do anything different at all.

best way to externalize HTML in GWT apps?

What's the best way to externalize large quantities of HTML in a GWT app? We have a rather complicated GWT app of about 30 "pages"; each page has a sort of guide at the bottom that is several paragraphs of HTML markup. I'd like to externalize the HTML so that it can remain as "unescaped" as possible.
I know and understand how to use property files in GWT; that's certainly better than embedding the content in Java classes, but still kind of ugly for HTML (you need to backslashify everything, as well as escape quotes, etc.)
Normally this is the kind of thing you would put in a JSP, but I don't see any equivalent to that in GWT. I'm considering just writing a widget that will simply fetch the content from html files on the server and then add the text to an HTML widget. But it seems there ought to be a simpler way.
I've used ClientBundle in a similar setting. I've created a package my.resources and put my HTML document and the following class there:
package my.resources;
import com.google.gwt.core.client.GWT;
import com.google.gwt.resources.client.ClientBundle;
import com.google.gwt.resources.client.TextResource;
public interface MyHtmlResources extends ClientBundle {
public static final MyHtmlResources INSTANCE = GWT.create(MyHtmlResources.class);
#Source("intro.html")
public TextResource getIntroHtml();
}
Then I get the content of that file by calling the following from my GWT client code:
HTML htmlPanel = new HTML();
String html = MyHtmlResources.INSTANCE.getIntroHtml().getText();
htmlPanel.setHTML(html);
See http://code.google.com/webtoolkit/doc/latest/DevGuideClientBundle.html for further information.
You can use some templating mechanism. Try FreeMarker or Velocity templates. You'll be having your HTML in files that will be retrieved by templating libraries. These files can be named with proper extensions, e.g. .html, .css, .js obsearvable on their own.
I'd say you load the external html through a Frame.
Frame frame = new Frame();
frame.setUrl(GWT.getModuleBase() + getCurrentPageHelp());
add(frame);
You can arrange some convention or lookup for the getCurrentPageHelp() to return the appropriate path (eg: /manuals/myPage/help.html)
Here's an example of frame in action.
In GWT 2.0, you can do this using the UiBinder.
<ui:UiBinder xmlns:ui='urn:ui:com.google.gwt.uibinder'>
<div>
Hello, <span ui:field='nameSpan’/>, this is just good ‘ol HTML.
</div>
</ui:UiBinder>
These files are kept separate from your Java code and can be edited as HTML. They are also provide integration with GWT widgets, so that you can easily access elements within the HTML from your GWT code.
GWT 2.0, when released, should have a ClientBundle, which probably tackles this need.
You could try implementing a Generator to load external HTML from a file at compile time and build a class that emits it. There doesn't seem to be too much help online for creating generators but here's a post to the GWT group that might get you started: GWT group on groups.google.com.
I was doing similar research and, so far, I see that the best way to approach this problem is via the DeclarativeUI or UriBind. Unfortunately it still in incubator, so we need to work around the problem.
I solve it in couple of different ways:
Active overlay, i.e.: you create your standard HTML/CSS and inject the GET code via <script> tag. Everywhere you need to access an element from GWT code you write something like this:
RootPanel.get("element-name").setVisible(false);
You write your code 100% GWT and then, if a big HTML chunk is needed, you bring it to the client either via IFRAME or via AJAX and then inject it via HTML panel like this:
String html = "<div id='one' "
+ "style='border:3px dotted blue;'>"
+ "</div><div id='two' "
+ "style='border:3px dotted green;'"
+ "></div>";
HTMLPanel panel = new HTMLPanel(html);
panel.setSize("200px", "120px");
panel.addStyleName("demo-panel");
panel.add(new Button("Do Nothing"), "one");
panel.add(new TextBox(), "two");
RootPanel.get("demo").add(panel);
Why not to use good-old IFRAME? Just create an iFrame where you wish to put a hint and change its location when GWT 'page' changes.
Advantages:
Hits are stored in separate maintainable HTML files of any structure
AJAX-style loading with no coding at all on server side
If needed, application could still interact with loaded info
Disadvantages:
Each hint file should have link to shared CSS for common look-and-feel
Hard to internationalize
To make this approach a bit better, you might handle loading errors and redirect to default language/topic on 404 errors. So, search priority will be like that:
Current topic for current language
Current topic for default language
Default topic for current language
Default error page
I think it's quite easy to create such GWT component to incorporate iFrame interactions
The GWT Portlets framework (http://code.google.com/p/gwtportlets/) includes a WebAppContentPortlet. This serves up any content from your web app (static HTML, JSPs etc.). You can put it on a page with additional functionality in other Portlets and everything is fetched with a single async call when the page loads.
Have a look at the source for WebAppContentPortlet and WebAppContentDataProvider to see how it is done or try using the framework itself. Here are the relevant bits of source:
WebAppContentPortlet (client side)
((HasHTML)getWidget()).setHTML(html == null ? "<i>Web App Content</i>" : html);
WebAppContentDataProvider (server side):
HttpServletRequest servletRequest = req.getServletRequest();
String path = f.path.startsWith("/") ? f.path : "/" + f.path;
RequestDispatcher rd = servletRequest.getRequestDispatcher(path);
BufferedResponse res = new BufferedResponse(req.getServletResponse());
try {
rd.include(servletRequest, res);
res.getWriter().flush();
f.html = new String(res.toByteArray(), res.getCharacterEncoding());
} catch (Exception e) {
log.error("Error including '" + path + "': " + e, e);
f.html = "Error including '" + path +
"'<br>(see server log for details)";
}
You can use servlets with jsps for the html parts of the page and still include the javascript needed to run the gwt app on the page.
I'm not sure I understand your question, but I'm going to assume you've factored out this common summary into it's own widget. If so, the problem is that you don't like the ugly way of embedding HTML into the Java code.
GWT 2.0 has UiBinder, which allows you to define the GUI in raw HTMLish template, and you can inject values into the template from the Java world. Read through the dev guide and it gives a pretty good outline.
Take a look at
http://code.google.com/intl/es-ES/webtoolkit/doc/latest/DevGuideClientBundle.html
You can try GWT App with html templates generated and binded on run-time, no compiling-time.
Not knowing GWT, but can't you define and anchor div tag in your app html then perform a get against the HTML files that you need, and append to the div? How different would this be from a micro-template?
UPDATE:
I just found this nice jQuery plugin in an answer to another StackOverflow question.

Categories

Resources