Tools/libraries to resolve/expand thousands of URLs

Tools/libraries to resolve/expand thousands of URLs - java

In a crawler-like project we have a common and widely used task to resolve/expand thousands of URLs. Say we have (very simplified example):
http://bit.ly/4Agih5
GET 'http://bit.ly/4Agih5' request returns one of the 3xx, we follow redirect right to the:
http://stackoverflow.com
GET 'http://stackoverflow.com' returns 200. So 'stackoverflow.com' is the result we need.
Any URLs (not only well-known shorteners like bit.ly) are allowed as input. Some of them redirect once, some doesn't redirect at all (result is the URL itself in this case), some redirect multiple times. Our task to follow all redirects imitating browser behavior as much as possible. In general, if we have some URL A resolver should return us URL B which should be the same as if A was opened in some browser.
So far we used Java, pool of threads and simple URLConnection to solve this task. Advantages are obvious:
simplicity - just create URLConnection, set follow redirects and that's it (almost);
well HTTP support - Java provides everything we need to imitate browser as much as possible: auto follow redirects, cookies support.
Unfortunately such approach has also drawbacks:
performance - threads are not for free, URLConnection starts downloading document right after getInputStream(), even if we don't need it;
memory footprint - don't sure exactly but seems that URL and URLConnection are quite heavy objects, and again buffering of the GET result right after getInputStream() call.
Are there other solutions (or improvements to this one) which may significantly increase speed and decrease memory consumption? Presumably, we need something like:
high-performance lightweight Java HTTP client based on java.nio;
C HTTP client which uses poll() or select();
some ready library which resolves/expands URLs;

You can use Python, Gevent, and urlopen. Combine this gevent exampel with the redirect handling in this SO question.
I would not recommend Nutch, it is very complex to set up and has numerous dependencies (Hadoop, HDFS).

I'd use a selenium script to read URLs off of a queue and GET them. Then wait about 5 seconds per browser to see if a redirect occurs and if so put the new redirect URL back into the queue for the next instance to process. You can have as many instances running simultaneously as you want.
UPDATE:
If you only care about the Location header (what most non-JS or meta redirects use), simply check it, you never need to get the inputStream:
HttpURLConnection.setFollowRedirects(false);
URL url = new URL("http://bit.ly/abc123");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
String newLocation = conn.getHeaderField("Location");
If the newLocation is populated then stick that URL back into the queue and have that followed next round.

Related

POS Customer Display (Web Based Application) [duplicate]

I was searching for a way how to communicate between multiple tabs or windows in a browser (on the same domain, not CORS) without leaving traces. There were several solutions:
using the window object
postMessage
cookies
localStorage
The first is probably the worst solution - you need to open a window from your current window and then you can communicate only as long as you keep the windows open. If you reload the page in any of the windows, you most likely lost the communication.
The second approach, using postMessage, probably enables cross-origin communication, but it suffers the same problem as the first approach. You need to maintain a window object.
The third way, using cookies, store some data in the browser, which can effectively look like sending a message to all windows on the same domain, but the problem is that you can never know if all tabs read the "message" already or not before cleaning up. You have to implement some sort of timeout to read the cookie periodically. Furthermore you are limited by maximum cookie length, which is 4 KB.
The fourth solution, using localStorage, seemed to overcome the limitations of cookies, and it can be even listen-to using events. How to use it is described in the accepted answer.

You may better use BroadcastChannel for this purpose. See other answers below. Yet if you still prefer to use localstorage for communication between tabs, do it this way:
In order to get notified when a tab sends a message to other tabs, you simply need to bind on 'storage' event. In all tabs, do this:
$(window).on('storage', message_receive);
The function message_receive will be called every time you set any value of localStorage in any other tab. The event listener contains also the data newly set to localStorage, so you don't even need to parse localStorage object itself. This is very handy because you can reset the value just right after it was set, to effectively clean up any traces. Here are functions for messaging:
// use local storage for messaging. Set message in local storage and clear it right away
// This is a safe way how to communicate with other tabs while not leaving any traces
//
function message_broadcast(message)
{
localStorage.setItem('message',JSON.stringify(message));
localStorage.removeItem('message');
}
// receive message
//
function message_receive(ev)
{
if (ev.originalEvent.key!='message') return; // ignore other keys
var message=JSON.parse(ev.originalEvent.newValue);
if (!message) return; // ignore empty msg or msg reset
// here you act on messages.
// you can send objects like { 'command': 'doit', 'data': 'abcd' }
if (message.command == 'doit') alert(message.data);
// etc.
}
So now once your tabs bind on the onstorage event, and you have these two functions implemented, you can simply broadcast a message to other tabs calling, for example:
message_broadcast({'command':'reset'})
Remember that sending the exact same message twice will be propagated only once, so if you need to repeat messages, add some unique identifier to them, like
message_broadcast({'command':'reset', 'uid': (new Date).getTime()+Math.random()})
Also remember that the current tab which broadcasts the message doesn't actually receive it, only other tabs or windows on the same domain.
You may ask what happens if the user loads a different webpage or closes his tab just after the setItem() call before the removeItem(). Well, from my own testing the browser puts unloading on hold until the entire function message_broadcast() is finished. I tested to put some very long for() cycle in there and it still waited for the cycle to finish before closing. If the user kills the tab just in-between, then the browser won't have enough time to save the message to disk, thus this approach seems to me like safe way how to send messages without any traces.

There is a modern API dedicated for this purpose - Broadcast Channel
It is as easy as:
var bc = new BroadcastChannel('test_channel');
bc.postMessage('This is a test message.'); /* send */
bc.onmessage = function (ev) { console.log(ev); } /* receive */
There is no need for the message to be just a DOMString. Any kind of object can be sent.
Probably, apart from API cleanness, it is the main benefit of this API - no object stringification.
It is currently supported only in Chrome and Firefox, but you can find a polyfill that uses localStorage.

For those searching for a solution not based on jQuery, this is a plain JavaScript version of the solution provided by Thomas M:
window.addEventListener("storage", message_receive);
function message_broadcast(message) {
localStorage.setItem('message',JSON.stringify(message));
}
function message_receive(ev) {
if (ev.key == 'message') {
var message=JSON.parse(ev.newValue);
}
}

Checkout AcrossTabs - Easy communication between cross-origin browser tabs. It uses a combination of the postMessage and sessionStorage APIs to make communication much easier and reliable.
There are different approaches and each one has its own advantages and disadvantages. Let’s discuss each:
LocalStorage
Pros:
Web storage can be viewed simplistically as an improvement on cookies, providing much greater storage capacity. If you look at the Mozilla source code we can see that 5120 KB (5 MB which equals 2.5 million characters on Chrome) is the default storage size for an entire domain. This gives you considerably more space to work with than a typical 4 KB cookie.
The data is not sent back to the server for every HTTP request (HTML, images, JavaScript, CSS, etc.) - reducing the amount of traffic between client and server.
The data stored in localStorage persists until explicitly deleted. Changes made are saved and available for all current and future visits to the site.
Cons:
It works on same-origin policy. So, data stored will only be able available on the same origin.
Cookies
Pros:
Compared to others, there's nothing AFAIK.
Cons:
The 4 KB limit is for the entire cookie, including name, value, expiry date, etc. To support most browsers, keep the name under 4000 bytes, and the overall cookie size under 4093 bytes.
The data is sent back to the server for every HTTP request (HTML, images, JavaScript, CSS, etc.) - increasing the amount of traffic between client and server.
Typically, the following are allowed:
300 cookies in total
4096 bytes per cookie
20 cookies per domain
81920 bytes per domain (given 20 cookies of the maximum size 4096 = 81920 bytes.)
sessionStorage
Pros:
It is similar to localStorage.
Changes are only available per window (or tab in browsers like Chrome and Firefox). Changes made are saved and available for the current page, as well as future visits to the site on the same window. Once the window is closed, the storage is deleted
Cons:
The data is available only inside the window/tab in which it was set.
The data is not persistent, i.e., it will be lost once the window/tab is closed.
Like localStorage, tt works on same-origin policy. So, data stored will only be able available on the same origin.
PostMessage
Pros:
Safely enables cross-origin communication.
As a data point, the WebKit implementation (used by Safari and Chrome) doesn't currently enforce any limits (other than those imposed by running out of memory).
Cons:
Need to open a window from the current window and then can communicate only as long as you keep the windows open.
Security concerns - Sending strings via postMessage is that you will pick up other postMessage events published by other JavaScript plugins, so be sure to implement a targetOrigin and a sanity check for the data being passed on to the messages listener.
A combination of PostMessage + SessionStorage
Using postMessage to communicate between multiple tabs and at the same time using sessionStorage in all the newly opened tabs/windows to persist data being passed. Data will be persisted as long as the tabs/windows remain opened. So, even if the opener tab/window gets closed, the opened tabs/windows will have the entire data even after getting refreshed.
I have written a JavaScript library for this, named AcrossTabs which uses postMessage API to communicate between cross-origin tabs/windows and sessionStorage to persist the opened tabs/windows identity as long as they live.

I've created a library sysend.js for sending messages between browser tabs and windows. The library doesn't have any external dependencies.
You can use it for communication between tabs/windows in the same browser and domain. The library uses BroadcastChannel, if supported, or storage event from localStorage.
The API is very simple:
sysend.on('foo', function(data) {
console.log(data);
});
sysend.broadcast('foo', {message: 'Hello'});
sysend.broadcast('foo', "hello");
sysend.broadcast('foo', ["hello", "world"]);
sysend.broadcast('foo'); // empty notification
When your browser supports BroadcastChannel it sends a literal object (but it's in fact auto-serialized by the browser) and if not, it's serialized to JSON first and deserialized on another end.
The recent version also has a helper API to create a proxy for cross-domain communication (it requires a single HTML file on the target domain).
Here is a demo.
The new version also supports cross-domain communication, if you include a special proxy.html file on the target domain and call proxy function from the source domain:
sysend.proxy('https://target.com');
(proxy.html is a very simple HTML file, that only have one script tag with the library).
If you want two-way communication you need to do the same on other domains.
NOTE: If you will implement the same functionality using localStorage, there is an issue in Internet Explorer. The storage event is sent to the same window, which triggers the event and for other browsers, it's only invoked for other tabs/windows.

Another method that people should consider using is shared workers. I know it's a cutting-edge concept, but you can create a relay on a shared worker that is much faster than localstorage, and doesn't require a relationship between the parent/child window, as long as you're on the same origin.
See my answer here for some discussion I made about this.

There's a tiny open-source component to synchronise and communicate between tabs/windows of the same origin (disclaimer - I'm one of the contributors!) based around localStorage.
TabUtils.BroadcastMessageToAllTabs("eventName", eventDataString);
TabUtils.OnBroadcastMessage("eventName", function (eventDataString) {
DoSomething();
});
TabUtils.CallOnce("lockname", function () {
alert("I run only once across multiple tabs");
});
P.S.: I took the liberty to recommend it here since most of the "lock/mutex/sync" components fail on websocket connections when events happen almost simultaneously.

I wrote an article on this on my blog: Sharing sessionStorage data across browser tabs.
Using a library, I created storageManager. You can achieve this as follows:
storageManager.savePermanentData('data', 'key'): //saves permanent data
storageManager.saveSyncedSessionData('data', 'key'); //saves session data to all opened tabs
storageManager.saveSessionData('data', 'key'); //saves session data to current tab only
storageManager.getData('key'); //retrieves data
There are other convenient methods as well to handle other scenarios as well.

This is a development storage part of Tomas M's answer for Chrome. We must add a listener:
window.addEventListener("storage", (e)=> { console.log(e) } );
Load/save the item in storage will not fire this event - we must trigger it manually by
window.dispatchEvent( new Event('storage') ); // THIS IS IMPORTANT ON CHROME
And now, all open tabs will receive the event.

what does cache means in POST and GET

I have seen that that one of the main difference between POST and GET is that POST is not cached but GET is cached.
Could you explain me what do you mean about "cache"?
Also, if I use POST or GET server sends me response. Is there any difference? In all of cases, I have request data and response, is not it?
Thanks

To Cache (in the context of HTTP) means to store a page/response either on the client or some intermediate host - perhaps in a content distribution network. When the client requests a page, then the page can be served from the client's cache (if the client requested it before) or the intermediate host. This is faster and requires fewer resources than getting the page from the server that generated it.
One downside is that if the request changes some state on the server, that change won't happen if the page is served from a cache. This is why POST requests are usually not served from a cache.
Another downside to caching is that the cached copy may be out of date. The HTTP caching mechanisms try to prevent this.

The basic idea behind the GET and POST methods is that a GET message only retrieves information but never changes the state of the server. (Hence the name). As a result, just about any caching system will assume that you can remember the last GET response returned, and that the next one will look the same.
A POST on the other hand is a request that sends new information to the server. So not only can these not be cached (because there's no guaruantuee that the next POST won't modify things even more; think +1 like buttons for example) but they actually have to invalidate parts of the cache because they might modify pages.
As a result, your browser for example will warn you when you try to refresh a page to which you POSTed information, because you might make changes you did not want made by doing so. When GETting a page, it will not do so because you cannot change anything on the site by doing so.
(Or rather; it's your job as a programmer to make sure that nothing changes when GETting a page.)

GET is supposed to return the same result from the server and not change things at the server side and hence idempotent.
Whereas POST means it can modify something at the server(make an entry in db, delete something etc) and hence not idempotent.
And with regards to caching the data in GET has been addressed here in a nice manner.
http://www.ebaytechblog.com/2012/08/20/caching-http-post-requests-and-responses/#.VGy9ovmUeeQ

Most efficient java way to test 300,000+ URLs [duplicate]

This question already has answers here:
Preferred Java way to ping an HTTP URL for availability
(6 answers)
Closed 9 years ago.
I'm trying to find the most efficient way to test 300,000+ URLs in a database to basically check if the URLs are still valid.
Having looked around the site I've found many excellent answers and am now using something along the lines of:
Read URL from file....
Test URL:
final URL url = new URL("http://" + address);
final HttpURLConnection urlConn = (HttpURLConnection) url.openConnection();
urlConn.setConnectTimeout(1000 * 10);
urlConn.connect();
urlConn.getResponseCode(); // Do something with the code
urlConn.disconnect();
Write details back to file....
So a couple of questions:
1) Is there a more efficient way to test URLs and get response codes?
2) Initially I am able to test about 50 URLs per minute, but after 5 or so minutes things really slow down - I imagine there is some resources I'm not releasing but am not sure what
3) Certain URLs (e.g. www.bhs.org.au) will cause the above to hang for minutes (not good when I have so many URLs to test) even with the connect timeout set, is there anyway I can tighten this up?
Thanks in advance for any help, it's been a quite a few years since I've written any code and I'm starting again from scratch :-)

By far the fastest way to do this would be to use java.nio to open a regular TCP connection to your target host on port 80. Then, simply send it a minimal HTTP request and process the result yourself.
The main advantage of this is that you can have a pool of 10 or 100 or even 1000 connections open and loading at the same time rather than having to do them one after the other. With this, for example, it won't matter much if one server (www.bhs.org.au) takes several minutes to respond. It'll simply hog one of your many connections in the pool, but others will keep running.
You could also achieve that same thing with a little more overhead but a lot less complex coding by using a Thread Pool to run many HttpURLConnections (the way you are doing it now) in parallel in multiple threads.

This may or may not help, but you might want to change your request method to HEAD instead of using the default, which is GET:
urlConn.setRequestMethod("HEAD");
This tells the server that you do not really need a response back, other than the response code.
The article What Is a HTTP HEAD Request Good for describes some uses for HEAD, including link verification:
[Head] asks for the response identical to the one that would correspond to a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.... This can be used for example for creating a faster link verification service.

In Java, it's possible determine the size of a web page before download?

I want determine the size of a web page, and so, if it is greater than a number (eg.:5MB), I will download it or not.
Can I have this information?

You can do a decent approximation with:
HttpURLConnection content = (HttpURLConnection) new URL("www.example.com").openConnection();
System.out.println(content.getContentLength());
However, this will only tell you the length of the specific resource you're requesting (e.g. the HTML at the base of the URL). You will also need to go through the HTML in the page, look at all the resources that it references (scripts from other sites, images, video, etc.) and total them all up.
That will get you fairly close to a total size, but even then you won't get a perfect count, because (a) not all URLs are going to return this information and you don't have any control over that, and (b) depending on how the content is loaded (such as through AJAX calls that hide the specifics) you won't be able to know ahead of time the complete list of resources to be downloaded.
Alternatively, if a URL doesn't return a result, I think Giacomo was suggesting the use of a CounterInputStream. Not a bad idea. You could maybe combine the above suggestion with the CounterInputStream to keep a count of the total that has been sent, and potentially stop the transfer when it reaches a specified maximum total transfer size. That way you'd essentially have a predicted size (say a site tells you it's going to be 3.3 MB), but as you're downloading you find out that it's actually 6 MB and hasn't stopped yet, and make the decision to not download anymore than that.

I may be wrong however can't you just use
HttpURLConnection conn = (HttpURLConnection) new URL("http://www.google.com").openConnection();
System.out.println(conn.getContentLength());
?

How can I do sessions URL is very long I cannot append JSESSIONID=389729387392.What is the solution for this?

I got the answer for If I disabled the cookies then using URL ReDirect I can pass the JSESSIONID but my URL is already very long as I use the GET method it has constraint. Then how
should I use my sessions.I want my application to be very security intensive.
This is one of the question asked to my friend in GOOGLE interview.

Apart from using one-letter parameter names (e.g. ?a=value1&b=value2&c=value3 or using RESTFul-like URL's (i.e. just the pathinfo, no query parameters, e.g. /value1/value2/value3, which is accessible by HttpServletRequest#getPathInfo() in the servlet) instead of ?name1=value1&name2=value2&name3=value3, you can also consider to Gzip and Base64-encode the query string so that it becomes shorter. Both JavaScript and Java are capable of (de)compressing and (d)e(n)coding it. You can eventually format the query string in JSON before compressing/encoding, it will be shorter in case of arrays/collections/maps.
That said, are you sure that the request URL's are often that unfriendly long (assuming that it's over 255 characters)? Why would you need to pass that much information in? Are they supposed to maintain the client state? If so, you shouldn't use the URL for this, but the HttpSession instance in the server side which is already associated with the jsessionid cooke. Use HttpSession#setAttribute() to store some information in session and use HttpSession#getAttribute() to retrieve it.

As far as I understand, your main problem with JSESSIONID in the URL is the total length.
Perhaps you should have a closer look at why the length of the URLs are too long in the first place. Since you allready have a session, it is not unlikely you can move some GET parameters to the session. There are also lots of different way to make shorter URLs for pages (a la mod_rewrite).
With regards to security, JSESSIONID is just as vunerable with HTTP GET as HTTP POST. The base64 encoding HTTP POST does is not a security measure at all. The best way to gain a bit more security is to encrypt the transport channel through TLS/SSL, in effect enable HTTPS. This will make sure that eavesdropping (or man in the middle attacks) will not have access to the plain text.

If you want your application to be security intensive why are you using GET. Use POST. This will also reduce the URL length.
As such, as per the HTTP protocol there is no max length limit to URL length. Most of the time its the browser that puts in the max length limit. Try different browsers
You should put forward the above points to the interviewer. They might be more interested in your ability to assess the system as a whole and identify any fundamental flaws.

If the URL is too long then you have to store that data somewhere else. Most sites would put the session ID in a cookie.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.