Java concurrent recursive website download - java

I need to do a program to download the webpages, that is, I give a webpage to the software and it would download all the files in the website.
I would pass also a level of depth, that is, the level where the software goes download each file of the website.
I will develop this software in Java and I need to use concurrency also.
Please, tell me your opinion about how to do this.
Thanks for the help.
Thanks to everyone for the help.
I need to ask one more thing. How do I do to download a file from the website?
Thaks one more time. =D

A very useful library for spiders and bots: htmlunit

Well, this is a bit hard to answer without knowing how detailed guidance you need, but here's an overview. :)
Java makes such applications quite easy, actually, since both HTTP-requests and threading are readily available. My solution would probably involve a global stack containing new urls, and a farm of a constant number of threads that pop urls from the stack. I'd store the urls as a custom object, so that I could keep track of the depth.
I think your main issue here will be with sites that doesn't respond, or doesn't follow the HTTP standard. I've noticed many times in similiar applications that sometimes these doesn't time out properly, and eventually they end up blocking all the threads. Unfortunately I don't have any good solutions here.
A few useful classes as a starting point:
http://java.sun.com/javase/6/docs/api/java/lang/Thread.html
http://java.sun.com/javase/6/docs/api/java/lang/ThreadGroup.html
http://java.sun.com/javase/6/docs/api/java/net/URL.html
http://java.sun.com/javase/6/docs/api/java/net/HttpURLConnection.html

I would look at this recourses:
http://hc.apache.org/httpclient-3.x/
http://java.sun.com/javase/6/docs/api/java/util/concurrent/package-summary.html
http://java.sun.com/javase/6/docs/api/java/util/concurrent/locks/package-summary.html

I would have a look at the Java Executors package. You create a set of tasks (Runnables) and pass them to a suitable chosen Executor. You get a Future back and you can then query this for its result.
The Executor will coordinate when this Runnable is executed. Implementations exist for single-threaded executors, executors with a pool of threads etc. So you don't need to worry (too much) wrt. the threading intricacies. The concurrency utilities will look after this for you.
Apache HTTP Client will look after the HTTP querying for you.

Related

Advanced Java File I/O tutorials? Tips? Advice?

I'm working on a project right now that will make use of Java File I/O that goes beyond the simple "write this string to a file" documentation and tutorials that I find on the net. This project will essentially provide a database mechanism, similar to the popular "NoSQL" databases that are gaining a lot of press these days. However, I'm unable to find a ton of documentation that provides detailed information on which APIs to use, how to use them, etc. I've also been looking for any generally accepted design patterns around Java File I/O, but without any luck.
If I had to list a couple of requirements, I'd say:
Pseudo-transactional support (not a hard requirement, as it can be implemented higher up in the API stack)
Ability to write data of an arbitrary length in a structure that can be read back later on
Indexing
Ability to remove an object from the "database" efficiently
Fast searching
Possible multi-threaded access (multiple read threads, single write, most likely)
Can anyone point me to any tutorials, documentation, design patterns, etc. that may be helpful? Are there any open source frameworks that revolve around Java File I/O? I know of a lot of frameworks that provide wrappers around NIO for the purposes of Network I/O, but nothing File-related.
Thanks for any help you can provide!
Take a look at Apache Commons Transaction. It supports transactional file access, by performing the work in temporary files, and committing the work by moving them to the actual files.
You might also be interested in the XADisk project, although I haven't pored through it's sources.
As far as searching is concerned, the Apache Solr and Lucene projects would be of help.

Restrict download file bandwidth/speed in Servlet

we got high-load java application which works in clustered mode.
I need to add ability to download and upload files for our customers.
For storing files i'm going to user gridFs, not sure, it's best choice, but mongo can be clustered and mongo can replicate data between diff nodes.
That's exactly what i need.
Different group of users should be limited with different bandwidth. Based of some business rules i should restrict download speed for some users.
I saw few solutions for this
Most of them works same way.
Read bunch of bytes
Sleep thread
Repeat
Mongo simply provide me InputStrem and i can read from that stream and write to servlet output stream. I'm not sure it is valid approach. Also I'm afraid, that users can create a lot of concurent threads during download and it can hurt performance.
Could it be an issue for servlet container ?
If it could be an issue, how can it be avoided ? probably using nio ?
I prefer to use pure java solution.
Any help will be highly appreciated.
Leaky bucket or token bucket algorithms can be used to control the network bandwidth.
EDIT: I did some quick prototyping and implemented the algorithm leveraging Servlet 3.0 asynchronous processing. Results are pretty good. Full source code can be found on GitHub. Have fun!
Also I'm afraid, that users can create a lot of concurent threads during download and it can hurt performance.
Could it be an issue for servlet container ?
Yes, it could.
If it could be an issue, how can it be avoided ? probably using nio ?
NIO won't help per se. It certainly won't prevent the low-bandwidth responses tying up threads for long periods of time.
I think what you would need to do is to implement downloads in a special web container. I'm not sure, but I think that Servlet 3.0 with async mode might do the trick.

Is there a Java equivalent to libevent?

I've written a high-throughput server that handles each request in its own thread. For requests coming in it is occasionally necessary to do RPCs to one or more back-ends. These back-end RPCs are handled by a separate queue and thread-pool, which provides some bounding on the number of threads created and the maximum number of connections to the back-end (it does some caching to reuse clients and save the overhead of constantly creating connections). Having done all this, though, I'm beginning to think an event-based architecture would be more efficient.
In searching around I haven't found any equivalents to libevent for Java, but maybe I'm not looking in the right place? Mina-statemachine from Apache was the closest thing I found, but it looks more verbose than I need and there's no real release available.
Any suggestions?
I am a bit late but:
Have you looked at Netty?
Or Grizzly.
How about the Light Weight Event System? :) http://www.lwes.org/ and http://sourceforge.net/projects/lwes/files/
The answer seems to be 'no', though it looks like the Ruby EventMachine library provides a Java implementation for JRuby users that might be usable or at least serve as inspiration for writing my own:
http://github.com/eventmachine/eventmachine/tree/master/java/
You might be looking for a workflow engine like
JBPM or any other open source tool listed here.

Disable libraries in Java?

Assume I have a webpage where people submit java source code (a simple class).
I want to compile and run the code on my server, but naturally I want to prevent people from harming my server, so how do I disable java.io.* and other functions/libraries of my choice?
A regexp on the source code would be one way, but it would be "nicer" if one could pass some argument to javac or java.
(This could be useful when creating an AI competition or something where one implements a single class, but I want to prevent tampering with the java environment.)
If you are in complete control of the JVM, then you can use security policies to do this. It's the same approach taken by web browsers when they host applets.
http://java.sun.com/j2se/1.5.0/docs/guide/security/permissions.html
Hope this helps.
Depending on your intent, you might be able to speak with Nick Parlante, who runs javabat.com - it does pretty much exactly what you're describing. I don't know whether he's willing to share his solution, but he might be able to give you some specific help.
My advice is don't do it. At least, don't do it unless you are willing and prepared to accept the consequences of the machine that runs your server being hacked. And maybe other machines on the same network.
The Google App Engine uses an approach where classes are white listed - that is, they are probably either not loaded, or the classes themselves changed and the libraries recompiled, so that no IO, or other system calls can be made. perhaps you could try this by recompiling a jvm like http://jikesrvm.org/.
You can always run the code in a custom classloader. This allows you full control about what you will accept to load.

Where can I find an AS400 to Java interface?

Does anyone have links and resources to connect to an AS400 from Java?
I remember years ago, somebody told me about a connector that simulates KeyStrokes from the keyboard and other "purest" approach that connected directly.
On the web I have found a lot of links, but I cannot find a complete product to do this (I am probably not using the right keywords).
EDIT
Thanks for the answers:
What we are looking for is a way to access the data inside the AS400 and/or the screens it uses and expose them for other new applications re-use. Either as a webservice of some sort, or directly through Java ( and java will expose the operations using webservices )
Thanks in advance.
EDIT
As per MicSim post, I've also found this link:
http://www.ibm.com/developerworks/library/ws-as400/index.html
What you are looking for is probably the Toolbox for Java™ & JTOpen from IBM. There is also an AS400 class in the toolbox for performing specific AS400 tasks. You can look here and here for more details. Just googled it and hope it's helpful.
IBM's 5250 screen-scraping technology was "WebFacing" - I would post a link but you're probably better off Googling it, since IBM's documentation is so scattered. There are other technologies available too but: Screen-scraping was never anyone's favourite since typically you end up with something which, although it looks more up-to-date, actually is harder to use than a green screen and no more functional. The 5250 is probably the single best data entry platform I've ever used - web forms in a browser are one of the worst.
As mentioned, jt400 is the way to go for most other things. In particular:
JDBC - for all things SQL. If you do it right and address your files as though they really are tables, it's a way to get away from the 400 entirely.
Record-level access - write Java programs using a similar database API to RPGLE (all those chains, setlls that 400 programmers love)
Call programs, system commands, manage resources (data queues, data areas, prints / spools, jobs etc etc)
Good luck
If you just want to run Java on the AS/400 (or iSeries, or System i, or whatever IBM's marketing department has decided to call it this month), that's a supported language. You can access the pseudo-DB2 database directly. Or are you after some other form of integration?
This obviously depends on what you want to do, however if you want to simulate keystrokes across a network connection to an AS400 process then Expect4j may be the library you are looking for.
This is generally a really nasty hack though and there are frequently better ways to achieve your goals. What are you trying to do?
The expect4J library can be found here. Expect was originally a unix command that allowed you to specify a string that you are expecting to see and then a string of characters to return. It was frequently used for automating logins etc and for screen-scraping applications.
Even better is the TN5250j Console, which can be used to extract data from the AS/400.
jacada makes tools to do what your looking for
http://www.jacada.com/

Categories

Resources