Parsing a HTML file in Java

Parsing a HTML file in Java - java

I am currently in the process of developing an application that will request some information from Websites. What I'm looking to do is parse the HTML files through a connection online. I was just wondering, by parsing the Website will it put any strain on the server, will it have to download any excess information or will it simply connect to the site as I would do through my browser and then scan the source?
If this is putting extra strain on the Website then I'm going to have to make a special request to some of the companies I'm scanning. However if not then I have the permission to do this.
I hope this made some sort of sense.
Kind regards,
Jamie.

No extra strain on other people servers. The server will get your simple HTML GET request, it won't even be aware that you're then parsing the page/html.
Have you checked this: JSoup?

Consider doing the parsing and the crawling/scraping in separate steps. If you do that, you can probably use an existing open-source crawler such as crawler4j that already has support for politeness delays, robots.txt, etc. If you just blindly go grabbing content from somebody's site with a bot, the odds are good that you're going to get banned (or worse, if the admin is feeling particularly vindictive or creative that day).

Depends on the website. If you do this to Google then most likely you will be on a hold for a day. If you parse Wikipedia, (which I have done myself) it won't be a problem because its already a huge, huge website.
If you want to do it the right way, first respect robots.txt, then try to scatter your requests. Also try to do it when the traffic is low. Like around midnight and not at 8AM or 6PM when people get to computers.

Besides Hank Gay's recommendation, I can only suggest that you can also re-use some open-source HTML parser, such as Jsoup, for parsing/processing the downloaded HTML files.

You could use htmlunit. It gives you virtual gui less browser.

Your Java program hitting other people's server to download the content of a URL won't put any more strain on the server than a web browser doing so-- essentially they're precisely the same operation. In fact, you probably put less strain on them, because your program probably won't be bothered about downloading images, scripts etc that a web browser would.
BUT:
if you start bombarding a server of a company with moderate resources with downloads or start exhibiting obvious "robot" patterns (e.g. downloading precisely every second), they'll probably block you; so put some sensible constraints on what you do (e.g. every consecutive download to the same server happens at random intervals of between 10 and 20 seconds);
when you make your request, you probably want to set the "referer" request header either to mimic an actual browser, or to be open about what it is (invent a name for your "robot", create a page explaining what it does and include a URL to that page in the referer header)-- many server owners will let through legitimate, well-behaved robots, but block "suspicious" ones where it's not clear what they're doing;
on a similar note, if you're doing things "legally", don't fetch pages that the site's "robot.txt" files prohibits you from fetching.
Of course, within some bounds of "non-malicious activity", in general it's perfectly legal for you to make whatever request you want whenever you want to whatever server. But equally, that server has a right to serve or deny you that page. So to prevent yourself from being blocked, one way or another, you need to either get approval from the server owners, or "keep a low profile" in your requests.

Related

Creating a file for download (client side vs. backend)

I need to allow for csv-file downloads on my page and I was going to try ngCsv (from Angular) but for browser support this seems fairly limited. I've seen quite a few examples of this being done with vanilla Javascript. And after a discussion with a colleague of "backend vs. frontend" I'm feeling more and more unsure of what to do.
Are there any true optimization/efficiency reasons why I should avoid doing this on the client side (assuming the files are no more than 100MB each download)?

Are there any true optimization/efficiency reasons why I should avoid
doing this on the client side (assuming the files are no more than
100MB each download)?
If the data on the .csv would be the same for each user, and only updated every now and then, I would suggest you have your server create / update a static .csv. It wouldn't be resource-intensive, and you wouldn't have to worry about browser compatibility / user resources.
If, however, the data you need to create a .csv for is different on a per-user basis, then you should consider creating the file client-side. If you can help it, you don't want your server having to dynamically generate 100MB .csv files each time a user clicks the link.
You could write a script that only generates the .csv client-side if the browser is not mobile and there is web-worker support. If either of those conditions are not met, you could fall back to having your server do it.
Ultimately, your answer is going to really depend on the requirements / context of this project. Try to cache the results where possible, and use common sense. Good luck :)

Can you programmatically connect to a sequence of web-pages and parse the source HTML without imposing stress on the system or raising red flags?

I am working on a project in NLP requiring me to download quite a few video game reviews --- about 10,000 per website. So, I am going to write a program that goes to each URL and pulls out the review part of each page as well as some additional metadata.
I'm using Java and was planning on just opening an HttpURLConnection and reading the text through an input stream. Then, closing the connection and opening the next one.
My questions are this:
1) Let's assume this is a site with medium-to-small amounts of traffic: normally, they receive about 1000 requests per second from normal users. Is it possible that my program would cause undue stress to their system, impacting the user experience for others?
2) Could these connections made one right after another appear as some kind of malicious attack?
Am I being paranoid, or is this an issue? Is there a better way to go about getting this data? I am going to several websites so working individually with site administrators is inconvenient and probably impossible.

If you mimic a web browser, and extract text at human speeds (that is, it normally takes a human several seconds to "click thru" to the next page even if they aren't reading the text), then the server can't really tell what the client is.
In other words, just throttle your slurping to 1 page per few seconds, and no problems.
The other concern you ought to have is legality. I assume these reviews are material that you didn't write, and have no permission to create derivative works from. If you are just slurping them for personal use, then its ok. If you are slurping them to create something (a derivative work), then you are breaking copyright.

I believe you are misunderstanding how HTTP requests work. You ask for a page and you get it... the fact that you're reading a stream one line at a time has no bearing on the HTTP request and the site is perfectly happy to give you your 1 page at a time. It won't look malicious (cause it's just 1 users reading pages... totally normal behavior). You're 100% ok to proceed with your plan (if it is as you described it).

Can page scraping be detected?

So I just created an application that does page scraping for me, and ran it. It worked fine. I was wondering if someone would be able to figure out that the code was being page scraped, whether or not they had written code for that purpose?
I wrote the code in java, and it's pretty much just checking for one line of the html code.
I thought I'ld get some insight on that before I add anymore code to this program. I mean it's useful, and all, but it's almost like a hack.
Seems like the worst case scenario as a result of this page scraper isn't too bad as I can just use another device later and the IP will be different. Also it might not matter in a month. The website seems to be getting quite a lot of web traffic anyways at the moment. Whoever edits the page is probably asleep now, and it really hasn't accomplished anything at this point so this could go unnoticed.
Thanks for such fast responses. I think it might have gone unnoticed. All I did was copy a header, so just text. I guess that is probably similar to how browser copy-paste works. The page was just edited this morning, including the text I was trying to get. If they did notice anything, they haven't announced it, so all is good.

It is a hack. :)
There's no way to programmatically determine if a page is being scraped. But, if your scraper becomes popular or you use it too heavily, it's quite possible to detect scraping statistically. If you see one IP grab the same page or pages at the same time every day, you can make an educated guess. Same if you see requests on another timer.
You should try to obey the robots.txt file if you can, and rate limit yourself, to be polite.

As a sysadmin myself, yes I'd probably notice but ONLY based on the behavior of the client. If a client had a weird user agent, I'd be suspicious. If a client browsed the site too quickly or in very predictable intervals, I'd be suspicious. If certain support files were never requested (favicon.ico, various linked in CSS and JS files), I'd be suspicious. If the client were accessing odd (not directly accessible) pages, I'd be suspicious.
Then again I'd have to actually be looking at my logs. And this week Slashdot has been particularly interesting, so no I probably wouldn't notice.

It depends on how have you implemented this and how smart are the detection tools.
First take care about User-Agent. If you do not set it explicitly it will be something like "Java-1.6". Browsers send their "unique" user agents, so you can just mimic the browser behavior and send User-Agent of MSIE, or FireFox (for example).
Second, check other HTTP headers. Probably some browsers send their specific headers. Take one example and follow it, i.e. try to add the headers to your requests (even if you do not need them).
Human user acts relatively slowly. Robot may act very quickly, i.e. retrieve the page and then "click" link, i.e. perform yet another HTTP GET. Put random sleep between these operations.
Browser retrieves not only the main HTML. Then it downloads images and other stuff. If you really do not want to be detected you have to parse HTML and download this stuff, i.e. actually be "browser".
And the last point. It is obviously not your case but it is almost impossible to implement robot that passes Capcha. This is yet another way to detect robot.
Happy hacking!

If your scraper acts like a human then there is a hardly any chance for it to be detected as a scraper. But if your scraper acts like a robot then its not difficult to be detected.
To act like a human you will need to:
Look at what a browser sends in the HTTP headers and simulate them.
Look at what a browser requests for when accessing the page and access the same with the scraper
Time your scraper to access at the speed of a normal user
Send requests at random intervals of time instead of at fixed intervals
If possible make requests from a dynamic IP rather than a static one

assuming you wrote the page scraper in a normal manner, i.e., it fetches the whole page and then does pattern recognition to extract what you want from the page, all someone might be able to tell is that the page was fetched by a robot rather than a normal browser. all their logs will show is that the entire page was fetched; they can't tell what you do with it once it's in your RAM.

To the server serving the page, there's no difference whether you download a page into the browser or download a page and screen scrape it. Both actions just require an HTTP request, whatever you do with the resulting HTML on your end is none of the server's business.
Having said that, a sophisticated server could conceivably detect activity that doesn't look like a normal browser. For example, a browser should request any additional resources linked to from the page, something that usually doesn't happen when screen scraping. Or requests with an unusual frequency coming from a particular address. Or simply the HTTP User-Agent header.
Whether a server tries to detect these things or not depends on the server, most don't.

I'd like to put my two cents in for others that may be reading this. In the past couple of years web scraping has been frowned upon more and more by the court system. I've cited a lot of examples in a blog post I recently wrote.
You should definitely abide the robots.txt but also look at the websites T&C's to make sure you are not in violation. There are definitely ways that people can identify you are web scraping and there could be potential consequences for doing so. In the event that web scraping is not disallowed by the website's Terms and Conditions, then have fun but make sure to still be conscionable. Dont destroy a webserver with an out of control bot, throttle yourself to make sure you dont impact the server!
For full disclosure, I am a co-founder of Distil Networks and we help companies identify and stop web scrapers and bots.

Access Hotmail Unread Mail Count via Java

I want to write an application using Java6 that can check a users Hotmail inbox for the 'unread message count'!
There is a Javascript API but I will not have a browser instance, and it seems that I need one to use it. (see stakoverflow question: 964392 )
I can use POP3, but since it does not support flags, I can only tell how many 'new' messages there are in the users Inbox since the last time I checked, not how many unread messages there are. ( This is my current implementation, it's not what is required, but is currently all I can achieve )
There is IMAP access, but only for 'premium users'(Hotmail users who pay).
There's also HttpMail access, but this is poorly documented, and from testing, seems it's also only for premium users.

Unfortunately, this similar question on msdn suggests this is impossible
EDIT:
All I can offer is a half-solution. You could create the html page containing the script suggested by the people on MSDN but instead of setting the value of an input box to the number of unread messages - you could use Ajax to post this number back to your application. This is, of course, not a very robust solution since it depends on the browser and may very well not be cross platform. Another thing you can do is read up on running Javascript on the JVM. I don't know how good that solution is, either, but I think it's more robust once (or rather if) you can get it to work.

One potential option could be to use the HTMLUnit Java headless web browser to make the requests. HTMLUnit has very good, but not perfect, JavaScript support to handle creating the dynamic content.

Best way to upload multiple files from a browser

I'm working on a web application. There is one place where the user can upload files with the HTTP protocol. There is a choice between the classic HTML file upload control and a Java applet to upload the files.
The classic HTML file upload isn't great because you can only select one file at a time, and it's quite hard to get any progress indication during the actual upload (I finally got it using a timer refreshing a progress indicator with data fetched from the server via an AJAX call). The advantage: it's always working.
With the Java applet I can do more things: select multiple files at once (even a folder), compress the files, get a real progress bar, drag'n'drop files on the applet, etc...
BUT there are a few drawbacks:
it's a nightmare to get it to work properly on Mac Safari and Mac Firefox (Thanks Liveconnect)
the UI isn't exactly the native UI and some people notice that
the applet isn't as responsive as it should (could be my fault, but everything looks ok to me)
there are bugs in the Java UrlConnection class with HTTPS, so I use the Apache common HTTP client to do the actual HTTP upload. It's quite big a package and slows down the download of the .jar file
the Apache common HTTP client has sometimes trouble going through proxies
the Java runtime is quite big
I've been maintaining this Java applet for a while but now I'm fed up with all the drawbacks, and considering writing/buying a completely new component to upload theses files.
Question
If you had the following requirements:
upload multiple files easily from a browser, through HTTP or HTTPS
compress the files to reduce the upload time
upload should work on any platform, with native UI
must be able to upload huge files, up to 2gb at least
you have carte blanche on the technology
What technology/compontent would you use?
Edit :
Drag'n'Drop of files on the component would be a great plus.
It looks like there are a lot of issues related to bugs with the Flash Player (swfupload known issues). Proper Mac support and upload through proxies with authentication are options I can not do without. This would probably rule out all Flash-based options :-( .
I rule out all HTML/Javascript-only options because you can't select more than one file at a time with the classic HTML control. It's a pain to click n-times the "browse" button when you want to select multiple files in a folder.

I implemented something very recently in Silverlight.
Basically uses HttpWebRequest to send a chunk of data to a GenericHandler.
On the first post, 4KB of data is sent. On the 2nd chunk, I send another 4K chunk.
When the 2nd chunk is received, I calculate the round trip it took between first and 2nd chunk and so now
the 3rd chunk when sent will know to increase speed.
Using this method I can upload files of ANY size and I can resume.
Each post I send along this info:
[PARAMETERS]
[FILEDATA]
Here, parameters contain the following:
[Chunk #]
[Filename]
[Session ID]
After each chunk is received, I send a response back to my Silverlight saying how fast it took so that it can now send a larger
chunk.
Hard to put my explaination without code but that's basically how I did it.
At some point I will put together a quick writeup on how I did this.

I've never used it with files of 2GB in size, but the YUI File Uploader worked pretty well on a previous project. You may also be interested in this jQuery Plugin.
That said, I still think the Java Applet is the way to go. I think you'll end up with less portability and UI issues than you expect and Drag/Drop works great. For the record, Box.net uses a Java Applet for their multi-file quick uploads.

OK this is my take on this
I did some testing with swfupload, and I have my previous experience with Java, and my conclusion is that whatever technology is used there is no perfect solution to do uploads on the browser : you'll always end up with bugs when uploading huge files, going through proxies, with ssl, etc...
BUT :
a flash uploader (a la swfupload) is really lightweight, doesn't need authorization from the user and has a native interface which is REALLY cool, me thinks
a java uploader needs authorization but you can do whatever you want with the files selected by the user (aka compression if needed), and drag and drop works well. Be prepared for some epic bugs debuggin' though.
I didn't get a change to play with Silverlight as long as I'd like maybe that's the real answer, though the technology is still quite young so ... I'll edit this post if I get a chance to fiddle a bit with Silverlight
Thanks for all the answers !!

There are a number of free flash components that exist with nice multiple file upload capability. They make use of ActionScripts FileReference class with a PHP (or whatever) receiver on the server side. Some have recently broken with the launch of FP10 but I know for certain that swfupload will work :)
Hope this helps!

What about these two
Jupload
http://jupload.sourceforge.net/
and
jumploader
http://jumploader.com/
Both are java applets but they are also both really easy to use and implement.

what about google gears?

There are HTTP/HTTPS upload controls that allow multi-file upload. Here is one from Telerik, which I have found to be solid and reliable. The latest version looks to have most if not all of your feature requirements.

You can upload multiple files with HTTP forms as well, as Dave already pointed out, but if you're set on using something beyond what HTTP and Javascript offers I would heavily consider Flash. There are even some pre-existing solutions for it such as MultiPowUpload and it offers many of the features you're looking for. It's also easier to obtain progress information using a Flash client than with AJAX calls from Javascript since you have a little more flexibility.

You may check the Apache Commons FileUpload package. It allows you to upload multiple files, monitor the progress of the upload, and more. You can find more information here:
http://commons.apache.org/fileupload/
http://commons.apache.org/fileupload/using.html
Good luck

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.