Best architecture for crawling website in application

Best architecture for crawling website in application - java

I am working on a product in which we need a feature to crawl the user given URL and publish his separate mobile site. In the crawling process we want to crawl the site content, CSS, images and scripts. The product used to do more activities like scheduling some marketing activities and all. What I want to ask -
what is the best practice and open source framework to do this task?
Should we do it in the application itself or should there be another server for doing this activity (if this activity takes load)? Keep in mind that we have 1 "lacks" user visiting every month publishing his mobile site from the website, and around 1-2k concurrent users.
The application is built in Java and the Java EE platform using Spring and Hibernate as server side technology.

We used Derkley DB Java edition for managing off-heap queue of links to crawl and distinguish between links pending download and ones downloaded yet.
For parsing HTML TagSoup is the best choise in wild internet.
Batik is the choice for parsing CSS and SVG.
PDFBox is awesome and allows to extract links from PDF
Quartz scheduler is intustry-proven choice for event scheduling.
And yes, you will need one or more servers for crawling, one server for aggregating results and scheduling tasks, and perhaps another server for WEB front and back end.
This worked well for http://linktiger.com and http://pagefreezer.com

I'm implementing a crawling project based on Selenium HtmlUnit Driver. I think it's really the best Java Framework to automate a headless browser.

Related

imitate browser with java

I'm looking for a java-framework which enables me to easily communicate with a website.
What I'd like to do is for example:
log into a website
open various pages
read information
submit information into forms
send ajax-requests
read ajax-response
What I'm not looking for is a browser automation plugin like selenium. I'm trying to have my application directly communicate with the website.
That's the general outline. If you can think of a better solution for the following problem, I'm more than willing to follow your advice (:
We're working with an webapplication with an gruesome GUI. Unfortunatley we've got no means to tinker with said application or request changes to it. What I'd ike to do is to build is a client which logs into said application, fetches the data and displays them in a more appropriate manner with additional information based on that data while also providing tools to process this data and submit it back to that web-application.
Thanks in advance.

Selenium does come for JAVA. You can download it from here. http://www.seleniumhq.org/download/
Here is a tutorial:
https://www.airpair.com/selenium/posts/selenium-tutorial-with-java
How Selenium web driver works
Selenium web driver (firefox web driver) will open a web browser(firefox) for you and you can actually see what's going on. The choice of opening a browser window may not be the requirement for you. Then you can use:
HTMLUnitWebDriver
PhantomJSDriverService

Take a look at
http://hc.apache.org/httpcomponents-client-ga/quickstart.html
Its not a framework but a library but should provide you the needed methods to interact with your web application

Client android application for site

I want to write a client application for a site (e.g. to read smth from site, add some comments, likes etc). I haven't got access to site sources and there isn't any API for work with it. So at my Android application I decided to parse this site (it has static pages) using : JSOUP Library
And using this library I'm going to write unofficial, but API for my purposes to work with this site, and then use it in my Android application.
Can somebody tell me, is this good practice or there are better ways to do? Is this good idea at all to parse site in Android device.

As I wrote in comment - in general building your own application on top of the third party www service is not a good idea. If you want to do it anyway you have 2 options:
Use jSoup (or any other html parser if exists) and parse third party content on the device
Set up some middleware server to parse content and serve it in some more convenient way.
The second option has a little advantages - you can fix application without forcing users to update it and probably you'll save a bit of device's bandwidth. Of course disadvantage is that you have to pay for server.
General problem with applications like that is that every single change with layout, skin, server configuration can cause your application to stop working, as well as parsing html needs much more work that just connect to existing API.
More over - publishing your application can cause some legal issues (copyright) and is against Google Play's policy:
Do not post an app where the primary functionality is to: Drive
affiliate traffic to a website or Provide a webview of a website not
owned or administered by you (unless you have permission from the
website owner/administrator to do so)

Scanning feature in Alfresco community edition?

I am working on to the alfresco 4.2 community addition.now i have to use the some kind of Scanning feature to scan the hard copy of the document and upload.
I have googled but haven't found any good solution.

Additionally to Alfresco you need a so called capture software which handles the scanning, converting to a PDF, OCR and the filing to Alfresco. There are several solutions available in in the market in different quality with different concepts.
Here a (not complete) list of working solutions I know of in the order of costs:
Quikdrop (Client-Installation): simple .NET-Client with Scan-Client, PDF-Conversion, OCR and limited Metadata-Support
Kofax-Express with Alfresco-Connector from ic-solution (Client-Installation): professional Capture Client supporting barcodes, scan optimizations, guided metadata extraction, validations, delivery to Alfresco supporting document types & metadata
Ephesoft (Server-Installation): web based capture solution available as a community, cloud and commercial version
Abbyy Flexicapture (Server-Installation): Local Capture Clients with a central Capture / Transformation and Extraction Service
Kofax with Alfresco-Kofax-Connector (Server-Installation): Local Capture Clients with a central Capture / Transformation and Extraction Service

The answer to your question is probably not related directly to Alfresco. Alfresco is excellent at managing documents, but not until you get them into Alfresco.
So first you have to scan the documents by a scanner and really any scanning software out there. Once you do, you upload the documents using something like:
CIFS - you just mount a folder in Alfresco to your desktop, as any other network drive and move the scanned documents in that folder. Usually you'll create an Alfresco rule on that folder to move the documents away, to email somebody, start a workflow or anything really.
You can upload the documents using Explorer or Share. It is probably not efficient if you have a lot of documents to upload.
You can use another application to connect to Alfresco using the upload API and send the documents in.
You email the scanned documents to Alfresco (provided that you have configured up incomming email box on Alfresco).
Use Alfrescos built-in FTP server to upload the documents.
There are more ways to get the documents in, these are, I think, the common ones.

You can use ChronoScan (http://www.chronoscan.org) there is a CMIS module to scan/ocr and send directly to Alfresco, SharePoint, etc in PDF Text or other formats,
The software is free for no commercial use with a nag screen, and is very similar to x10 price solutions (Kofax Express, etc..)
Regards

In addition to #zladuric:s answer I would like to add that there are software like Ephesoft and Kofax that for example can aid in the extraction of metadata from the scanned documents.

How to log application usage to website

I'm looking for a method to see how often people use my application and some small usage stats about that usage, e.g. Time Of Day (derived from the message time), duration of usage (program statistic), etc.
The application is written in Java and already connects to the internet, so I know I can send/request information from websites.
My question, is how best to do this? I know I could use Google Analytics and "ping" a specific web page, but ideally I'd like the extra statistics too, can that be done with GA? How do I separate that traffic out of the other GA stats?
Are there existing code snippets I could utilise?
What do I need the server to do? I have hosting with SSH access, MySQL, etc. So can install packages if needed.
Edited to add
This is not a web application, it's a local program that runs on a client machine and connects to the internet to gather data. So there's no web pages that I can insert java code or other scripts into for true web analytics.
This is why I was thinking that I would have to "ping" or "poke" (No idea what the correct terms are) a specific web address, perhaps a PHP page that would record the statistics.
The statistics that I would like to gather are:
IP Address (ONLY to determine the unique visitor and perhaps country of origin)
Time of execution (from the time the statistic was generated)
Number of items processed (program statistic)
Execution duration (program generated statistic)
As the program is usually run by the user an average of once per day, I don't anticipate massive load on the server (famous last words!)
Edited for clarification
The application is a CLI based (no GUI or web browser, web server, or other web application technologies are used). The application runs locally on a user's machine, collects information on various files, downloads information on those files from the internet (yes, using a URL connection), and compiles that information into a database.
I have no view of or access to, the users of the application. I do have a website that I use Google Analytics to see who visits and where from, all the usually stats.
I want to be able to capture a small bunch of stats (explained above) each time the application is run so I know that the application is being used and by how many users and what for.
I had thought I might be able to call a PHP web page with some arguments that could then be added to a database, e.g. http://omertron.com/stats?IP=192.168.2.0|processed=23|duration=270
Or can Google Analytics be used to log that information some how? I can't find much in the documentation about how I would do that.

Check out this list of web analytics software. Lots of free packages there, and once you find one that suits your needs, you'll be able to frame your question specifically to the challenges in using that particular package.

Your options are:
Tagging Systems (like Google Analytics)
Access Log File Analysis
With tagging you create an account with Google Analytics and add some specific JavaScript code you will get from Google, into the relevant places of your code, this allows the browser of your visitors to connect to GA and get captured there.
The Access log file can hold all information about all sessions. There is a lot of detail data generated, so data has to be Extracted, Transformed and Loaded (ETL) to a database. The evaluation can be then performed in nearly real-time. You can create some dashboard application that does the ETL and displays the status of you application.
A third option would be to combine tagging and log file analysis. This will give you more precise results.

Interesting, my thoughts are you would need a framework to accomplish this.
The java application should be able to asynchronously log every event that is happening in the application.
In google analytics you can define names and push events for those specific names. If you would be able to use the following api, I dont think you need to ping a specific web page to use google analytics.
http://download.oracle.com/javase/6/docs/technotes/guides/scripting/programmer_guide/index.html
I have not used this api, hope this helps!

Crawl Web Data using Web Crawler

I would like to use a web crawler and crawl a particular website. The website is a learning management system where many student upload their assignments,project presentations and so on. My question is that can i use a web crawler and download the files that have been uploaded in the learning management system. After i download them i would like to create an index on them so as to query the set of documents. User can use my application as a search engine. Can a crawler does this? I know about webeater ( Crawler written in Java )

Download the files in Java SingleThread.
Parse the files (you can get idea from parse plugins of nutch).
Create index with lucene

If you want to use a real webcrawler, user http://www.httrack.com/
It offers you so many options for copying websites or content on webpages including flash. It works on windows and mac.
Then you can do steps 2 and 3 as suggested above.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.