Crawl Web Data using Web Crawler - java

I would like to use a web crawler and crawl a particular website. The website is a learning management system where many student upload their assignments,project presentations and so on. My question is that can i use a web crawler and download the files that have been uploaded in the learning management system. After i download them i would like to create an index on them so as to query the set of documents. User can use my application as a search engine. Can a crawler does this? I know about webeater ( Crawler written in Java )

Download the files in Java SingleThread.
Parse the files (you can get idea from parse plugins of nutch).
Create index with lucene

If you want to use a real webcrawler, user http://www.httrack.com/
It offers you so many options for copying websites or content on webpages including flash. It works on windows and mac.
Then you can do steps 2 and 3 as suggested above.

Related

How to integrate java apis with HTML pages

I have a jar file which i want to integrate with web page and run on web browser (my be chrome). So what i want to do is calling a java API which will give me some data using which i want to populate web page. so the java code will run in the background. Now user can select any one of the option from web page and i want to send user input through java API.
Only thing i can think of currently is through java applet. Is there any other way to do this. May be something similar to applet already available in the market.

How to build a full text search similar (as quick & as accurate) to Google Drive?

Google Drive searching is really amazing.
It built up a full text search index instantly right after I had uploaded the document(pdf/M$ office document).
Since I want to use this technology in my own GAE project,I was wondering
1.is there any existing api from(Google/others) provide this function
2.how to implement by myself.
On App Engine you can use Search API, to do full-test indexing & querying.
You can use following options.
Google desktop and not sure about api.
IIS index server thenyou can do query it index files too.
Dtseach is tool with api but paid.
Apache api i forgot name but will post it shortly.

Scanning feature in Alfresco community edition?

I am working on to the alfresco 4.2 community addition.now i have to use the some kind of Scanning feature to scan the hard copy of the document and upload.
I have googled but haven't found any good solution.
Additionally to Alfresco you need a so called capture software which handles the scanning, converting to a PDF, OCR and the filing to Alfresco. There are several solutions available in in the market in different quality with different concepts.
Here a (not complete) list of working solutions I know of in the order of costs:
Quikdrop (Client-Installation): simple .NET-Client with Scan-Client, PDF-Conversion, OCR and limited Metadata-Support
Kofax-Express with Alfresco-Connector from ic-solution (Client-Installation): professional Capture Client supporting barcodes, scan optimizations, guided metadata extraction, validations, delivery to Alfresco supporting document types & metadata
Ephesoft (Server-Installation): web based capture solution available as a community, cloud and commercial version
Abbyy Flexicapture (Server-Installation): Local Capture Clients with a central Capture / Transformation and Extraction Service
Kofax with Alfresco-Kofax-Connector (Server-Installation): Local Capture Clients with a central Capture / Transformation and Extraction Service
The answer to your question is probably not related directly to Alfresco. Alfresco is excellent at managing documents, but not until you get them into Alfresco.
So first you have to scan the documents by a scanner and really any scanning software out there. Once you do, you upload the documents using something like:
CIFS - you just mount a folder in Alfresco to your desktop, as any other network drive and move the scanned documents in that folder. Usually you'll create an Alfresco rule on that folder to move the documents away, to email somebody, start a workflow or anything really.
You can upload the documents using Explorer or Share. It is probably not efficient if you have a lot of documents to upload.
You can use another application to connect to Alfresco using the upload API and send the documents in.
You email the scanned documents to Alfresco (provided that you have configured up incomming email box on Alfresco).
Use Alfrescos built-in FTP server to upload the documents.
There are more ways to get the documents in, these are, I think, the common ones.
You can use ChronoScan (http://www.chronoscan.org) there is a CMIS module to scan/ocr and send directly to Alfresco, SharePoint, etc in PDF Text or other formats,
The software is free for no commercial use with a nag screen, and is very similar to x10 price solutions (Kofax Express, etc..)
Regards
In addition to #zladuric:s answer I would like to add that there are software like Ephesoft and Kofax that for example can aid in the extraction of metadata from the scanned documents.

Best architecture for crawling website in application

I am working on a product in which we need a feature to crawl the user given URL and publish his separate mobile site. In the crawling process we want to crawl the site content, CSS, images and scripts. The product used to do more activities like scheduling some marketing activities and all. What I want to ask -
what is the best practice and open source framework to do this task?
Should we do it in the application itself or should there be another server for doing this activity (if this activity takes load)? Keep in mind that we have 1 "lacks" user visiting every month publishing his mobile site from the website, and around 1-2k concurrent users.
The application is built in Java and the Java EE platform using Spring and Hibernate as server side technology.
We used Derkley DB Java edition for managing off-heap queue of links to crawl and distinguish between links pending download and ones downloaded yet.
For parsing HTML TagSoup is the best choise in wild internet.
Batik is the choice for parsing CSS and SVG.
PDFBox is awesome and allows to extract links from PDF
Quartz scheduler is intustry-proven choice for event scheduling.
And yes, you will need one or more servers for crawling, one server for aggregating results and scheduling tasks, and perhaps another server for WEB front and back end.
This worked well for http://linktiger.com and http://pagefreezer.com
I'm implementing a crawling project based on Selenium HtmlUnit Driver. I think it's really the best Java Framework to automate a headless browser.

some questions on usage of java.net.url in Google App Engine for Java

I want to use java.net.url to crawl some websites and retrieve some data.
I am confused about the following issues--
(1) Suppose I configure the crawler to visit a video sharing webpage, for eg You Tube. Now, the crawler is set to visit a specific You Tube video page-- does this mean that when the crawler actually visits that page, it will by default download all elements on that page, including the FLV Video? Or can I control which files to retrieve. The aim being, minimisation of bandwidth utilisation on Google App Engine. Specifically, initially I want only the HTML web page itself to be retrieved, without retrieving images/videos/other attachments on that web page... is this possible, either on Google App Engine, or as part of a regular java web app?
(2) What is the quick and easy way to know the exact bandwidth being utilised for visiting a single specific site? So that I can keep track of bandwidth utilisation?
Also keeping the above 2 issues in mind, do you recommend usage of java.net.url or low level API? Or do you think I should not stick with App Engine (and use for eg. AWS)?
(1) Your crawler will only load what the web-server responds for a specific URL, which normally is pure HTML. In case of YouTube, just right-click with your browser on a page and select View Source. That is what you'll download if you load the page automatically. No video, just text.
(2) when you read the content of the webpage, just count the bytes you received. That is your bandwidth.

Categories

Resources