How to design a crawl bot?

How to design a crawl bot? - java

I'm working on a little project to analyze the content on some sites I find interesting; this is a real DIY project that I'm doing for my entertainment/enlightenment, so I'd like to code as much of it on my own as possible.
Obviously, I'm going to need data to feed my application, and I was thinking I would write a little crawler that would take maybe 20k pages of html and write them to text files on my hard drive. However, when I took a look on SO and other sites, I couldn't find any information on how to do this. Is it feasible? It seems like there are open-source options available (webpshinx?), but I would like to write this myself if possible.
Scheme is the only language I know well, but I thought I'd take use this project to learn myself some some Java, so I'd be interested if there are any racket or java libraries that would be helpful for this.
So I guess to summarize my question, what are some good resources to get started on this? How can I get my crawler to request info from other servers? Will I have to write a simple parser for this, or is that unnecessary given I want to take the whole html file and save it as txt?

This is entirely feasible, and you can definitely do it with Racket. You may want to take a look at the PLaneT libraries; In particular, Neil Van Dyke's HtmlPrag:
http://planet.racket-lang.org/display.ss?package=htmlprag.plt&owner=neil
.. is probably the place to start. You should be able to pull the content a web page into a parsed format in one or two lines of code.
Let me know if you have any questions about this.

Having done this myself in Racket, here is what I would suggest.
Start with a "Unix tools" approach:
Use curl to do the work of downloading each page (you can execute it from Racket using system) and storing the output in a temporary file.
Use Racket to extract the URIs from the <a> tags.
You can "cheat" and do a regular expression string search.
Or, do it "the right way" with a true HTML parser, as John Clements' great answer explains.
Consider maybe doing the cheat first, then looping back later to do it the right way.
At this point you could stop, or, you could go back and replace curl with your own code to do the downloads. For this you can use Racket's net/url module.
Why I suggest trying curl, first, is that it helps you do something more complicated than it might seem:
Do you want to follow 30x redirects?
Do you want to accept/store/provide cookies (the site may behave differently otherwise)?
Do you want to use HTTP keep-alive?
And on and on.
Using curl for example like this:
(define curl-core-options
(string-append
"--silent "
"--show-error "
"--location "
"--connect-timeout 10 "
"--max-time 30 "
"--cookie-jar " (path->string (build-path 'same "tmp" "cookies")) " "
"--keepalive-time 60 "
"--user-agent 'my crawler' "
"--globoff " ))
(define (curl/head url out-file)
(system (format "curl ~a --head --output ~a --url \"~a\""
curl-core-options
(path->string out-file)
url)))
(define (curl/get url out-file)
(system (format "curl ~a --output ~a --url \"~a\""
curl-core-options
(path->string out-file)
url)))
represents is a lot of code that you would otherwise need to write from scratch in Racket. To do all the things those curl command line flags are doing for you.
In short: Start with the simplest case of using existing tools. Use Racket almost as a shell script. If that's good enough for you, stop. Otherwise go on to replace the tools one by one with your bespoke code.

I suggest looking into the open source web crawler for java known as crawler4j.
It is very simple to use and it provides very good resources and options for your crawling.

If you know scheme, and you want to ease into Java, why don't you start with Clojure?
You can leverage your lisp knowledge, and take advantage of java html parsing libraries* out there to get something working. Then if you want to start transitioning parts of it to Java to learn a bit, you can write bits of functionality in Java and wire that into the Clojure code.
Good luck!
* I've seen several SO questions on this.

If I were you, I wouldn't write a crawler -- I'd use one of the many free tools that download web sites locally for offline browsing (e.g. http://www.httrack.com/) to do the spidering. You may need to tweak the options to disable downloading images, etc, but those tools are going to be way more robust and configurable than anything you write yourself.
Once you do that, you'll have a whole ton of HTML files locally that you can feed to your application.
I've done a lot of textual analysis of HTML files; as a Java guy, my library of choice for distilling HTML into text (again, not something you want to roll yourself) is the excellent Jericho parser: http://jericho.htmlparser.net/docs/index.html
EDIT: re-reading your question, it does appear that you are set on writing your own crawler; if so, I would recommend Commons HttpClient to do the downloading, and still Jericho to pull out the links and process them into new requests.

I did that in Perl years ago (much easier, even without the webcrawler module).
I suggest you read the wget documentation and use the tool for inspiration. Wget is the netcat of webcrawling; its feature set will inspire you.
Your program should accept a list of URLs to start with and add them to a list of URLs to try. You then have to decide if you want to collect every url or only add those from the domains (and subdomains?) provided in the initial list.
I made you a fairly robust starting point in Scheme:
(define (crawl . urls)
;; I would use regular expressions for this unless you have a special module for this
;; Hint: URLs tend to hide in comments. referal tags, cookies... Not just links.
(define (parse url) ...)
;; For this I would convert URL strings to a standard form then string=
(define (url= x y) ...)
;; use whatever DNS lookup mecanism your implementation provides
(define (get-dom) ...)
;; the rest should work fine on its own unless you need to modify anything
(if (null? urls) (error "No URLs!")
(let ([doms (map get-dom urls)])
(let crawl ([done '()])
(receive (url urls) (car+cdr urls)
(if (or (member url done url=)
(not (member (get-dom url) doms url=)))
(crawl urls done)
(begin (parse url) (display url) (newline)
(crawl (cons url done)))))))))

Related

Neural Net use for finding specific types of websites?

So I'm working on my first project and I'm trying to incorporate a neural net in it somehow. At the moment I just created web crawler that basically takes a word as input and then performs a google search and retrieves the html data of the links.
Now I am trying to only use the html data from specific types of websites, in my case websites that offer free educational content/courses. Example being This site https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-092-java-preparation-for-6-170-january-iap-2006/index.htm
I'm new to neural nets but is this something a neural net is able to do or would another method be better?
Also the rest of my code, such as for web crawler is in Java, so If neural net is applicable in this case what library or tool would you guys recommend for building/training the neural net. I was thinking Neuroph but would love to hear some suggestions.

When you use Neural Networks , it's for predicting something , for example you get an image as input and as ouput you'll have to get the nature of the image , for example knowing what's the content of image : is it a cat or a dog .. etc
About Web Crawler :
The web crawler you've been talking about is not something that necessarily needs neural network ( the idea that you wanted ) , but in case you wanna add some predictions then you can use it , for example taking word as input , making google search about it and then predicting nature of content
I dont know exacly what you wanna predict or the nature of prediction you want to do ( classification or regression ) but i can suggest you first how to take an input html
Taking Html content as input :
First thing to mention , the neural networks doesnt treat caracters , it treats numbers , so if you wanna treat an html content you'll have to use a mecanism , and that's not an easy step , there is a domain called NLP ( Natural Language Processing ) which gives you some good ways to treat texts , you can also use it for html content ( or in a different way if you want ).
I already made before a project of Text Suggestion with a recurrent neural network where i use one the NLP's methods , you can check it on my github because i explained on the Readme all the steps in details : https://github.com/KaramMed/Modele-de-Suggestion-du-Texte
About Library :
I recommand you to use TensorFlow for Java , it's one of the best librairies of Deep Learning and you can find so much tutorials about it

Get notification in my client code

I want to get the notifications about any change in any issues in my jira server.
I have basic code for connecting jira from java code using jira-rest-java-client library that they have provided.
I searched their javadocs and also went through some classes in that API library but I could not find any methods/classes which would be helpful to me.
Does anyone know if it is possible to get notification events from changes in jira to my java code (may be via polling or something like that).

What do you want to achieve?
You want to have push notifications? There isn't any, IMHO.
UPDATE: However, there is this WebHook thingy: https://confluence.atlassian.com/display/JIRA/Managing+Webhooks.
I have no expertise with it, but it is promising, please read this short introduction also: http://blogs.atlassian.com/2012/10/jira-5-2-remote-integration-webhooks/.
You are looking for something that gives you back what changed in the last N minutes, something like the Activity Stream? You can get the RSS feed of Activity Streams for Projects and for Users.
How
The base URL is https://jira.contoso.com/activity. Then you can append querystring parameters, like maxResults for paginating.
Selecting the data source is through the filters you provide in the streams parameter. It looks like it is JQL, but it's not.
Examples:
List a project's activites: ?streams=key+IS+SOMEPROJ.
List a user's activites: ?streams=user+IS+foobar.
List event between two dates: ?streams=update-date+BETWEEN+1425300236000+1425300264999. (Note: the epoch is the millisecond precision epoch.)
List user activities in one project: ?streams=user+IS+JohnDoe&streams=key+IS+PROJECTKEY.
More complex ones: ?streams=user+IS+JohnDoe&streams=key+IS+PROJECTKEY&streams=activity+IS+issue:close
Watch out, it is case sensitive, on my JIRA 6.1.9, if I write Is instead of IS, I get an error page (but not if AFTER is not all uppercase o.O).
Also note, that spaces should be encoded as plus signs (+), not URL encoded (%20 for spaces).
If you go to your JIRA, and fetch the following URL: https://jira.yourserver.com/rest/activity-stream/1.0/config, it will list all the combinations it accepts.
What
The call returns a standard Atom feed. You can then process it with XML query tools, or with other Java-based RSS/ATOM reader libraries.
Noteworthy document about this topic: https://developer.atlassian.com/docs/atlassian-platform-common-components/activity-streams/consuming-an-activity-streams-feed

Java - Extracting plaintext from web-page source code (getting massive quantities of lyrics from website)

O community, I'm in the process of writing the pseudocode for an application that extracts song lyrics from a remote host (web-server, not my own) by reading the page's source code.
This is assuming that:
Lyrics are being displayed in plaintext
Portion of source code containing lyrics is readable by Java front-end application
I'm not looking for source code to answer the question, but what is the technical term used for querying a remote webpage for plaintext content?
If I can determine the webpage naming scheme, I could set the pointer of the URL object to the appropriate webpage, right? The only limitations would be irregular capitalization, and would only be effective if the plaintext was found in EXACTLY the same place.
Do you have any suggestions?
I was thinking something like this for "Buck 65", singing "I look good"
URL url = new URL(http://www.elyrics.net/read/b/buck-65-lyrics/i-look-good-lyrics.html);
I could substitute "buck-65-lyrics" & "i-look-good-lyrics" to reflect user input?
Input re-directed to PostgreSQL table
Current objective:
User will request name of {song, artist, album}, Java front-end will query remote webpage
Full source code (containing plaintext) will be extracted with Java front-end
Lyrics will be extracted from source code (somehow)
If song is not currently indexed by PostgreSQL server, will be added to table.
Operations will be made on the plaintext to suit the objectives of the program
I'm only looking for direction. If I'm headed completely in the wrong direction, please let me know. This is only for the pseudocode. I'm not looking for answers, or hand-outs, I need assistance in determining what I need to do. Are there external libraries for extracting plaintext that you know of? What technical names are there for what I'm trying to accomplish?
Thanks, Tyler

This approach is referred to as screen or data scraping. Note that employing it often breaks the target service's terms of service. Usually, this is not a robust approach, which is why API-like services with guarantees about how they operate are preferable.
Your approach sounds like it will work for the most part, but a few things to keep in mind.
If the web service you're interacting with requires a very precise URL scheme, you should not feed your user-provided data directly into it, since it is likely to be muddied by missing words, abbreviations, or misspellings. You might be better off doing some sort of search, first, and using that search's best result.
Reading HTML data is more complicated than you think. Use an existing library like jsoup to assist you.

The technical term to extract content from a site is web scraping, you can google that. There are a lot of online libraries, for java there is jsoup. Though its easy to write your own regex.
1st thing I would do i use curl and get the content from the site just for testing, this will give you a fair idea of what to do.

You will have to use a HTML parser. One of the most popular is jsoup.
Take care abut the legal aspect fo what you you do ;)

Is HTML parsing (in Java/Android) then extracting data from it, an effective way of getting a webpage's content?

So, I'm using HTTP Post Requests in Android Java to log into a website, before extracting the entire HTML code. After that, I use Pattern/Matcher (regex) to find all the elements I need before extracting them from the HTML data, and deleting everything unnecessary. For instance when I extract this:
String extractions = <td>Good day sir</td>
Then I use:
extractions.replaceAll("<td>", "").replaceAll("</td>", "");
I do this multiple times until I have all the data needed from that site, before I display it in some kind of list.
I'm not particularly stuck on anything, but please, can you tell me if this is an effective/efficient/fast way of getting data from a page and processing it, or are there ways to do this faster? Because sometimes it's like my program takes a lot of time to get certain data (although mostly that's when I'm on 3G on my phone).

Like others have said, regex is not the best tool for this job. But in this case, the particular way you use regex is even more inefficient than it would normally be.
In any case, let me offer one more possible solution (depending on your use case).
It's called YQL (Yahoo Query Language).
http://developer.yahoo.com/yql/
Here is a console for it so you can play around with it.
http://developer.yahoo.com/yql/console/
YQL is the lazy developer's way to build your own api on the fly. The main inconvenience is that you have to use Yahoo as a go-between, but if you're ok with that, then I'd suggest you go that route. Using YQL is probably the quickest way to get that kind of work done (especially if the html you're targeting keeps on changing and if its html tags are not always valid).

Using regex to parse a website is always a bad idea:
How to use regular expressions to parse HTML in Java?
Using regular expressions to parse HTML: why not?

Have a look at the Apache Tika library for extracting text from HTML - there are many other parsers also available, such as PDF etc. : http://tika.apache.org/

How to webscrape scholar.google.com in Java?

I want to write a Java func grabTopResults(String f) such that grabTopResults("automata theory") returns me a list of the top 100 cited papers on scholar.google.com for "automata theory".
Does anyone have suggestions for what libraries will make my life easy?
Thanks!

As I'm sure Google can afford the bandwidth, I'll ignore the question of whether this is immoral/illegal/prohibited by Google's T&C
First thing you need to do is figure out what HTTP request (or requests) you need to issue in order to obtain the page with the data you need. Once you've figured this out, use HttpClient to issue the same request from Java code. The previous link shows example code that explains how to do this.
Once you've downloaded the content of the relevant page, you'll need to use a HTML parser to extract the data you're interested in. The Jericho parser suggested by peperg is a good choice.
If the Google police come knocking, you've never heard of me, OK?

I use http://jericho.htmlparser.net/docs/index.html . Google Scholar doesn't have API ( http://code.google.com/p/google-ajax-apis/issues/detail?id=109 ). Of course it is not allowed by Google (read terms of use. Automatic requestr are forbidden).

Below is a bit of example code which gets the titles on the first page using the open source product TestPlan. It is a standalone product, but if you really need it I could help you integrated it into your Java code (it is written in Java itself).
GotoURL http://scholar.google.com/
SubmitForm with
%Params:q% automate theory
end
set %Items% as response //div[#class='gs_r']
foreach %Item% in %Items%
set %Title% as selectIn %Item% h3
Notice %Title%
end
This produces output like the below (my IP is Germany, thus a german response). Obviously you could format it however you like, or write it to a file; this is just a rough test.
00000000-00 GOTOURL http://scholar.google.com/
00000001-00 SUBMITFORM default
00000002-00 NOTICE [ZITATION] Stochastic complexity in statistical inquiry theory
00000003-00 NOTICE AUTOMATED THEORY FORMATION IN MATHEMATICS1
00000004-00 NOTICE Constraint generation via automated theory formation
00000005-00 NOTICE [BUCH] Automated theorem proving: after 25 years
00000006-00 NOTICE [BUCH] Introduction to the Theory of Computation
00000007-00 NOTICE [ZITATION] Computer-controlled systems: theory and design
00000008-00 NOTICE [BUCH] … , randomness & incompleteness: papers on algorithmic information theory
00000009-00 NOTICE [BUCH] Automatic control systems
00000010-00 NOTICE [BUCH] VLSI physical design automation: theory and practice
00000011-00 NOTICE Singular Control Systems.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.