How to webscrape scholar.google.com in Java?

How to webscrape scholar.google.com in Java? - java

I want to write a Java func grabTopResults(String f) such that grabTopResults("automata theory") returns me a list of the top 100 cited papers on scholar.google.com for "automata theory".
Does anyone have suggestions for what libraries will make my life easy?
Thanks!

As I'm sure Google can afford the bandwidth, I'll ignore the question of whether this is immoral/illegal/prohibited by Google's T&C
First thing you need to do is figure out what HTTP request (or requests) you need to issue in order to obtain the page with the data you need. Once you've figured this out, use HttpClient to issue the same request from Java code. The previous link shows example code that explains how to do this.
Once you've downloaded the content of the relevant page, you'll need to use a HTML parser to extract the data you're interested in. The Jericho parser suggested by peperg is a good choice.
If the Google police come knocking, you've never heard of me, OK?

I use http://jericho.htmlparser.net/docs/index.html . Google Scholar doesn't have API ( http://code.google.com/p/google-ajax-apis/issues/detail?id=109 ). Of course it is not allowed by Google (read terms of use. Automatic requestr are forbidden).

Below is a bit of example code which gets the titles on the first page using the open source product TestPlan. It is a standalone product, but if you really need it I could help you integrated it into your Java code (it is written in Java itself).
GotoURL http://scholar.google.com/
SubmitForm with
%Params:q% automate theory
end
set %Items% as response //div[#class='gs_r']
foreach %Item% in %Items%
set %Title% as selectIn %Item% h3
Notice %Title%
end
This produces output like the below (my IP is Germany, thus a german response). Obviously you could format it however you like, or write it to a file; this is just a rough test.
00000000-00 GOTOURL http://scholar.google.com/
00000001-00 SUBMITFORM default
00000002-00 NOTICE [ZITATION] Stochastic complexity in statistical inquiry theory
00000003-00 NOTICE AUTOMATED THEORY FORMATION IN MATHEMATICS1
00000004-00 NOTICE Constraint generation via automated theory formation
00000005-00 NOTICE [BUCH] Automated theorem proving: after 25 years
00000006-00 NOTICE [BUCH] Introduction to the Theory of Computation
00000007-00 NOTICE [ZITATION] Computer-controlled systems: theory and design
00000008-00 NOTICE [BUCH] … , randomness & incompleteness: papers on algorithmic information theory
00000009-00 NOTICE [BUCH] Automatic control systems
00000010-00 NOTICE [BUCH] VLSI physical design automation: theory and practice
00000011-00 NOTICE Singular Control Systems.

Related

How to receive a multipart message in ZMQ using Java?

As simple as this operation seems I can't find any documentation regarding how to receive a multipart message using ZMQ (Jeromq). I checked The Guide but it only contains C code with this info and it seems that I'm supposed to receive messages the same way no matter what kind of message I'm receiving.
In reality what happens is that I receive the multipart message in two messages with this code:
while (running.get()) {
items.poll();
if (items.pollin(0)) {
ByteArray message = receiver.recv(0);
System.out.println("Received " + String(message, Charset.forName("UTF-8")));
}
}
The "Received" part will get printed twice if I send a multipart message like this:
publisher.sendMore(message.key);
publisher.send(objectMapper.writeValueAsString(message.data));
What am I doing wrong?
Edit: I know there is a language selector below the examples but this particular problem is not present in any of the examples only explained inline with C code.
Edit
I tried to explore the API and found the hasReceiveMore() method. I tried using it, but it didn't work, I ended up with an infinite loop with this code:
List<String> parts = new ArrayList<>();
while(receiver.hasReceiveMore()) {
parts.add(receiver.recvStr());
}

Q : "What am I doing wrong?"
Your code has to actively assume each message to might have been composed as a multipart-message (Zero-Warranties in this, the less a-priori) and actively check for the presence of the ZMQ_RECVMORE flag, after each subsequent .recv()-method call, until the .getsockopt( ZMQ_RECVMORE )-method says otherwise.
JeroMQ might have translated this published native-API into some other utlity methods, so best re-read the JeroMQ-source code to find, where this native-API multipart-message handling-"protocol" gets wrapped into the JeroMQ-tooling.
EPILOGUE : Verba docent, Exempla trahunt...Having helped more than 1.3 M Community members and countless anonymous site visitors, I got punished and censored for helping.Censorship of deleting comments continues. The spirit of StackOverflow turns to digital totalitarianism. Delete, delete and punish those, who keep thinking and present help and advice to those, who ask for a sponsored help...-------------------------------------Let's review the facts:-----"I couldn't find any docs regarding how I should receive the parts, but I tried something that looks like what you mentioned...it didn't work either. – Adam Arold 20 hours ago"Either of not finding "any docs" or a "not working" (another not published, reproducible MCVE) were my fault or omission, were they?( my answer to these false claims was administratively deleted a few minutes after being posted... Self-explanatory )-----"This is not an answer and it doesn't contain a solution. I'm not sure why you're surprised. What you cited is the C API that has nothing to do with the JeroMQ API. In the end the solution was that I have to recv before I try to check the RECVMORE flag. This was not in your answer. Alternatively ZMsg can be used. – Adam Arold 11 hours ago"ANALYSIS OF THE CLAIMS :Sentence #1. :"This is not an answer and it doesn't contain a solution."This IS an answer, in spite of the "claim". It contains several important pieces of information, that she/he/anyone would otherwise have to try to seek for hours (days or weeks?) to later study and conceptually well comprehend architecture-wise so as not to make any ill-formulated code-design(s) or get principally trapped into one's own, misconcepted, decisions, if not having been advised and warned about thses possible shortcommings I've personally met ( and help others not to repeat ) throughout my last 13+ years spent with fabulous Martin Sustrik's masterpiece - the ZeroMQ, since v2.1+. So this "claim" is both wrong and unsupported by facts. A minor "claim" that the answer did not contain a solution is absurd, StackOverflow Community members are neither an employee to shout at, the less we bear a commitment to program a code that will snap and fit all the needs of the (unpublished) use-case.Sentence #2 :(an expressed feeling)- rather skipped as it is more a case of insult than a fair argument, isn't it?Sentence #3 :"What you cited is the C API that has nothing to do with the JeroMQ API."Oh sure, YES , the C API ( and the ZeroMQ-RFC docs on mandatory "wire-line" protocols properties that any peer-implementation has to obey... ) is the starting point and a cardinal reference in all of this. And NO , both the published ZeroMQ RFC-documents and the API are the rock-solid reference for anyone to start with, so as to best understand, how the internal engines and all the mandatory "wire-line" properties obeying protocol pumps are working (and must be working), so as to declare themselves to retain the ZeroMQ-compatibility. The JeroMQ-authors did their work based on these documented properties, didn't they? If they did not or if they "cut some corners" on doing that, the story is lost and was not my fault they did not meet and/or cover all the ZMTP/ZeroMQ-RFC/API properties & requirements, was it? That said, any wrapper/binding, including any version of the JeroMQ must also conform to these inner working rules, which is sufficiently self-documented & demonstrated, if nowhere else, in the JeroMQ source-code (Which warning was also the part of the Answer provided, wasn't it?), if it aspires to be a ZeroMQ-compatible tool. Again, should your current (used) JeroMQ-implementation misses to meet a well documented JeroMQ-API documentation you would like to use & read through (to find both the description and examples of code for the use-cases), which was claimed it did not or that the will to seek and find any such (source-coded) information, it is not the Community sponsoring member to punish for the lack of both the former, the more the latter.Sentence 4. + 5. :This needs to get highlighted:"In the end the solution was that I have to recv before I try to check the RECVMORE flag. This was not in your answer."First of all, it WAS in the Answer - the very first sentence:"...code has to actively assume each message to might have been composed as a multipart-message (Zero-Warranties in this, the less a-priori) and actively check for the presence of the ZMQ_RECVMORE flag, after each subsequent .recv()-method call, until the .getsockopt( ZMQ_RECVMORE )-method says otherwise."My generation grew up in a deep belief, that if we've made an error or 've made a poor decision, based on an unsupported assumption, we never punish anyone else, for (us) having made an error of a bad decision. Surprisingly, not working here. Why would anyone ever punish a person, who reached out and came to help you solve your problem and sponsored your personal need to get a step further? No one will if the culture to ask for a sponsored help and punish anyone who did would grow further. Isn't this called an arrogant or dictator-alike style of person to person behaviour? Be it the former or latter, it is neither fair, the less a style to be promoted the less rewarded as a Community preferred behaviour. The "argument" per se is empty, void - Not having called a .recv()-method, nothing gets ever from inside the ZeroMQ-API abstract horizon, the less an indication, promised to learn by .getsockopt()-method's use on getting a RECVMORE-flag ( sure, after some .recv() has been confirmed to have gotten a substance --The-Message-- That is both elementary and does not need to "include" it in any text about ZeroMQ/JeroMQ messaging as it is self-explanatory - Would anyone claim that it was unfair not to explain that asking for an email-attachment makes no sense if there were no email delivered so far? No one fair ever would. So, the Answer did the very opposite - it did warn about this, that for every .recv()-ed message, professional designer ought always assume a { 0+ }-RECVMORE-flagged multi-part message components, that follow the first one .recv()-ed and need to get dug out of the API.The last sentence :"Alternatively ZMsg can be used."This claim remains an undecidable problem, as the O/P contains zero information about a version. Native ZeroMQ API has evolved since its premiere release via v2.0-v2.1-..-v2.11, via v3.0-v3.1-v3.2, refactored and extended via v4.0-v4.1-v4.2-v4.3 and still counting, and a "claimed" Zmsg-abstraction is sure not to be present in earlier implementations, so the version number is cardinal on this ( also being a part of the StackOverflow best practices for how to ask good questions with a problem-reproducing MCVE / MWE code and all relevant details, the version number being one part of that, isn't it? ).

Java - Extracting plaintext from web-page source code (getting massive quantities of lyrics from website)

O community, I'm in the process of writing the pseudocode for an application that extracts song lyrics from a remote host (web-server, not my own) by reading the page's source code.
This is assuming that:
Lyrics are being displayed in plaintext
Portion of source code containing lyrics is readable by Java front-end application
I'm not looking for source code to answer the question, but what is the technical term used for querying a remote webpage for plaintext content?
If I can determine the webpage naming scheme, I could set the pointer of the URL object to the appropriate webpage, right? The only limitations would be irregular capitalization, and would only be effective if the plaintext was found in EXACTLY the same place.
Do you have any suggestions?
I was thinking something like this for "Buck 65", singing "I look good"
URL url = new URL(http://www.elyrics.net/read/b/buck-65-lyrics/i-look-good-lyrics.html);
I could substitute "buck-65-lyrics" & "i-look-good-lyrics" to reflect user input?
Input re-directed to PostgreSQL table
Current objective:
User will request name of {song, artist, album}, Java front-end will query remote webpage
Full source code (containing plaintext) will be extracted with Java front-end
Lyrics will be extracted from source code (somehow)
If song is not currently indexed by PostgreSQL server, will be added to table.
Operations will be made on the plaintext to suit the objectives of the program
I'm only looking for direction. If I'm headed completely in the wrong direction, please let me know. This is only for the pseudocode. I'm not looking for answers, or hand-outs, I need assistance in determining what I need to do. Are there external libraries for extracting plaintext that you know of? What technical names are there for what I'm trying to accomplish?
Thanks, Tyler

This approach is referred to as screen or data scraping. Note that employing it often breaks the target service's terms of service. Usually, this is not a robust approach, which is why API-like services with guarantees about how they operate are preferable.
Your approach sounds like it will work for the most part, but a few things to keep in mind.
If the web service you're interacting with requires a very precise URL scheme, you should not feed your user-provided data directly into it, since it is likely to be muddied by missing words, abbreviations, or misspellings. You might be better off doing some sort of search, first, and using that search's best result.
Reading HTML data is more complicated than you think. Use an existing library like jsoup to assist you.

The technical term to extract content from a site is web scraping, you can google that. There are a lot of online libraries, for java there is jsoup. Though its easy to write your own regex.
1st thing I would do i use curl and get the content from the site just for testing, this will give you a fair idea of what to do.

You will have to use a HTML parser. One of the most popular is jsoup.
Take care abut the legal aspect fo what you you do ;)

How to design a crawl bot?

I'm working on a little project to analyze the content on some sites I find interesting; this is a real DIY project that I'm doing for my entertainment/enlightenment, so I'd like to code as much of it on my own as possible.
Obviously, I'm going to need data to feed my application, and I was thinking I would write a little crawler that would take maybe 20k pages of html and write them to text files on my hard drive. However, when I took a look on SO and other sites, I couldn't find any information on how to do this. Is it feasible? It seems like there are open-source options available (webpshinx?), but I would like to write this myself if possible.
Scheme is the only language I know well, but I thought I'd take use this project to learn myself some some Java, so I'd be interested if there are any racket or java libraries that would be helpful for this.
So I guess to summarize my question, what are some good resources to get started on this? How can I get my crawler to request info from other servers? Will I have to write a simple parser for this, or is that unnecessary given I want to take the whole html file and save it as txt?

This is entirely feasible, and you can definitely do it with Racket. You may want to take a look at the PLaneT libraries; In particular, Neil Van Dyke's HtmlPrag:
http://planet.racket-lang.org/display.ss?package=htmlprag.plt&owner=neil
.. is probably the place to start. You should be able to pull the content a web page into a parsed format in one or two lines of code.
Let me know if you have any questions about this.

Having done this myself in Racket, here is what I would suggest.
Start with a "Unix tools" approach:
Use curl to do the work of downloading each page (you can execute it from Racket using system) and storing the output in a temporary file.
Use Racket to extract the URIs from the <a> tags.
You can "cheat" and do a regular expression string search.
Or, do it "the right way" with a true HTML parser, as John Clements' great answer explains.
Consider maybe doing the cheat first, then looping back later to do it the right way.
At this point you could stop, or, you could go back and replace curl with your own code to do the downloads. For this you can use Racket's net/url module.
Why I suggest trying curl, first, is that it helps you do something more complicated than it might seem:
Do you want to follow 30x redirects?
Do you want to accept/store/provide cookies (the site may behave differently otherwise)?
Do you want to use HTTP keep-alive?
And on and on.
Using curl for example like this:
(define curl-core-options
(string-append
"--silent "
"--show-error "
"--location "
"--connect-timeout 10 "
"--max-time 30 "
"--cookie-jar " (path->string (build-path 'same "tmp" "cookies")) " "
"--keepalive-time 60 "
"--user-agent 'my crawler' "
"--globoff " ))
(define (curl/head url out-file)
(system (format "curl ~a --head --output ~a --url \"~a\""
curl-core-options
(path->string out-file)
url)))
(define (curl/get url out-file)
(system (format "curl ~a --output ~a --url \"~a\""
curl-core-options
(path->string out-file)
url)))
represents is a lot of code that you would otherwise need to write from scratch in Racket. To do all the things those curl command line flags are doing for you.
In short: Start with the simplest case of using existing tools. Use Racket almost as a shell script. If that's good enough for you, stop. Otherwise go on to replace the tools one by one with your bespoke code.

I suggest looking into the open source web crawler for java known as crawler4j.
It is very simple to use and it provides very good resources and options for your crawling.

If you know scheme, and you want to ease into Java, why don't you start with Clojure?
You can leverage your lisp knowledge, and take advantage of java html parsing libraries* out there to get something working. Then if you want to start transitioning parts of it to Java to learn a bit, you can write bits of functionality in Java and wire that into the Clojure code.
Good luck!
* I've seen several SO questions on this.

If I were you, I wouldn't write a crawler -- I'd use one of the many free tools that download web sites locally for offline browsing (e.g. http://www.httrack.com/) to do the spidering. You may need to tweak the options to disable downloading images, etc, but those tools are going to be way more robust and configurable than anything you write yourself.
Once you do that, you'll have a whole ton of HTML files locally that you can feed to your application.
I've done a lot of textual analysis of HTML files; as a Java guy, my library of choice for distilling HTML into text (again, not something you want to roll yourself) is the excellent Jericho parser: http://jericho.htmlparser.net/docs/index.html
EDIT: re-reading your question, it does appear that you are set on writing your own crawler; if so, I would recommend Commons HttpClient to do the downloading, and still Jericho to pull out the links and process them into new requests.

I did that in Perl years ago (much easier, even without the webcrawler module).
I suggest you read the wget documentation and use the tool for inspiration. Wget is the netcat of webcrawling; its feature set will inspire you.
Your program should accept a list of URLs to start with and add them to a list of URLs to try. You then have to decide if you want to collect every url or only add those from the domains (and subdomains?) provided in the initial list.
I made you a fairly robust starting point in Scheme:
(define (crawl . urls)
;; I would use regular expressions for this unless you have a special module for this
;; Hint: URLs tend to hide in comments. referal tags, cookies... Not just links.
(define (parse url) ...)
;; For this I would convert URL strings to a standard form then string=
(define (url= x y) ...)
;; use whatever DNS lookup mecanism your implementation provides
(define (get-dom) ...)
;; the rest should work fine on its own unless you need to modify anything
(if (null? urls) (error "No URLs!")
(let ([doms (map get-dom urls)])
(let crawl ([done '()])
(receive (url urls) (car+cdr urls)
(if (or (member url done url=)
(not (member (get-dom url) doms url=)))
(crawl urls done)
(begin (parse url) (display url) (newline)
(crawl (cons url done)))))))))

Rapid Miner 101

I'm back with a question. I'm playing with Rapid Miner for automatic text classification and cant get it work. I'm getting an error that says, "no example set in the example, offending operator Performance ". Any idea what that is referring to ?

In RapidMiner you have to use the converter components before using it as example sets. So, if you have an output as 'doc', for example, you have to use the component 'Documents to Data' in order to link it to the next input 'exa'. That´s all!

Could you provide more details about your RapidMiner text mining process?
Without more context, your question is difficult to answer.
For more help with RapidMiner, you may want to check out the RapidMiner user forum: http://forum.rapid-i.com/
At RapidMiner Resources, you can find RapidMiner tutorial videos about how to text mining with RapidMiner:
http://rapidminerresources.com/index.php?page=text-mining-3
Rapid-I also offers a 90 minutes text mining webinar. You can find it at the Rapid-I web page under "services" and "training" or in the web shop.
I hope these links help you to get started with automatic text classification with RapidMiner. If you provide more details about your RapidMiner text mining process, I may also be able to directly answer your question.

If it says that there is no Example Set, then the issue is probably with your original data. Can you post an image of your process?
For instance, make sure that you have connected the initial input to your operator - what two operators does the error occur at?
One thought: the example set in text mining is usually your document collection, but if you are really using documents (PDF, Word) then your format will be Documents (Doc), and you may need to transform your documents to data (Documents to Data operator). Then you should have an Example Set that you can feed into your Performance operator.
Hope this helps - as the earlier comment said, without knowing the process, it is hard to tell exactly where the error is.

What are good methods to perform spreadsheet-like calculations in a programming language?

What's the best way to do spreadsheet-like calculations in a programming language? Example: A multi-user application needs to be available over the web that crunches columns and cells of numbers like a spread-sheet based on user submission. What are the best data structures/ database models/patterns to handle this type of work so that handling the different columns are done efficiently and easily in php, java, or even .Net. Is it better to use data structures within the language, or is it better to use a database? If using a database is the way, how does one go about doing this?

To do the actual calculation, look at graph theory. Basically you want to represent each cell as a node in a graph and each dependency as a directed edge. Next, do a topological sort to calculate the value of each cell in the right order.

Aspose.Cells (formerly Aspose.Excel.Web) is a good way to get the functionality you are looking for.
Unless you are asking more for a "How is it done?" than "I need to do it." Then I would look at the other answers given.

Along the lines of "I need to do it"
Microsoft has Excel Services which does just what you want.
Spreadsheet operations on the server. It is available via a web services interface, so you can connect and drive calculations from Java, PHP, .NET, whatever.
Excel Services is part of Sharepoint 2007.

Resolver One is a Spreadsheet app made in IronPython.
There is an explanation of the overall mechanic for the calculation [pythonology.org] it uses for user generated ecuations.
The relevant image showing Resolver One's overall algorithm.
Should note that users can write python code to be interpreted both on the cells and a special 'outside of sheet' place.
Look at another question here in SO, from where I reused my answer.

I can't tell you how to do it. But I would recommend you to look at the code of PHPExcel. PHPExcel is a library that allows you to create Excel files within PHP.
The workflow of PHPExcel is simplified like this:
Create an empty Excel file object
Add cells (with either data or formulas) to the "Excel file"
Call the create function which is generating the file itself
In your case you would have to replace 3. with something like "Create web interface".
Therefore I would recommend you to look at the code of this open source project and look how the general structure is. This should help you solving your problem.

I once used a binary tree to store the output of parsing a string using BODMAS. Each node was an operation between two other nodes, which could be a number, a variable or another operation.
So y = x * x + 2
became:
+
* 2
x x
Sadly this was at school in Pascal and is stored on a 5 1/4" disk, so you don't want it :)

SpreadsheetGear for .NET will let you load Excel workbooks, plug in values, calculate and then get the results.
You can see a few simple ASP.NET calculation samples here, other ASP.NET samples here and download a free trial here.
Disclaimer: I own SpreadsheetGear LLC

I must point out that google spreadsheets already does this kind of stuff.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to webscrape scholar.google.com in Java? - java

I want to write a Java func grabTopResults(String f) such that grabTopResults("automata theory") returns me a list of the top 100 cited papers on scholar.google.com for "automata theory". Does anyone have suggestions for what libraries will make my life easy? Thanks!

I use http://jericho.htmlparser.net/docs/index.html . Google Scholar doesn't have API ( http://code.google.com/p/google-ajax-apis/issues/detail?id=109 ). Of course it is not allowed by Google (read terms of use. Automatic requestr are forbidden).

Related

How to receive a multipart message in ZMQ using Java?

Java - Extracting plaintext from web-page source code (getting massive quantities of lyrics from website)

How to design a crawl bot?

Rapid Miner 101

What are good methods to perform spreadsheet-like calculations in a programming language?

Categories

Resources