I am working on scraping some data on a specific Web page:
http://www.studenti.ict.uniba.it/esse3/ListaAppelliOfferta.do
The data I have to scrape are showed in the table which can be obtained as the output of the search which can be achieved by selecting one "Facoltà", one "Dipartimento" and then clicking on "Avvia Ricerca".
I am very glad to say I was able to scrape 100% of the data in the table using JSoup, but in order to do so I need the HTML source code of the page containing the table.
The only way I was able to get that HTML is by manually selecting one "Facoltà", one "Dipartimento" and then clicking on "Avvia Ricerca". Then the table is showed and I can obtain the HTML of the whole page containing it by right clicking and downloading the source code.
I want to write some Java code which allows to automate these steps, after I give to my program the above mentioned url:
selecting "Dipartimento di Informatica" among Facoltà
selecting "Informatica" (or one of the others available)
clicking "Avvia Ricerca"
downloading the HTML source code of the Web page in .html file
So then I can apply the code I wrote by myself for scraping the data in the table I need.
Is there any library or something of this kind that can help me? I am sure there is no need to re-invent the wheel on this matter.
Please note I tried some code to do that:
try{
URL url= new URL("http://www.studenti.ict.uniba.it/esse3/ListaAppelliOfferta.do");
URLConnection urlConn = url.openConnection();
BufferedReader dis= new BufferedReader(new InputStreamReader((url.openStream())));
String s="";
while (( s=dis.readLine())!= null) {
System.out.println(s);
}
dis.close();
}catch (MalformedURLException mue) {}
catch (IOException ioe) {}
But in this way I obtain only the HTML code of the page still not containing the table I need to scrape data from.
Related
If a URL points to a page that contains images, html, javascripts, pdf file ...
How to determine how many requests it makes to get all those parts ? And the size of each part ?
My code looks like this :
try
{
url=new URL(aUrl);
connection=(HttpURLConnection)url.openConnection();
connection.setRequestMethod("HEAD");
// connection.connect();
size=connection.getContentLengthLong();
Out("URL : "+aUrl);
if (size<0) Out("Could not determine file size.");
else Out("Size : " + size+" bytes");
connection.getInputStream().close();
}
catch (Exception e) { e.printStackTrace(); }
It onlys gets the size specified by the HEAD in the URL, I guess that's the total size, how can I figure out the size of each part : html, javascript, images... ?
And more importantly, how many requests ?
There is no easy way to get this information apart from fetching everything. The top level HTML document you get with the first request contains links to other documents (images, style sheets, Javascript, ...) which in turn could contain further links (e.g. a background image referenced from a style sheet). These other resources may even reside on other servers.
To make things even more complicated, the Javascript in the page may load further resources dynamically.
I have these lines of codes that I have being trying to use to read pdf file with Apache pdfBox.
private void readPdf(){
try {
File PDF_Path = new File("/home/olyjosh/Downloads/my project.pdf");
PDDocument inputPDF = PDDocument.load(PDF_Path);
List<PDPage> allPages = inputPDF.getDocumentCatalog().getAllPages();
PDPage testPage = (PDPage) allPages.get(5);
System.out.println("Number of pages "+allPages.size());
PDFPagePanel pdfPanel = new PDFPagePanel();
jPanel1.add(pdfPanel);
pdfPanel.setPage(testPage);
// this.revalidate();
inputPDF.close();
} catch (IOException ex) {
Logger.getLogger(NewJFrame.class.getName()).log(Level.SEVERE, null, ex);
}
}
I want this pdf to be displayed on swing component like jPanel but this will only display the panel with the expected content of the pdf file. However, I was able to display the pdf as image using
convertToImage = testPage.convertToImage();
Please, how do I work around this or what am I doing wrong.
Apache PDF-Box has a mailing list where I was able to ask the same question and this was the response I got
This was removed in 2.0 because it made trouble. Obviously, it doesn't work for 1.8 either, at least for you, so why bother?
There are two ways to display, either get a BufferedImage (renderImage / renderImageWithDPI) and display that somehow (see in PDFDebugger how to do it), or renderPageToGraphics which renders to a graphics device object.
If you really want to get the source code of the deleted PDFReader application (which includes PDFPagePanel), use svn to get revision 1702125 or earlier, that should have it. But if it didn't work for you in 1.8, it won't work for you now.
The point is that swing display of PDF pages isn't part of the API, it's part of some tool (now: in PDFDebugger, previously: in PDFReader)
You need to have some understanding of awt / swing. If you don't, learn it, or hire somebody. (That's what we did, and the best is: google paid it, as part of the google summer of code)
Tilman
I am creating a process in my java program that needs to retrieve the stock prices from Yahoo Finance and I can't figure out how to do it nor do I know where to start. So far I have it so that it can connect to any specific stock page that I want it to but I'm not sure how to go about retrieving the current stock price.
urlName = "http://finance.yahoo.com/q?s=" + ticker + "&ql=0";
URL url = new URL(urlName);
// Get the input stream through URL Connection
URLConnection con = url.openConnection();
InputStream is =con.getInputStream();
con.connect();
You can go about it one of two ways:
(1) Easy way: use the Yahoo Finance API: http://yahoofinance-api.com/
(2) Hard way: Parse the html source code for the price.
Open a reader on stream. Get the source code in a string and then analyze the source code for the tag that contains your info the use xml parsing to get that info.you can use tiddy library in java.
Do you specifically want to crawl the website to do this as an exercise? It would be much easier to use a library like this: https://code.google.com/p/yahoo-finance-managed/wiki/YahooFinanceAPIs
If you do want to crawl, you can use an HttpConnection or Apache HttpClient to get the HTML then use a library like JSoup to parse and interpret the data.
When I run the following code it prints out the contents of the given page.
However when I do a select all and copy of the actual page manually I get different text. What must I do so that when I run the java request that I will get the same text as when I to a ctrl+a, ctrl+c
URL myUrl = new URL("http://www.oddsportal.com/matches/soccer/20131204/");
URLConnection yc = myUrl.openConnection();
BufferedReader in = new BufferedReader(
new InputStreamReader(
yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
This could depend on various reasons. For example:
browser can do some non-visual for user conversation with server, receiving cookies from it and sending them again
page can be modified dynamically with javascript
page contents can be modified on server depeneding on browser's name from headers of request
and so.
When you run your code you get the HTML source of the page.
When you show the page in a browser and visually select and copy the page contents (ctrl-a, ctrl-c) you get a copy of the content rendered by a browser.
If you want to access the contents of the page programmatically you need to parse it somehow; JSoup library would be a good choice to select specific contents. HTMLUnit is a non-visual browser library that renders the page and lets you work with the result; this is closer to your current approach.
(Assuming you are not ctrl-a, ctrl-c in a source window, of course.)
This is not an easy job! Following reasons why you get different results:
Server delivers different pages depending on the browser (technically it's the HTTP header 'User-Agent' which often controls logic on server-side)
AJAX requests change the page content
Client-side logic (e.g. Modernizr) controls the output variants
Don't know whether there is some kind of binding for Java, but a possible solution would be to use PhantomJS.
BTW with your Java code you're eating the new-line/curriage-return characters, because BufferedReader.readLine() strips off the \n.
I'm currently writing some MATLAB code to interact with my company's internal reports database. So far I can access the HTML abstract page using code which looks like this:
import com.mathworks.mde.desk.*;
wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.setCurrentLocation(ReportURL(8:end));
pause(1);
s={};
while isempty(s)
s=char(wb.getHtmlText);
pause(.1);
end
desk=MLDesktop.getInstance;
desk.removeClient(wb);
I can extract out various bits of information from the HTML text which ends up in the variable s, however the PDF of the report is accessed via what I believe is a JavaScript command (onClick="gotoFulltext('','[Report Number]')").
Any ideas as to how I execute this JavaScript command and get the contents of the PDF file into a MATLAB variable?
(MATLAB sits on top of Java, so I believe a Java solution would work...)
I think you should take a look at the JavaScript that is being called and see what the final request to the webserver looks like.
You can do this quite easily in Firefox using the FireBug plugin.
https://addons.mozilla.org/en-US/firefox/addon/1843
Once you have found the real server request then you can just request this URL or post to this URL instead of trying to run the JavaScript.
Once you have gotten the correct URL (a la the answer from pjp), your next problem is to "get the contents of the PDF file into a MATLAB variable". Whether or not this is possible may depend on what you mean by "contents"...
If you want to get the raw data in the PDF file, I don't think there is a way currently to do this in MATLAB. The URLREAD function was the first thing I thought of to read content from a URL into a string, but it has this note in the documentation:
s = urlread('url') reads the content
at a URL into the string s. If the
server returns binary data, s will
be unreadable.
Indeed, if you try to read a PDF as in the following example, s contains some text intermingled with mostly garbage:
s = urlread('http://samplepdf.com/sample.pdf');
If you want to get the text from the PDF file, you have some options. First, you can use URLWRITE to save the contents of the URL to a file:
urlwrite('http://samplepdf.com/sample.pdf','temp.pdf');
Then you should be able to use one of two submissions on The MathWorks File Exchange to extract the text from the PDF:
Extract text from a PDF document by Dimitri Shvorob
PDF Reader by Tom Gaudette
If you simply want to view the PDF, you can just open it in Adobe Acrobat with the OPEN function:
open('temp.pdf');
wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.executeScript('javascript:alert(''Some code from a link'')');
desk=com.mathworks.mde.desk.MLDesktop.getInstance;
desk.removeClient(wb);