Java URL library for grabbing lines on a website

Java URL library for grabbing lines on a website - java

I want to be able to grab N lines (HTML text content that start on new lines) on a specific URL e.g. www.sitename.com and store them as strings in an array.
something like
public void grabLines(){
//create instance of class from imported library
//pass sitename into it
//from the instance, call a method for grabbing the lines on the site and pass in "N" as a parameter
//the method returns an array/list of N Strings that I can access later
}
Is there a native Java library I can import to do this? Does it allow me do what I want easily?
Thanks

Are you trying to make a screen scraper? you will be pulling html as opposed to just what you see. also if the website is dynamic you won't be able to pull everything that you can see. If you want just html and stuff you can try something like this. I tried to build a bloomberg screen scraper and then parse out the random html tags.
try {
URL bbg = new URL("http://www.bloomberg.com/markets/economic-calendar/");
BufferedReader r = new BufferedReader(new InputStreamReader( bbg.openStream()));
while( (temp = r.readLine())!= null){
System.out.println(temp);
}
} catch (Exception e){
e.printStackTrace();
}

Apache HttpClient is an abstraction above the URL/Reader technique above, but similar: Apache HTTP Client

Related

Handling file downloads via REST API

I want to set up the REST API to support file downloads via Java (The java part is not needed at the moment -- I am saying it in here so you can make your answer more specific for my problem).
How would I do that?
For example, I have this file in a folder (./java.jar), how can I stream it in such a way for it to be downloadable by a Java client?
I forgot to say that this, is for some paid-content.
My app should be able to do this
Client: Post to server with username,pass.
Rest: Respond accordingly to what user has bought (so if it has bought that file, download it)
Client: Download file and put it in x folder.
I thought of encoding a file in base64 and then posting the encoded result into the usual .json (maybe with a nice name -- useful for the java application, and with the code inside -- though I would not know how I should rebuild the file at this point). <- Is this plausible? Or is there an easier way?
Also, please do not downvote if unnecessary, although there is no code in the question, that doesn't mean I haven't researched it, it just means that I found nothing suitable for my situation.
Thanks.

What you need is a regular file streaming, using a valid URL.
Below code is an excerpt from here
import java.net.*;
import java.io.*;
public class URLReader {
public static void main(String[] args) throws Exception {
URL oracle = new URL("http://www.oracle.com/");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
}
}

For your needs, based on your updated comments on the above answer, you could call your REST endpoint after user logs in(with Auth and other headers/body you wish to receive) and proceed to the download.
Convert your jar/downloadable content to bytes. More on this
Java Convert File to Byte Array and visa versa
Later, in case if you dont want regular streaming as aforementioned in previous answers, you can put the byte content in the body as Base64 String. You can encode to Base64 from your byte array using something like below.
Base64.encodeToString(byte[], Base64.NO_WRAP + Base64.URL_SAFE);
Reference from here: How to send byte[] and strings to a restful webservice and retrieve this information in the web method implementation
Again, there are many ways to do this, this is one of the ways you can probably do using REST.

Transferring Emojis from Spreadsheet to Java

I would like to transfer data from a google sheet (or any spreadsheet) into java.
How emoji shows up in google sheets: "That picture 😍😍🔥🔥🔥"
How emoji shows up in downloaded TSV: "That picture "ðŸ˜ðŸ˜ðŸ”¥ðŸ”¥ðŸ”¥"
I have trouble understanding how I should be dealing with Emojis:
Is the following correct? I believe the way that emojis behave is that what I see in that first image is the HTML version of the emoji, and that there is an escaped version that looks something like \uD383\u2823
How do I proceed to transfer emojis into java:
What I want to do is be able to count the number of different emojis, so I need to separate them based on their codes.

So it seemed I was freaking out for no reason and should've just gone straight into Java hands first instead of thinking about encodings:
I downloaded my spreadsheet file into TSV
I parsed the TSV file using a regular BufferedReader and used
import org.apache.commons.lang3.StringEscapeUtils;
`BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(filename));
} catch (FileNotFoundException e1) {
e1.printStackTrace();
}
try {
while ((line = reader.readLine()) != null) {
System.out.println(StringEscapeUtils.escapeJava(line));
}
}`
3. output: \u00F0\u0178\u201D\u00A5\u00F0\u0178\u201D\u00A5 for input 🔥🔥

Scrape website for one data

I would like to extract the value of <div class="score">4.1</div> from a website with JAVA (Android). I tried Jsoup and even though it couldn't be simpler to use, it gives me the value in 8 seconds, which is very slow. You need to know, the page source of the site has 300,000 characters and this <div> is somewhere in the middle.
Even using HttpClient and getting the source into a StringBuilder then going through the whole string until the score part is found is faster (3-4 seconds).
I couldn't try out HtmlUnit as it requires a massive amount of jar files and after a while Eclipse always pissed itself in its confusion.
Is there a faster way?

You may simply send a XMLhttpRequest and then search the response using search() function. I think this would be much faster.
Similar Question: Retrieving source code using XMLhttpRequest in javascript
To make the search more fast, you can simply use indexOf([sting to search],[starting index]) and specify the starting index (it doesn't needs to be very accurate, you just have to decrease your search area).

Here is what I did. The problem was that I read the webpage line by line then glued them together into a StringBuilder and searched for the specific part. Then I asked myself: why do I read the page line by line then glue them together? So instead I read the page into a ByteArray and converted it into a String. The scraping time became less than a second!
try
{
InputStream is = new URL(url).openStream();
outputDoc = new ByteArrayOutputStream();
byte buf[]=new byte[1024];
int len;
while((len=is.read(buf))>0)
{
outputDoc.write(buf,0, len);
}
outputDoc.close();
} catch(Exception e) { e.printStackTrace(); }
try {
page = new String(outputDoc.toByteArray(), "UTF-8");
//here I used str.indexOf to find the part
}

Getting HTML code of Web Page in Java

I am working on scraping some data on a specific Web page:
http://www.studenti.ict.uniba.it/esse3/ListaAppelliOfferta.do
The data I have to scrape are showed in the table which can be obtained as the output of the search which can be achieved by selecting one "Facoltà", one "Dipartimento" and then clicking on "Avvia Ricerca".
I am very glad to say I was able to scrape 100% of the data in the table using JSoup, but in order to do so I need the HTML source code of the page containing the table.
The only way I was able to get that HTML is by manually selecting one "Facoltà", one "Dipartimento" and then clicking on "Avvia Ricerca". Then the table is showed and I can obtain the HTML of the whole page containing it by right clicking and downloading the source code.
I want to write some Java code which allows to automate these steps, after I give to my program the above mentioned url:
selecting "Dipartimento di Informatica" among Facoltà
selecting "Informatica" (or one of the others available)
clicking "Avvia Ricerca"
downloading the HTML source code of the Web page in .html file
So then I can apply the code I wrote by myself for scraping the data in the table I need.
Is there any library or something of this kind that can help me? I am sure there is no need to re-invent the wheel on this matter.
Please note I tried some code to do that:
try{
URL url= new URL("http://www.studenti.ict.uniba.it/esse3/ListaAppelliOfferta.do");
URLConnection urlConn = url.openConnection();
BufferedReader dis= new BufferedReader(new InputStreamReader((url.openStream())));
String s="";
while (( s=dis.readLine())!= null) {
System.out.println(s);
}
dis.close();
}catch (MalformedURLException mue) {}
catch (IOException ioe) {}
But in this way I obtain only the HTML code of the page still not containing the table I need to scrape data from.

How to work with html code readed on Java?

I know how to read the HTML code of a website, for example, the next java code reads all the HTML code from http://www.transfermarkt.co.uk/en/fc-barcelona/startseite/verein_131.html this is a website that shows all the football players of F.C. Barcelona.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class ReadWebPage {
public static void main(String[] args) throws IOException {
String urltext = "http://www.transfermarkt.co.uk/en/fc-barcelona/startseite/verein_131.html";
URL url = new URL(urltext);
BufferedReader in = new BufferedReader(new InputStreamReader(url
.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
// Process each line.
System.out.println(inputLine);
}
in.close();
}
}
OK, but now I need to work with the HTML code, I need to obtain the names ("Valdés, Victor", "Pinto, José Manuel", etc...) and the positions (Goalkeeper, Defence, Midfield, Striker) of each of the players of the team. For example, I need to create an ArrayList <String> PlayerNames and an ArrayList <String> PlayerPositions and put on these arrays all the names and positions of all the players.
How I can do it??? I can't find the code example that can do it on google..... code examples are welcome
thanks

I would recommend using HtmlUnit, which will give you access to the DOM tree of the HTML page, and even execute JavaScript in case the data are dynamically put in the page using AJAX.
You could also use JSoup: no JavaScript, but more lightweight and support for CSS selectors.

I think that the best approach is first to purify HTML code into the valid XHTML form, and them apply XSL transformation - for retrieving some part of information you can use XPATH expressions. The best available html tag balancer is in my opinion neko HTML (http://nekohtml.sourceforge.net/).

You might like to take a look at htmlparser
I used this for something similar.
Usage something like this:
Parser fullWebpage = new Parser("WEBADDRESS");
NodeList nl = fullWebpage.extractAllNodesThatMatch(new TagNameFilter("<insert html tag>"));
NodeList tds = nodes.extractAllNodesThatMatch(new TagNameFilter("a"),true);
String data = tds.toHtml();

Java has its own, built-in HTML parser. A positive feature of this parser it that it is error tolerant and would assume some tags even if they are missing or misspelled. While called swing.text.html.Parser, it has actually nothing shared with Swing (and with text only as much as HTML is a text). Use ParserDelegator. You need to write a callback for use with this parser, otherwise it is not complex to use. The code example (written as a ParserDelegator test) can be found here. Some say it is a reminder of the HotJava browser. The only problem with it, seems not upgraded to the most recent versions of HTML.
The simple code example would be
Reader reader; // read HTML from somewhere
HTMLEditorKit.ParserCallback callback = new MyCallBack(); // Implement that interface.
ParserDelegator delegator = new ParserDelegator();
delegator.parse(reader, callback, false);

I've found a link that is just what you was looking for:
http://tiny-url.org/work_with_html_java

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java URL library for grabbing lines on a website - java

Apache HttpClient is an abstraction above the URL/Reader technique above, but similar: Apache HTTP Client

Related

Handling file downloads via REST API

Transferring Emojis from Spreadsheet to Java

Scrape website for one data

Getting HTML code of Web Page in Java

How to work with html code readed on Java?

Categories

Resources