I know how to read the HTML code of a website, for example, the next java code reads all the HTML code from http://www.transfermarkt.co.uk/en/fc-barcelona/startseite/verein_131.html this is a website that shows all the football players of F.C. Barcelona.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class ReadWebPage {
public static void main(String[] args) throws IOException {
String urltext = "http://www.transfermarkt.co.uk/en/fc-barcelona/startseite/verein_131.html";
URL url = new URL(urltext);
BufferedReader in = new BufferedReader(new InputStreamReader(url
.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
// Process each line.
System.out.println(inputLine);
}
in.close();
}
}
OK, but now I need to work with the HTML code, I need to obtain the names ("Valdés, Victor", "Pinto, José Manuel", etc...) and the positions (Goalkeeper, Defence, Midfield, Striker) of each of the players of the team. For example, I need to create an ArrayList <String> PlayerNames and an ArrayList <String> PlayerPositions and put on these arrays all the names and positions of all the players.
How I can do it??? I can't find the code example that can do it on google..... code examples are welcome
thanks
I would recommend using HtmlUnit, which will give you access to the DOM tree of the HTML page, and even execute JavaScript in case the data are dynamically put in the page using AJAX.
You could also use JSoup: no JavaScript, but more lightweight and support for CSS selectors.
I think that the best approach is first to purify HTML code into the valid XHTML form, and them apply XSL transformation - for retrieving some part of information you can use XPATH expressions. The best available html tag balancer is in my opinion neko HTML (http://nekohtml.sourceforge.net/).
You might like to take a look at htmlparser
I used this for something similar.
Usage something like this:
Parser fullWebpage = new Parser("WEBADDRESS");
NodeList nl = fullWebpage.extractAllNodesThatMatch(new TagNameFilter("<insert html tag>"));
NodeList tds = nodes.extractAllNodesThatMatch(new TagNameFilter("a"),true);
String data = tds.toHtml();
Java has its own, built-in HTML parser. A positive feature of this parser it that it is error tolerant and would assume some tags even if they are missing or misspelled. While called swing.text.html.Parser, it has actually nothing shared with Swing (and with text only as much as HTML is a text). Use ParserDelegator. You need to write a callback for use with this parser, otherwise it is not complex to use. The code example (written as a ParserDelegator test) can be found here. Some say it is a reminder of the HotJava browser. The only problem with it, seems not upgraded to the most recent versions of HTML.
The simple code example would be
Reader reader; // read HTML from somewhere
HTMLEditorKit.ParserCallback callback = new MyCallBack(); // Implement that interface.
ParserDelegator delegator = new ParserDelegator();
delegator.parse(reader, callback, false);
I've found a link that is just what you was looking for:
http://tiny-url.org/work_with_html_java
Related
I have used this method to retrieve a webpage into an org.jsoup.nodes.Document object:
myDoc = Jsoup.connect(myURL).ignoreContentType(true).get();
How should I write this object to a HTML file?
The methods myDoc.html(), myDoc.text() and myDoc.toString() don't output all elements of the document.
Some information in a javascript element can be lost in parsing it. For example, "timestamp" in the source of an Instagram media page.
Use doc.outerHtml().
import org.apache.commons.io.FileUtils;
public void downloadPage() throws Exception {
final Response response = Jsoup.connect("http://www.example.net").execute();
final Document doc = response.parse();
final File f = new File("filename.html");
FileUtils.writeStringToFile(f, doc.outerHtml(), StandardCharsets.UTF_8);
}
Don't forget to catch Exceptions. Add dependency or download Apache commons-io library for easy and quick way to saving files in UTF-8 format.
The fact that there are elements that are ignored, must be due to the attempt of normalization by Jsoup.
In order to get the server's exact output without any form of normalization use this.
Connection.Response html = Jsoup.connect("PUT_URL_HERE").execute();
System.out.println(html.body());
I am getting this output when trying to use Jsoup to extract text from Wikipedia:
I dont have enough rep to post pictures as I am new to this site but its basically like this:
[]{k[]q[]f[]d[]d etc..
Here is part of my code:
public static void scrapeTopic(String url)
{
String html = getUrl("http://www.wikipedia.org/" + url);
Document doc = Jsoup.parse(html);
String contentText = doc.select("*").first().text();
System.out.println(contentText);
}
It appears to get all the information but in the wrong format!
I appreciate any help given
Thanks in advance
Here are some suggestion for you. While fetching general webpage, which doesn't require HTTP header's field to be set like cookie, user-agent just call:
Document doc = Jsoup.connect("givenURL").get();
This function read the webpage using a GET request. When you are selecting element using *, it returns any element, that is all the element of the document. Hence, calling doc.select("*").first() is returning the #root element. Try printing it to see:
System.out.println(doc.select("*").first().tagName()); // #root
System.out.println(doc.select("*").first()); // will print the whole document,
System.out.println(doc); //print the whole document, the above action is pointless
System.out.println(doc.select("*").first()==doc);
// check whither they are equal, and it will print TRUE
I am assuming that you are just playing around to learn about this API, although selector is much powerful, but a good start should be trying general document manipulation function e.g., doc.getElementsByTag().
However, in my local machine, i was successful to fetch the Document and parsing it using your getURL() function !!
I want to be able to grab N lines (HTML text content that start on new lines) on a specific URL e.g. www.sitename.com and store them as strings in an array.
something like
public void grabLines(){
//create instance of class from imported library
//pass sitename into it
//from the instance, call a method for grabbing the lines on the site and pass in "N" as a parameter
//the method returns an array/list of N Strings that I can access later
}
Is there a native Java library I can import to do this? Does it allow me do what I want easily?
Thanks
Are you trying to make a screen scraper? you will be pulling html as opposed to just what you see. also if the website is dynamic you won't be able to pull everything that you can see. If you want just html and stuff you can try something like this. I tried to build a bloomberg screen scraper and then parse out the random html tags.
try {
URL bbg = new URL("http://www.bloomberg.com/markets/economic-calendar/");
BufferedReader r = new BufferedReader(new InputStreamReader( bbg.openStream()));
while( (temp = r.readLine())!= null){
System.out.println(temp);
}
} catch (Exception e){
e.printStackTrace();
}
Apache HttpClient is an abstraction above the URL/Reader technique above, but similar: Apache HTTP Client
As an example suppose i want my program to
Vist stackoverflow everyday
Find the most question in some tag for that day
Format it and then send it to my email address
I don't know how to do it , i know php more , but i have some understnading of Java , j2ee , spring MVC but not java network programming
Any guidelines how should i go
I'd start by looking at the Stack Exchange API.
What you can possibly do is extract the contents of url and write it to a string buffer and then using JSOUP.jar (used to parse html elements) parse the html string to get the content of your choice.I have a small sample which does exactly that i read all the contents of the url into a string and then parse the content based on the CLASS TAG (here in this case it is question-hyperlink)
package com.tps.examples;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class URLGetcontent {
public static void main(String args[]) {
try {
URL url = new URL("http://stackoverflow.com/questions");
URLConnection conn = url.openConnection();
conn.setDoOutput(true);
// Get the response
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
StringBuffer str = new StringBuffer();
while ((line = rd.readLine()) != null) {
// System.out.println(line);
str.append(line);
}
Document doc = Jsoup.parse(str.toString());
Elements content = doc.getElementsByClass("question-hyperlink");
for (int i = 0; i < content.size(); i++) {
System.out.println("Question: " + (i + 1) + ". " + content.eq(i).text());
System.out.println("");
}
System.out.println("*********************************");
} catch (Exception e) {
}
}
}
Once the data is extracted you can use javamail class to send the content in email.
As you're wanting to retrieve data from a website (i.e. over HTTP), you probably want to look into using one of many HTTP clients already written in Java or PHP.
Apache HTTP Client is a good Java client used by many people. If you're invoking a RESTful interface, Jersey has a nice client library.
On the PHP side of things, someone already mentioned file_get_contents... but you could also look into the cURL library
As far as emailing goes, there's a JavaMail API (I'll admit I'm not familiar with it), or depending on your email server you might jump through other hoops (for example Exchange can send email through their SOAP interfaces.)
with file_get_contents() in PHP you cal also fetch files via HTTP:
$stackoverflow = file_get_contents("http://stackoverflow.com");
Then you have to parse this. For many sites there are special APIs which you can request via JSON or XML.
If you know shell scripting (that's the way i do it for many sites - works great with a cronjob :)) then you can use sed, wget, w3m, grep, mail to do it...
StackOverflow and other stackexchange sites provide a simple API (stackapps). You Please check out.
I would like to convert some HTML characters back to text using Java Standard Library. I was wondering whether any library would achieve my purpose?
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
// TODO code application logic here
// "Happy & Sad" in HTML form.
String s = "Happy & Sad";
System.out.println(s);
try {
// Change to "Happy & Sad". DOESN'T WORK!
s = java.net.URLDecoder.decode(s, "UTF-8");
System.out.println(s);
} catch (UnsupportedEncodingException ex) {
}
}
I think the Apache Commons Lang library's StringEscapeUtils.unescapeHtml3() and unescapeHtml4() methods are what you are looking for. See https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html.
Here you have to just add jar file in lib jsoup in your application and then use this code.
import org.jsoup.Jsoup;
public class Encoder {
public static void main(String args[]) {
String s = Jsoup.parse("<Français>").text();
System.out.print(s);
}
}
Link to download jsoup: http://jsoup.org/download
java.net.URLDecoder deals only with the application/x-www-form-urlencoded MIME format (e.g. "%20" represents space), not with HTML character entities. I don't think there's anything on the Java platform for that. You could write your own utility class to do the conversion, like this one.
The URL decoder should only be used for decoding strings from the urls generated by html forms which are in the "application/x-www-form-urlencoded" mime type. This does not support html characters.
After a search I found a Translate class within the HTML Parser library.
You can use the class org.apache.commons.lang.StringEscapeUtils:
String s = StringEscapeUtils.unescapeHtml("Happy & Sad")
It is working.
I'm not aware of any way to do it using the standard library. But I do know and use this class that deals with html entities.
"HTMLEntities is an Open Source Java class that contains a collection of static methods (htmlentities, unhtmlentities, ...) to convert special and extended characters into HTML entitities and vice versa."
http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=htmlentities
Or you can use unescapeHtml4:
String miCadena="GUÍA TELEFÓNICA";
System.out.println(StringEscapeUtils.unescapeHtml4(miCadena));
This code print the line:
GUÍA TELEFÓNICA
As #jem suggested, it is possible to use jsoup.
With jSoup 1.8.3 it il possible to use the method Parser.unescapeEntities that retain the original html.
import org.jsoup.parser.Parser;
...
String html = Parser.unescapeEntities(original_html, false);
It seems that in some previous release this method is not present.