Getting Content From a Website with Java - java

I was curious how to pull information from a website with Java, and I found JSoup ( HTML Parser) Was a popular suggestion. I have found quite a few examples online but nothing really explaining how to use it. Say I wanted to get the temperature for Toronto using this url, http://weather.gc.ca/city/pages/on-143_metric_e.html , how would I go about doing so?
I guess you have to specify tags, but in the html for that site, the information I want is in a tag, but so is more inforation so when when I run my code
String url = "http://weather.gc.ca/city/pages/on-4_metric_e.html";
Document document = Jsoup.connect(url).get();
String temp = document.select("dd").text();
System.out.println("Title: " + temp);
I get a lot more information than I want.

For the temperature try this:
String url = "http://weather.gc.ca/city/pages/on-4_metric_e.html";
Document document = Jsoup.connect(url).get();
String temp = document.select("p").get(1).text();
System.out.println("Temperature: " + temp);
For formulating the CSS queries refer to the syntax sheet: http://jsoup.org/cookbook/extracting-data/selector-syntax
Also try: http://try.jsoup.org/, great for testing!

Let say I want to read the contents of mywebsite.com. This is how i'll do it:
import java.net.*;
import java.io.*;
class MyClass {
public static void main(String[] arg) throws Exception {
URL u = new URL("http://www.mywebsite.com");
InputStream ins = u.openStream();
InputStreamReader isr = new InputStreamReader(ins);
BufferedReader br = new BufferedReader(isr);
System.out.println(br.readLine());
}
}
Hopefully this should get you started..

Related

how to exclude tag from XML String in java

I am making a piece of code to send and recieve data from and to an webpage. I am doeing this in java. But when i 'receive' the xml data it is still between tags like this
<?xml version='1.0'?>
<document>
<title> TEST </title>
</document>
How can i get the data without the tags in Java.
This is what i tried, The function writes the data and then should get the reponse and use that in a System.out.println.
public static String User_Select(String username, String password) {
String mysql_type = "1"; // 1 = Select
try {
String urlParameters = "mysql_type=" + mysql_type + "&username=" + username + "&password=" + password;
URL url = new URL("http://localhost:8080/HTTP_Connection/index.php");
URLConnection conn = url.openConnection();
conn.setDoOutput(true);
OutputStreamWriter writer = new OutputStreamWriter(conn.getOutputStream());
writer.write(urlParameters);
writer.flush();
String line;
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = reader.readLine()) != null) {
System.out.println(line);
//System.out.println("Het werkt!!");
}
writer.close();
reader.close();
return line;
} catch (IOException iox) {
iox.printStackTrace();
return null;
}
}
Thanks in advance
I would suggest simply using RegEx to read the XML, and get the tag content that you are after.
That simplifies what you need to do, and limits the inclusion of additional (unnecessary) libraries.
And then there are lots of StackOverflows on this topic: Regex for xml parsing and In RegEx, I want to find everything between two XML tags just to mention 2 of them.
use DOMParser in java.
Check further in java docs
Use an XML Parser to Parse your XML. Here is a link to Oracle's Tutorial
Oracle Java XML Parser Tutorial
Simply pass the InputStream from URLConnection
Document doc = DocumentBuilderFactory.
newInstance().
newDocumentBuilder().
parse(conn.getInputStream());
From there you could use xPath to query the contents of the document or simply walk the document model.
Take a look at Java API for XML Processing (JAXP) for more details
You have to use an XML Parser , in your case the perfect choice is JSoup which scrap data from the web and parse XML & HTML format ,it will load data and parse it and give you what you want , here is a an example of how it works :
1. XML From an URL
String xml = Jsoup.connect("http://localhost:8080/HTTP_Connection/index.php")
.get().toString();
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
String myTitle=doc.select("title").first();// myTitle contain now TEST
Edit :
to send GET or POST parameters with you request use this code:
String xml = Jsoup.connect("http://localhost:8080/HTTP_Connection/index.php")
.data("param1Name";"param1Value")
.data("param2Name","param2Value").get().toString();
you can use get() to invoke HTTP GET method or post() to invoke HTTP POST method.
2. XML From String
You can use JSoup to parse XML data in a String :
String xmlData="<?xml version='1.0'?><document> <title> TEST </title> </document>" ;
Document doc = Jsoup.parse(xmlData, "", Parser.xmlParser());
String myTitle=doc.select("title").first();// myTitle contain now TEST

Saving the first Image from URL

Here's my problem. I have a txt file called "sites.txt" . In these i type random internet sites. My Goal is to save the first image of each site. I tried to filter the Server response by the img tag and it actually works for some sites, but for some not.
The sites where it works the img src starts with http:// ... the sites it doesnt work start with anything else.
I also tried to add the http:// to the img src images which didnt have it, but i still get the same error:
Exception in thread "main" java.net.MalformedURLException: no protocol:
at java.net.URL.<init>(Unknown Source)
My current code is:
public static void main(String[] args) throws IOException{
try {
File file = new File ("sites.txt");
Scanner scanner = new Scanner (file);
String url;
int counter = 0;
while(scanner.hasNext())
{
url=scanner.nextLine();
URL page = new URL(url);
URLConnection yc = page.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine = in.readLine();
while (!inputLine.toLowerCase().contains("img"))inputLine = in.readLine();
in.close();
String[] parts = inputLine.split(" ");
int i=0;
while(!parts[i].contains("src"))i++;
String destinationFile = "image"+(counter++)+".jpg";
saveImage(parts[i].substring(5,parts[i].length()-1), destinationFile);
String tmp=scanner.nextLine();
System.out.println(url);
}
scanner.close();
}
catch (FileNotFoundException e)
{
System.out.println ("File not found!");
System.exit (0);
}
}
public static void saveImage(String imageUrl, String destinationFile) throws IOException {
// TODO Auto-generated method stub
URL url = new URL(imageUrl);
String fileName = url.getFile();
String destName = fileName.substring(fileName.lastIndexOf("/"));
System.out.println(destName);
InputStream is = url.openStream();
OutputStream os = new FileOutputStream(destinationFile);
byte[] b = new byte[2048];
int length;
while ((length = is.read(b)) != -1) {
os.write(b, 0, length);
}
is.close();
os.close();
}
I also got a tip to use the apache jakarte http client libraries but i got absolutely no idea how i could use those i would appreciate any help.
A URL (a type of URI) requires a scheme in order to be valid. In this case, http.
When you type www.google.com into your browser, the browser is inferring you mean http:// and automatically prepends it for you. Java doesn't do this, hence your exception.
Make sure you always have http://. You can easily fix this using regex:
String fixedUrl = stringUrl.replaceAll("^((?!http://).{7})", "http://$1");
or
if(!stringUrl.startsWith("http://"))
stringUrl = "http://" + stringUrl;
An alternative solution
Simply try with ImageIO that contains static convenience methods for locating ImageReaders and ImageWriters, and performing simple encoding and decoding.
Sample code:
// read a image from the URL
// I used the URL that is your profile pic on StackOverflow
BufferedImage image = ImageIO
.read(new URL(
"https://www.gravatar.com/avatar/3935223a285ab35a1b21f31248f1e721?s=32&d=identicon&r=PG&f=1"));
// save the image
ImageIO.write(image, "jpg", new File("resources/avatar.jpg"));
When you're scraping the site's HTML for image elements and their src attributes, you'll run into several different representations of URLs.
Some examples are:
resource = https://google.com/images/srpr/logo9w.png
resource = google.com/images/srpr/logo9w.png
resource = //google.com/images/srpr/logo9w.png
resource = /images/srpr/logo9w.png
resource = images/srpr/logo9w.png
For the second through fifth ones, you'll need to build the rest of the URL.
The second one may be more difficult to differentiate from the fourth and fifth ones, but I'm sure there are workarounds. The URL Standard leads me to believe you won't see it as often, because I don't think it's technically valid.
The third case is pretty simple. If the resource variable starts with //, then you just need to prepend the protocol/scheme to it. You can do this with the site object you have:
url = site.getProtocol() + ":" + resource
For the fourth and fifth cases, you'll need to prepend the resource with the entire site's URL.
Here's a sample application that uses jsoup to parse the HTML, and a simple utility method to build the resource URL. You're interested in the buildResourceUrl method. Also, it doesn't handle the second case; I'll leave that to you.
import java.io.*;
import java.net.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
public class SiteScraper {
public static void main(String[] args) throws IOException {
URL site = new URL("https://google.com/");
Document doc = Jsoup.connect(site.toString()).get();
Elements images = doc.select("img");
for (Element image : images) {
String src = image.attr("src");
System.out.println(buildResourceUrl(site, src));
}
}
static URL buildResourceUrl(URL site, String resource)
throws MalformedURLException {
if (!resource.matches("^(http|https|ftp)://.*$")) {
if (resource.startsWith("//")) {
return new URL(site.getProtocol() + ":" + resource);
} else {
return new URL(site.getProtocol() + "://" + site.getHost() + "/"
+ resource.replaceAll("^/", ""));
}
}
return new URL(resource);
}
}
This obviously won't cover everything, but it's a start. You may run into problems when the URL you're trying to access is in a subdirectory of the root of the site (i.e., http://some.place/under/the/rainbow.html). You may even encounter base64 encoded data URI's in the src attribute... It really depends on the individual case and how far you're willing to go.

Need to convert from XML to HTML in java

Please give me suggestion as i need to convert from XML to HTML in java without using XSLT. As i was searching in the web but everywhere it was showing can convert from xml to html with use of only xslt/xsl?
Please guyz give me some suggestions?
You can parse xml data using jQuery.parseXML and use data of it.
$.get('/url_of_the_xml_resource')
.done(function(data){
// parse the xml
data = $.parseXML(data);
//
// do anything you want with the parsed data
})
.fail(function(){
alert('something went wrong!');
})
;
This will save root.xml's content as root.xml.html.
public static void main(String[] args) throws Exception {
String xmlFile = "root.xml";
Scanner scanner = new Scanner(new File(xmlFile)).useDelimiter("\\Z");
String xmlContent = scanner.next();
xmlContent = xmlContent.trim().replaceAll("<","<").replaceAll(">",">").replaceAll("\n", "<br />").replaceAll(" ", " ");
PrintWriter out = new PrintWriter(xmlFile+".html");
out.println("<html><body>" + xmlContent + "</body></html>");
scanner.close();
out.close();
}
Note: This will retain the XML's original indentation and line breaking.
You can use StringEscapeUtils
and use the method escapeHtml.
String yourXmlAsHtmlString = StringEscapeUtils.escapeHtml(yourXmlAsString);

Using boilerpipe to extract non-english articles

I am trying to use boilerpipe java library, to extract news articles from a set of websites.
It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.
In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper. I found no solution in this paper.
My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?
How i'm using the library:
(first attempt based on the URL):
URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);
(second on the HTLM source code)
String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);
You don't have to modify inner Boilerpipe classes.
Just pass InputSource object to the ArticleExtractor.INSTANCE.getText() method and force encoding on that object. For example:
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
Regards!
Well, from what I see, when you use it like that, the library will auto-chose what encoding to use. From the HTMLFetcher source:
public static HTMLDocument fetch(final URL url) throws IOException {
final URLConnection conn = url.openConnection();
final String ct = conn.getContentType();
Charset cs = Charset.forName("Cp1252");
if (ct != null) {
Matcher m = PAT_CHARSET.matcher(ct);
if(m.find()) {
final String charset = m.group(1);
try {
cs = Charset.forName(charset);
} catch (UnsupportedCharsetException e) {
// keep default
}
}
}
Try debugging their code a bit, starting with ArticleExtractor.getText(URL), and see if you can override the encoding
Ok, got a solution.
As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax
What i did was to convert all the text that was fetched, to UTF-8.
At the end of the fetch function, i had to add two lines, and change the last one:
final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line
Boilerpipe's ArticleExtractor uses some algorithms that have been specifically tailored to English - measuring number of words in average phrases, etc. In any language that is more or less verbose than English (ie: every other language) these algorithms will be less accurate.
Additionally, the library uses some English phrases to try and find the end of the article (comments, post a comment, have your say, etc) which will clearly not work in other languages.
This is not to say that the library will outright fail - just be aware that some modification is likely needed for good results in non-English languages.
Java:
import java.net.URL;
import org.xml.sax.InputSource;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
public class Boilerpipe {
public static void main(String[] args) {
try{
URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
System.out.println(text);
}catch(Exception e){
e.printStackTrace();
}
}
}
Eclipse:
Run > Run Configurations > Common Tab. Set Encoding to Other(UTF-8), then click Run.
I had the some problem; the cnr solution works great. Just change UTF-8 encoding to ISO-8859-1. Thank's
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);

Extracting anchor tag from html using Java

I have several anchor tags in a text,
Input: <a href="http://stackoverflow.com" >Take me to StackOverflow</a>
Output:
http://stackoverflow.com
How can I find all those input strings and convert it to the output string in java, without using a 3rd party API ???
There are classes in the core API that you can use to get all href attributes from anchor tags (if present!):
import java.io.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class HtmlParseDemo {
public static void main(String [] args) throws Exception {
String html =
"<a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> " +
"<!-- " +
"<a href=\"http://ignoreme.com\" >...</a> " +
"--> " +
"<a href=\"http://www.google.com\" >Take me to Google</a> " +
"<a>NOOOoooo!</a> ";
Reader reader = new StringReader(html);
HTMLEditorKit.Parser parser = new ParserDelegator();
final List<String> links = new ArrayList<String>();
parser.parse(reader, new HTMLEditorKit.ParserCallback(){
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if(t == HTML.Tag.A) {
Object link = a.getAttribute(HTML.Attribute.HREF);
if(link != null) {
links.add(String.valueOf(link));
}
}
}
}, true);
reader.close();
System.out.println(links);
}
}
which will print:
[http://stackoverflow.com, http://www.google.com]
public static void main(String[] args) {
String test = "qazwsxTake me to StackOverflowfdgfdhgfd"
+ "Take me to StackOverflow2dcgdf";
String regex = "<a href=(\"[^\"]*\")[^<]*</a>";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(test);
System.out.println(m.replaceAll("$1"));
}
NOTE: All Andrzej Doyle's points are valid and if you have more then simple Y in your input, and you are sure that is parsable HTML, then you are better with HTML parser.
To summarize:
The regex i posted doesn't work if you have <a> in comment. (you can treat it as special case)
It doesn't work if you have other attributes in the <a> tag. (again you can treat it as special case)
there are many other cases that regex wont work, and you can not cover all of them with regex, since HTML is not regular language.
However, if your req is always replace Y with "X" without considering the context, then the code i've posted will work.
You can use JSoup
String html = "<p>An <a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String linkHref = link.attr("href"); // "http://stackoverflow.com"
Also See
Example
The above example works perfect; if you want to parse an HTML document say instead of concatenated strings, write something like this to compliment the code above.
Existing code above ~ modified to show: HtmlParser.java (HtmlParseDemo.java) above
complementing code with HtmlPage.java below. The content of the HtmlPage.properties file is at the bottom of this page.
The main.url property in the HtmlPage.properties file is:
main.url=http://www.whatever.com/
That way you can just parse the url that your after. :-)
Happy coding :-D
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class HtmlParser
{
public static void main(String[] args) throws Exception
{
String html = HtmlPage.getPage();
Reader reader = new StringReader(html);
HTMLEditorKit.Parser parser = new ParserDelegator();
final List<String> links = new ArrayList<String>();
parser.parse(reader, new HTMLEditorKit.ParserCallback()
{
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
{
if (t == HTML.Tag.A)
{
Object link = a.getAttribute(HTML.Attribute.HREF);
if (link != null)
{
links.add(String.valueOf(link));
}
}
}
}, true);
reader.close();
// create the header
System.out.println("<html>\n<head>\n <title>Link City</title>\n</head>\n<body>");
// spit out the links and create href
for (String l : links)
{
System.out.print(" " + l + "\n");
}
// create footer
System.out.println("</body>\n</html>");
}
}
import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.StringWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ResourceBundle;
public class HtmlPage
{
public static String getPage()
{
StringWriter sw = new StringWriter();
ResourceBundle bundle = ResourceBundle.getBundle(HtmlPage.class.getName().toString());
try
{
URL url = new URL(bundle.getString("main.url"));
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.setDoOutput(true);
InputStream content = (InputStream) connection.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(content));
String line;
while ((line = in.readLine()) != null)
{
sw.append(line).append("\n");
}
} catch (Exception e)
{
e.printStackTrace();
}
return sw.getBuffer().toString();
}
}
For example, this will output links from http://ebay.com.au/ if viewed in a browser.
This is a subset, as there are a lot of links
Link City
#mainContent
http://realestate.ebay.com.au/
The most robust way (as has been suggested already) is to use regular expressions (java.util.regexp), if you are required to build this without using 3d party libs.
The alternative is to parse the html as XML, either using a SAX parser to capture and handle each instance of an "a" element or as a DOM Document and then searching it using XPATH (see http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/package-summary.html). This is problematic though, since it requires the HTML page to be fully XML compliant in markup, a very dangerous assumption and not an approach I would recommend since most "real" html pages are not XML compliant.
Still, I would recommend also looking at existing frameworks out there built for this purpose (like JSoup, also mentioned above). No need to reinvent the wheel.

Categories

Resources