Jsoup href request and to output on file - java

I made this sample to request one url query through a java application. The request connection and query are right. But, I'm missing how am I able to get all href elements from the query and write on one output file? Anyone has any guidelines?
Thanks in advance
Document engineSearch=Jsoup.connect("http://ask.com/web?q="+URLEncoder.encode(query))
.userAgent("Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.8.1.6) Gecko/20070723 Iceweasel/2.0.0.6 (Debian-2.0.0.6-0etch1)")
.get();
String title = engineSearch.title();
Elements links = engineSearch.select("a[href]").first().getAllElements();
String queryEncoding=engineSearch.outputSettings().charset().name();
file = new File(folder.getPath()+"\\"+date+" "+Tag+".html");
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(file),queryEncoding);
writer.write(engineSearch.html());
writer.close();

Here is an example of exactly what you want, I dont have a dev environment handy but something along those lines should work
http://jsoup.org/cookbook/extracting-data/attributes-text-html
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
for (Element e : links) {
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/", which you can save to file
}

Related

How to get absolute url using java or jsoup

I am having a textbox and submit button in my jsp page. When submitting this button with some url in textbox, I am getting the response of that url using URLConnection
String strUrl = request.getParameter("url");
URL url = new URL(strUrl);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
byte[] encodedBytes = Base64.encodeBase64("root:pass".getBytes());
String encoding = new String(encodedBytes);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.connect();
InputStream content = (InputStream) connection.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(content));
try {
fWriter = new FileWriter(new File("f:\\new.html"));
writer = new BufferedWriter(fWriter);
while ((line = in.readLine()) != null) {
String s = line.toString();
writer.write(s);
}
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
In the resulting html page, every css and js and images were missing as they are pointed to get from local.
for example, js is placed as followed in my generated html page.
<script src="/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
But this actual src is as follows,
<script src="https://www.url.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
I know that there are many solution to replace all src, href with url host. Found many answers related to that.
I used a solution as follows,
if (s.contains(("href="))) {
if (s.contains("\"../") || s.contains("\"/")) {
s = s.replace("\"../", "\"http://" + url.getHost() + "/");
s = s.replace("\"/", "\"http://" + url.getHost() + "/");
writer.write(s);
out.println(s);
}
}
Now I am able to get link,but its not useful in all the web sites. which means that it will helpful for only sites having that kind of host only prefix with src and hrefs.
In some websites, links are defined as href="frmArticles.aspx". In this case its not enough to add host with href url, because href and src are different even though I prefix with host. For example, folowing URL having href links as different than its URL.
http://www.nakkheeran.in/Users/frmMagazine.aspx?M=2
தை தை தை
If, I am adding host to this href it becomes as follows,
தை தை தை
And this is not available. Because, the actual url is
தை தை தை
There are essentially two ways to get the absolute URL:
Using Jsoup's abs:href attribute getter. It works like this:
Element a = myDoc.select("a").first(); //selects tue first link on the page, replace with whatever selector you need to get your link (a element)
String url = a.attr("abs:href"); //gets the absolute url of the link (href attribute)
Note that you need to provide Jsoup with the URL of the HTML document you are using, so it can resolve the URL correctly, this is done automatically if you use Jsoup.connect(myHtmlUrl).get(), if you are parsing HTML from a String or from a file, you need to provide it, use the appropriate Jsoup.parse() method which allows you to provide a base URL
The other way is with Java's built in URL class, which is probably what you should use in your case. You can use it like this:
String absoluteUrl = new URL(new URL("http://example.com/example.html"), "script.js")
Which would print:
http://example.com/script.js
To clarify a bit, the first parameter (in this case example.com) is the url your HTML document is from, and the second parameter ("script.js") is the URL found in your HTML.
In your case, you could use it like:
String absoluteUrl = new URL(new URL("https://www.url.com/"), "/ajax/libs/jquery/2.1.1/jquery.min.js")
Which will print:
https://www.url.com/ajax/libs/jquery/2.1.1/jquery.min.js
The URL class has a constructor URL(URL context, String url) that does what you tried doing with regexps.
Edit: In your case the context URL is the source URL of the parsed resource. Let's say you parse something from URL context = new URL("http://example.com/path/to/some.html#where?is+carmen+sandiego"). Then you just take the reference of any link and create a URL ref = new URL(context, src).

Getting Content From a Website with Java

I was curious how to pull information from a website with Java, and I found JSoup ( HTML Parser) Was a popular suggestion. I have found quite a few examples online but nothing really explaining how to use it. Say I wanted to get the temperature for Toronto using this url, http://weather.gc.ca/city/pages/on-143_metric_e.html , how would I go about doing so?
I guess you have to specify tags, but in the html for that site, the information I want is in a tag, but so is more inforation so when when I run my code
String url = "http://weather.gc.ca/city/pages/on-4_metric_e.html";
Document document = Jsoup.connect(url).get();
String temp = document.select("dd").text();
System.out.println("Title: " + temp);
I get a lot more information than I want.
For the temperature try this:
String url = "http://weather.gc.ca/city/pages/on-4_metric_e.html";
Document document = Jsoup.connect(url).get();
String temp = document.select("p").get(1).text();
System.out.println("Temperature: " + temp);
For formulating the CSS queries refer to the syntax sheet: http://jsoup.org/cookbook/extracting-data/selector-syntax
Also try: http://try.jsoup.org/, great for testing!
Let say I want to read the contents of mywebsite.com. This is how i'll do it:
import java.net.*;
import java.io.*;
class MyClass {
public static void main(String[] arg) throws Exception {
URL u = new URL("http://www.mywebsite.com");
InputStream ins = u.openStream();
InputStreamReader isr = new InputStreamReader(ins);
BufferedReader br = new BufferedReader(isr);
System.out.println(br.readLine());
}
}
Hopefully this should get you started..

Jsoup: null result in absUrl (abs:)

I tried to make a image links downloader with jsoup. I have made a downloader HTML code part, and when I have done a parse part, I recognized, that sometimes links to images appeared without main part. So I found absUrl solution, but by some reasons it did not work (it gave me null). So I tried use uri.resolve(), but it gave me unchanged result. So now I do not know how to solve it. I attached part of my code, that responsible for parsing ant writing url to string:
public static String finalcode(String textin) throws Exception {
String text = source(textin);
Document doc = Jsoup.parse(text);
Elements images = doc.getElementsByTag("img");
String Simages = images.toString();
int Limages = countLines(Simages);
StringBuilder src = new StringBuilder();
while (Limages > 0) {
Limages--;
Element image = images.get(Limages);
String href = image.attr("src");
src.append(href);
src.append("\n");
}
String result = src.toString();
return result;
}
It looks like you are parsing HTML from String, not from URL. Because of that jsoup can't know from which URL this HTML codes comes from, so it can't create absolute path.
To set this URL for Document you should parse it using Jsoup.parse(String html, String baseUri) version, like
String url = "http://server/pages/document.htlm";
String text = "<img src = '../images/image_name1.jpg'/><img src = '../images/image_name2.jpg'/>'";
Document doc = Jsoup.parse(text, url);
Elements images = doc.getElementsByTag("img");
for (Element image : images){
System.out.println(image.attr("src")+" -> "+image.attr("abs:src"));
}
Output:
../images/image_name1.jpg -> http://server/images/image_name1.jpg
../images/image_name2.jpg -> http://server/images/image_name2.jpg
Other option would be letting Jsoup parse page directly by supplying URL instead of String with HTML
Document doc = Jsoup.connect("http://example.com").get();
This way Document will know from which URL it came, so it will be able to create absolute paths.

how to exclude tag from XML String in java

I am making a piece of code to send and recieve data from and to an webpage. I am doeing this in java. But when i 'receive' the xml data it is still between tags like this
<?xml version='1.0'?>
<document>
<title> TEST </title>
</document>
How can i get the data without the tags in Java.
This is what i tried, The function writes the data and then should get the reponse and use that in a System.out.println.
public static String User_Select(String username, String password) {
String mysql_type = "1"; // 1 = Select
try {
String urlParameters = "mysql_type=" + mysql_type + "&username=" + username + "&password=" + password;
URL url = new URL("http://localhost:8080/HTTP_Connection/index.php");
URLConnection conn = url.openConnection();
conn.setDoOutput(true);
OutputStreamWriter writer = new OutputStreamWriter(conn.getOutputStream());
writer.write(urlParameters);
writer.flush();
String line;
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = reader.readLine()) != null) {
System.out.println(line);
//System.out.println("Het werkt!!");
}
writer.close();
reader.close();
return line;
} catch (IOException iox) {
iox.printStackTrace();
return null;
}
}
Thanks in advance
I would suggest simply using RegEx to read the XML, and get the tag content that you are after.
That simplifies what you need to do, and limits the inclusion of additional (unnecessary) libraries.
And then there are lots of StackOverflows on this topic: Regex for xml parsing and In RegEx, I want to find everything between two XML tags just to mention 2 of them.
use DOMParser in java.
Check further in java docs
Use an XML Parser to Parse your XML. Here is a link to Oracle's Tutorial
Oracle Java XML Parser Tutorial
Simply pass the InputStream from URLConnection
Document doc = DocumentBuilderFactory.
newInstance().
newDocumentBuilder().
parse(conn.getInputStream());
From there you could use xPath to query the contents of the document or simply walk the document model.
Take a look at Java API for XML Processing (JAXP) for more details
You have to use an XML Parser , in your case the perfect choice is JSoup which scrap data from the web and parse XML & HTML format ,it will load data and parse it and give you what you want , here is a an example of how it works :
1. XML From an URL
String xml = Jsoup.connect("http://localhost:8080/HTTP_Connection/index.php")
.get().toString();
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
String myTitle=doc.select("title").first();// myTitle contain now TEST
Edit :
to send GET or POST parameters with you request use this code:
String xml = Jsoup.connect("http://localhost:8080/HTTP_Connection/index.php")
.data("param1Name";"param1Value")
.data("param2Name","param2Value").get().toString();
you can use get() to invoke HTTP GET method or post() to invoke HTTP POST method.
2. XML From String
You can use JSoup to parse XML data in a String :
String xmlData="<?xml version='1.0'?><document> <title> TEST </title> </document>" ;
Document doc = Jsoup.parse(xmlData, "", Parser.xmlParser());
String myTitle=doc.select("title").first();// myTitle contain now TEST

How to get all links (<a href>) in URL

I get some URL and i need to search all the links in this URL and just show them, thats all.
I write its in java:
PrintWriter writer=new PrintWriter("Web.txt");
URL oracle = new URL("http://edition.cnn.com/");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
{
writer.println(inputLine);
System.out.println(inputLine);
}
in.close();
Now my question is how can I find only links in this huge file?
I thought about <a href" ... ... ..>but its not always right..
Thanks
JSOUP is the way to go! It's a Java API on which you can parse HTML documents (either local or external ones) and navigate on it's DOM structure using a jQuery similiar syntax.
Your code to get all the links should look something like this:
Document doc = Jsoup.connect("http://edition.cnn.com").get(); // Parse this URL's HTML
Elements elements = doc.select("a"); // Search for all <a> elements
Then, to list every link and save it to your file:
for (Element element : elements) {
writer.println(element.attr("href")); // Get the "href" attribute from the element
}

Categories

Resources