How to get all links (<a href>) in URL - java

I get some URL and i need to search all the links in this URL and just show them, thats all.
I write its in java:
PrintWriter writer=new PrintWriter("Web.txt");
URL oracle = new URL("http://edition.cnn.com/");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
{
writer.println(inputLine);
System.out.println(inputLine);
}
in.close();
Now my question is how can I find only links in this huge file?
I thought about <a href" ... ... ..>but its not always right..
Thanks

JSOUP is the way to go! It's a Java API on which you can parse HTML documents (either local or external ones) and navigate on it's DOM structure using a jQuery similiar syntax.
Your code to get all the links should look something like this:
Document doc = Jsoup.connect("http://edition.cnn.com").get(); // Parse this URL's HTML
Elements elements = doc.select("a"); // Search for all <a> elements
Then, to list every link and save it to your file:
for (Element element : elements) {
writer.println(element.attr("href")); // Get the "href" attribute from the element
}

Related

Is it possible to download only the HEAD tag of a page?

I've done some research about this and had no conclusive answer.
This question lays some of the path through it: How can I download only part of a page?
But then again, I don't want to download only a random part of a page, but one of the first tags, the head.
Is it possible somehow to query the page, and stream it's content to a buffer and stop downloading (discarding the rest) as soon as you find the tag closer </head> ?
EDIT:
Adding stuff to the page itself is not possible, since I want to pull the header of websites on my app.
Imagine http://stackoverflow.com is entered as the parameter. The whole page is around 240kb, but if I stop downloading the moment I hit </head>, it's only 5kb. Allowing me to save around 97% bandwidth for this page.
Maybe this is enough for you - Open a URLConnection and read from the input stream
public class test {
public static void main(String[] args) throws Exception {
URL oracle = new URL("http://www.oracle.com/");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null){
if(inputLine.contains("</head>")) break;
System.out.println(inputLine);
}
in.close();
}
}
here you have the tutorial

how to exclude tag from XML String in java

I am making a piece of code to send and recieve data from and to an webpage. I am doeing this in java. But when i 'receive' the xml data it is still between tags like this
<?xml version='1.0'?>
<document>
<title> TEST </title>
</document>
How can i get the data without the tags in Java.
This is what i tried, The function writes the data and then should get the reponse and use that in a System.out.println.
public static String User_Select(String username, String password) {
String mysql_type = "1"; // 1 = Select
try {
String urlParameters = "mysql_type=" + mysql_type + "&username=" + username + "&password=" + password;
URL url = new URL("http://localhost:8080/HTTP_Connection/index.php");
URLConnection conn = url.openConnection();
conn.setDoOutput(true);
OutputStreamWriter writer = new OutputStreamWriter(conn.getOutputStream());
writer.write(urlParameters);
writer.flush();
String line;
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = reader.readLine()) != null) {
System.out.println(line);
//System.out.println("Het werkt!!");
}
writer.close();
reader.close();
return line;
} catch (IOException iox) {
iox.printStackTrace();
return null;
}
}
Thanks in advance
I would suggest simply using RegEx to read the XML, and get the tag content that you are after.
That simplifies what you need to do, and limits the inclusion of additional (unnecessary) libraries.
And then there are lots of StackOverflows on this topic: Regex for xml parsing and In RegEx, I want to find everything between two XML tags just to mention 2 of them.
use DOMParser in java.
Check further in java docs
Use an XML Parser to Parse your XML. Here is a link to Oracle's Tutorial
Oracle Java XML Parser Tutorial
Simply pass the InputStream from URLConnection
Document doc = DocumentBuilderFactory.
newInstance().
newDocumentBuilder().
parse(conn.getInputStream());
From there you could use xPath to query the contents of the document or simply walk the document model.
Take a look at Java API for XML Processing (JAXP) for more details
You have to use an XML Parser , in your case the perfect choice is JSoup which scrap data from the web and parse XML & HTML format ,it will load data and parse it and give you what you want , here is a an example of how it works :
1. XML From an URL
String xml = Jsoup.connect("http://localhost:8080/HTTP_Connection/index.php")
.get().toString();
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
String myTitle=doc.select("title").first();// myTitle contain now TEST
Edit :
to send GET or POST parameters with you request use this code:
String xml = Jsoup.connect("http://localhost:8080/HTTP_Connection/index.php")
.data("param1Name";"param1Value")
.data("param2Name","param2Value").get().toString();
you can use get() to invoke HTTP GET method or post() to invoke HTTP POST method.
2. XML From String
You can use JSoup to parse XML data in a String :
String xmlData="<?xml version='1.0'?><document> <title> TEST </title> </document>" ;
Document doc = Jsoup.parse(xmlData, "", Parser.xmlParser());
String myTitle=doc.select("title").first();// myTitle contain now TEST

extracting subcontents of a link

How can I extract "read more" part of this news.When I uses jsoup it only gives before the contents of "read more" part.I want to extract entire contents of that news.
Scanner sc=new Scanner(System.in);
String code=sc.nextLine();
doc = Jsoup.connect("http://ieee-link.org/category/events/" +code+ "/").get();
Elements els = doc.select("div.entry");
System.out.println(els.text());
The read more seems to contain a link. You can extract the target of that link and get this URL as well with Jsoup:
Elements els = doc.select("div.entry");
//inside each els we can find something like: <a class="more-link" href="http://ieee-link.org/renesas/">Read More ยป</a>
for (Element el : els){
Element anchor = el.select("a.more-link");
if (anchor != null){
Document moreDoc = Jsoup.connect(anchor.attr("href")).get();
System.out.println(moreDoc);
}
else{
System.out.println(el);
}
}
Note, that this code is written out of my head. Some method names might be wrong. Spelling is also questionable.

Jsoup href request and to output on file

I made this sample to request one url query through a java application. The request connection and query are right. But, I'm missing how am I able to get all href elements from the query and write on one output file? Anyone has any guidelines?
Thanks in advance
Document engineSearch=Jsoup.connect("http://ask.com/web?q="+URLEncoder.encode(query))
.userAgent("Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.8.1.6) Gecko/20070723 Iceweasel/2.0.0.6 (Debian-2.0.0.6-0etch1)")
.get();
String title = engineSearch.title();
Elements links = engineSearch.select("a[href]").first().getAllElements();
String queryEncoding=engineSearch.outputSettings().charset().name();
file = new File(folder.getPath()+"\\"+date+" "+Tag+".html");
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(file),queryEncoding);
writer.write(engineSearch.html());
writer.close();
Here is an example of exactly what you want, I dont have a dev environment handy but something along those lines should work
http://jsoup.org/cookbook/extracting-data/attributes-text-html
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
for (Element e : links) {
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/", which you can save to file
}

Java - Parsing HTML - get text

I am tring to get text from a website; when you change the language the html url have an "/en" inside, but the page that have the information that i want don't have.
http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92
html tags: (the text contains the description of the photo)
<div id="redx_gallery_pic_title"> text text </div>
The problem is that the website is in german and i want the text in english, and my script gets only the german version
Any ideas how can i do it?
java code:
...
URL oracle = new URL(x);
BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));
String inputLine=null;
StringBuffer theText = new StringBuffer();
while ((inputLine = in.readLine()) != null)
theText.append(inputLine+"\n");
String html = theText.toString();
in.close();
String[] name = StringUtils.substringsBetween(html, "redx_gallery_pic_title\">", "</div>");
That site is internationalized with German as default. You need to tell the server what language you're accepting by specifying the desired ISO 639-1 language code in the Accept-Language request header.
URLConnection connection = new URL(url).openConnection();
connection.setRequestProperty("Accept-Language", "en");
InputStream input = connection.getInputStream();
// ...
Unrelated to the concrete problem, may I suggest you to have a look at Jsoup as a HTML parser? It's much more convenient with its jQuery-like CSS selector syntax and therefore much less bloated than your attempt as far:
String url = "http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92";
Document document = Jsoup.connect(url).header("Accept-Language", "en").get();
String title = document.select("#redx_gallery_pic_title").text();
System.out.println(title); // Beech, glazing V3
That's all.

Categories

Resources