Java - Parsing HTML - get text

Java - Parsing HTML - get text - java

I am tring to get text from a website; when you change the language the html url have an "/en" inside, but the page that have the information that i want don't have.
http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92
html tags: (the text contains the description of the photo)
<div id="redx_gallery_pic_title"> text text </div>
The problem is that the website is in german and i want the text in english, and my script gets only the german version
Any ideas how can i do it?
java code:
...
URL oracle = new URL(x);
BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));
String inputLine=null;
StringBuffer theText = new StringBuffer();
while ((inputLine = in.readLine()) != null)
theText.append(inputLine+"\n");
String html = theText.toString();
in.close();
String[] name = StringUtils.substringsBetween(html, "redx_gallery_pic_title\">", "</div>");

That site is internationalized with German as default. You need to tell the server what language you're accepting by specifying the desired ISO 639-1 language code in the Accept-Language request header.
URLConnection connection = new URL(url).openConnection();
connection.setRequestProperty("Accept-Language", "en");
InputStream input = connection.getInputStream();
// ...
Unrelated to the concrete problem, may I suggest you to have a look at Jsoup as a HTML parser? It's much more convenient with its jQuery-like CSS selector syntax and therefore much less bloated than your attempt as far:
String url = "http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92";
Document document = Jsoup.connect(url).header("Accept-Language", "en").get();
String title = document.select("#redx_gallery_pic_title").text();
System.out.println(title); // Beech, glazing V3
That's all.

Related

How to get absolute url using java or jsoup

I am having a textbox and submit button in my jsp page. When submitting this button with some url in textbox, I am getting the response of that url using URLConnection
String strUrl = request.getParameter("url");
URL url = new URL(strUrl);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
byte[] encodedBytes = Base64.encodeBase64("root:pass".getBytes());
String encoding = new String(encodedBytes);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.connect();
InputStream content = (InputStream) connection.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(content));
try {
fWriter = new FileWriter(new File("f:\\new.html"));
writer = new BufferedWriter(fWriter);
while ((line = in.readLine()) != null) {
String s = line.toString();
writer.write(s);
}
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
In the resulting html page, every css and js and images were missing as they are pointed to get from local.
for example, js is placed as followed in my generated html page.
<script src="/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
But this actual src is as follows,
<script src="https://www.url.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
I know that there are many solution to replace all src, href with url host. Found many answers related to that.
I used a solution as follows,
if (s.contains(("href="))) {
if (s.contains("\"../") || s.contains("\"/")) {
s = s.replace("\"../", "\"http://" + url.getHost() + "/");
s = s.replace("\"/", "\"http://" + url.getHost() + "/");
writer.write(s);
out.println(s);
}
}
Now I am able to get link,but its not useful in all the web sites. which means that it will helpful for only sites having that kind of host only prefix with src and hrefs.
In some websites, links are defined as href="frmArticles.aspx". In this case its not enough to add host with href url, because href and src are different even though I prefix with host. For example, folowing URL having href links as different than its URL.
http://www.nakkheeran.in/Users/frmMagazine.aspx?M=2
தை தை தை
If, I am adding host to this href it becomes as follows,
தை தை தை
And this is not available. Because, the actual url is
தை தை தை

There are essentially two ways to get the absolute URL:
Using Jsoup's abs:href attribute getter. It works like this:
Element a = myDoc.select("a").first(); //selects tue first link on the page, replace with whatever selector you need to get your link (a element)
String url = a.attr("abs:href"); //gets the absolute url of the link (href attribute)
Note that you need to provide Jsoup with the URL of the HTML document you are using, so it can resolve the URL correctly, this is done automatically if you use Jsoup.connect(myHtmlUrl).get(), if you are parsing HTML from a String or from a file, you need to provide it, use the appropriate Jsoup.parse() method which allows you to provide a base URL
The other way is with Java's built in URL class, which is probably what you should use in your case. You can use it like this:
String absoluteUrl = new URL(new URL("http://example.com/example.html"), "script.js")
Which would print:
http://example.com/script.js
To clarify a bit, the first parameter (in this case example.com) is the url your HTML document is from, and the second parameter ("script.js") is the URL found in your HTML.
In your case, you could use it like:
String absoluteUrl = new URL(new URL("https://www.url.com/"), "/ajax/libs/jquery/2.1.1/jquery.min.js")
Which will print:
https://www.url.com/ajax/libs/jquery/2.1.1/jquery.min.js

The URL class has a constructor URL(URL context, String url) that does what you tried doing with regexps.
Edit: In your case the context URL is the source URL of the parsed resource. Let's say you parse something from URL context = new URL("http://example.com/path/to/some.html#where?is+carmen+sandiego"). Then you just take the reference of any link and create a URL ref = new URL(context, src).

how to exclude tag from XML String in java

I am making a piece of code to send and recieve data from and to an webpage. I am doeing this in java. But when i 'receive' the xml data it is still between tags like this
<?xml version='1.0'?>
<document>
<title> TEST </title>
</document>
How can i get the data without the tags in Java.
This is what i tried, The function writes the data and then should get the reponse and use that in a System.out.println.
public static String User_Select(String username, String password) {
String mysql_type = "1"; // 1 = Select
try {
String urlParameters = "mysql_type=" + mysql_type + "&username=" + username + "&password=" + password;
URL url = new URL("http://localhost:8080/HTTP_Connection/index.php");
URLConnection conn = url.openConnection();
conn.setDoOutput(true);
OutputStreamWriter writer = new OutputStreamWriter(conn.getOutputStream());
writer.write(urlParameters);
writer.flush();
String line;
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = reader.readLine()) != null) {
System.out.println(line);
//System.out.println("Het werkt!!");
}
writer.close();
reader.close();
return line;
} catch (IOException iox) {
iox.printStackTrace();
return null;
}
}
Thanks in advance

I would suggest simply using RegEx to read the XML, and get the tag content that you are after.
That simplifies what you need to do, and limits the inclusion of additional (unnecessary) libraries.
And then there are lots of StackOverflows on this topic: Regex for xml parsing and In RegEx, I want to find everything between two XML tags just to mention 2 of them.

use DOMParser in java.
Check further in java docs

Use an XML Parser to Parse your XML. Here is a link to Oracle's Tutorial
Oracle Java XML Parser Tutorial

Simply pass the InputStream from URLConnection
Document doc = DocumentBuilderFactory.
newInstance().
newDocumentBuilder().
parse(conn.getInputStream());
From there you could use xPath to query the contents of the document or simply walk the document model.
Take a look at Java API for XML Processing (JAXP) for more details

You have to use an XML Parser , in your case the perfect choice is JSoup which scrap data from the web and parse XML & HTML format ,it will load data and parse it and give you what you want , here is a an example of how it works :
1. XML From an URL
String xml = Jsoup.connect("http://localhost:8080/HTTP_Connection/index.php")
.get().toString();
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
String myTitle=doc.select("title").first();// myTitle contain now TEST
Edit :
to send GET or POST parameters with you request use this code:
String xml = Jsoup.connect("http://localhost:8080/HTTP_Connection/index.php")
.data("param1Name";"param1Value")
.data("param2Name","param2Value").get().toString();
you can use get() to invoke HTTP GET method or post() to invoke HTTP POST method.
2. XML From String
You can use JSoup to parse XML data in a String :
String xmlData="<?xml version='1.0'?><document> <title> TEST </title> </document>" ;
Document doc = Jsoup.parse(xmlData, "", Parser.xmlParser());
String myTitle=doc.select("title").first();// myTitle contain now TEST

How to get all links (<a href>) in URL

I get some URL and i need to search all the links in this URL and just show them, thats all.
I write its in java:
PrintWriter writer=new PrintWriter("Web.txt");
URL oracle = new URL("http://edition.cnn.com/");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
{
writer.println(inputLine);
System.out.println(inputLine);
}
in.close();
Now my question is how can I find only links in this huge file?
I thought about <a href" ... ... ..>but its not always right..
Thanks

JSOUP is the way to go! It's a Java API on which you can parse HTML documents (either local or external ones) and navigate on it's DOM structure using a jQuery similiar syntax.
Your code to get all the links should look something like this:
Document doc = Jsoup.connect("http://edition.cnn.com").get(); // Parse this URL's HTML
Elements elements = doc.select("a"); // Search for all <a> elements
Then, to list every link and save it to your file:
for (Element element : elements) {
writer.println(element.attr("href")); // Get the "href" attribute from the element
}

How to scrape a website, http get vs http post?

I am new to programming and know very little about http, but I wrote a code to scrape a website in Java, and have been running into the issue that my code scrapes "get" http calls (based on typing in a URL) but I do not know how to go about scraping data for a "post" http call.
After a brief overview on http, I believe I will need to simulate the browser, but do not know how to do this in Java. The website I have been trying to use.
As I need to scrape that source code for all the pages, the URL does not change as each next button is clicked. I have used Firefox firebug to look at what is going on when the button is clicked, but I do not know all that I am looking for.
My code to scrape the data as of now is:
public class Scraper {
private static String month = "11";
private static String day = "4";
private static String url = "http://cpdocket.cp.cuyahogacounty.us/SheriffSearch/results.aspx?q=searchType%3dSaleDate%26searchString%3d"+month+"%2f"+day+"%2f2013%26foreclosureType%3d%27NONT%27%2c+%27PAR%27%2c+%27COMM%27%2c+%27TXLN%27"; // the input website to be scraped
public static String sourcetext; //The source code that has been scraped
//scrapeWebsite runs the method to scrape the input URL and returns a string to be parsed.
public static void scrapeWebsite() throws IOException {
URL urlconnect = new URL(url); //creates the url from the variable
URLConnection connection = urlconnect.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
connection.getInputStream(), "UTF-8"));
String inputLine;
StringBuilder sourcecode = new StringBuilder(); // creates a stringbuilder which contains the sourcecode
while ((inputLine = in.readLine()) != null)
sourcecode.append(inputLine);
in.close();
sourcetext = sourcecode.toString();
}
What would be the best way to go about scraping all the pages for each "post" call?

Take a look at the jersey client interface
View the source of each page and determine the pattern of the url for next an previous pages then loop through.

How do I get parsed HTML special characters using JSOUP

I am using JSoup to get the H1 tag value from a webpage, this tag contains the following HTML.
Hexyl β-D-glucopyranoside
When I use the .text() method I get the following. (Note the ?) I assume this is because it cannot work out the HTML for the "β" character. How do I get this value as rendered on a webpage.
Hexyl ?-D-glucopyranoside
Do I need to do some kind of conversion after I have picked up the text I want?
Here is my code.
String check = "<title>Hexyl β-D-glucopyranoside ≥98.0% (TLC) | ≥ ≥</title>";
Document doc3 = Jsoup.parse(check);
doc3.outputSettings().escapeMode(Entities.EscapeMode.base); // default
doc3.outputSettings().charset("UTF-8");
System.out.println("UTF-8: " + doc3.html());
//doc3.outputSettings().charset("ISO 8859-1");
doc3.outputSettings().charset("ASCII");
System.out.println("ASCII: " + doc3.html());`
-----Output at console-----
UTF-8: <html>
<head>
<title>Hexyl ?-D-glucopyranoside ?98.0% (TLC) | ? ? </title>
</head>
<body></body>
</html>
ASCII: <html>
<head>
<title>Hexyl β-D-glucopyranoside ≥98.0% (TLC) | ≥ ≥</title>
</head>
<body></body>
</html>

Looks like the IDE you're using is using the wrong character encoding.
It's nothing to do with your code as I've ran it and it's fine (outputs the weird characters). If you're using Eclipse go to the run configuration settings for that particular project and click the 'common' tab then choose UTF-8.

It's too late to set charset after parsing a document. I had the same problem once, tried to do it your way and failed miserably.
This worked for me:
String url = "url to html page";
InputStream is is =new URL(url).openStream();
org.jsoup.nodes.Document doc = org.jsoup.Jsoup.parse(is , "ISO-8859-2", url);
If I have html text only as string, I convert it to InputString first (http://www.kodejava.org/examples/265.html)
InputStream is = new ByteArrayInputStream(text.getBytes("UTF-8"));
then read it with correct charset:
BufferedReaderr = new BufferedReader(new InputStreamReader(is, "UTF-8"), 4*1024);
StringBuilder total = new StringBuilder();
String line = "";
while ((line = r.readLine()) != null) {
total.append(line);
}
r.close();
is.close();
String html = total.toString();
...and parse:
doc = org.jsoup.Jsoup.parse(html);
The important thing is to somehow get InputStream object and from here there're ways to use your desired charset with it. Maybe it can be done in a more strightforward way. But it works.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - Parsing HTML - get text - java

Related

How to get absolute url using java or jsoup

how to exclude tag from XML String in java

How to get all links (<a href>) in URL

How to scrape a website, http get vs http post?

How do I get parsed HTML special characters using JSOUP

Categories

Resources