How do I get parsed HTML special characters using JSOUP - java

I am using JSoup to get the H1 tag value from a webpage, this tag contains the following HTML.
Hexyl β-D-glucopyranoside
When I use the .text() method I get the following. (Note the ?) I assume this is because it cannot work out the HTML for the "β" character. How do I get this value as rendered on a webpage.
Hexyl ?-D-glucopyranoside
Do I need to do some kind of conversion after I have picked up the text I want?
Here is my code.
String check = "<title>Hexyl β-D-glucopyranoside ≥98.0% (TLC) | ≥ ≥</title>";
Document doc3 = Jsoup.parse(check);
doc3.outputSettings().escapeMode(Entities.EscapeMode.base); // default
doc3.outputSettings().charset("UTF-8");
System.out.println("UTF-8: " + doc3.html());
//doc3.outputSettings().charset("ISO 8859-1");
doc3.outputSettings().charset("ASCII");
System.out.println("ASCII: " + doc3.html());`
-----Output at console-----
UTF-8: <html>
<head>
<title>Hexyl ?-D-glucopyranoside ?98.0% (TLC) | ? ? </title>
</head>
<body></body>
</html>
ASCII: <html>
<head>
<title>Hexyl β-D-glucopyranoside ≥98.0% (TLC) | ≥ ≥</title>
</head>
<body></body>
</html>

Looks like the IDE you're using is using the wrong character encoding.
It's nothing to do with your code as I've ran it and it's fine (outputs the weird characters). If you're using Eclipse go to the run configuration settings for that particular project and click the 'common' tab then choose UTF-8.

It's too late to set charset after parsing a document. I had the same problem once, tried to do it your way and failed miserably.
This worked for me:
String url = "url to html page";
InputStream is is =new URL(url).openStream();
org.jsoup.nodes.Document doc = org.jsoup.Jsoup.parse(is , "ISO-8859-2", url);
If I have html text only as string, I convert it to InputString first (http://www.kodejava.org/examples/265.html)
InputStream is = new ByteArrayInputStream(text.getBytes("UTF-8"));
then read it with correct charset:
BufferedReaderr = new BufferedReader(new InputStreamReader(is, "UTF-8"), 4*1024);
StringBuilder total = new StringBuilder();
String line = "";
while ((line = r.readLine()) != null) {
total.append(line);
}
r.close();
is.close();
String html = total.toString();
...and parse:
doc = org.jsoup.Jsoup.parse(html);
The important thing is to somehow get InputStream object and from here there're ways to use your desired charset with it. Maybe it can be done in a more strightforward way. But it works.

Related

Generated html shows ? instead of international characters

I have the following problem in my project:
If I run my project locally (and from Jar), the .ftlh file that I am processing compiles just fine - it shows all international characters withouth any problems (like ą ę ć).
Now, if I deploy my project to cloud, all of those international characters are displayed as ?. I have no idea whats going on, as I've set the following in the .ftlh file:
<#ftl encoding='UTF-8'>
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
</head>
<body>
And my configuration:
#Bean
public freemarker.template.Configuration templateConfiguration() throws IOException {
freemarker.template.Configuration configuration = new freemarker.template.Configuration(freemarker.template.Configuration.VERSION_2_3_24);
configuration.setTemplateLoader(new ClassTemplateLoader(this.getClass(), "/folder"));
configuration.setDefaultEncoding("UTF-8");
configuration.setTemplateExceptionHandler(TemplateExceptionHandler.RETHROW_HANDLER);
configuration.setLogTemplateExceptions(false);
return configuration;
}
And this is how I process the template:
#Qualifier("templateConfiguration")
#Autowired
private Configuration configuration;
public void generateEmail(Order order, OutputStream outputStream) throws IOException, TemplateException {
Template template = configuration.getTemplate(EMAIL, "UTF-8");
OutputStreamWriter out = new OutputStreamWriter(outputStream);
template.process(order, out);
}
When I generate the email, and use System.out.println on the following:
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
try{
emailsTemplateService.generateEmail(order, byteArrayOutputStream);
} catch (Exception e){
e.printStackTrace();
}
String htmlMessage = new String(byteArrayOutputStream.toByteArray(), StandardCharsets.UTF_8);
System.out.println(htmlMessage);
It will print the HTML file with international characters (when runs locally). But when I run in the cloud, it will display ? instead.
Any ideas on what am I doing wrong?
You've used a specified character encoding in almost all cases, which is good. But you forgot one.
This:
OutputStreamWriter out = new OutputStreamWriter(outputStream);
Should be this:
OutputStreamWriter out = new OutputStreamWriter(outputStream, StandardCharsets.UTF_8);
Since you didn't specify the encoding for the OutputStreamWriter, it took the platform default encoding, which was different for the two platforms on which you ran the code (and it was not UTF-8 on the cloud)

how to exclude tag from XML String in java

I am making a piece of code to send and recieve data from and to an webpage. I am doeing this in java. But when i 'receive' the xml data it is still between tags like this
<?xml version='1.0'?>
<document>
<title> TEST </title>
</document>
How can i get the data without the tags in Java.
This is what i tried, The function writes the data and then should get the reponse and use that in a System.out.println.
public static String User_Select(String username, String password) {
String mysql_type = "1"; // 1 = Select
try {
String urlParameters = "mysql_type=" + mysql_type + "&username=" + username + "&password=" + password;
URL url = new URL("http://localhost:8080/HTTP_Connection/index.php");
URLConnection conn = url.openConnection();
conn.setDoOutput(true);
OutputStreamWriter writer = new OutputStreamWriter(conn.getOutputStream());
writer.write(urlParameters);
writer.flush();
String line;
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = reader.readLine()) != null) {
System.out.println(line);
//System.out.println("Het werkt!!");
}
writer.close();
reader.close();
return line;
} catch (IOException iox) {
iox.printStackTrace();
return null;
}
}
Thanks in advance
I would suggest simply using RegEx to read the XML, and get the tag content that you are after.
That simplifies what you need to do, and limits the inclusion of additional (unnecessary) libraries.
And then there are lots of StackOverflows on this topic: Regex for xml parsing and In RegEx, I want to find everything between two XML tags just to mention 2 of them.
use DOMParser in java.
Check further in java docs
Use an XML Parser to Parse your XML. Here is a link to Oracle's Tutorial
Oracle Java XML Parser Tutorial
Simply pass the InputStream from URLConnection
Document doc = DocumentBuilderFactory.
newInstance().
newDocumentBuilder().
parse(conn.getInputStream());
From there you could use xPath to query the contents of the document or simply walk the document model.
Take a look at Java API for XML Processing (JAXP) for more details
You have to use an XML Parser , in your case the perfect choice is JSoup which scrap data from the web and parse XML & HTML format ,it will load data and parse it and give you what you want , here is a an example of how it works :
1. XML From an URL
String xml = Jsoup.connect("http://localhost:8080/HTTP_Connection/index.php")
.get().toString();
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
String myTitle=doc.select("title").first();// myTitle contain now TEST
Edit :
to send GET or POST parameters with you request use this code:
String xml = Jsoup.connect("http://localhost:8080/HTTP_Connection/index.php")
.data("param1Name";"param1Value")
.data("param2Name","param2Value").get().toString();
you can use get() to invoke HTTP GET method or post() to invoke HTTP POST method.
2. XML From String
You can use JSoup to parse XML data in a String :
String xmlData="<?xml version='1.0'?><document> <title> TEST </title> </document>" ;
Document doc = Jsoup.parse(xmlData, "", Parser.xmlParser());
String myTitle=doc.select("title").first();// myTitle contain now TEST

Encoding problems in Android Application (WebView.LoadData())

I'm having a problem encoding a part of a webpage in my Android-application. What I've got is a application collecting part of a webpage and displaying this to a user. For this question lets say that I've got a webpage with a text and below the text a table and below the table a lot of junk I'm not interested in. So I'm chosing what to view using the position of the first element (for example a unique tag) and a end position (same there, something unique. Using a inputstreamreader with a start/end position.
Then in my string ("string") I run:
String s = Uri.encode(string);
The string s is then used accordingly:
web.loadData(s, "text/html","ISO-8859-1");
But this gives me some unwanted chars in the middle of the text: "Â" appears. I've tried to in the string run .replace("Â", ""); but this doesn't solve the problem.
I've also tried following:
web.loadData(s, "text/html", "UTF-8");
web.loadData(s,"text/html;utf-8",null);
But the "Â" and one or two "*" still appears?
Been searching the web and found the: loadDataWithBaseUrlbut this doesn't solve it either so I would very much like som assistence :)
On the top of the page:
<html xmlns="http://www.w3.org/1999/xhtml" lang="sv-se" dir="ltr">
On another page:
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-us" dir="ltr">
So I've got one english and one swedish page but the error is regarding both url:s.
Best regards!
use this:
webview.loadData(html_content, "text/html; charset=utf-8", "utf-8");
I tested it, and it works.
This code worked for me.
String base64EncodedString = null;
try {
base64EncodedString = android.util.Base64.encodeToString((preString+mailContent.getBody()+postString).getBytes("UTF-8"), android.util.Base64.DEFAULT);
} catch (UnsupportedEncodingException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
if(base64EncodedString != null)
{
wvMailContent.loadData(base64EncodedString, "text/html; charset=utf-8", "base64");
}
else
{
wvMailContent.loadData(preString+mailContent.getBody()+postString, "text/html; charset=utf-8", "utf-8");
}

Embedding image into HTML in java?

I have this code where I am trying to read an image from Url:
public class question_insert {
public static String latex(String tex) throws IOException {
String urltext = "http://chart.apis.google.com/chart?cht=tx&chl="+tex;
URL url = new URL(urltext);
BufferedReader in = new BufferedReader(new InputStreamReader(url
.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
// Process each line.
System.out.println(inputLine.toString());
}
in.close();
return inputLine;}
But what I am getting is unreadable code. The url gives only one image try this:
http://chart.apis.google.com/chart?cht=tx&chl=2+2%20\frac{3}{4}
What should I do to embed the image into Html?
First of all it is not clear what you mean by image in Html format ? You could Base64 encode its binary data, but is that what you really want?
How do you expect to output a PNG picture returned by your URL to a text console (that is System.out)?
Second, the way you're retrieving the image is not functional even if you were to store it on a disk as a PNG file, because Reader and its derivatives like BufferedReader are used to read character data. From Reader API:
Abstract class for reading character streams
You need to read binary (byte) data, so you need to stick with BufferedInputStream
After some thinking I realized that embedding image into HTML is what you really want:
public static void main(String[] args) throws Exception {
String urltext = "http://chart.apis.google.com/chart?cht=tx&chl=2+2%20\\frac{3}{4}";
URL url = new URL(urltext);
BufferedInputStream bis = new BufferedInputStream(url.openStream());
byte[] imageBytes = new byte[0];
for(byte[] ba = new byte[bis.available()];
bis.read(ba) != -1;) {
byte[] baTmp = new byte[imageBytes.length + ba.length];
System.arraycopy(imageBytes, 0, baTmp, 0, imageBytes.length);
System.arraycopy(ba, 0, baTmp, imageBytes.length, ba.length);
imageBytes = baTmp;
}
System.out.println("<img src='data:image/png;base64," + DatatypeConverter.printBase64Binary(imageBytes) + "'>");
}
The result is:
<img src=''>
Isn't that great? Anything for you!
Well, I don't know if that is what you want because it seems that nobody does. But if you want to get this output
<img style="-webkit-user-select: none"
src="http://chart.apis.google.com/chart?cht=tx&chl=2+2%20\frac{3}{4}" />
you will have to use this code
public static String latex(String tex) {
String url = "http://chart.apis.google.com/chart?cht=tx&chl=" + tex;
return "<img style=\"-webkit-user-select: none\" src=\"" + url + "\"/>";
}
Also you might have to escape some characters like \ in the tex parameter.
To get your image, you should try to use ImageIO API like this
try {
URL url = new URL(urltext);
BufferedImage img = ImageIO.read(url);
} catch (IOException e) {
e.printStackTrace();
}
http://chart.apis.google.com/chart?cht=tx&chl=2+2%20\frac{3}{4}
Note that this URL is wrong. This shows 22 3/4 instead of the intended 2 + 2 3/4.The request parameter containing special characters needs to be URL-encoded as follows.
http://chart.apis.google.com/chart?cht=tx&chl=2%2B2%20%5Cfrac%7B3%7D%7B4%7D
You can achieve this with URLEncoder#encode().
String chl = "2+2 \\frac{3}{4}";
String url = "http://chart.apis.google.com/chart?cht=tx&chl=" + URLEncoder.encode(chl, "UTF-8");
Back to your functional requirement:
What should I do to embed the image into Html?
If your sole functional requirement is to display the image as available behind the mentioned URL by an HTML <img> element in a HTML/JSP page, then you need to use JSTL <c:url> tag to URL-encode request parameters containing special characters.
<%#taglib prefix="c" uri="http://java.sun.com/jsp/jstl/core" %>
...
<c:url var="url" value="http://chart.apis.google.com/chart">
<c:param name="cht" value="tx" />
<c:param name="chl" value="2+2 \\frac{3}{4}" />
</c:url>
Then you can just refer it as ${url} (as declared in var attribute of <c:url>) in the src attribute of the HTML <img> element:
<img src="${url}" />
Reading a binary image stream from an URL as a character stream and storing in a string as you initially attempted makes completely no utter sense. You also wouldn't open image files in notepad, for example.

Java - Parsing HTML - get text

I am tring to get text from a website; when you change the language the html url have an "/en" inside, but the page that have the information that i want don't have.
http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92
html tags: (the text contains the description of the photo)
<div id="redx_gallery_pic_title"> text text </div>
The problem is that the website is in german and i want the text in english, and my script gets only the german version
Any ideas how can i do it?
java code:
...
URL oracle = new URL(x);
BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));
String inputLine=null;
StringBuffer theText = new StringBuffer();
while ((inputLine = in.readLine()) != null)
theText.append(inputLine+"\n");
String html = theText.toString();
in.close();
String[] name = StringUtils.substringsBetween(html, "redx_gallery_pic_title\">", "</div>");
That site is internationalized with German as default. You need to tell the server what language you're accepting by specifying the desired ISO 639-1 language code in the Accept-Language request header.
URLConnection connection = new URL(url).openConnection();
connection.setRequestProperty("Accept-Language", "en");
InputStream input = connection.getInputStream();
// ...
Unrelated to the concrete problem, may I suggest you to have a look at Jsoup as a HTML parser? It's much more convenient with its jQuery-like CSS selector syntax and therefore much less bloated than your attempt as far:
String url = "http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92";
Document document = Jsoup.connect(url).header("Accept-Language", "en").get();
String title = document.select("#redx_gallery_pic_title").text();
System.out.println(title); // Beech, glazing V3
That's all.

Categories

Resources