Unicode chars are converted to broken symbols when I use wkhtmltopdf

Unicode chars are converted to broken symbols when I use wkhtmltopdf - java

I have HTML that contains some Unicode characters, and saved in "UTF-8" to disk. I can use less to display it, all characters displayed well:
<h1>什么是Action?</h1>
<p>Play程序接收到的大部分请求，都是由<code>Action</code>来处理的。
But when I use "wkhtmltopdf" to convert it to PDF, it shows broken characters:
My command is:
wkhtmltopdf --encoding utf-8 book.html book.pdf
How to fix this?

Finally I found the reason: I don't have unicode fonts in my ubuntu server.
I upload some truetype fonts from my local ubuntu to the server, everything works fine.
freewind#freewind:/usr/share/fonts$ cd truetype/
freewind#freewind:/usr/share/fonts/truetype$ ls
arphic ttf-dejavu ttf-lao
freefont ttf-devanagari-fonts ttf-liberation
kochi ttf-gujarati-fonts ttf-malayalam-fonts
msttcorefonts ttf-indic-fonts-core ttf-oriya-fonts
openoffice ttf-japanese-gothic.ttf ttf-punjabi-fonts
sazanami ttf-japanese-mincho.ttf ttf-tamil-fonts
takao ttf-kacst-one ttf-telugu-fonts
thai ttf-kannada-fonts unfonts
ttf-bengali-fonts ttf-khmeros-core wqy
I simply upload them all, it fix this problem, although I don't know which font is the key.

I was having this problem too. Turned out, the HTML file had a meta tag that was setting the wrong charset.
Eg the HTML file had
<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<style>
and the issue was resolved when I switched it to instead utf-8 for the charset, like so:
<head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
<style>

Try
wkhtmltopdf-i386 book.html book.pdf

If you are on a MS Windows machine (the above answer is for X Windows font server), the following worked for me:
You can use YaHei or SimSun with wkhtmltoimage.
Explicitly set content using Chinese letters to the new font-family in your style:
.smsnotification_chinese {
font-size: 30px;
font-family: "Microsoft Yahei", SimSun;
}
This will work on stock US Windows machines. There is a more robust description of font fallbacks described here for others: Chinese Standard Web Fonts: A Guide to CSS Font Family Declarations for Web Design in Simplified Chinese.
Note: The wkhtmltoimage binary does not work on Azure worker machines due to GDI+ sandbox restrictions. You can get around this by writing your own web service wrapper or using this free wrapper: Convert HTML to PDF in .Net on Azure

Related

JSP mysql and utf8

I am trying to store data encoded in greek through my JSP page in a mysql database using an insert statement. I have set the mysql arrays collation to utf 8. I have already used the
request.setCharacterEncoding("UTF-8");
response.setCharacterEncoding("UTF-8");
statements and I have made the proper modification to the server.xml of the tomcat server ...
Any other ideas???

set collation greek_general_ci.
ALTER TABLE <table name> CONVERT TO CHARACTER SET utf8 COLLATE greek_general_ci;
EDIT:
In MySQL Workbench right click on table->Alter Table--> change Collation to greek_general_ci or greekbin

It seems to me that you are trying to read UTF-8 characters (data is inputed as UTF-8 as you stated) using the Greek charset or other encoding that do not have equivalent Unicode chars.
Question marks or equivalent are shown when the byte (or bytes) do not have any association with the encoding you are using. You are parsing bytes that your client does not understand as valid, there is no representation.
I recommend you to read this very helpful article before you continue:
For example, you could encode the Unicode string for Hello (U+0048
U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding,
or the Hebrew ANSI Encoding, or any of several hundred encodings that
have been invented so far, with one catch: some of the letters might
not show up! If there's no equivalent for the Unicode code point
you're trying to represent in the encoding you're trying to represent
it in, you usually get a little question mark: ? or, if you're really
good, a box. Which did you get? -> �
Also the problem is not in your database or table/column encoding. It would only apply if you were using stored procedures for example.
Make sure your browser is operating in UTF-8 when inputing or showing information. Using Chrome you can go to Tools > Encoding, the Unicode UTF-8 should be set after you open your JSP in the browser. You can also debug the request/response in the Network tab. If you are inputing data as UTF-8, read it as UTF-8.
You dont need to change Tomcat config or use the HttpFilter to set encoding if you are doing it in the JSP. I was able to simulate your problem using only the following config:
ISO-8859-7 is a Greek Encoding that I used to read unicode and simulate your problem
<%# page language="java" contentType="text/html; charset=ISO-8859-7" pageEncoding="ISO-8859-7"%>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-7">
</head>
</html>
If this does not help I ask you to post complete details and source code of your application.

Wrong encoding in java servlet (tomcat)

I am trying to setup the right encoding for my JSP/servlet pages in Tomcat 7. Though, I have to be successful yet. I made some tries from the suggestions given by this stackexchange thread: Character encoding JSP -displayed wrong in JSP but not in URL: "á » Ã¡ é » Ã©", but they didn't work.
The curious fact lies on the fact that if I let the pages "as is" the browser recognise them as having the encoding Windows-CP 1252 and when I change for UTF-8 the text is displayed correctly. But applying filters and other mechanisms the browser put the encoding as UTF-8 and is not possibile to display it correctly. In fact for the latter if I change the encoding the results are horrible at minimum.

I got it right now. In pages JSP I am putting as first instruction:
<%# page pageEncoding="utf-8" %>
This fixes all problems. Other possibilities like to put response.setCharacterEncoding( "UTF-8" ) as first instruction don't work.
In relation to servlets I need to setup the character encoding before to get the PrintWriter object:
response.setContentType("text/html");
response.setCharacterEncoding("UTF-8");
PrintWriter out = response.getWriter();
These things have solved my problem of strange characters. To sum up: The problem was that the response coming out from JSP/servlet didn't have pointed that itself was encoded in UTF-8

Maybe is not a JSP problem. Have you tried doing that in the page, directly?
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
...
</head>
Also, try to save the page in UTF-8 format

retrieve and display Tamil characters from mysql database to Browser

I am using java language. I can store Tamil characters to database in the same format. But when I retrieve and display in the browser using jsp it displayed like boxes. I use the following code to save Tamil character in mysql database.
Properties pr = new Properties();
pr.put("user", "root");
pr.put("password", "root");
pr.put("characterEncoding", "UTF-8");
pr.put("useUnicode", "true");
Class.forName("com.mysql.jdbc.Driver");
connection = DriverManager.getConnection(connectionURL,pr);
I can see the Tamil characters in database. But I can't retrieve and display in the same format. Please Help me. Thanks in advance.

Add the following to the top of your JSPs:
<%# page pageEncoding="UTF-8" %>
It instructs the server to use UTF-8 to write the characters to the response. It also adds a HTTP response Content-Type header with a value of text/html;charset=UTF-8. This is quite different from a simple <meta> tag which is ignored by webbrowsers when the content is served over HTTP. For debugging purposes, you can see the real HTTP headers using for example Fiddler2 or Firebug.
That should be sufficient.

What is the character set of your JSP pages? Make sure that it is UTF-8.
<meta http-equiv="content-type" content="text/html; charset=UTF-8">

adarshr gave you the probable solution to your probleme, also you could read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.

Where to add the UTF-8 extension in the HTML page?

I need to add the charset="utf-8" at the end of the script tags to get the translation to another language done.
I don know where all I should add the tags. Any rules are followed. Please let me know where to add the charset. Do i need to add at the end of "ApplicationLoader.js" or only after the jquery plugins. Any suggestion please.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>My Web App</title>
<link href="css/jquery/jquery.ui.all.css" rel="stylesheet" type="text/css" />
<script type="text/javascript" src="js/jquery-1.4.2.min.js" charset="utf-8"></script>
<script type="text/javascript" src="js/jquery-ui-1.8.custom.min.js" charset="utf-8"></script>
<script type="text/javascript" src="js/jquery.depends.js" charset="utf-8"></script>
<script type="text/javascript" src="myemployeelist.js" ></script>
Update:
I will make it more easy for you to understand my situation in details and am not a more stuff guy am a newbie under training now.As far as I understood I will explain my problems to you.
I created a webapplication project in Eclipse, in which I created a class
for JDBC connection to MySQL.
I have a class in serverside which gets the users profile value from my webapp
textbox and saves in DB.
I am using jQuery plugin for doing translation from English to Arabic.
I have an HTML page in which as stated in the above part of question I have the
tags in which I added the charset="utf-8" to specify unicode.
I am using dwr to get the values in js and send it to server side.
I changed my computer input language to Arabic and my Mozilla firefox locale to
Arabic.
I am able to enter the English values in mysql and I am able to retrieve it but when I
enter a Arabic value it's not getting saved. The JDBC error is
java.sql.SQLException: Incorrect string value: '\xD8\xB3\xD9\x84\xD8\xA8...'
I dont know how to configure my server Jetty 6 LANG variable to utf-8. Any suggestion please. Thanks.

If a Content-Type header is present in the HTTP response headers, then this will override the meta headers. Very often, this header is already by default supplied by the webserver and more than often the charset is absent (which would assume client's default charset which is often ISO-8859-1). In other words, the meta headers are generally only interpreted whenever the resources are opened locally (not by HTTP). Big chance that this is the reason that your meta headers apparently didn't work when served over HTTP.
You can use Firebug or Fiddler2 to determine the HTTP response headers. Below is a Firebug screen:
You can configure the general setting for HTTP response headers at webserver level. You can also configure it on a request basis at programming language level. Since it's unclear what webserver / programming language you're using, I can't go in detail about how to configure it accordingly.
Update: as per the problem symptoms, which is the following typical MySQL exception:
java.sql.SQLException: Incorrect string value: '\xD8\xB3\xD9\x84\xD8\xA8...'
The byte sequence D8 B3 D9 84 D8 A8 is a valid UTF-8 sequence which represents those characters سلب (U+0633, U+0644 and U+0628). So the HTTP part is fine. You mentioned that you were using Jetty 6 as servletcontainer. The later builds of Jetty 6 already support UTF-8 out of the box.
However, the problem is in the DB part. This exception indicates that the charset the DB/table is been instructed to use doesn't support this byte sequence. This can happen when the DB/table isn't been instructed to use UTF-8.
To fix the DB part, issue those MySQL commands:
ALTER DATABASE db_name DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
And for future DB/tables, use CHARACTER SET utf8 COLLATE utf8_general_ci in CREATE statement as well.

It is usually enough to declare the main document's encoding once, either through a content-type header (See #BalusC's answer for an in-depth explanation) or, if that is not available, the Meta tag.
If you use the Meta tag, make sure it is in the first line of the head section.
There should be no need to specify the character set explicitly for the script files.
Of course, all the content you deal with needs to be UTF-8 encoded as well for this to work. It's not enough to just slap the content-type meta tag in front. (But you are probably aware of that.)

In this way:
<script type="text/javascript" src="[path]/myscript.js" charset="utf-8"></script>

GWT Characters with accent not recognised when added programmatically

I am using the UIBinder in GWT but I have problems displaying letters with an accent.
My xml looks like this
<!DOCTYPE ui:UiBinder SYSTEM "http://dl.google.com/gwt/DTD/xhtml.ent">
<ui:UiBinder xmlns:ui="urn:ui:com.google.gwt.uibinder"
xmlns:g="urn:import:com.google.gwt.user.client.ui">
...
<g:Label ui:field="lbl"></Label>
If I type my text directly in the xml <g:Label>éç</g:Label> the accents come out fine. But if I use the setText method in the associated class lbl.setText("éç") they are replaced by a diamond with a question mark in it.
Edit: If if type them in html it displays the ampersand and stuff
SOLUTION:
In fact when I tested the app after changing the file format to UTF-8 I hadn't went back through the code to retype all the accent which were broken during the change. So they still appeared the same in the browser.

You need to set the response encoding and client encoding to UTF-8 as well.
Add this to top of your page to instruct the XML parser to use UTF-8:
<?xml version="1.0" encoding="UTF-8"?>
Add this to the HTML <head> to instruct the client to use UTF-8:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />

After doing BalusC tasks check your File Save Option go to >> 'File\AdvancedSaveOptions...' and check if your page saved as Unicode (UTF-8 with signature) codepage 650001.
Your issue may be because of using Windows Codepage 1252
just note that you have to retype your Unicode String

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.