Why French characters don't work using utf-8 with Java? - java

I have a HTML file with some French character in it. I need to replace some string inside that file, so I do the following:
public static void replaceStringInFile(String filePath, String oldText, String newText)
{
try
{
Path path = Paths.get(filePath);
Charset charset = StandardCharsets.UTF_8;
String content = new String(Files.readAllBytes(path), charset);
content = content.replace(oldText, newText);
Files.write(path, content.getBytes(charset));
}
catch(Exception e)
{
e.printStackTrace();
}
}
My strings are replaced, but the French character are not there, replaced with �
If I replace UTF_8 with ISO_8859_1, it's working.
I thought UTF_8 was universal? Should work with French? I tried to specify utf-8 in the html file header:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta charset="utf-8"/>
....
</style>
I would like to understand why UTF_8 isn't keeping my French characters...

You have to know the encoding of the text file before you read it. Apparently, it is originally an HTML file without meta charset.
You guessed UTF-8. It's not UTF-8 because reading it detected bytes that don't correspond to UTF-8 and therefore were replaced with the Unicode replacement character U+FFFD �, which you are then displaying(?) using the incorrect encoding, turning � into the Mojibake "�".
So, you'd have to go back to the sender/writer to find out what the encoding is. Then you can write a program to read it.

I think the problem is not that utf-8 is not working with Java.
The problem is your file is not utf-8.
To confirm that you can run "file -I your_file_path": if your output is something like "your_file_name:text/plain; charset=unknown-'somenumber'bit" that means it is not utf-8.

Related

utf 8 encoding is not working properly using java

I have to print the log to the HTML file with Contains Currency symbols like ¥ , € etc
Below is the line of code im using to write to output file
File fileDir = new File("filename.html");
out = new BufferedWriter( new OutputStreamWriter(new FileOutputStream(fileDir,true ),"UTF-8"));
in Output file 'ï¿¥ is printing instead of ¥ and € instead of €.
Some possibilities:
The HTML file is not set to show UTF-8 encoding. Try setting UTF-8 in the browser you use to view the file and see if that works.
If the page is not showing in UTF-8 by default, update the header. It should contain:
<head>
<meta charset="UTF-8">
</head>
Garbage-in, garbage out. Java won't convert encodings for you. You have to actually write UTF-8 encoded characters to the output stream.
Where are your ¥characters coming from? Are they embedded in your source file?
If they are in the Java source, your editor must be set to UTF-8 encoding, otherwise you will be inputting the incorrect characters in your source.

How to render special characters in html

Am am reading html text from a resourse file as
InputStream fstream = this.getClass().getClassLoader()
.getResourceAsStream(filename);
myString = IOUtils.toString(fstream, "UTF-8");
But if html contains special chars as
McDonald's
it convert it to McDonald?s, i can resolve it if i replace ' with apos but is there any other way to do it. is it some encoding isseu?? as its very much tedious to replace every single char since my file contains these special chars in thousands.
Thanks,
Try a different encoding. Possibly be Cp1252 or ISO-8859-1. You can find more character encodings in http://www.iana.org/assignments/character-sets (use the preferred MIME name) or take at look in Character encoding - Wikipedia, the free encyclopedia.
use this meta tag intead of utf8 if your site is in english language for multiple languages you have to use utf8
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">

converting a String to UTF8 format

I java code, I am having a string name = "örebro"; // its a swedish character.
But when I use this name in web application. I print some special character at 'Ö' character.
Is there anyway I can use the same character as it is in "örebro".
I did some thing like this but does not worked.
String name = "örebro";
byte[] utf8s = name .getBytes("UTF-8");
name = new String(utf8s, "UTF-8");
But the name at the end prints the same, something like this. �rebo
Please guide me
The Java code you've provided is pointless, it will do nothing. Java Strings are already perfectly capable of encoding any character (though you have to be careful with literals in the source code, as they depend on the encoding the compiler uses, which is platform-dependant).
Most likely your problem is that your webpage does not declare the encoding correctly in the HTTP header or the HTML meta tags.
You need to set the encoding of your output to UTF8.
It is likely the browser that reads the page does not know the encoding.
send the header (before any other output) something in Java like ServletResponse resource; (...)resource.setContentType ("text/html;charset=utf-8");
in your html page, mention the encoding by sending (printing)<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
If the page used to generate the output is jsp it's useful to precise
<%# page contentType="text/html; charset=utf-8" %>

Java String Encoding to UTF-8

I have some HTML code that I store in a Java.lang.String variable. I write that variable to a file and set the encoding to UTF-8 when writing the contents of the string variable to the file on the filesystem. I open up that file and everything looks great e.g. → shows up as a right arrow.
However, if the same String (containing the same content) is used by a jsp page to render content in a browser, characters such as → show up as a question mark (?)
When storing content in the String variable, I make sure that I use:
String myStr = new String(bytes[], charset)
instead of just:
String myStr = "<html><head/><body>→</body></html>";
Can someone please tell me why the String content gets written to the filesystem perfectly but does not render in the jsp/browser?
Thanks.
but does not render in the jsp/browser?
You need to set the response encoding as well. In a JSP you can do this using
<%# page pageEncoding="UTF-8" %>
This has actually the same effect as setting the following meta tag in HTML <head>:
<meta http-equiv="content-type" content="text/html; charset=utf-8">
Possibilities:
The browser does not support UTF-8
You don't have Content-Type: text/html; charset=utf-8 in your HTTP Headers.
The lazy developer (=me) uses Apache Common Lang StringEscapeUtils.escapeHtml http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringEscapeUtils.html#escapeHtml(java.lang.String) which will help you handle all 'odd' characters. Let the browser do the final translation of the html entities.

java utf-8 encding problem

i am using an HTML parser called HTMLCLEANER to parse HTML page
the problem is that each page has a different encoding than the other.
my question
Can i change from any character encoding to UTF-8?
You cannot seamlessly "convert" from encoding X to encoding Y without knowing encoding X beforehand. Just check the HTTP response header which encoding it is using (if you're obtaining those HTML pages by HTTP) and then use the appropriate encoding in your HTML parser tool.
Where do you get the HTML page from? If you get it from the servlet request, you can use getReader() on it and pass that to clean(). This will use the right encoding. If you get it from an upload, pass the input stream to clean(). If you get it by http client, you need to check the reponse header Content-Type using getResponseCharSet().
Can i change from any character
encoding to UTF-8?
Yes, you can express any Unicode character in UTF-8 encoding.
There might be a problem when changing the encoding of HTML pages: if the page contains an "charset" Meta-Tag, for example,
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
you have to update this tag so it corresponds to the actual encoding.
public void arreglarString(String cadena) {
for (int i = 161; i < 256; i++) {
char car = (char) i;
cadena = cadena.replaceAll(car + "", "&#" + i);
}
return cadena;
}

Categories

Resources