I am facing an issue in displaying greek characters. The characters should appear as σ μυστικός αυτό? but they are appearing as ó ìõóôéêüò áõôü?
There are some other greek characters which appear fine but the above text appears garbled.
The content is read from a HTML file using following code by a servlet:
public String getResponse() {
StringBuffer sb = new StringBuffer();
try {
BufferedReader in = new BufferedReader((new InputStreamReader(new FileInputStream(fn), "8859_1")));
String line=null;
while ((line=in.readLine())!=null){
sb.append(line);
}
in.close();
return sb.toString();
}
}
I am setting encoding as UTF-8 while sending back response:
PrintWriter out;
if ((encodings != null) && (encodings.indexOf("gzip") != 1)) {
OutputStream out1 = response.getOutputStream();
out = new PrintWriter(new GZIPOutputStream(out1), false);
response.setHeader("Content-Encoding","gzip");
}
else {
out = response.getWriter();
}
response.setCharacterEncoding("UTF-8");
response.setContentType("text/html;charset=UTF-8");
out.println(getResponse());
The characters appear fine on my local development machine (which is Windows), but appear garbled when deployed on a CentOS Server. Both machines have JDK7 and Tomcat 7 installed.
I'm 99% sure the problem is your input encoding (when you read the data). You're decoding it as ISO-8859-1 when it's probably ISO-8859-7 instead. This would cause the symptoms you see.
The simplest way to check would be to open the HTML in a hex editor and examine the character encodings directly. If the Greek characters take up one byte each then it's almost definitely ISO-8859-7 (not -1). If they take up 2 bytes each then it's UTF-8.
From what you posted it looks like ISO-8859-7. In that character set, the lower-case sigma σ is 0xF3, while in ISO-8859-1 that same code maps to ó, which matches the data you showed. I'm sure if you mapped all the remaining characters you'd see a 1-to-1 match in the codes. Maybe your Windows system's default codepage is ISO-8859-7?
Related
I consuming an api which is returning String with special characters, so I replace them with blank or some other user readable char.
My code:
String text = response;
if (text != null) {
text = text.replace("Â", "");
//same for other special char
}
The above code works fine for windows machine but in Linux "Â" converted into "?", even other all special char converted into "?".
I am using Java, UTF-8 in my HTML.
Please let me know any platform independent solution. Thanks
I am consuming the REST api, so while getting the output I have to maintain UTF-8 encoding.
BufferedReader br = new BufferedReader(new InputStreamReader((inputStream), standardCharsets.UTF_8));
I have added standardCharsets.UTF_8
I have a CSV file, which using Excel to save as CSV UTF-8 encoded.
I have my java code read the file as byte array
then
String result = new String(b, 0, b.length, "UTF-8");
But somehow the content "Montréal" becomes "Montr?al" when save to DB, what might be the problem?
The environment is unix with:
-bash-4.1$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
BTW it works on my windows machine when I run my code and see in DB the correct "Montréal". So my guess is that the environment has some default locale setting that forces the use of dedault encoding.
Thanks
I don't have your complete code, but I tried the following code and it works for me:
String x = "c:/Book2.csv";
BufferedReader br = null;
try{
br = new BufferedReader(new InputStreamReader(new FileInputStream(
x), "UTF8"));
String b;
while ((b = br.readLine()) != null) {
System.out.println(b);
}
} finally {
if (br != null){
br.close();
}
}
If you see "Montr?al" printed on your console, don't worry. It does not mean that the program is not working. Now, you may want to check if your console supports printing UTF-8 characters. So, you can put a debug and inspect the variable to check if has what you want.
If you see correct value in debug and it prints a "?" in your output, you can rest assured that the String variable is having the right value and you can write it back to any file or DB as needed.
If you see "?" when you query your DB, the tool you may be using is not printing the output correctly. Try reading the DB value in java code an check by putting a debug in you code. I usually use putty to query the DB to see the double byte characters correctly. That's all I have, hope that helps.
You have to use ISO/IEC 8859, not UTF-8, if you look at the list of character encodings on Wikipedia page you'll understand the difference.
Basically, UTF-8 its the commom encoding used by western country...
Also, you can check your terminal encoding, maybe the problem is there.
I am developing a spider reading URL from a text file and download the page the write the URL and the file content in another file with a \t between them.
When I get the page , it may contain line feed feed character which should be moved. But I do not know the page encoding before I get the page.
Now I am using JSOUP ,for it can handle the encoding problem for me. But I find that JSOUP parses the HTML to find the encoding which make it slow.
Is there a easy way to just remove the line feed character from the string or byte array?
Will this code work with UTF-8 or GBK?
byte[] buffer=new byte[4096];
String page="";
while((input.read(buffer))!=-1){
for(int i=0;i<buffer.length;i++)
if(buffer[i]=='\r'||buffer[i]=='\n'){
buffer[i]=' ';
}
page+=new String(page);
}
I found the code above not work in utf-8 because a character in the Asian language may be longer than 8 or 16 bit , so wen I convert byte to String a character may be splited.
The code following works fine for me:
int responseCode = connection.getResponseCode();
if (responseCode >= 200 && responseCode < 300) {
InputStream input =connection.getInputStream();
byte[] buffer=new byte[BUFFER_SIZE];
byte[] urlBytes=(url+"\t").getBytes("ASCII");
System.arraycopy(urlBytes, 0, buffer, 0, urlBytes.length);
int t=0,index=urlBytes.length;
while((t=input.read())!=-1){
if(index>=buffer.length-1){
byte[] temp=new byte[buffer.length*3/2];
System.arraycopy(buffer, 0, temp, 0, buffer.length-1);
buffer=temp;
}
if(t=='\n'||t=='\r'){
t=' ';
}
buffer[index++]=(byte)t;
}
buffer[index++]='\n';
Depending on the operating system, new lines can be \n, \r\n, or sometimes \r, but these are ASCII characters, they are always the same if the encoding is a superset of ASCII. In that case, just remove all \r and \n in your pages.
However this will not work for other encoding such as UTF-16.
I've run into a problem when trying to parse a JSON string that I grab from a file. My problem is that the Zero width no-break space character (unicode 0xfeff) is at the beginning of my string when I read it in, and I cannot get rid of it. I don't want to use regex because of the chance there may be other hidden characters with different unicodes.
Here's what I have:
StringBuilder content = new StringBuilder();
try {
BufferedReader br = new BufferedReader(new FileReader("src/test/resources/getStuff.json"));
String currentLine;
while((currentLine = br.readLine()) != null) {
content.append(currentLine);
}
br.close();
} catch(Exception e) {
Assert.fail();
}
And this is the the start of the JSON file (it's too long to copy paste the whole thing, but I have confirmed it is valid):
{"result":{"data":{"request":{"year":null,"timestamp":1413398641246,...
Here's what I've tried so far:
Copying the JSON file to notepad++ and showing all characters
Copying file to notepad++ and converting to UFT-8 without BOM, and ISO 8859-1
Opened JSON file in other text editors such as sublime and saved as UFT-8
Copied the JSON file to a txt file and read that in
Tried using Scanner instead of BufferedReader
In intellij I tried view -> active editor -> show whitespaces
How can I read this file in without having the Zero width no-break space character at the beginning of the string?
0xEF 0xBB 0xBF is the UTF-8 BOM, 0xFE 0xFF is the UTF-16BE BOM, and 0xFF 0xFE is the UTF-16LE BOM. If 0xFEFF exists at the front of your String, it means you created a UTF encoded text file with a BOM. A UTF-16 BOM could appear as-is as 0xFEFF, whereas a UTF-8 BOM would only appear as 0xFEFF if the BOM itself were being decoded from UTF-8 to UTF-16 (meaning the reader detected the BOM but did not skip it). In fact, it is known that Java does not handle UTF-8 BOMs (see bugs JDK-4508058 and JDK-6378911).
If you read the FileReader documentation, it says:
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
You need to read the file content using a reader that recognizes charsets, preferably one that will read the BOM for you and adjust itself internally as needed. But worse case, you could just open the file yourself, read the first few bytes to detect if a BOM is present, and then construct a reader using an appropriate charset to read the rest of the file. Here is an example using org.apache.commons.io.input.BOMInputStream that does exactly that:
(from https://stackoverflow.com/a/13988345/65863)
String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
//use reader
} finally {
inputStream.close();
}
Right now, I have some code that reads a page and saves everything to an html file. However, there are some problems... some punctuation and special characters show up as question marks.
Of course, if I do this manually, I'd save the .txt file with Unicode encoding rather than the default ANSI. I looked around, and all I see about this is complaining that it's impossible in Java or half explanations that I don't understand...
In any case, can anyone help me correct the question marks? Here is the part of my code that downloads the page. (The lister creates an array of urls to download, to be used with sites with pages. You can ignore that, it works fine.)
public void URLDownloader(String site, int startPage, int endPage) throws Exception {
String[] pages = URLLister(site, startPage, endPage);
String webPage = pages[0];
int fileNumber = startPage;
if (startPage == 0)
fileNumber++;
//change pages
for(int i = 0; i < pages.length; i++) {
webPage = pages[i];
URL url= new URL(webPage);
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream()));
PrintWriter out = new PrintWriter(name + (fileNumber+i) + ".html");
String inputLine;
//while stuff to read on current page
while ((inputLine = in.readLine()) != null) {
out.println(inputLine); //write line of text
}
out.close(); //end writing text
if (startPage == 0)
startPage++;
console.append("Finished page " + startPage + "\n");
startPage++;
}
if I do this manually, I'd save the .txt file with Unicode encoding rather than the default ANSI
Windows is giving you misleading terminology here. There is no such encoding as ‘Unicode’; Unicode is the character set which is encoded in different ways into bytes. The encoding that Windows calls ‘Unicode’ is actually UTF-16LE. This is a two-byte-per-code-unit encoding that is not ASCII compatible and is generally inconvenient; Web pages tend not to work well with it.
(For what it's worth the ‘ANSI’ code page isn't anything to do with ANSI either. Plus ça change...)
PrintWriter out = new PrintWriter(name + (fileNumber+i) + ".html");
This creates a file using the Java default encoding, which is likely the ANSI code page in your case. To specify a different encoding, use the optional second argument to PrintWriter:
PrintWriter out = new PrintWriter(name + (fileNumber+i) + ".html", "utf-8");
UTF-8 is usually a good choice: being a UTF it can store all Unicode characters, and it's ASCII-compatible too.
However! You are also reading in the string using the default encoding:
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
which probably isn't the encoding of the page. Again, you can specify the encoding using an optional parameter:
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), "utf-8"));
and this will work fine if the web page was actually served as UTF-8.
But what if it wasn't? There are actually multiple ways the encoding of an HTML page can be determined:
From the Content-Type: text/html;charset=... header parameter, if present.
From the <?xml declaration, if it's served as application/xhtml+xml.
From the <meta> equivalent tag in the page, if (1) and (2) were not present.
From browser-specific guessing heuristics, which may depend on user settings.
You can get (1) by reading URL.getConnection().getContentType() and parsing out the parameter. To get (2) or (3) you have to actually parse the file, which is kind of bad news. (4) is out of reach.
Probably the most consistent thing you can do is just what web browsers (except IE) do when they save a standalone web page to disc: take the exact original bytes that were served and put them straight into a file without any attempt to decode them. Then you don't have to worry about encodings or line ending changes. It does mean any charset metadata in the HTTP headers gets lost, but there's not really much you can do about that short of parsing the HTML and inserting a <meta> tag yourself (probably far too much faff).
InputStream in = url.openStream();
OutputStream out = new FileOutputStream(name + (fileNumber+i) + ".html");
byte[] buffer = new byte[1024*1024];
int len;
while ((len = in.read(buffer)) != -1) {
out.write(buffer, 0, len);
}
(nb buffer copy loop from this question which offers alternatives such as IOUtils.)