I am having trouble reading international characters in Java.
The default character set being used is UTF-8 and my Eclipse workspace is also set to this.
I am reading a title of a video from the Internet (Gangam Style in fact ;) ) which contains Korean characters, I am doing this as follows:
BufferedReader stdIn = new BufferedReader(new InputStreamReader(shellCommand.getInputStream()));
String fileName = null, output = null;
while ((output = stdInput.readLine()) != null) {
if (output.indexOf("Destination") > 0) {
System.out.println(output);
I know that the title it will read is: "PSY - GANGNAM STYLE (강남스타일) M/V", but the console displays the following instead: "PSY - GANGNAM STYLE () M V" which causes errors further along in my program.
It seems like the InputStream Reader isn't reading these characters correctly.
Does anyone have any ideas? I've spent the last hour scouring the Internet and haven't found any answers. Thanks in advance everyone.
The default character set being used is UTF-8
The default where? In Java itself, or in the video? It would be a much clearer if you specified this explicitly. You should check that's correct for the video data too.
It seems like the InputStream Reader isn't reading these characters correctly.
Well, all we know is that the text isn't showing properly on the console. Either it isn't being read correctly, or it's not being displayed correctly. You should print out each character's Unicode value so you can check the exact content of the string. For example:
static void logCharacters(String text) {
for (int i = 0; i < text.length(); i++) {
char c = text.charAt(i);
System.out.println(c + " " + Integer.toHexString(c));
}
}
You need to enure default char-set using Charset.defaultCharset().name() else use
InputStreamReader in = new InputStreamReader(shellCommand.getInputStream(), "UTF-8");
I tried sample program and it prints correctly in eclipse. It might be problem of windows console as AlexR has pointed out.
byte[] bytes = "PSY - GANGNAM STYLE (강남스타일) M/V".getBytes();
InputStreamReader reader = new InputStreamReader(new ByteArrayInputStream(bytes));
BufferedReader bufferedReader = new BufferedReader(reader);
String str = bufferedReader.readLine();
System.out.println(str);
Output:
PSY - GANGNAM STYLE (강남스타일) M/V
Related
I'm importing a file into my code and trying to print it. the file contains
i don't like cake.
pizza is good.
i don’t like "cookies" to.
17.
29.
the second dont has a "right single quotation" and when I print it the output is
don�t
the question mark is printed out a blank square. is there a way to convert it to a regular apostrophe?
EDIT:
public class Somethingsomething {
public static void main(String[] args) throws FileNotFoundException,
IOException {
ArrayList<String> list = new ArrayList<String>();
File file = new File("D:\\project1Test.txt");//D:\\project1Test.txt
if(file.exists()){//checks if file exist
FileInputStream fileStream = new FileInputStream(file);
InputStreamReader input = new InputStreamReader(fileStream);
BufferedReader reader = new BufferedReader(input);
String line;
while( (line = reader.readLine()) != null) {
list.add(line);
}
for(int i = 0; i < list.size(); i ++){
System.out.println(list.get(i));
}
}
}}
it should print as normal but the second "don't" has a white block on the apostrophe
this is the file I'm using https://www.mediafire.com/file/8rk7nwilpj7rn7s/project1Test.txt
edit: if it helps even more my the full document where the character is found here
https://www.nytimes.com/2018/03/25/business/economy/labor-professionals.html
It’s all about character encoding. The way characters are represented isn't always the same and they tend to get misinterpreted.
Characters are usually stored as numbers that depend on the encoding standard (and there are so many of them). For example in ASCII, "a" is 97, and in UTF-8 it's 61.
Now when you see funny characters such as the question mark (called replacement character) in this case, it's usually that an encoding standard is being misinterpreted as another standard, and the replacement character is used to replace the unknown or misinterpreted character.
To fix your problem you need to tell your reader to read your file using a specific character encoding, say SOME-CHARSET.
Replace this:
InputStreamReader input = new InputStreamReader(fileStream);
with this:
InputStreamReader input = new InputStreamReader(fileStream, "SOME-CHARSET");
A list of charsets is available here. Unfortunately, you might want to go through them one by one. A short list of most common ones could be found here.
Your problem is almost certainly the encoding scheme you are using. You can read a file in most any encoding scheme you want. Just tell Java how your input was encoded. UTF-8 is common on Linux. Windows native is CP-1250.
This is the sort of problem you have all the time if you are processing files created on a different OS.
See here and Here
I'll give you a different approach...
Use the appropriate means for reading plain text files. Try this:
public static String getTxtContent(String path)
{
try(BufferedReader br = new BufferedReader(new FileReader(path)))
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
return sb.toString();
}catch(IOException fex){ return null; }
}
I am trying to read all the words as a String from the url - http://www.puzzlers.org/pub/wordlists/unixdict.txt
But the outputted String has some part of the strings missing whenever there is a '
Why I am getting the same.
How to avoid the same.
I am getting the same error when using String builder instead of concatenating.
public static String getUrlContents(String theUrl) throws IOException {
URL url = new URL(theUrl);
URLConnection urlConnection = url.openConnection();
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(urlConnection.getInputStream()));
String line;
String text = "";
while ((line = bufferedReader.readLine()) != null) {
text += line + " ";
}
bufferedReader.close();
return text;
}
Output:
After ain' huge blank and then continues from 'd anyhow
and it continues after
So it's eating up the text between two subsequent ' and '
Looks like the text is there, since when I search for antony which is between the blanks the eclipse highlights the word as seen below, but it's not visible on my screen :O
As already answered by Naruto that console has buffer size and above that size content is not visible. To check whether your string is correct or not, just copy whole content from console CTRL+A and paste in notepad file, I'm sure you will see complete content.
Basically it's not a bug but predefined size of the console, you can change it as well at (Window > Preferences > Run/Debug > Console).
Another way is just use Fixed Width Console and set Max size which is 1000 and you will be able to see the content in the console.
I have a txt file with three rows of integers, after adding them to a List I'm finding a strange char at the beginning of the first index. I used an InputStream, BufferedReader and StringBuilder to read from the file. I tried to debug using println() statements at several places but I still can't figure out where that char came from.
File selectedFile = fileChooser.getSelectedFile();
inputStream = new FileInputStream(selectedFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
StringBuilder out = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
out.append(line);
items.add(line);
}
When I try to copy the output from printing out List items to this post somehow the char I'm talking about does not show, so I'll post a screenshot instead:
http://imgur.com/gjaF3no
http://imgur.com/JHAH6mV
The first is of the entire list, and the second should show the char I'm talking more clearly, it looks like a dot before "3". Any help would be appreciated, Thank you.
You can try removing all control characters (strange characters) by doing the following:
strangeString.replaceAll("\\p{Cntrl}", "");
Reference: Java - removing strange characters from a String
Thank you all for the help. The problem was actually in the original txt file like #coder
I am trying to read in some sentences from a file that contains unicode characters. It does print out a string but for some reason it messes up the unicode characters
This is the code I have:
public static String readSentence(String resourceName) {
String sentence = null;
try {
InputStream refStream = ClassLoader
.getSystemResourceAsStream(resourceName);
BufferedReader br = new BufferedReader(new InputStreamReader(
refStream, Charset.forName("UTF-8")));
sentence = br.readLine();
} catch (IOException e) {
throw new RuntimeException("Cannot read sentence: " + resourceName);
}
return sentence.trim();
}
The problem is probably in the way that the string is being output.
I suggest that you confirm that you are correctly reading the Unicode characters by doing something like this:
for (char c : sentence.toCharArray()) {
System.err.println("char '" + ch + "' is unicode codepoint " + ((int) ch)));
}
and see if the Unicode codepoints are correct for the characters that are being messed up. If they are correct, then the problem is output side: if not, then input side.
First, you could create the InputStreamReader as
new InputStreamReader(refStream, "UTF-8")
Also, you should verify if the resource really contains UTF-8 content.
One of the most annoying reason could be... your IDE settings.
If your IDE default console encoding is something like latin1 then you'll be struggling very long with different variations of java code but nothing help untill you correctly set some IDE options.
I tried to ask this question earlier, but I was unclear in my question. Java BufferedReader action on character?
Here is my problem.. I have a BufferedReader set to read from a device. It is reading well. I have it set to
if (Status.reader.ready()) {
Lines = Status.reader.readLine();
}
if (Lines.contains(">")) {
log.level1("ready to send data")
}
Buffered reader does not report the > until I've sent more data to the device. The problem is that when reader contains > it is not reporting ready. It holds onto the > until I input more data.
I tried the following and it returns nothing. It does not even return the log.level0()
Lines = ""
try {
Lines = Status.reader.readLine();
} catch (IOException e) {
Log.level0("Attempted to read blank line");
}
Here is the actual data sent:
^M^M01 02 F3^M00 01 F3 3E^M>
But BufferedReader ignores the > until more data has been sent then get a result like this:
>0102
When I check the actual data from the device from the command prompt, it returns what I'd expect, the > is present.
BufferedReader will not give me the >. Is there some way I can check for this char otherwise?
The BufferedReader.readLine() method reads data a line at a time. That is, it will attempt to read characters until it sees an end-of-line sequence (e.g. "\n", "\r" or "\r\n") or the end of stream.
If your input data is not line oriented, then you should not be using readLine() to read it. I suggest that you do your own record / message extraction; e.g.
BufferedReader br = ...
StringBuilder sb = new StringBuilder(...);
int ch = br.read();
while (ch != -1 && ch != '>') {
sb.append((char) ch);
ch = br.read();
}
String record = sb.toString();
Check this:
http://download.oracle.com/docs/cd/E17476_01/javase/1.5.0/docs/api/java/io/BufferedReader.html
I recommend that you use the function public int read() instead.
At google you can find a lot of examples1
With those F3s in there it looks to me like your data isn't even character-oriented let alone line-oriented. Is your device really Unicode-compliant?
I would use a BufferedInputStream.