use SmbFileInputStream to read data in utf-8 encoding

use SmbFileInputStream to read data in utf-8 encoding - java

i have a file which has the following string:
Vol conforme à la réglementation
However, when i read the file using SmbFileInputStream i get:
Vol conforme � la r�glementation
Could you please let me know the best way to read this file so that I get the String as I have in the original file. I am converting it to utf-8, which I am not sure is the correct way. Here is the current code that I am using:
SmbFileInputStream smbFileInputStream = new SmbFileInputStream(fileURL);
BufferedReader bufferedFileReader = new BufferedReader(new InputStreamReader(smbFileInputStream, "UTF-8"));
String line = null;
StringBuilder stringBuilder = new StringBuilder();
try {
while ((line = bufferedFileReader.readLine()) != null) {
if (!line.trim().isEmpty()) {
stringBuilder.append(line);
}
}
return stringBuilder.toString();
} finally {
bufferedFileReader.close();
}

Your file is not UTF-8 encoded. Based on the output of the baked string, it's probably ISO-8859-1 encoded or Windows cp1252 encoded, or even ISO-8859-15.
You should pass these encodings instead. It won't be quickly obvious to know which one of these encoding to use until your data contains a byte which maps to the wrong character.
The Euro symbol is a good test. It doesn't exist in ISO-8859-1 and is in different map positions in cp1252 and ISO-8859-15.
Notepad++ is an awesome tool for quickly checking files with different decodings.

Related

characters not appearing when I print when I import a file?

I'm importing a file into my code and trying to print it. the file contains
i don't like cake.
pizza is good.
i don’t like "cookies" to.
17.
29.
the second dont has a "right single quotation" and when I print it the output is
don�t
the question mark is printed out a blank square. is there a way to convert it to a regular apostrophe?
EDIT:
public class Somethingsomething {
public static void main(String[] args) throws FileNotFoundException,
IOException {
ArrayList<String> list = new ArrayList<String>();
File file = new File("D:\\project1Test.txt");//D:\\project1Test.txt
if(file.exists()){//checks if file exist
FileInputStream fileStream = new FileInputStream(file);
InputStreamReader input = new InputStreamReader(fileStream);
BufferedReader reader = new BufferedReader(input);
String line;
while( (line = reader.readLine()) != null) {
list.add(line);
}
for(int i = 0; i < list.size(); i ++){
System.out.println(list.get(i));
}
}
}}
it should print as normal but the second "don't" has a white block on the apostrophe
this is the file I'm using https://www.mediafire.com/file/8rk7nwilpj7rn7s/project1Test.txt
edit: if it helps even more my the full document where the character is found here
https://www.nytimes.com/2018/03/25/business/economy/labor-professionals.html

It’s all about character encoding. The way characters are represented isn't always the same and they tend to get misinterpreted.
Characters are usually stored as numbers that depend on the encoding standard (and there are so many of them). For example in ASCII, "a" is 97, and in UTF-8 it's 61.
Now when you see funny characters such as the question mark (called replacement character) in this case, it's usually that an encoding standard is being misinterpreted as another standard, and the replacement character is used to replace the unknown or misinterpreted character.
To fix your problem you need to tell your reader to read your file using a specific character encoding, say SOME-CHARSET.
Replace this:
InputStreamReader input = new InputStreamReader(fileStream);
with this:
InputStreamReader input = new InputStreamReader(fileStream, "SOME-CHARSET");
A list of charsets is available here. Unfortunately, you might want to go through them one by one. A short list of most common ones could be found here.

Your problem is almost certainly the encoding scheme you are using. You can read a file in most any encoding scheme you want. Just tell Java how your input was encoded. UTF-8 is common on Linux. Windows native is CP-1250.
This is the sort of problem you have all the time if you are processing files created on a different OS.
See here and Here

I'll give you a different approach...
Use the appropriate means for reading plain text files. Try this:
public static String getTxtContent(String path)
{
try(BufferedReader br = new BufferedReader(new FileReader(path)))
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
return sb.toString();
}catch(IOException fex){ return null; }
}

Remove illegal xml characters from UTF-16LE encoded file

I have a java application that parses an xml file that was encoded in utf-16le. The xml has been erroring out while being parsed due to illegal xml characters. My solution is to read in this file into a java string, then removing the xml characters, so it can be parsed successfully. It works 99% but there are some slight differences in the input output from this process, not caused by the illegal characters being removed, but going from the utf-16le encoding to java string utf-16.. i think
BufferedReader reader = null;
String fileText = ""; //stored as UTF-16
try {
reader = new BufferedReader(new InputStreamReader(in, "UTF-16LE"));
for (String line; (line = reader.readLine()) != null; ) {
fileText += line;
}
} catch (Exception ex) {
logger.log(Level.WARNING, "Error removing illegal xml characters", ex);
} finally {
if (reader != null) {
reader.close();
}
}
//code to remove illegal chars from string here, irrelevant to problem
ByteArrayInputStream inStream = new ByteArrayInputStream(fileText.getBytes("UTF-16LE"));
Document doc = XmlUtil.openDocument(inStream, XML_ROOT_NODE_ELEM);
Do characters get changed/lost when going from UTF-16LE to UTF-16? Is there a way to do this in java and assuring the input is exactly the same as the output?

Certainly one problem is that readLine throws away the line ending.
You would need to do something like:
fileText += line + "\r\n";
Otherwise XML attributes, DTD entities, or something else could get glued together where at least a space was required. Also you do not want the text content to be altered when it contains a line break.
Performance (speed and memory) can be improved using a
StringBuilder fileText = new StringBuilder();
... fileText.append(line).append("\n");
... fileText.toString();
Then there might be a problem with the first character of the file, which
sometimes redundantly is added: a BOM char.
line = line.replace("\uFEFF", "");

My (String).split("="); isnt working?

I wanted to make a string to a String[] but it isnt working how i wanted it to work! My code:
public static void get(HashMap<String, String> saves, File file) throws UnsupportedEncodingException, FileNotFoundException, IOException{
if (!file.exists()){
return;
}
InputStreamReader reader;
reader = new InputStreamReader(new FileInputStream(file), "UTF-16");
String r = null;
String[] s;
BufferedReader bufreader = new BufferedReader(reader);
while((r=bufreader.readLine()) != null){
s = r.split("=");
if (s.length < 2){
System.out.println(s.length);
System.out.println(s[0]);
return;
}
saves.put(s[0].toString(), s[1].toString());
s = null;
}
}
And also when i tell it to println the String to the console
System.out.println(s.length);
System.out.println(s[0]);
it just prints:
1
??????????????????
-
-
What it should be reading (What is in the file):
1=welcome
2=hello
3=bye
4=goodbye
So i want it to put the values in to the hashmap:
saves.put("1", "welcome");
saves.put("2", "hello");
saves.put("3", "bye");
saves.put("4", "goodbye");
but the s = e.split("=") is not splitting it is making the String to "?????????"
Thank you!

It seems you're using the wrong encoding.
Your input file is not really UTF-16, as the Java code expects it.
I saved your example data in a file, and the result was similarly broken.
The default encoding on my system is UTF-8, so I changed the encoding of the file with the command:
iconv -f utf-8 -t utf-16 orig.txt > converted.txt
When using your program on converted.txt,
it produces the expected output.
It also produces the expected output if I use orig.txt,
and make this simple change in your program:
reader = new InputStreamReader(new FileInputStream(file), "UTF-8");
You can either make sure the file is UTF-16 encoded,
and if not, convert it,
or use the correct encoding when you create the InputStreamReader.

Greek characters read with UTF8 charset are printed as ��

I am trying to read a file containing greek words in utf8
with the following code
reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"));
while((line = reader.readLine()) != null){
tokenizer = new StringTokenizer(line, delimiter);
while(tokenizer.hasMoreTokens()){
currentToken = tokenizer.nextToken();
map.put(currentToken, 1);
}
}
On every forum I looked for, I saw this new FileInputStream(file), "UTF8")
but still the printed results is like that ����
p.s. when i print a variable containing a greek word from inside the code, the print is successfull, that means that the problem is on file read.
any ideas?

There are some with too professionalism here. I remind you again that we are humans, not compilers! I am here again "powers" you deleted by post! I am very proud of being born in the birthplace of democracy, respecting the other discussants! You don't respect anything "guru" guys...
PS: Yeah, I know that you disseminate again down votes, but who really cares?

There is no "UTF8" charset in Java. The correct charset name is "UTF-8":
new InputStreamReader(new FileInputStream(file), "UTF-8"))
Or use StandardCharsets.UTF_8 instead to avoid any ambiguity:
new InputStreamReader(new FileInputStream(file), StandardCharsets.UTF_8))
That being said, make sure the file is actually UTF-8 encoded. If it has a UTF-8 BOM in front, you will have to either strip it off from the file itself, or manually skip it when reading the file before then reading the lines. Java readers do not recognize or skip BOMs automatically.

Use this for proper converstion - this one is from iso-8859-1 to utf-8:
public String to_utf8(String fieldvalue) throws UnsupportedEncodingException{
String fieldvalue_utf8 = new String(fieldvalue.getBytes("ISO-8859-1"), "UTF-8");
return fieldvalue_utf8;
}

Read a CSV file in UTF-8 format

I am reading a csv file in java, adding a new column with new information and exporting it back to a CSV file. I have a problem in reading the CSV file in UTF-8 format. I read line by line and store it in a StringBuilder, but when I print the line I can see that the information I'm reading is not in UTF-8 but in ANSI. I used both System.out.print and printstream in UTF and the information appears still in ANSI. This is my code :
BufferedReader br;
try {
br = new BufferedReader(new InputStreamReader(new FileInputStream(
"./users.csv"), "UTF8"));
String line;
while ((line = br.readLine()) != null) {
if (line.contains("none#none.com")) {
continue;
}
if (!line.contains("#") && !line.contains("FirstName")) {
continue;
}
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
ps.print(line + "\n");
sbusers.append(line);
sbusers.append("\n");
sbusers2.append(line);
sbusers2.append(",");
}
br.close();
} catch (IOException e) {
System.out.println("Failed to read users file.");
} finally {
}
It prints out information like "Professor -P�s". Since the reading isn't being done correctly the output to the new file is also being exported in ANSI.

Are you sure your CSV is UTF-8 encoded? My guess is that it's not. Try using ISO-8859-1 for reading the file, but keep the output as UTF-8. (UTF8 and UTF-8 both tend to work, but you should use UTF-8 as #Marcelo suggested)

In the line:
br = new BufferedReader(new InputStreamReader(new FileInputStream("./users.csv"),"UTF8"));
Your charset should be "UTF-8" not "UTF8".

Printing to System.out using UTF encoding ????????????
Why would you do that ? System.out and the encoding it uses is determined at the OS level (it becomes the default charset in the JVM), and that's the only one you want to use on System.out.

Fist, as suggested by #Marcelo, use UTF8 instead of UTF-8:
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream("./users.csv"), "UTF8"));
Second, forget about the PrintStream, just use System.out, or better yet, a logging API. You don't need to worry about how Java will output your string to the console (number one rule about character encoding: After you've read things successfully, let Java handle the encoding and only worry about it again when you are writing to an external file / database / etc).
Third and more important, check that your file is really encoded in UTF-8, this is the cause of 99% of the encoding problems.
Make sure that you test with a real UTF-8 file (use tools like iconv to convert to UTF-8 and be sure about it).

found a potential solution(I had the same problem). Depending on the type of UTF-8 encoding you need to specify if further...
Replace:
br = new BufferedReader(new InputStreamReader(new FileInputStream(
"./users.csv"), "UTF8"));
With:
br = new BufferedReader(new InputStreamReader(new FileInputStream(
"./users.csv"), "ISO_8859_1"));
For further understanding: https://mincong.io/2019/04/07/understanding-iso-8859-1-and-utf-8/

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

use SmbFileInputStream to read data in utf-8 encoding - java

Related

characters not appearing when I print when I import a file?

Remove illegal xml characters from UTF-16LE encoded file

My (String).split("="); isnt working?

Greek characters read with UTF8 charset are printed as ��

Read a CSV file in UTF-8 format

Categories

Resources

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

use SmbFileInputStream to read data in utf-8 encoding - java

Related

characters not appearing when I print when I import a file?

Remove illegal xml characters from UTF-16LE encoded file

My (String).split("="); isnt working?

Greek characters read with UTF8 charset are printed as ����

Read a CSV file in UTF-8 format

Categories

Resources

Greek characters read with UTF8 charset are printed as ��