My (String).split("="); isnt working? - java

I wanted to make a string to a String[] but it isnt working how i wanted it to work! My code:
public static void get(HashMap<String, String> saves, File file) throws UnsupportedEncodingException, FileNotFoundException, IOException{
if (!file.exists()){
return;
}
InputStreamReader reader;
reader = new InputStreamReader(new FileInputStream(file), "UTF-16");
String r = null;
String[] s;
BufferedReader bufreader = new BufferedReader(reader);
while((r=bufreader.readLine()) != null){
s = r.split("=");
if (s.length < 2){
System.out.println(s.length);
System.out.println(s[0]);
return;
}
saves.put(s[0].toString(), s[1].toString());
s = null;
}
}
And also when i tell it to println the String to the console
System.out.println(s.length);
System.out.println(s[0]);
it just prints:
1
??????????????????
-
-
What it should be reading (What is in the file):
1=welcome
2=hello
3=bye
4=goodbye
So i want it to put the values in to the hashmap:
saves.put("1", "welcome");
saves.put("2", "hello");
saves.put("3", "bye");
saves.put("4", "goodbye");
but the s = e.split("=") is not splitting it is making the String to "?????????"
Thank you!

It seems you're using the wrong encoding.
Your input file is not really UTF-16, as the Java code expects it.
I saved your example data in a file, and the result was similarly broken.
The default encoding on my system is UTF-8, so I changed the encoding of the file with the command:
iconv -f utf-8 -t utf-16 orig.txt > converted.txt
When using your program on converted.txt,
it produces the expected output.
It also produces the expected output if I use orig.txt,
and make this simple change in your program:
reader = new InputStreamReader(new FileInputStream(file), "UTF-8");
You can either make sure the file is UTF-16 encoded,
and if not, convert it,
or use the correct encoding when you create the InputStreamReader.

Related

Incorrect printing of non-eglish characters with Java

I thought this was only an issue with Python 2 but have run into a similar issue now with java (Windows 10, JDK8).
My searches have lead to little resolution so far.
I read from 'stdin' input stream this value: Viļāni. When I print it to console I get this: Vi????ni.
Relevant code snippets are as follows:
BufferedReader in = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
ArrayList<String> corpus = new ArrayList<String>();
String inputString = null;
while ((inputString = in.readLine()) != null) {
corpus.add(inputString);
}
String[] allCorpus = new String[corpus.size()];
allCorpus = corpus.toArray(allCorpus);
for (String line : allCorpus) {
System.out.println(line);
}
Further expansion on my problem as follows:
I read a file containing the following 2 lines:
を
Sōten_Kōro
When I read this from disk and output to a second file I get the following output:
ã‚’
S�ten_K�ro
When I read the file from stdin using cat testinput.txt | java UTF8Tester I get the following output:
???
S??ten_K??ro
Both are obviously wrong. I need to be able to print the correct characters to console and file. My sample code is as follows:
public class UTF8Tester {
public static void main(String args[]) throws Exception {
BufferedReader stdinReader = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
String[] stdinData = readLines(stdinReader);
printToFile(stdinData, "stdin_out.txt");
BufferedReader fileReader = new BufferedReader(new FileReader("testinput.txt"));
String[] fileData = readLines(fileReader);
printToFile(fileData, "file_out.txt");
}
private static void printToFile(String[] data, String fileName)
throws FileNotFoundException, UnsupportedEncodingException {
PrintWriter writer = new PrintWriter(fileName, "UTF-8");
for (String line : data) {
writer.println(line);
}
writer.close();
}
private static String[] readLines(BufferedReader reader) throws IOException {
ArrayList<String> corpus = new ArrayList<String>();
String inputString = null;
while ((inputString = reader.readLine()) != null) {
corpus.add(inputString);
}
String[] allCorpus = new String[corpus.size()];
return corpus.toArray(allCorpus);
}
}
Really stuck here and help would really be appreciated! Thanks in advance. Paul
System.in/out will use the default Windows character set.
Java String will use Unicode internally.
FileReader/FileWriter are old utility classes that use the default character set, hence they are for non-portable local files only.
The error you saw, was a special character as two bytes UTF-8 sequence, but every (special UTF-8) byte interpreted as the default single byte encoding, but with a value not present, hence twice a ? substitution.
Required is that the character can be entered on System.in in the default charset.
Then the String was converted from the default charset.
Writing it to file in UTF-8 needs to specify UTF-8.
Hence:
BufferedReader stdinReader = new BufferedReader(new InputStreamReader(System.in));
String[] stdinData = readLines(stdinReader);
printToFile(stdinData, "stdin_out.txt");
Path path = Paths.get("testinput-utf8.txt");
List<String> lines = Files.readAllLines(path); // Here the default is UTF-8!
Path path = Paths.get("testinput-winlatin1.txt");
List<String> lines = Files.readAllLines(path, "Windows-1252");
Files.write(lines, Paths.get("file_out.txt"), StandardCharsets.UTF_8);
To check whether your current computer system handles Japanese:
System.out.println("Hiragana letter Wo '\u3092'."); // Either を or ?.
Seeing ? the conversion to the default system encoding could not deliver.
を is U+3092, u-encoded as ASCII with \u3092.
To create an UTF-8 text under Windows:
Files.write(Paths.get("out-utf8.txt"),
"\uFEFFHiragana letter Wo '\u3092'.".getBytes(StandardCharsets.UTF_8));
Here I use an ugly (generally unneeded) BOM marker char \uFEFF (a zero-width space) that will let Windows Notepad recognize the text being in UTF-8.

characters not appearing when I print when I import a file?

I'm importing a file into my code and trying to print it. the file contains
i don't like cake.
pizza is good.
i don’t like "cookies" to.
17.
29.
the second dont has a "right single quotation" and when I print it the output is
don�t
the question mark is printed out a blank square. is there a way to convert it to a regular apostrophe?
EDIT:
public class Somethingsomething {
public static void main(String[] args) throws FileNotFoundException,
IOException {
ArrayList<String> list = new ArrayList<String>();
File file = new File("D:\\project1Test.txt");//D:\\project1Test.txt
if(file.exists()){//checks if file exist
FileInputStream fileStream = new FileInputStream(file);
InputStreamReader input = new InputStreamReader(fileStream);
BufferedReader reader = new BufferedReader(input);
String line;
while( (line = reader.readLine()) != null) {
list.add(line);
}
for(int i = 0; i < list.size(); i ++){
System.out.println(list.get(i));
}
}
}}
it should print as normal but the second "don't" has a white block on the apostrophe
this is the file I'm using https://www.mediafire.com/file/8rk7nwilpj7rn7s/project1Test.txt
edit: if it helps even more my the full document where the character is found here
https://www.nytimes.com/2018/03/25/business/economy/labor-professionals.html
It’s all about character encoding. The way characters are represented isn't always the same and they tend to get misinterpreted.
Characters are usually stored as numbers that depend on the encoding standard (and there are so many of them). For example in ASCII, "a" is 97, and in UTF-8 it's 61.
Now when you see funny characters such as the question mark (called replacement character) in this case, it's usually that an encoding standard is being misinterpreted as another standard, and the replacement character is used to replace the unknown or misinterpreted character.
To fix your problem you need to tell your reader to read your file using a specific character encoding, say SOME-CHARSET.
Replace this:
InputStreamReader input = new InputStreamReader(fileStream);
with this:
InputStreamReader input = new InputStreamReader(fileStream, "SOME-CHARSET");
A list of charsets is available here. Unfortunately, you might want to go through them one by one. A short list of most common ones could be found here.
Your problem is almost certainly the encoding scheme you are using. You can read a file in most any encoding scheme you want. Just tell Java how your input was encoded. UTF-8 is common on Linux. Windows native is CP-1250.
This is the sort of problem you have all the time if you are processing files created on a different OS.
See here and Here
I'll give you a different approach...
Use the appropriate means for reading plain text files. Try this:
public static String getTxtContent(String path)
{
try(BufferedReader br = new BufferedReader(new FileReader(path)))
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
return sb.toString();
}catch(IOException fex){ return null; }
}

Read a CSV file in UTF-8 format

I am reading a csv file in java, adding a new column with new information and exporting it back to a CSV file. I have a problem in reading the CSV file in UTF-8 format. I read line by line and store it in a StringBuilder, but when I print the line I can see that the information I'm reading is not in UTF-8 but in ANSI. I used both System.out.print and printstream in UTF and the information appears still in ANSI. This is my code :
BufferedReader br;
try {
br = new BufferedReader(new InputStreamReader(new FileInputStream(
"./users.csv"), "UTF8"));
String line;
while ((line = br.readLine()) != null) {
if (line.contains("none#none.com")) {
continue;
}
if (!line.contains("#") && !line.contains("FirstName")) {
continue;
}
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
ps.print(line + "\n");
sbusers.append(line);
sbusers.append("\n");
sbusers2.append(line);
sbusers2.append(",");
}
br.close();
} catch (IOException e) {
System.out.println("Failed to read users file.");
} finally {
}
It prints out information like "Professor -P�s". Since the reading isn't being done correctly the output to the new file is also being exported in ANSI.
Are you sure your CSV is UTF-8 encoded? My guess is that it's not. Try using ISO-8859-1 for reading the file, but keep the output as UTF-8. (UTF8 and UTF-8 both tend to work, but you should use UTF-8 as #Marcelo suggested)
In the line:
br = new BufferedReader(new InputStreamReader(new FileInputStream("./users.csv"),"UTF8"));
Your charset should be "UTF-8" not "UTF8".
Printing to System.out using UTF encoding ????????????
Why would you do that ? System.out and the encoding it uses is determined at the OS level (it becomes the default charset in the JVM), and that's the only one you want to use on System.out.
Fist, as suggested by #Marcelo, use UTF8 instead of UTF-8:
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream("./users.csv"), "UTF8"));
Second, forget about the PrintStream, just use System.out, or better yet, a logging API. You don't need to worry about how Java will output your string to the console (number one rule about character encoding: After you've read things successfully, let Java handle the encoding and only worry about it again when you are writing to an external file / database / etc).
Third and more important, check that your file is really encoded in UTF-8, this is the cause of 99% of the encoding problems.
Make sure that you test with a real UTF-8 file (use tools like iconv to convert to UTF-8 and be sure about it).
found a potential solution(I had the same problem). Depending on the type of UTF-8 encoding you need to specify if further...
Replace:
br = new BufferedReader(new InputStreamReader(new FileInputStream(
"./users.csv"), "UTF8"));
With:
br = new BufferedReader(new InputStreamReader(new FileInputStream(
"./users.csv"), "ISO_8859_1"));
For further understanding: https://mincong.io/2019/04/07/understanding-iso-8859-1-and-utf-8/

StringEscapeUtils.unescapeHtml doesn't work on strings read from files

I'm trying to read in a file that contains unicode characters, convert those characters to their corresponding symbols and then print the resulting text to a new file. I'm trying to use StringEscapeUtils.unescapeHtml to do this but the lines are just being printed as is, with the unicode points still intact. I did a practice run by copying a single line from the file, making a string from that and then calling StringEscapeUtils.unescapeHtml on that, which works perfectly. My code is below:
class FileWrite
{
public static void main(String args[])
{
try{
String testString = " \"text\":\"Dude With Knit Hat At Party Calls Beer \u2018Libations\u2019 http://t.co/rop8NSnRFu\" ";
FileReader instream = new FileReader("Home Timeline.txt");
BufferedReader b = new BufferedReader(instream);
FileWriter fstream = new FileWriter("out.txt");
BufferedWriter out = new BufferedWriter(fstream);
out.write(StringEscapeUtils.unescapeHtml3(testString) + "\n");//This gives the desired output,
//with unicode points converted
String line = b.readLine().toString();
while(line != null){
out.write(StringEscapeUtils.unescapeHtml3(line) + "\n");
line = b.readLine();
}
//Close the output streams
b.close();
out.close();
}
catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
}
}
//This gives the desired output,
//with unicode points converted
out.write(StringEscapeUtils.unescapeHtml3(testString) + "\n");
You are mistaken. Java unescapes String literals of this form at compile time when it builds them into the class file:
"\u2018Libations\u2019"
There are no HTML 3 escapes in this code. The method you have chosen is designed to unescape escape sequences of the form ‘.
You probably want the unescapeJava method.
You're strings are being both read and written using your platforms default encoding. You want to explicitly specify the character set to use as 'UTF-8':
Input stream:
BufferedReader b = new BufferedReader(new InputStreamReader(
new FileInputStream("Home Timeline.txt"),
Charset.forName("UTF-8")));
Output stream:
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("out.txt"),
Charset.forName("UTF-8")));

Which encoding does Process.getInputStream() use?

In a Java program, I spawn a new Process via ProcessBuilder.
args[0] = directory.getAbsolutePath() + File.separator + program;
ProcessBuilder pb = new ProcessBuilder(args);
pb.directory(directory);
final Process process = pb.start();
Then, I read the process standard output with a new Thread
new Thread() {
public void run() {
BufferedReader reader = new BufferedReader(
new InputStreamReader(process.getInputStream()));
String line = "";
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
}.start();
However, when the process outputs non-ASCII characters (such as 'é'), the line has character '\uFFFD' instead.
What is the encoding in the InputStream returned by getInputStream (my platform is Windows in Europe)?
How can I change things so that line contains the expected data (i.e. '\u00E9' for 'é')?
Edit: I tried new InputStreamReader(...,"UTF-8"):
é becomes \uFFFD
An InputStream is a binary stream, so there is no encoding. When you create the Reader, you need to know what character encoding to use, and that would depend on what the program you called produces (Java will not convert it in any way).
If you do not specify anything for InputStreamReader, it will use the platform default encoding, which may not be appropriate. There is another constructor that allows you to specify the encoding.
If you know what encoding to use (and you really have to know):
new InputStreamReader(process.getInputStream(), "UTF-8") // for example
Interestingly enough, when running on Windows:
ProcessBuilder pb = new ProcessBuilder("cmd", "/c dir");
Process process = pb.start();
Then CP437 code page works quite well for
new InputStreamReader(process.getInputStream(), "CP437");
As I understand, an operation system streams are byte-streams, there are no characters here. The InputStreamReader constructor uses jvm default character set java.nio.charset.Charset#defaultCharset(), you could use another constructor to explicitly specify a character set.
According to http://www.fileformat.info/info/unicode/char/e9/index.htm '\uFFFD' is a unicode code for character 'é'. It actually means that you are reading the stream correctly. Your problem is in writing.
Windows console does not support unicode by default. So, if you want to test your code open file and write your stream there. But do not forget to set the encoding UTF-8.
Scientific
On Windows this works perfect:
private static final Charset CONSOLE_ENCODING;
static {
Charset enc = Charset.defaultCharset();
try {
String example = "äöüßДŹす";
String command = File.separatorChar == '/' ? "echo " + example : "cmd.exe /c echo " + example;
Process exec = Runtime.getRuntime().exec(command);
InputStream inputStream = exec.getInputStream();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
while (exec.isAlive()) {
Thread.sleep(100);
}
byte[] buff = new byte[inputStream.available()];
if (buff.length > 0) {
int count = inputStream.read(buff);
baos.write(buff, 0, count);
}
byte[] array = baos.toByteArray();
for (Charset charset : Charset.availableCharsets().values()) {
String s = new String(array, charset);
if (s.equals(example)) {
enc = charset;
break;
}
}
} catch (InterruptedException e) {
throw new Error("Could not determine console charset.", e);
} catch (IOException e) {
throw new Error("Could not determine console charset.", e);
}
CONSOLE_ENCODING = enc;
}
According to specification: there is no hint for runtime-encoding change of jvm. We can not be sure that the encoding does NOT change while running and the charset still correct after such change.
If you, like me, know in what encoding you want to use for all input/output, you can either encode it in the Java API calls to some (not all) CreateReader methods, which some other answers have pointed out.
But this will hard code it in the source, which might or might not, be ok.
I found a better way after reading this answer which reveals that you can set the encoding before the JVM starts up to what you need.
java -Dfile.encoding=ISO-8859-1 ...
I put this as a comment but i see there was an answer after ,so it might be redundant now :)
BufferedReader br = new BufferedReader(
new InputStreamReader(conn.getInputStream(), "UTF-8"));
use commons-lang jar file in this use - StringEscapeUtils.escapeHtml
BufferedReader br = new BufferedReader(
new InputStreamReader(StringEscapeUtils.escapeHtml(conn.getInputStream()));

Categories

Resources