I thought this was only an issue with Python 2 but have run into a similar issue now with java (Windows 10, JDK8).
My searches have lead to little resolution so far.
I read from 'stdin' input stream this value: Viļāni. When I print it to console I get this: Vi????ni.
Relevant code snippets are as follows:
BufferedReader in = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
ArrayList<String> corpus = new ArrayList<String>();
String inputString = null;
while ((inputString = in.readLine()) != null) {
corpus.add(inputString);
}
String[] allCorpus = new String[corpus.size()];
allCorpus = corpus.toArray(allCorpus);
for (String line : allCorpus) {
System.out.println(line);
}
Further expansion on my problem as follows:
I read a file containing the following 2 lines:
を
Sōten_Kōro
When I read this from disk and output to a second file I get the following output:
ã‚’
S�ten_K�ro
When I read the file from stdin using cat testinput.txt | java UTF8Tester I get the following output:
???
S??ten_K??ro
Both are obviously wrong. I need to be able to print the correct characters to console and file. My sample code is as follows:
public class UTF8Tester {
public static void main(String args[]) throws Exception {
BufferedReader stdinReader = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
String[] stdinData = readLines(stdinReader);
printToFile(stdinData, "stdin_out.txt");
BufferedReader fileReader = new BufferedReader(new FileReader("testinput.txt"));
String[] fileData = readLines(fileReader);
printToFile(fileData, "file_out.txt");
}
private static void printToFile(String[] data, String fileName)
throws FileNotFoundException, UnsupportedEncodingException {
PrintWriter writer = new PrintWriter(fileName, "UTF-8");
for (String line : data) {
writer.println(line);
}
writer.close();
}
private static String[] readLines(BufferedReader reader) throws IOException {
ArrayList<String> corpus = new ArrayList<String>();
String inputString = null;
while ((inputString = reader.readLine()) != null) {
corpus.add(inputString);
}
String[] allCorpus = new String[corpus.size()];
return corpus.toArray(allCorpus);
}
}
Really stuck here and help would really be appreciated! Thanks in advance. Paul
System.in/out will use the default Windows character set.
Java String will use Unicode internally.
FileReader/FileWriter are old utility classes that use the default character set, hence they are for non-portable local files only.
The error you saw, was a special character as two bytes UTF-8 sequence, but every (special UTF-8) byte interpreted as the default single byte encoding, but with a value not present, hence twice a ? substitution.
Required is that the character can be entered on System.in in the default charset.
Then the String was converted from the default charset.
Writing it to file in UTF-8 needs to specify UTF-8.
Hence:
BufferedReader stdinReader = new BufferedReader(new InputStreamReader(System.in));
String[] stdinData = readLines(stdinReader);
printToFile(stdinData, "stdin_out.txt");
Path path = Paths.get("testinput-utf8.txt");
List<String> lines = Files.readAllLines(path); // Here the default is UTF-8!
Path path = Paths.get("testinput-winlatin1.txt");
List<String> lines = Files.readAllLines(path, "Windows-1252");
Files.write(lines, Paths.get("file_out.txt"), StandardCharsets.UTF_8);
To check whether your current computer system handles Japanese:
System.out.println("Hiragana letter Wo '\u3092'."); // Either を or ?.
Seeing ? the conversion to the default system encoding could not deliver.
を is U+3092, u-encoded as ASCII with \u3092.
To create an UTF-8 text under Windows:
Files.write(Paths.get("out-utf8.txt"),
"\uFEFFHiragana letter Wo '\u3092'.".getBytes(StandardCharsets.UTF_8));
Here I use an ugly (generally unneeded) BOM marker char \uFEFF (a zero-width space) that will let Windows Notepad recognize the text being in UTF-8.
Related
I'm importing a file into my code and trying to print it. the file contains
i don't like cake.
pizza is good.
i don’t like "cookies" to.
17.
29.
the second dont has a "right single quotation" and when I print it the output is
don�t
the question mark is printed out a blank square. is there a way to convert it to a regular apostrophe?
EDIT:
public class Somethingsomething {
public static void main(String[] args) throws FileNotFoundException,
IOException {
ArrayList<String> list = new ArrayList<String>();
File file = new File("D:\\project1Test.txt");//D:\\project1Test.txt
if(file.exists()){//checks if file exist
FileInputStream fileStream = new FileInputStream(file);
InputStreamReader input = new InputStreamReader(fileStream);
BufferedReader reader = new BufferedReader(input);
String line;
while( (line = reader.readLine()) != null) {
list.add(line);
}
for(int i = 0; i < list.size(); i ++){
System.out.println(list.get(i));
}
}
}}
it should print as normal but the second "don't" has a white block on the apostrophe
this is the file I'm using https://www.mediafire.com/file/8rk7nwilpj7rn7s/project1Test.txt
edit: if it helps even more my the full document where the character is found here
https://www.nytimes.com/2018/03/25/business/economy/labor-professionals.html
It’s all about character encoding. The way characters are represented isn't always the same and they tend to get misinterpreted.
Characters are usually stored as numbers that depend on the encoding standard (and there are so many of them). For example in ASCII, "a" is 97, and in UTF-8 it's 61.
Now when you see funny characters such as the question mark (called replacement character) in this case, it's usually that an encoding standard is being misinterpreted as another standard, and the replacement character is used to replace the unknown or misinterpreted character.
To fix your problem you need to tell your reader to read your file using a specific character encoding, say SOME-CHARSET.
Replace this:
InputStreamReader input = new InputStreamReader(fileStream);
with this:
InputStreamReader input = new InputStreamReader(fileStream, "SOME-CHARSET");
A list of charsets is available here. Unfortunately, you might want to go through them one by one. A short list of most common ones could be found here.
Your problem is almost certainly the encoding scheme you are using. You can read a file in most any encoding scheme you want. Just tell Java how your input was encoded. UTF-8 is common on Linux. Windows native is CP-1250.
This is the sort of problem you have all the time if you are processing files created on a different OS.
See here and Here
I'll give you a different approach...
Use the appropriate means for reading plain text files. Try this:
public static String getTxtContent(String path)
{
try(BufferedReader br = new BufferedReader(new FileReader(path)))
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
return sb.toString();
}catch(IOException fex){ return null; }
}
I'm writing a small Java code to modify a txt.
Here's what im trying to work with, we have this long file to be imported on our accountability system, but it's generated with some information that really doesn't matter to the accountant and just makes his work harder.
The file comes in lines like this:
00;1;1;22012018;"1779";"C";0;0;139084;0;2;0;"RECEBTO TITULO 001/000664/02 - EDGAR ROSA DA TRINDADE";7;19247028000;4561000150;1;
And what im trying to do is to just remove the 001/ and the /02 in the first line.
And the output would be like(2000+ lines):
00;1;1;22012018;"1779";"C";0;0;139084;0;2;0;"RECEBTO TITULO 000664 - EDGAR ROSA DA TRINDADE";7;19247028000;4561000150;1;
it isn't a fixed value so a cant just .replaceall() on the file. It varies from 2 to even 8 digits on both sides or spacings before/after the bar. like so 0000123 / 123.
My questions are.
What kind of function or replacing parameter should i use to include all of the diversity of data it the files?
Just read the file line by line and remove the char sequence with java regex, then write them to a new file.
public class Example {
private static final String PREFIX = "\\d+/";
private static final String SUFFIX = "/.*-";
public static void main(String[] args) throws IOException{
File target = new File("target.txt");
FileWriter fileWriter = new FileWriter(target);
File source = new File("source.txt");
FileReader fileReader = new FileReader(source);
try {
BufferedReader bufferedReader = new BufferedReader(fileReader);
String line;
while ((line = bufferedReader.readLine()) != null) {
fileWriter.write(line.replaceFirst(PREFIX, "").replaceFirst(SUFFIX, " -"));
fileWriter.write("\r\n");
}
} finally {
fileReader.close();
fileWriter.close();
}
}
}
I wanted to make a string to a String[] but it isnt working how i wanted it to work! My code:
public static void get(HashMap<String, String> saves, File file) throws UnsupportedEncodingException, FileNotFoundException, IOException{
if (!file.exists()){
return;
}
InputStreamReader reader;
reader = new InputStreamReader(new FileInputStream(file), "UTF-16");
String r = null;
String[] s;
BufferedReader bufreader = new BufferedReader(reader);
while((r=bufreader.readLine()) != null){
s = r.split("=");
if (s.length < 2){
System.out.println(s.length);
System.out.println(s[0]);
return;
}
saves.put(s[0].toString(), s[1].toString());
s = null;
}
}
And also when i tell it to println the String to the console
System.out.println(s.length);
System.out.println(s[0]);
it just prints:
1
??????????????????
-
-
What it should be reading (What is in the file):
1=welcome
2=hello
3=bye
4=goodbye
So i want it to put the values in to the hashmap:
saves.put("1", "welcome");
saves.put("2", "hello");
saves.put("3", "bye");
saves.put("4", "goodbye");
but the s = e.split("=") is not splitting it is making the String to "?????????"
Thank you!
It seems you're using the wrong encoding.
Your input file is not really UTF-16, as the Java code expects it.
I saved your example data in a file, and the result was similarly broken.
The default encoding on my system is UTF-8, so I changed the encoding of the file with the command:
iconv -f utf-8 -t utf-16 orig.txt > converted.txt
When using your program on converted.txt,
it produces the expected output.
It also produces the expected output if I use orig.txt,
and make this simple change in your program:
reader = new InputStreamReader(new FileInputStream(file), "UTF-8");
You can either make sure the file is UTF-16 encoded,
and if not, convert it,
or use the correct encoding when you create the InputStreamReader.
I'm trying to read in a file that contains unicode characters, convert those characters to their corresponding symbols and then print the resulting text to a new file. I'm trying to use StringEscapeUtils.unescapeHtml to do this but the lines are just being printed as is, with the unicode points still intact. I did a practice run by copying a single line from the file, making a string from that and then calling StringEscapeUtils.unescapeHtml on that, which works perfectly. My code is below:
class FileWrite
{
public static void main(String args[])
{
try{
String testString = " \"text\":\"Dude With Knit Hat At Party Calls Beer \u2018Libations\u2019 http://t.co/rop8NSnRFu\" ";
FileReader instream = new FileReader("Home Timeline.txt");
BufferedReader b = new BufferedReader(instream);
FileWriter fstream = new FileWriter("out.txt");
BufferedWriter out = new BufferedWriter(fstream);
out.write(StringEscapeUtils.unescapeHtml3(testString) + "\n");//This gives the desired output,
//with unicode points converted
String line = b.readLine().toString();
while(line != null){
out.write(StringEscapeUtils.unescapeHtml3(line) + "\n");
line = b.readLine();
}
//Close the output streams
b.close();
out.close();
}
catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
}
}
//This gives the desired output,
//with unicode points converted
out.write(StringEscapeUtils.unescapeHtml3(testString) + "\n");
You are mistaken. Java unescapes String literals of this form at compile time when it builds them into the class file:
"\u2018Libations\u2019"
There are no HTML 3 escapes in this code. The method you have chosen is designed to unescape escape sequences of the form ‘.
You probably want the unescapeJava method.
You're strings are being both read and written using your platforms default encoding. You want to explicitly specify the character set to use as 'UTF-8':
Input stream:
BufferedReader b = new BufferedReader(new InputStreamReader(
new FileInputStream("Home Timeline.txt"),
Charset.forName("UTF-8")));
Output stream:
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("out.txt"),
Charset.forName("UTF-8")));
I'm calling grep in java to separately count the number of a list of words in a corpus.
BufferedReader fb = new BufferedReader(
new InputStreamReader(
new FileInputStream("french.txt"), "UTF8"));
while ((l = fb.readLine()) != null){
String lpt = "\\b"+l+"\\b";
String[] args = new String[]{"grep","-ic",lpt,corpus};
Process grepCommand = Runtime.getRuntime().exec(args);
grep.waitFor()
}
BufferedReader grepInput = new BufferedReader(new InputStreamReader(grep.getInputStream()));
int tmp = Integer.parseInt(grepInput.readLine());
System.out.println(l+"\t"+tmp);
This works well for my English word-list and corpus. But I also have a French word list and corpus. It doesn't work for french and a sample output on java console looks like this:
� bord 0
� c�t� 0
correct form: "à bord" and "à côté".
Now my question is: where is the problem? Should I fix my java code, or it's a grep issue?
If so how do I fix it. (I also can't see french characters on my terminal correctly even though I changed the encoding to UTF-8).
The problem is in your design. Do not call grep from java. Use pure java implementation instead: read file line by line and implement your own "grep" using pure java API.
But seriously I believe that the problem is in your shell. Did you try to run grep manually and filter French characters? I believe it will not work for you. It depends on your shell configuration and therefore depends on platform. Java can provide platform independent solution. To achieve this you should avoid as much as it is possible using non-pure-java techniques including executing command line utilities.
BTW code that reads line-by-line your file and uses String.contains() or pattern matching for lines filtering even shorter than code that runs grep.
I would suggest that you read the file line by line then call split on the word boundary to get the number of words.
public static void main(String[] args) throws IOException {
final File file = new File("myFile");
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final String[] words = line.split("\\b");
System.out.println(words.length + " words in line \"" + line + "\".");
}
}
}
This avoids calling grep from you program.
The odd characters you are getting may well be do to with using the wrong encoding. Are you sure your file is in "UTF-8"?
EDIT
OP wants to read one file line-by-line and then search for occurrences of the read line in another file.
This can still be done more easily using java. Depending on how big your other file is you can either read it into memory first and search it or search it line-by-line also
A simple example reading the file into memory:
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
final File corpusFile = new File("corpus");
final String corpusFileContent = readFileToString(corpusFile);
final File file = new File("myEngramFile");
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final int matches = countOccurencesOf(line, corpusFileContent);
};
}
}
private static String readFileToString(final File file) throws IOException {
final StringBuilder stringBuilder = new StringBuilder();
try (final FileChannel fc = new RandomAccessFile(file, "r").getChannel()) {
final ByteBuffer byteBuffer = ByteBuffer.allocate(4096);
final CharsetDecoder charsetDecoder = Charset.forName("UTF-8").newDecoder();
while (fc.read(byteBuffer) > 0) {
byteBuffer.flip();
stringBuilder.append(charsetDecoder.decode(byteBuffer));
byteBuffer.reset();
}
}
return stringBuilder.toString();
}
private static int countOccurencesOf(final String countMatchesOf, final String inString) {
final Matcher matcher = Pattern.compile("\\b" + countMatchesOf + "\\b").matcher(inString);
int count = 0;
while (matcher.find()) {
++count;
}
return count;
}
This should work fine if your "corpus" file is less than a hundred megabytes or so. Any bigger and you will want to change the "countOccurencesOf" method to something like this
private static int countOccurencesOf(final String countMatchesOf, final File inFile) throws IOException {
final Pattern pattern = Pattern.compile("\\b" + countMatchesOf + "\\b");
int count = 0;
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
++count;
}
};
}
return count;
}
Now you would just pass your "File" object into the method rather than the stringified file.
Note that the streaming approach reads files line-by-line and hence drops the linebreaks, you need to add them back before parsing the String if your Pattern relies on them being there.