StringEscapeUtils.unescapeHtml doesn't work on strings read from files - java

I'm trying to read in a file that contains unicode characters, convert those characters to their corresponding symbols and then print the resulting text to a new file. I'm trying to use StringEscapeUtils.unescapeHtml to do this but the lines are just being printed as is, with the unicode points still intact. I did a practice run by copying a single line from the file, making a string from that and then calling StringEscapeUtils.unescapeHtml on that, which works perfectly. My code is below:
class FileWrite
{
public static void main(String args[])
{
try{
String testString = " \"text\":\"Dude With Knit Hat At Party Calls Beer \u2018Libations\u2019 http://t.co/rop8NSnRFu\" ";
FileReader instream = new FileReader("Home Timeline.txt");
BufferedReader b = new BufferedReader(instream);
FileWriter fstream = new FileWriter("out.txt");
BufferedWriter out = new BufferedWriter(fstream);
out.write(StringEscapeUtils.unescapeHtml3(testString) + "\n");//This gives the desired output,
//with unicode points converted
String line = b.readLine().toString();
while(line != null){
out.write(StringEscapeUtils.unescapeHtml3(line) + "\n");
line = b.readLine();
}
//Close the output streams
b.close();
out.close();
}
catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
}
}

//This gives the desired output,
//with unicode points converted
out.write(StringEscapeUtils.unescapeHtml3(testString) + "\n");
You are mistaken. Java unescapes String literals of this form at compile time when it builds them into the class file:
"\u2018Libations\u2019"
There are no HTML 3 escapes in this code. The method you have chosen is designed to unescape escape sequences of the form ‘.
You probably want the unescapeJava method.

You're strings are being both read and written using your platforms default encoding. You want to explicitly specify the character set to use as 'UTF-8':
Input stream:
BufferedReader b = new BufferedReader(new InputStreamReader(
new FileInputStream("Home Timeline.txt"),
Charset.forName("UTF-8")));
Output stream:
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("out.txt"),
Charset.forName("UTF-8")));

Related

After reading with BufferReader '\n' won't be accepted as a new line char, how to solve this?

I have a large text file I want to format. Say the input file is called inputFile and output file is called outputFile.
This is my code for using BufferedReader and BufferedWriter
Here is my code
public static void readAndWrite(String fileNameToRead, String fileNameToWrite) {
try{
BufferedReader fr = new BufferedReader(
new FileReader(String.format("%s.txt", fileNameToRead)));
BufferedWriter out = new BufferedWriter(
new FileWriter(String.format("%s.txt", fileNameToWrite), true));
String currentTmp = "";
String tmp = "";
String test = "work \nwork";
out.append(test);
while((tmp = fr.readLine()) != null) {
tmp = tmp.trim();
if(tmp.isEmpty()) {
currentTmp = currentTmp.trim();
out.append(currentTmp);
out.newLine();
out.newLine();
currentTmp = "";
} else {
currentTmp = currentTmp.concat(" ").concat(tmp);
}
}
if(!currentTmp.equals("")) {
out.write(currentTmp);
}
fr.close();
out.close();
} catch (IOException e) {
System.out.println("exception occoured" + e);
}
}
public static void main(String[] args) {
String readFile = "inPutFile";
String writeFile = "outPutFile";
readAndWrite(readFile, writeFile);
}
The problem is that the test string inside the code which have '\n' can we converted to a new line with BufferedWriter. But if I put the same string in the text file it would not perform the same.
In a more easy way to see is that I want my input file have this
work\n
work
and output as
work
work
I am using mac, so the separator should be '\n'
work\n
if you see the "\n" in your file, it is not a new line character. It is just two characters.
The trim() method will not remove those characters.
Instead you might have something like:
if (tmp.endsWith("\n")
tmp = tmp.substring(0, tmp.length() - 2);
I am using mac, so the separator should be '\n'
You should use the newline character for the platform. So when writing to your file the code should be:
} else {
currentTmp = currentTmp.concat(" ").concat(tmp);
out.append( currentTmp );
out.newLine();
}
The newline() method will use the appropriate new line String for the platform.
Edit:
You need to understand what an escape character is in Java. When you use:
String text = "test\n"
and write the string to a file, only 5 characters are written to the file, not 6. The "\n" is an escape sequence which will cause the ascii value for the new line character to be added to the file. This character is not displayable so you can't see it in the file.
After #camickr answer, I think I realized the problem. Some how if I have a text in the file like this
work \nwork
The \n won't be treated as a single char ('\n'), rather it has been treated as two chars. I think thats why when the BufferWriter writes the input string it won't treat it as a new line.

Incorrect printing of non-eglish characters with Java

I thought this was only an issue with Python 2 but have run into a similar issue now with java (Windows 10, JDK8).
My searches have lead to little resolution so far.
I read from 'stdin' input stream this value: Viļāni. When I print it to console I get this: Vi????ni.
Relevant code snippets are as follows:
BufferedReader in = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
ArrayList<String> corpus = new ArrayList<String>();
String inputString = null;
while ((inputString = in.readLine()) != null) {
corpus.add(inputString);
}
String[] allCorpus = new String[corpus.size()];
allCorpus = corpus.toArray(allCorpus);
for (String line : allCorpus) {
System.out.println(line);
}
Further expansion on my problem as follows:
I read a file containing the following 2 lines:
を
Sōten_Kōro
When I read this from disk and output to a second file I get the following output:
ã‚’
S�ten_K�ro
When I read the file from stdin using cat testinput.txt | java UTF8Tester I get the following output:
???
S??ten_K??ro
Both are obviously wrong. I need to be able to print the correct characters to console and file. My sample code is as follows:
public class UTF8Tester {
public static void main(String args[]) throws Exception {
BufferedReader stdinReader = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
String[] stdinData = readLines(stdinReader);
printToFile(stdinData, "stdin_out.txt");
BufferedReader fileReader = new BufferedReader(new FileReader("testinput.txt"));
String[] fileData = readLines(fileReader);
printToFile(fileData, "file_out.txt");
}
private static void printToFile(String[] data, String fileName)
throws FileNotFoundException, UnsupportedEncodingException {
PrintWriter writer = new PrintWriter(fileName, "UTF-8");
for (String line : data) {
writer.println(line);
}
writer.close();
}
private static String[] readLines(BufferedReader reader) throws IOException {
ArrayList<String> corpus = new ArrayList<String>();
String inputString = null;
while ((inputString = reader.readLine()) != null) {
corpus.add(inputString);
}
String[] allCorpus = new String[corpus.size()];
return corpus.toArray(allCorpus);
}
}
Really stuck here and help would really be appreciated! Thanks in advance. Paul
System.in/out will use the default Windows character set.
Java String will use Unicode internally.
FileReader/FileWriter are old utility classes that use the default character set, hence they are for non-portable local files only.
The error you saw, was a special character as two bytes UTF-8 sequence, but every (special UTF-8) byte interpreted as the default single byte encoding, but with a value not present, hence twice a ? substitution.
Required is that the character can be entered on System.in in the default charset.
Then the String was converted from the default charset.
Writing it to file in UTF-8 needs to specify UTF-8.
Hence:
BufferedReader stdinReader = new BufferedReader(new InputStreamReader(System.in));
String[] stdinData = readLines(stdinReader);
printToFile(stdinData, "stdin_out.txt");
Path path = Paths.get("testinput-utf8.txt");
List<String> lines = Files.readAllLines(path); // Here the default is UTF-8!
Path path = Paths.get("testinput-winlatin1.txt");
List<String> lines = Files.readAllLines(path, "Windows-1252");
Files.write(lines, Paths.get("file_out.txt"), StandardCharsets.UTF_8);
To check whether your current computer system handles Japanese:
System.out.println("Hiragana letter Wo '\u3092'."); // Either を or ?.
Seeing ? the conversion to the default system encoding could not deliver.
を is U+3092, u-encoded as ASCII with \u3092.
To create an UTF-8 text under Windows:
Files.write(Paths.get("out-utf8.txt"),
"\uFEFFHiragana letter Wo '\u3092'.".getBytes(StandardCharsets.UTF_8));
Here I use an ugly (generally unneeded) BOM marker char \uFEFF (a zero-width space) that will let Windows Notepad recognize the text being in UTF-8.

Converting a .java file to a .txt document

I am trying to figure out how to load a .java doc and out put it into a text document...
What needs to be done:
Write a program that opens a Java source file, adds line numbers, and
saves the result in a new file. Line numbers are numbers which
indicate the different lines of a source file, they are useful when
trying to draw someone's attention to a particular line (e.g.,
"there's a bug on line 4"). Your program should prompt the user to
enter a filename, open it, and then save each line to an output fix
with the line numbers prepended to the beginning of each line.
Afterward, display the name of the output file. The name of the output
file should based on the input file with the '.' replaced by a '_',
and ".txt" added to the end. (Hint: if you are using a PrintWriter
object called pw to save the text file, then the line
"pw.printf("%03d", x);" will display an integer x padded to three
digits with leading zeros.)
The text.java needs to output into the text document with numbered lines such as:
001 public class dogHouse {
002 public static void main (String[] args) {
003 and so on...
004
import java.io.*;
public class dogHouse {
public static void main(String [] args) throws IOException {
// The name of the file to open.
String fileName = "test.java";
// This will reference one line at a time
String line = null;
try {
// FileReader reads text files in the default encoding.
FileReader fileReader =
new FileReader(fileName);
// Always wrap FileReader in BufferedReader.
BufferedReader bufferedReader =
new BufferedReader(fileReader);
while((line = bufferedReader.readLine()) != null) {
System.out.println(line);
}
// Always close files.
bufferedReader.close();
}
// The name of the file to open.
finally {
// Assume default encoding.
FileWriter fileWriter =
new FileWriter(fileName);
// Always wrap FileWriter in BufferedWriter.
BufferedWriter bufferedWriter =
new BufferedWriter(fileWriter);
// Note that write() does not automatically
// append a newline character.
bufferedWriter.write("Hello there,");
// Always close files.
bufferedWriter.close();
}
}
}
You need to print and count the line(s) as you read them. You also need to differentiate between your output file and your input file. And, I would prefer to use try-with-resources Statements. Something like,
String fileName = "test.java";
String outputFileName = String.format("%s.txt", fileName.replace('.', '_'));
try (BufferedReader br = new BufferedReader(new FileReader(fileName));
PrintWriter pw = new PrintWriter(new FileWriter(outputFileName))) {
int count = 1;
String line;
while ((line = br.readLine()) != null) {
pw.printf("%03d %s%n", count, line);
count++;
}
} catch (Exception e) {
e.printStackTrace();
}

How to write new line character to a file in Java

I have a string that contains new lines. I send this string to a function to write the String to a text file as:
public static void writeResult(String writeFileName, String text)
{
try
{
FileWriter fileWriter = new FileWriter(writeFileName);
BufferedWriter bufferedWriter = new BufferedWriter(fileWriter);
bufferedWriter.write(text);
// Always close files.
bufferedWriter.close();
}
catch(IOException ex) {
System.out.println("Error writing to file '"+ writeFileName + "'");}
} //end writeResult function
But when I open the file, I find it without any new lines.
When I display the text in the console screen, it is displayed with new lines. How can I write the new line character in the text file.
EDIT:
Assume this is the argument text that I sent to the function above:
I returned from the city about three o'clock on that
may afternoon pretty well disgusted with life.
I had been three months in the old country, and was
How to write this string as it is (with new lines) in the text file. My function write the string in one line. Can you provide me with a way to write the text to the file including new lines ?
EDIT 2:
The text is originally in a .txt file. I read the text using:
while((line = bufferedReader.readLine()) != null)
{
sb.append(line); //append the lines to the string
sb.append('\n'); //append new line
} //end while
where sb is a StringBuffer
In EDIT 2:
while((line = bufferedReader.readLine()) != null)
{
sb.append(line); //append the lines to the string
sb.append('\n'); //append new line
} //end while
you are reading the text file, and appending a newline to it. Don't append newline, which will not show a newline in some simple-minded Windows editors like Notepad. Instead append the OS-specific line separator string using:
sb.append(System.lineSeparator()); (for Java 1.7 and 1.8)
or
sb.append(System.getProperty("line.separator")); (Java 1.6 and below)
Alternatively, later you can use String.replaceAll() to replace "\n" in the string built in the StringBuffer with the OS-specific newline character:
String updatedText = text.replaceAll("\n", System.lineSeparator())
but it would be more efficient to append it while you are building the string, than append '\n' and replace it later.
Finally, as a developer, if you are using notepad for viewing or editing files, you should drop it, as there are far more capable tools like Notepad++, or your favorite Java IDE.
SIMPLE SOLUTION
File file = new File("F:/ABC.TXT");
FileWriter fileWriter = new FileWriter(file,true);
filewriter.write("\r\n");
The BufferedWriter class offers a newLine() method. Using this will ensure platform independence.
bufferedWriter.write(text + "\n"); This method can work, but the new line character can be different between platforms, so alternatively, you can use this method:
bufferedWriter.write(text);
bufferedWriter.newline();
Split the string in to string array and write using above method (I assume your text contains \n to get new line)
String[] test = test.split("\n");
and the inside a loop
bufferedWriter.write(test[i]);
bufferedWriter.newline();
This approach always works for me:
String newLine = System.getProperty("line.separator");
String textInNewLine = "this is my first line " + newLine + "this is my second
line ";
Put this code wherever you want to insert a new line:
bufferedWriter.newLine();
PrintWriter out = null; // for writting in file
String newLine = System.getProperty("line.separator"); // taking new line
out.print("1st Line"+newLine); // print with new line
out.print("2n Line"+newLine); // print with new line
out.close();
Here is a snippet that gets the default newline character for the current platform.
Use
System.getProperty("os.name") and
System.getProperty("os.version").
Example:
public static String getSystemNewline(){
String eol = null;
String os = System.getProperty("os.name").toLowerCase();
if(os.contains("mac"){
int v = Integer.parseInt(System.getProperty("os.version"));
eol = (v <= 9 ? "\r" : "\n");
}
if(os.contains("nix"))
eol = "\n";
if(os.contains("win"))
eol = "\r\n";
return eol;
}
Where eol is the newline

Java read file with strings from different languages

I made a program that reads different text files and combines this into a .csv file. Its a .csv file with translations into English, dutch, french, italian, portuguese and spanish.
Now here is my problem:
In the end i get a nice filled .csv file with all the translations together. I read the files with UTF-8 and all the languages get shown right except for the french one. Some chars are shows as Questionmarks like these: "Mis ? jour" and it should be "Mis à jour".
Here is the method that reads the different files with the different languages and makes objects from them so i can sort them en put them in the right spot in the .csv file
The files are filled like this:
To Airport;A l’aéroport
Today;Aujourd’hui
public static Language getTranslations(String inputFileName) {
Language language = new Language();
FileInputStream fstream;
try {
fstream = new FileInputStream(inputFileName);
// Get the object of DataInputStream
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader( new InputStreamReader( new FileInputStream(inputFileName), "UTF-8"));
String strLine;
//Read File Line By Line
while ((strLine = br.readLine()) != null) {
// Print the content on the console
String[] values = strLine.split(";");
if(values.length == 2) {
language.putTranslationItem(values[0], values[1]);
}
}
//Close the input stream
in.close();
} catch (FileNotFoundException e) {
} catch (IOException e) {
}
return language;
}
I hope anybody can help out!
Thanks
I am not completely sure about this , but you can try to convert the values[0] and values[1] strings into bytearray
byte[] value_0_utfString = values[0].getBytes("UTF-8") ;
byte[] value_1_utfString = values[1].getBytes("UTF-8") ;
and then convert it back into a string
str_0 = new String(value_0_utfString ,"UTF-8") ;
str_1 = new String(value_1_utfString ,"UTF-8") ;
Not sure if this is the right / optimized way , but since a single line comprises of both english and french , I thought splitting and encoding might help , I haven't tried this myself
Resave the text file by clicking "save as" in any text editor(eg: memopad) and change the encoding type to ANSI instead of UTF-8.

Categories

Resources