How can I convert UTF-8 literals, into its UTF-8 character?

How can I convert UTF-8 literals, into its UTF-8 character? - java

I have a bunch of text files that were encoded in UTF-8. The text inside the files look like this: \x6c\x69b/\x62\x2f\x6d\x69nd/m\x61x\x2e\x70h\x70.
I've copied all these text files and placed them into a directory /convert/.
I need to read each file and convert the encoded literals into characters, then save the file. filename.converted.txt
What would be the smartest approach to do this? What can I do to convert to the new text? Is there a function for handling Unicode text to convert between the literal to character types? Should I be using a different programming language for this?
This is what I have at the moment:
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
public class decode {
public static void main(String args[]) {
File directory = new File("C:/convert/");
String[] files = directory.list();
boolean success = false;
for (String file : files) {
System.out.println("Processing \"" + file + "\"");
//TODO read each file and convert them into characters
success = true;
if (success) {
System.out.println("Successfully converted \"" + file + "\"");
} else {
System.out.println("Failed to convert \"" + file + "\"");
}
//save file
if (success) {
try {
FileWriter open = new FileWriter("C:/convert/" + file + ".converted.txt");
BufferedWriter write = new BufferedWriter(open);
write.write("TODO: write converted text into file");
write.close();
System.out.println("Successfully saved \"" + file + "\" conversion.");
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
}

(It looks like there's some confusion about what you mean - this answer assumes the input file is entirely in ASCII, and uses "\x" to hex-encode any bytes which aren't in the ASCII range.)
It sounds to me like the UTF-8 part of it is actually irrelevant. You can treat it as opaque binary data for output. Assuming the input file is entirely ASCII:
Open the input file as text (e.g. using FileInputStream wrapped in InputStreamReader specifying an encoding of "US-ASCII")
Open the output file as binary (e.g. using FileOutputStream)
Read each character from the input
Is it '\'?
If not, write the character's ASCII value to the output stream (just case from char to byte)
What's the next character?
If it's 'x', read the next two characters, convert them from hex to a byte (there's lots of code around to do this part), and write that byte to the output stream
If it's '\', write the ASCII value for '\' to the output stream
Otherwise, possibly throw an exception indicating failure
Loop until you've exhausted the input file
Close both files in finally blocks
You'll then have a "normal" UTF-8 file which should be readable by any text editor which supports UTF-8.

java.io.InputStreamReader can be used to convert an input stream from an arbitrary charset into Java chars. I'm not exactly sure how you want to write it back out, though. Do you want non-ASCII characters to be written out as ASCII Unicode escape sequences?

Related

Reading and writing to the same file using try-with-resources

I am trying to make a program that receives a specified String, and removes every occurence of this String in a text document. The text file that is used to read / write is the same. The arguments used are received from cmd, in this order:
inputString filename
The program compiles fine, but after running it leaves the original text file blank. If i make a try-catch block for input handling, and a try-catch block for output handling, I am able to read and write to the same file. If i use a try-with-resources block, I am able to read a file, and save the output to a different file than the original, with all occurences of inputString from cmd removed. But it seems like I can't read and write to the same file using try-with-resources, and also the input.hasNext() statement returns false when I try to do it this way.
Code example below:
package ch12;
import java.io.*;
import java.util.*;
public class Chapter_12_E11_RemoveText {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("Usage java ch12.Chapter_12_E11_RemoveText inputString filename");
System.exit(1);
}
File filename = new File(args[1]);
if (!filename.exists()) {
System.out.println("Target file " + args[1] + " does not exist");
System.exit(2);
}
try (
Scanner input = new Scanner(filename);
PrintWriter output = new PrintWriter(filename);
) {
System.out.println("hasNext() is " + input.hasNext());
System.out.println("hasNextLine() is " + input.hasNextLine());
while (input.hasNext()) {
String s1 = input.nextLine();
System.out.println("String fetched from input.nextLine() " + s1);
System.out.println("Attemping to replace all words equal to " + args[0] + " with \"\"");
String s2 = s1.replaceAll(args[0], "");
output.println(s2);
}
}
}
}
I am suspecting that when I create a new PrintWriter object with the argument filename, the original file is overwritten by a blank file before the while-loop executes. Am i right here? Is it possible to read and write to the same file using try-with-resources?

From the PrintWriter docs:
If the file exists then it will be truncated to zero size; otherwise, a new file will be created.
So you are correct, by the time you initialize your PrintWriter, your Scanner has nothing to scan.
I would remove the PrintWriter initialization from the resources initialization block, build your file representation in memory, then replace the file contents in another block (or nest it).
That is, if the file has a reasonable size for your memory to handle the replacement.

Java FileWriter outputs a question mark

I have been unable to find the reason for this. The only problem I am having in this code is that when the FileWriter tries to put the new value into the text file, it instead puts a ?. I have no clue why, or even what it means. Here is the code:
if (secMessage[1].equalsIgnoreCase("add")) {
if (secMessage.length==2) {
try {
String deaths = readFile("C:/Users/Samboni/Documents/Stuff For Streaming/deaths.txt", Charset.defaultCharset());
FileWriter write = new FileWriter("C:/Users/Samboni/Documents/Stuff For Streaming/deaths.txt");
int comb = Integer.parseInt(deaths) + 1;
write.write(comb);
write.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
And here is the readFile method:
static String readFile(String path, Charset encoding) throws IOException {
byte[] encoded = Files.readAllBytes(Paths.get(path));
return new String(encoded, encoding);
}
Also, the secMessage array is an array of strings containing the words of an IRC message split into individual words, that way the program can react to the commands on a word-by-word basis.

You're calling Writer.write(int). That writes a single UTF-16 code point to the file, taking just the bottom 16 bits. If your platform default encoding isn't able to represent the code point you're trying to write, it will write '?' as a replacement character.
I suspect you actually want to write out a text representation of the number, in which case you should use:
write.write(String.valueOf(comb));
In other words, turn the value into a string and then write it out. So if comb is 123, you'll get three characters ('1', '2', '3') written to the file.
Personally I'd avoid FileWriter though - I prefer using OutputStreamWriter wrapping FileOutputStream so you can control the encoding. Or in Java 7, you can use Files.newBufferedWriter to do it more simply.

write.write(new Integer(comb).toString());
You can convert the int into a string. Otherwise you will need the int to be a character. That will only work for a small subset of numbers, 0-9, so it is not recommended.

Replace All Backslashes on a Property File

I was looking at a Properties file I'm testing and I realized that every time I do a Properties.store() values that contain characters like : and / receive a backslash, but I want my property file to be read by other programs that are not written in Java (so they will not use the Properties library) and those backslashes are causing problems on them. Is there any way to save the file without those?
I've tried building this function, which is called after the Properties file has been saved:
private void replaceInFile(File file) throws IOException {
File tmpFile = new File("/sdcard/test.prop");
FileWriter fw = new FileWriter(tmpFile);
Reader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
while (br.ready()) {
fw.write(br.readLine().replaceAll("\\", "") + "\n");
}
fw.close();
br.close();
fr.close();
}
But I'm getting this error when the function is called:
02-03 13:05:34.757: E/AndroidRuntime(15558): java.util.regex.PatternSyntaxException: Syntax error U_REGEX_BAD_ESCAPE_SEQUENCE near index 1:
\
^

These are special characters. They must be escaped with a slash.
= and : are symbols that separate key from value. What if you have foo=bar=baz? Or foo:bar:baz? Which is the key and which is the value
If you want to enforce different rules, then implement your own mechanism and don't use java.util.Properties. For the complete set of rules see Properties.load(..)
You can, after storing the properties, 1. read to string 2. replace escaped characters. 3. write the new string to file.

Error reading UTF-8 file in Java

I am trying to read in some sentences from a file that contains unicode characters. It does print out a string but for some reason it messes up the unicode characters
This is the code I have:
public static String readSentence(String resourceName) {
String sentence = null;
try {
InputStream refStream = ClassLoader
.getSystemResourceAsStream(resourceName);
BufferedReader br = new BufferedReader(new InputStreamReader(
refStream, Charset.forName("UTF-8")));
sentence = br.readLine();
} catch (IOException e) {
throw new RuntimeException("Cannot read sentence: " + resourceName);
}
return sentence.trim();
}

The problem is probably in the way that the string is being output.
I suggest that you confirm that you are correctly reading the Unicode characters by doing something like this:
for (char c : sentence.toCharArray()) {
System.err.println("char '" + ch + "' is unicode codepoint " + ((int) ch)));
}
and see if the Unicode codepoints are correct for the characters that are being messed up. If they are correct, then the problem is output side: if not, then input side.

First, you could create the InputStreamReader as
new InputStreamReader(refStream, "UTF-8")
Also, you should verify if the resource really contains UTF-8 content.

One of the most annoying reason could be... your IDE settings.
If your IDE default console encoding is something like latin1 then you'll be struggling very long with different variations of java code but nothing help untill you correctly set some IDE options.

Reading a line from a text file using FileReader, using System.out.println seems print in unicode?

Im still teaching myself Java so I wanted to try to read a text file and step 1) output it to console and step 2) write the contents to a new txt file.
Here is some code I have google'd to start with and it is reading the file, but when I output the line contents to the console I get the following (looks like its outputting in unicode or something... like every character as an extra byte associated to it....
ÿþFF□u□l□l□ □T□i□l□t□ □P□o□k□e□r□ <SNIP>
Here is what the first line of the file looks like when I open in via notepad:
Full Tilt Poker Game #xxxxxxxxxx: $1 + $0.20 Sit & Go (xxxxxxxx), Table 1 - 15/30 - No Limit Hold'em - 22:09:45 ET - 2009/12/26
Here is my code, do I need to specify the encoding to display txt file contents to the console? I assumed that simple text would be straight forward for java...but Im new and don't understand much about how finicky java is yet.
EDIT: I dont know if it matters but Im using Eclipse as my IDE currently.
package readWrite;
import java.io.*;
public class Read {
public static void main(String args[])
{
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader("C:\\Users\\brian\\workspace\\downloads\\poker_text.txt"));
String line = reader.readLine();
while (line!=null) {
// Print read line
System.out.println(line);
// Read next line for while condition
line = reader.readLine();
}
} catch (IOException ioe) {
System.out.println(ioe.getMessage());
} finally {
try { if (reader!=null) reader.close(); } catch (Exception e) {}
}
}
}

The ÿþ at the beginning appears to be a Byte Order Mark for a UTF-16 encoded file.
http://en.wikipedia.org/wiki/Byte_order_mark#UTF-16
You might need to read the file in a different manner so Java can convert those UTF-16 characters to something your System.out can display.
Try something like this
FileInputStream fis = new FileInputStream("filename");
BufferedReader reader = new BufferedReader(new InputStreamReader(fis, "UTF-16"));
OR
Open up your text file in notepad again, and File/Save As. On the save screen (at least in windows 7) there is a pulldown with the encoding setting. Choose ANSI or UTF-8

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I convert UTF-8 literals, into its UTF-8 character? - java

java.io.InputStreamReader can be used to convert an input stream from an arbitrary charset into Java chars. I'm not exactly sure how you want to write it back out, though. Do you want non-ASCII characters to be written out as ASCII Unicode escape sequences?

Related

Reading and writing to the same file using try-with-resources

Java FileWriter outputs a question mark

Replace All Backslashes on a Property File

Error reading UTF-8 file in Java

Reading a line from a text file using FileReader, using System.out.println seems print in unicode?

Categories

Resources