Java FileWriter outputs a question mark - java

I have been unable to find the reason for this. The only problem I am having in this code is that when the FileWriter tries to put the new value into the text file, it instead puts a ?. I have no clue why, or even what it means. Here is the code:
if (secMessage[1].equalsIgnoreCase("add")) {
if (secMessage.length==2) {
try {
String deaths = readFile("C:/Users/Samboni/Documents/Stuff For Streaming/deaths.txt", Charset.defaultCharset());
FileWriter write = new FileWriter("C:/Users/Samboni/Documents/Stuff For Streaming/deaths.txt");
int comb = Integer.parseInt(deaths) + 1;
write.write(comb);
write.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
And here is the readFile method:
static String readFile(String path, Charset encoding) throws IOException {
byte[] encoded = Files.readAllBytes(Paths.get(path));
return new String(encoded, encoding);
}
Also, the secMessage array is an array of strings containing the words of an IRC message split into individual words, that way the program can react to the commands on a word-by-word basis.

You're calling Writer.write(int). That writes a single UTF-16 code point to the file, taking just the bottom 16 bits. If your platform default encoding isn't able to represent the code point you're trying to write, it will write '?' as a replacement character.
I suspect you actually want to write out a text representation of the number, in which case you should use:
write.write(String.valueOf(comb));
In other words, turn the value into a string and then write it out. So if comb is 123, you'll get three characters ('1', '2', '3') written to the file.
Personally I'd avoid FileWriter though - I prefer using OutputStreamWriter wrapping FileOutputStream so you can control the encoding. Or in Java 7, you can use Files.newBufferedWriter to do it more simply.

write.write(new Integer(comb).toString());
You can convert the int into a string. Otherwise you will need the int to be a character. That will only work for a small subset of numbers, 0-9, so it is not recommended.

Related

The proper uses of a print writer and file writer

why should we use file writer and then wrap it with a print writer when we can directly use a print writer? and we use buffered readers so they can read large chunks of data at once but then to get the output printed we have to loop them into a while loop why don't we have a simpler way to get the output printed?
Let's first have a look at the javadoc for the main differences.
FileWriter
Convenience class for writing character files. The constructors of this class assume that the default character encoding ... FileWriter is meant for writing streams of characters.
PrintWriter
Prints formatted representations of objects to a text-output stream.
Which means FileWriter focuses on character-wise output and you cannot define the character encoding. Whereas PrintWriter focuses on formatted text output and you can specify the character encoding.
Find a small example as demonstraction
// we need this as there is no convenient method to output a platform
// specific line separator charcater(s)
String newLine = System.getProperty("line.separator");
try (FileWriter fw = new FileWriter("/tmp/fw.txt")) {
fw.append('\u2126').append(newLine);
fw.write(65);
fw.append(newLine);
fw.append(String.format("%10s: %s%n", "some", "value"));
fw.append("some line").append(newLine);
} catch (IOException ex) {
System.err.println("something failed: " + ex.getMessage());
}
// the println() methods will append the right platform specific line separator
// charcater(s)
try (PrintWriter pw = new PrintWriter("/tmp/pw.txt", "UTF8")) {
pw.append('\u2126');
pw.println();
pw.write(65);
pw.println();
pw.printf("%10s: %s%n", "some", "value");
pw.println("some line");
} catch (FileNotFoundException | UnsupportedEncodingException ex) {
System.err.println(ex.getMessage());
}
If you run the snippet on a unicode aware machine (or run the code as java -Dfile.encoding=UTF-8 ...) the output will be
fw.txt
Ω
A
some: value
some line
pw.txt
Ω
A
some: value
some line
For the above examples the code and the result look more or less the same. PrintWriter provide methods for formatted output, whereas for FileWriter you have to do the formatting before the output.
But the big difference comes, when your environment is not unicode aware (or run the code as java -Dfile.encoding=ISO-8859-1 ...)
fw.txt
?
A
some: value
some line
The unicode omega character cannot be printed with ISO8859-1 encoding.
With the PrintWriter we defined the character encoding for the output. Which is independent from the default encoding of the environment.
pw.txt
Ω
A
some: value
some line
Back to your question. Wrapping a FileWriter into a PrintWriter. It is possible. But you loose the main benefit, the ability to choose the chracter encoding.
try (PrintWriter pw = new PrintWriter(new FileWriter("/tmp/pwfw.txt"))) {
pw.append('\u2126');
pw.println();
} catch (IOException ex) {
System.err.println("something failed: " + ex.getMessage());
}
The file pwfw.txt will contain the unicode character omega only if the default encoding of the environment is unicode. So you would have the same limitation (for the encoding) like with FileWriter.
If you have to use FileWriter or PrintWriter depends on your needs. I believe PrintWriter should to do it most of the time.

Safe implementation of BufferdReader

I want to use a BufferedReader to read a file uploaded to my server.
The file would by written as a CSV file, but I can't assume this, so I code some test where the file is an image or a binary file (supposing the client has sent me the wrong file or an attacker is trying to break my service), or even worse, the file is a valid CSV file but has a line of 100MB.
My application can deal with this problem, but it has to read the first line of the file:
...
String firstLine = bufferedReader.readLine();
//Perform some validations and reject the file if it's not a CSV file
...
But, when I code some tests, I've found a potential risk: BufferedReader doesn't perform any control over the amount of bytes it reads until it found a return line, so it can ended up throwing an OutOfMemoryError.
This is my test:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.Reader;
import org.junit.Test;
public class BufferedReaderTest {
#Test(expected=OutOfMemoryError.class)
public void testReadFileWithoutReturnLineCharacter() throws IOException {
BufferedReader bf = new BufferedReader(getInfiniteReader());
bf.readLine();
bf.close();
}
private Reader getInfiniteReader() {
return new Reader(){
#Override
public int read(char[] cbuf, int off, int len) throws IOException {
return 'A';
}
#Override
public void close() throws IOException {
}
};
}
}
I've been looking up some safe BufferedReader implementation on the internet, but I can't find anything. The only class I've found was BoundedInputStream from apache IO, that limits the amount of bytes read by an input stream.
I need an implementation of BufferedReader that knows how to limit the number of bytes/characters read in each line.
Something like this:
The app calls 'readLine()'
The BufferedReader reads bytes until it found a return line character or it reaches the maximum amount of bytes allowed
If it has found a return line character, then reset the bytes read (so it could read the next line) and return the content
If it has reached the maximum amount of bytes allowed, it throws an exception
Does anybody knows about an implementation of BufferedReader that has this behaviour?
This is not how you should proceed to detect whether a file is binary or not.
Here is how you can do to check whether a file is truly text or not; note that this requires that you know the encoding beforehand:
final Charset cs = StandardCharsets.UTF_8; // or another
final CharsetDecoder decoder = cs.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT); // default is REPLACE!
// Here, "in" is the input stream from the file
try (
final Reader reader = new InputStreamReader(in, decoder);
) {
final char[] buf = new char[4096]; // or other size
while (reader.read(buf) != -1)
; // nothing
} catch (MalformedInputException e) {
// cannot decode; binary, or wrong encoding
}
Now, since you can initialize a BufferedReader over a Reader, you can use:
try (
final Reader r = new InputStreamReader(in, decoder);
final BufferedReader reader = new BufferedReader(r);
) {
// Read lines normally
} catch (CharacterCodingException e) {
// Not a CSV, it seems
}
// etc
Now, a little more explanation about how this works... While this is a fundamenal part of reading text in Java, it is a part which is equally fundamentally misunderstood!
When you read a file as text using a Reader, you have to specify a character coding; in Java, this is a Charset.
What happens internally is that Java will create a CharsetDecoder from that Charset, read the byte stream and output a char stream. And there are three ways to deal with errors:
CodingErrorAction.REPLACE (the default): unmappable byte sequences are replaced with the Unicode replacement character (it does ring a bell, right?);
CodingErrorAction.IGNORE: unmappable byte sequences do not trigger the emission of a char;
CodingErrorAction.REPORT: unmappable byte sequences trigger a CharacterCodingException to be thrown, which inherits IOException; in turn, the two subclasses of CharacterCodingException are MalformedInputException and UnmappableCharacterException.
Therefore, what you need to do in order to detect whether a file is truly text is to:
know the encoding beforehand!
use a CharsetDecoder configured with CodingErrorAction.REPORT;
use it in an InputStreamReader.
This is one way; there are others. All of them however will use a CharsetDecoder at some point.
Similarly, there is a CharsetEncoder for the reverse operation (char stream to byte stream), and this is what is used by the Writer family.
Thank you #fge for the answer. I ended up implementing a safe Readerthat can deal with files with too long lines (or without lines at all).
If anybody wants to see the code, the project (very small project even with many tests) is available here:
https://github.com/jfcorugedo/security-io

unexpected charcters appears after writing it into a text file

when i try to get the text from a document, if it is followed by some special characters such as TM or C (for copyright) and so on, after writing it into a text file it will makes some unexpected added to it. as an example, we can consider the following:
if we have Apache™ Hadoop™! and then if we try to write in into a text using FileOutputStream then result would be like Apacheâ Hadoopâ which the â is nonsense for me and generally i want a way to detect such characters in the text and just skipping them for writing them, is there solution to this?
If you want just the printable ASCII range, then iterate over your string character by character building a new string. Include the character only if it's within the range 0x20 to 0x7E.
final StringBuilder buff = new StringBuilder();
for (char c : string.toCharArray())
{
if (c >= 0x20 && c <= 0x7E)
{
buff.append(c);
}
}
final FileWriter w = new FileWriter(...);
w.write(buff.toString());
w.close();
If you want to keep carriage returns and newlines, you also need to consider 0x0A and 0x0D.
I mis-read the question originally and didn't notice you wanted to skip them. I'll leave this here for now and will delete it if someone posts something better.
To deal with the characters properly, you can explicit setg the charset to ISO-8859-1. To do this, you'll need to use something like an OutputStreamWriter.
final OutputStreamWriter writer;
writer = new OutputStreamWriter(new FileOutputStream(file),
Charset.forName("ISO-8859-1"));
writer.write(string);
writer.close();
This won't skip them, but should encode them properly.
The reason is characters coding problem. Before you write the string into file, you need to coding the String characters.
you can use like follow:
Writer out = new OutputStreamWriter(new FileOutputStream(
new File("D://helloWorld.txt")), "UTF8");
String tm ="Apache™ Hadoop™";
out.write(tm);
out.close();

How can I convert UTF-8 literals, into its UTF-8 character?

I have a bunch of text files that were encoded in UTF-8. The text inside the files look like this: \x6c\x69b/\x62\x2f\x6d\x69nd/m\x61x\x2e\x70h\x70.
I've copied all these text files and placed them into a directory /convert/.
I need to read each file and convert the encoded literals into characters, then save the file. filename.converted.txt
What would be the smartest approach to do this? What can I do to convert to the new text? Is there a function for handling Unicode text to convert between the literal to character types? Should I be using a different programming language for this?
This is what I have at the moment:
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
public class decode {
public static void main(String args[]) {
File directory = new File("C:/convert/");
String[] files = directory.list();
boolean success = false;
for (String file : files) {
System.out.println("Processing \"" + file + "\"");
//TODO read each file and convert them into characters
success = true;
if (success) {
System.out.println("Successfully converted \"" + file + "\"");
} else {
System.out.println("Failed to convert \"" + file + "\"");
}
//save file
if (success) {
try {
FileWriter open = new FileWriter("C:/convert/" + file + ".converted.txt");
BufferedWriter write = new BufferedWriter(open);
write.write("TODO: write converted text into file");
write.close();
System.out.println("Successfully saved \"" + file + "\" conversion.");
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
}
(It looks like there's some confusion about what you mean - this answer assumes the input file is entirely in ASCII, and uses "\x" to hex-encode any bytes which aren't in the ASCII range.)
It sounds to me like the UTF-8 part of it is actually irrelevant. You can treat it as opaque binary data for output. Assuming the input file is entirely ASCII:
Open the input file as text (e.g. using FileInputStream wrapped in InputStreamReader specifying an encoding of "US-ASCII")
Open the output file as binary (e.g. using FileOutputStream)
Read each character from the input
Is it '\'?
If not, write the character's ASCII value to the output stream (just case from char to byte)
What's the next character?
If it's 'x', read the next two characters, convert them from hex to a byte (there's lots of code around to do this part), and write that byte to the output stream
If it's '\', write the ASCII value for '\' to the output stream
Otherwise, possibly throw an exception indicating failure
Loop until you've exhausted the input file
Close both files in finally blocks
You'll then have a "normal" UTF-8 file which should be readable by any text editor which supports UTF-8.
java.io.InputStreamReader can be used to convert an input stream from an arbitrary charset into Java chars. I'm not exactly sure how you want to write it back out, though. Do you want non-ASCII characters to be written out as ASCII Unicode escape sequences?

Java: reading strings from a random access file with buffered input

I've never had close experiences with Java IO API before and I'm really frustrated now. I find it hard to believe how strange and complex it is and how hard it could be to do a simple task.
My task: I have 2 positions (starting byte, ending byte), pos1 and pos2. I need to read lines between these two bytes (including the starting one, not including the ending one) and use them as UTF8 String objects.
For example, in most script languages it would be a very simple 1-2-3-liner like that (in Ruby, but it will be essentially the same for Python, Perl, etc):
f = File.open("file.txt").seek(pos1)
while f.pos < pos2 {
s = f.readline
# do something with "s" here
}
It quickly comes hell with Java IO APIs ;) In fact, I see two ways to read lines (ending with \n) from regular local files:
RandomAccessFile has getFilePointer() and seek(long pos), but it's readLine() reads non-UTF8 strings (and even not byte arrays), but very strange strings with broken encoding, and it has no buffering (which probably means that every read*() call would be translated into single undelying OS read() => fairly slow).
BufferedReader has great readLine() method, and it can even do some seeking with skip(long n), but it has no way to determine even number of bytes that has been already read, not mentioning the current position in a file.
I've tried to use something like:
FileInputStream fis = new FileInputStream(fileName);
FileChannel fc = fis.getChannel();
BufferedReader br = new BufferedReader(
new InputStreamReader(
fis,
CHARSET_UTF8
)
);
... and then using fc.position() to get current file reading position and fc.position(newPosition) to set one, but it doesn't seem to work in my case: looks like it returns position of a buffer pre-filling done by BufferedReader, or something like that - these counters seem to be rounded up in 16K increments.
Do I really have to implement it all by myself, i.e. a file readering interface which would:
allow me to get/set position in a file
buffer file reading operations
allow reading UTF8 strings (or at least allow operations like "read everything till the next \n")
Is there a quicker way than implementing it all myself? Am I overseeing something?
import org.apache.commons.io.input.BoundedInputStream
FileInputStream file = new FileInputStream(filename);
file.skip(pos1);
BufferedReader br = new BufferedReader(
new InputStreamReader(new BoundedInputStream(file,pos2-pos1))
);
If you didn't care about pos2, then you woundn't need Apache Commons IO.
I wrote this code to read utf-8 using randomaccessfiles
//File: CyclicBuffer.java
public class CyclicBuffer {
private static final int size = 3;
private FileChannel channel;
private ByteBuffer buffer = ByteBuffer.allocate(size);
public CyclicBuffer(FileChannel channel) {
this.channel = channel;
}
private int read() throws IOException {
return channel.read(buffer);
}
/**
* Returns the byte read
*
* #return byte read -1 - end of file reached
* #throws IOException
*/
public byte get() throws IOException {
if (buffer.hasRemaining()) {
return buffer.get();
} else {
buffer.clear();
int eof = read();
if (eof == -1) {
return (byte) eof;
}
buffer.flip();
return buffer.get();
}
}
}
//File: UTFRandomFileLineReader.java
public class UTFRandomFileLineReader {
private final Charset charset = Charset.forName("utf-8");
private CyclicBuffer buffer;
private ByteBuffer temp = ByteBuffer.allocate(4096);
private boolean eof = false;
public UTFRandomFileLineReader(FileChannel channel) {
this.buffer = new CyclicBuffer(channel);
}
public String readLine() throws IOException {
if (eof) {
return null;
}
byte x = 0;
temp.clear();
while ((byte) -1 != (x = (buffer.get())) && x != '\n') {
if (temp.position() == temp.capacity()) {
temp = addCapacity(temp);
}
temp.put(x);
}
if (x == -1) {
eof = true;
}
temp.flip();
if (temp.hasRemaining()) {
return charset.decode(temp).toString();
} else {
return null;
}
}
private ByteBuffer addCapacity(ByteBuffer temp) {
ByteBuffer t = ByteBuffer.allocate(temp.capacity() + 1024);
temp.flip();
t.put(temp);
return t;
}
public static void main(String[] args) throws IOException {
RandomAccessFile file = new RandomAccessFile("/Users/sachins/utf8.txt",
"r");
UTFRandomFileLineReader reader = new UTFRandomFileLineReader(file
.getChannel());
int i = 1;
while (true) {
String s = reader.readLine();
if (s == null)
break;
System.out.println("\n line " + i++);
s = s + "\n";
for (byte b : s.getBytes(Charset.forName("utf-8"))) {
System.out.printf("%x", b);
}
System.out.printf("\n");
}
}
}
For #Ken Bloom A very quick go at a Java 7 version. Note: I don't think this is the most efficient way, I'm still getting my head around NIO.2, Oracle has started their tutorial here
Also note that this isn't using Java 7's new ARM syntax (which takes care of the Exception handling for file based resources), it wasn't working in the latest openJDK build that I have. But if people want to see the syntax, let me know.
/*
* Paths uses the default file system, note no exception thrown at this stage if
* file is missing
*/
Path file = Paths.get("C:/Projects/timesheet.txt");
ByteBuffer readBuffer = ByteBuffer.allocate(readBufferSize);
FileChannel fc = null;
try
{
/*
* newByteChannel is a SeekableByteChannel - this is the fun new construct that
* supports asynch file based I/O, e.g. If you declared an AsynchronousFileChannel
* you could read and write to that channel simultaneously with multiple threads.
*/
fc = (FileChannel)file.newByteChannel(StandardOpenOption.READ);
fc.position(startPosition);
while (fc.read(readBuffer) != -1)
{
readBuffer.rewind();
System.out.println(Charset.forName(encoding).decode(readBuffer));
readBuffer.flip();
}
}
Start with a RandomAccessFile and use read or readFully to get a byte array between pos1 and pos2. Let's say that we've stored the data read in a variable named rawBytes.
Then create your BufferedReader using
new BufferedReader(new InputStreamReader(new ByteArrayInputStream(rawBytes)))
Then you can call readLine on the BufferedReader.
Caveat: this probably uses more memory than if you could make the BufferedReader seek to the right location itself, because it preloads everything into memory.
I think the confusion is caused by the UTF-8 encoding and the possibility of double byte characters.
UTF8 doesn't specify how many bytes are in a single character. I'm assuming from your post that you are using single byte characters. For example, 412 bytes would mean 411 characters. But if the string were using double byte characters, you would get the 206 character.
The original java.io package didn't deal well with this multi-byte confusion. So, they added more classes to deal specifically with strings. The package mixes two different types of file handlers (and they can be confusing until the nomenclature is sorted out). The stream classes provide for direct data I/O without any conversion. The reader classes convert files to strings with full support for multi-byte characters. That might help clarify part of the problem.
Since you state you are using UTF-8 characters, you want the reader classes. In this case, I suggest FileReader. The skip() method in FileReader allows you to pass by X characters and then start reading text. Alternatively, I prefer the overloaded read() method since it allows you to grab all the text at one time.
If you assume your "bytes" are individual characters, try something like this:
FileReader fr = new FileReader( new File("x.txt") );
char[] buffer = new char[ pos2 - pos ];
fr.read( buffer, pos, buffer.length );
...
I'm late to the party here, but I ran across this problem in my own project.
After much traversal of Javadocs and Stack Overflow, I think I found a simple solution.
After seeking to the appropriate place in your RandomAccessFile, which I am here calling raFile, do the following:
FileDescriptor fd = raFile.getFD();
FileReader fr = new FileReader(fd);
BufferedReader br = new BufferedReader(fr);
Then you should be able to call br.readLine() to your heart's content, which will be much faster than calling raFile.readLine().
The one thing I'm not sure about is whether UTF8 strings are handled correctly.
The java IO API is very flexible. Unfortunately sometimes the flexibility makes it verbose. The main idea here is that there are many streams, writers and readers that implement wrapper patter. For example BufferedInputStream wraps any other InputStream. The same is about output streams.
The difference between streams and readers/writers is that streams work with bytes while readers/writers work with characters.
Fortunately some streams, writers and readers have convenient constructors that simplify coding. If you want to read file you just have to say
InputStream in = new FileInputStream("/usr/home/me/myfile.txt");
if (in.markSupported()) {
in.skip(1024);
in.read();
}
It is not so complicated as you afraid.
Channels is something different. It is a part of so called "new IO" or nio. New IO is not blocked - it is its main advantage. You can search in internet for any "nio java tutorial" and read about it. But it is more complicated than regular IO and is not needed for most applications.

Categories

Resources