Removing the BOM character with Java [duplicate]

Removing the BOM character with Java [duplicate] - java

This question already has answers here:
Byte order mark screws up file reading in Java
(11 answers)
Closed 8 years ago.
I am trying to read files using FileReader and write them into a separate file.
These files are UTF-8 encoded, but unfortuantely some of them still contain a BOM.
The relevant code I tried is this:
private final String UTF8_BOM = "\uFEFF";
private String removeUTF8BOM(String s)
{
if (s.startsWith(UTF8_BOM))
{
s=s.replace(UTF8_BOM, "");
}
return s;
}
line=removeUTF8BOM(line);
But for some reason the BOM is not removed. Is there any other way I can do this with FileReader? I know that there is the BOMInputStream that should work, but I'd rather find a solution using FileReader.

The class FileReader is an old utility class, that uses the platform encoding. On Windows that is likely not UTF-8.
Best to read with another class.
As amusement, and to clarify the error, here a dirty hack, that works for platforms with single byte encodings:
private final String UTF8_BOM = new String("\uFEFF".getBytes(StandardCharsets.UTF_8));
This gets the UTF-8 bytes and makes a String in the current platform encoding.
No need to mention that FileReader is non-portible, dealing only with local files.

Naive Solution to the question as asked:
public static void main(final String[] args)
{
final String hasbom = "\uFEFF" + "Hello World!";
final String nobom = hasbom.charAt(0) == '\uFEFF' ? hasbom.substring(1) : hasbom;
System.out.println(hasbom.equals(nobom));
}
Outputs:
false
Proper Solution Approach:
You should never program to a File based API and instead program against InputStream/OutputStream so that your code is portable to different source locations.
This is just an untested example of how you might go about encapsulating this behavior into an InputStream to make it transparent.
public class BomProofInputStream extends InputStream
{
private final InputStream is;
public BomProofInputStream(#Nonnull final InputStream is)
{
this.is = is;
}
private boolean isFirstByte = true;
#Override
public int read() throws IOException
{
if (this.isFirstByte)
{
this.isFirstByte = false;
final int b = is.read();
if ("\uFEFF".charAt(0) != b) { return b; }
}
return is.read();
}
}
Found an full fledged example with some searching:

Related

Issue with encoding; .jar doesn't work with Cyrillic characters in UTF-8 files

So I have this regex as String literal in my code:
private static final String FILE_PATTERN = "((\\s*\".*НЕКОТОРЫЕ СИМВОЛЫ .*\"\\R)([^\"].* (?!-)\\d+\\s*)+)+";
Also I have input test files in UTF-8 encoding.
And the problem is that when I test my program in IDE (IntelliJ IDEA in my case) everything is OK. Particularly, regex works with Cyrillic characters in test files.
But when I build my program (Maven) and tested .jar file with the same test files, it turned out that most likely regex won't work with Cyrillic characters.
Then I tested it again with file in Windows 1251 encoding and it worked.
So my question is - how can I make my .jar work with UTF-8 files, just like in IDE?
Thanks in advance.
[UPDATE1]
two test files, one in UTF-8 and another in Windows 1251
I've tried to replace Cyrillic characters with \u codes like this:
private static final String FILE_PATTERN = "((\\s*\".*\\u041E\\u0442\\u0434\\u0435\\u043B .*\"\\R)([^\"].* (?!-)\\d+\\s*)+)+";
this doesn't work :(
[UPDATE2]
File processing starts like this:
static void processFile(String inputFile) {
try {
String fileStr = FileHandler.readFile(inputFile).toString();
if (!FileParser.validateFile(fileStr)) {
System.out.println("Sorry, input file format is invalid");
...
File validating looks like this:
public class FileParser {
private static final String FILE_PATTERN = "((\\s*\".*Отдел .*\"\\R)([^\"].* (?!-)\\d+\\s*)+)+";
public static boolean validateFile(String fileStr) {
return Pattern.compile(FILE_PATTERN).matcher(fileStr).matches();
}
...
File reading is very common I think:
public class FileHandler {
public static StringBuilder readFile(String fileName) {
StringBuilder res = new StringBuilder();
String temp;
try (BufferedReader r = new BufferedReader(new FileReader((fileName)))) {
while ((temp = r.readLine()) != null) {
res.append(temp).append("\n");
}
} catch (FileNotFoundException e) {
System.out.println("Input file not found!");
} catch (IOException e) {
// log exception
}
return res;
}
...

I'll throw some possibilities at the problem.
The classes FileReader and FileWriter use the default platform encoding, without overload for a specified encoding. I am not sure whether this is intended, but one of the alternatives:
public static StringBuilder readFile(String fileName) {
StringBuilder res = new StringBuilder();
String temp;
Charset charset = StandardCharsets.UTF_8;
//Charset charset = Charset.fromName("Windows-1251");
try (BufferedReader r = Files.newBufferedReader(fileName, charset)) {
while ((temp = r.readLine()) != null) {
res.append(temp).append("\n");
}
} catch (FileNotFoundException e) {
System.out.println("Input file not found!");
} catch (IOException e) {
// log exception
}
return res;
}
Or:
String readFile(String fileName) throws IOException {
byte[] content = Files.readAllBytes(Paths.get(fileName));
return new String(content, StandardCharsets.UTF_8);
}
Then the editor encoding of the java sources must be the same encoding as that of the javac compiler. One can check this by using the \uXXXX ASCII representation of such special chars: if it then suddenly works, ...
You used two backslashes, but \u0063 (letter c) works java source level, and in fact instead of public class you can write publi\u0063 \u0063lass.
private static final String FILE_PATTERN =
"((\\s*\".*\u041E\u0442\u0434\u0435\u043B .*\"\\R)([^\"].* (?!-)\\d+\\s*)+)+";
Then there is the regular expression, that has two Unicode flags, (?u) and (?U) undermore for what a letter constitutes. That should not be a problem here.

How can I keep processed file as a kind of library?

Sorry for the confusing header but here is my clarification.
I am making a program that read text file and keep it in Array. Then there's a process which need to read the file everytimes a new object is created.
(file)->(Array in one class)->(an object is created with array from that class as a parameter or has some methods involve the kind of array)
The question is, is there anyway to make it unnecessary to read file everytimes? Like store the array as universal constant or something similar to that?
thx

This is not particularly good design, but it should give you some ideas.
public class CachedFile {
private static String contents;
public static void load(File file) throws IOException {
StringBuilder sb = new StringBuilder();
try (Reader r = new BufferedReader(new FileReader(file))) {
int ch;
while ((ch = r.read()) != -1) {
contents.append((char) ch);
}
contents = sb.toString();
}
public static String getContents() { return contents; }
}

Java: reading strings from a random access file with buffered input

I've never had close experiences with Java IO API before and I'm really frustrated now. I find it hard to believe how strange and complex it is and how hard it could be to do a simple task.
My task: I have 2 positions (starting byte, ending byte), pos1 and pos2. I need to read lines between these two bytes (including the starting one, not including the ending one) and use them as UTF8 String objects.
For example, in most script languages it would be a very simple 1-2-3-liner like that (in Ruby, but it will be essentially the same for Python, Perl, etc):
f = File.open("file.txt").seek(pos1)
while f.pos < pos2 {
s = f.readline
# do something with "s" here
}
It quickly comes hell with Java IO APIs ;) In fact, I see two ways to read lines (ending with \n) from regular local files:
RandomAccessFile has getFilePointer() and seek(long pos), but it's readLine() reads non-UTF8 strings (and even not byte arrays), but very strange strings with broken encoding, and it has no buffering (which probably means that every read*() call would be translated into single undelying OS read() => fairly slow).
BufferedReader has great readLine() method, and it can even do some seeking with skip(long n), but it has no way to determine even number of bytes that has been already read, not mentioning the current position in a file.
I've tried to use something like:
FileInputStream fis = new FileInputStream(fileName);
FileChannel fc = fis.getChannel();
BufferedReader br = new BufferedReader(
new InputStreamReader(
fis,
CHARSET_UTF8
)
);
... and then using fc.position() to get current file reading position and fc.position(newPosition) to set one, but it doesn't seem to work in my case: looks like it returns position of a buffer pre-filling done by BufferedReader, or something like that - these counters seem to be rounded up in 16K increments.
Do I really have to implement it all by myself, i.e. a file readering interface which would:
allow me to get/set position in a file
buffer file reading operations
allow reading UTF8 strings (or at least allow operations like "read everything till the next \n")
Is there a quicker way than implementing it all myself? Am I overseeing something?

import org.apache.commons.io.input.BoundedInputStream
FileInputStream file = new FileInputStream(filename);
file.skip(pos1);
BufferedReader br = new BufferedReader(
new InputStreamReader(new BoundedInputStream(file,pos2-pos1))
);
If you didn't care about pos2, then you woundn't need Apache Commons IO.

I wrote this code to read utf-8 using randomaccessfiles
//File: CyclicBuffer.java
public class CyclicBuffer {
private static final int size = 3;
private FileChannel channel;
private ByteBuffer buffer = ByteBuffer.allocate(size);
public CyclicBuffer(FileChannel channel) {
this.channel = channel;
}
private int read() throws IOException {
return channel.read(buffer);
}
/**
* Returns the byte read
*
* #return byte read -1 - end of file reached
* #throws IOException
*/
public byte get() throws IOException {
if (buffer.hasRemaining()) {
return buffer.get();
} else {
buffer.clear();
int eof = read();
if (eof == -1) {
return (byte) eof;
}
buffer.flip();
return buffer.get();
}
}
}
//File: UTFRandomFileLineReader.java
public class UTFRandomFileLineReader {
private final Charset charset = Charset.forName("utf-8");
private CyclicBuffer buffer;
private ByteBuffer temp = ByteBuffer.allocate(4096);
private boolean eof = false;
public UTFRandomFileLineReader(FileChannel channel) {
this.buffer = new CyclicBuffer(channel);
}
public String readLine() throws IOException {
if (eof) {
return null;
}
byte x = 0;
temp.clear();
while ((byte) -1 != (x = (buffer.get())) && x != '\n') {
if (temp.position() == temp.capacity()) {
temp = addCapacity(temp);
}
temp.put(x);
}
if (x == -1) {
eof = true;
}
temp.flip();
if (temp.hasRemaining()) {
return charset.decode(temp).toString();
} else {
return null;
}
}
private ByteBuffer addCapacity(ByteBuffer temp) {
ByteBuffer t = ByteBuffer.allocate(temp.capacity() + 1024);
temp.flip();
t.put(temp);
return t;
}
public static void main(String[] args) throws IOException {
RandomAccessFile file = new RandomAccessFile("/Users/sachins/utf8.txt",
"r");
UTFRandomFileLineReader reader = new UTFRandomFileLineReader(file
.getChannel());
int i = 1;
while (true) {
String s = reader.readLine();
if (s == null)
break;
System.out.println("\n line " + i++);
s = s + "\n";
for (byte b : s.getBytes(Charset.forName("utf-8"))) {
System.out.printf("%x", b);
}
System.out.printf("\n");
}
}
}

For #Ken Bloom A very quick go at a Java 7 version. Note: I don't think this is the most efficient way, I'm still getting my head around NIO.2, Oracle has started their tutorial here
Also note that this isn't using Java 7's new ARM syntax (which takes care of the Exception handling for file based resources), it wasn't working in the latest openJDK build that I have. But if people want to see the syntax, let me know.
/*
* Paths uses the default file system, note no exception thrown at this stage if
* file is missing
*/
Path file = Paths.get("C:/Projects/timesheet.txt");
ByteBuffer readBuffer = ByteBuffer.allocate(readBufferSize);
FileChannel fc = null;
try
{
/*
* newByteChannel is a SeekableByteChannel - this is the fun new construct that
* supports asynch file based I/O, e.g. If you declared an AsynchronousFileChannel
* you could read and write to that channel simultaneously with multiple threads.
*/
fc = (FileChannel)file.newByteChannel(StandardOpenOption.READ);
fc.position(startPosition);
while (fc.read(readBuffer) != -1)
{
readBuffer.rewind();
System.out.println(Charset.forName(encoding).decode(readBuffer));
readBuffer.flip();
}
}

Start with a RandomAccessFile and use read or readFully to get a byte array between pos1 and pos2. Let's say that we've stored the data read in a variable named rawBytes.
Then create your BufferedReader using
new BufferedReader(new InputStreamReader(new ByteArrayInputStream(rawBytes)))
Then you can call readLine on the BufferedReader.
Caveat: this probably uses more memory than if you could make the BufferedReader seek to the right location itself, because it preloads everything into memory.

I think the confusion is caused by the UTF-8 encoding and the possibility of double byte characters.
UTF8 doesn't specify how many bytes are in a single character. I'm assuming from your post that you are using single byte characters. For example, 412 bytes would mean 411 characters. But if the string were using double byte characters, you would get the 206 character.
The original java.io package didn't deal well with this multi-byte confusion. So, they added more classes to deal specifically with strings. The package mixes two different types of file handlers (and they can be confusing until the nomenclature is sorted out). The stream classes provide for direct data I/O without any conversion. The reader classes convert files to strings with full support for multi-byte characters. That might help clarify part of the problem.
Since you state you are using UTF-8 characters, you want the reader classes. In this case, I suggest FileReader. The skip() method in FileReader allows you to pass by X characters and then start reading text. Alternatively, I prefer the overloaded read() method since it allows you to grab all the text at one time.
If you assume your "bytes" are individual characters, try something like this:
FileReader fr = new FileReader( new File("x.txt") );
char[] buffer = new char[ pos2 - pos ];
fr.read( buffer, pos, buffer.length );
...

I'm late to the party here, but I ran across this problem in my own project.
After much traversal of Javadocs and Stack Overflow, I think I found a simple solution.
After seeking to the appropriate place in your RandomAccessFile, which I am here calling raFile, do the following:
FileDescriptor fd = raFile.getFD();
FileReader fr = new FileReader(fd);
BufferedReader br = new BufferedReader(fr);
Then you should be able to call br.readLine() to your heart's content, which will be much faster than calling raFile.readLine().
The one thing I'm not sure about is whether UTF8 strings are handled correctly.

The java IO API is very flexible. Unfortunately sometimes the flexibility makes it verbose. The main idea here is that there are many streams, writers and readers that implement wrapper patter. For example BufferedInputStream wraps any other InputStream. The same is about output streams.
The difference between streams and readers/writers is that streams work with bytes while readers/writers work with characters.
Fortunately some streams, writers and readers have convenient constructors that simplify coding. If you want to read file you just have to say
InputStream in = new FileInputStream("/usr/home/me/myfile.txt");
if (in.markSupported()) {
in.skip(1024);
in.read();
}
It is not so complicated as you afraid.
Channels is something different. It is a part of so called "new IO" or nio. New IO is not blocked - it is its main advantage. You can search in internet for any "nio java tutorial" and read about it. But it is more complicated than regular IO and is not needed for most applications.

Guava equivalent for IOUtils.toString(InputStream)

Apache Commons IO has a nice convenience method IOUtils.toString() to read an InputStream to a String.
Since I am trying to move away from Apache Commons and to Guava: is there an equivalent in Guava? I looked at all classes in the com.google.common.io package and I couldn't find anything nearly as simple.
Edit: I understand and appreciate the issues with charsets. It just so happens that I know that all my sources are in ASCII (yes, ASCII, not ANSI etc.), so in this case, encoding is not an issue for me.

You stated in your comment on Calum's answer that you were going to use
CharStreams.toString(new InputStreamReader(supplier.get(), Charsets.UTF_8))
This code is problematic because the overload CharStreams.toString(Readable) states:
Does not close the Readable.
This means that your InputStreamReader, and by extension the InputStream returned by supplier.get(), will not be closed after this code completes.
If, on the other hand, you take advantage of the fact that you appear to already have an InputSupplier<InputStream> and used the overload CharStreams.toString(InputSupplier<R extends Readable & Closeable>), the toString method will handle both the creation and closing of the Reader for you.
This is exactly what Jon Skeet suggested, except that there isn't actually any overload of CharStreams.newReaderSupplier that takes an InputStream as input... you have to give it an InputSupplier:
InputSupplier<? extends InputStream> supplier = ...
InputSupplier<InputStreamReader> readerSupplier =
CharStreams.newReaderSupplier(supplier, Charsets.UTF_8);
// InputStream and Reader are both created and closed in this single call
String text = CharStreams.toString(readerSupplier);
The point of InputSupplier is to make your life easier by allowing Guava to handle the parts that require an ugly try-finally block to ensure that resources are closed properly.
Edit: Personally, I find the following (which is how I'd actually write it, was just breaking down the steps in the code above)
String text = CharStreams.toString(
CharStreams.newReaderSupplier(supplier, Charsets.UTF_8));
to be far less verbose than this:
String text;
InputStreamReader reader = new InputStreamReader(supplier.get(),
Charsets.UTF_8);
boolean threw = true;
try {
text = CharStreams.toString(reader);
threw = false;
}
finally {
Closeables.close(reader, threw);
}
Which is more or less what you'd have to write to handle this properly yourself.
Edit: Feb. 2014
InputSupplier and OutputSupplier and the methods that use them have been deprecated in Guava 16.0. Their replacements are ByteSource, CharSource, ByteSink and CharSink. Given a ByteSource, you can now get its contents as a String like this:
ByteSource source = ...
String text = source.asCharSource(Charsets.UTF_8).read();

If you've got a Readable you can use CharStreams.toString(Readable). So you can probably do the following:
String string = CharStreams.toString( new InputStreamReader( inputStream, "UTF-8" ) );
Forces you to specify a character set, which I guess you should be doing anyway.

Nearly. You could use something like this:
InputSupplier<InputStreamReader> readerSupplier = CharStreams.newReaderSupplier
(streamSupplier, Charsets.UTF_8);
String text = CharStreams.toString(readerSupplier);
Personally I don't think that IOUtils.toString(InputStream) is "nice" - because it always uses the default encoding of the platform, which is almost never what you want. There's an overload which takes the name of the encoding, but using names isn't a great idea IMO. That's why I like Charsets.*.
EDIT: Not that the above needs an InputSupplier<InputStream> as the streamSupplier. If you've already got the stream you can implement that easily enough though:
InputSupplier<InputStream> supplier = new InputSupplier<InputStream>() {
#Override public InputStream getInput() {
return stream;
}
};

UPDATE: Looking back, I don't like my old solution. Besides it is 2013 now and there are better alternatives available now for Java7. So here is what I use now:
InputStream fis = ...;
String text;
try ( InputStreamReader reader = new InputStreamReader(fis, Charsets.UTF_8)){
text = CharStreams.toString(reader);
}
or if with InputSupplier
InputSupplier<InputStreamReader> spl = ...
try ( InputStreamReader reader = spl.getInput()){
text = CharStreams.toString(reader);
}

Another option is to read bytes from Stream and create a String from them:
new String(ByteStreams.toByteArray(inputStream))
new String(ByteStreams.toByteArray(inputStream), Charsets.UTF_8)
It's not 'pure' Guava, but it's a little bit shorter.

Based on the accepted answer, here is a utility method that mocks the behavior of IOUtils.toString() (and an overloaded version with a charset, as well). This version should be safe, right?
public static String toString(final InputStream is) throws IOException{
return toString(is, Charsets.UTF_8);
}
public static String toString(final InputStream is, final Charset cs)
throws IOException{
Closeable closeMe = is;
try{
final InputStreamReader isr = new InputStreamReader(is, cs);
closeMe = isr;
return CharStreams.toString(isr);
} finally{
Closeables.closeQuietly(closeMe);
}
}

There is much shorter autoclosing solution in case when input stream comes from classpath resource:
URL resource = classLoader.getResource(path);
byte[] bytes = Resources.toByteArray(resource);
String text = Resources.toString(resource, StandardCharsets.UTF_8);
Uses Guava Resources, inspired by IOExplained.

EDIT (2015): Okio is the best abstraction and tools for I/O in Java/Android that I know of. I use it all the time.
FWIW here's what I use.
If I already have a stream in hand, then:
final InputStream stream; // this is received from somewhere
String s = CharStreams.toString(CharStreams.newReaderSupplier(new InputSupplier<InputStream>() {
public InputStream getInput() throws IOException {
return stream;
}
}, Charsets.UTF_8));
If I'm creating a stream:
String s = CharStreams.toString(CharStreams.newReaderSupplier(new InputSupplier<InputStream>() {
public InputStream getInput() throws IOException {
return <expression creating the stream>;
}
}, Charsets.UTF_8));
As a concrete example, I can read an Android text file asset like this:
final Context context = ...;
String s = CharStreams.toString(CharStreams.newReaderSupplier(new InputSupplier<InputStream>() {
public InputStream getInput() throws IOException {
return context.getAssets().open("my_asset.txt");
}
}, Charsets.UTF_8));

For a concrete example, here's how I can read an Android text file asset:
public static String getAssetContent(Context context, String file) {
InputStreamReader reader = null;
InputStream stream = null;
String output = "";
try {
stream = context.getAssets().open(file);
reader = new InputStreamReader(stream, Charsets.UTF_8);
output = CharStreams.toString(reader);
} catch (IOException e) {
e.printStackTrace();
} finally {
if (stream != null) {
try {
stream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if (reader != null) {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return output;
}

Does Java have a path joining method? [duplicate]

This question already has answers here:
Closed 13 years ago.
Exact Duplicate:
combine paths in java
I would like to know if there is such a method in Java. Take this snippet as example :
// this will output a/b
System.out.println(path_join("a","b"));
// a/b
System.out.println(path_join("a","/b");

This concerns Java versions 7 and earlier.
To quote a good answer to the same question:
If you want it back as a string later, you can call getPath(). Indeed, if you really wanted to mimic Path.Combine, you could just write something like:
public static String combine (String path1, String path2) {
File file1 = new File(path1);
File file2 = new File(file1, path2);
return file2.getPath();
}

Try:
String path1 = "path1";
String path2 = "path2";
String joinedPath = new File(path1, path2).toString();

One way is to get system properties that give you the path separator for the operating system, this tutorial explains how. You can then use a standard string join using the file.separator.

This is a start, I don't think it works exactly as you intend, but it at least produces a consistent result.
import java.io.File;
public class Main
{
public static void main(final String[] argv)
throws Exception
{
System.out.println(pathJoin());
System.out.println(pathJoin(""));
System.out.println(pathJoin("a"));
System.out.println(pathJoin("a", "b"));
System.out.println(pathJoin("a", "b", "c"));
System.out.println(pathJoin("a", "b", "", "def"));
}
public static String pathJoin(final String ... pathElements)
{
final String path;
if(pathElements == null || pathElements.length == 0)
{
path = File.separator;
}
else
{
final StringBuilder builder;
builder = new StringBuilder();
for(final String pathElement : pathElements)
{
final String sanitizedPathElement;
// the "\\" is for Windows... you will need to come up with the
// appropriate regex for this to be portable
sanitizedPathElement = pathElement.replaceAll("\\" + File.separator, "");
if(sanitizedPathElement.length() > 0)
{
builder.append(sanitizedPathElement);
builder.append(File.separator);
}
}
path = builder.toString();
}
return (path);
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Removing the BOM character with Java [duplicate] - java

Related

Issue with encoding; .jar doesn't work with Cyrillic characters in UTF-8 files

How can I keep processed file as a kind of library?

Java: reading strings from a random access file with buffered input

Guava equivalent for IOUtils.toString(InputStream)

Does Java have a path joining method? [duplicate]

Categories

Resources