How to know when to use byteStream for reading data and when to use charStream for reading data from a file? - java

I am trying to understand if I need to read data from different types of files (.properties file, json file, text file etc) or from console, which java class should I use and why exactly.
Some classes are reading data in the form of bytes (8 bits) and some are reading in the form of characters (16 bits unicode). So, how do I decide which class to use for reading data?
enter image description here
In the above sample code, I am trying to read a .properties file. So, how to decide if I need to use FileInputStream or any other class to read the file?
I tried looking for answers online but I am still not clear.

It depends on the individual case.
For property files, there is an overloaded Properties.load() method. You can pass either an InputStream or a Reader.
Here it would be best to use the InputStream option (byte stream) because the load() function handles the correct character set on its own. If you use a Reader you are responsible to read the file with the correct encoding.
Same for parsing XML files. The encoding is part of the XML header (or UTF-8 as default) so it is the best option to let the parser read and handle it.
For JSON the default encoding is UTF-8. Other encodings are possible but I don't know whether it is possible to declare the encoding inside a JSON document like in XML.
So for other file types it depends on the use case. If you have a text file encoded as UTF-8 and you want to copy it to another location as is, you can simply treat the file as a byte block. But if you have to extract some words it is necessary to interpret the byte block as characters so you need a Reader and the right character encoding (in conjunction with an InputStreamReader).
Sometimes you get the right encoding via an API e.g. if you call a REST service via HTTP you can extract the encoding out of a HTTP header. Otherwise, you simply have to know the encoding.

Related

Java - possible to modify and parse gzipped xml files without unzipping?

I have an arraylist of gzipped xml files. Is it possible to view and manipulate the contents of these xml files all without unzipping them and taking up disk space? If so, what would be the correct class(es) to use for this task?
I know I can create a gzipinputstream from a fileinputstream of the zip file but from there I'm not sure what to do. I have only this written:
GZIPInputStream in = new GZIPInputStream(new FileInputStream(zippedFiles.get(i)));
I need some way to parse text within the xml files and modify the xml itself but again, extracting all of them would take up too much disk space.
What exactly are you going to achieve? You can extract the file into memory using a ByteArrayOutputStream and convert it into a byte-Array that you forward to your XML parser library (converting it to String and passing that is not recommended as the encoding is specified inside the XML file itself and the conversion to String must therefore be done by the XML parser internally). Most XML parsers also support reading directly from any InputStream, so you could pass yours directly to it which will probably further reduce your memory consumption. Disk space will only be occupied when writing data back to it by simply reversing the described procedure. Still, as you directly replace the source file by overwriting it, there is nowhere any disk space wasted.
The fact that they're in a list doesn't change much, but no.
Ignoring compression, files are stored linearly on disks. You can append to them cheaply, you can replace bytes cheaply, but you can't replace sequences of different lengths (like replace("Testing Procedure Specification", "TPS")) without rewriting the file after the modified substring.
Gziping the file complicates things, but the same rule applies. In general, making arbitrary modifications to a file requires rewriting the file.
Your code for reading the files is on the right track, though. You can easily read through gziped files as streams and without having to decompress the entire file.

Java: Reading a file containing both text and binary data

I'm having a problem with a new file format I'm being asked to implement at work.
Basically, the file is a text file which contains a bunch of headers containing information about the data in UTC-8, and then the rest of the file is the numerical data in binary. I can write the data and read it back just fine, and I recently added the code to write the headers.
The problem is that I don't know how to read a file that contains both text and binary data. I want to be able to read in and deal with the header information (which is fairly extensive) and then be able to continue reading the binary data without having to re-iterate through the headers. Is this possible?
I am currently using a FileInputStream to read the binary data, but I don't know how to start it at the beginning of the data, rather than the beginning of the whole file. One of the FileInputStream's constructors takes a FileDescriptor as the argument and I think that's my answer, but I don't know how to get one from another file reading class. Am I approaching this correctly?
You can reposition a FileInputStream to any arbitrary point by getting its channel via getChannel() and calling position() on that channel.
The one caveat is that this position affects all consumers of the stream. It is not suitable if you have different threads (for example) reading from different parts of the same file. In that case, create a separate FileInputStream for each consumer.
Also, this technique only works for file streams, because the underlying file can be randomly accessed. There is no equivalent for sockets, or named pipes, or anything else that is actually a stream.

Byte order mark removal in xml files using Java IO

The byte order mark is the first 3 bytes in my xml file. How do I remove the Byte order mark from the xml file programmatically? I want to completely discard it.
Rather than removing it, I use a special reader which reacts properly to BOM (and uses proper encoding, based on read BOM): I copied it from elsewhere (see note inside) but it is open-sourced in my android-menu-navigator project:
http://code.google.com/p/android-menu-navigator/source/browse/src/pl/polidea/navigator/UnicodeReader.java
You can use this reader anyway to read content of XML and write it elsewhere, effectively removing the BOM.

Get file's encoding in Java [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Java : How to determine the correct charset encoding of a stream
User will upload a CSV file to the server, server need to check if the CSV file is encoded as UTF-8. If so need to inform user, (s)he uploaded a wrong encoding file. The problem is how to detect the file user uploaded is UTF-8 encoding? The back end is written in Java. So anyone get the suggestion?
At least in the general case, there's no way to be certain what encoding is used for a file -- the best you can do is a reasonable guess based on heuristics. You can eliminate some possibilities, but at best you're narrowing down the possibilities without confirming any one. For example, most of the ISO 8859 variants allow any byte value (or pattern of byte values), so almost any content could be encoded with almost any ISO 8859 variant (and I'm only using "almost" out of caution, not any certainty that you could eliminate any of the possibilities).
You can, however, make some reasonable guesses. For example, a file that start out with the three characters of a UTF-8 encoded BOM (EF BB BF), it's probably safe to assume it's really UTF-8. Likewise, if you see sequences like: 110xxxxx 10xxxxxx, it's a pretty fair guess that what you're seeing is encoded with UTF-8. You can eliminate the possibility that something is (correctly) UTF-8 enocded if you ever see a sequence like 110xxxxx 110xxxxx. (110xxxxx is a lead byte of a sequence, which must be followed by a non-lead byte, not another lead byte in properly encoded UTF-8).
You can try and guess the encoding using a 3rd party library, for example: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
Well, you can't. You could show kind of a "preview" (or should I say review?) with some sample data from the file so the user can check if it looks okay. Perhaps with the possibility of selecting different encoding options to help determine the correct one.

Detect if user changed file extension to upload?

Using a Java servlet, is it possible to detect the true file type of a file, regardless of its extension?
Scenario: You only allow plain text file uploads (.txt and .csv) The user takes the file, mypicture.jpg, renames it to mypicture.txt and proceeds to upload the file. Your servlet expects only text files and blows up trying to read the jpg.
Obviously this is user error, but is there a way to detect that its not plain text and not proceed?
You can do this using the builtin URLConnection#guessContentTypeFromStream() API. It's however pretty limited in content types it can detect, you can then better use a 3rd party library like jMimeMagic.
See also:
Best way to determine file type in Java
When do browsers send application/octet-stream as Content-Type?
No. There is no way to know what type of file you are being uploaded. You must make all verifications on the server before taking any actions with the file.
I think you should consider why your program might blow up when give a JPEG (say) and make it defensive against this. For example a JPEG file is likely to have apparently very long lines (any LF of CR LF will be soemwhat randomly spread). But a so called text file could equally have long lines that might kill your program,
What exactly do you mean by "plain text file"? Would a file consisting of Chinese text be a plain text file? If you assume English text in ASCII or ANSI coding, you would have to read the full file as binary file, and check that e. g. all byte values are between, say, 32 and 127 plus 13, 10, 9, maybe.

Categories

Resources