Detect Encoding with Java

Detect Encoding with Java - java

I have an example which is workingfine. With this example (provided below), I can detect the encoding of file using the universaldetector framework from mozilla.
But I want that this example to detect the encoding of input and not of the file for Example using class Scanner? How can I modify the code below to detect the encoding of input instead of file?
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector {
public static void main(String[] args) throws java.io.IOException {
byte[] buf = new byte[4096];
java.io.FileInputStream fis = new java.io.FileInputStream("C:\\Users\\khalat\\Desktop\\Java\\toti.txt");
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
}

i found a elegant example wich can test at least, wether the charatcht is ISO-8859-1 see code below.
public class TestIso88591 {
public static void main(String[] args){
if(TestIso88591.testISO("ü")){
System.out.println("True");
}
else{
System.out.println("False");
}
}
public static boolean testISO(String text){
return Charset.forName(CharEncoding.ISO_8859_1).newEncoder().canEncode(text);
}
}
now i hav question to expert Java .there is a posibillity to test charachter wether it is ISO-8859-5 or ISO-8859-7? yes yes I know there is utf-8 but my exact question its how can i test the iso-8859-5 charachter. because the input data should be stored in SAP and SAP can handel only with ISO-8859-1 CHarachter. I need that as soon as.

OK I researched a bit more. And the result is. It is useless to read bytes from stdin to guess the encoding, because the java API let you directly read the input as a string which is already encoded ;) The only usecase for this dector is when you get a stream of unknown bytes from a file or socket etc. to guess how to decode it in a java string.
Next pseudo code, it's only theoretical approach to it. But as we figured out it makes no sense ;)
Its very simple.
byte[] buf = new byte[4096];
java.io.FileInputStream fis = new java.io.FileInputStream("C:\\Users\\khalat\\Desktop\\Java\\toti.txt");
UniversalDetector detector = new UniversalDetector(null);
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
What you are doing here is reading from the file into an byte array, which is then passed to the detector.
Replace your FileInputStream with an other reader.
For example to read everything from Standard In:
byte[] buf = new byte[4096];
InputStreamReader isr = new InputStreamReader(System.in);
UniversalDetector detector = new UniversalDetector(null);
int nread = 0;
while ((nread = isr.read(buf, nread, buf.length)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
ATTENTION!!
This code is not tested by me. Its only based in Java API Docs.
I also would place a BufferedReader between the input stream and the read, to puffer. Also it can't work because of the size of the buffer with 4096 bytes. As I see my Example it would work, when you directly enter minimum 4096 bytes in Stdandard IN in one chunk, otherwise the while loop will never start.
About Reader API, The Base class java.io.Reader (http://docs.oracle.com/javase/7/docs/api/java/io/Reader.html#read(char[],%20int,%20int)) Defines the method read as abstract, and any Reader based impl. has to impl this method. SO IT IS THERE!!!
About you can't figure out the encoding of a chunk of unknown bytes. Yes thats right. But you can make a guess, like the detector from mozilla tries. Because you have some clues: 1. We expect that the bytes are a text 2. we know any byte in any specified encoding 3. we can trie to decode several bytes in a guessed encoding and compare the resulting string
About we are experts:
Yes most of use are ones ;) But we don't like to make the homework for someone else. We like to fix bugs or give advices. So provide a full example which provides an error we can fix. Or as it happend here: we give you an advice with some pseudo code. (I don't have the time to setup a project and write you an working example)
Nice comment thread ;)

Related

How to calculate checksum with InputStream and then use it again

I want to calculate the CRC3 checksum of a given InputStream and then use to get the string out of it. Here's what I've tried so far
private long calculateChecksum(InputStream stream) throws IOException {
CRC32 crc = new CRC32();
byte[] buffer = new byte[8192];
int length;
while ((length = stream.read(buffer)) > 0) {
crc.update(buffer, 0, length);
}
return crc.getValue();
}
and then
String text = IOUtils.toString(inputStream, UTF_8);
I also tried to reverse the order. First use it as string and then calculate the checksum. But it didn't work.
What seems to be my issue is that the index goes to the end while calculating the checksum and then doesn't reset. Any idea how to use InputStream after calculating the checksum?

As others said, a stream can be consumed only once. But you can consume it and calculate the CRC value at the same time by wrapping your InputStream with a java.util.zip.CheckedInputStream.
Here is a complete example, assuming the text file "test.txt" is in the current directory and contains only this one line: These are german umlauts: äöüÄÖÜß
import org.apache.commons.io.IOUtils;
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import java.util.zip.CRC32;
import java.util.zip.CheckedInputStream;
public class App {
private static final String INPUT_FILE = "test.txt";
public static void main( String[] args ) {
final CRC32 crc32 = new CRC32();
try(InputStream in = new CheckedInputStream(new BufferedInputStream(
new FileInputStream(INPUT_FILE)), crc32))
{
final String text = IOUtils.toString(in, StandardCharsets.UTF_8);
System.out.println(text);
System.out.println(String.format("CRC32: %x", crc32.getValue()));
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output:
These are german umlauts: äöüÄÖÜß
CRC32: 84bcd851

Yes, an InputStream is consumed. You have a few options:
mark
mark() / reset() are optional methods of inputstreams; mark sets a mark (this does, by itself, nothing), and reset 'rewinds back' to the mark, replaying everything that was provided since the last time you called mark().
However, your average inputstream either does not support it, or, if it does, supports it by storing in memory all the bytes that are received since setting the mark. Meaning, if you do this to an inputstream that contains a few GB worth of data, you're going to get an OutOfMemoryError.
If there isn't a lot of data, just use mark and reset. Wrap in a BufferedInputStream which is specced to support mark/reset:
private void example(InputStream in) {
BufferedInputStream buffered = new BufferedInputStream(in);
in.mark();
long crc = calculateChecksum(buffered);
in.reset();
String text = IOUtils.toString(buffered, UTF_8);
}
Duplicate
Your second option is to duplicate the inputstream, sending each retrieved byte both to IOUtils as well as to the CRC algorithm.
This is complicated and not recommended.
Checksum the string instead.
You already have a string of data. Just checksum that:
private void example(InputStream in) {
String text = IOUtils.toString(in, UTF_8);
CRC32 crc = new CRC32();
crc.update(text.getBytes(UTF_8));
long checksum = crc.getValue();
}
Or, ditching IOUtils:
private void example(InputStream in) {
byte[] data = in.readAllBytes();
CRC32 crc = new CRC32();
crc.update(data);
long checksum = crc.getValue();
String text = new String(data, UTF_8);
}

InputStream is a read-once stream. Once you've read it, you can't go back to start again. This is because InputStream is general-purpose: it could be the stream of bytes read from a keyboard, for example, or read from a real-time data feed.
If your input stream is in fact a FileInputStream, then you could use
inputStream.getChannel.position(0);
to reset it to the start of the file.
If it's a ByteArrayInputStream, then you already have a byte array so you might as well just use that instead.
If you want to write a general-purpose function that doesn't know what kind of InputStream it is given, then you can wrap it in a BufferedInputStream and use its mark() method. This will use extra memory to buffer the whole of the stream.

How to tell the original encoding of a file

I have a bunch of plain text file which I downloaded from 3rd party servers.
Some of them are gibberish; the server sent the information of ENCODING1 (e.g.: UTF8), but in reality the encoding of the file was ENCODING2 (e.g.: Windows1252).
Is there a way to somehow correct these files?
I presume the files were (ENCODING1) mostly encoded in UTF8, ISO-8859-2 and Windows1252 (and I presume they were mostly saved with one of these encodings). I was thinking about re-encoding every filecontent with
new String(String.getBytes(ENCODING1), ENCODING2)
with all possibilites of ENCODING1 and ENCODING2 (for 3 encodings that would be 9 options)
then finding some way (for example: charachter frequency?) to tell which of the 9 results is the correct one.
Are there any 3rd party libraries for this?
I tried JChardet and ICU4J, but as far as I know both of them are only capable of detecting the encoding of the file before the step with ENCODING1 took place
Thanks,
krisy

You can use library provided by google to detect character set for a file, please see following:
import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector
{
public static void main(String[] args) throws java.io.IOException
{
if (args.length != 1) {
System.err.println("Usage: java TestDetector FILENAME");
System.exit(1);
}
byte[] buf = new byte[4096];
String fileName = args[0];
java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
}
Read more at following URL
You can also try jCharDet by sourceforge, please see following URL
Cheers !!

Inside JVM Strings are always unicode (converted by reading or creation), so aStringVariable.getBytes(ENCODING1) will only work for output.
For a basic understanding you should read http://www.joelonsoftware.com/articles/Unicode.html.
As mentioned in this article there is no way to know for sure which original encoding was used; due to this article e.g. Internet Explorer guesses by the frequency of different bytes.

So the original files are in UTF8 (multibyte Unicode format), ISO-8859-2 (Latin-2) and Windows-1252 (MS Latin-1). You want to have them all in UTF-8.
First the download should not do any conversion, so the contents stay intact.
Otherwise you could only attempt to repair a wrong encoding, without guarantee.
Java uses Unicode for text internally. So create a String only with the correct encoding. For the file contents use byte[].
The functionality available:
If the file is in 7-bits US-ASCII then it already UTF-8
If the file has only valid UTF-8 sequences, it most likely is UTF-8; can be tested
Remains to distinguish between Latin-2 and MS Latin-1
The latter can be done by some statistics. For instance identifying the language by their 100 most frequent words functions rather well.
I am aware of a couple of charset detectors. That one did not seem to work might also be that the file is already corrupted. With Notepad++ or JEdit or some other encoding converting editor you might check.
Charset detectCharset(Path path) throws IOException {
byte[] content = Files.readAllBytes(path);
boolean ascii = true;
boolean utf8 = true;
Map<Byte, Integer> specialCharFrequencies = new TreeMap<>();
for (int i = 0; i < content.length; ++i) {
byte b = content[i];
if (b < 0) {
ascii = false;
if ((b & 0xC0) == 0xC0) { // UTF-8 continuation byte
if (i == 0 || content[i - 1] >= 0) {
utf8 = false;
}
}
specialCharFrequencies.merge(b, 1, Integer::sum);
}
}
if (ascii || utf8) {
return StandardCharsets.UTF_8;
}
// ... determine by frequencies
Charset latin1 = Charset.forName("Windows-1252");
Charset latin2 = Charset.forName("ISO-8859-2");
System.out.println(" B Freq 1 2");
specialCharFrequencies.entrySet().stream()
.forEach(e -> System.out.printf("%02x %06d %s %s%n",
e.getKey() & 0xFF, e.getValue(),
new String(new byte[] {e.getKey(), 0, 1}, latin1),
new String(new byte[] {e.getKey(), 0, 1}, latin2)));
return null;
}
Illegal UTF-8 can slip through this check, but it would be easy to use a Charset decoder.

How do i open a file in java and then call a hex editor to check the file type?

We are really stuck on this topic, this is the only code we have which converts a file into hex but we need to open a file and then for the java code to read the hex and extract certain bytes (e.g. the first 4 bytes for the file extension:
import java.io.*;
public class FileInHexadecimal
{
public static void main(String[] args) throws Exception
{
FileInputStream fis = new FileInputStream("H://Sample_Word.docx");
int i = 0;
while ((i = fis.read()) != -1) {
if (i != -1) {
System.out.printf("%02X\n ", i);
}
}
fis.close();
}
}

Do not confuse internal and external representation - what you do when converting to hex is that you only create a different representation of the same bytes.
There is no need to convert to hex if you just want to read some bytes from the file - just read them. For example, to read the first four bytes, you can use something like
byte[] buffer = new byte[4];
FileInputStream fis = new FileInputStream("H://Sample_Word.docx");
int read = fis.read(buffer);
if (read != buffer.length) {
System.out.println("Short file!");
}
If you need to read data from an arbitrary position within the file, you might want to check RandomAccessFile instead of using a stream. RandomAccessFile allows to set the position where to start reading.

GZipping from standard input to standard output in Java

Preface: I'm a total Java noob...I just wrote Hello World yesterday. Have mercy on my noob self.
I'm not sure how to read from standard input or output to standard output in Java. I know there are things like Scanners and System.out.println, but this doesn't seem to apply directly to what I'm trying to do.
In particular, I'm trying to use GZip on standard input and output the compressed result to standard output. I see that there is a GZipOutputStream class that I'll certainly want to use. However, how can I initialize the output stream to direct to std output? Further, how can I just read from standard input?
How can I accomplish this? How do I compress std input and output the result to std output?
(Here's a diagram of what I'm trying to accomplish: Std input -> GZIP (via my Java program) -> std output (the compressed version of the std input)

Take a look at the following constructor : GZIPInputStream(InputStream in). To get stdin as an InputStream, use System.in. Reading from the stream is done with the read(byte[] buf, int off, int len) method- take a look at the documentation for a detailed description.
The whole thing would be something like
GZIPInputStream i = new GZIPInputStream(System.in);
byte[] buffer = new byte[1024];
int n = i.read(buffer, 0,buffer.length)
System.out.println("Bytes read: " + n);
Personally, I found streams in Java to have a steep learning curve, so I do recommend reading any tutorial online.
I'll leave it as an exercise to figure out the output.
--
Disclaimer: haven't actually tried the code

import java.io.IOException;
import java.util.zip.GZIPOutputStream;
public class InToGzipOut {
private static final int BUFFER_SIZE = 512;
public static void main(String[] args) throws IOException {
byte[] buf = new byte[BUFFER_SIZE];
GZIPOutputStream out = new GZIPOutputStream(System.out);
int len;
while ((len = System.in.read(buf)) > 0) {
out.write(buf, 0, len);
}
out.finish();
}
}

Java reading file into memory and how not to blow up memory

I'm a bit of a newbie in Java and I trying to perform a MAC calculation on a file.
Now since the size of the file is not known at runtime, I can't just load all of the file in to memory. So I wrote the code so it would read in bits (4k in this case).
The issue I'm having is I tried loading the entire file into memory to see if both methods produce the same hash. However they seem to be producing different hashes
Here's the bit by bit code:
FileInputStream fis = new FileInputStream("sbs.dat");
byte[] file = new byte[4096];
m = Mac.getInstance("HmacSHA1");
int i=fis.read(file);
m.init(key);
while (i != -1)
{
m.update(file);
i=fis.read(file);
}
mac = m.doFinal();
And here's the all at once approach:
File f = new File("sbs.dat");
long size = f.length();
byte[] file = new byte[(int) size];
fis.read(file);
m = Mac.getInstance("HmacSHA1");
m.init(key);
m.update(file);
mac = m.doFinal();
Shouldn't they both produce the same hash?
The question however is more generic. Is the 1st code the correct way of loading a file into memory into pieces and perform whatever we want to do inside the while cycle? (socket send, cipher a file, etc...).
This question is useful because every tutorial I've seen just loads everything at once...
Update: Working :-D. Will this approach work properly sending a file in pieces through a socket?

No. You have no guarantee that in fis.read(file) will read file.length bytes. This is why read() is returning an int to tell you how many bytes it has actually read.
You should instead do this:
m.init(key);
int i=fis.read(file);
while (i != -1)
{
m.update(file, 0, i);
i=fis.read(file);
}
taking advantage of Mac.update(byte[] data, int offset, int len) method that allows you to specify length of actual data in in byte[] array.

The read function will not necessarily fill up your entire array. So, you need to check how many bytes were returning from the read function, and only use that many bytes of your buffer.

Just like Jason LeBrun says - The read method will not always read the specified amount of bytes. For example: What do you think will happen if the file does not contain a multiple of 4096 bytes?
I would go for something like this:
FileInputStream fis = new FileInputStream(filename);
byte[] buffer = new byte[buffersize];
Mac m = Mac.getInstance("HmacSHA1");
m.init(key);
int n;
while ((n = fis.read(buffer)) != -1)
{
m.update(buffer, 0, n);
}
byte[] mac = m.doFinal();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Detect Encoding with Java - java

Related

How to calculate checksum with InputStream and then use it again

How to tell the original encoding of a file

How do i open a file in java and then call a hex editor to check the file type?

GZipping from standard input to standard output in Java

Java reading file into memory and how not to blow up memory

Categories

Resources