I have a bunch of plain text file which I downloaded from 3rd party servers.
Some of them are gibberish; the server sent the information of ENCODING1 (e.g.: UTF8), but in reality the encoding of the file was ENCODING2 (e.g.: Windows1252).
Is there a way to somehow correct these files?
I presume the files were (ENCODING1) mostly encoded in UTF8, ISO-8859-2 and Windows1252 (and I presume they were mostly saved with one of these encodings). I was thinking about re-encoding every filecontent with
new String(String.getBytes(ENCODING1), ENCODING2)
with all possibilites of ENCODING1 and ENCODING2 (for 3 encodings that would be 9 options)
then finding some way (for example: charachter frequency?) to tell which of the 9 results is the correct one.
Are there any 3rd party libraries for this?
I tried JChardet and ICU4J, but as far as I know both of them are only capable of detecting the encoding of the file before the step with ENCODING1 took place
Thanks,
krisy
You can use library provided by google to detect character set for a file, please see following:
import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector
{
public static void main(String[] args) throws java.io.IOException
{
if (args.length != 1) {
System.err.println("Usage: java TestDetector FILENAME");
System.exit(1);
}
byte[] buf = new byte[4096];
String fileName = args[0];
java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
}
Read more at following URL
You can also try jCharDet by sourceforge, please see following URL
Cheers !!
Inside JVM Strings are always unicode (converted by reading or creation), so aStringVariable.getBytes(ENCODING1) will only work for output.
For a basic understanding you should read http://www.joelonsoftware.com/articles/Unicode.html.
As mentioned in this article there is no way to know for sure which original encoding was used; due to this article e.g. Internet Explorer guesses by the frequency of different bytes.
So the original files are in UTF8 (multibyte Unicode format), ISO-8859-2 (Latin-2) and Windows-1252 (MS Latin-1). You want to have them all in UTF-8.
First the download should not do any conversion, so the contents stay intact.
Otherwise you could only attempt to repair a wrong encoding, without guarantee.
Java uses Unicode for text internally. So create a String only with the correct encoding. For the file contents use byte[].
The functionality available:
If the file is in 7-bits US-ASCII then it already UTF-8
If the file has only valid UTF-8 sequences, it most likely is UTF-8; can be tested
Remains to distinguish between Latin-2 and MS Latin-1
The latter can be done by some statistics. For instance identifying the language by their 100 most frequent words functions rather well.
I am aware of a couple of charset detectors. That one did not seem to work might also be that the file is already corrupted. With Notepad++ or JEdit or some other encoding converting editor you might check.
Charset detectCharset(Path path) throws IOException {
byte[] content = Files.readAllBytes(path);
boolean ascii = true;
boolean utf8 = true;
Map<Byte, Integer> specialCharFrequencies = new TreeMap<>();
for (int i = 0; i < content.length; ++i) {
byte b = content[i];
if (b < 0) {
ascii = false;
if ((b & 0xC0) == 0xC0) { // UTF-8 continuation byte
if (i == 0 || content[i - 1] >= 0) {
utf8 = false;
}
}
specialCharFrequencies.merge(b, 1, Integer::sum);
}
}
if (ascii || utf8) {
return StandardCharsets.UTF_8;
}
// ... determine by frequencies
Charset latin1 = Charset.forName("Windows-1252");
Charset latin2 = Charset.forName("ISO-8859-2");
System.out.println(" B Freq 1 2");
specialCharFrequencies.entrySet().stream()
.forEach(e -> System.out.printf("%02x %06d %s %s%n",
e.getKey() & 0xFF, e.getValue(),
new String(new byte[] {e.getKey(), 0, 1}, latin1),
new String(new byte[] {e.getKey(), 0, 1}, latin2)));
return null;
}
Illegal UTF-8 can slip through this check, but it would be easy to use a Charset decoder.
Related
I'm writing EXIF metadata to a JPEG using Apache Commons Imaging (Sanselan), and, at least in the 0.97 release of Sanselan, there were some bugs related to charset/encoding. The EXIF 2.2 standard requires that the encoding of fields of type UNDEFINED be prefixed with an 8-byte ASCII "signature", describing the encoding of the following content. The field/tag I'm writing to is the UserComment EXIF tag.
Windows expects the content to be encoded in UTF16, so the bytes written to the JPEG must contain a combination of (single byte) ASCII characters, followed by (double byte) Unicode characters. Furthermore, although UserComment doesn't seem to require it, I notice that often the content is "null-padded" to even length.
Here's the code I'm using to create and write the tag:
String textToSet = "Test";
byte[] ASCIIMarker = new byte[]{ 0x55, 0x4E, 0x49, 0x43, 0x4F, 0x44, 0x45, 0x00 }; // spells out "UNICODE"
byte[] comment = textToSet.getBytes("UnicodeLittle");
// pad with \0 if (total) length is odd (or is \0 byte automatically added by arraycopy?)
int pad = (ASCIIMarker.length + comment.length) % 2;
byte[] bytesComment = new byte[ASCIIMarker.length + comment.length + pad];
System.arraycopy(ASCIIMarker, 0, bytesComment, 0, ASCIIMarker.length);
System.arraycopy(comment, 0, bytesComment, ASCIIMarker.length, comment.length);
if (pad > 0) bytesComment[bytesComment.length-1] = 0x00;
TiffOutputField exif_comment = new TiffOutputField(TiffConstants.EXIF_TAG_USER_COMMENT,
TiffFieldTypeConstants.FIELD_TYPE_UNDEFINED, bytesComment.length - pad, bytesComment);
Then when I read the tag back from the JPEG, I do the following:
String textRead;
TiffField field = jpegMetadata.findEXIFValue(TiffConstants.EXIF_TAG_USER_COMMENT);
if (field != null) {
textRead= new String(field.getByteArrayValue(), "UnicodeLittle");
}
What confuses me is this: The bytes written to the JPEG are prefixed with 8 ASCII bytes, which obviously need to be "stripped off" in order to compare what was written to what was read:
if (textRead != null) {
if (textToSet.equals(textRead)) { // expecting this to FAIL
print "Equal";
} else {
print "Not equal";
if (textToSet.equals(textRead.substring(5))) { // this works
print "Equal after all...";
}
}
}
But why substring(5), as opposed to... substring(8)? If it was 4, I might think that 4 double byte (UTF-16) symbols total 8 bytes, but it only works if I strip off 5 bytes. Is this an indication that I'm not creating the payload (byte array bytesComment) properly?
PS! I will update to Apache Commons Imaging RC 1.0, which came out in 2016 and hopefully has fixed these bugs, but I'd still like to understand why this works once I've gotten this far with 0.97 :-)
I have an example which is workingfine. With this example (provided below), I can detect the encoding of file using the universaldetector framework from mozilla.
But I want that this example to detect the encoding of input and not of the file for Example using class Scanner? How can I modify the code below to detect the encoding of input instead of file?
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector {
public static void main(String[] args) throws java.io.IOException {
byte[] buf = new byte[4096];
java.io.FileInputStream fis = new java.io.FileInputStream("C:\\Users\\khalat\\Desktop\\Java\\toti.txt");
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
}
i found a elegant example wich can test at least, wether the charatcht is ISO-8859-1 see code below.
public class TestIso88591 {
public static void main(String[] args){
if(TestIso88591.testISO("ü")){
System.out.println("True");
}
else{
System.out.println("False");
}
}
public static boolean testISO(String text){
return Charset.forName(CharEncoding.ISO_8859_1).newEncoder().canEncode(text);
}
}
now i hav question to expert Java .there is a posibillity to test charachter wether it is ISO-8859-5 or ISO-8859-7? yes yes I know there is utf-8 but my exact question its how can i test the iso-8859-5 charachter. because the input data should be stored in SAP and SAP can handel only with ISO-8859-1 CHarachter. I need that as soon as.
OK I researched a bit more. And the result is. It is useless to read bytes from stdin to guess the encoding, because the java API let you directly read the input as a string which is already encoded ;) The only usecase for this dector is when you get a stream of unknown bytes from a file or socket etc. to guess how to decode it in a java string.
Next pseudo code, it's only theoretical approach to it. But as we figured out it makes no sense ;)
Its very simple.
byte[] buf = new byte[4096];
java.io.FileInputStream fis = new java.io.FileInputStream("C:\\Users\\khalat\\Desktop\\Java\\toti.txt");
UniversalDetector detector = new UniversalDetector(null);
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
What you are doing here is reading from the file into an byte array, which is then passed to the detector.
Replace your FileInputStream with an other reader.
For example to read everything from Standard In:
byte[] buf = new byte[4096];
InputStreamReader isr = new InputStreamReader(System.in);
UniversalDetector detector = new UniversalDetector(null);
int nread = 0;
while ((nread = isr.read(buf, nread, buf.length)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
ATTENTION!!
This code is not tested by me. Its only based in Java API Docs.
I also would place a BufferedReader between the input stream and the read, to puffer. Also it can't work because of the size of the buffer with 4096 bytes. As I see my Example it would work, when you directly enter minimum 4096 bytes in Stdandard IN in one chunk, otherwise the while loop will never start.
About Reader API, The Base class java.io.Reader (http://docs.oracle.com/javase/7/docs/api/java/io/Reader.html#read(char[],%20int,%20int)) Defines the method read as abstract, and any Reader based impl. has to impl this method. SO IT IS THERE!!!
About you can't figure out the encoding of a chunk of unknown bytes. Yes thats right. But you can make a guess, like the detector from mozilla tries. Because you have some clues: 1. We expect that the bytes are a text 2. we know any byte in any specified encoding 3. we can trie to decode several bytes in a guessed encoding and compare the resulting string
About we are experts:
Yes most of use are ones ;) But we don't like to make the homework for someone else. We like to fix bugs or give advices. So provide a full example which provides an error we can fix. Or as it happend here: we give you an advice with some pseudo code. (I don't have the time to setup a project and write you an working example)
Nice comment thread ;)
My Java client sends a file to a C++ server using this code:
FileInputStream fileInputStream = new FileInputStream(path);
byte[] buffer = new byte[64*1024];
int bytesRead = 0;
while ( (bytesRead = fileInputStream.read(buffer)) != -1)
{
if (bytesRead > 0)
{
this.outToServer.write(buffer, 0, bytesRead);
}
}
My C++ server receives the bytes using this code:
vector<char> buf5(file_length);
size_t read_bytes;
do
{
read_bytes = socket.read_some(boost::asio::buffer(buf5,file_length));
file_length -= read_bytes;
}
while(read_bytes != 0);
string file(buf5.begin(), buf5.end());
And then creates the file using this code:
ofstream out_file( (some_path).c_str() );
out_file << file << endl;
out_file.close();
However, somehow the file gets corrupted during this process.
At the end of the process, both files(the one sent and the one created) have the same size.
What am I doing wrong? Any help would be appreciated!
Edit: tried to use different code for receiving the file, same result:
char buf[file_length];
size_t length = 0;
while( length < file_length )
{
length += socket.read_some(boost::asio::buffer(&buf[length], file_length - length), error);
}
string file(buf);
1) is it a text file?
2) if not try opening the file in binary mode before writing, also do not use << operator, instead use write or put methods
In your first example the problem appears to be this line:
read_bytes = socket.read_some(boost::asio::buffer(buf5,file_length));
This results in you overwriting the first N bytes of your string and not appending multiple reads correctly.
In your second example the problem is likely:
string file(buf);
If buf contains any NUL characters then the string will be truncated. Use the same string creation as in your first example with a std::vector<char>.
If you still have problems I would recommend doing a binary diff of the source and copied files (most hex editors can do this). This should give you a better picture of exactly where the difference is and what may be causing it.
I have a normal pdf file A.pdf , a third party encodes the file in base64 and sends it to me in a webservice as a long string (i have no control on the third party).
My problem is that when i decode the string with java org.apache.commons.codec.binary.Base64 and right the output to a file called B.pdf
I expect B.pdf to be identical to A.pdf, but B.pdf turns out a little different then A.pdf. As a result B.pdf is not recognized as a valid pdf by acrobat.
Does base64 have different types of encoding\charset mechanisms? can i detect how the string I received is encoded so that B.pdf=A.pdf ?
EDIT- this is the file I want to decode, after decoding it should open as a pdf
my encoded file
this is the header of the files opened in notepad++
**A.pdf**
%PDF-1.4
%±²³´
%Created by Wnv/EP PDF Tools v6.1
1 0 obj
<<
/PageMode /UseNone
/ViewerPreferences 2 0 R
/Type /Catalog
**B.pdf**
%PDF-1.4
%±²³´
%Created by Wnv/EP PDF Tools v6.1
1 0! bj
<<
/PageMode /UseNone
/ViewerPreferences 2 0 R
/]
pe /Catalog
this is how I decode the string
private static void decodeStringToFile(String encodedInputStr,
String outputFileName) throws IOException {
BufferedReader in = null;
BufferedOutputStream out = null;
try {
in = new BufferedReader(new StringReader(encodedInputStr));
out = new BufferedOutputStream(new FileOutputStream(outputFileName));
decodeStream(in, out);
out.flush();
} finally {
if (in != null)
in.close();
if (out != null)
out.close();
}
}
private static void decodeStream(BufferedReader in, OutputStream out)
throws IOException {
while (true) {
String s = in.readLine();
if (s == null)
break;
//System.out.println(s);
byte[] buf = Base64.decodeBase64(s);
out.write(buf);
}
}
You are breaking your decoding by working line-by-line. Base64 decoders simply ignore whitespace, which means that a byte in the original content could very well be broken into two Base64 text lines. You should concatenate all the lines together and decode the file in one go.
Prefer using byte[] rather than String when supplying content to the Base64 class methods. String implies character set encoding, which may not do what you want.
I'm trying to recognize a BOM for UTF-8 when reading a file. Of course, Java files like to deal with 16 bit chars, and the BOM characters are eight bit bytes.
My test code looks like:
public void testByteOrderMarks() {
System.out.println("test byte order marks");
byte[] bytes = {(byte) 0xEF, (byte) 0xBB, (byte) 0xBF, (byte) 'a', (byte) 'b',(byte) 'c'};
String test = new String(bytes, Charset.availableCharsets().get("UTF-8"));
System.out.printf("test len: %s value %s\n", test.length(), test);
String three = test.substring(0,3);
System.out.printf("len %d >%s<\n", three.length(), three);
for (int i = 0; i < test.length();i++) {
byte b = bytes[i];
char c = test.charAt(i);
System.out.printf("b: %s %x c: %s %x\n", (char) b, b, c, (int) c);
}
}
and the result is:
test byte order marks
test len: 4 value ?abc
len 3 >?ab<
b: ? ef> c: ? feff
b: ? bb c: a 61
b: ? bf c: b 62
b: a 61 c: c 63
I can't figure out why the length of "test" is 4 and not 6.
I can't figure out why I don't pick up each 8 bit byte to do the comparison.
Thanks
Don't use characters when trying to figure out the BOM header. The BOM header is two or three bytes, so you should open an (File)InputStream, read two bytes and process them.
Incidentally, the XML header (<?xml version=... encoding=...>) is pure ASCII so it's safe to load that as a byte stream, too (well, unless there is a BOM to indicate that the file is saved with 16bit characters and not as UTF-8).
My solution (see DecentXML's XMLInputStreamReader) is to load the first few bytes of the file and analyze them. That gives me enough information to create a properly decoding Reader out of an InputStream.
A character is a character. The Byte Order Mark is the Unicode character U+FEFF. In Java it is the character '\uFEFF'. There is no need to delve into bytes. Just read the first character of the file, and if it matches '\uFEFF' it is the BOM. If it doesn't match then the file was written without a BOM.
private final static char BOM = '\uFEFF'; // Unicode Byte Order Mark
String firstLine = readFirstLineOfFile("filename.txt");
if (firstLine.charAt(0) == BOM) {
// We have a BOM
} else {
// No BOM present.
}
If you want to recognize a BOM file a better solution (and works for me) will be use the encoding detector library of Mozilla: http://code.google.com/p/juniversalchardet/
In that link is described easily how to use it:
import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector {
public static void main(String[] args) throws java.io.IOException {
byte[] buf = new byte[4096];
String fileName = "testFile.";
java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
}
If you are using maven the dependency is:
<dependency>
<groupId>com.googlecode.juniversalchardet</groupId>
<artifactId>juniversalchardet</artifactId>
<version>1.0.3</version>
</dependency>