Convert Windows-1252 xml file to UTF-8

Convert Windows-1252 xml file to UTF-8 - java

Is there any approach to convert large XML file(500+MBs) from 'Windows-1252' encoding to 'UTF-8' encoding in java?

Sure:
Open a FileInputStream wrapped in an InputStreamReader with the Windows-1252 for the input
Open a FileOutputStream wrapped in an OutputStreamWriter with the UTF-8 encoding for the output
Create a buffer char array (e.g. 16K)
Repeatedly read into the array and write however much has been written:
char[] buffer = new char[16 * 1024];
int charsRead;
while ((charsRead = input.read(buffer)) > 0) {
output.write(buffer, 0, charsRead);
}
Don't forget to close the output afterwards! (Otherwise there could be buffered data which never gets written to disk.)
Note that as it's XML, you may well need to manually change the XML declaration as well, as it should be specifying that it's in Windows-1252...
The fact that this works on a streaming basis means you don't need to worry about the size of the file - it only reads up to 16K characters in memory at a time.

Is this a one-off or a job that you need to run repeatedly and make efficient?
If it's a one-off, I don't see the need for Java coding. Just run the query ".", for example
java net.sf.saxon.Query -s:input.xml -qs:. -o:output.xml
making sure you allocate say 3Gb of memory.
If you're doing it repeatedly and want a streamed approach, you have to choose between handling it as text (as Jon Skeet suggests) or as XML. The advantage of doing it as XML is primarily that the XML declaration will get taken care of, and character references will be converted to characters. The simplest is to use a JAXP identity transformation:
Source in = new StreamSource(new File("input.xml"));
TransformerFactory f = TransformerFactory.newInstance();
Result out = new StreamResult(new File("output.xml"));
f.newTransformer().transform(in, out);

If this is a one-off, Java may not be the most appropriate tool. Consider iconv:
iconv -f windows-1252 -t utf-8 <source.xml >target.xml
This has all the benefits of streaming without requiring you to write any code.
Unlike Michael's solution, this won't take care of the XML declaration. Edit this manually if necessary or, now you're using UTF-8, omit it.

Related

Resource file format processing in Java

I am trying to implement a processor for a specific resource archive file format in Java. The format has a Header comprised of a three-char description, a dummy byte, plus a byte indicating the number of files.
Then each file has an entry consisting of a dummy byte, a twelve-char string describing the file name, a dummy byte, and an offset declared in a three-byte array.
What would be the proper class for reading this kind of structure? I have tried RandomAccessFile but it does not allow to read arrays of data, e.g. I can only read three chars by calling readChar() three times, etc.
Of course I can extend RandomAccessFile to do what I want but there's got to be a proper out-of-the-box class to do this kind of processing isn't it?
This is my reader for the header in C#:
protected override void ReadHeader()
{
Header = new string(this.BinaryReader.ReadChars(3));
byte dummy = this.BinaryReader.ReadByte();
NFiles = this.BinaryReader.ReadByte();
}

I think you got lucky with your C# code, as it relies on the character encoding to be set somewhere else, and if it didn't match the number of bytes per character in the file, your code would probably have failed.
The safest way to do this in Java would be to strictly read bytes and do the conversion to characters yourself. If you need seek abilities, then indeed RandomAccessFile would be your easiest solution, but it should be pointed out that InputStream allows skipping, so if you don`t need actual random access, just to skip some of the files, you could certainly use it.
In either case, you should read the bytes from the file per the file specification, and then convert them to characters based on a known encoding. You should never trust a file that was not written by a Java program to contain any Java data types other than byte, and even if it was written by Java, it may well have been converted to raw bytes while writing.
So your code should be something along the lines of:
String header = "";
int nFiles = 0;
RandomAccessFile raFile = new RandomAccessFile( "filename", "r" );
byte[] buffer = new byte[3];
int numRead = raFile.read( buffer );
header = new String( buffer, StandardCharsets.US_ASCII.name() );
int numSkipped = raFile.skipBytes(1);
nFiles = raFile.read(); // The byte is read as an integer between 0 and 255
Sanity checks (checking that actual 3 bytes were read, 1 byte was skipped and nFiles is not -1) and exception handling have been skipped for brevity.
It's more or less the same if you use InputStream.

I would go with MappedByteBuffer. This will allow you to seek arbitrarily, but will also deal efficiently and transparently with large files that are too large to fit comfortably in RAM.
This is, to my mind, the best way of reading structured binary data like this from a file.
You can then build your own data structure on top of that, to handle the specific file format.

"FF FF" is getting dumped as "FD" on some of the computers with Java Scripting API

I am facing issues with the Java Scripting API together with JavaScript on some of the PCs. After analyzing the dumped file, I noticed that "FF FF" is geeting printed as "FD" on some of the PCs. Below is the code snippet:
var outputfile = new RandomAccessFile(f, "rw");
var byte_data_array = getMyByteArrayData(somebytearray);
var data_string = new java.lang.String(byte_data_array);
outputfile.writeBytes(data_string);

You're converting the data from bytes to String without specifying an encoding (which uses the local-dependant platform default encoding), then write it to a file using the writeBytes() method that is documented in the API doc as discarding the higher-order byte of each character.
What did you expect? I'm actually surprised the result has any resemblance at all to the original data.
What you most likely should do is replace the last two lines with this:
outputfile.write(byte_data_array);
And always remember: bytes are for data, Strings are for text, and if you convert between them, you always need to pay attention to what encoding is used.

Does FileOutputStream truncate an existing file

Does
final OutputStream output = new FileOutputStream(file);
truncate the file if it already exists? Surprisingly, the API documentation for Java 6 does not say. Nor does the API documentation for Java 7. The specification for the language itself has nothing to say about the semantics of the FileOutputStream class.
I am aware that
final OutputStream output = new FileOutputStream(file, true);
causes appending to the file. But appending and truncating are not the only possibilities. If you write 100 bytes into a 1000 byte file, one possibility is that the final 900 bytes are left as they were.

FileOutputStream without the append option does truncate the file.
Note that FileOutputStream opens a Stream, not a random access file, so i guess it does make sense that it behaves that way, although i agree that the documentation could be more explicit about it.

I tried this on Windows 2008 x86 and java 1.6.0_32-b05
I created 2 processes which wrote continually to the same file one 1Mb of the character 'b' and the other 4Mb of the character 'a'. Unless I used
out = new RandomAccessFile(which, "rw");
out.setLength(0);
out.getChannel().lock();
I found that a 3rd reader process could read what appeared to be a File which started with 1Mb of 'b's followed by 'a's
I found that writing first to a temporary file and then renaming it
File.renameTo
to the File also worked.
I would not depend on FileOuputStream on windows to truncate a file which may be being read by a second process...
Not new FileOutputStream(file)
Nor FileOutputStream(file, false) ( does not truncate )
Nor
this;
out = new FileOutputStream(which, false);
out.getChannel().truncate(0);
out.getChannel().force(true);
However
out = new FileOutputStream(which, false);
out.getChannel().truncate(0);
out.getChannel().force(true);
out.getChannel().lock();
does work

FileOutputStream is meant to write binary data, which is most often overwritten.
If you are manipulating text data, you should better use a FileWriter which has convenient append methods.

How do you write any ASCII character to a file in Java?

Basically I'm trying to use a BufferedWriter to write to a file using Java. The problem is, I'm actually doing some compression so I generate ints between 0 and 255, and I want to write the character who's ASCII value is equal to that int. When I try writing to the file, it writes many ? characters, so when I read the file back in, it reads those as 63, which is clearly not what I want. Any ideas how I can fix this?
Example code:
int a = generateCode(character); //a now has an int between 0 and 255
bw.write((char) a);
a is always between 0 and 255, but it sometimes writes '?'

You are really trying to write / read bytes to / from a file.
When you are processing byte-oriented data (as distinct from character-oriented data), you should be using InputStream and OutputStream classes and not Reader and Writer classes.
In this case, you should use FileInputStream / FileOutputStream, and wrap with a BufferedInputStream / BufferedOutputStream if you are doing byte-at-a-time reads and writes.
Those pesky '?' characters are due to issues the encoding/decoding process that happens when Java converts between characters and the default text encoding for your platform. The conversion from bytes to characters and back is often "lossy" ... depending on the encoding scheme used. You can avoid this by using the byte-oriented stream classes.
(And the answers that point out that ASCII is a 7-bit not 8-bit character set are 100% correct. You are really trying to read / write binary octets, not characters.)

You need to make up your mind what are you really doing. Are you trying to write some bytes to a file, or are you trying to write encoded text? Because these are different concepts in Java; byte I/O is handled by subclasses of InputStream and OutputStream, while character I/O is handled by subclasses of Reader and Writer. If what you really want to write is bytes to a file (which I'm guessing from your mention of compression), use an OutputStream, not a Writer.
Then there's another confusion you have, which is evident from your mention of "ASCII characters from 0-255." There are no ASCII characters above 127. Please take 15 minutes to read this: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" (by Joel Spolsky). Pay particular attention to the parts where he explains the difference between a character set and an encoding, because it's critical for understanding Java I/O. (To review whether you understood, here's what you need to learn: Java Writers are classes that translate character output to byte output by applying a client-specified encoding to the text, and sending the bytes to an OutputStream.)

Java strings are based on 16 bit wide characters, it tries to perform conversions around that assumption if there is no clear specifications.
The following sample code, write and reads data directly as bytes, meaning 8-bit numbers which have an ASCII meaning associated with them.
import java.io.*;
public class RWBytes{
public static void main(String[] args)throws IOException{
String filename = "MiTestFile.txt";
byte[] bArray1 =new byte[5];
byte[] bArray2 =new byte[5];
bArray1[0]=65;//A
bArray1[1]=66;//B
bArray1[2]=67;//C
bArray1[3]=68;//D
bArray1[4]=69;//E
FileOutputStream fos = new FileOutputStream(filename);
fos.write(bArray1);
fos.close();
FileInputStream fis = new FileInputStream(filename);
fis.read(bArray2);
ByteArrayInputStream bais = new ByteArrayInputStream(bArray2);
for(int i =0; i< bArray2.length ; i++){
System.out.println("As the bytem value: "+ bArray2[i]);//as the numeric byte value
System.out.println("Converted as char to printiong to the screen: "+ String.valueOf((char)bArray2[i]));
}
}
}
A fixed subset of the 7 bit ASCII code is printable, A=65 for example, the 10 corresponds to the "new line" character which steps down one line on screen when found and "printed". Many other codes exist which manipulate a character oriented screen, these are invisible and manipulated the screen representation like tabs, spaces, etc. There are also other control characters which had the purpose of ringing a bell for example.
The higher 8 bit end above 127 is defined as whatever the implementer wanted, only the lower half have standard meanings associated.
For general binary byte handling there are no such qualm, they are number which represent the data. Only when trying to print to the screen the become meaningful in all kind of ways.

Read first part of inputstream in Java

I have a XML file that I read from a URLConnection. The file grows as time goes. I want to read the first 100k of the file, today the file is 1.3MB. Then I want to parse the part I have read.
How can I do that?

(From scratch)
int length = 100*1024;
byte[] buf = new byte[length];
urlConnection.getInputStream().read(buf,0,length);
StringBufferInputStream in = new StringBufferInputStream(new String(buf));
new SAXParser().parse(in,myHandler);

As far as I understand you're interested not just in 100k of a stream but 100k of a stream from which you could extract data you need. This means taking 100k as proposed by Peter won't work as it might result in non-well-formed XML.
Instead I'd suggest to use StAX parser which will give you ability to read and parse XML directly from stream with ability to stop when you've reached 100k (or near) limit.
For further information take a look at XMLStreamReader interface (and samples around its usage). For example you could loop until you get to the START_ELEMENT with name "result" and then use method getTextCharacters(int sourceStart, char[] target, int targetStart, int length) specifying 100k as buffer size.
As you mentioned Android currently it doesn't have StAX parser available. However it does have XmlPullParser with similar functionality.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert Windows-1252 xml file to UTF-8 - java

Is there any approach to convert large XML file(500+MBs) from 'Windows-1252' encoding to 'UTF-8' encoding in java?

Related

Resource file format processing in Java

"FF FF" is getting dumped as "FD" on some of the computers with Java Scripting API

Does FileOutputStream truncate an existing file

How do you write any ASCII character to a file in Java?

Read first part of inputstream in Java

Categories

Resources