How to process a string with 823237 characters

How to process a string with 823237 characters - java

I have a string that has 823237 characters in it. its actually an xml file and for testing purpose I want to return as a response form a servlet.
I have tried everything I can possible think of
1) creating a constant with the whole string... in this case Eclipse complains (with a red line under servlet class name) -
The type generates a string that requires more than 65535 bytes to encode in Utf8 format in the constant pool
2) breaking the whole string into 20 string constants and writing to the out object directly
something like :
out.println( CONSTANT_STRING_PART_1 + CONSTANT_STRING_PART_2 +
CONSTANT_STRING_PART_3 + CONSTANT_STRING_PART_4 +
CONSTANT_STRING_PART_5 + CONSTANT_STRING_PART_6 +
// add all the string constants till .... CONSTANT_STRING_PART_20);
in this case ... the build fails .. complaining..
[javac] D:\xx\xxx\xxx.java:87: constant string too long
[javac] CONSTANT_STRING_PART_19 + CONSTANT_STRING_PART_20);
^
3) reading the xml file as a string and writing to out object .. in this case I get
SEVERE: Allocate exception for servlet MyServlet
Caused by: org.apache.xmlbeans.XmlException: error: Content is not allowed in prolog.
Finally my question is ... how can I return such a big string (as response) from the servlet ???

You can avoid to load all the text in memory using streams:
InputStream is = new FileInputStream("path/to/your/file"); //or the following line if the file is in the classpath
InputStream is = MyServlet.class.getResourceAsStream("path/to/file/in/classpath");
byte[] buff = new byte[4 * 1024];
int read;
while ((read = is.read(buff)) != -1) {
out.write(buff, 0, read);
}

The second approach might work the following way:
out.print(CONSTANT_STRING_PART_1);
out.print(CONSTANT_STRING_PART_2);
out.print(CONSTANT_STRING_PART_3);
out.print(CONSTANT_STRING_PART_4);
// ...
out.print(CONSTANT_STRING_PART_N);
out.println();
You can do this in a loop of course (which is highly recommended ;)).
The way you do it, you just temporarely create the large string again to then pass it to println(), which is the same problem as the first one.

Ropes: Theory and practice
Why and when to use Ropes for Java for string manipulations

You can read a 823K file into a String. Maybe not the most elegant method, but totally doable. Method 3 should have worked. There was an XML error, but that has nothing to do with reading from a file into a String, or the length of the data.
It has to be an external file, though, because it is too big to be inlined into a class file (there are size limits for those).
I recommend Commons IO FileUtils#readFileToString.

You have to deal with ByteArrayOutputStream and not with the String it self. If you want to send your String in the http response all you have to do is to read from that byteArray stream and write in the response stream like this :
ByteArrayOutputStream baos = new ByteArrayOutputStream(8232237);
baos.write(constant1.getBytes());
baos.write(constant2.getBytes());
...
baos.writeTo(response.getOutputStream());

Both problem 1) and 2) are due to the same fundamental issue. A String literal (or constant String expression) cannot be more than 65535 characters because there is a hard limit on string constants in the class file format.
The third problem sounds like a bug in the way you've implemented it rather than a fundamental problem. In fact, it sounds like you are trying to load the XML as a DOM and then unparse it (which is unnecessary), and that somehow you have managed to mangle the XML in the process. (Or maybe it is mangled in the file you are trying to read ...)
The simple and elegant solution is to save the stuff in a file, and then read it as plain text.
Or ... less elegant, but just as effective:
String[] strings = new String[](
"longString1",
"longString2",
...
"longStringN"};
for (String str : strings) {
out.write(str);
}
Of course, the problem with embedding test data as string literals is that you have to escape certain characters in the string to keep the compiler happy. That's tedious if you have to do it by hand.

Related

Converting string to byte[] returns wrong value (encoding?)

I read a byte[] from a file and convert it to a String:
byte[] bytesFromFile = Files.readAllBytes(...);
String stringFromFile = new String(bytesFromFile, "UTF-8");
I want to compare this to another byte[] I get from a web service:
String stringFromWebService = webService.getMyByteString();
byte[] bytesFromWebService = stringFromWebService.getBytes("UTF-8");
So I read a byte[] from a file and convert it to a String and I get a String from my web service and convert it to a byte[]. Then I do the following tests:
// works!
org.junit.Assert.assertEquals(stringFromFile, stringFromWebService);
// fails!
org.junit.Assert.assertArrayEquals(bytesFromFile, bytesFromWebService);
Why does the second assertion fail?

Other answers have covered the likely fact that the file is not UTF-8 encoded giving rise to the symptoms described.
However, I think the most interesting aspect of this is not that the byte[] assert fails, but that the assert that the string values are the same passes. I'm not 100% sure why this is, but I think the following trawl through the source code might give us the answer:
Looking at how new String(bytesFromFile, "UTF-8"); works - we see that the constructor calls through to StringCoding.decode()
This in turn, if supplied with tht UTF-8 character set, calls through to StringDecoder.decode()
This calls through to CharsetDecoder.decode() which decides what to do if the character is unmappable (which I guess will be the case if a non-UTF-8 character is presented)
In this case it uses an action defined by
private CodingErrorAction unmappableCharacterAction
= CodingErrorAction.REPORT;
Which means that it still reports the character it has decoded, even though it's technically unmappable.
I think this means that even when the code gets an umappable character, it substitutes its best guess - so I'm guessing that its best guess is correct and hence the String representations are the same under comparison, but the byte[] are no longer the same.
This hypothesis is kind of supported by the fact that the catch block for CharacterCodingException in StringCoding.decode() says:
} catch (CharacterCodingException x) {
// Substitution is always enabled,
// so this shouldn't happen

I don't understand it fully, but here's what I get so fare:
The problem is that the data contains some bytes which are not valid UTF-8 bytes as I know by the following check:
// returns false for my data!
public static boolean isValidUTF8(byte[] input) {
CharsetDecoder cs = Charset.forName("UTF-8").newDecoder();
try {
cs.decode(ByteBuffer.wrap(input));
return true;
}
catch(CharacterCodingException e){
return false;
}
}
When I change the encoding to ISO-8859-1 everything works fine. The strange thing (which a don't understand yet) is why my conversion (new String(bytesFromFile, "UTF-8");) doesn't throw any exception (like my isValidUTF8 method), although the data is not valid UTF-8.
However, I think I will go another and encode my byte[] in a Base64 string as I don't want more trouble with encoding.

The real problem in your code is that you don't know what the real file encoding.
When you read the string from the web service you get a sequence of chars; when you convert the string from chars to bytes the conversion is made right because you specify how to transform char in bytes with a specific encoding ("UFT-8"). when you read a text file you face a different problem. You have a sequence of bytes that needs to be converted to chars. In order to do it properly you must know how the chars where converted to bytes i.e. what is the file encoding. For files (unless specified) it's a platform constants; on windows the file are encoded in win1252 (which is very close to ISO-8859-1); on linux/unix it depends, I think UTF8 is the default.
By the way the web service call did a decond operation under the hood; the http call use an header taht defins how chars are encoded, i.e. how to read the bytes form the socket and transform then to chars. So calling a SOAP web service gives you back an xml (which can be marshalled into a Java object) with all the encoding operations done properly.
So if you must read chars from a File you must face the encoding issue; you can use BASE64 as you stated but you lose one of the main benefits of text files: the are human readable, easing debugging and developing.

Read special charatters ( æ ø å ) with Java from Oracle database

i have a problem when reading special charatters from oracle database (use JDBC driver and glassfish tooplink).
I store on database the name "GRØNLÅEN KJÆTIL" through WebService and, on database, the data are store correctly.
But when i read this String, print on log file and convert this in byte array whit this code:
int pos = 0;
byte[] msg=new byte[1024];
String F = "F" + passenger.getName();
logger.debug("Add " + F + " " + F.length());
msg = addStringToArrayBytePlusSeparator(msg, F,pos);
..............
private byte[] addStringToArrayBytePlusSeparator(byte[] arrDest,String strToAdd,int destPosition)
{
System.arraycopy(strToAdd.getBytes(Charset.forName("ISO-8859-1")), 0, arrDest, destPosition, strToAdd.getBytes().length);
arrDest = addSeparator(arrDest,destPosition+strToAdd.getBytes().length,1);
return arrDest;
}
1) In the log file there is:"Add FGRÃNLÃ " (the name isn't correct and the F.length() are not printed).
2) The code throw:
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at it.edea.ebooking.business.chi.control.VingCardImpl.addStringToArrayBytePlusSeparator(Test.java:225).
Any solution?
Tanks

You're calling strToAdd.getBytes() without specifying the character encoding, within the System.arraycopy call - that will be using the system default encoding, which may well not be ISO-8859-1. You should be consistent in which encoding you use. Frankly I'd also suggest that you use UTF-8 rather than ISO-8859-1 if you have the choice, but that's a different matter.
Why are you dealing with byte arrays anyway at this point? Why not just use strings?
Also note that your addStringToArrayBytePlusSeparator method doesn't give any indication of how many bytes it's copied, which means the caller won't have any idea what to do with it afterwards. If you must use byte arrays like this, I'd suggest making addStringToArrayBytePlusSeparator return either the new "end of logical array" or the number of bytes copied. For example:
private static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
/**
* (Insert fuller description here.)
* Returns the number of bytes written to the array
*/
private static int addStringToArrayBytePlusSeparator(byte[] arrDest,
String strToAdd,
int destPosition)
{
byte[] encodedText = ISO_8859_1.getBytes(strToAdd);
// TODO: Verify that there's enough space in the array
System.arraycopy(encodedText, 0, arrDest, destPosition, encodedText.length);
return encodedText.length;
}

Encoding/Decoding problems are hard. In every process step you have to do the correct encoding/decoding. So,
familiarize yourself with the difference of bytes (inputstream) and Characters (Readers, Strings)
Choose in which character encoding you want to store your data in the database, and in which character encoding you want to expose your webservice. Make sure when you load initial data in the database it's in the right encoding
connect with the right database properties. mysql requires an addition to the connection url:?useUnicode=true&characterEncoding=UTF-8 when using UTF-8, I don't know about oracle.
if you print/debug at a certain step and it looks ok, you can't be sure you did it right. The logger can write with the wrong encoding (sometimes making something look ok, while in fact it's broken). Your terminal might not handle strange byte encodings correct. The same holds for command-line database clients. Your data might wrongly be stored, but your wrongly configured terminal interprets/shows the data as correct.
In XML, it's not only the stream encoding that matters, but also the xml-encoding attribute.

StringBufferInputStream Question in Java

I want to read an input string and return it as a UTF8 encoded string. SO I found an example on the Oracle/Sun website that used FileInputStream. I didn't want to read a file, but a string, so I changed it to StringBufferInputStream and used the code below. The method parameter jtext, is some Japanese text. Actually this method works great. The question is about the deprecated code. I had to put #SuppressWarnings because StringBufferInputStream is deprecated. I want to know is there a better way to get a string input stream? Is it ok just to leave it as is? I've spent so long trying to fix this problem that I don't want to change anything now I seem to have cracked it.
#SuppressWarnings("deprecation")
private String readInput(String jtext) {
StringBuffer buffer = new StringBuffer();
try {
StringBufferInputStream sbis = new StringBufferInputStream (jtext);
InputStreamReader isr = new InputStreamReader(sbis,
"UTF8");
Reader in = new BufferedReader(isr);
int ch;
while ((ch = in.read()) > -1) {
buffer.append((char)ch);
}
in.close();
return buffer.toString();
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
I think I found a solution - of sorts:
private String readInput(String jtext) {
String n;
try {
n = new String(jtext.getBytes("8859_1"));
return n;
} catch (UnsupportedEncodingException e) {
return null;
}
}
Before I was desparately using getBytes(UTF8). But I by chance I used Latin-1 "8859_1" and it worked. Why it worked, I can't fathom. This is what I did step-by-step:
OpenOffice CSV(utf8)------>SQLite(utf8, apparently)------->java encoded as Latin-1, somehow readable.

The reason that StringBufferInputStream is deprecated is because it is fundamentally broken ... for anything other than Strings consisting entirely of Latin-1 characters. According to the javadoc it "encodes" characters by simply chopping off the top 8 bits! You don't want to use it if your application needs to handle Unicode, etc correctly.
If you want to create an InputStream from a String, then the correct way to do it is to use String.getBytes(...) to turn the String into a byte array, and then wrap that in a ByteArrayInputStream. (Make sure that you choose an appropriate encoding!).
But your sample application immediately takes the InputStream, converts it to a Reader and then adds a BufferedReader If this is your real aim, then a simpler and more efficient approach is simply this:
Reader in = new StringReader(text);
This avoids the unnecessary encoding and decoding of the String, and also the "buffer" layer which serves no useful purpose in this case.
(A buffered stream is much more efficient than an unbuffered stream if you are doing small I/O operations on a file, network or console stream. But for a stream that is served from an in-memory data structure the benefits are much smaller, and possibly even negative.)
FOLLOWUP
I realized what you are trying to do now ... work around a character encoding / decoding issue.
My advice would be to try to figure out definitively the actual encoding of the character data that is being delivered by the database, then make sure that the JDBC drivers are configured to use the same encoding. Trying to undo the mis-translation by encoding with one encoding and decoding with another is dodgy, and can give you only a partial correction of the problems.
You also need to consider the possibility that the characters got mangled on the way into the database. If this is the case, then you may be unable to de-mangle them.

Is this what you are trying to do? Here is previous answer on similar question. I am not sure why you want to convert to a String to an exactly the same String.
Java String holds a sequence of chars in which each char represents a Unicode number. So it is possible to construct the same string from two different byte sequences, says one is encoded with UTF-8 and the other is encoded with US-ASCII.
If you want to write it to file, you can always convert it with String.getBytes("encoder");
private static String readInput(String jtext) {
byte[] bytes = jtext.getBytes();
try {
String string = new String(bytes, "UTF-8");
return string;
} catch (UnsupportedEncodingException ex) {
// do something
return null;
}
}
Update
Here is my assumption.
According to your comment, you SQLite DB store text value using one encoding, says UTF-16. For some reason, your SQLite APi cannot determine what the encoding it uses to encode the Unicode values to sequence of bytes.
So when you use getString method from your SQLite API, it reads a set of bytes form you DB, and convert them into Java String using incorrect encoding. If this is the case, you should use getBytes method and reconstruct the String yourself, i.e. new String(bytes, "encoding used in your DB"); If you DB is stored in UTF-16, then new String(bytes, "UTF-16"); should be readable.
Update
I wasn't talking about getBytes method on String class. I talked about getBytes method on your SQL result object, e.g. result.getBytes(String columnLabel).
ResultSet result = .... // from SQL query
String readableString = readInput(result.getBytes("my_table_column"));
You will need to change the signature of your readInput method to
private static String readInput(byte[] bytes) {
try {
// change encoding to your DB encoding.
// this can be UTF-8, UTF-16, 8859_1, etc.
String string = new String(bytes, "UTF-8");
return string;
} catch (UnsupportedEncodingException ex) {
// do something, at least return garbled text
return new String(bytes, "UTF-8");;
}
}
Whatever encoding you set in here which makes your String readable, it is definitely the encoding of your column in DB. This involves no unexplanable phenomenon and you know exactly what your column encoding is.
But it will be good to config your JDBC driver to use the correct encoding so that you will not need to use this readInput method to convert.
If no encoding can make your string readable, you will need consider the possibility of the characters got mangled when it was written to DB as #Stephen C said. If this is the case, using walk around method may cause you to lose some of the charaters during conversions. You will also need to solve encoding problem during writting as well.

The StringReader class is the new alternative to the deprecated StringBufferInputStream class.
However, you state that what you actually want to do is take an existing String and return it encoded as UTF-8. You should be able to do that much more simply I expect. Something like:
s8 = new String(jtext.getBytes("UTF8"));

How to check whether the file is binary?

I wrote the following method to see whether particular file contains ASCII text characters only or control characters in addition to that. Could you glance at this code, suggest improvements and point out oversights?
The logic is as follows: "If first 500 bytes of a file contain 5 or more Control characters - report it as binary file"
thank you.
public boolean isAsciiText(String fileName) throws IOException {
InputStream in = new FileInputStream(fileName);
byte[] bytes = new byte[500];
in.read(bytes, 0, bytes.length);
int x = 0;
short bin = 0;
for (byte thisByte : bytes) {
char it = (char) thisByte;
if (!Character.isWhitespace(it) && Character.isISOControl(it)) {
bin++;
}
if (bin >= 5) {
return false;
}
x++;
}
in.close();
return true;
}

Since you call this class "isASCIIText", you know exactly what you're looking for. In other words, it's not "isTextInCurrentLocaleEncoding". Thus you can be more accurate with:
if (thisByte < 32 || thisByte > 127) bin++;
edit, a long time later — it's pointed out in a comment that this simple check would be tripped up by a text file that started with a lot of newlines. It'd probably be better to use a table of "ok" bytes, and include printable characters (including carriage return, newline, and tab, and possibly form feed though I don't think many modern documents use those), and then check the table.

x doesn't appear to do anything.
What if the file is less than 500 bytes?
Some binary files have a situation where you can have a header for the first N bytes of the file which contains some data that is useful for an application but that the library the binary is for doesn't care about. You could easily have 500+ bytes of ASCII in a preamble like this followed by binary data in the following gigabyte.
Should handle exception if the file can't be opened or read, etc.

Fails badly if file size is less than 500 bytes
The line char it = (char) thisByte; is conceptually dubious, it mixes bytes and chars concepts, ie. assumes implicitly that the encoding is one-byte=one character (them, it excludes unicode encodings). In particular, it fails if the file is UTF-16 encoded.
The return inside the loop (slightly bad practice IMO) forgets to close the file.

The first thing I noticed - unrelated to your actual question, but you should be closing your input stream in a finally block to ensure it's always done. Usually this merely handles exceptions, but in your case you won't even close the streams of files when returning false.
Asides from that, why the comparison to ISO control characters? That's not a "binary" file, that's a "file that contains 5 or more control characters". A better way to approach the situation in my opinion, would be to invert the check - write an isAsciiText function instead which asserts that all the characters in the file (or in the first 500 bytes if you so wish) are in a set of bytes that are known good.
Theoretically, only checking the first few hundred bytes of a file could get you into trouble if it was a composite file of sorts (e.g. text with embedded pictures), but in practice I suspect every such file will have binary header data at the start so you're probably OK.

This would not work with the jdk install packages for linux or solaris. they have a shell-script start and then a bi data blob.
why not check the mime type using some library like jMimeMagic (http://http://sourceforge.net/projects/jmimemagic/) and deside based on the mimetype how to handle the file.

One could parse and compare ageinst a list of known binary file header bytes, like the one provided here.
Problem is, one needs to have a sorted list of binary-only headers, and the list might not be complete at all. For example, reading and parsing binary files contained in some Equinox framework jar. If one needs to identify the specific file types though, this should work.
If you're on Linux, for existing files on the disk, native file command execution should work well:
String command = "file -i [ZIP FILE...]";
Process process = Runtime.getRuntime().exec(command);
...
It will output information on the files:
...: application/zip; charset=binary
which you can furtherly filter with grep, or in Java, depending on, if you simply need estimation of the files' binary character, or if you need to find out their MIME types.
If parsing InputStreams, like content of nested files inside archives, this doesn't work, unfortunately, unless resorting to shell-only programs, like unzip - if you want to avoid creating temp unzipped files.
For this, a rough estimation of examining the first 500 Bytes worked out ok for me, so far, as was hinted in the examples above; instead of Character.isWhitespace/isISOControl(char), I used Character.isIdentifierIgnorable(codePoint), assuming UTF-8 default encoding:
private static boolean isBinaryFileHeader(byte[] headerBytes) {
return new String(headerBytes).codePoints().filter(Character::isIdentifierIgnorable).count() >= 5;
}
public void printNestedZipContent(String zipPath) {
try (ZipFile zipFile = new ZipFile(zipPath)) {
int zipHeaderBytesLen = 500;
zipFile.entries().asIterator().forEachRemaining( entry -> {
String entryName = entry.getName();
if (entry.isDirectory()) {
System.out.println("FOLDER_NAME: " + entryName);
return;
}
// Get content bytes from ZipFile for ZipEntry
try (InputStream zipEntryStream = new BufferedInputStream(zipFile.getInputStream(zipEntry))) {
// read and store header bytes
byte[] headerBytes = zipEntryStream.readNBytes(zipHeaderBytesLen);
// Skip entry, if nested binary file
if (isBinaryFileHeader(headerBytes)) {
return;
}
// Continue reading zipInputStream bytes, if non-binary
byte[] zipContentBytes = zipEntryStream.readAllBytes();
int zipContentBytesLen = zipContentBytes.length;
// Join already read header bytes and rest of content bytes
byte[] joinedZipEntryContent = Arrays.copyOf(zipContentBytes, zipContentBytesLen + zipHeaderBytesLen);
System.arraycopy(headerBytes, 0, joinedZipEntryContent, zipContentBytesLen, zipHeaderBytesLen);
// Output (default/UTF-8) encoded text file content
System.out.println(new String(joinedZipEntryContent));
} catch (IOException e) {
System.out.println("ERROR getting ZipEntry content: " + entry.getName());
}
});
} catch (IOException e) {
System.out.println("ERROR opening ZipFile: " + zipPath);
e.printStackTrace();
}
}

You ignore what read() returns, what if the files is shorter than 500 bytes?
When you return false, you don't close the file.
When converting byte to char, you assume your file is 7-bit ASCII.

Are there any Java Frameworks for binary file parsing?

My problem is, that I want to parse binary files of different types with a generic parser which is implemented in JAVA. Maybe describing the file format with a configuration file which is read by the parser or creating Java classes which parse the files according to some sort of parsing rules.
I have searched quite a bit on the internet but found almost nothing on this topic.
What I have found are just things which deal with compiler-generators (Jay, Cojen, etc.) but I don't think that I can use them to generate something for parsing binary files. But I could be wrong on that assumption.
Are there any frameworks which deal especially with easy parsing of binary files or can anyone give me a hint how I could use parser/compiler-generators to do so?
Update:
I'm looking for something where I can write a config-file like
file:
header: FIXED("MAGIC")
body: content(10)
content:
value1: BYTE
value2: LONG
value3: STRING(10)
and it generates automatically something which parses files which start with "MAGIC", followed by ten times the content-package (which itself consists of a byte, a long and a 10-byte string).
Update2:
I found something comparable what I'm looking for, "Construct", but sadly this is a Python-Framework. Maybe this helps someone to get an idea, what I'm looking for.

Using Preon:
public class File {
#BoundString(match="MAGIC")
private String header;
#BoundList(size="10", type=Body.class)
private List<Body> body;
private static class Body {
#Bound
byte value1;
#Bound
long value2;
#BoundString(size="10")
String value3;
}
}
Decoding data:
Codec<File> codec = Codecs.create(File.class);
File file = codecs.decode(codec, buffer);
Let me know if you are running into problems.

give a try to preon

I have used DataInputStream for reading binary files and I write the rules in Java. ;) Binary files can have just about any format so there is no general rule for how to read them.
Frameworks don't always make things simpler. In your case, the description file is longer than the code to just read the data using a DataInputStream.
public static void parse(DataInput in) throws IOException {
// file:
// header: FIXED("MAGIC")
String header = readAsString(in, 5);
assert header.equals("MAGIC");
// body: content(10)
// ?? not sure what this means
// content:
for(int i=0;i<10;i++) {
// value1: BYTE
byte value1 = in.readByte();
// value2: LONG
long value2 = in.readLong();
// value3: STRING(10)
String value3 = readAsString(in, 10);
}
}
public static String readAsString(DataInput in, int len) throws IOException {
byte[] bytes = new byte[len];
in.readFully(bytes);
return new String(bytes);
}
If you want to have a configuration file you could use a Java Configuration File. http://www.google.co.uk/search?q=java+configuration+file

Google's Protocol Buffers

Parser combinator library is an option. JParsec works fine, however it could be slow.

I have been developing a framework for Java which allows to parse binary data https://github.com/raydac/java-binary-block-parser
in the case you should just describe structure of your binary file in pseudolanguage

You can parse binary files with parsers like JavaCC. Here you can find a simple example. Probably it's a bit more difficult than parsing text files.

Have you looking into the world of parsers. A good parser is yacc, and there may be a port of it for java.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.