Creating Java DataInputStream data in Python - java

I have a Java program that uses a DataInputStream for storing object data.
Example:
DataInputStream tInput = new DataInputStream(getClass().getResourceAsStream(aDirectory + "/ResultItemInfo.dat"));
this._text = tInput.readUTF();
this._image = tInput.readUTF();
this._audio = tInput.readUTF();
this._random = false;
if (tInput.read() == 1) {
this._random = true;
}
this._hasMenu = false;
if (tInput.read() == 1) {
this._hasMenu = true;
}
Nice, isn't it?
There is an existing dataset, and now I have to add some records. If the tool that I am required to make was written in Java too, this would be pretty easy. Unfortunately, it is written in Python, so I have to discover a way how to create files that can be read from the Java application using Python.
Is there any easy way to do this?
As a last resort, I could:
Modify the Java app and use XML. This would break compatibility with all existing data, so I really don't want to do it.
Use Jython. Possible solution, but I want pure C-Python.
Reverse-Engineer the data format. Not a good solution either.

For a string to be readUTF-able, it must contain two bytes of counter and then exactly as many bytes of UTF-8 encoded data as counter says.
So I suggest this piece of code should write a unicode string data the way what readUTF() could read:
encoded = data.encode('UTF-8')
count = len(encoded)
msb, lsb = divmod(count, 256) # split to two bytes
outfile.write(chr(msb))
outfile.write(chr(lsb))
outfile.write(encoded)
The outfile must be open in binary mode (e.g. "wb").
This is according to description of Java interface DataInput. I did not try to run this, though.

Related

How to represent header value and actual message in Byte Array Java?

I need to make a byte array in which I will have header values initially and my actual message will come after the header values.
My header values will have - data center which is a string, client_id which is integer, pool_id which is also integer and data_count is also an integer.
And my actual message which will come after header values is - hello world
In my case, my header length may grow so I need to initialize that as a variable so that I can increase it later on as needed.
I am little bit confuse in how to use Byte Array here. How can I represent this in a byte array in network byte order so that c++ program can decode this out properly on ubuntu 12.04 machine?
You can use Protocol Buffers to represent the messages (header and content). It will handle the transformations between languages and different platforms. Also, it is providing room for further expansion and support for multiple message versions.
For your example you can define the message format like (eg. messageModel.proto):
package common;
option java_package = "my.java.package";
option java_outer_classname = "MessageProto";
message MyMessage {
optional string dataCenter = 1 [default = DEFAULT_DC];
optional int64 clientId = 2;
optional int64 poolId = 3;
optional int64 dataCount = 4;
optional string body = 5;
}
Then using the protoc compile like:
protoc -I src/java/ --java_out=src/java/ messageModel.proto
You will generate the transport objects and the utility classes to marshal them from one endpoint to another (representing different messages even). Please check the java tutorial for more details.
To create a MyMessage from java you will be able to do something like:
MessageProto.MyMessage.Builder mb = MessageProto.MyMessage.newBuilder();
mb.setDataCenter("aDC");
mb.setClientId(12);
mb.setPoolId(14);
mb.setDataCount(2);
mb.setbody("hello world");
MessageProto.MyMessage message = mb.build();
To transform the message into a byte array, you will use: message.toByteArray()
If C++/C is your destination you will need to generate (from the same model) the C builders and objects too. And to decode the message you will do something like:
MessageProto.MyMessage message = MessageProto.MyMessage.parseFrom(buffer);
Where buffer will represent the received content.
If this is only a homework assignment then you can serialize your header and body message using
a DataOutputStream, but I would suggest investigating Protocol Buffers as well.
Try using a DataOutputStream that is targeted to a ByteArrayOutputStream. When you're done with writing the message to the DataOutputStream, you can obtain the constructed byte array from the ByteArrayOutputStream.
Like this:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
dos.writeInt(client_id);
dos.writeUTF(data_center);
// etc...
byte[] message = baos.toByteArray();
Protocol Buffers are also a good option, if you want more flexibility and higher performance. It depends on what you want to get out of this application; if it needs higher performance, or whether it's a one-off throwaway app or something that you expect to grow and be maintained in the longer future. DataOutputStream and DataInputStream are simple to use and you can start right away, you need to invest a bit more of your time to learn Protocol Buffers.

Read special charatters ( æ ø å ) with Java from Oracle database

i have a problem when reading special charatters from oracle database (use JDBC driver and glassfish tooplink).
I store on database the name "GRØNLÅEN KJÆTIL" through WebService and, on database, the data are store correctly.
But when i read this String, print on log file and convert this in byte array whit this code:
int pos = 0;
byte[] msg=new byte[1024];
String F = "F" + passenger.getName();
logger.debug("Add " + F + " " + F.length());
msg = addStringToArrayBytePlusSeparator(msg, F,pos);
..............
private byte[] addStringToArrayBytePlusSeparator(byte[] arrDest,String strToAdd,int destPosition)
{
System.arraycopy(strToAdd.getBytes(Charset.forName("ISO-8859-1")), 0, arrDest, destPosition, strToAdd.getBytes().length);
arrDest = addSeparator(arrDest,destPosition+strToAdd.getBytes().length,1);
return arrDest;
}
1) In the log file there is:"Add FGRÃNLÃ " (the name isn't correct and the F.length() are not printed).
2) The code throw:
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at it.edea.ebooking.business.chi.control.VingCardImpl.addStringToArrayBytePlusSeparator(Test.java:225).
Any solution?
Tanks
You're calling strToAdd.getBytes() without specifying the character encoding, within the System.arraycopy call - that will be using the system default encoding, which may well not be ISO-8859-1. You should be consistent in which encoding you use. Frankly I'd also suggest that you use UTF-8 rather than ISO-8859-1 if you have the choice, but that's a different matter.
Why are you dealing with byte arrays anyway at this point? Why not just use strings?
Also note that your addStringToArrayBytePlusSeparator method doesn't give any indication of how many bytes it's copied, which means the caller won't have any idea what to do with it afterwards. If you must use byte arrays like this, I'd suggest making addStringToArrayBytePlusSeparator return either the new "end of logical array" or the number of bytes copied. For example:
private static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
/**
* (Insert fuller description here.)
* Returns the number of bytes written to the array
*/
private static int addStringToArrayBytePlusSeparator(byte[] arrDest,
String strToAdd,
int destPosition)
{
byte[] encodedText = ISO_8859_1.getBytes(strToAdd);
// TODO: Verify that there's enough space in the array
System.arraycopy(encodedText, 0, arrDest, destPosition, encodedText.length);
return encodedText.length;
}
Encoding/Decoding problems are hard. In every process step you have to do the correct encoding/decoding. So,
familiarize yourself with the difference of bytes (inputstream) and Characters (Readers, Strings)
Choose in which character encoding you want to store your data in the database, and in which character encoding you want to expose your webservice. Make sure when you load initial data in the database it's in the right encoding
connect with the right database properties. mysql requires an addition to the connection url:?useUnicode=true&characterEncoding=UTF-8 when using UTF-8, I don't know about oracle.
if you print/debug at a certain step and it looks ok, you can't be sure you did it right. The logger can write with the wrong encoding (sometimes making something look ok, while in fact it's broken). Your terminal might not handle strange byte encodings correct. The same holds for command-line database clients. Your data might wrongly be stored, but your wrongly configured terminal interprets/shows the data as correct.
In XML, it's not only the stream encoding that matters, but also the xml-encoding attribute.

What is the proper way to write/read a file with different IO streams

I have a file that contains bytes, chars, and an object, all of which need to be written then read. What would be the best way to utilize Java's different IO streams for writing and reading these data types? More specifically, is there a proper way to add delimiters and recognize those delimiters, then triggering what stream should be used? I believe I need some clarification on using multiple streams in the same file, something I have never studied before. A thorough explanation would be a sufficient answer. Thanks!
As EJP already suggested, use ObjectOutputStream and ObjectInputStream an0d wrap your other elements as an object(s). I'm giving as an answer so I could show an example (it's hard to do it in comment) EJP - if you want to embed it in your question, please do and I'll delete the answer.
class MyWrapedData implements serializeable{
private String string1;
private String string2;
private char char1;
// constructors
// getters setters
}
Write to file:
ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream(fileName));
out.writeObject(myWrappedDataInstance);
out.flush();
Read from file
ObjectInputStream in = new ObjectInputStream(new FileInputStream(fileName));
Object obj = in.readObject();
MyWrapedData wraped = null;
if ((obj != null) && (obj instanceof MyWrappedData))
wraped = (MyWrapedData)obj;
// get the specific elements from the wraped object
see very clear example here: Read and Write
Redesign the file. There is no sensible way of implementing it as presently designed. For example the object presupposes an ObjectOutputStream, which has a header - where's that going to go? And how are you going to know where to switch from bytes to chars?
I would probably use an ObjectOutputStream for the whole thing and write everything as objects. Then Serialization solves all those problems for you. After all you don't actually care what's in the file, only how to read and write it.
Can you change the structure of the file? It is unclear because the first sentence of your question contradicts being able to add delineators. If you can change the file structure you could output the different data types into separate files. I would consider this the 'proper' way to delineate the data streams.
If you are stuck with the file the way it is then you will need to write an interface to the file's structure which in practice is a shopping list of read operations and a lot of exception handling. A hackish way to program because it will require a hex editor and a lot of trial and error but it works in certain cases.
Why not write the file as XML, possibly with a nice simple library like XSTream. If you are concerned about space, wrap it in gzip compression.
If you have control over the file format, and it's not an exceptionally large file (i.e. < 1 GiB), have you thought about using Google's Protocol Buffers?
They generate code that parses (and serializes) file/byte[] content. Protocol buffers use a tagging approach on every value that includes (1) field number and (2) a type, so they have nice properties such as forward/backward compatability with optional fields etc. They are fairly well optimized for both speed and file size, adding only ~2 bytes of overhead for a short byte[], with ~2-4 additional bytes to encode the length on larger byte[] fields (VarInt encoded lengths).
This could be overkill, but if you have a bunch of different fields & types, protobuf is really helpful. See: http://code.google.com/p/protobuf/.
An alternative is Thrift by Facebook, with support for a few more languages although possibly less use in the wild last I checked.
If the structure of your file is not fixed, consider using a wrapper per type. First you need to create the interface of your wrapper classes….
interface MyWrapper extends Serializable {
void accept(MyWrapperVisitor visitor);
}
Then you create the MyWrapperVisitor interface…
interface MyWrapperVisitor {
void visit(MyString wrapper);
void visit(MyChar wrapper);
void visit(MyLong wrapper);
void visit(MyCustomObject wrapper);
}
Then you create your wrapper classes…
class MyString implements MyWrapper {
public final String value;
public MyString(String value) {
super();
this.value = value;
}
#Override
public void accept(MyWrapperVisitor visitor) {
visitor.visit(this);
}
}
.
.
.
And finally you read your objects…
final InputStream in = new FileInputStream(myfile);
final ObjectInputStream objIn = new ObjectInputStream(in);
final MyWrapperVisitor visitor = new MyWrapperVisitor() {
#Override
public void visit(MyString wrapper) {
//your logic here
}
.
.
.
};
//loop over all your objects here
final MyWrapper wrapper = (MyWrapper) objIn.readObject();
wrapper.accept(visitor);

How to check whether the file is binary?

I wrote the following method to see whether particular file contains ASCII text characters only or control characters in addition to that. Could you glance at this code, suggest improvements and point out oversights?
The logic is as follows: "If first 500 bytes of a file contain 5 or more Control characters - report it as binary file"
thank you.
public boolean isAsciiText(String fileName) throws IOException {
InputStream in = new FileInputStream(fileName);
byte[] bytes = new byte[500];
in.read(bytes, 0, bytes.length);
int x = 0;
short bin = 0;
for (byte thisByte : bytes) {
char it = (char) thisByte;
if (!Character.isWhitespace(it) && Character.isISOControl(it)) {
bin++;
}
if (bin >= 5) {
return false;
}
x++;
}
in.close();
return true;
}
Since you call this class "isASCIIText", you know exactly what you're looking for. In other words, it's not "isTextInCurrentLocaleEncoding". Thus you can be more accurate with:
if (thisByte < 32 || thisByte > 127) bin++;
edit, a long time later — it's pointed out in a comment that this simple check would be tripped up by a text file that started with a lot of newlines. It'd probably be better to use a table of "ok" bytes, and include printable characters (including carriage return, newline, and tab, and possibly form feed though I don't think many modern documents use those), and then check the table.
x doesn't appear to do anything.
What if the file is less than 500 bytes?
Some binary files have a situation where you can have a header for the first N bytes of the file which contains some data that is useful for an application but that the library the binary is for doesn't care about. You could easily have 500+ bytes of ASCII in a preamble like this followed by binary data in the following gigabyte.
Should handle exception if the file can't be opened or read, etc.
Fails badly if file size is less than 500 bytes
The line char it = (char) thisByte; is conceptually dubious, it mixes bytes and chars concepts, ie. assumes implicitly that the encoding is one-byte=one character (them, it excludes unicode encodings). In particular, it fails if the file is UTF-16 encoded.
The return inside the loop (slightly bad practice IMO) forgets to close the file.
The first thing I noticed - unrelated to your actual question, but you should be closing your input stream in a finally block to ensure it's always done. Usually this merely handles exceptions, but in your case you won't even close the streams of files when returning false.
Asides from that, why the comparison to ISO control characters? That's not a "binary" file, that's a "file that contains 5 or more control characters". A better way to approach the situation in my opinion, would be to invert the check - write an isAsciiText function instead which asserts that all the characters in the file (or in the first 500 bytes if you so wish) are in a set of bytes that are known good.
Theoretically, only checking the first few hundred bytes of a file could get you into trouble if it was a composite file of sorts (e.g. text with embedded pictures), but in practice I suspect every such file will have binary header data at the start so you're probably OK.
This would not work with the jdk install packages for linux or solaris. they have a shell-script start and then a bi data blob.
why not check the mime type using some library like jMimeMagic (http://http://sourceforge.net/projects/jmimemagic/) and deside based on the mimetype how to handle the file.
One could parse and compare ageinst a list of known binary file header bytes, like the one provided here.
Problem is, one needs to have a sorted list of binary-only headers, and the list might not be complete at all. For example, reading and parsing binary files contained in some Equinox framework jar. If one needs to identify the specific file types though, this should work.
If you're on Linux, for existing files on the disk, native file command execution should work well:
String command = "file -i [ZIP FILE...]";
Process process = Runtime.getRuntime().exec(command);
...
It will output information on the files:
...: application/zip; charset=binary
which you can furtherly filter with grep, or in Java, depending on, if you simply need estimation of the files' binary character, or if you need to find out their MIME types.
If parsing InputStreams, like content of nested files inside archives, this doesn't work, unfortunately, unless resorting to shell-only programs, like unzip - if you want to avoid creating temp unzipped files.
For this, a rough estimation of examining the first 500 Bytes worked out ok for me, so far, as was hinted in the examples above; instead of Character.isWhitespace/isISOControl(char), I used Character.isIdentifierIgnorable(codePoint), assuming UTF-8 default encoding:
private static boolean isBinaryFileHeader(byte[] headerBytes) {
return new String(headerBytes).codePoints().filter(Character::isIdentifierIgnorable).count() >= 5;
}
public void printNestedZipContent(String zipPath) {
try (ZipFile zipFile = new ZipFile(zipPath)) {
int zipHeaderBytesLen = 500;
zipFile.entries().asIterator().forEachRemaining( entry -> {
String entryName = entry.getName();
if (entry.isDirectory()) {
System.out.println("FOLDER_NAME: " + entryName);
return;
}
// Get content bytes from ZipFile for ZipEntry
try (InputStream zipEntryStream = new BufferedInputStream(zipFile.getInputStream(zipEntry))) {
// read and store header bytes
byte[] headerBytes = zipEntryStream.readNBytes(zipHeaderBytesLen);
// Skip entry, if nested binary file
if (isBinaryFileHeader(headerBytes)) {
return;
}
// Continue reading zipInputStream bytes, if non-binary
byte[] zipContentBytes = zipEntryStream.readAllBytes();
int zipContentBytesLen = zipContentBytes.length;
// Join already read header bytes and rest of content bytes
byte[] joinedZipEntryContent = Arrays.copyOf(zipContentBytes, zipContentBytesLen + zipHeaderBytesLen);
System.arraycopy(headerBytes, 0, joinedZipEntryContent, zipContentBytesLen, zipHeaderBytesLen);
// Output (default/UTF-8) encoded text file content
System.out.println(new String(joinedZipEntryContent));
} catch (IOException e) {
System.out.println("ERROR getting ZipEntry content: " + entry.getName());
}
});
} catch (IOException e) {
System.out.println("ERROR opening ZipFile: " + zipPath);
e.printStackTrace();
}
}
You ignore what read() returns, what if the files is shorter than 500 bytes?
When you return false, you don't close the file.
When converting byte to char, you assume your file is 7-bit ASCII.

Java/ImageIO Validate format before reading the entire file?

I'm developing a Web application that will let users upload images.
My concern is the file´s size, specially if they are invalid formats.
I'm wondering if there´s a way in java (or a third party library) to check the allowed files formats (jpg, gif and png) before reading the entire file.
If you wish to support only a few types of images you can start by (up)loading the image and at some point use the first few bytes to check wether you wish to continue the upload.
Quite a lot of image formats can be recognized by the first few bytes, the magic number. If the number matches you don't know whether the file is valid of course, but it may be used to match extension and magic number to prevent is really does not correspond at all.
Have a look at this page to check out some Java which checks mime-types. Do read the docs or source to check whether any given method requires the entire file, or can operate on the first few bytes. I've not used those libraries :)
Also check out this page which also lists some java libraries, and some papers on which detection is based.
Don't forget to put in some feedback if you managed to find something you like!
You don't need 3rd party libraries. The code you have to write is simple.
At the point you are handling your uploads, filter the files by their extension. This isn't perfect, but will account for most of the cases.
However, this would mean files are already uploaded to the server. You can use a bit of javascript on the client-side to perform the same operation - check whether the value of the file-upload component contains an allowed file type - .jpg, .png, etc.
function extensionsOkay(fval) {
var extension = new Array();
extension[0] = ".png";
extension[1] = ".gif";
extension[2] = ".jpg";
extension[3] = ".jpeg";
extension[4] = ".bmp";
// No other customization needed.
var thisext = fval.substr(fval.lastIndexOf('.')).toLowerCase();
for(var i = 0; i < extension.length; i++) {
if(thisext == extension[i]) {
$('#support-documents').hide();
return true; }
}
// show client side error message
$('#span.failed').show();
return false;
}

Categories

Resources