Are there any design patterns for high performance file parsing?

Are there any design patterns for high performance file parsing? - java

I've recently developed my own file parsing class called the BufferedParseStream, and used this to decode PNG images. I've been comparing it's performance against the open source project PNGJ, and have seen that for smaller image sizes, PNGJ can be up to twice as fast as my own implementation. I assume this is associated with the implementation overhead when using the BufferedInputStream, as PNGJ roll their own equivalent instead.
Are there any existing design patterns which guide high performance file parsing, into primitives such as an int, float etc.?
public class BufferedParseStream extends BufferedInputStream {
private final ByteBuffer mByteBuffer;
public BufferedParseStream(final InputStream pInputStream, final int pBufferSize) {
super(pInputStream, pBufferSize);
/* Initialize the ByteBuffer. */
this.mByteBuffer = DataUtils.delegateNative(new byte[8]);
}
private final void buffer(final int pNumBytes) throws IOException {
/* Read the bytes into the ByteStorage. */
this.read(this.getByteBuffer().array(), 0, pNumBytes);
/* Reset the ByteBuffer Location. */
this.getByteBuffer().position(0);
}
public final char parseChar() throws IOException {
/* Read a single byte. */
this.buffer(DataUtils.BYTES_PER_CHAR);
/* Return the corresponding character. */
return this.getByteBuffer().getChar();
}
public final int parseInt() throws IOException {
/* Read four bytes. */
this.buffer(DataUtils.BYTES_PER_INT);
/* Return the corresponding integer. */
return this.getByteBuffer().getInt();
}
public final long parseLong() throws IOException {
/* Read eight bytes. */
this.buffer(DataUtils.BYTES_PER_LONG);
/* Return the corresponding long. */
return this.getByteBuffer().getLong();
}
public final void setParseOrder(final ByteOrder pByteOrder) {
this.getByteBuffer().order(pByteOrder);
}
private final ByteBuffer getByteBuffer() {
return this.mByteBuffer;
}
}

Java nio should be faster than using input streams, your class that you present seems odd to me (might just be me though :)) because it has an extra layer on top of ByteBuffer which I don't think is required.
You should use the byte buffer directly, it has a getInt, getFloat method which you can feed directly in to the required variables.
I think though your performance problems could be in the PNG decoder code as someone else has already mentioned. You should post that for further analysis

Related

Creating device driver packets from Java

Now that I have some spare time on my hands, I decided to create a Java program to connect my XBee (i.e. zigbee) chips to my new SmartThings hub. I found a nice tutorial on doing this by creating the packets by hand (https://nzfalco.jimdofree.com/electronic-projects/xbee-to-smartthings/). My next task is to create a set of Java routines to create, send, receive, and access the required packets (i.e. a sequence of bytes).
Having done similar in C for other projects, my first thought was to simple create a class with the packet structure and send it. Something like this:
class DeviceAnnounce {
public byte frameId;
public byte addr64[];
public byte addr16[];
public byte capability;
};
Problem is there does not appear to be a way to cast this "structure" to an array of bytes to send to the device.
Next I thought, we have a serialize capability built into the Java runtime. So I added Serializable to the class and used the writeObject() method to convert the instance into a byte stream. Problem here is that writeObject() converts not only your bytes, but includes the definition of the object in the encoding. Works great for reading and writing object to disk, but it's not creating the packet I need to send to the xbee device.
I finally coded it the hard way, explicitly adding a method to my class that creates the byte array.
class DeviceAnnounce {
public DeviceAnnounce(byte frameId, byte[] addr64, byte[] addr16, byte capability) {
super();
this.frameId = frameId;
this.addr64 = addr64;
this.addr16 = addr16;
this.capability = capability;
}
public byte frameId;
public byte addr64[];
public byte addr16[];
public byte capability;
byte[] getBytes() throws IOException {
byte[] data=new byte[12];
data[0]=frameId;
data[1]=addr64[7];
data[2]=addr64[6];
data[3]=addr64[5];
data[4]=addr64[4];
data[5]=addr64[3];
data[6]=addr64[2];
data[7]=addr64[1];
data[8]=addr64[0];
data[9]=addr16[1];
data[10]=addr16[0];
data[11]=capability;
return data;
}
#Override
public String toString() {
return "DeviceAnnounce [frameId=" + frameId + ", addr64=" + HexUtils.prettyHexString(addr64) + ", addr16="
+ HexUtils.prettyHexString(addr16) + ", capability=" + capability + "]";
}
}
It works, but I keep thinking there must be a better way. Now the 64 dollar (or maybe bit) question. Is there a way to convert a POJO into a simple byte stream/array?

To build a block of bytes for transmitting, I recommend using the built-in ByteBuffer, which e.g. has helpers for 16-, 32-, and 64-bit integers in big- or little-endian.
You should then store the values as you use them, e.g.
public byte frameId;
public long addr64;
public short addr16;
public byte capability;
byte[] getBytes() throws IOException {
ByteBuffer buf = ByteBuffer.allocate(12)
.order(ByteOrder.BIG_ENDIAN/*Network Byte Order*/);
buf.put(frameId);
buf.putLong(addr64);
buf.putShort(addr16);
buf.put(capability);
return buf.array(); // or return the ByteBuffer itself
}

Filter out all text above a certain font size from PDF

As the title says, I want to filter out all text from a PDF that is above a certain font size. Currently, I am using the PDFBox library but I am open to using any other free library for Java.
My approach was to use a PDFStreamParser to iterate through the tokens. When I pass a Tf operator that has a size greater than my threshold, don't add the next Tj/TJ that is seen. However, it has become clear to me that this relatively simple approach will not work because the text may be scaled by the current transformation matrix.
Is there a better approach I could be taking, or a way to make my approach work without it getting too complicated?

Your approach
When I pass a Tf operator that has a size greater than my threshold, don't add the next Tj/TJ that is seen.
is too simple.
On one hand, as you remark yourself,
the text may be scaled by the current transformation matrix.
(Actually not only by the transformation matrix but also by the text matrix!)
Thus, you have to keep track of these matrices.
On the other hand Tf doesn't only set the base font size for the next text drawing instruction seen, it sets it until the size is explicitly changed by some other instruction.
Furthermore, the text font size and the current transformation matrix are part of the graphics state; thus, they are subject to save state and restore state instructions.
To edit a content stream with respect to the current state, therefore, you have to keep track of a lot of information. Fortunately, PDFBox contains classes to do the heavy lifting here, the class hierarchy based on the PDFStreamEngine, allowing you to concentrate on your task. To have as much information as possible available for editing, the PDFGraphicsStreamEngine class appears to be a good choice to build upon.
A generic content stream editor class
Thus, let's derive PdfContentStreamEditor from PDFGraphicsStreamEngine and add some code for generating a replacement content stream.
public class PdfContentStreamEditor extends PDFGraphicsStreamEngine {
public PdfContentStreamEditor(PDDocument document, PDPage page) {
super(page);
this.document = document;
}
/**
* <p>
* This method retrieves the next operation before its registered
* listener is called. The default does nothing.
* </p>
* <p>
* Override this method to retrieve state information from before the
* operation execution.
* </p>
*/
protected void nextOperation(Operator operator, List<COSBase> operands) {
}
/**
* <p>
* This method writes content stream operations to the target canvas. The default
* implementation writes them as they come, so it essentially generates identical
* copies of the original instructions {#link #processOperator(Operator, List)}
* forwards to it.
* </p>
* <p>
* Override this method to achieve some fancy editing effect.
* </p>
*/
protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
contentStreamWriter.writeTokens(operands);
contentStreamWriter.writeToken(operator);
}
// stub implementation of PDFGraphicsStreamEngine abstract methods
#Override
public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException { }
#Override
public void drawImage(PDImage pdImage) throws IOException { }
#Override
public void clip(int windingRule) throws IOException { }
#Override
public void moveTo(float x, float y) throws IOException { }
#Override
public void lineTo(float x, float y) throws IOException { }
#Override
public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException { }
#Override
public Point2D getCurrentPoint() throws IOException { return null; }
#Override
public void closePath() throws IOException { }
#Override
public void endPath() throws IOException { }
#Override
public void strokePath() throws IOException { }
#Override
public void fillPath(int windingRule) throws IOException { }
#Override
public void fillAndStrokePath(int windingRule) throws IOException { }
#Override
public void shadingFill(COSName shadingName) throws IOException { }
// PDFStreamEngine overrides to allow editing
#Override
public void processPage(PDPage page) throws IOException {
PDStream stream = new PDStream(document);
replacement = new ContentStreamWriter(replacementStream = stream.createOutputStream(COSName.FLATE_DECODE));
super.processPage(page);
replacementStream.close();
page.setContents(stream);
replacement = null;
replacementStream = null;
}
#Override
public void showForm(PDFormXObject form) throws IOException {
// DON'T descend into XObjects
// super.showForm(form);
}
#Override
protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
nextOperation(operator, operands);
super.processOperator(operator, operands);
write(replacement, operator, operands);
}
final PDDocument document;
OutputStream replacementStream = null;
ContentStreamWriter replacement = null;
}
(PdfContentStreamEditor class)
This code overrides processPage to create a new page content stream and eventually replace the old one with it. And it overrides processOperator to provide the processed instruction for editing.
For editing one simply overrides write here. The existing implementation simply writes the instructions as they come while you may change the instructions to write. Overriding nextOperation allows you to peek at the graphics state before the current instruction is applied to it.
Applying the editor as is,
PDDocument document = PDDocument.load(SOURCE);
for (PDPage page : document.getDocumentCatalog().getPages()) {
PdfContentStreamEditor identity = new PdfContentStreamEditor(document, page);
identity.processPage(page);
}
document.save(RESULT);
(EditPageContent test testIdentityInput)
therefore, will create a result PDF with equivalent content streams.
Customizing the content stream editor for your use case
You want to
filter out all text from a PDF that is above a certain font size.
Thus, we have to check in write whether the current instruction is a text drawing instruction, and if it is, we have to check the current effective font size, i.e. the base font size transformed by the text matrix and the current transformation matrix. If the effective font size is too large, we have to drop the instruction.
This can be done as follows:
PDDocument document = PDDocument.load(SOURCE);
for (PDPage page : document.getDocumentCatalog().getPages()) {
PdfContentStreamEditor identity = new PdfContentStreamEditor(document, page) {
#Override
protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
String operatorString = operator.getName();
if (TEXT_SHOWING_OPERATORS.contains(operatorString))
{
float fs = getGraphicsState().getTextState().getFontSize();
Matrix matrix = getTextMatrix().multiply(getGraphicsState().getCurrentTransformationMatrix());
Point2D.Float transformedFsVector = matrix.transformPoint(0, fs);
Point2D.Float transformedOrigin = matrix.transformPoint(0, 0);
double transformedFs = transformedFsVector.distance(transformedOrigin);
if (transformedFs > 100)
return;
}
super.write(contentStreamWriter, operator, operands);
}
final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
};
identity.processPage(page);
}
document.save(RESULT);
(EditPageContent test testRemoveBigTextDocument)
Strictly speaking completely dropping the instruction in question may not suffice; instead, one would have to replace it with an instruction to change the text matrix just like the dropped text drawing instructions would have done. Otherwise the following not-dropped text may be moved. Often, though, this does work as is because the text matrix is newly set for the following different text. So let's keep it simple here.
Constraints and remarks
This PdfContentStreamEditor only edits the page content stream. From there XObjects and Patterns may be used which are currently not edited by the editor. It should be easy, though, to, after editing the page content stream, recursively iterate of the XObjects and Patterns and edit them in a similar fashion.
This PdfContentStreamEditor essentially is a port of the PdfContentStreamEditor for iText 5 (.Net/Java) from this answer and the PdfCanvasEditor for iText 7 from this answer. The examples for using those editor classes may give some hints on how to use this PdfContentStreamEditor for PDFBox.
A similar (but less generic) approach has been used previously in the HelloSignManipulator class in this answer.
Fixing a bug
In the context of this question a bug in the PdfContentStreamEditor was found which caused some text lines in the example PDF in focus there to be moved.
The background: Some PDF instructions are defined via other ones, e.g. tx ty TD is specified to have the same effect as -ty TL tx ty Td. The corresponding PDFBox OperatorProcessor implementations for simplicity work by feeding the equivalent instructions back into the stream engine.
The PdfContentStreamEditor as implemented above in such a case retrieves signals for both the replacement instructions and the original instructions and writes them all back into the result stream. Thus, the effect of those instructions is doubled. E.g. in case of the TD instruction the text insertion point is forwarded two lines instead of one...
Thus, we have to ignore the replacement instructions. For this replace the method processOperator above by
#Override
protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
if (inOperator) {
super.processOperator(operator, operands);
} else {
inOperator = true;
nextOperation(operator, operands);
super.processOperator(operator, operands);
write(replacement, operator, operands);
inOperator = false;
}
}
boolean inOperator = false;

how to refer part of an array?

Given an object byte[], when we want to operate with such object often we need pieces of it. In my particular example i get byte[] from wire where first 4 bytes describe lenght of the message then another 4 bytes the type of the message (an integer that maps to concrete protobuf class) then remaining byte[] is actual content of the message... like this
length|type|content
in order to parse this message i have to pass content part to specific class which knows how to parse an instance from it... the problem is that often there are no methods provided so that you could specify from where to where parser shall read the array...
So what we end up doing is copying remaining chuks of that array, which is not effective...
As far as i know in java it is not possible to create another byte[] reference that actually refers to some original bigger byte[] array with just 2 indexes (this was approach with String that led to memory leaks)...
I wonder how do we solve situations like this? I suppose giving up on protobuf just because it does not provide some parseFrom(byte[], int, int) does not make sence... protobuf is just an example, anything could lack that api...
So does this force us to write inefficient code or there is something that can be done? (appart from adding that method)...

Normally you would tackle this kind of thing with streams.
A stream is an abstraction for reading just what you need to process the current block of data. So you can read the correct number of bytes into a byte array and pass it to your parse function.
You ask 'So does this force us to write inefficient code or there is something that can be done?'
Usually you get your data in the form of a stream and then using the technique demonstrated below will be more performant because you skip making one copy. (Two copies instead of three; once by the OS and once by you. You skip making a copy of the total byte array before you start parsing.) If you actually start out with a byte[] but it is constructed by yourself then you may want to change to constructing an object such as { int length, int type, byte[] contentBytes } instead and pass contentBytes to your parse function.
If you really, really have to start out with byte[] then the below technique is just a more convenient way to parse it, it would not be more performant.
So suppose you got a buffer of bytes from somewhere and you want to read the contents of that buffer. First you convert it to a stream:
private static List<Content> read(byte[] buffer) {
try {
ByteArrayInputStream bytesStream = new ByteArrayInputStream(buffer);
return read(bytesStream);
} catch (IOException e) {
e.printStackTrace();
}
}
The above function wraps the byte array with a stream and passes it to the function that does the actual reading.
If you can start out from a stream then obviously you can skip the above step and just pass that stream into the below function directly:
private static List<Content> read(InputStream bytesStream) throws IOException {
List<Content> results = new ArrayList<Content>();
try {
// read the content...
Content content1 = readContent(bytesStream);
results.add(content1);
// I don't know if there's more than one content block but assuming
// that there is, you can just continue reading the stream...
//
// If it's a fixed number of content blocks then just read them one
// after the other... Otherwise make this a loop
Content content2 = readContent(bytesStream);
results.add(content2);
} finally {
bytesStream.close();
}
return results;
}
Since your byte-array contains content you will want to read Content blocks from the stream. Since you have a length and a type field, I am assuming that you have different kinds of content blocks. The next function reads the length and type and passes the processing of the content bytes on to the proper class depending on the read type:
private static Content readContent(InputStream stream) throws IOException {
final int CONTENT_TYPE_A = 10;
final int CONTENT_TYPE_B = 11;
// wrap the InputStream in a DataInputStream because the latter has
// convenience functions to convert bytes to integers, etc.
// Note that DataInputStream handles the stream in a BigEndian way,
// so check that your bytes are in the same byte order. If not you'll
// have to find another stream reader that can convert to ints from
// LittleEndian byte order.
DataInputStream data = new DataInputStream(stream);
int length = data.readInt();
int type = data.readInt();
// I'm assuming that above length field was the number of bytes for the
// content. So, read length number of bytes into a buffer and pass that
// to your `parseFrom(byte[])` function
byte[] contentBytes = new byte[length];
int readCount = data.read(contentBytes, 0, contentBytes.length);
if (readCount < contentBytes.length)
throw new IOException("Unexpected end of stream");
switch (type) {
case CONTENT_TYPE_A:
return ContentTypeA.parseFrom(contentBytes);
case CONTENT_TYPE_B:
return ContentTypeB.parseFrom(contentBytes);
default:
throw new UnsupportedOperationException();
}
}
I have made up the below Content classes. I don't know what protobuf is but it can apparently convert from a byte array to an actual object with its parseFrom(byte[]) function, so take this as pseudocode:
class Content {
// common functionality
}
class ContentTypeA extends Content {
public static ContentTypeA parseFrom(byte[] contentBytes) {
return null; // do the actual parsing of a type A content
}
}
class ContentTypeB extends Content {
public static ContentTypeB parseFrom(byte[] contentBytes) {
return null; // do the actual parsing of a type B content
}
}

In Java, Array is not just section of memory - it is an object, that have some additional fields (at least - length). So you cannot link to part of array - you should:
Use array-copy functions or
Implement and use some algorithm that uses only part of byte array.

The concern seems that there is no way to create a view over an array (e.g., an array equivalent of List#subList()). A workaround might be making your parsing methods take in the reference to the entire array and two indices (or an index and a length) to specify the sub-array the method should work on.
This would not prevent the methods from reading or modifying sections of the array they should not touch. Perhaps an ByteArrayView class could be made to add a little bit of safety if this is a concern:
public class ByteArrayView {
private final byte[] array;
private final int start;
private final int length;
public ByteArrayView(byte[] array, int start, int length) { ... }
public byte[] get(int index) {
if (index < 0 || index >= length) {
throw new ArrayOutOfBoundsExceptionOrSomeOtherRelevantException();
}
return array[start + index];
}
}
But if, on the other hand, performance is a concern, then a method call to get() for fetching each byte is probably undesirable.
The code is for illustration; it's not tested or anything.
EDIT
On a second reading of my own answer, I realized that I should point this out: having a ByteArrayView will copy each byte you read from the original array -- just byte by byte rather than as a chunk. It would be inadequate for the OP's concerns.

Streaming URL Encoder

In my Java app, I'm looking for a streaming version of URLEncoder.encode(String s, String enc). I'd like to stream a large HTTP post request using the "application/x-www-form-urlencoded" content type. Does such a thing exist either in a library, or an open source project? Or is there an easy way to implement it?
This was an early attempt, but is incorrect because it doesn't handle UTF codepoints larger than one byte:
// Incorrect attempt at creating a URLEncoder OutputStream
private class URLEncoderOutputStream extends FilterOutputStream
{
public URLEncoderOutputStream(OutputStream out)
{
super(out);
}
#Override
public void write(int b) throws IOException
{
String s = new String(new byte[] { (byte)b });
String enc = URLEncoder.encode(s, "UTF-8");
out.write(enc.getBytes("UTF-8"));
}
}

The problem is that OutputStreams don't know anything about characters, only bytes. What you really want is a Writer, e.g.
public class URLEncodedWriter extends FilterWriter {
public void write(int c) {
out.write(URLEncoder.encode((char)c, "UTF-8"));
}
... // Same for 2 other write() methods
}

I think the answer is I shouldn't be trying to do this. According to the HTML Specification:
The content type "application/x-www-form-urlencoded" is inefficient for sending large quantities of binary data or text containing non-ASCII characters. The content type "multipart/form-data" should be used for submitting forms that contain files, non-ASCII data, and binary data.
Most servers will reject HTTP headers that exceed a certain length in any case.

How can I create constrained InputStream to read only part of the file?

I want to create an InputStream that is limited to a certain range of bytes in file, e.g. to bytes from position 0 to 100. So that the client code should see EOF once 100th byte is reached.

The read() method of InputStream reads a single byte at a time. You could write a subclass of InputStream that maintains an internal counter; each time read() is called, update the counter. If you have hit your maximum, do not allow any further reads (return -1 or something like that).
You will also need to ensure that the other methods for reading read_int, etc are unsupported (ex: Override them and just throw UnsupportedOperationException());
I don't know what your use case is, but as a bonus you may want to implement buffering as well.

As danben says, just decorate your stream and enforce the constraint:
public class ConstrainedInputStream extends InputStream {
private final InputStream decorated;
private long length;
public ConstrainedInputStream(InputStream decorated, long length) {
this.decorated = decorated;
this.length = length;
}
#Override public int read() throws IOException {
return (length-- <= 0) ? -1 : decorated.read();
}
// TODO: override other methods if you feel it's necessary
// optionally, extend FilterInputStream instead
}

Consider using http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/io/LimitInputStream.html

If you only need 100 bytes, then simple is probably best, I'd read them into an array and wrap that as a ByteArrayInputStream. E.g.
int length = 100;
byte[] data = new byte[length];
InputStream in = ...; //your inputstream
DataInputStream din = new DataInputStream(din);
din.readFully(data);
ByteArrayInputStream first100Bytes = new ByteArrayInputStream(data);
// pass first100bytes to your clients
If you don't want to use DataInputStream.readFully, there is IOUtils.readFully from apache commons-io, or you can implment the read loop explicitly.
If you have more advanced needs, such as reading from a segment in the middle of the file, or larger amounts of data, then extending InputStream and overriding the read(byte[], int,int) as well as read(), will give you better performance than just overriding the read() method.

You can use guava's ByteStreams.
Notice that you should use skipFully() before limit, for example:
ByteStreams.skipFully(tmpStream, range.start());
tmpStream = ByteStreams.limit(tmpStream, range.length());

In addition to this solution, using the skip method of an InputStream, you can also read a range starting in the middle of the file.
public class RangeInputStream extends InputStream
{
private InputStream parent;
private long remaining;
public RangeInputStream(InputStream parent, long start, long end) throws IOException
{
if (end < start)
{
throw new IllegalArgumentException("end < start");
}
if (parent.skip(start) < start)
{
throw new IOException("Unable to skip leading bytes");
}
this.parent=parent;
remaining = end - start;
}
#Override
public int read() throws IOException
{
return --remaining >= 0 ? parent.read() : -1;
}
}

I was solved a similar problem for my project, you can see the working code here PartInputStream.
I was used it for assets and files input streams. But it is not suitable for а streams whose length is not available initially, such as network streams.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Are there any design patterns for high performance file parsing? - java

Related

Creating device driver packets from Java

Filter out all text above a certain font size from PDF

how to refer part of an array?

Streaming URL Encoder

How can I create constrained InputStream to read only part of the file?

Categories

Resources