Tokenize Java InputStream into streams, not Strings

Tokenize Java InputStream into streams, not Strings - java

I know the Java libraries pretty well, so I was surprised when I realized that, apparently, there's no easy way to do something seemingly simple with a stream. I'm trying to read an HTTP request containing multipart form data (large, multiline tokens separated be delimiters that look like, for example, ------WebKitFormBoundary5GlahTkFmhDfanAn--), and I want to read until I encounter a part of the request with a given name, and then return an InputStream of that part.
I'm fine with just reading the stream into memory and returning a ByteArrayInputStream, because the files submitted should never be larger than 1MB. However, I want to make sure that the reading method throws an exception if the file is larger than 1MB, so that excessively-large files don't fill up the JVM's memory and crash the server. The file data may be binary, so that rules out BufferedReader.readLine() (it drops newlines, which could be any of \r, \n, or \r\n, resulting in loss of data).
All of the obvious tokenizing solutions, such as Scanner, read the tokens as Strings, not streams, which could cause OutOfMemoryErrors for large files--exactly what I'm trying to avoid. As far as I can tell, there's no equivalent to Scanner that returns each token as an InputStream without reading it into memory. Is there something I'm missing, or is there any way to create something like that myself, using just the standard Java libraries (no Apache Commons, etc.), that doesn't require me to read the stream a character at a time and write all of the token-scanning code myself?
Addendum: Shortly before posting this, I realized that the obvious solution to my original problem was simply to read the full request body into memory, failing if it's too large, and then to tokenize the resulting ByteArrayInputStream with a Scanner. This is inefficient, but it works. However, I'm still interested to know if there's a way to tokenize an InputStream into sub-streams, without reading them into memory, without using extra libraries, and without resorting to character-by-character processing.

It's not possible without loading them into memory (the solution you don't want) or saving them to disk (becomes I/O heavy). Tokenizing the stream into separate streams without loading it into memory implies that you can read the stream (to tokenize it) and be able to read it again later. In short, what you want is impossible unless your stream is seekable, but these are generally specialized streams for very specific applications and specialized I/O objects, like RandomAccessFile.

Related

Handle HTTP POST multipart response through ServerSocket

Good afternoon everyone,
First of all, I'll say that it's only for personal purpose in a certain way, it's made to use for little projects to improve my Java knowledge, but my idea is to make this kind of things to understand better the way developers works with sockets and bytes, as I really like to understand this kind of things better for my future ideas.
Actually I'm making a lightweight HTTP server in Java to understand the way it works, and I've been reading documentation but still have some difficulties to actually understand part of the official documentation. The main problem I'm facing is that, something I'd like to know if it's related or not, the content-length seems to have a higher length than the one I get from the BufferedReader. I don't know if the issue is about the way chars are managed and bytes are being parsed to chars on the BufferedReader, so it has less data, so probably what I have to do is treat this part as a binary, so I'd have to read the bytes of the InputStream, but here comes the real deal I'm facing.
As Readers reads a certain amount of bytes, and then it stops and uses this as buffer, this means the data from the InputStream is being used on the Reader, and it's no longer on the stream, so using read() would end up on a -1 as there aren't more bytes to read. A multipart is divided in multiple elements separated with a boundary, and a newline that delimiters the information from the content. I still have to get the information as an String to process it, but the content should be parsed into a binary data, and, without modifying the buffer length, implying I'd require knowledge about the exact length I require to get only the information, the most probably result would be the content being transferred to the BufferedReader buffer. Is it possible to do it even with the processed data from the BufferedStream, or should I find a way to get that certain content as binary without being processed?
As I said, I'm new working with sockets and services, so I don't exactly know which are the possibilities it's another kind of issue, so any help would be appreciated, thank you in advance.

Answer from Remy Lebeau, that can be found on the comments, which become useful for me:
since multipart data is both textual and binary, you are going to have to do your own buffering of the socket data so you have more control and know where the data switches back and forth. At the very least, since you can read binary data directly from a BufferedInputStream, and access its internal buffer, you can let it handle the actual buffering for you, and it is not difficult to write a custom readLine() method that can read a line of text from a BufferedInputStream without using BufferedReader

Special OutputStream to work into memory and file depending on the amount of input data

Currently I'm working with an SSH client api providing me stdout and stderr as InputStreams. I have to read all the data from these streams at client side and provide an api for implementors to be able to work with these data the way they want (just drop it, write it to DB, process it etc). First I tried to keep the whole data read in byte arrays, but with huge amount of data (could happen sometimes) this can cause serious memory problems. But I don't want to write all the data of every call into files if that isn't really necessary.
Anyone knows about a solution which reads data into memory until it reaches a limit (like 1mb), after it writes data from memory to a file and appends all the remaining data of the inputstream to the same file?

commons io has a workable solution: DeferredFileOutputStream.

Can you avoid reading the stream until you know what you are going to do with it?
If you use this approach you can dump them, read portions of data and write them to a database as you read it, or read and process the data as you read it.
This way you would not need to read more than 1 MB (or less) at any one time.

Reading a gz file and keeping track of position in file

So, here is the situation:
I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access.
In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)
So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.
Is there anything available or do I have to roll my own?
A few additional comments:
I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8
I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).
Thanks for tips,
Arnaud

I think jzran could be pretty much what you're looking for:
It's a Java library based on the
zran.c sample from zlib.
You can preprocess a large gzip
archive, producing an "index" that can
be used for random read access.
You can balance between index size and
access speed.

What you are looking for is called mark(), markSupported() and skip().
This methods are declared both in InputStream and Reader, so you are welcome to use them.

GZIP compression does not support seeking. Previous data blocks are needed to build compression tables...

How to best output large single line XML file (with Java/Eclipse)?

We have a process that outputs the contents of a large XML file to System.out.
When this output is pretty printed (ie: multiple lines) everything works. But when it's on one line Eclipse crashes with an OutOfMemory error. Any ideas how to prevent this?

Sounds like it is the Console panel blowing up. Consider limiting its buffer size.
EDIT: It's in Preferences. Search for Console.

How do you print it on one line?
using several System.out.print(String s)
using System.out.println(String verybigstring)
in the second case, you need a lot more memory...
If you want more memory for eclipse, could try to increase eclipses memory by changing the -Xmx value in eclipse.ini

I'm going to assume that you're building an org.w3c.Document, and writing it using a serializer. If you're hand-building an XML string, you're all but guaranteed to be producing something that's almost-but-not-quite XML, and I strongly suggest fixing that first.
That said, if you're writing to a stream from the serializer (and System.out is a stream), then you should be writing directly to the stream rather than writing to a string and printing that (which you'd do with a StringWriter). The reason for this is that the XML serializer will properly handle character encodings, while serializer to String to stream may not.
If you're not currently building a DOM, and are concerned about the memory requirements of doing so, then I suggest looking at the Practical XML library (which I maintain), in particular the builder package. It uses lightweight nodes, that are then output via a serializer using a SAX transform.
Edit in response to comment:
OK, you've got the serializer covered with XStream. I'm next going to assume that you are calling XStream.toXML(Object) to produce the string, and recommend that you call the variant toXML(Object, OutputStream), and pass it the actual output. The reason for this is that XML is very sensitive to character encoding, which is something that often breaks when converting strings to streams.
This may, of course, cause issues with building your POST request, particularly if you're using a library that doesn't provide you an OutputStream.

What is the most efficient way to replace many string tokens in many files in Java?

What I want to accomplish is a tool that filters my files replacing the occurrences of strings in this format ${some.property} with a value got from a properties file (just like Maven's or Ant's file filtering feature).
My first approach was to use Ant API (copy-task) or Maven Filtering component but both include many unnecessary dependencies and my program should be lightweight. After, I searched a little in Apache Common haven't found anything yet.
Is there an efficient (and elegant) solution to my problem?

The most efficient solution is using a templating engine. There are few, widely used engines, that comes in a single jar :
freemarker
apache velocity
stringtemplate (from antlr)

If this is configuration related, I would recommend Apache Commons Configuration. It will do varaible replacement on the fly.
It has other nice features, like handling XML, properties, Apple's pList formats.

The fastest and least encumbered way to do this will be to write your own. It shouldn't be that tough - probably take a couple of hours to write the tests and put the code together.
A suggested algorithm:
Start by loading the properties file into a Properties object.
Take an input reader (use BufferedReader if you will be reading files from a source with high latency), and grab each character, looking for a {. If the character isn't a {, emit the character to an output stream. If you find a {, start scanning for a }, accumulating the characters in a StringBuilder. If you hit another {, flush the StringBuilder to the output stream and start over. You may want to have some maximum # of characters that you allow the property key to contain. If you hit that limit, flush the StringBuilder to the output stream.
If you find a token surrounded by {}, grab the key name and do a Properties#getProperty() call. If you get a result, emit the result to the output stream. If you don't get a result, do something different.
If you want to get clever, once you get the result, instead of sending the result directly to the output stream, pre-pend it to the input stream (not literally - you'd do some logic to make it work), and continue. That way if any of the properties themselves refer to other properties, the algorithm effectively goes recursive.
If you are really going for performance, you could use a ByteBuffer instead of an input stream/writer

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Tokenize Java InputStream into streams, not Strings - java

Related

Handle HTTP POST multipart response through ServerSocket

Special OutputStream to work into memory and file depending on the amount of input data

Reading a gz file and keeping track of position in file

How to best output large single line XML file (with Java/Eclipse)?

What is the most efficient way to replace many string tokens in many files in Java?

Categories

Resources