Good afternoon everyone,
First of all, I'll say that it's only for personal purpose in a certain way, it's made to use for little projects to improve my Java knowledge, but my idea is to make this kind of things to understand better the way developers works with sockets and bytes, as I really like to understand this kind of things better for my future ideas.
Actually I'm making a lightweight HTTP server in Java to understand the way it works, and I've been reading documentation but still have some difficulties to actually understand part of the official documentation. The main problem I'm facing is that, something I'd like to know if it's related or not, the content-length seems to have a higher length than the one I get from the BufferedReader. I don't know if the issue is about the way chars are managed and bytes are being parsed to chars on the BufferedReader, so it has less data, so probably what I have to do is treat this part as a binary, so I'd have to read the bytes of the InputStream, but here comes the real deal I'm facing.
As Readers reads a certain amount of bytes, and then it stops and uses this as buffer, this means the data from the InputStream is being used on the Reader, and it's no longer on the stream, so using read() would end up on a -1 as there aren't more bytes to read. A multipart is divided in multiple elements separated with a boundary, and a newline that delimiters the information from the content. I still have to get the information as an String to process it, but the content should be parsed into a binary data, and, without modifying the buffer length, implying I'd require knowledge about the exact length I require to get only the information, the most probably result would be the content being transferred to the BufferedReader buffer. Is it possible to do it even with the processed data from the BufferedStream, or should I find a way to get that certain content as binary without being processed?
As I said, I'm new working with sockets and services, so I don't exactly know which are the possibilities it's another kind of issue, so any help would be appreciated, thank you in advance.
Answer from Remy Lebeau, that can be found on the comments, which become useful for me:
since multipart data is both textual and binary, you are going to have to do your own buffering of the socket data so you have more control and know where the data switches back and forth. At the very least, since you can read binary data directly from a BufferedInputStream, and access its internal buffer, you can let it handle the actual buffering for you, and it is not difficult to write a custom readLine() method that can read a line of text from a BufferedInputStream without using BufferedReader
Related
I know the Java libraries pretty well, so I was surprised when I realized that, apparently, there's no easy way to do something seemingly simple with a stream. I'm trying to read an HTTP request containing multipart form data (large, multiline tokens separated be delimiters that look like, for example, ------WebKitFormBoundary5GlahTkFmhDfanAn--), and I want to read until I encounter a part of the request with a given name, and then return an InputStream of that part.
I'm fine with just reading the stream into memory and returning a ByteArrayInputStream, because the files submitted should never be larger than 1MB. However, I want to make sure that the reading method throws an exception if the file is larger than 1MB, so that excessively-large files don't fill up the JVM's memory and crash the server. The file data may be binary, so that rules out BufferedReader.readLine() (it drops newlines, which could be any of \r, \n, or \r\n, resulting in loss of data).
All of the obvious tokenizing solutions, such as Scanner, read the tokens as Strings, not streams, which could cause OutOfMemoryErrors for large files--exactly what I'm trying to avoid. As far as I can tell, there's no equivalent to Scanner that returns each token as an InputStream without reading it into memory. Is there something I'm missing, or is there any way to create something like that myself, using just the standard Java libraries (no Apache Commons, etc.), that doesn't require me to read the stream a character at a time and write all of the token-scanning code myself?
Addendum: Shortly before posting this, I realized that the obvious solution to my original problem was simply to read the full request body into memory, failing if it's too large, and then to tokenize the resulting ByteArrayInputStream with a Scanner. This is inefficient, but it works. However, I'm still interested to know if there's a way to tokenize an InputStream into sub-streams, without reading them into memory, without using extra libraries, and without resorting to character-by-character processing.
It's not possible without loading them into memory (the solution you don't want) or saving them to disk (becomes I/O heavy). Tokenizing the stream into separate streams without loading it into memory implies that you can read the stream (to tokenize it) and be able to read it again later. In short, what you want is impossible unless your stream is seekable, but these are generally specialized streams for very specific applications and specialized I/O objects, like RandomAccessFile.
I keep running into the case where I want some structure of let's say buffer size 4096 and I can
write bytes into
read bytes from it
reset the read back to the previous read
MOST IMPORTANT, not have to deal with copying stuff as data windows get near the end of the byte array!!! (This is much like a circular buffer basically with wrap around or something)
ByteBuffer seems just as much of a heartache as byte[] as you write to it and read from it on both of these, the beginning of the array starts to empty out. I almost just want a structure of List or something....I just want it all managed for me (or I may have to write my own structure). I think some kind of InputStream would be nice with a mark and reset so I can mark before I read and then reset in case there is not enough data in the buffer just yet.
This is extremely useful in nearly all asynchronous programming where data comes in and you may have enough to parse or may not have enough to parse and you fill the buffer, try to read and parse and need to reset until you have more data.
ByteBuffer seems totally right for this, and ByteBuffer.compact() is really what you want to use when you want to move the remaining buffer back to the start.
You might be able to use this circular byte buffer - use the getAvailable() method rather than reading and then resetting.
hmmm, I just found this non-GPL one...apache license looks like..
https://svn.apache.org/repos/asf/etch/releases/release-1.0.0/util/src/main/java/etch/util/CircularByteBuffer.java
anyone use this? lloks ok to me.
Is it bad style to keep the references to streams "further down" a filter chain, and use those lower level streams again, or even to swap one type of stream for another? For example:
OutputStream os = new FileOutputStream("file");
PrintWriter pw = new PrintWriter(os);
pw.print("print writer stream");
pw.flush();
pw = null;
DataOutputStream dos = new DataOutputStream(os);
dos.writeBytes("dos writer stream");
dos.flush();
dos = null;
os.close();
If so, what are the alternatives if I need to use the functionality of both streams, e.g. if I want to write a few lines of text to a stream, followed by binary data, or vice versa?
This can be done in some cases, but it's error-prone. You need to be careful about buffers and stuff like the stream headers of ObjectOutputStream.
if I want to write a few lines of text to a stream, followed by binary
data, or vice versa?
For this, all you need to know is that you can convert text to binary data and back but always need to specify an encoding. However, it is also error-prone because people tend to use the API methods that use the platform default encoding, and of course you're basically implementing a parser for a custom binary file format - lots of things can go wrong there.
All in all, if you're creating a file format, especially when mixing text and binary data, it's best to use an existing framework like Google protocol buffers
If you have to do it, then you have to do it. So if you're dealing with an external dependency that you don't have control over, you just have to do it.
I think the bad style is the fact that you would need to do it. If you had to send binary data across sometimes, and text across at others, it would probably be best to have some kind of message object and send the object itself over the wire with Serialization. The data overhead isn't too much if structured properly.
I don't see why not. I mean, the implementations of the various stream classes should protect you from writing invalid data. So long as you're reading it back the same way, and your code is otherwise understandable, I don't see why that would be a problem.
Style doesn't always mean you have to do it the way you've seen others do it. So long as it's logical, and someone reading the code would see what (and why) you're doing it without you needing to write a bunch of comments, then I don't see what the issue is.
Since you're flushing between, it's probably fine. But it might be cleaner to use one OutputStream and just use os.write(string.getBytes()); to write the strings.
Is there a way to accurately gather the byte offsets of xml tags using the XMLStreamReader?
I have a large xml file that I require random access to. Rather than writing the whole thing to a database, I would like to run through it once with an XMLStreamReader to gather the byte offsets of significant tags, and then be able to use a RandomAccessFile to retrieve the tag content later.
XMLStreamReader doesn't seem to have a way to track character offsets. Instead people recommend attaching the XmlStreamReader to a reader that tracks how many bytes have been read (the CountingInputStream provided by apache.commons.io, for example)
e.g:
CountingInputStream countingReader = new CountingInputStream(new FileInputStream(xmlFile)) ;
XMLStreamReader xmlStreamReader = xmlStreamFactory.createXMLStreamReader(countingReader, "UTF-8") ;
while (xmlStreamReader.hasNext()) {
int eventCode = xmlStreamReader.next();
switch (eventCode) {
case XMLStreamReader.END_ELEMENT :
System.out.println(xmlStreamReader.getLocalName() + " #" + countingReader.getByteCount()) ;
}
}
xmlStreamReader.close();
Unfortunately there must be some buffering going on, because the above code prints out the same byte offsets for several tags. Is there a more accurate way of tracking byte offsets in xml files (ideally without resorting to abandoning proper xml parsing)?
You could use getLocation() on the XMLStreamReader (or XMLEvent.getLocation() if you use XMLEventReader), but I remember reading somewhere that it is not reliable and precise. And it looks like it gives the endpoint of the tag, not the starting location.
I have a similar need to precisely know the location of tags within a file, and I'm looking at other parsers to see if there is one that guarantees to give the necessary level of location precision.
You could use a wrapper input stream around the actual input stream, simply deferring to the wrapped stream for actual I/O operations but keeping an internal counting mechanism with assorted code to retrieve current offset?
Unfortunatly Aalto doesn't implement the LocationInfo interface.
The last java VTD-XML ximpleware implementation, currently 2.11
on sourceforge or on github
provides some code maintaning a byte offset after each call to
the getChar() method of its IReader implementations.
IReader implementations for various caracter encodings
are available inside VTDGen.java and VTDGenHuge.java
IReader implementations are provided for the following encodings
ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258
Updating IReader with a getCharOffset() method
and implementing it
by adding a charCount member along to the offset member of the
VTDGen and VTDGenHuge classes
and by incrementing it upon each getChar() and skipChar() call of each IReader implementation should give you the start of a solution.
I think I've found another option. If you replace your switch block with the following, it will dump the position immediately after the end element tag.
switch (eventCode) {
case XMLStreamReader.END_ELEMENT :
System.out.println(xmlStreamReader.getLocalName() + " end#" + xmlStreamReader.getLocation().getCharacterOffset()) ;
}
This solution also would require that the actual start position of the end tags would have to be manually calculated, and would have the advantage of not needing an external JAR file.
I was not able to track down some minor inconsistencies in the data management (I think it has to do with how I initialized my XMLStreamReader), but I always saw a consistent increase in the location as the reader moved through the content.
Hope this helps!
I recently worked out a solution for a similar question on How to find character offsets in big XML files using java?. I think it provides a good solution based on a ANTLR generated XML-Parser.
I just burned a day long weekend on this, and arrived at the solution partially thanks to some clues here. Remarkably I don't think this has gotten much easier in the 10 years since the OP posted this question.
TL;DR Use Woodstox and char offsets
The first problem to contend with is that most XMLStreamReader implementations seem to provide inaccurate results when you ask them for their current offsets. Woodstox however seems to be rock-solid in this regard.
The second problem is the actual type of offset you use. Unfortunately it seems that you have to use char offsets if you need to work with a multi-byte charset, which means the random-access retrieval from the file is not going to be very efficient - you can't just set a pointer into the file at your offset and start reading, you have to read through until you get to the offset, then start extracting. There may be a more efficient way to do this that I haven't though of, but the performance is acceptable for my case. 500MB files are pretty snappy.
[edit] So this turned into one of those splinter-in-my-mind things, and I ended up writing a FilterReader that keeps a buffer of byte offset to char offset mappings as the file is read. When we need to get the byte offset, we first ask Woodstox for the char offset, then get the custom reader to tell us the actual byte offset for the char offset. We can get the byte offset from the beginning and end of the element, giving us what we need to go in and surgically extract the element from the file by opening it as a RandomAccessFile.
I created a library for this, it's on GitHub and Maven Central. If you just want to get the important bits, the party trick is in the ByteTrackingReader.
[/edit]
There is another similar question on SO about this (but the accepted answer frightened and confused me), and some people commented about how this whole thing is a bad idea and why would you want to do it? XML is a transport mechanism, you should just import it to a DB and work with the data with more appropriate tools. For most cases this is true, but if you're building applications or integrations that communicate via XML (still going strong in 2020), you need tooling to analyze and operate on the files that are exchanged. I get daily requests to verify feed contents, having the ability to quickly extract a specific set of items from a massive file and verify not only the contents, but the format itself is essential.
Anyhow, hopefully this can save someone a few hours, or at least get them closer to a solution. God help you if you're finding this in 2030, trying to solve the same problem.
I could use some hints or tips for a decent interface for reading file of special characteristics.
The files in question has a header (~120 bytes), a body (1 byte - 3gb) and a footer (4 bytes).
The header contains information about the body and the footer is only a simple CRC32-value of the body.
I use Java so my idea was to extend the "InputStream" class and add a constructor such as "public MyInStream( InputStream in)" where I immediately read the header and the direct the overridden read()'s the body.
Problem is, I can't give the user of the class the CRC32-value until the whole body has been read.
Because the file can be 3gb large, putting it all in memory is a be an idea.
Reading it all in to a temporary file is going to be a performance hit if there are many small files.
I don't know how large the file is because the InputStream doesn't have to be a file, it could be a socket.
Looking at it again, maybe extending InputStream is a bad idea.
Thank you for reading the confused thoughts of a tired programmer. :)
Looking at it again, maybe extending
InputStream is a bad idea.
If users of the class need to access the body as a stream, it's IMO not a bad choice. Java's ObjectOutput/InputStream works like this.
I don't know how large the file is
because the InputStream doesn't have
to be a file, it could be a socket.
Um, then your problem is not with the choice of Java class, but with the design of the file format. If you can't change the format, there isn't really anything you can do to make the data at the end of the file available before all of it is read.
But perhaps you could encapsulate the processing of the checksum completely? Presumably it's a checksum for the body, so your class could always "read ahead" 4 bytes to see when the file ends and not return the last 4 bytes to the client as part of the body and instead compare them with a CRC computed while reading the body, throwing an exception when it does not match.