Getting location from XMLStreamReader

Getting location from XMLStreamReader - java

I'm working on a streaming SOAP-proxy that will be used for authentication and routing. Proxy must read and validate credentials and routing information from the SOAP header. It also must enrich the SOAP header with additional metadata that is consumed by the target services.
I first wrapped up a quick proof of concept using StAX (XMLStreamReader and XMLStreamWriter). It seemed to work without issues, so concept was proven, but performance was not what I was exepcting. While profiling the prototype I noticed that application spent considerable amount of cpu time on character encoding/decoding (30-80%)
I thought the solution would be simple:
Store data read by XMLStreamReader from InputStream to a buffer
Get the offset right before SOAP header's end tag (getLocation())
Send data from buffer until offset is reached
Send additional data
Send remaining data from buffer
Send data from InputStream
This would avoid all the unneccessary encoding/decoding. However to my surprise, I found that standard implementation does not support getLocation() - it returns -1.
Obsolete section - following does NOT actually work because XMLStreamReader consumes ridiculous amounts of memory when reading a byte at time. "I noticed that I can get the offset from InputStream, if I write a InputStreamReader that returns one byte at a time and tracks number of bytes read. This however doesn't feel like the right solution - absurd number of read calls will most likely cause greater performance hit than the extra decoding/encoding."
What are my options here, if writing custom parser is out of the question? Is there a standard compliant XMLStreamReader implementation that is both proven and supports the getLocation()?

Related

Handle HTTP POST multipart response through ServerSocket

Good afternoon everyone,
First of all, I'll say that it's only for personal purpose in a certain way, it's made to use for little projects to improve my Java knowledge, but my idea is to make this kind of things to understand better the way developers works with sockets and bytes, as I really like to understand this kind of things better for my future ideas.
Actually I'm making a lightweight HTTP server in Java to understand the way it works, and I've been reading documentation but still have some difficulties to actually understand part of the official documentation. The main problem I'm facing is that, something I'd like to know if it's related or not, the content-length seems to have a higher length than the one I get from the BufferedReader. I don't know if the issue is about the way chars are managed and bytes are being parsed to chars on the BufferedReader, so it has less data, so probably what I have to do is treat this part as a binary, so I'd have to read the bytes of the InputStream, but here comes the real deal I'm facing.
As Readers reads a certain amount of bytes, and then it stops and uses this as buffer, this means the data from the InputStream is being used on the Reader, and it's no longer on the stream, so using read() would end up on a -1 as there aren't more bytes to read. A multipart is divided in multiple elements separated with a boundary, and a newline that delimiters the information from the content. I still have to get the information as an String to process it, but the content should be parsed into a binary data, and, without modifying the buffer length, implying I'd require knowledge about the exact length I require to get only the information, the most probably result would be the content being transferred to the BufferedReader buffer. Is it possible to do it even with the processed data from the BufferedStream, or should I find a way to get that certain content as binary without being processed?
As I said, I'm new working with sockets and services, so I don't exactly know which are the possibilities it's another kind of issue, so any help would be appreciated, thank you in advance.

Answer from Remy Lebeau, that can be found on the comments, which become useful for me:
since multipart data is both textual and binary, you are going to have to do your own buffering of the socket data so you have more control and know where the data switches back and forth. At the very least, since you can read binary data directly from a BufferedInputStream, and access its internal buffer, you can let it handle the actual buffering for you, and it is not difficult to write a custom readLine() method that can read a line of text from a BufferedInputStream without using BufferedReader

Find byte offsets for e-mail attachments

I got a requirement to deliver emails to a legacy system that needs to read the attachments.
For each part in a multipart email I need to provide the byte offset for where the attachment starts in the email, so the legacy system doesn't need to know how to parse emails.
Performance and memory usage is an issue, so the solution can't load the entire email into memory. And to my eyes that leaves out javax.mail.
How would you go about it in Java?
My first idea was to use mime4j, but the library does not keep of byte offset or even the line number.
I investigated making a PR to mime4j to add tracking of line numbers and byte offsets. But it is not very easy, since it is a very mature project and it uses lots of buffering internally.
Now I am thinking that maybe I am going about this the wrong way. So I would very much appreciate any ideas of how to solve this in a simple matter.

You're going to run into issues just sending the byte offsets and the full email, as emails still can be base64 encoded or printed-quoteable encoded.
You'll want to use a MimeStreamParser and give your own ContentHandler and override the body method. You can then directly send the BodyDescriptor and InputStream to the legacy system. The InputStream is the "decoded" email (IE handles any Content-Transfer-Encoding). The BodyDescriptor is useful to extract stuff from the headers of the part that you may care about (MimeType and Charset are the most useful ones).
This does not buffer the whole email, and allows you to stream out just the body parts. I'm not sure how you're communicating with the legacy system (via the network or if it's an inprocess subcomponent) but hopefully that works!

Ways to buffer REST response

There's a REST endpoint, which serves large (tens of gigabytes) chunks of data to my application.
Application processes the data in it's own pace, and as incoming data volumes grow, I'm starting to hit REST endpoint timeout.
Meaning, processing speed is less then network throughoutput.
Unfortunately, there's no way to raise processing speed enough, as there's no "enough" - incoming data volumes may grow indefinitely.
I'm thinking of a way to store incoming data locally before processing, in order to release REST endpoint connection before timeout occurs.
What I've came up so far, is downloading incoming data to a temporary file and reading (processing) said file simultaneously using OutputStream/InputStream.
Sort of buffering, using a file.
This brings it's own problems:
what if processing speed becomes faster then downloading speed for
some time and I get EOF?
file parser operates with
ObjectInputStream and it behaves weird in cases of empty file/EOF
and so on
Are there conventional ways to do such a thing?
Are there alternative solutions?
Please provide some guidance.
Upd:
I'd like to point out: http server is out of my control.
Consider it to be a vendor data provider. They have many consumers and refuse to alter anything for just one.
Looks like we're the only ones to use all of their data, as our client app processing speed is far greater than their sample client performance metrics. Still, we can not match our app performance with network throughoutput.
Server does not support http range requests or pagination.
There's no way to divide data in chunks to load, as there's no filtering attribute to guarantee that every chunk will be small enough.
Shortly: we can download all the data in a given time before timeout occurs, but can not process it.
Having an adapter between inputstream and outpustream, to pefrorm as a blocking queue, will help a ton.

You're using something like new ObjectInputStream(new FileInputStream(..._) and the solution for EOF could be wrapping the FileInputStream first in an WriterAwareStream which would block when hitting EOF as long a the writer is writing.
Anyway, in case latency don't matter much, I would not bother start processing before the download finished. Oftentimes, there isn't much you can do with an incomplete list of objects.
Maybe some memory-mapped-file-based queue like Chronicle-Queue may help you. It's faster than dealing with files directly and may be even simpler to use.
You could also implement a HugeBufferingInputStream internally using a queue, which reads from its input stream, and, in case it has a lot of data, it spits them out to disk. This may be a nice abstraction, completely hiding the buffering.
There's also FileBackedOutputStream in Guava, automatically switching from using memory to using a file when getting big, but I'm afraid, it's optimized for small sizes (with tens of gigabytes expected, there's no point of trying to use memory).

Are there alternative solutions?
If your consumer (the http client) is having trouble keeping up with the stream of data, you might want to look at a design where the client manages its own work in progress, pulling data from the server on demand.
RFC 7233 describes the Range Requests
devices with limited local storage might benefit from being able to request only a subset of a larger representation, such as a single page of a very large document, or the dimensions of an embedded image
HTTP Range requests on the MDN Web Docs site might be a more approachable introduction.

This is the sort of thing that queueing servers are made for. RabbitMQ, Kafka, Kinesis, any of those. Perhaps KStream would work. With everything you get from the HTTP server (given your constraint that it cannot be broken up into units of work), you could partition it into chunks of bytes of some reasonable size, maybe 1024kB. Your application would push/publish those records/messages to the topic/queue. They would all share some common series ID so you know which chunks match up, and each would need to carry an ordinal so they can be put back together in the right order; with a single Kafka partition you could probably rely upon offsets. You might publish a final record for that series with a "done" flag that would act as an EOF for whatever is consuming it. Of course, you'd send an HTTP response as soon as all the data is queued, though it may not necessarily be processed yet.

not sure if this would help in your case because you haven't mentioned what structure & format the data are coming to you in, however, i'll assume a beautifully normalised, deeply nested hierarchical xml (ie. pretty much the worst case for streaming, right? ... pega bix?)
i propose a partial solution that could allow you to sidestep the limitation of your not being able to control how your client interacts with the http data server -
deploy your own webserver, in whatever contemporary tech you please (which you do control) - your local server will sit in front of your locally cached copy of the data
periodically download the output of the webservice using a built-in http querying library, a commnd-line util such as aria2c curl wget et. al, an etl (or whatever you please) directly onto a local device-backed .xml file - this happens as often as it needs to
point your rest client to your own-hosted 127.0.0.1/modern_gigabyte_large/get... 'smart' server, instead of the old api.vendor.com/last_tested_on_megabytes/get... server
some thoughts:
you might need to refactor your data model to indicate that the xml webservice data that you and your clients are consuming was dated at the last successful run^ (ie. update this date when the next ingest process completes)
it would be theoretically possible for you to transform the underlying xml on the way through to better yield records in a streaming fashion to your webservice client (if you're not already doing this) but this would take effort - i could discuss this more if a sample of the data structure was provided
all of this work can run in parallel to your existing application, which continues on your last version of the successfully processed 'old data' until the next version 'new data' are available
^
in trade you will now need to manage a 'sliding window' of data files, where each 'result' is a specific instance of your app downloading the webservice data and storing it on disc, then successfully ingesting it into your model:
last (two?) good result(s) compressed (in my experience, gigabytes of xml packs down a helluva lot)
next pending/ provisional result while you're streaming to disc/ doing an integrity check/ ingesting data - (this becomes the current 'good' result, and the last 'good' result becomes the 'previous good' result)
if we assume that you're ingesting into a relational db, the current (and maybe previous) tables with the webservice data loaded into your app, and the next pending table
switching these around becomes a metadata operation, but now your database must store at least webservice data x2 (or x3 - whatever fits in your limitations)
... yes you don't need to do this, but you'll wish you did after something goes wrong :)
Looks like we're the only ones to use all of their data
this implies that there is some way for you to partition or limit the webservice feed - how are the other clients discriminating so as not to receive the full monty?

You can use in-memory caching techniques OR you can use Java 8 streams. Please see the following link for more info:
https://www.conductor.com/nightlight/using-java-8-streams-to-process-large-amounts-of-data/

Camel could maybe help you the regulate the network load between the REST producer and producer ?
You might for instance introduce a Camel endpoint acting as a proxy in front of the real REST endpoint, apply some throttling policy, before forwarding to the real endpoint:
from("http://localhost:8081/mywebserviceproxy")
.throttle(...)
.to("http://myserver.com:8080/myrealwebservice);
http://camel.apache.org/throttler.html
http://camel.apache.org/route-throttling-example.html
My 2 cents,
Bernard.

If you have enough memory, Maybe you can use in-memory data store like Redis.
When you get data from your Rest endpoint you can save your data into Redis list (or any other data structure which is appropriate for you).
Your consumer will consume data from the list.

using chunked encoding request with variable chunk size in Java (HttpUrlConnection)

I'm searching for weeks now to find a solution how to use chunked transfer encoding in a Java client without coding my own myHttpURLConnection.
The HttpUrlConnection of Java expects a fixed chunk size, which is not usable for me. The data consists of several messages that are different in size and must be sent in neartime to the server. The system currently (in Prelive/UAT state) works based on having fixed 1024 byte chunks but since most messages are significantly shorter, this is a waste of band width not acceptable in PROD.
Furthermore, messages larger than 1024 bytes would be chopped apart so a) the server would need to assemble them again and b) the last part of the message would not be send until enough data is available for filling 1024 bytes (even worse, not neartime anymore).
Does anybody have an idea how to workaround the restriction of the HttpUrlConnection of Java (non compliant with RFC2616 as it does not fully implement it) without having to code everything on top of URLConnection?
I did not find any possibility to hook into the needed funcs for just setting a new chunk size for each heap of data.
My current option: douplicate all HTTPUrlConnection code and modify the parts dealing with CHUNKED (e.g. having some flush() function to adjust the chunk size and send what's there).

Rest calls-Large amount of data between calls

We are using Rest using Jersey. There are few scenarios where server(WAS 8.5) sends large amount of data to client, which is RCP application. In some cases data is so huge(150MB) in xml format that client gets an OutOfMemoryError exception.
I have below questions
How much size is increased when java object is converted in xml?
How we can send large java object to client and still use rest calls?

1) Tough question to answer without seeing the XML schema, I've seen well designed schemas that result in tight, lean XML, and others that are a mess and very bloated. To test it write some test code that serializes your Java objects to a byte[] and compare it's size to the XML payload you currently produce.
2) Might be worth looking into a chunking process, 150MB is pretty large for a single payload. Also are you using GZIP compression for this already? Also may be worth looking at Fast Infoset. Basically it's a binary encoding for XML that generally helps reduce the size of an XML Document.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.