I am using ActiveMQ, Spring.
Is there any way by which I can keep track of all processed messages. I have to keep track what all messages has been processed. I also want to review these processed messages at later stage.
Should i use database for this?
Is there any good library that can make this operation easy
I do not want to make table in database for every kind of model object
In general, I would suggest that you either log and/or record the messages into the database. If you simply want to review the messages later, simple logging may suffice. If you need to do transactional rollup/searching through a UI, then the database is better.
However, you can also achieve what you want with ActiveMQ virtual destinations. With this, you can have 1 destination forward to 2 other destinations. Then your app could listen on 1 destination, and a copy of the message would sit on the other for your review. For example:
<broker persistent="false" useJmx="false" xmlns="http://activemq.apache.org/schema/core">
<destinationInterceptors>
<virtualDestinationInterceptor>
<virtualDestinations>
<compositeQueue name="MY.QUEUE">
<forwardTo>
<queue physicalName="MY.QUEUE.PROCESS" />
<topic physicalName="MY.QUEUE.REVIEW" />
</forwardTo>
</compositeQueue>
</virtualDestinations>
</virtualDestinationInterceptor>
</destinationInterceptors>
</broker>
Would define a queue MY.QUEUE where each message would end up in BOTH the .PROCESS and .REVIEW queues.
I would use a database.
Perhaps you could use an ORM such as Hibernate, but JDBC or SpringTemplates may be better.
Rather than making a separate table for each model object, make a 'message' table and serialize the uncommon portions into a payload blob (or text). You could then use a utility to deserialize the message for review (or playback) later.
Kartik,
This is a good programming question, but it's more of a what should I program question rather than a "How can I do this" question. It's hard to answer a "What should I program question" because what you should program depends directly on what you need. At best, we can only guess at what you really need.
If you need to update the processed JMS messages, then a database will make it easy to update. If you need to prove that nobody updated a "logged" entry, then a database might not do the job.
Let's say this log is used to see which very slow to process messages still need to complete. Then a database will provide easy searching, provided that the person searching knows SQL. However, if the log is more of an archive, then the database just adds overhead to the entire process, a structured file will do.
In Java there is JDBC for writing and retrieving to databases, and it is not a hard API to use. Then again, there is also a number of decent logging frameworks, and of course there is always FileOutputStream. Without knowing how this log is to be used, it is very difficult to determine which techniques are really overkill, likewise it's not possible to know which techniques are not quite enough.
Go back and review how the log is to be used, and then evaluate if the features that databases provide are overkill.
Cheers,
Ed
Related
I'm looking for a way to prevent some sensitive data from being logged.
Ideally i would like to prevent / capture things like
String sensitive = "";
log.info ("This should be prevented or caught by something : {} ", sensitive);
this post is a bit of a longshot, I'm willing to investigate on any lead.
annotation, new types, Sonar Rules, logger hacking etc...
thx for your brainstorming :)
guillaume
Create custom type for it.
Make sure that toString doesn't return actual content.
I imagine there are multiple ways to do this, but one way is to use the Logback configuration file, to specify a message provider for the "arguments" and "message". In those providers, you define a "writeTo" method that looks for particular patterns in the output, and masks them.
This is the path to a solution, but I obviously don't provide many details here. I'm not aware of any "standard" solutions for this.
Another possibility would avail itself if your architecture has services running in transient containers, and the log output is sent to a centralized log aggregator, like Splunk. If you were ok with the initial logs written in the container having sensitive data, you could have the log aggregator look for patterns to mask for.
I would recommend two options, can you split your PII data into a separate log and then log that data securely?
If not, consider something like Cribl Logstream. Point your log shipper at it and let it strip away any PII you are concerned about. LogStream makes it very very easy to remove/mask/encrypt sensitive data. It has all sorts of other features as well.
At my last job we used LogStream as the router to make decisions about the data based on the content. PII data was detected and one copy was pushed to a secure PII certified logging platform and another copy was pushed to the operational logging platform but the PII data was masked so a wider audience could use the logging with no risk. It was a very useful workflow that solved a log of problems.
There's a REST endpoint, which serves large (tens of gigabytes) chunks of data to my application.
Application processes the data in it's own pace, and as incoming data volumes grow, I'm starting to hit REST endpoint timeout.
Meaning, processing speed is less then network throughoutput.
Unfortunately, there's no way to raise processing speed enough, as there's no "enough" - incoming data volumes may grow indefinitely.
I'm thinking of a way to store incoming data locally before processing, in order to release REST endpoint connection before timeout occurs.
What I've came up so far, is downloading incoming data to a temporary file and reading (processing) said file simultaneously using OutputStream/InputStream.
Sort of buffering, using a file.
This brings it's own problems:
what if processing speed becomes faster then downloading speed for
some time and I get EOF?
file parser operates with
ObjectInputStream and it behaves weird in cases of empty file/EOF
and so on
Are there conventional ways to do such a thing?
Are there alternative solutions?
Please provide some guidance.
Upd:
I'd like to point out: http server is out of my control.
Consider it to be a vendor data provider. They have many consumers and refuse to alter anything for just one.
Looks like we're the only ones to use all of their data, as our client app processing speed is far greater than their sample client performance metrics. Still, we can not match our app performance with network throughoutput.
Server does not support http range requests or pagination.
There's no way to divide data in chunks to load, as there's no filtering attribute to guarantee that every chunk will be small enough.
Shortly: we can download all the data in a given time before timeout occurs, but can not process it.
Having an adapter between inputstream and outpustream, to pefrorm as a blocking queue, will help a ton.
You're using something like new ObjectInputStream(new FileInputStream(..._) and the solution for EOF could be wrapping the FileInputStream first in an WriterAwareStream which would block when hitting EOF as long a the writer is writing.
Anyway, in case latency don't matter much, I would not bother start processing before the download finished. Oftentimes, there isn't much you can do with an incomplete list of objects.
Maybe some memory-mapped-file-based queue like Chronicle-Queue may help you. It's faster than dealing with files directly and may be even simpler to use.
You could also implement a HugeBufferingInputStream internally using a queue, which reads from its input stream, and, in case it has a lot of data, it spits them out to disk. This may be a nice abstraction, completely hiding the buffering.
There's also FileBackedOutputStream in Guava, automatically switching from using memory to using a file when getting big, but I'm afraid, it's optimized for small sizes (with tens of gigabytes expected, there's no point of trying to use memory).
Are there alternative solutions?
If your consumer (the http client) is having trouble keeping up with the stream of data, you might want to look at a design where the client manages its own work in progress, pulling data from the server on demand.
RFC 7233 describes the Range Requests
devices with limited local storage might benefit from being able to request only a subset of a larger representation, such as a single page of a very large document, or the dimensions of an embedded image
HTTP Range requests on the MDN Web Docs site might be a more approachable introduction.
This is the sort of thing that queueing servers are made for. RabbitMQ, Kafka, Kinesis, any of those. Perhaps KStream would work. With everything you get from the HTTP server (given your constraint that it cannot be broken up into units of work), you could partition it into chunks of bytes of some reasonable size, maybe 1024kB. Your application would push/publish those records/messages to the topic/queue. They would all share some common series ID so you know which chunks match up, and each would need to carry an ordinal so they can be put back together in the right order; with a single Kafka partition you could probably rely upon offsets. You might publish a final record for that series with a "done" flag that would act as an EOF for whatever is consuming it. Of course, you'd send an HTTP response as soon as all the data is queued, though it may not necessarily be processed yet.
not sure if this would help in your case because you haven't mentioned what structure & format the data are coming to you in, however, i'll assume a beautifully normalised, deeply nested hierarchical xml (ie. pretty much the worst case for streaming, right? ... pega bix?)
i propose a partial solution that could allow you to sidestep the limitation of your not being able to control how your client interacts with the http data server -
deploy your own webserver, in whatever contemporary tech you please (which you do control) - your local server will sit in front of your locally cached copy of the data
periodically download the output of the webservice using a built-in http querying library, a commnd-line util such as aria2c curl wget et. al, an etl (or whatever you please) directly onto a local device-backed .xml file - this happens as often as it needs to
point your rest client to your own-hosted 127.0.0.1/modern_gigabyte_large/get... 'smart' server, instead of the old api.vendor.com/last_tested_on_megabytes/get... server
some thoughts:
you might need to refactor your data model to indicate that the xml webservice data that you and your clients are consuming was dated at the last successful run^ (ie. update this date when the next ingest process completes)
it would be theoretically possible for you to transform the underlying xml on the way through to better yield records in a streaming fashion to your webservice client (if you're not already doing this) but this would take effort - i could discuss this more if a sample of the data structure was provided
all of this work can run in parallel to your existing application, which continues on your last version of the successfully processed 'old data' until the next version 'new data' are available
^
in trade you will now need to manage a 'sliding window' of data files, where each 'result' is a specific instance of your app downloading the webservice data and storing it on disc, then successfully ingesting it into your model:
last (two?) good result(s) compressed (in my experience, gigabytes of xml packs down a helluva lot)
next pending/ provisional result while you're streaming to disc/ doing an integrity check/ ingesting data - (this becomes the current 'good' result, and the last 'good' result becomes the 'previous good' result)
if we assume that you're ingesting into a relational db, the current (and maybe previous) tables with the webservice data loaded into your app, and the next pending table
switching these around becomes a metadata operation, but now your database must store at least webservice data x2 (or x3 - whatever fits in your limitations)
... yes you don't need to do this, but you'll wish you did after something goes wrong :)
Looks like we're the only ones to use all of their data
this implies that there is some way for you to partition or limit the webservice feed - how are the other clients discriminating so as not to receive the full monty?
You can use in-memory caching techniques OR you can use Java 8 streams. Please see the following link for more info:
https://www.conductor.com/nightlight/using-java-8-streams-to-process-large-amounts-of-data/
Camel could maybe help you the regulate the network load between the REST producer and producer ?
You might for instance introduce a Camel endpoint acting as a proxy in front of the real REST endpoint, apply some throttling policy, before forwarding to the real endpoint:
from("http://localhost:8081/mywebserviceproxy")
.throttle(...)
.to("http://myserver.com:8080/myrealwebservice);
http://camel.apache.org/throttler.html
http://camel.apache.org/route-throttling-example.html
My 2 cents,
Bernard.
If you have enough memory, Maybe you can use in-memory data store like Redis.
When you get data from your Rest endpoint you can save your data into Redis list (or any other data structure which is appropriate for you).
Your consumer will consume data from the list.
I'm developing a 'WS oriented' application basing on Spring/CXF/Oracle DB. Now, I stuck with some architectural consideration about right approach to organize message processing (already stored in db).
Briefly, process looks as follows:
(A) Get the message from client -> Validate -> Store -> Send reposponse
(B) Process -> Update data
I consider two general approaches for part B of the process:
1) Use JMS queue
Just after validation and storing incoming message details in DB publish a message to the JSM queue. On the other side define cosumer which will retrieve the message and do the processing
2) Fetch data to be processed
Manually fetch data from with db and process it.
Additional facts:
The processing won't be compute-intensive, so for new I dont think that work distribution will be needed (all in single JVM).
All data in single db schema
So, I'm interested what are key factors to choose JMS in such case?
JMS would be a better approach. In positive scenario, approach #2 works as well. But JMS would provide you some in-built capability, specially for failed case. Though internally JMS would be using a DB-based persistent storage; it would provide a better interface to communicate that data.
For example, you could configure an error queue to track all the messages, whose processing failed.
It would also provide you scalable architecture, where some other app (in future) could starts consuming your message and process.
Reliable: Due to asynchronous messaging, all the pieces don’t need to be up for the application to function as a whole.
Flexible : Think of scenario, in which you might want to process certain type of data before all other (prioritization). JMS would provide more better approach than tweaking logic in a program.
I'm making a server with Java that will provide chat services for flash clients. The server will store data about each user on a .txt file somewhere on the server. For example when a user logs in, information about this user is requested to the DatabaseManger class. It will then search through the database and return the information. The point is that when allot of people log in a short amount of time the server is doing allot of checks again and again.
The idea that I want to implement is that a connection class does something like this:
String userData = DatabaseManager.getUserData(this.username);
The DatabaseManager then doesn't search immediately, it stores this request in an array of requests, then in a fixed interval it goes through the database 1 time and returns data to the clients that requested this. This way when 15 people log in in a second it wont go through all the information 15 times. How to implement this?
You use a real DBMS like everyone else on the planet. I'm eager to hear a reason why someone wouldn't choose a DB for this application. I can't think of anything that would prevent it. Back in the day, RDBMS were ungainly, expensive, complicated beasts. Today, they're as readily available as tabloids at the checkout counter.
There are few excuses to not to a DB nowadays, and arguably there are more excuses to use the DB than the file system for most any application.
As above I'd recommend using an existing database solution like HSQLDB, you'd be far better off in the long run doing things this way rather than hacking your own solution together.
If you really want to do this anyway, have a look at the ScheduledExecutorService. You can then fire off a request to the executor service with a delay, and in that delay listen for more data and add it to the query.
First, this may be a stupid question, but I'm hoping someone will tell me so, and why. I also apologize if my explanation of what/why is lacking.
I am using a servlet to upload a HUGE (247MB) file, which is pipe (|) delineated. I grab about 5 of 20 fields, create an object, then add it to a list. Once this is done, I pass the the list to an OpenJPA transactional method called persistList().
This would be okay, except for the size of the file. It's taking forever, so I'm looking for a way to improve it. An idea I had was to use a BlockingQueue in conjunction with the persist/persistList method in a new thread. Unfortunately, my skills in java concurrency are a bit weak.
Does what I want to do make sense? If so, has anyone done anything like it before?
Servlets should respond to requests within a short amount of time. In this case, the persist of the file contents needs to be an asynchronous job, so:
The servlet should respond with some text about the upload job, expected time to complete or something like that.
The uploaded content should be written to some temp space in binary form, rather than keeping it all in memory. This is the usual way the multi-part post libraries to their work.
You should have a separate service that blocks on a queue of pending jobs. Once it gets a job, it processes it.
The 'job' is simply some handle to the temporary file that was written when the upload happened... and any metadata like who uploaded it, job id, etc.
The persisting service needs to upload a large number of rows, but make it appear 'atomic', either model the intermediate state as part of the table model(s), or write to temp spaces.
If you are writing to temp tables, and then copying all the content to the live table, remember to have enough log space and temp space at the database level.
If you have a full J2EE stack, consider modelling the job queue as a JMS queue, so recovery makes sense. Once again, remember to have proper XA boundaries, so all the row persists fall within an outer transaction.
Finally, consider also having a status check API and/or UI, where you can determine the state of any particular upload job: Pending/Processing/Completed.