I’m using Java to build a restful API around the functionality of Pentaho data integration.
I’ve implemented several end points such as the ability to create a repository
containing jobs and transformations, running the jobs and transformations, displaying the image of the job and transformations, displaying the database connections within a repo, plus quite a few more.
I’m trying to build an endpoint that allows for the data sources to be changed, such as the hostname, database name, etc. But I’ve run into an issue when it comes to saving the new connection details.
Here’s a snippet of code I’ve got. I’ve hard coded the values simply for testing purposes.
I loop through an array list containing the DatabaseMeta and then change the values of the fields.
for(DatabaseMeta meta: databaseMeta) {
meta.setHostName(“test_host”) ;
meta.setDBPort(“test_port”);
meta.setDBName(“test_database”);
repositoryService.updateDataSource(databaseMeta);
}
The updateDatasource() method simply invokes repository.save() (which is part of the org.pentaho.di.repository package) and passes in the DatabaseMeta.
When this method executes, it creates a .kdb file in my repository, with the values I set above, and making a GET request to the endpoint returns the connection details from the new file.
However, I simply want to overwrite the values in the existing transformation connection and return them in the GET request.
Is there any way that this can be achieved?
Any help will be greatly appreciated.
Thanks.
I don't know about the JAVA integration part, but as far as pure Pentaho goes, the Database Connection stated in the KTR/KJB needs to have the same parameters declared in the KTR/KJB, as such:
This way, whatever parameters you pass through to the KTR/KJB will be swapped out in the connection.
There's a REST endpoint, which serves large (tens of gigabytes) chunks of data to my application.
Application processes the data in it's own pace, and as incoming data volumes grow, I'm starting to hit REST endpoint timeout.
Meaning, processing speed is less then network throughoutput.
Unfortunately, there's no way to raise processing speed enough, as there's no "enough" - incoming data volumes may grow indefinitely.
I'm thinking of a way to store incoming data locally before processing, in order to release REST endpoint connection before timeout occurs.
What I've came up so far, is downloading incoming data to a temporary file and reading (processing) said file simultaneously using OutputStream/InputStream.
Sort of buffering, using a file.
This brings it's own problems:
what if processing speed becomes faster then downloading speed for
some time and I get EOF?
file parser operates with
ObjectInputStream and it behaves weird in cases of empty file/EOF
and so on
Are there conventional ways to do such a thing?
Are there alternative solutions?
Please provide some guidance.
Upd:
I'd like to point out: http server is out of my control.
Consider it to be a vendor data provider. They have many consumers and refuse to alter anything for just one.
Looks like we're the only ones to use all of their data, as our client app processing speed is far greater than their sample client performance metrics. Still, we can not match our app performance with network throughoutput.
Server does not support http range requests or pagination.
There's no way to divide data in chunks to load, as there's no filtering attribute to guarantee that every chunk will be small enough.
Shortly: we can download all the data in a given time before timeout occurs, but can not process it.
Having an adapter between inputstream and outpustream, to pefrorm as a blocking queue, will help a ton.
You're using something like new ObjectInputStream(new FileInputStream(..._) and the solution for EOF could be wrapping the FileInputStream first in an WriterAwareStream which would block when hitting EOF as long a the writer is writing.
Anyway, in case latency don't matter much, I would not bother start processing before the download finished. Oftentimes, there isn't much you can do with an incomplete list of objects.
Maybe some memory-mapped-file-based queue like Chronicle-Queue may help you. It's faster than dealing with files directly and may be even simpler to use.
You could also implement a HugeBufferingInputStream internally using a queue, which reads from its input stream, and, in case it has a lot of data, it spits them out to disk. This may be a nice abstraction, completely hiding the buffering.
There's also FileBackedOutputStream in Guava, automatically switching from using memory to using a file when getting big, but I'm afraid, it's optimized for small sizes (with tens of gigabytes expected, there's no point of trying to use memory).
Are there alternative solutions?
If your consumer (the http client) is having trouble keeping up with the stream of data, you might want to look at a design where the client manages its own work in progress, pulling data from the server on demand.
RFC 7233 describes the Range Requests
devices with limited local storage might benefit from being able to request only a subset of a larger representation, such as a single page of a very large document, or the dimensions of an embedded image
HTTP Range requests on the MDN Web Docs site might be a more approachable introduction.
This is the sort of thing that queueing servers are made for. RabbitMQ, Kafka, Kinesis, any of those. Perhaps KStream would work. With everything you get from the HTTP server (given your constraint that it cannot be broken up into units of work), you could partition it into chunks of bytes of some reasonable size, maybe 1024kB. Your application would push/publish those records/messages to the topic/queue. They would all share some common series ID so you know which chunks match up, and each would need to carry an ordinal so they can be put back together in the right order; with a single Kafka partition you could probably rely upon offsets. You might publish a final record for that series with a "done" flag that would act as an EOF for whatever is consuming it. Of course, you'd send an HTTP response as soon as all the data is queued, though it may not necessarily be processed yet.
not sure if this would help in your case because you haven't mentioned what structure & format the data are coming to you in, however, i'll assume a beautifully normalised, deeply nested hierarchical xml (ie. pretty much the worst case for streaming, right? ... pega bix?)
i propose a partial solution that could allow you to sidestep the limitation of your not being able to control how your client interacts with the http data server -
deploy your own webserver, in whatever contemporary tech you please (which you do control) - your local server will sit in front of your locally cached copy of the data
periodically download the output of the webservice using a built-in http querying library, a commnd-line util such as aria2c curl wget et. al, an etl (or whatever you please) directly onto a local device-backed .xml file - this happens as often as it needs to
point your rest client to your own-hosted 127.0.0.1/modern_gigabyte_large/get... 'smart' server, instead of the old api.vendor.com/last_tested_on_megabytes/get... server
some thoughts:
you might need to refactor your data model to indicate that the xml webservice data that you and your clients are consuming was dated at the last successful run^ (ie. update this date when the next ingest process completes)
it would be theoretically possible for you to transform the underlying xml on the way through to better yield records in a streaming fashion to your webservice client (if you're not already doing this) but this would take effort - i could discuss this more if a sample of the data structure was provided
all of this work can run in parallel to your existing application, which continues on your last version of the successfully processed 'old data' until the next version 'new data' are available
^
in trade you will now need to manage a 'sliding window' of data files, where each 'result' is a specific instance of your app downloading the webservice data and storing it on disc, then successfully ingesting it into your model:
last (two?) good result(s) compressed (in my experience, gigabytes of xml packs down a helluva lot)
next pending/ provisional result while you're streaming to disc/ doing an integrity check/ ingesting data - (this becomes the current 'good' result, and the last 'good' result becomes the 'previous good' result)
if we assume that you're ingesting into a relational db, the current (and maybe previous) tables with the webservice data loaded into your app, and the next pending table
switching these around becomes a metadata operation, but now your database must store at least webservice data x2 (or x3 - whatever fits in your limitations)
... yes you don't need to do this, but you'll wish you did after something goes wrong :)
Looks like we're the only ones to use all of their data
this implies that there is some way for you to partition or limit the webservice feed - how are the other clients discriminating so as not to receive the full monty?
You can use in-memory caching techniques OR you can use Java 8 streams. Please see the following link for more info:
https://www.conductor.com/nightlight/using-java-8-streams-to-process-large-amounts-of-data/
Camel could maybe help you the regulate the network load between the REST producer and producer ?
You might for instance introduce a Camel endpoint acting as a proxy in front of the real REST endpoint, apply some throttling policy, before forwarding to the real endpoint:
from("http://localhost:8081/mywebserviceproxy")
.throttle(...)
.to("http://myserver.com:8080/myrealwebservice);
http://camel.apache.org/throttler.html
http://camel.apache.org/route-throttling-example.html
My 2 cents,
Bernard.
If you have enough memory, Maybe you can use in-memory data store like Redis.
When you get data from your Rest endpoint you can save your data into Redis list (or any other data structure which is appropriate for you).
Your consumer will consume data from the list.
I'm a relatively novice programmer and newish to java, and I've been tasked with creating a distributed system that runs two types of applications; a single 'Server/Router' and as many Client-Server 'Nodes' as desired.
-The Server/Router will maintain a table of client connection info
-The Nodes will each send connection info when they spawn
-The Node A can request File F from Node B
---This Request is sent to the router, which looks up connection info for B and sends a request for F
---B begins streaming F to the router, which in turn streams F to A
That's the general idea. It sounds like it would be fairly simple if it weren't for the fact that I've never done ANY distributed computing before... So my question isn't one of how to correct code, but whether my design will work (and if it will not, how I might correct it)
So my idea is to create a public class AVRouter, another class AVNodeInfo, and a third class ConnectionMap; AVRouter will have a collection of AVRouterInfo objects which contain a name, port number, and IP address for each node. It will also have a Queue of ConnectionMap objects, which I'll get to shortly.
AVRouter will have a ServerSocket dedicated to receiving connection info from Nodes when they start up and populating its AVNodeInfo table with said data. It will have another port for file requests; it will use the connection info for the requester and the responder to generate a ConnectionMap object, which it will add the the queue.
The idea being that while there are ConnectionMaps in the queue, it will use the first one to facilitate a transfer.
The last class, AVNode, is much simpler; on spawning it sends its info to the Router, and then it waits for user input to name another node and a file it wishes to request from that node. It will send a completed request to the Router when one is available.
Logic for handling the AVNodeInfo table will probably just be handled with timeouts-if it's been X time since a node has made a request, the node will terminate on its own and the table will delete it from the table on its own... this is a small-scale proof-of-concept type project so it's not really within to scope to handle this nitty-gritty just yet.
So I actually have two questions:
1) Will this design be fine, or should it be improved?
2) How exactly does one handle streaming data from source A through router B to destination C without actually complete transferring from A to B, then from B to C?
I hope this question is within StackOverflow's scope; I know it's design rather than code, but I believe it's specific enough.
The principal concept sounds very ok and is probably the most easy to implement (with the least pitfalls). Its practically a standard approach.
You might want to consider, instead of sending a file from node to server and then forward to the requesting node, to have the nodes connect directly with each other. Node A would just ask the server where is file F and then connect directly to node B.
(that should reduce network load, since data travels a shorter route). But it adds quite some complexity (each node must be able to reach any other node and that makes each node a server). A composite approach would be to try direct connection and if that fails fall back to the via-server method.
You can just implement your original concept and when you have it working, see if you want/need that extension.
Edit: I would probably fuse NodeInfo and Connection (connection as a member of NodeInfo) - the server should have exactly one connection to each node (or if using multiple connections, have NodeInfo hold a collection of the open connections to that specific node).
EDIT: To add to the workability of your concept. Its generally what P2P sharing programs like BitTorrent implement. The "Tracker" acts as the initial "Router" telling each client about other peers. Peers then use direct connections to talk to each other. So its practical identical to what you've come up with, only there is not data traffic using the Server/Router as a bridge (for obvious bandwidth concerns, and it would contradict the P2P idea).
Scroll towards the end for the solution to the topic's problem. The original question was asking for a somewhat different thing.
As a part of a larger process, I need to fetch and link two related sets of data together. The way that the data is retrieved(dynamics crm, n:n relationships..) forces us retrieve the second set of the data again so that it will have all the necessary information. During a part of larger transformation of this data, I would like to access the http endpoint that is used to fetch the data from the crm, retrieve the second set of data and process it. I can get the endpoint through DefaultEndPointFactory like so:
DefaultEndpointFactory def = new DefaultEndpointFactory();
def.getInboundEndpoint("uri").getConnector;
But there is no method to actually send the mulemessage.
Solved:
The problem is that you can not set inbound properties on the MuleMessage, and the flow is depending on some of those to function(path, query params etc).
It seems you are able to inbound scoped properties with this:
m.setProperty("test", (Object)"test", PropertyScope.INBOUND);
Is there a way to make this approach work, or an alternative way to access the flow? I tried using mulecontext to get the flow:
muleContext.getRegistry().lookupFlowConstruct("myflow");
But it did not contain anything that looked useful.
Solution:
As David Dossot suggested in a comment of his answer, I was able to solve this with muleClients request method.
muleContext.getClient().request(url, timeout);
Then constructing the url as usual with GET parameters etc.
I'm not 100% sure about what you're trying to achieve but anyway, the correct way of using Mule transports from Java code is to use the MuleClient, which you can access with muleContext.getClient().
For example, the send method allow you to pass a properties map that are automatically added to the inbound scope. Behind the scene, Mule takes care of creating the endpoint needed for the operation.
Regarding the flow: what are you trying to do with it? Invoke it?
In my project, we have 2 REST calls which take too much time, so we are planning to optimize that. Here is how it works currently - we make 1st call to system A and then pass the response to system B for further processing. Once we get the response from system B, we have to manipulate it further before passing it to UI layer and this entire process takes lot of time. We planned on using Solr/Lucene but since we are not the data owners, we can't implement that. Can someone please shed some light on how best this can be handled? We are using Spring MVC and Spring webflow. Thanks in advance!!
[EDIT:] This is not the actual scenario and I am writing this as an example for better understanding. Think of this as making a store locator call for a particular zip to get a list of 100 stores and then sending those 100 stores to another call to get a list of inventory etc. So, this list of stores would change for every zip code and also the inventory there.
If your queries parameters to System A / System B are frequently the same you can add a cache framework to your code. If you use Spring3, you can use the cache easily with an #Cacheable annotation on your code calling SystemA. See :
http://static.springsource.org/spring/docs/3.1.0.M1/spring-framework-reference/html/cache.html
The cache subsystem will cache the result including processing code.