Good day,
There is a SOAP interface which returns the data I need. The request passes initial parameters to the server, and returns a data set. I need to collect hundreds of data sets with different parameters, store them in an array and then iterate through it. This process may need to be repeated every few weeks.
I found that SoapUI can retrieve the datasets one by one, but this means a lot of manual work. In my experience SoapUI also freezes every 30 minutes or so, especially if large datasets are involved. I spent a few days retrieving the data via SoapUI, copypasting the responses into array and then iterating through it with Javascript - but this approach seems very inefficient.
I never worked with SOAP before this - so most of the online tutorials fly over my head. SoapUI is quick and easy, but doesn't seem suitable for automatically retrieving large quantities of data. I don't have a huge preference for a programming language - whatever is simplest to implement is fine with me. I'd be perfectly happy if the response comes in an XML that's simply in a string format - it can then be parsed with regex, I know how to do this. I have a bit of experience with java and python though, if that helps. Thank you!
You might want to take a look at Zeep, a Python SOAP client.
The response object you’ll get back can be accessed just like a dictionary, you can do things like response['key']. You can also serialize zeep objects to native python data structures.
Related
This question is not very a language-specific question, it's some kind of pattern-related question, but I would like to tag it with some popular languages that I can understand here.
I've not been very experienced with the requirement of efficiently loading data in combination with searching data (especially for mobile environment).
My strategy used before is load everything into local memory and search from there (such as using LINQ in C#).
One more strategy is reload the data every time a new search is executed. Doing something like this is of course not efficient, also we may need to do some more complicated things to sync the newly loaded data with the existing data (already loaded into local memory).
The last strategy I can think of is the hardest one to implement, that is lazily load the data together with the searching execution. That is when the search is executed, the return result should be cached locally. The search should look in the local memory first before fetching new result from the service/server. So the result of each search is a combination of the local search and the server search. The purpose here is to reduce the amount of data being reloaded from server every time a search is run.
Here is what I can think of to implement this kind of strategy:
When a search is run, look in the local memory first. Finishing this step gives out the local result.
Now before sending request to search on the server side, we need to somehow pass what are already put in the result (locally) to exclude them from the result when searching on the server side. So the searching method may include a list of arguments containing all the item IDs found by the fisrt step.
With that searching request, we can exclude the found result and return only new items to the client.
The last step is merge the 2 results: from local and server to have the final search result before showing on the UI to the user.
I'm not sure if this is the right approach but what I feel not really good here is at the step 2. Because we need to send a list of item IDs found on the step 1 to the server, so what if we have hundreds or thousands of such IDs, sending them in that case to the server may not be very efficient. Also the query to exclude such a large amount of items may not be also efficient (even using direct SQL or LINQ). I'm still confused at this point.
Finally if you have any better idea and importantly implemented in some production project, please share with me. I don't need any concrete example code, I just need some idea or steps to implement.
Too long for a comment....
Concerning step 2, you know you can run into many problems:
Amount of data
Over time, you may accumulate a huge amount of data so that even the set their id's gets bigger than the normal server answer. In the end, you could need to cache not only previous server's answers on the client, but also client's state on the server. What you're doing is sort of synchronization, so look at rsync for inspiration; it's an old but smart Unix tool. Also git push might be inspiring.
Basically, by organizing your IDs into a tree, you can easily synchronize the information (about what the client already knows) between the server and the client. The price may be increasing latency as multiple steps may be needed.
Using the knowledge
It's quite possible that excluding the already known objects from the SQL result could be more expensive than not, especially when you can't easily determine if a to-be-excluded object would be a part of the full answer. Still, you can save bandwidth by post-filtering the data.
Being up to date
If your data change or get deleted, your may find your client keeping obsolete data. The client subscribing for relevant changes is one possibility; associating a (logical) timestamp to your IDs is another one.
Summary
It can get pretty complicated and you should measure before you even try. You may find out that the problem itself is hard enough and that achieving these savings is even harder and the gain limited. You know the root of all evil, right?
I would approach the problem by thinking local and remote are two different data sources,
When a search is triggered, the search is initiated against both data sources (local - in memory and server)
Most likely local search will result in results first, so display them to the user.
When results returned from the server, you can append non duplicate results.
Optional - in case server data has changed and some results remove/ or changed, update/remove local results and update the view.
Here is a biological database, http://www.genecards.org/index.php?path=/GeneDecks
Usually, if I type in a gene name (string) (ex. TF53) and summit it, it will come back with a result on the webpage. Also, it can be chosen if users want to save it as tab-delimited/XML file. However, I have a list of gene name which contains more than thousands of gene name. How can I automate this a series of processes by Java program ?
I know this question can be quite broad and probably has various way to do. With only a little experience in Java programming, I really appreciate if someone can suggest a easier way to do it. Thanks.
One of the possibilities is to read gene names sequentially from your list and send for each other that request:
http://www.genecards.org/index.php?path=/GeneDecks/ParalogHunter/<your gene name>/100/{%22Sequence_Paralogs%22:%221%22,%22Domains%22:%221%22,%22Super_Pathways%22:%221%22,%22Expression_Patterns%22:%221%22,%22Phenotypes%22:%221%22,%22Compounds%22:%221%22,%22Disorders%22:%221%22,%22Gene_Ontologies%22:%221%22}
(so basically mimic what the site does).
For example:
http://www.genecards.org/index.php?path=/GeneDecks/ParalogHunter/TNFRSF10B/100/{%22Sequence_Paralogs%22:%221%22,%22Domains%22:%221%22,%22Super_Pathways%22:%221%22,%22Expression_Patterns%22:%221%22,%22Phenotypes%22:%221%22,%22Compounds%22:%221%22,%22Disorders%22:%221%22,%22Gene_Ontologies%22:%221%22}
However, they might not like people using their site in such way (submitting a lot of automatic requests). You might want to check their policy on that. Also, other thing to check is if they have an official API which can be used for batch retrieval of gene information.
I'm trying to transfer a stream of strings from my C++ program to my Java program in an efficient manner but I'm not sure how to do this. Can anyone post up links/explain the basic idea about how do implement this?
I was thinking of writing my data into a text file and then reading the text file from my Java program but I'm not sure that this will be fast enough. I need it so that a single string can be transferred in 16ms so that we can get around 60 data strings to the C++ program in a second.
Text files can easily be written to and read from upwards with 60 strings worth of content in merely a few milliseconds.
Some alternatives, if you find that you are running into timing troubles anyway:
Use socket programming. http://beej.us/guide/bgnet/output/html/multipage/index.html.
Sockets should easily be fast enough.
There are other alternatives, such as the tibco messaging service, which will be an order of magnitude faster than what you need: http://www.tibco.com/
Another alternative would be to use a mysql table to pass the data, and potentially just set an environment variable in order to indicate the table should be queried for the most recent entries.
Or I suppose you could just use an environment variable itself to convey all of the info -- 60 strings isn't very much.
The first two options are the more respectable solutions though.
Serialization:
protobuf or s11n
Pretty much any way you do this will be this fast. A file is likely to be the slowest and it could be around 10ms total!. A Socket will be similar if you have to create a new connection as well (its the connect, not the data which will take most time) Using a socket has the advantage of the sender and receiver knowing how much data has been produced. If you use a file instead, you need another way to say, the file is complete now, you should read it. e.g. a socket ;)
If the C++ and Java are in the same process, you can use a ByteBuffer to wrap a C array and import into Java in around 1 micro-second.
I am building an application which retrieves data and parses it into a two-dimensional array object before sending it back to the client. The application then uses the data to create an image on HTML5 canvas. The array contains several thousand entries, and when I built the application using GWT-RPC, it worked properly, but it took too long to transfer the array to the client (several minutes).
I found this issue when searching for solutions: http://code.google.com/p/google-web-toolkit/issues/detail?id=860
The last response was from a couple months ago, but there doesn't seem to be a conclusive answer to the best way to go about passing large arrays from server to client. Since deRPC is being deprecated (I have yet to actually try using it), is using requestfactory the only choice? It seems like requestFactory is supposed to be for accessing a database, and not for performing calculations and returning large results, and I haven't found an example where a request is made for a calculation and a result is passed back. Should I create a JSON object instead of an array in my current implementation and keep RPC or am I missing something when it comes to requestFactory?
The issue you linked to is about slow speed of de-serialization on the client and not the data transfer speed. You should first measure the transfer speed using Firebug or similar tool and then subtract this time from total time of RPC call to find out how much time is spent during de-serialization. Roughly speaking, the breakup goes like this:
Total RPC time = time-spent-on-server + network-time-in-out +
deserialization-time
You should first find out which part is the real bottleneck, if it turns out to be the data transfer speed, you'll probably need to re-think your design. See my answer to a related question.
EDIT:
IMO, until you have calculated the above time breakup, you should put aside the question that whether JSON or another approach is right for you
My need is to aggregate real time statistics of a web application server.
For example:
How many requests of content type X have been done
How long it takes to process request of type Y
And so on.
This data has to be completely in memory, not in a file, for best performance. It doesn't log each and every request but instead only stores counters of various aspects.
The most easy way I know is to store the values in a SQL-like table and do SQL-like queries. The benefit is that the indexing is coming off-the-shelf without development effort. I guess some embedded Java databases like Apache Derby would do the work.
The other way to go is to implement collection (say a list) and hash table for each "index column". This way it's all done with Java/Scala collections API, but I actually have to implement indexing mechanism myself, test it, maintain it, etc.
So my question is what way do you think is preferred, and if there are other ways to easily and quickly implement this feature?
Thanks.
I would choose H2 database, I have very positive experiences with it, performance is great as well.
Are you sure that SQL database is well suited for your needs, and have you looked at javamelody, to see if it suits your needs, or if it does not suit you take a look at JRobin for a rolling database implementation.
I would imagine you only need one collection per type of information you need to collection. To improve performance, simplify code I would use TObjectIntHashMap. e.g.
How many requests of content type X have been done
TObjectIntHashMap<ContentType> contentTypeCount
= new TObjectIntHashMap<ContentType>();
contentTypeCount.increment(contentType);
How long it takes to process request of type Y
TObjectLongHashMap<ProcessType> contentTypeTime
= new TObjectLongHashMap<ProcessType>();
contentTypeTime.adjustValue(processType, processTime);
I don't see how you can make it any shorter/simpler/faster by using the other approaches you mentioned.
The average time to perform increment(key) on my machines takes 15 ns (billionths of a second)
I also been noticed about Twitter Ostrich that is statistics library for Scala.
It contains counters, gauges and timing meters.
Data is accessible from HTTP REST API.