I have a Java SOAP web service initially designed in Axis 1 which isn't meeting my performance requirements.
The request I'm most concerned about is one used to add lots (millions of rows) of data to the database. On the client side, I'll just be looping through files, pushing this data up to my web service. Each row has three elements, so the request looks something like:
<SOAP Envelope/Header/Body>
<AddData>
<Data>
<FirstName>John</FirstName>
<LastName>Smith</LastName>
<Age>42</Age>
</Data>
</AddData>
</SOAP Envelope/Body>
I'm finding the following performance trends:
When I do one row per request, I can get around 720 rows per minute.
When I encapsulate multiple rows into a single request, I can get up to 2,400 rows per minute (100 rows per request).
Unfortunately, that performance isn't going to meet our requirements, as we have hundreds millions of rows to insert (at 2,500 rows per minute, it would take about 2 months to load all the data in).
So I've been looking into the application to see where our bottleneck is. Each request of 100 rows is taking about 2.5 seconds (I've tried a few different servers and get similar results). I've found the following:
Client-side overhead is negligible (from monitoring the performance of my own client and using SOAP UI)
The database activity only accounts for about 10% (.2s) of the total time, so Hibernate caching, etc. won't help out much.
The network overhead is negligible (<1ms ping time from client to server, getting >10MB/s throughput with each request sending <20KB).
So this leaves some 2 seconds unaccounted for. The only other piece of this puzzle that I can point a finger at is the overhead of deserializing the incoming requests on the server side. I noticed that Axis 2 claims speed improvements in this area, so I ported this function over the an Axis 2 web service but didn't get the speedup I was looking for (the overall time per request improved by about 10%).
Am I underestimating the amount of time needed to deserialize 100 of the elements described above? I can't imagine that that deserialization could possibly take ~2 seconds.
What can I do to optimize the performance of this web application and cut down on that 2 second overhead?
Thanks in advance!
========= The next day.... ===========
The plot thickens...
At the recommendation of #millhouse, I investigated single row requests one a production server a bit more. I found that they could be suitably quick on good hardware. So I tried adding 1,000 rows using increments ranging from 1 (1,000 separate requests) to 1,000 (a single request).
1 row / Request - 14.5 seconds
3/req - 5.8s
5/req - 4.5s
6/req - 4.2s
7/req - 287s
25/req - 83s
100/req - 22.4s
1000/req - 4.4s
As you can see, the extra 2 second lag kicks in ay 7 rows per request (approximately 2 extra seconds per request when compared to 6 rows per request). I can reproduce this consistently. Larger numbers of requests all had similar overhead, but that became less noticeable when inserting 1,000 rows per request. Database time grew linearly and was still fairly negligible compared to the overall request time.
So I've found that I get best performance using either 6 rows per request, or thousands of rows per request.
Is there any reason why 7 would have such lower performance than 6 rows per request? The machine has 8 cores, and we have 10 connections in the session pool (i.e. I have no idea where the threshold of 6 is coming from).
I used Axis2 for a similar job about 5 years ago, but I'm afraid I can't offer any real "magic bullet" that will make it better. I recall our service performing at hundreds-per-second not seconds-per-hundred though.
I'd recommend either profiling your request-handling, or simply adding copious amounts of logging (possibly using one of the many stopwatch implementations around to give detailed timings) and seeing what's using the time. Does a request really take 2 seconds to get through the Axis layer to your code, or is it just accumulating through lots of smaller things?
If the processing for a single request in isolation is fast, but things get bogged down once you start loading the service up, investigate your app server's thread settings. I seem to recall having to break my processing into synchronous and asynchronous parts (i.e. the synchronous part doing the bare minimum to give a suitable response back to the client, and heavy-lifting being done in a thread from a pool), but that might not be appropriate for your situation.
Also make sure that construction of a new User object (or whatever it is) doesn't do anything too expensive (like grabbing a new ID, from a service, which wraps a DAO, which hits a slow database server, which runs a badly-written stored-procedure, which locks an entire table ;-) )
Related
I am trying to improve the response time of my REST APIs (I am using SpringBoot 2.1.3 with Groovy 2.4.14 and MSSQL). I noticed that the more popular GET API are at certain time periods taking much longer than they should be (>4 seconds as opposed to 0.3 seconds). I've looked into CPU usage, memory, blocking threads, blocking DBs, DeferredResult, fetching schemes, SpringBoot and JPA settings, etc none of these were a bottleneck or were just not relevant (the database search is a simple repository.findById() for a domain object with a few primitive fields).
List<Object> getObjectForId(String id) {
curCallCount++
List<Object> objList = objectRepository.findAllById(id)
curCallCount--
objList
}
The issue seems to be that the more existing calls to the service that have not exited at the time of the call, the longer the response time of the API call (there is almost a linear correlation, if there are 50 existing calls to the service, repository.findbyId() takes 5 seconds, and if there are 200, it would take 20 seconds. Meanwhile while there are 200 concurrent calls, the manual database query is still fast (0.3 seconds).
Is this expected for the Spring repository calls? Where is this overhead from repository.findById() coming from in an environment when there are many concurrent calls to the service, even though the manual database search performance is not affected?
Don't know about the API side, but I would certainly start by looking at the compilation/recompilation rate in SQL Server and looking at your plan cache usage. Using a string variable might be passing all your parameters in as nvarchar(max) and limiting the reuse of your query plans.
The issue was the Hikari pool size being too small (10 by default, so that when there are more than 10 concurrent calls, processes must wait for a free thread). Increasing this (to 150, for example) resolved this issue.
I have a web application and have to fetch 1000 records using a REST API. Each record is around 500 bytes.
What is the best way to do it from the following and why? Is there another better way to do it?
1>Fetch one record at a time. Trigger 1000 calls in parallel.
2>Fetch in groups of 20. Trigger 50 calls in parallel.
3>Fetch in groups of 100. Trigger 10 calls in parallel.
4>Fetch all 1000 records together.
As #Dima said in the comments, it really depends on what you are trying to do.
How are the records being consumed?
Is it a back end process to process or program to program communication? If so, then it depends on the difficulty of processing once the client receives it. Is it going to take them a long time to process each record? 1 ms per record, or 100ms per record? This option depends entirely on possible processing time per record.
Is there a front end consuming this for human users? If so, batch requesting would be good for reasons like paginating results. In such cases, I would go with option 2 or 3 personally.
In general though, depending upon the sheer volume of records, I would recommend considering batching requests (by triggering fewer calls). Heuristically speaking, you are likely to get better overall network throughput that way.
If you add more specifics, I'll happily update my answer, but until then, general will have to do!
Best for what case? What are you trying to optimize?
I did some tests a while back on a similar situation, with slightly larger payloads (images), where my goal was to utilize network efficiently on a high-latency setup (across continents).
My results were that after a minimal amount of parallelism (like 3-4 threads), the network was almost perfectly saturated. We compared it to specific (proprietary) UDP-based transfer protocols, and there was no measurable difference.
Anyway, it may be not what you are looking for, but sometimes having a "dumb" http endpoint is good enough.
I have a web service with a load balancer that maps requests to multiple machines. Each of these requests end up sending a http call to an external API, and for that reason I would like to rate limit the number of requests I send to the external API.
My current design:
Service has a queue in memory that stores all received requests
I rate limit how often we can grab a request from the queue and process it.
This doesn't work when I'm using multiple machines, because each machine has its own queue and rate limiter. For example: when I set my rate limiter to 10,000 requests/day, and I use 10 machines, I will end up processing 100,000 requests/day at full load because each machine processes 10,000 requests/day. I would like to rate limit so that only 10,000 requests get processed/day, while still load balancing those 10,000 requests.
I'm using Java and MYSQL.
use memcached or redis keep api request counter per client. check every request if out rate limit.
if you think checking at every request is too expensive,you can try storm to process request log, and async calculate request counter.
The two things you stated were:
1)"I would like to rate limit so that only 10,000 requests get processed/day"
2)"while still load balancing those 10,000 requests."
First off, it seems like you are using a divide and conquer approach where each request from your end user gets mapped to one of the n machines. So, for ensuring that only the 10,000 requests get processed within the given time span, there are two options:
1) Implement a combiner which will route the results from all n machines to
another endpoint which the external API is then able to access. This endpoint is able
to keep a count of the amount of jobs being processed, and if it's over your threshold,
then reject the job.
2) Another approach is to store the amount of jobs you've processed for the day as a variable
inside of your database. Then, it's common practice to check if your threshold value
has been reached by the value in your database upon the initial request of the job
(before you even pass it off to one of your machines). If the threshold value has been
reached, then reject the job at the beginning. This, coupled with an appropriate message, has an advantage as having a better experience for the end user.
In order to ensure that all these 10,000 requests are still being load balanced so that no 1 CPU is processing more jobs than any other cpu, you should use a simple round robin approach to distribute your jobs over the m CPU's. With round robin, as apposed to a bin/categorization approach, you'll ensure that the job request is being distributed as uniformly as possible over your n CPU's. A downside to round robin, is that depending on the type of job you're processing you might be replicating a lot data as you start to scale up. If this is a concern for you, you should think about implementing a form of locality-sensitive hash (LSH) function. While a good hash function distributes the data as uniformly as possible, LSH exposes you to having a CPU process more jobs than other CPU's if a skew in the attribute you choose to hash against has a high probability of occurring. As always, there's tradeoffs associated with both, so you'll know best for your use cases.
Why not implement a simple counter in your database and make the API client implement the throttling?
User Agent -> LB -> Your Service -> Q -> Your Q consumers(s) -> API Client -> External API
API client checks the number (for today) and you can implement whatever rate limiting algorithm you like. eg if the number is > 10k the client could simply blow up, have the exception put the message back on the queue and continue processing until today is now tomorrow and all the queued up requests can get processed.
Alternatively you could implement a tiered throttling system, eg flat out til 8k, then 1 message every 5 seconds per node up til you hit the limit at which point you can send 503 errors back to the User Agent.
Otherwise you could go the complex route and implement a distributed queue (eg AMQP server) however this may not solve the issue entirely since your only control mechanism would be throttling such that you never process any faster than the something less than the max limit per day. eg your max limit is 10k so you never go any faster than 1 message every 8 seconds.
If you're not adverse to using a library/service https://github.com/jdwyah/ratelimit-java is an easy way to get distributed rate limits.
If performance is of utmost concern you can acquire more than 1 token in a request so that you don't need to make an API request to the limiter 100k times. See https://www.ratelim.it/documentation/batches for details of that
I'm currently writing a crawler in java, and I'm stuck by something.
In my crawler, I have threads downloading a static distant page, using HttpURLConnection.
I tried to download one small file (2kb) with different parameters. The connection has a timeout set to 1s.
I noticed that, if I use 100 threads for the download, I suceed in making 3 times more request per second (~10k requests per second, which use ), whereas when using 500 threads I suceed in making "only" 4k requests per second.
I would have expected to be able to do at least as many request per second as with 100 threads.
Could you explain me why is this behaving this way, and if there is some parameter to activate somewhere to increase the maximum number of parallel connection ?
Thanks :)
i think it's just a matter of your cpu, at a certain point switching threats is more expensive then the time gained by not waiting for a single connection.
i would try to maximize parralel connection by setting a upper limit
I work at a retailer and we consider to introduce CQ5 as a CMS.
However, after doing some research and talking to consultants it turns out, that there may be things that may be "complicated". Perhaps one of you can shed a little light on this.
The first thing is, we were told that when you use the Multi Site Manager to create multi language pages (about 80 languages) the update process can be as slow as half an hour until a change is ultimately published. Did someone of you experience something similar?
The other thing is, that the TarOptimizer has pretty long running times. I was told that runs that take up to 24 hours are not uncommon. Again my question: Did someone of you had such a problem or has an explanation for this?
I am really looking forward to your response.
These are really 2 separate question, but I'll address them based on my experience.
The update process for creating new multi-language pages will vary based on the number of languages, and also the number of publish instances and web-servers (assuming you're using dispatcher to cache) you are running. This is because the replication process is where the bottleneck is (at least in my experience), and as such if you're trying to push out a large amount of content across a large number of publishers with a large number of front-end web-servers whose cache needs to be cleared, there will be some delay in getting this to happen since replication is an asynchronous process. The longest delay I've seen for this has been in the 10-15 minute range, that was with 12 publishers and 12 front end webservers, but this comes with the obvious caveat that your mileage may vary.
For the Tar Optimzation job, I'd encourage you to take a look at this page as it has a lot of good info about the Tar Optizer job and how to tune it. The job can take a long time to run when you have a large repository, especially on an instance with a large number of write operations, but the run times can be configured so that it only runs during a given time period, and it will pick up where it left off the night before if the total run time is longer than the allowed run time. By default, it runs from 2-5 am each night, so if it takes more than that 3 hour period, it will continue where it left off the next night, allowing it to optimize the entire repository over a period of a few days if needed.