I am developing a Google App Engine application which reads and edits a big SpreadSheet with around 150 columns and 500 rows. Beside the specific size (it may vary) I am looking for a way to improve performance since most of the times I get a 500 Internal Server Error (as you can see below).
java.lang.RuntimeException: Unable to complete the HTTP request Caused
by: java.net.SocketTimeoutException: Timeout while fetching URL:
https://spreadsheets.google.com/feeds/worksheets/xxxxxxxxxxxxxxxxxxxxxxx/private/full
In the code snippet below you can see how I read my SpreadSheet and which line throws the exception.
for (SpreadsheetEntry entry : spreadsheets) {
if (entry.getTitle().getPlainText().compareTo(spreadsheetname) == 0) {
spreadsheet = entry;
}
}
WorksheetFeed worksheetFeed = service.getFeed(spreadsheet.getWorksheetFeedUrl(), WorksheetFeed.class);
List<WorksheetEntry> worksheets = worksheetFeed.getEntries();
WorksheetEntry worksheet = worksheets.get(0);
URL listFeedUrl = worksheet.getListFeedUrl();
// The following line is the one who generates the error
ListFeed listFeed = service.getFeed(listFeedUrl, ListFeed.class);
for (ListEntry row : listFeed.getEntries()) {
String content = row.getCustomElements().getValue("rowname");
String content2 = row.getCustomElements().getValue("rowname2");
}
I already improved the performance using structured queries. Basically I apply filters within the URL and that allows me to only retrieve the few rows I need. Please notice that I still get the above error sometimes no matter what.
URL listFeedUrl = new URI(worksheet.getListFeedUrl().toString() + "?sq=rowname=" + URLEncoder.encode("\"" + filter+ "\"").toString()).toURL();
My problem however is different, first of all there are certain times where I must read ALL rows but only FEW columns (around 5). I still need to find a way to achieve that, I do know that there is another parameter "tq" which allows to select columns but that statement requires the letter notation (such as A,B,AA), I'd like to use column names instead.
Most important I need to get rid of the 500 Internal Server Error. Since it sounds like a Timeout problem I'd like to increase that value to a resonable amount of time. My users can wait for a few seconds also because it seems completely random. When it works it loads the page in around 2-3 seconds. When it doesn't work however I get a 500 Internal Server Error which is going to be really frustrating for the enduser.
Any idea? I couldn't find anything on the App Engine settings. The only idea I had so far is to split the spreadsheet in multiple spreadsheets (or worksheets) in order to read less columns. However if there's an option that can allow me to increase the Timeout it would be awesome.
EDIT: I was looking around on the Internet and I may have found something that can help me. I just found out service object offers a setConnectionTimeout method, testing it right away.
// Set timeout
int timeout = 60000;
service.setConnectTimeout(timeout);
Time Out
I use a 10 Second time out with a retry. It works ok for me.
Sheet size
I have used it with 80,000 cells at a time. It works fine, I have not seen the retry fail. I am using CellFeed, not ListFeed.
Yes, it does not like large sheets, small sheets of 1000 cells or so are much faster. Even if I only write to part of the sheet, small sheets are much faster. (Feels like it recalculates whole sheets, as does not look to be down to data volume, but I am not sure)
Exponential backoff
Zig suggests an exponential backoff - would be be interested in numbers - what timeout values and failure rates people get with exponential backoff - also the impact of sheet size.
I suspect start with a 3 Second Time out and double with every retry might work, but have not tested it.
The real problem is that you shouldnt use a spreadsheet for this. It will throw many errors including rate limits if you attempt to make heavy use.
At a minimum you will need to use exponential backoff to retry errors but will still be slow. Doing a query by url is not efficient either.
The solution is that you dump the spresdsheet into the datastore, then do your queries from there. Since you also edit the spreadsheet its not that easy to keep it in sync with your datastore data. A general solution requires task queues to handle correctly the timeouts and lots of data (cells)
Related
We have identified in our user base that since the last google fit app update there's been a dramatic drop in data, and since it began we have tried to identify the issue in our code. Giving the timing, we thought the version we were using ( 18.0 at the time ) was the problem.
Upgrading to SDK 20.0 did not improve the results, but stopped the data from stalling. currently we can assume 50-60% of the users connected to google fit trough the SDK are no longer corretcly retrieving data according to the (previously working) implementation. They are not lost, and they still send some bits here and there, but it's no longer what it used to be.
This graph showcases the timeline of events that lead us the conclusion that one of the sides must be doing something wrong.
The code examples below have been stripped of most data processing code for readability, but it is there.
Our Fitness client requests FitnessOptions.ACCESS_READ for all the types mentioned below, plus others depending on the App, every time it's initialised, either in foreground or background, making sure we only request those accepted by the user.
We can confirm the next data types no longer return any value when requesting daily total or local device daily total, but do return data chunks of the same period when requested in a non-aggregated read:
DataType.TYPE_STEP_COUNT_DELTA
DataType.TYPE_CALORIES_EXPENDED
DataType.TYPE_HEART_RATE_BPM
we also tried changing those possible to their aggregate counterparts, with no avail:
DataType.AGGREGATE_CALORIES_EXPENDED
DataType.AGGREGATE_STEP_COUNT_DELTA
This is our current getDailyTotal implementation, working before the update, and is written straight out as the examples on the developer site show:
Fitness.getHistoryClient(context, account)
.readDailyTotal(type)
.addOnSuccessListener {
Logger.i("${type.name}::DailyTotal::Success")
onResponse(it)
}
This currently returns 0 no matter the time of the day it's asked.
Then we have our complementary code, which emulates what getDailyTotal does in the insides, also as per developer site examples:
from: day start at 00:00:00, UTC+1
to: day end at 23:59:59, UTC+1
type: any DataType.
val readRequest = DataReadRequest.Builder()
.enableServerQueries()
.aggregate(type)
.bucketByTime(1, TimeUnit.DAYS)
.setTimeRange(from.time, to.time, TimeUnit.MILLISECONDS)
.build()
val account = GoogleSignIn
.getAccountForExtension(context, fitnessOptions!!)
GFitClient.request(context, account, readRequest) {
if (it == null) {
aggregatedRequestError(type)
} else {
Logger.i(TAG, "Aggregated ${type.name} received.")
}
}
The common result here is either 1) a null or empty result, 2) actually getting the result ( in the case of DataType.TYPE_STEP_COUNT_DELTA sometimes it happens ) or 3) a APIException code 5012, this datatype can't be aggregated.
We are using the single aggregate since the double, that could be called by (type, type.aggregate) has been deprecated since a couple versions already, although some developer site examples still use it.
The use ( or not ) of .enableServerQueries() does not modify the final result.
Finally we assume the worst and we request anything for that day no matter what and then we aggregate manually. This usually reports results, wether others did not. sadly those results are never conclusive enough to feel comfortable.
val readRequest = DataReadRequest.Builder()
.enableServerQueries()
.read(type)
.bucketByTime(1, TimeUnit.DAYS)
.setTimeRange(from.time, to.time, TimeUnit.MILLISECONDS)
.build()
val account = GoogleSignIn
.getAccountForExtension(context, fitnessOptions!!)
This tends to work but the manual processing of the data is complex given the intricate nested nature of datasets, buckets and the overall dataset structure.
We have also noticed issues when retrieving data that is clearly seen on the fit app, but doesn't appear on the SDK, for example, Huawei Health activities appearing on the App while the SDK returns only a subset of them, and the other way around, the SDK returning us data ( for example, a whole night worth of sleep sessions ( light, rem, deep... ), while the fit app shows that same sleep as a single Sleep block without any sessions.
Sleep session as shown in a third party app, with the same data the SDK returns us:
The same sleep session shown in the Google fit app:
As far as the documentation says:
For the Android APIs, read by data type and the Fit platform will
return the merged stream by default. This automatically includes all
data available to your app, including data written by other apps. You
won't be able to see a list of which apps or devices the data came
from with the Android APIs.
We believe that the merged stream is not behaving properly, not in real time ( which could be explained by a delay between the App showing the data directly from the backend and the SDK not having the data yet written ), but also not in a matter of minutes or hours of difference, sometimes never showing up.
To understand how we retrieve this data, we have a background WorkerManager CouroutineJob that every once in a while ( when the system lets so, given doze mode permissions, but what we would prefer (and ask so via WorkerManager configuration ) is once every hour or couple of hours, to keep the data up to date with the one displayed in the fitness app ), we request data from last update to last day's end day or/and we request today's daily total ( or up to the current time, depends on how far the "doesn't work" funnel we go, and also on the last update's date).
Is there anything wrong in our implementation?
has google fit changed the way it reports its data to connected apps?
can we somehow get more truthful data?
is there any way to request the same data differently, more efficiently? we are deeply interested mostly in getting daily summaries, totals and averages, rather than time buckets / sessions. We request both but they go to different data funnels covering different use cases.
There is no answer yet.
Our solution has ended up having a rowdy succession of checks for data and on every failure we try a different way.
I want to know how can I fetch the Complete Google Spreadsheet with more than 80K rows using "List-based Feeds" in Google Spreadsheet.
To make it more clear, the flow of the application is as follows:
Connect to Google Spreadsheet using service.getFeed()
Using List-based Feeds fetch all the rows and push the task in the task queue to enter the data into datastore.
Problems:
1. The application works fine on localhost, but when deployed, timeout error occurs stating "HardDeadlineExceeded Exception". I had read the documentation of this exception and found that handling such exception would be of not much use. Following code is used to establish connection and get List-based Feeds:
try
{
lf = service.getFeed(url, ListFeed.class); //Exception occurs at this point
timeoutflag=1;
}
catch(Exception e)
{
timeoutinc += 3;
service.setConnectTimeout(timeoutinc * 10000);
service.setReadTimeout(timeoutinc * 10000);
}
The second exception I get is: out of memory Exception
java.lang.OutOfMemoryError: Java heap space
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse (AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse (SAXParserImpl.java:642)
at org.xml.sax.helpers.ParserAdapter.parse (ParserAdapter.java:430)
...
I had gone through the official documentation of Google and found that I can use Cell-based feeds, but as my application completely depends upon List-based feeds, shifting to Cell-based feeds is not an optimal choice for my use case as I need to fetch the data row by row and not cell by cell.
Please guide...!
1. The application works fine on localhost, but when deployed, timeout error occurs stating "HardDeadlineExceeded Exception". I had read the documentation of this exception and found that handling such exception would be of not much use.
Based from this documentation, if the DeadlineExceededException is not caught, an uncatchable HardDeadlineExceededError is thrown. The instance is terminated in both cases, but the HardDeadlineExceededError does not give any time margin to return a custom response. To make sure your request returns within the allowed time frame, you can use the ApiProxy.getCurrentEnvironment().getRemainingMillis() method to checkpoint your code and return if you have no time left.
2. The second exception I get is: out of memory Exception
From this related SO post, you got the error maybe because the heap is being over-allocated. The only way to solve other then increasing the heap space is to see what is using all the heap space and then trying to make sure objects can be collected that stay around longer then they are needed. If it is a file or something that can't be collected that is making you run out of heap space, you should re-engineer your program if the file sizes aren't constant and keep changing. If they are constant just increase the heap space above the file size. You can check this thread for more information.
For your question, How to fetch a Google Spreadsheet having more than 80K rows using ListFeed in JAVA (GAE)?, I suggest to check this documentation. This sample code might also help.
// Make a request to the API and get all spreadsheets.
SpreadsheetFeed feed = service.getFeed(SPREADSHEET_FEED_URL,
SpreadsheetFeed.class);
List<SpreadsheetEntry> spreadsheets = feed.getEntries();
if (spreadsheets.size() == 0) {
// TODO: There were no spreadsheets, act accordingly.
}
I've been facing an issue within adding a new sheet (tab) to existing spreadsheet.
Before the main issue, I will try to explain why I need to do such a thing and maybe there will be other solution. I'm trying to upload spreadsheet with several sheets. When sheets have smaller size, everything is ok. But when I do the "service.spreadsheets().create(spreadsheet).execute()" request with really big sheets (like two sheets with 40k cells), I get normal response, but created spreadsheet contains only empty "Untitled document" with empty tab. That's first thing which bothers me, why I don't receive something like "your insert is too big" or something like that.
So I would like to create spreadsheet, insert first tab (as a smaller request) and then add another tab (sheet) and so.. But what I only found over the stack and google documentation is "BatchUpdateSpreadsheetRequest". But this request doesn't allow me to add already created sheet, it just create new empty sheet, which is really annoying.
Do I miss some API call? Also I found over the documentation and stack some limits, but there is no clear info about how big can be requests with sheets (I've seen all the 400k rows, and what you can found here), but that didn't help a lot.
Can someone provide me info how to "split" spreadsheet creation into creating more smaller request so the created spreadsheet will contain all the data?
Thanks
The V4 API currently has a limit of 10MB of data per request, though I don't think we advertise this fact in the documentation anywhere right now.
To workaround, you can use multiple different requests in a BatchUpdateSpreadsheetRequest -- an AddSheetRequest as you mentioned, plus UpdateCellsRequest, or some number of other requests. Check out the guide that details what requests deal with what portions of the spreadsheet.
If you have specific portions of the spreadsheet you're curious about how to set, please follow up.
I have a RDF file that has 7MB and ~ 80k statements.
When starting the application, I have the following code, that retrieves a list of itens I need to show to the user:
NodeIterator iterator = technologyModel.listObjectsOfProperty(subject);
while (iterator.hasNext()) {
RDFNode node = iterator.nextNode();
myCollection.add(node.asLiteral().getString().trim());
}
Note: This code works just fine and returns something about 3k results, and is the first time the "technologyModel" is accessed.
Obviously, before doing that, I have to load the dataset/model, and here is the problem.
Case (1) When I load the dataset/model from a RDF file, doing this:
InputStream in = FileManager.get().open(ParamsHelper.sourceRDF);
technologyModel.read(in, "RDF/XML-ABBREV");
the technologyModel seems instantly loaded and the first code posted runs in less than a second.
Case (2) However, when I try to load the model from a TDB database (previously loaded with the same RDF file used on first case), with this code:
dataset = TDBFactory.createDataset(ParamsHelper.tdbBaseDir);
dataset.begin(ReadWrite.READ) ;
technologyModel = dataset.getNamedModel("http://a.example.biz/technology");
dataset.end();
the technologyModel doesn´t seem to be instantly loaded, and even though the first code posted returns as expected, it runs in about 30 seconds at the first call.
If I call that same code after the first time, or, for example, insert another operation like technologyModel.listSubjects() before calling this code for the first time, it will run immediately, as expected.
It seems to me that on the second case, the model is really loaded only afthe the first operation it suffers. Does it make any sense?
I don´t want to keep my data in a RDF file, but rather have a TDB database storing the triples. That´s why the second option seems to fit me better.
Can anyone help me on this? I hope I could expose the problem correctly.
Thanks in advance.
There are two effects here:
TDBFactory.createDataset doesn't loaded any data - it connects to the database. Data is loaded into memory (cached) as it is used so when you are doing listObjectsOfProperty the first time, all caches are cold and the database may well be slow. It will be quite sensitive to the hardware you are running on at this point.
The second is that Model API calls can have access patterns that are databse-unfriendly. It is better to use SPARQL on the dataset.
By the way: listObjectsOfProperty does not take a subject - it takes a property and can access a lotof the database. If myCollection is a set, then you may be adding a lot more than 3K items.
I'm querying data in the Facebook Graph API explorer:
access_token="SECRET"
GET https://graph.facebook.com/me/home?limit=20&until=1334555920&fields=id
result:
{
"data": [
]
}
I was shocked since there are many feeds on my "home".
Then I tried to set the limit to 100, then I got a feed list.
What's going on here? Does the "limit" parameter affect the graph api's result?
I tried to increase the limit to 25 and query again, there is one feed.
So what's the relationship between "limit" and "until"?
Facebook's API can be a little weird sometimes because of the data you're trying to access and there's a few parts to this question.
Limits
The limits are applied when data is returned, but before permissions and access controls are generated, which is explained with this blog post from last year: Limits in the Graph API.
Permissions
More importantly, even if you give yourself a token with every FB permission possible, you still won't be able to access everything that you created. Say you post something on a Friend's feed, but their feed is not set to Public privacy. Any queries against that friend's feed with your token will never return data (Or at least that was the case around a year ago).
API Itself
One of the most awesome bugs I found in the Graph API when I was working with it last year is the way it handles paging. The Graph API allows three filters: limit, offset, and since/until. Somewhere Facebook recommends (and rightly so) that you make use of the since/until dates exclusively whenever possible for paging. Ignoring debates as to why you would do that vs. offsets on a theoretical basis, on a practical one the following query used to degrade over time:
// This obviously isn't valid as written, but you the params change as described
limit=fixed-value&offset=programmatic-increase&since=some-fixed-date-here
The reason: Date ranges and offsets don't behave well with each other. As an example, say I made the following initial query:
// My example query
limit=20&since=1334555920
--> {#1,#2, ... #20}
Naturally you would want to page more data. The result would be something like this (I can't remember the exact pattern, but the top n would be repeats and the list of results would be truncated by n/2 or something similar):
// My example query
limit=20&since=1334555920&offset=20
---> {#10, #11 ... #25}
I never figured out why it happened, but eventually the query would taper off to return nothing and you would only get around 50-100 unique values. If you paged using dates exclusively however, you could go on for as long as the data would let you.
This is with the caveat that this was a bug and this was from a while ago. The main lesson here is I never would have found this bug without modifying my query to make things that should come out exactly the same (A particular date range based on posts #10-30 compared with a limit=20, offset=10) but the results were quite different.