BigQuery Pagination through large result set with cloud library - java

I am working on accessing data from Google BigQuery, the data is 500MB which I need to transform as part of the requirement. I am setting Allow Large Results, setting a destination table etc.
I have written a java job in Google's new cloud library since that is recommended now - com.google.cloud:google-cloud-bigquery:0.21.1-beta (I have tried 0.20 beta as well without any fruitful results)
I am having problem with pagination of this data, the library is inconsistent in fetching results page wise. Here is my code snippet,
Code Snippet
System.out.println("Accessing Handle of Response");
QueryResponse response = bigquery.getQueryResults(jobId, QueryResultsOption.pageSize(10000));
System.out.println("Got Handle of Response");
System.out.println("Accessing results");
QueryResult result = response.getResult();
System.out.println("Got handle of Result. Total Rows: "+result.getTotalRows());
System.out.println("Reading the results");
int pageIndex = 0;
int rowId = 0;
while (result != null) {
System.out.println("Reading Page: "+ pageIndex);
if(result.hasNextPage())
{
System.out.println("There is Next Page");
}
else
{
System.out.println("No Next Page");
}
for (List<FieldValue> row : result.iterateAll()) {
System.out.println("Row: " + rowId);
rowId++;
}
System.out.println("Getting Next Page: ");
pageIndex++;
result = result.getNextPage();
}
Output print statements
Accessing Handle of Response
Got Handle of Response
Accessing results
Got handle of Result. Total Rows: 9617008
Reading the results
Reading Page: 0
There is Next Page
Row: 0
Row: 1
Row: 2
Row: 3
:
:
Row: 9999
Row: 10000
Row: 10001
:
:
Row: 19999
:
:
Please note that it never hits/prints - "Getting Next Page: ".
My expectation was that I would get data in chunks of 10000 rows at a time. Please note that if I run the same code on a query which returns 10-15K rows and set the pageSize to be 100 records, I do get the "Getting Next Page:" after every 100 rows. Is this a known issue with this beta library?

This looks very close to a problem I have been struggling with for hours. And I just found the solution, so I will share it here, even though you probably found a solution yourself a long time ago.
I did exactly like the documentation and tutorials said, but my page size were not respected and I kept getting all rows every time, no matter what I did. Eventually I found another example, official I think, right here.
What I learned from that example is that you should only use iterateAll() to get the rest of the rows. To get the current page rows you need to use getValues() instead.

Related

Couchbase view pagination with java client

I am trying to pull records from a view which emits in following way
DefaultViewRow{id=0329a6ac-84cb-403e-9d1d, key=[“X”,“Y”,“0329a6ac-84cb-403e-9d1d”,“31816552700”], value=1}
As we have millions of record, we are trying to implement pagination to pull 500 records per page and do some process then get next 500 records
i have implemented following code with java client
def cluster = CouchbaseCluster.create(host)
def placesBucket = cluster.openBucket("pass", "pass")
def startKey = JsonArray.from(X, "Y")
def endKey = JsonArray.from(X, "Y", new JsonObject())
hasRow = true
rowPerPage = 500
page = 0
currentStartkey=""
startDocId=""
def viewResult
def counter = 0
while (hasRow) {
hasRow = false
def skip = page == 0 ?0: 1
page = page + 1
viewResult = placesBucket.query(ViewQuery.from("design", "view")
.startKey(startKey)
.endKey(endKey)
.reduce(false)
.inclusiveEnd()
.limit(rowPerPage)
.stale(Stale.FALSE)
.skip(skip).startKeyDocId(startDocId)
)
def runResult = viewResult.allRows()
for(ViewRow row: runResult){
hasRow = true
println(row)
counter++
startDocId = row.document().id()
}
println("Page NUMBER "+ page)
}
println("total "+ counter)
Post execution, i am getting few repetitive rows and even though the total records is around 1000 for particular small scenario i get around 3000+ rows in response and it keeps going.
can someone please tell me if i am doing something wrong ?? PS: My start key value will be same for each run as i am trying to get each unique doc _id.
please help.

Extracting avg time spent at a place from Google

I'm trying to use jsoup to extract the average time spent at a place straight from Google's search results; as Google API does not support fetching of that info at the moment.
For example,
url is "https://www.google.com/search?q=vivocity" and the text to extract is "15 min to 2 hr"
I've tried the following code
try {
String url = "https://www.google.com.sg/search?q=vivocity";
Document doc = Jsoup.connect(url).userAgent("mozilla/17.0").get();
Elements ele = doc.select("div._B1k");
for (Element qwer:ele){
temp += "Avg time spent: " + qwer.getElementsByTag("b").first().text() + "\n";
}
}
catch (IOException e){
e.printStackTrace();
}
I have also tried just outputing doc.text() and searching through the output, it doesn't seem to contain anything to do with avg time taken too.
Strange thing is with other URLs and divs, they work perfectly fine.
Any help will be appreciated, thank you.

Conversation ID leads to unkown path in graph-api

I have a code that fetches conversations and the messages inside them (a specific number of pages). It works most of the time, but for certain conversations it throws an exception, such as:
Exception in thread "main" com.restfb.exception.FacebookOAuthException: Received Facebook error response of type OAuthException: Unknown path components: /[id of the message]/messages (code 2500, subcode null)
at com.restfb.DefaultFacebookClient$DefaultGraphFacebookExceptionMapper.exceptionForTypeAndMessage(DefaultFacebookClient.java:1192)
at com.restfb.DefaultFacebookClient.throwFacebookResponseStatusExceptionIfNecessary(DefaultFacebookClient.java:1118)
at com.restfb.DefaultFacebookClient.makeRequestAndProcessResponse(DefaultFacebookClient.java:1059)
at com.restfb.DefaultFacebookClient.makeRequest(DefaultFacebookClient.java:970)
at com.restfb.DefaultFacebookClient.makeRequest(DefaultFacebookClient.java:932)
at com.restfb.DefaultFacebookClient.fetchConnection(DefaultFacebookClient.java:356)
at test.Test.main(Test.java:40)
After debugging I found the ID that doesn't work and tried to access it from graph-api, which results in an "unknown path components" error. I also attempted to manually find the conversation in me/conversations and click the next page link in the graph api explorer which also lead to the same error.
Is there a different way to retrieve a conversation than by ID? And if not, could someone show me an example to verify first if the conversation ID is valid, so if there are conversations I can't retrieve I could skip them instead of getting an error. Here's my current code:
Connection<Conversation> fetchedConversations = fbClient.fetchConnection("me/Conversations", Conversation.class);
int pageCnt = 2;
for (List<Conversation> conversationPage : fetchedConversations) {
for (Conversation aConversation : conversationPage) {
String id = aConversation.getId();
//The line of code which causes the exception
Connection<Message> messages = fbClient.fetchConnection(id + "/messages", Message.class, Parameter.with("fields", "message,created_time,from,id"));
int tempCnt = 0;
for (List<Message> messagePage : messages) {
for (Message msg : messagePage) {
System.out.println(msg.getFrom().getName());
System.out.println(msg.getMessage());
}
if (tempCnt == pageCnt) {
break;
}
tempCnt++;
}
}
}
Thanks in advance!
Update: Surrounded the problematic part with a try catch as a temporary solution, also counted the number of occurrences and it only effects 3 out of 53 conversations. I also printed all the IDs, and it seems that these 3 IDs are the only ones that contain a "/" symbol, I'm guessing it has something to do with the exception.
The IDs that work look something like this: t_[text] (sometimes a "." or a ":" symbol) and the ones that cause an exception are always t_[text]/[text]
conv_id/messages is not a valid graph api call.
messages is a field of conversation.
Here is what you do (single call to api):
Connection<Conversation> conversations = facebookClient.fetchConnection("me/conversations", Conversation.class);
for (Conversation conv : conversations.getData()) {
// To get list of messages for given conversation
LinkedList<Message> allConvMessagesStorage = new LinkedList<Message>();
Connection<Message> messages25 = facebookClient.fetchConnection(conv.getId()+"/messages", Message.class);
//Add messages returned
allConvMessagesStorage.addAll(messages25.getData());
//Check if there is next page to fetch
boolean progress = messages25.hasNext();
while(progress){
messages25 = facebookClient.fetchConnectionPage(messages25.getNextPageUrl(), Message.class);
//Append next page of messages
allConvMessagesStorage.addAll(messages25.getData());
progress = messages25.hasNext();
}
}

Using twitter4j to search through more than 100 queries [duplicate]

This question already has answers here:
How to retrieve more than 100 results using Twitter4j
(4 answers)
Closed 6 years ago.
I am trying to create a program that searches a query from twitter. The problem I am having is that the API returns only a 100 result queries and when I try to retrieve more it keeps giving me the same results again.
User user = twitter.showUser("johnny");
Query query = new Query("football");
query.setCount(100);
query.lang("en");
int i=0;
try {
QueryResult result = twitter.search(query);
for(int z = 0;z<2;z++){
for( Status status : result.getTweets()){
System.out.println("#" + status.getUser().getScreenName() + ":" + status.getText());
i++;
}
}
The program will print me 200 results relating to the query "football", but instead of giving me 200 different results it prints a 100 results twice. My end results should be that I can print as many different results as the rate limit allows. I have seen programs that return more than 100 responses for a specific user, but I haven't seen something that can return more than a 100 responses for a unique query like "football".
To get more than 100 results on a search Query you need to call to the next iteration of the Query.
Query query = new Query("football");
QueryResult result;
int Count=0;
do {
result = twitter.search(query);
List<Status> tweets = result.getTweets();
for (Status tweet : tweets) {
System.out.println("#" + tweet.getUser().getScreenName() + ":" + tweet.getText());
Count++;
}
try {
Thread.sleep(500);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
while ((query = result.nextQuery()) != null);
System.out.println(Count);
System.exit(0);
I just tested it and got 275 tweets, keep in mind this from the documentation:
The Search API is not complete index of all Tweets, but instead an index of recent Tweets. At the moment that index includes between 6-9 days of Tweets.
And:
Before getting involved, it’s important to know that the Search API is focused on relevance and not completeness. This means that some Tweets and users may be missing from search results. If you want to match for completeness you should consider using a Streaming API instead.

Obtain a share UpdateKey from LinkedIn using LinkedIn J and getNetworkUpdates() with Coldfusion

Using the "Network Updates API" example at the following link I am able to post network updates with no problem using client.postNetworkUpdate(updateText).
http://code.google.com/p/linkedin-j/wiki/GettingStarted
So posting works great.. However posting an update does not return an "UpdateKey" which is used to retrieve stats for post itself such as comments, likes, etc. Without the UpdateKey I cannot retrieve stats. So what I would like to do is post, then retrieve the last post using the getNetworkUpdates() function, and in that retrieval will be the UpdateKey that I need to use later to retrieve stats. Here's a sample script in Java on how to get network updates, but I need to do this in Coldfusion instead of Java.
Network network = client.getNetworkUpdates(EnumSet.of(NetworkUpdateType.STATUS_UPDATE));
System.out.println("Total updates fetched:" + network.getUpdates().getTotal());
for (Update update : network.getUpdates().getUpdateList()) {
System.out.println("-------------------------------");
System.out.println(update.getUpdateKey() + ":" + update.getUpdateContent().getPerson().getFirstName() + " " + update.getUpdateContent().getPerson().getLastName() + "->" + update.getUpdateContent().getPerson().getCurrentStatus());
if (update.getUpdateComments() != null) {
System.out.println("Total comments fetched:" + update.getUpdateComments().getTotal());
for (UpdateComment comment : update.getUpdateComments().getUpdateCommentList()) {
System.out.println(comment.getPerson().getFirstName() + " " + comment.getPerson().getLastName() + "->" + comment.getComment());
}
}
}
Anyone have any thoughts on how to accomplish this using Coldfusion?
Thanks
I have not used that api, but I am guessing you could use the first two lines to grab the number of updates. Then use the overloaded client.getNetworkUpdates(start, end) method to retrieve the last update and obtain its key.
Totally untested, but something along these lines:
<cfscript>
...
// not sure about accessing the STATUS_UPDATE enum. One of these should work:
// method 1
STATUS_UPDATE = createObject("java", "com.google.code.linkedinapi.client.enumeration.NetworkUpdateType$STATUS_UPDATE");
// method 2
NetworkUpdateType = createObject("java", "com.google.code.linkedinapi.client.enumeration.NetworkUpdateType");
STATUS_UPDATE = NetworkUpdateType.valueOf("STATUS_UPDATE");
enumSet = createObject("java", "java.util.EnumSet");
network = yourClientObject.getNetworkUpdates(enumSet.of(STATUS_UPDATE));
numOfUpdates = network.getUpdates().getTotal();
// Add error handling in case numOfUpdates = 0
result = yourClientObject.getNetworkUpdates(numOfUpdates, numOfUpdates);
lastUpdate = result.getUpdates().getUpdateList().get(0);
key = lastUpdate.getUpdateKey();
</cfscript>
You can also use socialauth library to retrieve updates and post status on linkedin.
http://code.google.com/p/socialauth

Categories

Resources