Couchbase view pagination with java client - java

I am trying to pull records from a view which emits in following way
DefaultViewRow{id=0329a6ac-84cb-403e-9d1d, key=[“X”,“Y”,“0329a6ac-84cb-403e-9d1d”,“31816552700”], value=1}
As we have millions of record, we are trying to implement pagination to pull 500 records per page and do some process then get next 500 records
i have implemented following code with java client
def cluster = CouchbaseCluster.create(host)
def placesBucket = cluster.openBucket("pass", "pass")
def startKey = JsonArray.from(X, "Y")
def endKey = JsonArray.from(X, "Y", new JsonObject())
hasRow = true
rowPerPage = 500
page = 0
currentStartkey=""
startDocId=""
def viewResult
def counter = 0
while (hasRow) {
hasRow = false
def skip = page == 0 ?0: 1
page = page + 1
viewResult = placesBucket.query(ViewQuery.from("design", "view")
.startKey(startKey)
.endKey(endKey)
.reduce(false)
.inclusiveEnd()
.limit(rowPerPage)
.stale(Stale.FALSE)
.skip(skip).startKeyDocId(startDocId)
)
def runResult = viewResult.allRows()
for(ViewRow row: runResult){
hasRow = true
println(row)
counter++
startDocId = row.document().id()
}
println("Page NUMBER "+ page)
}
println("total "+ counter)
Post execution, i am getting few repetitive rows and even though the total records is around 1000 for particular small scenario i get around 3000+ rows in response and it keeps going.
can someone please tell me if i am doing something wrong ?? PS: My start key value will be same for each run as i am trying to get each unique doc _id.
please help.

Related

Iteration through json with multiple API calls for other requests

I am using Postman to iterate through a json of about 40 pairs of items. I need to then take that array created and run an API call for each element in the array to return a set of results. Using the code here, i'm only able to pull the final element in the array. I attempted to put the postman.setNextRequest in the for loop but then I found out that no matter where it is, it always executes last.
tests["Status code is 200 (that's good!)"] = (responseCode.code === 200);
if (responseCode.code === 200) {
var jsonData = pm.response.json();
var json = [];
postman.setEnvironmentVariable("json", jsonData)
postman.setNextRequest('GetAdmins');
for (var key in jsonData ) {
if (jsonData.hasOwnProperty(key)) {
postman.setEnvironmentVariable("organizationId", jsonData[key].id)
postman.setEnvironmentVariable("orgname", jsonData[key].name)
tests[jsonData[key].name + " " + jsonData[key].id] = !!jsonData[key].name;
}
}
}
else {
postman.setNextRequest(null);
}
GetAdmins is another GET that uses {{organizationId}} in the call.
I think what i'm looking for is; what is the best way to go about running another API call on each element in the json?
Thanks in advance!
EDIT: Adding JSON output
[
{
"id": XXXXXX,
"name": "Name1"
},
{
"id": XXXXXX,
"name": "Name2"
},
{
"id": XXXXXX,
"name": "Name3"
}
]
This might work to get the data - I’ve not tried it out yet though so it might not work first time.
var jsonData = pm.response.json()
data = _.map(jsonData, item => {
organizationId: item.id
orgName: item.name
})
pm.environment.set('organizationData', JSON.stringify(data))
Then you have all of your organization data in a variable and you can use these to iterate over the Id’s in the next "Get Admins" request.
You would need to have some code in the Pre-request script of the next request to access each of the id’s to iterate over in the request. You need to parse the variable like this:
var orgID = pm.environment.get(JSON.parse("organizationData"))
Then orgID[0].organizationId would be the first one in the list.
Not a complete solution for your problem but it might help you get the data.
I was able to solve this using these two guides:
Loops and dynamic variables in Postman: part 1
Loops and dynamic variables in Postman: part 2
I also had to implement the bigint fix for java, but in Postman, which was very annoying... that can be found here:
Hacking bigint in API testing with Postman Runner Newman in CI Environment
Gist
A lot of google plus trial and error got me up and running.
Thanks anyway for all your help everyone!
This ended up being my final code:
GetOrgs
tests["Status code is 200 (that's good!)"] = (responseCode.code === 200);
eval(postman.getGlobalVariable("bigint_fix"));
var jsonData = JSON.parse(responseBody);
var id_list = [];
jsonData.forEach(function(list) {
var testTitle = "Org: " + list.name + " has id: " + JSON.stringify(list.id);
id_list.push(list.id);
tests[testTitle] = !!list.id;
});
postman.setEnvironmentVariable("organizationId",JSON.stringify(id_list.shift()));
postman.setEnvironmentVariable("id_list", JSON.stringify(id_list));
postman.setNextRequest("GetAdmins");
GetAdmins
eval(postman.getGlobalVariable("bigint_fix"));
var jsonData = JSON.parse(responseBody);
jsonData.forEach(function(admin) {
var testTitle = "Admin: " + admin.name + " has " + admin.orgAccess;
tests[testTitle] = !!admin.name;
});
var id_list = JSON.parse(environment.id_list);
if (id_list.length > 0) {
postman.setEnvironmentVariable("organizationId", JSON.stringify(id_list.shift());
postman.setEnvironmentVariable("id_list", JSON.stringify(id_list));
postman.setNextRequest("GetAdmins");
}
else {
postman.clearEnvrionmentVariable("organizationId");
postman.clearEnvironmentVariable("id_list");
}

BigQuery Pagination through large result set with cloud library

I am working on accessing data from Google BigQuery, the data is 500MB which I need to transform as part of the requirement. I am setting Allow Large Results, setting a destination table etc.
I have written a java job in Google's new cloud library since that is recommended now - com.google.cloud:google-cloud-bigquery:0.21.1-beta (I have tried 0.20 beta as well without any fruitful results)
I am having problem with pagination of this data, the library is inconsistent in fetching results page wise. Here is my code snippet,
Code Snippet
System.out.println("Accessing Handle of Response");
QueryResponse response = bigquery.getQueryResults(jobId, QueryResultsOption.pageSize(10000));
System.out.println("Got Handle of Response");
System.out.println("Accessing results");
QueryResult result = response.getResult();
System.out.println("Got handle of Result. Total Rows: "+result.getTotalRows());
System.out.println("Reading the results");
int pageIndex = 0;
int rowId = 0;
while (result != null) {
System.out.println("Reading Page: "+ pageIndex);
if(result.hasNextPage())
{
System.out.println("There is Next Page");
}
else
{
System.out.println("No Next Page");
}
for (List<FieldValue> row : result.iterateAll()) {
System.out.println("Row: " + rowId);
rowId++;
}
System.out.println("Getting Next Page: ");
pageIndex++;
result = result.getNextPage();
}
Output print statements
Accessing Handle of Response
Got Handle of Response
Accessing results
Got handle of Result. Total Rows: 9617008
Reading the results
Reading Page: 0
There is Next Page
Row: 0
Row: 1
Row: 2
Row: 3
:
:
Row: 9999
Row: 10000
Row: 10001
:
:
Row: 19999
:
:
Please note that it never hits/prints - "Getting Next Page: ".
My expectation was that I would get data in chunks of 10000 rows at a time. Please note that if I run the same code on a query which returns 10-15K rows and set the pageSize to be 100 records, I do get the "Getting Next Page:" after every 100 rows. Is this a known issue with this beta library?
This looks very close to a problem I have been struggling with for hours. And I just found the solution, so I will share it here, even though you probably found a solution yourself a long time ago.
I did exactly like the documentation and tutorials said, but my page size were not respected and I kept getting all rows every time, no matter what I did. Eventually I found another example, official I think, right here.
What I learned from that example is that you should only use iterateAll() to get the rest of the rows. To get the current page rows you need to use getValues() instead.

Add counter which increases on every http request

I want to add one condition in below scenario.
I would like to Exit from the scenario if(counter=8 or WorkflowStatus=true)
Does anyone knows how to add a counter which increases on every request upto 8 times and exit after 8, and above condition if request gets WorkflowStatus=true then exit in below scenario?
Let me know if you need more clarification.
Thanks.
class LaunchResources extends Simulation {
val scenarioRepeatCount = Integer.getInteger("scenarioRepeatCount", 1).toInt
val userCount = Integer.getInteger("userCount", 1).toInt
val UUID = System.getProperty("UUID", "24d0e03")
val username = System.getProperty("username", "p1")
val password = System.getProperty("password", "P12")
val testServerUrl = System.getProperty("testServerUrl", "https://someurl.net")
val httpProtocol = http
.baseURL(testServerUrl)
.basicAuth(username, password)
.connection("""keep-alive""")
.contentTypeHeader("""application/vnd+json""")
val headers_0 = Map(
"""Cache-Control""" -> """no-cache""",
"""Origin""" -> """chrome-extension://fdmmgasdw1dojojpjoooidkmcomcm""")
val scn = scenario("LaunchAction")
.repeat (scenarioRepeatCount) {
exec(http("LaunchAResources")
.post( """/api/actions""")
.headers(headers_0)
.body(StringBody(s"""{"UUID": "$UUID", "stringVariables" : {"externalFilePath" : "/Test.mp4"}}"""))
.check(jsonPath("$.id").saveAs("WorkflowID")))
.exec(http("SaveWorkflowStatus")
.get("""/api/actions/{$WorkflowID}""")
.headers(headers_0)
.check(jsonPath("$.status").saveAs("WorkflowStatus")))
}
setUp(scn.inject(atOnceUsers(userCount))).protocols(httpProtocol)
}
Personally I use this tricks to have a counter increments at every request
val scn = scenario("Scenario Conversion")
.exec{session => session.set("number",session.userId.split("-").last.toInt)}
You can reuse this in another session value
val scn = scenario("Scenario Conversion")
.exec{session => session.set("number",session.userId.split("-").last.toInt)}
.exec{session => session.set("timestamp", nextDay(session("number").as[Int]/1000))}
You can use Redis to storage your count number, control Redis Number every time when request is comming.
I Use Redis to count my http post count in 3 minutes, If the count is over 10 times in 3 minutes, I will disable this post Ip Address, And this ip will get 403 forbidden error in future 3 minutes.

Dataframes are slow to parse through small amount of data

I have 2 classes doing a similar task in Apache Spark but the one using data frame is many times slower than the "regular" one using RDD. (30x)
I would like to use data frame since it will eliminate a lot of code and classes we have but obviously I can't have it be that much slower.
The data set is nothing big. We have 30 some files with json data in each about events triggered from activities in another piece of software. There are between 0 to 100 events in each file.
A data set with 82 events will take about 5 minutes to be processed with data frames.
Sample code:
public static void main(String[] args) throws ParseException, IOException {
SparkConf sc = new SparkConf().setAppName("POC");
JavaSparkContext jsc = new JavaSparkContext(sc);
SQLContext sqlContext = new SQLContext(jsc);
conf = new ConfImpl();
HashSet<String> siteSet = new HashSet<>();
// last month
Date yesterday = monthDate(DateUtils.addDays(new Date(), -1)); // method that returns the date on the first of the month
Date startTime = startofYear(new Date(yesterday.getTime())); // method that returns the date on the first of the year
// list all the sites with a metric file
JavaPairRDD<String, String> allMetricFiles = jsc.wholeTextFiles("hdfs:///somePath/*/poc.json");
for ( Tuple2<String, String> each : allMetricFiles.toArray() ) {
logger.info("Reading from " + each._1);
DataFrame metric = sqlContext.read().format("json").load(each._1).cache();
metric.count();
boolean siteNameDisplayed = false;
boolean dateDisplayed = false;
do {
Date endTime = DateUtils.addMonths(startTime, 1);
HashSet<Row> totalUsersForThisMonth = new HashSet<>();
for (String dataPoint : Conf.DataPoints) { // This is a String[] with 4 elements for this specific case
try {
if (siteNameDisplayed == false) {
String siteName = parseSiteFromPath(each._1); // method returning a parsed String
logger.info("Data for site: " + siteName);
siteSet.add(siteName);
siteNameDisplayed = true;
}
if ( dateDisplayed == false ) {
logger.info("Month: " + formatDate(startTime)); // SimpleFormatDate("yyyy-MM-dd")
dateDisplayed = true;
}
DataFrame lastMonth = metric.filter("event.eventId=\"" + dataPoint + "\"").filter("creationDate >= " + startTime.getTime()).filter("creationDate < " + endTime.getTime()).select("event.data.UserId").distinct();
logger.info("Distinct for last month for " + dataPoint + ": " + lastMonth.count());
totalUsersForThisMonth.addAll(lastMonth.collectAsList());
} catch (Exception e) {
// data does not fit the expected model so there is nothing to print
}
}
logger.info("Total Unique for the month: " + totalStudentForThisMonth.size());
startTime = DateUtils.addMonths(startTime, 1);
dateDisplayed = false;
} while ( startTime.getTime() < commonTmsMetric.monthDate(yesterday).getTime());
// reset startTime for the next site
startTime = commonTmsMetric.StartofYear(new Date(yesterday.getTime()));
}
}
There are a few things that are not efficient in this code but when I look at the logs it only adds a few seconds to the whole processing.
I must be missing something big.
I have ran this with 2 executors and 1 executor and the difference is 20 seconds on 5 minutes.
This is running with Java 1.7 and Spark 1.4.1 on Hadoop 2.5.0.
Thank you!
So there a few things, but its hard to say without seeing the breakdown of the different tasks & their time. The short version is you are doing way to much work in the driver and not taking advantage of Spark's distributed capabilities.
For example, you are collecting all of the data back to the driver program (toArray() and your for loop). Instead you should just point Spark SQL at the files in needs to load.
For the operators, it seems like your doing many aggregations in the driver, instead you could use the driver to generate the aggregations and have Spark SQL execute them.
Another big difference between your in-house code and the DataFrame code is going to be Schema inference. Since you've already created classes to represent your data, it seems likely that you know the schema of your JSON data. You can likely speed up your code by adding the schema information at read time so Spark SQL can skip inference.
I'd suggest re-visiting this approach and trying to build something using Spark SQL's distributed operators.

How to query nested mongodb fields efficiently in Java?

I'm not very experienced with running Mongo queries from Java, so I'm no expert at commands. I have a Mongo collection with ~6500 documents, each containing multiple fields (some of which have sub-fields), like below:
"_id" : NumberLong(714847),
"franchiseIds" : [
NumberLong(714848),
NumberLong(714849)
],
"profileSettings" : {
"DISCLAIMER_SETUP" : {
"settingType" : "DISCLAIMER_SETUP",
"attributeMap" : {
...
I want to have an operation which will go through the entire collection from time to time and calculate how many franchiseIds are present, since different documents could have anywhere from 1 to 4 franchiseIds.
From the Mongo shell, I did a very simple script to get this, and it calculated the result immediately:
rs_default:SECONDARY> var totalCount = 0;
rs_default:SECONDARY> db.profiles.find().forEach( function(profile) { totalCount += profile.franchiseIds.length } );
rs_default:SECONDARY> totalCount
However, when I attempted to do the same thing in Java, which is where this would run on the server from time to time, it was much less performant, taking around 15 seconds to complete:
int result = 0;
List<Profile> allProfiles = mongoTemplate.findAll(Profile.class, PROFILE_COLLECTION);
for (Profile profile : allProfiles) {
result += profile.getFranchiseIds().size();
}
return results
I realize the above isn't performant in Java as it's having to allocate memory for all of the Profiles being loaded in. In the Mongo shell script, is Mongo simply taking care of this itself?
Any ideas how I can do something similar in Java?
EDIT:
I returned only the franchiseIds field on the response from Mongo, and that helped significantly. Below is the improved code:
final Query query = new Query();
query.fields().include(FRANCHISE_IDS);
final List<Profile> allProfiles = mongoTemplate.find(query, Profile.class, PROFILE_COLLECTION);
for (Profile profile : allProfiles) {
result += profile.getFranchiseIds().size();
}
return result;

Categories

Resources