We are currently evaluating Elasticsearch as our solution for Analytics. The main driver is the fact that once the data is populated into Elasticsearch, the reporting comes for free with Kibana.
Before adopting it, I am tasked to do a performance analysis of the tool.
The main requirement is supporting a PUT rate of 500 evt/sec.
I am currently starting with a small setup as follows just to get a sense of the API before I upload that to a more serious lab.
My Strategy is basically, going over CSVs of analytics that correspond to the format I need and putting them into elasticsearch. I am not using the bulk API because in reality the events will not arrive in a bulk fashion.
Following is the main code that does this:
// Created once, used for creating a JSON from a bean
ObjectMapper mapper = new ObjectMapper();
// Creating a measurement for checking the count of sent events vs
// ES stored events
AnalyticsMetrics metrics = new AnalyticsMetrics();
metrics.startRecording();
File dir = new File(mFolder);
for (File file : dir.listFiles()) {
CSVReader reader = new CSVReader(new FileReader(file.getAbsolutePath()), '|');
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
AnalyticRecord record = new AnalyticRecord();
record.serializeLine(nextLine);
// Generate json
String json = mapper.writeValueAsString(record);
IndexResponse response = mClient.getClient().prepareIndex("sdk_sync_log", "sdk_sync")
.setSource(json)
.execute()
.actionGet();
// Recording Metrics
metrics.sent();
}
}
metrics.stopRecording();
return metrics;
I have the following questions:
How do I know through the API when all the requests are completed and the data is saved into Elasticsearch? I could query Elasticsearch for the objects counts in my particular index but doing that would be a new performance factor by itself, hence I am eliminating this option.
Is the above the fastest way to insert object to Elasticsearch or are there other optimizations I could do. Keep in mind the bulk API is not an option for now.
Thx in advance.
P.S: the Elasticsearch version I am using on both client and server is 1.0.0.
Elasticsearch index response has isCreated() method that returns true if the document is a new one or false if it has been updated and can be used to see if the document was successfully inserted/updated.
If bulk indexing is not an option there are other areas that could be tweaked to improve performance like
increasing index refresh interval using index.refresh_interval
disabling replicas by setting index.number_of_replicas to 0
Disabling _source and _all fields if they are not needed.
Related
This is most similar to this question.
I am creating a pipeline in Dataflow 2.x that takes streaming input from a Pubsub queue. Every single message that comes in needs to be streamed through a very large dataset that comes from Google BigQuery and have all the relevant values attached to it (based on a key) before being written to a database.
The trouble is that the mapping dataset from BigQuery is very large - any attempt to use it as a side input fails with the Dataflow runners throwing the error "java.lang.IllegalArgumentException: ByteString would be too long". I have attempted the following strategies:
1) Side input
As stated,the mapping data is (apparently) too large to do this. If I'm wrong here or there is a work-around for this, please let me know because this would be the simplest solution.
2) Key-Value pair mapping
In this strategy, I read the BigQuery data and Pubsub message data in the first part of the pipeline, then run each through ParDo transformations that change every value in the PCollections to KeyValue pairs. Then, I run a Merge.Flatten transform and a GroupByKey transform to attach the relevant mapping data to each message.
The trouble here is that streaming data requires windowing to be merged with other data, so I have to apply windowing to the large, bounded BigQuery data as well. It also requires that the windowing strategies are the same on both datasets. But no windowing strategy for the bounded data makes sense, and the few windowing attempts I've made simply send all the BQ data in a single window and then never send it again. It needs to be joined with every incoming pubsub message.
3) Calling BQ directly in a ParDo (DoFn)
This seemed like a good idea - have each worker declare a static instance of the map data. If it's not there, then call BigQuery directly to get it. Unfortunately this throws internal errors from BigQuery every time (as in the entire message just says "Internal error"). Filing a support ticket with Google resulted in them telling me that, essentially, "you can't do that".
It seems this task doesn't really fit the "embarrassingly parallelizable" model, so am I barking up the wrong tree here?
EDIT :
Even when using a high memory machine in dataflow and attempting to make the side input into a map view, I get the error java.lang.IllegalArgumentException: ByteString would be too long
Here is an example (psuedo) of the code I'm using:
Pipeline pipeline = Pipeline.create(options);
PCollectionView<Map<String, TableRow>> mapData = pipeline
.apply("ReadMapData", BigQueryIO.read().fromQuery("SELECT whatever FROM ...").usingStandardSql())
.apply("BQToKeyValPairs", ParDo.of(new BQToKeyValueDoFn()))
.apply(View.asMap());
PCollection<PubsubMessage> messages = pipeline.apply(PubsubIO.readMessages()
.fromSubscription(String.format("projects/%1$s/subscriptions/%2$s", projectId, pubsubSubscription)));
messages.apply(ParDo.of(new DoFn<PubsubMessage, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
JSONObject data = new JSONObject(new String(c.element().getPayload()));
String key = getKeyFromData(data);
TableRow sideInputData = c.sideInput(mapData).get(key);
if (sideInputData != null) {
LOG.info("holyWowItWOrked");
c.output(new TableRow());
} else {
LOG.info("noSideInputDataHere");
}
}
}).withSideInputs(mapData));
The pipeline throws the exception and fails before logging anything from within the ParDo.
Stack trace:
java.lang.IllegalArgumentException: ByteString would be too long: 644959474+1551393497
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.concat(ByteString.java:524)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:576)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.copyFrom(ByteString.java:559)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString$Output.toByteString(ByteString.java:1006)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillBag.persistDirectly(WindmillStateInternals.java:575)
com.google.cloud.dataflow.worker.WindmillStateInternals$SimpleWindmillState.persist(WindmillStateInternals.java:320)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillCombiningState.persist(WindmillStateInternals.java:951)
com.google.cloud.dataflow.worker.WindmillStateInternals.persist(WindmillStateInternals.java:216)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.flushState(StreamingModeExecutionContext.java:513)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:363)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1000)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$800(StreamingDataflowWorker.java:133)
com.google.cloud.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:771)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Check out the section called "Pattern: Streaming mode large lookup tables" in Guide to common Cloud Dataflow use-case patterns, Part 2. It might be the only viable solution since your side input doesn't fit into memory.
Description:
A large (in GBs) lookup table must be accurate, and changes often or
does not fit in memory.
Example:
You have point of sale information from a retailer and need to
associate the name of the product item with the data record which
contains the productID. There are hundreds of thousands of items
stored in an external database that can change constantly. Also, all
elements must be processed using the correct value.
Solution:
Use the "Calling external services for data enrichment" pattern
but rather than calling a micro service, call a read-optimized NoSQL
database (such as Cloud Datastore or Cloud Bigtable) directly.
For each value to be looked up, create a Key Value pair using the KV
utility class. Do a GroupByKey to create batches of the same key type
to make the call against the database. In the DoFn, make a call out to
the database for that key and then apply the value to all values by
walking through the iterable. Follow best practices with client
instantiation as described in "Calling external services for data
enrichment".
Other relevant patterns are described in Guide to common Cloud Dataflow use-case patterns, Part 1:
Pattern: Slowly-changing lookup cache
Pattern: Calling external services for data enrichment
How can I do the following?
What I want to do is load Stanford NLP ONCE, then interact with it via an HTTP or other endpoint. The reason is that it takes a long time to load, and loading for every string to analyze is out of the question.
For example, here is Stanford NLP loading in a simple C# program that loads the jars... I'm looking to do what I did below, but in java:
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [9.3 sec].
Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.all.3class.distsim.crf.ser.gz ... done [12.8 sec].
Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.muc.7class.distsim.crf.ser.gz ... done [5.9 sec].
Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.conll.4class.distsim.crf.ser.gz ... done [4.1 sec].
done [8.8 sec].
Sentence #1 ...
This is over 30 seconds. If these all have to load each time, yikes. To show what I want to do in java, I wrote a working example in C#, and this complete example may help someone some day:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using java.io;
using java.util;
using edu.stanford.nlp;
using edu.stanford.nlp.pipeline;
using Console = System.Console;
namespace NLPConsoleApplication
{
class Program
{
static void Main(string[] args)
{
// Path to the folder with models extracted from `stanford-corenlp-3.6.0-models.jar`
var jarRoot = #"..\..\..\..\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models";
// Text for intial run processing
var text = "Kosgi Santosh sent an email to Stanford University. He didn't get a reply.";
// Annotation pipeline configuration
var props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, sentiment");
props.setProperty("ner.useSUTime", "0");
// We should change current directory, so StanfordCoreNLP could find all the model files automatically
var curDir = Environment.CurrentDirectory;
Directory.SetCurrentDirectory(jarRoot);
var pipeline = new StanfordCoreNLP(props);
Directory.SetCurrentDirectory(curDir);
// loop
while (text != "quit")
{
// Annotation
var annotation = new Annotation(text);
pipeline.annotate(annotation);
// Result - Pretty Print
using (var stream = new ByteArrayOutputStream())
{
pipeline.prettyPrint(annotation, new PrintWriter(stream));
Console.WriteLine(stream.toString());
stream.close();
}
edu.stanford.nlp.trees.TreePrint tprint = new edu.stanford.nlp.trees.TreePrint("words");
Console.WriteLine();
Console.WriteLine("Enter a sentence to evaluate, and hit ENTER (enter \"quit\" to quit)");
text = Console.ReadLine();
} // end while
}
}
}
So it takes the 30 seconds to load, but each time you give it a string on the console, it takes the smallest fraction of a second to parse & tokenize that string.
You can see that I loaded the jar files prior to the while loop.
This may end up being a socket service, HTML, or something else that will entertain requests (in the form of strings), and spit back the parsing.
My ultimate goal is to use a mechanism in Nifi, via a processor that can send strings to be parsed, and have them returned in less than a second, versus 30+ seconds if a traditional web server threaded example (for instance) is used. Every request would load the whole thing for 30 seconds, THEN get down to business. I hope I made this clear!
How to do this?
Any of the mechanisms you list are perfectly reasonable routes forward for leveraging that service with Apache NiFi. Depending on your needs, some of the processors and extensions that are bundled with the standard release of NiFi may be sufficient to interact with your proposed web service or similar offering.
If you are striving for performing all of this within NiFi itself, a custom Controller Service might be a great path to provide this resource to NiFi that falls within the lifecycle of the application itself.
NiFi can be extended with items like controller services and custom processors and we have some documentation to get you started down that path.
Additional details could certainly help to provide some more information. Feel free to follow up on here with additional comments and/or reach out to the community via our mailing lists.
One item I did want to call out if it was unclear that NiFi is JVM driven and work would be done in Java or JVM friendly languages.
You should look at the new CoreNLP Server which Stanford NLP introduced in version 3.6.0. It seems like it does just what you want? Some other people such as ETS have done similar things.
Fine point: If using this heavily, you might (at present) want to grab the latest CoreNLP code from github HEAD, since it contains a few fixes to the server which will be in the next release.
I have a Couchbase cluster which has around 25M documents. I am able to read them sequentially and also I have a function that can read a specific number of documents from the database. But my use case is slightly different since I cannot store all the 25M documents (each document is huge) in memory.
I need to process the documents in batches, say 1M/batch, push that batch to my memory, (do some operation on those documents) and push the next batch.
The function which I have written to read specific number of documents doesn't ensure that it returns a different set of documents when called again.
Is there a way by which I can complete this functionality? I also have a function which can create documents in batches. I am not sure if I can write a similar function that can read the documents in batches. The function is given below.
public void createMultipleCustomerDocuments(String docId, Customer myCust, long numDocs) {
Gson gson = new GsonBuilder().create();
JsonObject content = JsonObject.fromJson(gson.toJson(myCust));
JsonDocument document = JsonDocument.create(docId, content);
jsonDocuments.add(document);
documentCounter++;
if (documentCounter == numDocs) {
Observable.from(jsonDocuments).flatMap(new Func1<JsonDocument, Observable<JsonDocument>>() {
public Observable<JsonDocument > call(final JsonDocument docToInsert) {
return (theBucket.async().upsert(docToInsert));
}
}).last().toBlocking().single();
documentCounter = 0;
//System.out.println("Batch counter: " + batchCounter++);
}
Can someone please help me with this?
I would try to create a view which containing all of the documents, and then querying the view with skip and limit. (Can use .startKey() and startKeyId() functions instead of skip() to avoid overhead.)
but, remember not to keep that view in a production env, it will be cpu hog.
Another option, use the DCP protocol to replicate the database into your app. but it is more work.
I am new to Elasticsearch. I read Elasticsearch's Java client APIs and am able to build query and send it to the Elasticsearch server via the transport client.
Because my query is quite complex with multi-level filters and I notice that it is cumbersome to build a query via the Java client. I feel that it is much simpler to build a JSON query string and then send it over to the Elasticsearch server via a Java client.
Is this something Elasticsearch offers?
I do like what Elasticsearch Java API can do after receiving results such as scrolling over the results. I want to keep these features.
Thanks for any input and links!
Regards.
Did further research on Elasticsearch API and found out that Elasticsearch does offer this capability. Here is how:
SearchResponse scrollResp = client.prepareSearch("my-index")
.setTypes("my-type")
.setSearchType(SearchType.SCAN)
.setQuery(query) // **<-- Query string in JSON format**
.execute().actionGet();
You can no longer pass in string to the .setQuery function, however you can use a WrapperQueryBuilder like this:
WrapperQueryBuilder builder = QueryBuilders.wrapperQuery(searchQuery);
SearchRequestBuilder sr = client.prepareSearch().setIndices(index).setTypes(mapping).setQuery(builder);
I'd recommend using the Java API, it is very good once you get used to it and in most cases it is less cumbersome. If you look through the Elasticsearch source code you will see that the Java API Builds the JSON under the hood. Here is an example from the MatchAllQueryBuilder:
#Override
public void doXContent(XContentBuilder builder, Params params) throws IOException {
builder.startObject(MatchAllQueryParser.NAME);
if (boost != -1) {
builder.field("boost", boost);
}
if (normsField != null) {
builder.field("norms_field", normsField);
}
builder.endObject();
}
ElasticSearch has built in capabilities to do exactly what you need, in an organized manner.
To answer your question, please see this link (the material is gone on elastic's site, so it might no longer work):
https://web.archive.org/web/20150906215934/https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/search.html
All you have to do is build a simple file which contains your search template i.e complex search query.
It can be a simple json file, or a text file.
Now you simply pass in your parameters, through your java code.
See the example in the link, it makes things amply clear.
Bhargav.
hi i want to get all issues stored in jira from java using jql or any othere way.
i try to use this code:
for(String name:getProjectsNames()){
String jqlRequest = "project = \""+name+"\"";
SearchResult result = restClient.getSearchClient().searchJql(
jqlRequest, 10000,0, pm);
final Iterable<BasicIssue> issues = result.getIssues();
for (BasicIssue is : issues) {
Issue issue = restClient.getIssueClient().getIssue(is.getKey(), pm);
...........
}
it give me the result but it take a very long time.
is there a query or a rest API URL or any other way that give me all issues?
please help me
The JIRA REST API will give you all the info from each issue at a rate of a few issues/second. The Inquisitor add-on at https://marketplace.atlassian.com/plugins/com.citrix.jira.inquisitor will give you thousands of issues per second but only the standard JIRA fields.
There is one other way. There is one table in JIRA database named "dbo.jiraissue". If you have access to that database then you can fetch all the ids of all issues. After fetching this data you can send this REST request "**localhost/rest/api/2/issue/issue_id" and get JSON response. Of course you have to write some code for this but this is one way I know to get all issues.