I'm developing an application that serves as a client for solr. I have to do a multi-core search where the fields are exactly the same. I do not know the best way to implement it. I'm using solrj in java.
What would be best to use Distributed Search from solr or search on each separate core using threads on the application side?
Example
http://XXXX:8983/solr/core1
http://XXXX:8983/solr/core2
http://XXXX:8983/solr/core3
http://XXXX:8983/solr/core4
The fields in each core are the same.
I want to search efficiently search in all colors with a resulting result-set.
Solr UI
At this moment I have 26 cores, the largest have
Num Docs: 4677529
Size: 56.7 GB
The others have similar values. The number of cores tends to increase.
Thanks
As far as I understand from question and comments your scenario is perfect for SolrCloud, which is the name of a configuration that enables a set of new distributed capabilities in Solr.
A collection is a complete logical index that could be physically distributed across more Solr instances.
When you have to submit a query to your collection all you have to do is refer to the collection like previously you did with your cores. The SolrJ client should be built in a different manner, you have to specify the zookeeper connection string, use the CloudSolrClient and specify the default collection.
String zkHostString = "zkServerA:2181,zkServerB:2181,zkServerC:2181/solr";
CloudSolrClient solr = new CloudSolrClient.Builder().withZkHost(zkHostString).build();
solr.setDefaultCollection("collectionName");
This let you to the following options:
Your configuration is already a SolrCloud environment but didn't know anything about. Let's check if your Solr admin (taken from of one of your Solr Instances) and see if there is "Cloud" menu in the left menu. See the attached image
In this case, have a look at Cloud menu, this will show you the network topology of your Cluster and the name of the collection to use in your SolrJ implementation. See the attached image:
In case the "Cloud" menu is missing (image 1). You should move your existing cores from a standalone Solr configuration to SolrCloud.
To be clear you cannot switch your existing Solr instances from standalone to SolrCloud. The simplest way I would suggest is create a new SolrCloud cluster and reindex all your cores. I also suggest to have a look at Solr terminology in a SolrCloud configuration.
In the following lines are the steps to create a SolrCloud:
create a Zookeeper ensemble
create one or more Solr instances and start them in SolrCloud mode (i.e. specifying the zookeeper connection string parameter -z zk-node1:2181,zk-node2:2181,zk-node3:2181 at Solr start)
upload your Solr collection configuration to Zookeeper (use Solr zkcli.sh tool)
create your collection - Collection API - create a collection
Now you can start to move (reindex) your documents into the new brand collection you have created.
From the size and the number of documents you're re-indexing you have to create a number of shards in order to split the collection across your SolrCloud instances.
I strongly suggest to make practice with a playground, for example start a recent version of Solr (6.x) with the -cloud -e cloud parameter. This would start more Solr instances and a zookeeper standalone all on the same server, but consider this just as a toy to see how the things work.
if all the cores have the same config, and you are worried with efficiency etc, it would make more sense to have this setup under Solrcloud.
you can have all the data under a single collection and then sharded
or you can have your data partitioned into multiple different collections (that can be sharded as well, of course). For instance it is typical to have monthly collections for logs data.
then you have an alias that points to all your collections
on the client side, you just query the alias, and all is transparent for you, all needed collections are hit, search is distributed as needed etc.
If you don't want to change anything (i.e. these are cores that usually live and behave separate from other cores) and it's a small, contained task (i.e. not something you want to scale further in the future), you can use the explicit sharding support in Solr to query all the cores at the same time. This assumes that the documents are roughly evenly distributed across cores, as the scores are calculated locally before being aggregated in the node you're querying.
Related
I've been playing around with a SpringBoot application, the Micrometer façade, the Statsd implementation for Micrometer and the AWS OpenTelemetry distro deployed on ECS/Fargate. So far, I've been able to export many different metrics (JVM, tomcat, data source, etc) to CloudWatch, adding the cluster name, the service name and the task ID as dimensions.
My problem now is that I don't know how to handle that information. In a production deployment I may have more than one container and I may need to scale them out/in. This makes impossible (or at least I don't know how to do it) to create a dashboard as I need to select the task IDs up front. Another problem is that there is no way to add a filter in the dashboard that just shows the list of available task IDs so I can select the one I want to monitor at that moment to remove the noise from the other ones. Something QuickSight can do.
Am I better off just moving to something like Prometheus/Grafana for this? How do people handle monitoring of containers, specially Java applications?
AWS gives you the option to alarm you based on ECS metrics but at the service level (so I guess either based on the average or max of CPU usage for example) but that isn't enough when you have a workload that is not evenly spread across your instances. Is alerting not possible when at the container level (something like alert me when the service is at 60% CPU or a single container is at 80% for example)?
I need to create an ETL process that will extract, tranform & then load 100+ tables from several instances of SQLServer to as many instances of Oracle in parallel on a daily basis. I understand that I can create multiple threads in Java to accomplish this but if all of them run on the same machine this approach won't scale. Another approach could be to get a bunch of ec2 instances & start transferring tables for each instance on a different ec2 instance. With this approach, though, I would have to take care of "elasticity" by adding/removing machines from my pool.
Somehow I think I can use "Apache Spark on Amazon EMR" to accomplish this, but in the past I've used Spark only to handle data on HDFS/Hive, so not sure if transferring data from one Db to another Db is a good use case for Spark - or - is it?
Starting from your last question:
"Not sure if transferring data from one Db to another Db is a good use case for Spark":
It is, within the limitation of the JDBC spark connector. There are some limitations such as the missing support in updates, and the parallelism when reading the table (requires splitting the table by a numeric column).
Considering the IO cost and the overall performance of RDBMS, running the jobs in a FIFO mode does not sound like a good idea. You can submit each one of the jobs with a configuration that requires 1/x of cluster resources so x tables will be processed in parallel.
I am working as a developer on a batch processing solution, how it works is that we split a big file and process it across jvms. So we have 4 processor jvms which take a chunk of file and process it and 1 gateway jvm job of gateway jvm is to split the file into no. of processor jvms i.e. 4 and send a rest request which is consumed by processor jvms, rest request has all the details the file location it has to pick the file from and some other details
Now if i want to add another processor jvm without any downtime is there any way we can do it. Currently we are maintaining the urls for 4 jvms in a property file is there any better way to do it ? which provided me the ability to add more jvms without restarting any component
You can consider setting up a load balancer and putting your JVM(s) behind it. The load balancer would be responsible for distributing the incoming requests to the JVMs.
This way you can scale up or scale down your JVM depending on the work load. Also, if one of the JVMs are not working, other part of your system need not care about it anymore.
Not sure what is your use case and tech stack you are following. But it seems that you need to have distributed system with auto-scaling and dynamic provisioning capabilities. Have you considered Hadoop or Spark clusters or Akka?
If you can not use any of it, then solution is to maintain list of JVMs in some datastore (lets say in a table); its dynamic data meaning one can add/remove/update JVMs. Then you need a resource manager who can decide whether to spin up a new JVM based on load or any other conditional logic. This resource manager needs to monitor entire system. Also, whenever you create a task or chunk or slice data then distribute it using message queues such as ApacheMQ, ActiveMQ. You can also consider Kafka for complex use cases. Now a days, application servers such as websphere (Libery profile), weblogic also provide auto-scaling capability. So, if you are already using any of such application server then you can think of making use of that capability. I hope this helps.
I am looking for a Java solution beside big memory and hazelcast. Since we are using Hadoop/Spark we should have access to Zookeeper.
So I just want to know if there is a solution satisfying our needs or do we need to build something ourself.
What I need are reliable objects that are inmemory, replicated and synchronized. For manipulation I would like to have lock support and atomic actions spanning an object.
I also need support for object references and List/Set/Map support.
The rest we can build on ourself.
The idea is simply having self organizing network that configures itself based on the environment and that is best done by synchronized objects that are replicated and one can listen to.
Hazelcast, has a split-brain detector in place and when split-brain happens hazelcast will continue to accept updates and when the cluster is merged back, it will give you an ability to merge the updates that you preferred.
We are implementing a cluster quorum feature, which will hopefully available in the next minor (3.5) version. With cluster quorum you can define a minimum threshold or a custom function of your own to decide whether cluster should continue to operate or not in a partitioned network.
For example, if you define a quorum size of 3, if there is a less than 3 members in the cluster, the cluster will stop operating.
Currently hazelcast behaves like an AP solution, but when the cluster quorum is available you can tune hazelcast to behave like a CP solution.
Currently I have two separated applications.
First is RESTful API.
Second is data storage that can process raw data and store processed data on the file system. This data grouped by folders and folder ids are grouped by user ids.
These applications connected through message queue (ActiveMQ) using queueCount queues.
Files sending also through this queue using embedded fileserver.
I want to distribute this data storage across several nodes.
1) First variant
On each of n nodes set up ActiveMQ and current storage application.
Create master node that will be serve queries to these shards.
In this way data for the different users will be stored on different nodes.
2) Second
Set up n nodes with storage app. Set up one instance for ActiveMQ. Create n*queueCount queues in ActiveMQ. Consume messages from corresponding queues from storage nodes.
But both variants are not perfect, maybe you can give advice to me?
Thanks in advance
Update:
What is the best way to evenly distribute data based on uuid?
Why dont you use a distributed file system like hdfs to distribute your data store. This way replication is covered, data is distributed and you can use hadoop to even send jobs to process your data in parallel.
#vvsh, what you are attempting is distributed storage with load-balancing (but I did not understand how you plan to keep specific user's files on a specific node and at the same time get even load distribution). Any way, before I go any further, the mechanism you are attempting is quite difficult to achieve in a stable manner, instead, consider using some of the infrastructures mentioned in the comments, they may not 100% fit your requirement but will do a much better job.
Now, to achieve even distribution, your architecture essentially needs to be some kind of hub-spoke model, where the hub (in your case master server) will collect the load from a single queue with multiple JMS clients running on multiple thread. The master server has to essentially do the round-robin dispatching (you may choose different types of schemes, based on file number, if file sizes are fairly constant or file size and net total dispatched to a node).
The persistence agents must run on every node to actually take the files, process them and persist in the datastore. The communication between the master and the agents could be through web service or direct socket (depending on the performance you require), the Q based communication with the agents could potentially chock your JMS server.
One point of observation is that the files could be staged in another location, like a Document/CMS and only the ID could be communicated to the master as well as the agents, there by reducing the network load and JMS persistence load.
The above mechanism needs to toke care of exceptions, failures and re-dispatching i.e. guaranteed delivery, horizontal scaling, concurrency handling, and optimized for performance. In my view you shall be better off using some proven infrastructure but if you really want to do it, the above architecture will get the job done.