We are working to build a big cluster of 100 nodes with 300 TB storage. Then we have to serve it to different users (clients) with restricted resources limit i.e., we do not want to expose complete cluster to each user. Is it possible ? If it is not possible then what are other ways to do it. Are there any builtin solutions available ? It is just like cluster partitioning on demand.
On hadoop 2 there is a concept of HDFS Federation that can partition the file system namespace over multiple separated namenodes each of which manages a portion of the file system namespace.
Related
I'm developing an application that serves as a client for solr. I have to do a multi-core search where the fields are exactly the same. I do not know the best way to implement it. I'm using solrj in java.
What would be best to use Distributed Search from solr or search on each separate core using threads on the application side?
Example
http://XXXX:8983/solr/core1
http://XXXX:8983/solr/core2
http://XXXX:8983/solr/core3
http://XXXX:8983/solr/core4
The fields in each core are the same.
I want to search efficiently search in all colors with a resulting result-set.
Solr UI
At this moment I have 26 cores, the largest have
Num Docs: 4677529
Size: 56.7 GB
The others have similar values. The number of cores tends to increase.
Thanks
As far as I understand from question and comments your scenario is perfect for SolrCloud, which is the name of a configuration that enables a set of new distributed capabilities in Solr.
A collection is a complete logical index that could be physically distributed across more Solr instances.
When you have to submit a query to your collection all you have to do is refer to the collection like previously you did with your cores. The SolrJ client should be built in a different manner, you have to specify the zookeeper connection string, use the CloudSolrClient and specify the default collection.
String zkHostString = "zkServerA:2181,zkServerB:2181,zkServerC:2181/solr";
CloudSolrClient solr = new CloudSolrClient.Builder().withZkHost(zkHostString).build();
solr.setDefaultCollection("collectionName");
This let you to the following options:
Your configuration is already a SolrCloud environment but didn't know anything about. Let's check if your Solr admin (taken from of one of your Solr Instances) and see if there is "Cloud" menu in the left menu. See the attached image
In this case, have a look at Cloud menu, this will show you the network topology of your Cluster and the name of the collection to use in your SolrJ implementation. See the attached image:
In case the "Cloud" menu is missing (image 1). You should move your existing cores from a standalone Solr configuration to SolrCloud.
To be clear you cannot switch your existing Solr instances from standalone to SolrCloud. The simplest way I would suggest is create a new SolrCloud cluster and reindex all your cores. I also suggest to have a look at Solr terminology in a SolrCloud configuration.
In the following lines are the steps to create a SolrCloud:
create a Zookeeper ensemble
create one or more Solr instances and start them in SolrCloud mode (i.e. specifying the zookeeper connection string parameter -z zk-node1:2181,zk-node2:2181,zk-node3:2181 at Solr start)
upload your Solr collection configuration to Zookeeper (use Solr zkcli.sh tool)
create your collection - Collection API - create a collection
Now you can start to move (reindex) your documents into the new brand collection you have created.
From the size and the number of documents you're re-indexing you have to create a number of shards in order to split the collection across your SolrCloud instances.
I strongly suggest to make practice with a playground, for example start a recent version of Solr (6.x) with the -cloud -e cloud parameter. This would start more Solr instances and a zookeeper standalone all on the same server, but consider this just as a toy to see how the things work.
if all the cores have the same config, and you are worried with efficiency etc, it would make more sense to have this setup under Solrcloud.
you can have all the data under a single collection and then sharded
or you can have your data partitioned into multiple different collections (that can be sharded as well, of course). For instance it is typical to have monthly collections for logs data.
then you have an alias that points to all your collections
on the client side, you just query the alias, and all is transparent for you, all needed collections are hit, search is distributed as needed etc.
If you don't want to change anything (i.e. these are cores that usually live and behave separate from other cores) and it's a small, contained task (i.e. not something you want to scale further in the future), you can use the explicit sharding support in Solr to query all the cores at the same time. This assumes that the documents are roughly evenly distributed across cores, as the scores are calculated locally before being aggregated in the node you're querying.
I'm brand-spanking-new to Hadoop and believe I'm beginning to see how much different data analytics ("offline") is from the super-low-latency world of web apps. One major thing I'm still struggling to understand is how truly "big data" makes it onto HDFS in the first place.
Say I have 500TB of data stored across a variety of systems (RDBMS, NoSQL, log data, whatever). My understanding is that, if I want to write MR jobs to query and analyze this data, I need to first import/ingest it all into HDFS.
But even if I had, say a 1Gbps network connection between each disparate system and my Hadoop cluster, this is 500TB = 500 * 1000Gb = 500,000 GB of data, which at 1Gbps, is 500,000 seconds or ~138 hours to port all the data onto my HDFS cluster. That's almost a week.
And, if my understanding of big data is correct, the terrabyte scale is actually pretty low-key, with many big data systems scaling into the petabyte range. Now we'd be up to months, maybe even years, just to be able to run MR jobs against them. If we have systems that are orders of magnitude beyond petabytes, then we're looking at having "flying rocket scooters" buzzing around everywhere before the data is even ready to be queried.
Am I missing something fundamental here? This just doesn't seem right to me.
Typically data is loaded as it's being generated. However, there are a few tools out there to help with the loading to HDFS.
Apache Flume - https://flume.apache.org/ - Designed for aggregating large amounts of log data. Flume has many bundled 'sources' which can be used to consume the log data including reading from files, directories, queuing systems, or even accepting incoming data from TCP/UDP/HTTP. With that you can setup Flume on a number of hosts to parallelize the data aggregation.
Apache Sqoop - http://sqoop.apache.org/ - Designed for bulk loading from structured datastores such as relational databases. Sqoop uses connectors to connect, structure, and load data to HDFS. The built in one can connect to any thing that adheres to JDBC 4 specifications.
500TB of data is a lot of data to load, but if it's spread out across multiple systems and formats using Sqoop and/or Flume should make relatively quick work.
Currently I have two separated applications.
First is RESTful API.
Second is data storage that can process raw data and store processed data on the file system. This data grouped by folders and folder ids are grouped by user ids.
These applications connected through message queue (ActiveMQ) using queueCount queues.
Files sending also through this queue using embedded fileserver.
I want to distribute this data storage across several nodes.
1) First variant
On each of n nodes set up ActiveMQ and current storage application.
Create master node that will be serve queries to these shards.
In this way data for the different users will be stored on different nodes.
2) Second
Set up n nodes with storage app. Set up one instance for ActiveMQ. Create n*queueCount queues in ActiveMQ. Consume messages from corresponding queues from storage nodes.
But both variants are not perfect, maybe you can give advice to me?
Thanks in advance
Update:
What is the best way to evenly distribute data based on uuid?
Why dont you use a distributed file system like hdfs to distribute your data store. This way replication is covered, data is distributed and you can use hadoop to even send jobs to process your data in parallel.
#vvsh, what you are attempting is distributed storage with load-balancing (but I did not understand how you plan to keep specific user's files on a specific node and at the same time get even load distribution). Any way, before I go any further, the mechanism you are attempting is quite difficult to achieve in a stable manner, instead, consider using some of the infrastructures mentioned in the comments, they may not 100% fit your requirement but will do a much better job.
Now, to achieve even distribution, your architecture essentially needs to be some kind of hub-spoke model, where the hub (in your case master server) will collect the load from a single queue with multiple JMS clients running on multiple thread. The master server has to essentially do the round-robin dispatching (you may choose different types of schemes, based on file number, if file sizes are fairly constant or file size and net total dispatched to a node).
The persistence agents must run on every node to actually take the files, process them and persist in the datastore. The communication between the master and the agents could be through web service or direct socket (depending on the performance you require), the Q based communication with the agents could potentially chock your JMS server.
One point of observation is that the files could be staged in another location, like a Document/CMS and only the ID could be communicated to the master as well as the agents, there by reducing the network load and JMS persistence load.
The above mechanism needs to toke care of exceptions, failures and re-dispatching i.e. guaranteed delivery, horizontal scaling, concurrency handling, and optimized for performance. In my view you shall be better off using some proven infrastructure but if you really want to do it, the above architecture will get the job done.
I have a cluster of 4 servers. A file consists of many logical documents. Each file is started as a workflow. So, in summary a workflow runs on the server for each physical input file which can contain as many as 3,00,000 logical documents. At any given time 80 workflows are running concurrently across the cluster. Is there a way to speed up the file processing? Is file splitting a good alternative ? Any suggestions? Everything is java based running on a tomcat servlet engine.
Try to process the files in Oracle Coherence. This gives grid processing. Coherence also provides data persistence as well.
In a simplified manner my Java application can be described as follows:
It is a web application running on a Tomcat server with a SOAP interface. The application uses JPA/Hibernate to store data in a MySQL database. The data stored consists of list of users, a list of hosts, and a list of URIs pointing to huge files (10GB) in the filesystem.
The whole system consists of a central server, where my application is running on, and a bunch of worker hosts. A user can connect to the SOAP interface and ask the system to copy the files that belong to him to a specific worker host, where he then can analyze the data in some way (We cannot use NFS, we need to copy the data to the local disc storage of a worker host). The database then stores for each user on which worker host his files are stored.
At the moment the system is running with one central server with the Tomcat application and the MySQL database and 10 worker hosts and about 30 users which have 100 files (on average 10GB) size stored distributed over the worker hosts.
But in the future I have to scale the system by a factor of 100-1000. So I might have to deal with 10000 users, 100000 files and 10000 hosts. And the system should also become fault tolerant, so that I have don't have a single central server (which is the single point of failure in the system now), but maybe several ones. Also, if one of the worker hosts fails the system should be notified, so it doesn't try to copy files on that server.
My question is now: Which Java technologies could I use to make my application scalable and fault tolerant? What kind of architecture would you recommend? Should I still have a huge database storing all the information about all files, hosts and users in the system in one place, or should I better distribute my database on several hosts and synchronize them somehow?
The technology you need is called Architecture.
No matter which technology you use, you need to have a well-architected system for scalability and redundancy. Make a diagram of the entire architecture of the system as it currently works. Mark each component with its limitations for users, jobs, bandwidth, hard drive space, memory, or whatever parts are limiting for your application. This will give you the baseline design.
Now draw that same diagram as it would need to be to meet your scalability and redundancy requirements. You might have to break apart pieces to make it work, or develop entirely new pieces. This diagram will make it very clear what you need.
One specific thing I want to address is the database. If you can split the database across logistic lines so that you do not join any queries from one to another, then you should have separate databases. Beyond that, the best configuration for a database is to have each database on one fast machine with lots of storage and very fast access times. If you do this, the only thing that will slow down your database are bad queries or poorly-indexed tables. In my experience, synchronizing databases is to be avoided unless you have one master database that has write access and it replicates to other databases which are read-only. Regardless, this can be a last step after you've profiled all of your queries and you literally need additional hardware.