Tuesday, June 23, 2015

Setting Up SolrCloud in Solr 5.x


While there is a lot of documentation on the Solr Confluence Wiki, it may be challenging to find all of the right levers to pull in order to start a multi node SolrCloud instance without using the provided script which creates an example SolrCloud for you via a command line wizard.  This post is intended to be a step-by-step guide to manually create a SolrCloud cluster without the user of the example script.

Installing a Zookeeper Ensemble


In order for your Solr instances to automatically receive configuration and participate in the cluster, you need to install a Zookeeper Ensemble.  It is possible to have only one Zookeeper instance to run your SolrCloud, however it is recommended to have at least 3 instances.  Why three and not two?  Zookeeper requires a quorum to be considered up and running, so if you have two instances and one goes down, one instance up and one down would not be a quorum.  If you have three instances, then if one goes down you will still have two out of three running and it will still be a quorum and thus keep running.

Let's get started.

Create a directory on the server named "solrcloud".  We'll refer to this as <BASE_INSTALL_DIR>.  

First, download Zookeeper from the Apache project website at http://zookeeper.apache.org/releases.html.  At the time of this writing SolrCloud uses version 3.4.6.

Once the distribution is downloaded, unzip/untar to <BASE_INSTALL_DIR >.  A folder named "zookeeper-3.4.6" will be extracted.  We'll refer to this as <ZOOKEEPER_HOME> from now on.

We'll be creating three ZooKeeper instances, each will need a data directory.  In the <BASE_INSTALL_DIR> create a directory named "zdata".  Under the zdata directory, we'll create a directory for each instance named "1", "2", and "3".  In a production environment, each ZooKeeper instance would be on a different server so you would just create a data directory in the location on your server that makes sense for you, however to keep things simple we'll be creating three instances on one machine and thus we need to create all three data directories on the same machine.

Within each of the data directories, a file named "myid" must be created.  The only piece that needs to go in the "myid" file is the instance name.  For now, we'll just use "1", "2", and "3" as the id's for the ZooKeeper instances.  So add a "myid" file to each of the instance data directories and add an id to each.

Your directory structure should look like the following:

With Zookeeper on a single machine, it is not necessary to create multiple directories in order to have multiple instances.  You just need to create a ZooKeeper configuration for each instance which ends with the id of the instance.  To create the configuration files, go to <ZOOKEEPER_HOME>/conf.  Copy zoo_sample.cfg and name the new configuration file zoo.cfg for instance 1, zoo2.cfg for instance 2, and zoo3.cfg for instance 3.  Open the config files and update the "clientPort" property in each file.  For the purposes of this guide, we'll increment the port number by one for each, but in production they will be on different servers so you can either leave the default port "2181" or you can change it to the port you wish to use.  You will also need to configure the ports that the ZooKeeper instances communicate with each other on.

The configuration file for instance 1 should look similar to the following:

tickTime=2000
initLimit=10
syncLimit=5
dataDir=<BASE_INSTALL_DIR>/zdata/1
clientPort=2181

server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2890:3890

The only differences in the configuration files for the other two Zookeeper instances should be the "clientPort" values and the "dataDir" values which should be set to "2182" for instance 2 and "2183" for instance 3 and the data directories we specified previously for each instance.

Your ZooKeeper install should contain the configuration files as shown below.



Now you are ready to start your ZooKeeper Ensemble, but before we do that, let's create a helper script to start them all without having to type the startup command for each instance every time.

In your <BASE_INSTALL_DIR>, create a file named "startZookeeper.sh".  In the file add the following:

#!/bin/sh
cd ./zookeeper-3.4.6
bin/zkServer.sh start zoo.cfg
bin/zkServer.sh start zoo2.cfg
bin/zkServer.sh start zoo3.cfg

Ensure you give the script execute permission, then run it from the command line.  You should see output similar to the following:

JMX enabled by default
Using config: <BASE_INSTALL_DIR>/zookeeper-3.4.6/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
JMX enabled by default
Using config: <BASE_INSTALL_DIR>/zookeeper-3.4.6/bin/../conf/zoo2.cfg
Starting zookeeper ... STARTED
JMX enabled by default
Using config: <BASE_INSTALL_DIR>/zookeeper-3.4.6/bin/../conf/zoo3.cfg
Starting zookeeper ... STARTED

Creating a configset


Before we create the Solr instances, we'll need to create a configset in order to create a collection to shard and replicate across multiple instances.  Creating a configset is very specific to your own collection, so it is out of the scope of this guide to create a configset, however I will add a couple of pointers.

If you use one of the pre-built configsets that come with Solr 5, they are located in solr-5.2.1/server/solr/configsets and you don't have to do anything.  However, if you do roll your own, keep in mind the following:
  • Any paths referenced in your solrconfig.xml must be updated to reflect paths relative to your Solr instance directories (which we will create in a later section).
  • If you have a need for additional jar files such as jdbc drivers, you can add a "lib" directory inside your solr instance collection specific directories and they will automatically be picked up by Solr, so you do not have to modify solrconfig.xml in order to use them.
    • Note: The collection directories will be created by Solr once you create your collection, so you will have to add the lib directory and jars once you have completed "Adding a collection" section later in this guide.  The directory will be at <BASE_INSTALL_DIR>/solr-5.2.1/server/<instance>/<collection>.  Ex. <BASE_INSTALL_DIR>/solr-5.2.1/server/solr/mycollection_shard1_replica1.  Restart the Solr instances once the jars are in place.
  • Ensure the appropriate <lib> tags are added to solrconfig.xml for any libraries you need in addition to the ones that may already be there.  For example, to use the DataImportHandler, you need to add the following lines if they don't already exist:
    •   <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />
    •   <lib dir="${solr.install.dir:../../../..}/contrib/dataimporthandler-extras/lib" regex=".*\.jar" />
    •   <lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" />
  • Create/update the schema.xml as necessary to map data from the source to a Solr document.


Uploading a configset to Zookeeper


Note: This section is only relevant if you want to upload your configuration ahead of time instead of specifying the configuration to use in the "create" command used in the "Adding a Collection" section or if you are using the Collections API to issue a "create" command via the REST interface.  When doing the creation of a collection via the REST interface you cannot specify a configset directory like you can using the solr script from the command line.  Feel free to skip this section unless you plan on using the Collections API instead of the bin/solr script to create a collection.

In order for a configset to be used in SolrCloud, it needs to reside within Zookeeper.  Zookeeper uses this configuration to automatically propagate configuration to the Solr instances and create your collection on each instance.

To upload the configset, you will need to use zkcli.sh which is in <BASE_INSTALL_DIR>/solr-5.2.1/server/scripts/cloud-scripts.  So go to that directory and issue the following command:

./zkcli.sh -zkhost localhost:2181,localhost:2182,localhost:2183 -cmd upconfig -confname <your conf name> -confdir <BASE_INSTALL_DIR>/solr-5.2.1/sever/solr/configsets/<your conf dir>/conf

The above assumes you have put your configset in the configsets directory, however it doesn't have to be there.  Also, in a production system, you won't be using localhost and the ports may be different, but you'll just need to update the host and ports as necessary for your environment.

After running the command, your configset should be uploaded to Zookeeper, but we don't have a Solr instance up and running yet so we won't be able to check it via the web interface quite yet.

Creating Solr Instances


In a production environment each instance will be on a separate server, so just like the Zookeeper instances they will likely have the same port, but different hosts.  However, for the purposes of this guide, we will create three separate instances on the same machine.  Luckily, this is very easy to do, but not quite as easy as Zookeeper which allows you to just add additional configuration files, but it is almost as easy.  All you need to do is create additional directories that will serve as Solr home directories for each of the instances.  The current Solr home is <BASE_INSTALL_DIR>/solr-5.2.1/server/solr.  So we'll just add 3 additional directories so we can have a total of 4 instances.  You can add as many as necessary, but we'll only add 3 here so that we can demonstrate sharding and replication across four instances.  

Under <BASE_INSTALL_DIR>/solr-5.2.1/server add a directory named "solr2", "solr3", and "solr4" to represent our additional instances.  Copy solr.xml for the original solr home directory and place it into each of the newly created directories.  Then open each file and update the ports.  Updating the ports is only necessary because we are on the same instance and we can't have multiple instances running on the same port.  Use the following port numbers for the purposes of this guide:

Instance 1: 8983
Instance 2: 8984
Instance 3: 8985
Instance 4: 8986

Your directory structure with the new instances should look similar to the following:



That's all you need to do to create additional Solr instances.  Simple right?  Of course they don't do much now since they have no collections configured, but that's what we'll do in a minute.  However, first we need to start up our Solr instances.

Starting Solr Instances


In order to start a Solr instance as part of the cloud and connected with the Zookeeper ensemble, issue the following commands from <BASE_INSTALL_DIR>/solr-5.2.1.

  • bin/solr start -cloud -s server/solr -p 8983 -z localhost:2181,localhost:2182,localhost:2183 -noprompt
  • bin/solr start -cloud -s server/solr2 -p 8984 -z localhost:2181,localhost:2182,localhost:2183 -noprompt
  • bin/solr start -cloud -s server/solr3 -p 8985 -z localhost:2181,localhost:2182,localhost:2183 -noprompt
  • bin/solr start -cloud -s server/solr4 -p 8986 -z localhost:2181,localhost:2182,localhost:2183 -noprompt
As with the Zookeeper instances, you can create a script that contains these commands as well so that you don't have to type them one by one each time you want to start your instances.  Name it something like "startSolr.sh" and put it in the <BASE_INSTALL_DIR> and make sure you give it execute permission.

#!/bin/sh
cd solr-5.2.1
bin/solr start -cloud -s server/solr -p 8983 -z localhost:2181,localhost:2182,localhost:2183 -noprompt
bin/solr start -cloud -s server/solr2 -p 8984 -z localhost:2181,localhost:2182,localhost:2183 -noprompt
bin/solr start -cloud -s server/solr3 -p 8985 -z localhost:2181,localhost:2182,localhost:2183 -noprompt
bin/solr start -cloud -s server/solr4 -p 8986 -z localhost:2181,localhost:2182,localhost:2183 -noprompt

Upon successful execution of the startup commands you should see the following output:
Waiting to see Solr listening on port 8983 [/]  
Started Solr server on port 8983 (pid=37286). Happy searching!

Waiting to see Solr listening on port 8984 [/]  
Started Solr server on port 8984 (pid=37386). Happy searching!

Waiting to see Solr listening on port 8985 [/]  
Started Solr server on port 8985 (pid=37489). Happy searching!

Waiting to see Solr listening on port 8986 [/]  
Started Solr server on port 8986 (pid=37591). Happy searching!

Once the instances are up, you can open a web browser and go to the Solr web pages at:

Adding a Collection

First let's verify the configuration we uploaded earlier for our collection is in Zookeeper, otherwise we won't be able to create the collection.  So, fire up a browser and go to http://localhost:8983/solr.

Navigate to the "Cloud" tab and open the "Tree" tab underneath it.  You should see a tree containing the files in your Zookeeper ensemble.  Within that set of files is a directory named "configs".  Open that up and you should see your configuration there.



Now that you have a running Zookeeper ensemble along with four Solr instances, we can easily add your custom Solr collection.  In order to do this, we'll use the solr utility in <BASE_INSTALL_DIR>/solr-5.2.1/bin.  You could also use the CollectionsAPI directly and issue commands via the REST interface running on your Solr instances.  See https://cwiki.apache.org/confluence/display/solr/Collections+API for more details on the collections API.  Note that when using the Collections API to issue a "create" command, the configuration will already need to be in Zookeeper.  Please refer to the "Uploading a configset to Zookeeper section above on how to upload your configset.

Issue the following command to create your collection:

  • bin/solr create -c <collection name> -d <config directory> -n <config name> -p 8983 -s 2 -rf 2

What the above command does is it creates a collection with the name you specify in the -c argument.  This can be anything you want to name your collection.  The -d argument specifies the config directory where your configset resides.  This looks in <BASE_INSTALL_DIR>/solr-5.2.1/server/solr/configsets for the directory name you specify.  The command will automatically add the config to Zookeeper.  The -n argument specifies the name you wish to give this configuration in Zookeeper.  Name it something meaningful so you can find it in the Solr admin console later on.  The -p option specifies the port of the Solr instance you are creating this collection on.  Since you are using Zookeeper, even though you specify only one of the Solr instances, the collection will be propagated as necessary to the other instances.  The -s specifies the number of shards and the -rf parameter specifies the replication factor or how many copies of the shards you want.  Since this example specifies two shards and two replicas, Zookeeper and Solr will automatically create a primary/leader shard for each shard on a separate instance and a replica of each of those shards on other instances using the four instances we configured without us having to do any work other than creating the collection on one of the Solr instances.

If everything worked, you should see a new directory within each of your solr instance directories with the name of your config followed by shard and replica labels.

  • <collection_name>_shard1_replica1 
  • <collection_name>_shard1_replica2
  • <collection_name>_shard2_replica1
  • <collection_name>_shard2_replica2

The instance that each of these appear in may be different for each installation since we had all of the Solr instances up before we created the collection.  If you want to, you can forego starting all of the Solr instances at once and only bring one up to start with.  If you do this, the first shard will be put on the running instance.  Then you can start the next server and the other half of the first shard will be put on the new instance.  You keep repeating this until they are all up and you will end up with specific portions of the shards on specific servers as well as the replicas.

To view your SolrCloud go to http://localhost:8983/solr/#/~cloud which will show a diagram of all the instances in your cloud.


Stopping Zookeeper and Solr Instances

Stopping the Solr instances is very easy.  Just issue the following command from the <BASE_INSTALL_DIR>/solr-5.2.1 directory:

bin/solr stop -all

If you want to stop a particular instance remove the "-all" argument and supply the "-p" argument and specify the port of the instance you want to stop.

bin/solr stop -p 8984

Stopping Zookeeper instances is also very easy.  From the <BASE_INSTALL_DIR>/zookeeper-3.4.6 directory run the following command:

bin/zkServer.sh stop zoo.cfg

Replace zoo.cfg with the appropriate instance configuration as necessary. i.e. zoo2.cfg or zoo3.cfg

Hopefully this guide was helpful to you.  That's all for now!



7 comments:

  1. Thank you for the post. Helpful

    ReplyDelete
  2. Thank you very much. This was very helpful I followed your instructions and got my VM cluster running with 4 instances with each one running on a separate host using a single external zookeeper.

    ReplyDelete
  3. Thank you for putting this together. Please consider adding some clarification on how to use the config set that was initially uploaded. I experimented and found I could use the uploaded config by specifying the path that appears in Solr Admin > Cloud > Tree for the -d parameter

    ReplyDelete
    Replies
    1. When the configuration is uploaded to zookeeper you would have specified a name for that configuration. When creating a collection you must specify the name of that configuration from which to create the collection. If you have made updates to the collection, it must be re-uploaded and then reloaded. To reload you can use the browser and hit http://localhost:8983/solr/admin/collections?name=&action=reload which will reload the configuration across all of your Solr instances.

      Delete
    2. How you uploaded config set to specified path admin>cloud>tree.
      please let me know step wise and I am a windows user not linux/unix.

      Delete
  4. Yes you can. As long as Solr can communicate with Zookeeper and the correct ports are open on Zookeeper instances. The Zookeeper connection port (which is specified in zoo.cfg) needs to be open to the Solr instances. Make sure wherever you host the two have a fast connection between them so that distributed tasks stored by Zookeeper and picked up by Solr instances are not delayed due to network latency.

    ReplyDelete
  5. Hie I am newbie. I am unable to upload confifgset to solr please let me know step wise. I am using windows system.

    ReplyDelete