WYSIWYG

http://kufli.blogspot.com
http://github.com/karthik20522

Friday, November 28, 2014

Elasticsearch - No downtime reindexing

As you probably know that mappings in elasticsearch cannot be changed, for example like changing a property type from a string to an int etc. The only way to make such changes is to copy the entire index into a brand new index with new mappings.

Reindexing is an unavoidable common practice as data model changes effects how data is indexed in elastic search. So while designing the system, having an alias assigned to all indexes is a good choice as we can swap indexes in and out. Alias is basically providing an alternate name to an index. For example:



Now all that you need to do is to create a new index with new mappings and copy the data over from the original index to the new index. To perform a bulk copy operation, I prefer to use tools such as elasticsearch-dump which helps in this bulk copy operation.

Following query performs a copy from one keyword index to a second index: Now all that you need to do is to delete the alias from the original index and assign it the new index. This way the calling client using the alias for querying and indexing would have no impact But what about the documents that were updated during the scan and scroll process? Well, that's tricky but if your model does have a update date property you can always re run the es dump to fetch only the documents that were updated after a certain date time.

Labels:

Sunday, November 23, 2014

Elasticsearch - Dynamic Data Mapping

Data in Elasticsearch can be indexed without providing any information about it's content as ES accepts dynamic properties and ES detects if the property value is a string, integer, datetime, boolean etc. In this article, lets work on getting dynamic mapping setup the right way along with some commonly performed search operations.

To start with a simple example. Lets consider the following object: Given the above Json blob and indexing into the elasticsearch would result in the following mapping: This is great that Elasticsearch automatically detected id to be long and text & type to be a string. But if you look carefully, the keywordText and KeywordType are set to default type of "analyzed". This means that those two fields are now available for partial text search. But I want keywordType to be "not_analyzed" as users would never partial text search it. To overcome this but preserve the dynamic nature of this index, we can create a Keywords Index with mapping provided for certain fields: As you can see from above, we have set dynamic to "true" but let the index know that if any field that matches "keywordType" to use a specific mapping rather instead of ES figuring it out for us.

Now that we have KeywordType to "not_analyzed" which basically now is an "exact match" search including the case (upper and lower case). But how do I make KeywordType to be a case insensitive exact match? One way is to to lower the keywordType and have the calling system provide a lower case searches only. For this, the following mapping changes need to happen: We are basically using the "Keyword tokenizer" that Elasticsearch provides that makes it exact match search and "filter" of lowercase which automatically converts the input to lower case. More info on at Elasticsearch tokensizers

So far so good but I dont want users to search on all fields which Elasticsearch by default provides. I would rather have the user provide which field they want to search on. Why? Doing a "_all" search on an index with 100's of fields is a very expensive operation; that's why! To disable "_all" search: OK great, now that _all field search is disabled but now since dynamic is turned on which means any new fields can automagically be indexed, I don't want elasticsearch to index any binary blob as it would consume too much memory; but rather just store it and not index it. For this, the updated mapping would look like: Setting "enabled: false" lets elasticsearch know that this field should not be indexed for search purposes but would be part of the document result. So basically it's stored but not searchable.

Since dynamic mapping is enabled, Elasticsearch parses thru every single property to determine it;s type. As much I would love for Elasticsearch to perform all the magic mappings, let give it a helping hand by letting the Elasticsearch know that certain properties are DateTime type based on it;s name. So basically, if any property that has either "date" or "Date" at it's ending then assume it's a DateTime object. For example "createDate" or "updateDate" would match the above template. Also as you may notice, "date_detection" is set to false.

How about make all string into exact lower case match? So providing dynamic templates when the properties are unknown helps a lot and not have every single field "analyzed" which takes up too memory and extra processing time. The memory consumption analysis will be for another blog post.

Just as an extra, Elasticsearch provides a way to match templates to index names. This means that we provide elasticsearch a template file with mapping information and when an index is created, ES automatically matches the index name with the give template and auto applies the mappings. This template needs to be saved in the "/etc/elasticsearch/templates" folder. An example template file:

Labels:

Elasticsearch - Zen, AWS Cluster Setup

In a cluster environment, multiple elasticsearch nodes/servers join to form a cluster where the shards are distributed and replicated among these servers but to the outside world it is presented as a single system. For elasticsearch to connect to different nodes, ES provides two discovery methods. One being Zen discovery and other being cloud based discovery via plugins for Azure, AWS and Google Compute engine.

Zen Discovery

From the above snippet, it's pretty straightforward to understand that the discover.type is "zen" and the minimum number of nodes required to form a cluster is "2" and using "unicast" to find other hosts and provide some sort of recovery mechanism (fault detection) if a server went offline or some network problems. This is probably all that Zen discovery has to offer, simple and easy! More info at Zen discovery

AWS/EC2 Discovery

For EC2 discovery, we first need to install the cloud-aws plugin if not already installed From the above config, the discovery type is ec2 and optionally given a region for the plugin to discover other nodes and security group. If there is no IAM role associated with the server, then AWS secret_key and access_key needs to be provided in-order for the plugin to query AWS for node information.

Having the node.auto_attributes set to true would add aws_availability_zone to the node attributes properties which helps in node awareness. What this means is that, given an index with replication factor of 1, ES uses this attribute to determine which node this particular shard is sitting on but makes sure the replicated shard is on a different box. More info at Shard Awareness We can make the Elasticsearch node discovery a little bit faster by filtering the number of servers it needs to ping during the discovery process. This filter can achieved by using the ec2.tag if they are assigned to EC2 servers. In a enterprise environment where there are 100's of ec2 servers deployed on AWS, pinging every single one of them would take a very long time, this should help speed things up. More information for this EC2 discovery plugin at cloud-aws and various discovery at elasticsearch-dsicovery

Labels:

Elasticsearch - Advanced settings and Tweaks

Now that we have Elasticsearch installed and confirmed working, we can start looking into more advanced settings, more of tweaking, to improve Elasticsearch performance. For most use cases, following three area's of Elasticsearch configuration needs to be addressed:
  • Memory configuration
  • Threadpool configuration
  • Data Store configuration

Memory configuration:

By default Elasticsearch assigns the minimum heap size of 256MB and 1GB maximum heap size. But in real world server environments with many gb in memory availablity, it;s always good to provide 50% of the server memory as a rule of thumb to Elasticsearch process. This setting can be set using: But providing the heap size is just not enough as the memory can be swapped out by the OS. To prevent this we need to lock the process address space assigned to Elasticsearch. This can be done by adding the following line to elasticsearch.yml file and restarting elasticsearch: After starting Elasticsearch, you can see whether this setting was applied successfully by checking the value of mlockall in the output from this request: But the mlockall is false. If you see that mlockall is false, then it means that the the mlockall request has failed. The most probable reason is that the user running Elasticsearch doesn’t have permission to lock memory. This can be granted by running ulimit -l unlimited as root before starting Elasticsearch.
Note that you will always have to run ulimit -l unlimited before elasticsearch restart or else mlockall is set back to false, this is probably because the the User ESprocess is running on is not root

Threadpool Configuration:

Elasticsearch can holds several thread pools with a queue bound to each of these pools which allow pending requests to be held instead of discarded. For example, by default for index operation, it has a fixed thread pool size of # no of processors in the system and a queue_size of 200. So if there are more than 200 requests, the new requests are discarded and following exception is returned back to the client: EsRejectedExecutionException[rejected execution (queue capacity 200)..]
To overcome this limitation and increase the concurrency of elasticsearch processing messages, following setting are be tweaked: So if the use cases if primarily for searching i.e. more search operations than indexing operations, the threadpool for search can be increased and the threadpool for indexing can be much lower. Though queuing up thousands of messages is probably not a wise decision, so tweak responsibly. More information about threadpool size and configuration can be found at Elasticsearch Threadpool

ES by default assumes that you're going to use it mostly for searching and querying, so it allocates 90% of its allocated total HEAP memory for searching. This can be changed with the following settings. Note that implication of this setting can be significant as you are reducing the memory allocated for search purposes! More at Indices Module

Store and indices Configuration:

The store module allows you to control how index data is stored. The index can either be stored in-memory (no persistence) or on-disk (the default). Unless your data is temporary data using in-memory store is a bad idea as you will loose the data upon restart. For disk based storage, we need to have fast disk seeks if the data to be looked up is not in memory. The most optimal way is to use mmap fs which is basically memory mapped files. More information regarding storage options can be found at Elasticsearch Store

Labels:

Elasticsearch - Installation and general settings

Installation of Elasticsearch is a breeze by which I mean it's as simple as downloading the zip/tar file and unzipping it.
In the above bash script, we are essentially downloading the file, unzipping and installing couple of plugins for administration and cloud discovery. Now that we have Elasticsearch unzipped, we can optionally provide location to it's data, log and configuration folder. There are two ways to provide Elastisearch this configuration. First way is to provide the paths in the yml configuration file elasticsearch.yml. For example Second way is running Elasticsearch in daemon mode, you can setup the paths in the sysconfig file generally located at /etc/sysconfig/elasticsearch. The configuration in this file is passed onto ES as command line settings when elasticsearch is started. Note that the configuration in the elasticsearch.yml file overrides the sysconfig file
But lets say that we using a package manager or puppet scripts to install elasticsearch and now we have no idea where the config files and data directories located. One easy way to get these information is to curl the elasticsearch node endpoint which returns back all the information regarding each node with all path and configuration information More information on the directory structure can be found at Elasticsearch Directory layout
OK, now that we have elasticsearch unzipped and the data directory setup, lets update some minimal but essential Elasticsearch configuration: List of all configuration can be found at: Elasticsearch configuration file
Note that if the node.name is not provided, Elasticsearch automatically assigns a node name based on Marvel comic characters. This is fine as long as Elasticsearch process does not restart as it will assign a new name again which could be trouble if you are monitoring the ES process by node names.
Now that we have the basic elasticsearch settings updated/added, we can start elasticsearch by running:

Labels: