WYSIWYG

http://kufli.blogspot.com
http://github.com/karthik20522

Sunday, January 4, 2015

Scala Parser Combinators - SQL Parser Example

Scala Parser Combinators: https://github.com/scala/scala-parser-combinators

Scala Parser Combinators is basically a parsing framework for extracting data when there is a pattern in the given input. This framework provides a more statically typed, functional way of extracting instead of using regex expression which can get hard to read.

In this post, lets build a SQL parser where given a valid sql statement we can identify the "table" name, "column" names and other sql properties. Following are some fundamental functions, operations that Scala combinator provides which would help in parsing:
  • " | ": says “succeed if either the left or right operand parse successfully”
  • " ~ ": says “succeed if the left operand parses successfully, and then the right parses successfully on the remaining input”
  • " ~> ": says “succeed if the left operand parses successfully followed by the right, but do not include the left content in the result”
  • " <~ ": is the reverse, “succeed if the left operand is parsed successfully followed by the right, but do not include the right content in the result”
  • " ^^ ": says “if the left operand parses successfully, transform the result using the function on the right”
  • " ^^^ ": says “if the left operand parses successfully, ignore the result and use the value from the right”
  • " rep(fn) ": says "parse the given input using the parser function fn"
  • " repsep(ident, char) ": says "parse the given input and split the input using the given 'char'"
Lets start out with a set of SQL statements and it's associated Parser code

select * from users

select name,age from users

select count(name) from users

select * from users order by age desc

select * from users order by name, age desc

select age from users where age>30

Labels:

Elasticsearch - Cautionary and Useful Tips

Update/Delete Gotcha:

In elasticsearch, an update to a document is basically a delete and reinsert. A delete operation in elasticsearch is basically marking the document to be deleted and not actually deleted. This is problem especially when you have heavy updates/delete operations as the documents are not actually purged but instead just marked for deletion, which takes up disk space. Following screen shot is an example where the total number of documents in the index (where documents can be searched) is not the same as the actual total documents in the index.



To reclaim disk space, you have to optimize the index: More information at: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-optimize.html

Memory Limitation - Max ES Heap size:

By default elasticsearch allocates 1GB of heap to it's process. This is ok for development purposes but in production you should generally provide half of the server memory to Elasticsearch. To set the heap size: More the memory given to elasticsearch is better as more data is held in the memory for faster search/seeks but there are few gotchas to be aware of:
  • Do not cross 32GB
    As it turns out, the JVM uses a trick to compress object pointers when heaps are less than ~32 GB. Once you cross that magical ~30–32 GB boundary, the pointers switch back to ordinary object pointers. The size of each pointer grows, more CPU-memory bandwidth is used, and you effectively lose memory.
  • Give half of the memory to Lucene
    Lucene is designed to leverage the underlying OS for caching in-memory data structures. If you give all available memory to Elasticsearch’s heap, there won’t be any left over for Lucene. This can seriously impact the performance of full-text search.
  • Disable Memory Swapping
    Swapping main memory to disk will basically affect server and elasticsearch performance. If memory swaps to disk, a 100-microsecond operation becomes one that take 10 milliseconds. To avoid this you should enable mlockall. This allows the JVM to lock its memory and prevent it from being swapped by the OS. In your elasticsearch.yml, set this:
More information at : http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/heap-sizing.html

Index name - Alias

It's always advisable to provide an alias to the index and have the application use the alias name instead of the actual index name. This is useful as we can switch index's without affecting the calling application. For example, we can create a brand new index with new mappings and basically delete the alias from the old index and assign it to the new index. This way re-indexing operations would be a zero downtime operations.

Logging - Debug/Info/Error:

By default, elasticsearch log level is set to debug in the logging.yml file. This is probably not a good choice as ES tends to log everything which takes up a lot of disk space. I learned this the hard way where I copied data from one index to another index for reindexing purposes and elasticsearch logged every single payload and the log size was almost the size of the index itself! It;s best to have the log level to WARN instead of DEBUG or INFO.

Document Versioning:

For every insert or document update, elasticsearch either auto assigns a version number of expects the user to provide a version number. This is useful for concurrency control. There are 4 types of version mechanism:
  • internal: (Default) Auto assigned by elasticsearch
  • external: Version number provided by user. Must always be greater than the existing version of the document
  • external_gte: Version number provided by user but the version number should be at the very least equal to the existing document version.
  • force: Version number provided by the user where the number can be anything. But not recommended
The above four version options are not very well documented, but can be understood by reading the ES source code at: Source

Labels: