tag:blogger.com,1999:blog-81423177360198529912024-02-07T23:02:06.574-05:00WYSIWYG<a href="http://kufli.blogspot.com">http://kufli.blogspot.com</a><br>
http://github.com/karthik20522..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comBlogger111125tag:blogger.com,1999:blog-8142317736019852991.post-89740563423403702782015-05-16T13:53:00.000-04:002015-06-19T13:56:37.062-04:00Trivial bash Script to restart services in AWSToo many aws servers? Been there and I hate it. Following is a simple script that I use to restart services running on EC2 instances. I am using AWS Cli to get the ip address based on tag names and then ssh into the box and to the command. Note that the tag names are the same as service names in my case.
<script type="syntaxhighlighter" class="brush: csharp">
#!/bin/bash
set -e
function usage {
echo >&2 "Usage: $0 [ -e environment -n service_name -a action ]"
exit 1
}
while getopts ":n:e:a:" FLAG; do
case $FLAG in
n) NAME=${OPTARG};;
e) ENV=${OPTARG};;
a) ACTION=${OPTARG};;
[?]) usage;;
esac
done
if [[ -z $NAME || -z $ENV || -z $ACTION ]]; then
usage
fi
for fid in $(aws ec2 describe-instances --filters "Name=tag:ServerEnv,Values=${ENV}" "Name=tag:SubSystem,Values=${NAME}" --query 'Reservations[*].Instances[*].PrivateIpAddress' --output text)
do
ssh -n $fid "sudo ${ACTION} ${NAME}; exit;"
done
</script>..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-8927376862269832202015-05-13T00:00:00.000-04:002015-09-09T08:16:52.109-04:00RabbitMQ Upgrade - Bash ScriptAn easy to read, trivial bash script for upgrading RabbitMQ service.
<script type="syntaxhighlighter" class="brush: csharp">
function check_rmq_version {
rmq_version=$(sudo rabbitmqctl status | grep "rabbit," | cut -d',' -f3 | cut -d'}' -f1 | tr -d '"')
echo && echo " Version = $rmq_version" && echo
}
function stop_rmq {
echo "Stopping RabbitMQ..."
sudo service rabbitmq-server stop
}
function kill_erlang {
echo "Killing stray RMQ/erlang processes..."
#pids=$(ps -fe | grep erlang | grep rabbitmq | awk '{ print $2 }')
#echo $pids
pgrep -u rabbitmq -x beam | xargs kill -9 || echo && echo " RabbitMQ already stopped"
echo
}
function upgrade_rmq_version {
echo "Changing directory to /tmp..."
cd /tmp
echo
echo "wgetting RabbitMQ .rpm file from official website..."
url="http://www.rabbitmq.com/releases/rabbitmq-server/v3.4.3/rabbitmq-server-3.4.3-1.noarch.rpm"
wget $url
echo
echo "Validating signature..."
url="http://www.rabbitmq.com/rabbitmq-signing-key-public.asc"
sudo rpm --import $url
echo
echo "Upgrading RabbitMQ version..."
file="rabbitmq-server-3.4.3-1.noarch.rpm"
sudo yum install $file
echo
}
function start_rmq {
echo "Starting RabbitMQ"
sudo service rabbitmq-server start
}
function main {
check_rmq_version # Checking the current version of RabbitMQ
stop_rmq # Stopping the rabbitmq-server service
kill_erlang # Killing erlang to ensure RMQ is stopped
upgrade_rmq_version # Upgrading RabbitMQ
start_rmq # Starting the rabbitmq-server service
check_rmq_version # Checking the current version of RabbitMQ
}
main
</script>..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-55194951503252939962015-04-14T17:05:00.000-04:002015-06-13T17:50:16.941-04:00ReIndexing Elasticsearch in ScalaThe following scala script reads from one index and writes to another script using Scan and scroll method. The script also takes in a partial function where the values from one index can be manipulated before saving into another index. This script assumes you have a field called "id" and an field called "submitDate" so it can continually perform scan and scroll once the preliminary index copy is done, so keep the index's in sync.
<script type="syntaxhighlighter" class="brush: csharp">
import org.json4s._
import org.json4s.JsonDSL._
import spray.client.pipelining._
import spray.http._
import spray.httpx._
import spray.httpx.encoding._
import scala.concurrent._
import scala.concurrent.duration._
import akka.actor._
import scala.collection.mutable.ListBuffer
import scala.annotation.tailrec
class ElasticsearchReIndexerActor(esInputHost: String,
esOutputHost: String,
inputIndex: String,
outputIndex: String,
indexType: String,
processData: (JValue) => JValue) extends Actor {
import context.dispatcher
val ouputIndexClient = new ESClient(s"http://$esOutputHost", outputIndex, indexType, context)
val pipeline = addHeader("Accept-Encoding", "gzip") ~> sendReceive ~> decode(Gzip) ~> unmarshal[HttpResponse]
var lastUpdateDateTime: String = "1900-01-01"
def receive = {
case "init" => {
val scanId: String = (Await.result(getScanId(lastUpdateDateTime), 60 seconds) \\ "_scroll_id").extract[String]
self ! scanId
}
case scanId: String => iterateData(scanId)
case ReceiveTimeout => self ! "init"
}
def getScanId(startDate: String): Future[JValue] = {
println("Query data with date gte: " + lastUpdateDateTime)
val esQuery = "{\"query\":{\"bool\":{\"must\":[{\"range\":{\"submitData\":{\"gte\":\"" + lastUpdateDateTime + "\"}}}]}}}"
val esURI = s"http://$esInputHost/$inputIndex/$indexType/_search?search_type=scan&scroll=5m&size=50"
val esResponse: Future[HttpResponse] = pipeline(Post(esURI, esQuery))
esResponse.map(r => { parse(r.entity.asString) })
}
def iterateData(scanId: String) = {
val scrollData = ("scroll_id" -> scanId)
val esURI = Uri(s"http://$esInputHost/_search/scroll?scroll=5m")
val esResponse: HttpResponse = Await.result(pipeline(Post(esURI, scanId)), 60 seconds)
val responseData: JValue = ModelJsonHelper.toJValue(esResponse.entity.asString)
val bulkList = new ListBuffer[JValue]()
val bulkData: ListBuffer[JValue] = (responseData \ "hits" \ "hits" \ "_source") match {
case JNothing | JNull => throw new Exception("Result set is empty")
case JArray(dataList) => {
dataList.foreach { data =>
val id = (data \ "id").extract[String]
val bulkIndexType = ("index" -> (("_index" -> outputIndex) ~
("_type" -> indexType) ~ ("_id" -> id)))
bulkList += bulkIndexType
bulkList += processData(data)
}
bulkList
}
case x => throw new Exception("UNKNWON TYPE: " + x);
}
val bulkResponse: SearchQueryResponse = Await.result(ouputIndexClient.bulk(bulkList.toList), 60 seconds)
(responseData \\ "_scroll_id") match {
case JNothing | JNull => {
lastUpdateDateTime = DateTime.now.toString
context.setReceiveTimeout(1.minute)
println("Paused at: " + lastUpdateDateTime)
}
case x => self ! x.extract[String]
}
}
}
</script>
Notes:
<ul>
<li>The ESClient is an extension of on <a href="https://github.com/gphat/wabisabi">wabisabi</a> Library for elasticsearch</li>
<li>The Actor initially performs a scan-scroll with submit date gte 1900</li>
<li>Once the initial scan-scroll is done, it pauses for a minute and performs a scan-scroll again with the submitDate of previous endTime (dateTime.now minus 1 minute)</li>
<li>This way every minute after the previous run it will continually keep the index in sync</li>
<li>The partial function "processData" provides a way to manipulate the original data, manipulate it and save it to the new index</li>
<li>Bulk-indexing is used for saving to the new index, hence a the "id" field is required to determine the "id" of the new document</li>
</ul>
Usage:
<script type="syntaxhighlighter" class="brush: csharp">
val esReindexActor = system.actorOf(Props(new ElasticsearchReIndexerActor(
"localhost:9200",
"localhost:9200",
"inputIndex",
"outputIndex",
"someType",
doNothing)), name = "esReIndexer")
esReindexActor ! "init"
}
def doNothing(data: JValue): JValue = data
</script>
..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-7339189292917494882015-03-31T16:42:00.000-04:002015-06-13T16:55:23.462-04:00Calling SOAP Service in ScalaScala out of the box has limited capability to call SOAP service neither does libraries such as http-dispatch or spray client. SOAP service is reality is just a xml request/response service and lo and behold, XML is a first class citizen in Scala. <br> <br>
One of the widely used library is <a href="http://scalaxb.org/">ScalaXB</a> which helps in generating case classes given a xsd or wsdl file. Scalaxb is an XML data-binding tool for Scala that supports W3C XML Schema (xsd) and Web Services Description Language (wsdl) as the input file. This is great but it's quite hard to maintain and the code readability goes down the drain as the code is dynamically generated. For example, the following screen shot is what scalaxb generates when either a wsdl or xsd is provided
<br>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyTRy-22V85_oqSzp0DFUrJww7maZKx_dn85TwAgLOb55q8b3nDMsxZQpATZXDojpu8xExci8nEqOzqeLqW0FSmRFX-mwAaOa96X6o3wgSCsbSF1C6ph0xcob_AC5jVgHsr2dFvsFMbcVg/s1600/scalaxb.JPG" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyTRy-22V85_oqSzp0DFUrJww7maZKx_dn85TwAgLOb55q8b3nDMsxZQpATZXDojpu8xExci8nEqOzqeLqW0FSmRFX-mwAaOa96X6o3wgSCsbSF1C6ph0xcob_AC5jVgHsr2dFvsFMbcVg/s200/scalaxb.JPG" /></a>
<br>
<br>
But what we really need is a trivial way to call a webservice using our existing http clients. Following is one way of doing so.
<br>
In this below example, i am calling a service that returns back a list of keywords given a list of keywordid's.
<script type="syntaxhighlighter" class="brush: csharp">
import spray.client.pipelining._
import akka.actor.{ ActorRefFactory }
import spray.http._
import spray.httpx._
import scala.concurrent.Future
import scala.xml._
class KeywordService(keywordServiceURL: String, implicit val actorRefFactory: ActorRefFactory) {
import actorRefFactory.dispatcher
def sendAndReceive = sendReceive
def fetchKeywords(keywordIds: List[Int], language: String = "en-us"): Future[Elem] = {
if (keywordIds.isEmpty) {
Future { <xml/> }
} else {
val requestDetail = new GetKeywordDetailsRequest("test", 0, Terms(termIds = keywordIds), DesiredDetails(languageCodes = language))
doRequest(keywordServiceURL, wrap(requestDetail.toXML), Some(keywordLookupMetric))
}
}
private val mapErrors = (response: HttpResponse) => {
response.status.isSuccess match {
case true => response
case false => throw new Exception(response.entity.asString)
}
}
private def doRequest(uri: String, data: Elem, timer: Option[Timer] = None): Future[Elem] = {
val kwdServiceURI = Uri(uri)
val pipeline = addHeader("SOAPAction", "http://xxx.com/GetKeywordDetails") ~> sendAndReceive ~> mapErrors ~> unmarshal[HttpResponse]
val kwdServiceResponse: Future[HttpResponse] = pipeline(Post(kwdServiceURI, data))
kwdServiceResponse map {
r => XML.loadString(r.entity.asString(spray.http.HttpCharsets.`UTF-8`).replaceAll("[^\\x20-\\x7e]", ""))
} recover {
case any: UnsuccessfulResponseException => throw any;
}
}
def wrap(xml: Elem): Elem = {
val buf = new StringBuilder
buf.append("<s:Envelope xmlns:s=\"http://schemas.xmlsoap.org/soap/envelope/\">")
buf.append("<s:Body xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\">")
buf.append(xml.toString.split('\n').map(_.trim.filter(_ >= ' ')).mkString)
buf.append("</s:Body>")
buf.append("</s:Envelope>")
XML.loadString(buf.toString)
}
}
</script>
<br>
In the above script, all that i am doing is constructing the soap request headers manually and performing a POST operation. There are couple of things to be noted:
<ul>
<li>"SOAPAction" header is manually added to let the service know which service operation it is intended for</li>
<li>Setting the charset to UTF-8 and removing unicode characters "[^\\x20-\\x7e]"</li>
<li>Removing the unicode characters are necessary as scala fails to parse the response. This mostly seems to happens when calling .NET WCF services</li>
</ul>
<br>
<i>GetKeywordDetailsRequest</i> is a class that has the input parameters and has a function that generates the formatted xml for the soap request
<script type="syntaxhighlighter" class="brush: csharp">
case class Terms(
termIds: List[Int],
status: Int = 0
)
case class DesiredTermDetails(
ancestors: Boolean = false,
category: Boolean = false,
children: Boolean = false,
translations: Boolean = true,
mappingSynonyms: Boolean = false,
searchSynonyms: Boolean = false,
requiredRelationships: Boolean = false,
suggestedRelationships: Boolean = false,
languageCodes: String = "en-us"
)
class GetKeywordDetailsRequest(user: String = "test", mode: Int = 0, terms: Terms, desiredTermDetails: DesiredTermDetails) {
def toXML = {
val requestXML = <GetKeywordDetails xmlns="http://xxx.com/">
<GetKeywordDetailsRequest xmlns="http://xxxx.com/zzzz.xsd">
<User>{ user }</User>
<Mode>{ mode }</Mode>
<Terms>
{
for { tID <- terms.termIds } yield <TermID>{ tID }</TermID>
}
<Status>{ terms.status }</Status>
</Terms>
<DesiredTermDetails>
<Ancestors>{ desiredTermDetails.ancestors }</Ancestors>
<Category>{ desiredTermDetails.category }</Category>
<Children>{ desiredTermDetails.children }</Children>
<Translations>{ desiredTermDetails.translations }</Translations>
<MappingSynonyms>{ desiredTermDetails.mappingSynonyms }</MappingSynonyms>
<SearchSynonyms>{ desiredTermDetails.searchSynonyms }</SearchSynonyms>
<RequiredRelationships>{ desiredTermDetails.requiredRelationships }</RequiredRelationships>
<SuggestedRelationships>{ desiredTermDetails.suggestedRelationships }</SuggestedRelationships>
<LanguageCodes>{ desiredTermDetails.languageCodes }</LanguageCodes>
</DesiredTermDetails>
</GetKeywordDetailsRequest>
</GetKeywordDetails>
requestXML
}
}
</script>
..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-56737860686650544022015-01-04T15:26:00.003-05:002015-01-04T16:24:09.485-05:00Scala Parser Combinators - SQL Parser ExampleScala Parser Combinators: <a href="https://github.com/scala/scala-parser-combinators">https://github.com/scala/scala-parser-combinators</a><br><br>
Scala Parser Combinators is basically a parsing framework for extracting data when there is a pattern in the given input. This framework provides a more statically typed, functional way of extracting instead of using regex expression which can get hard to read. <br><br>
In this post, lets build a SQL parser where given a valid sql statement we can identify the "table" name, "column" names and other sql properties. Following are some fundamental functions, operations that Scala combinator provides which would help in parsing:
<ul>
<li>
"<b> | </b>": says “succeed if either the left or right operand parse successfully”
</li>
<li>
"<b> ~ </b>": says “succeed if the left operand parses successfully, and then the right parses successfully on the remaining input”
</li>
<li>
"<b> ~> </b>": says “succeed if the left operand parses successfully followed by the right, but do not include the left content in the result”
</li>
<li>
"<b> <~ </b>": is the reverse, “succeed if the left operand is parsed successfully followed by the right, but do not include the right content in the result”
</li>
<li>
"<b> ^^ </b>": says “if the left operand parses successfully, transform the result using the function on the right”
</li>
<li>
"<b> ^^^ </b>": says “if the left operand parses successfully, ignore the result and use the value from the right”
</li>
<li>
" rep(fn) ": says "parse the given input using the parser function fn"
</li>
<li>
" repsep(ident, char) ": says "parse the given input and split the input using the given 'char'"
</li>
</ul>
Lets start out with a set of SQL statements and it's associated Parser code
<h4><i>select * from users</i></h4>
<script type="syntaxhighlighter" class="brush: csharp">
import scala.util.parsing.combinator._
import scala.util.parsing.combinator.syntactical._
case class Select(val fields: String*)
case class From(val table: String)
def selectAll: Parser[Select] = "select" ~ "*" ^^^ (Select("*")) //output: Select("*")
def from: Parser[From] = "from" ~> ident ^^ (From(_)) //output: From("users")
</script>
<h4><i>select name,age from users</i></h4>
<script type="syntaxhighlighter" class="brush: csharp">
def select: Parser[Select] = "select" ~ repsep(ident, ",") ^^ {
case "select" ~ f => Select(f: _*)
}
//output: Select(List[String]("name", "age"))
</script>
<h4><i>select count(name) from users</i></h4>
<script type="syntaxhighlighter" class="brush: csharp">
case class Count(val field: String)
def count: Parser[Count] = "select" ~ "count" ~> "(" ~> ident <~ ")" ^^ {
case exp => Count(exp)
}
//output: Count("name")
</script>
<h4><i>select * from users order by age desc</i></h4>
<script type="syntaxhighlighter" class="brush: csharp">
abstract class Direction
case class Asc(field: String*) extends Direction
case class Desc(field: String*) extends Direction
def order: Parser[Direction] = {
"order" ~> "by" ~> ident ~ ("asc" | "desc") ^^ {
case f ~ "asc" => Asc(f)
case f ~ "desc" => Desc(f)
}
}
//output: Desc("age")
</script>
<h4><i>select * from users order by name, age desc</i></h4>
<script type="syntaxhighlighter" class="brush: csharp">
abstract class Direction
case class Asc(field: String*) extends Direction
case class Desc(field: String*) extends Direction
def order: Parser[Direction] = {
("order" ~> "by" ~> ident ~ ("asc" | "desc") ^^ {
case f ~ "asc" => Asc(f)
case f ~ "desc" => Desc(f)
}) | ("order" ~> "by" ~> repsep(ident, ",") ~ ("asc" | "desc") ^^ {
case f ~ "asc" => Asc(f: _*)
case f ~ "desc" => Desc(f: _*)
})
}
//output: Desc("name", "age")
</script>
<h4><i>select age from users where age>30</i></h4>
<script type="syntaxhighlighter" class="brush: csharp">
def where: Parser[Where] = "where" ~> rep(predicate) ^^ (Where(_: _*))
def predicate = (
ident ~ "=" ~ wholeNumber ^^ { case f ~ "=" ~ i => NumberEquals(f, i.toInt) }
| ident ~ "<" ~ wholeNumber ^^ { case f ~ "<" ~ i => LessThan(f, i.toInt) }
| ident ~ ">" ~ wholeNumber ^^ { case f ~ ">" ~ i => GreaterThan(f, i.toInt) })
//output: GreaterThan("age", 30)
</script>..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-60185467943435081592015-01-04T14:13:00.001-05:002015-01-04T14:54:25.752-05:00Elasticsearch - Cautionary and Useful Tips<h3>Update/Delete Gotcha:</h3>
In elasticsearch, an <b>update</b> to a document is basically a delete and reinsert. A <b>delete</b> operation in elasticsearch is basically marking the document to be deleted and not actually deleted. This is problem especially when you have heavy updates/delete operations as the documents are not actually purged but instead just marked for deletion, which takes up disk space. Following screen shot is an example where the total number of documents in the index (where documents can be searched) is not the same as the actual total documents in the index.
<br> <br>
<a href="http://1.bp.blogspot.com/-0l_DTumcpk0/VKmPid6F9lI/AAAAAAAAEM4/mgxXSqtUmsA/s1600/elasticsearch_delete_update.JPG" imageanchor="1" ><img border="0" src="http://1.bp.blogspot.com/-0l_DTumcpk0/VKmPid6F9lI/AAAAAAAAEM4/mgxXSqtUmsA/s640/elasticsearch_delete_update.JPG" /></a>
<br><br>
To reclaim disk space, you have to optimize the index:
<script type="syntaxhighlighter" class="brush: csharp">
curl -XPOST 'http://localhost:9200/_optimize?only_expunge_deletes=true'
</script>
More information at: <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-optimize.html">http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-optimize.html</a>
<h3>Memory Limitation - Max ES Heap size:</h3>
By default elasticsearch allocates 1GB of heap to it's process. This is ok for development purposes but in production you should generally provide half of the server memory to Elasticsearch. To set the heap size:
<script type="syntaxhighlighter" class="brush: csharp">
export ES_HEAP_SIZE=10g
</script>
More the memory given to elasticsearch is better as more data is held in the memory for faster search/seeks but there are few gotchas to be aware of:
<ul>
<li>
<b>Do not cross 32GB</b>
<br>
As it turns out, the JVM uses a trick to compress object pointers when heaps are less than ~32 GB. Once you cross that magical ~30–32 GB boundary, the pointers switch back to ordinary object pointers. The size of each pointer grows, more CPU-memory bandwidth is used, and you effectively lose memory.
</li>
<li>
<b>Give half of the memory to Lucene</b><br>
Lucene is designed to leverage the underlying OS for caching in-memory data structures. If you give all available memory to Elasticsearch’s heap, there won’t be any left over for Lucene. This can seriously impact the performance of full-text search.
</li>
<li>
<b>Disable Memory Swapping</b><br>
Swapping main memory to disk will basically affect server and elasticsearch performance. If memory swaps to disk, a 100-microsecond operation becomes one that take 10 milliseconds. To avoid this you should enable mlockall. This allows the JVM to lock its memory and prevent it from being swapped by the OS. In your elasticsearch.yml, set this:
<script type="syntaxhighlighter" class="brush: csharp">
bootstrap.mlockall: true
</script>
</li>
</ul>
More information at : <a href="http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/heap-sizing.html">http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/heap-sizing.html</a>
<h3>Index name - Alias</h3>
It's always advisable to provide an alias to the index and have the application use the alias name instead of the actual index name. This is useful as we can switch index's without affecting the calling application. For example, we can create a brand new index with new mappings and basically delete the alias from the old index and assign it to the new index. This way re-indexing operations would be a zero downtime operations.
<script type="syntaxhighlighter" class="brush: csharp">
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
"actions" : [
{ "remove" : { "index" : "kIndex_v1", "alias" : "assets" } },
{ "add" : { "index" : "kIndex_v2", "alias" : "assets" } }
]
}'
</script>
<h3>Logging - Debug/Info/Error:</h3>
By default, elasticsearch log level is set to debug in the logging.yml file. This is probably not a good choice as ES tends to log everything which takes up a lot of disk space. I learned this the hard way where I copied data from one index to another index for reindexing purposes and elasticsearch logged every single payload and the log size was almost the size of the index itself! It;s best to have the log level to WARN instead of DEBUG or INFO.
<h3>Document Versioning:</h3>
For every insert or document update, elasticsearch either auto assigns a version number of expects the user to provide a version number. This is useful for concurrency control. There are 4 types of version mechanism:
<ul>
<li><b>internal</b>: (Default) Auto assigned by elasticsearch</li>
<li><b>external</b>: Version number provided by user. Must always be greater than the existing version of the document</li>
<li><b>external_gte</b>: Version number provided by user but the version number should be at the very least equal to the existing document version.</li>
<li><b>force</b>: Version number provided by the user where the number can be anything. But not recommended</li>
</ul>
The above four version options are not very well documented, but can be understood by reading the ES source code at: <a href="https://github.com/elasticsearch/elasticsearch/blob/1816951b6b0320e7a011436c7c7519ec2bfabc6e/src/main/java/org/elasticsearch/index/VersionType.java#L275">Source</a>
<script type="syntaxhighlighter" class="brush: csharp">
//external_gte example
curl -XPOST "http://localhost:9200/designs/shirt/1?version=4&version_type=external_gte" -d'
{
"name": "elasticsearch",
"votes": 1
}'
</script>..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-64845944634650176972014-11-28T16:34:00.000-05:002014-11-28T16:35:11.189-05:00Elasticsearch - No downtime reindexingAs you probably know that mappings in elasticsearch cannot be changed, for example like changing a property type from a string to an int etc. The only way to make such changes is to copy the entire index into a brand new index with new mappings.
<br><br>
Reindexing is an unavoidable common practice as data model changes effects how data is indexed in elastic search. So while designing the system, having an alias assigned to all indexes is a good choice as we can swap indexes in and out. Alias is basically providing an alternate name to an index. For example:
<br><br>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZ49go7pqLib5PoB99lms7kiniRAv9-yfiDWdHmBPXGDeCGtz8GKLsghywFISZfSYO1xIOZnwXelpN4OwNmrMVMyY5GSmCApB8i_PWzl-3yGWz-Xoi3sELXJITYA8qQyR2A7r1JbbzH-tE/s1600/elasticsearch_alias.JPG" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZ49go7pqLib5PoB99lms7kiniRAv9-yfiDWdHmBPXGDeCGtz8GKLsghywFISZfSYO1xIOZnwXelpN4OwNmrMVMyY5GSmCApB8i_PWzl-3yGWz-Xoi3sELXJITYA8qQyR2A7r1JbbzH-tE/s400/elasticsearch_alias.JPG" /></a>
<br><br>
Now all that you need to do is to create a new index with new mappings and copy the data over from the original index to the new index. To perform a bulk copy operation, I prefer to use tools such as <a href="https://github.com/taskrabbit/elasticsearch-dump">elasticsearch-dump</a> which helps in this bulk copy operation.
<br><br>
Following query performs a copy from one keyword index to a second index:
<script type="syntaxhighlighter" class="brush: csharp">
elasticdump --input=http://localhost:9200/assets_v1 --output=http://locahost:9200/assets_v2 --type=data --bulk=true --limit=500 --bulk-use-output-index-name=true
</script>
Now all that you need to do is to delete the alias from the original index and assign it the new index. This way the calling client using the alias for querying and indexing would have no impact
<script type="syntaxhighlighter" class="brush: csharp">
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
"actions" : [
{ "remove" : { "index" : "assts_v1", "alias" : "assets" } },
{ "add" : { "index" : "assts_v2", "alias" : "assets" } }
]
}'
</script>
But what about the documents that were updated during the scan and scroll process? Well, that's tricky but if your model does have a update date property you can always re run the es dump to fetch only the documents that were updated after a certain date time.
<script type="syntaxhighlighter" class="brush: csharp">
elasticdump --input=http://localhost:9200/assets_v1 --output=http://locahost:9200/assets_v2 --type=data --bulk=true --limit=500 --bulk-use-output-index-name=true --searchBody='{"query":{"bool":{"must":[{"range":{"asset.submitDate":{"gte":"2014-09-01","lte":"2014-09-21"}}}]}}}'
</script>..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-27004026496863728642014-11-23T16:57:00.001-05:002014-11-23T17:46:25.115-05:00Elasticsearch - Dynamic Data MappingData in Elasticsearch can be indexed without providing any information about it's content as ES accepts dynamic properties and ES detects if the property value is a string, integer, datetime, boolean etc. In this article, lets work on getting dynamic mapping setup the right way along with some commonly performed search operations.
<br><br>
To start with a simple example. Lets consider the following object:
<script type="syntaxhighlighter" class="brush: csharp">
$ curl -XPOST http://localhost:9200/keywords/keyword/61669 -d
'{
"keywordId": 61669,
"keywordText": "Massaging",
"keywordType": "Submitted"
}'
</script>
Given the above Json blob and indexing into the elasticsearch would result in the following mapping:
<script type="syntaxhighlighter" class="brush: csharp">
{
"keywords" : {
"mappings" : {
"keyword" : {
"properties" : {
"keywordId" : {
"type" : "long"
},
"keywordText" : {
"type" : "string"
},
"keywordType" : {
"type" : "string"
}
}
}
}
}
}
</script>
This is great that Elasticsearch automatically detected id to be long and text & type to be a string. But if you look carefully, the keywordText and KeywordType are set to default type of "analyzed". This means that those two fields are now available for partial text search. But I want keywordType to be "not_analyzed" as users would never partial text search it. To overcome this but preserve the dynamic nature of this index, we can create a Keywords Index with mapping provided for certain fields:
<script type="syntaxhighlighter" class="brush: csharp">
$ curl -XPUT http://localhost:9200/keywords -d
'{
"mappings": {
"keyword": {
"dynamic": "true",
"properties": {
"keywordType": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'
</script>
As you can see from above, we have set dynamic to "true" but let the index know that if any field that matches "keywordType" to use a specific mapping rather instead of ES figuring it out for us.
<br><br>
Now that we have KeywordType to "not_analyzed" which basically now is an "exact match" search including the case (upper and lower case). But how do I make KeywordType to be a case insensitive exact match? One way is to to lower the keywordType and have the calling system provide a lower case searches only. For this, the following mapping changes need to happen:
<script type="syntaxhighlighter" class="brush: csharp">
$ curl -XPUT http://localhost:9200/keywords -d
'{
"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_keyword": {
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
}
},
"mappings": {
"keyword": {
"dynamic": "true",
"properties": {
"keywordType": {
"type": "string",
"analyzer": "analyzer_keyword"
}
}
}
}
}'
</script>
We are basically using the "Keyword tokenizer" that Elasticsearch provides that makes it exact match search and "filter" of lowercase which automatically converts the input to lower case. More info on at <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-tokenizers.html">Elasticsearch tokensizers</a>
<br><br>
So far so good but I dont want users to search on all fields which Elasticsearch by default provides. I would rather have the user provide which field they want to search on. Why? Doing a "_all" search on an index with 100's of fields is a very expensive operation; that's why! To disable "_all" search:
<script type="syntaxhighlighter" class="brush: csharp">
#mapping configuration from above
"mappings": {
"keyword": {
"_ttl" : { "enabled" : true, "default" : "5d" },
"dynamic": "true",
"_all": {
"enabled": false
},
...
}
</script>
OK great, now that _all field search is disabled but now since dynamic is turned on which means any new fields can automagically be indexed, I don't want elasticsearch to index any binary blob as it would consume too much memory; but rather just store it and not index it. For this, the updated mapping would look like:
<script type="syntaxhighlighter" class="brush: csharp">
#mapping configuration from above
...
"properties": {
"keywordType": {
"type": "string",
"analyzer": "analyzer_keyword"
},
"blob": {
"type": "string",
"enabled": false
}
}
...
</script>
Setting "enabled: false" lets elasticsearch know that this field should not be indexed for search purposes but would be part of the document result. So basically it's stored but not searchable.
<br><br>
Since dynamic mapping is enabled, Elasticsearch parses thru every single property to determine it;s type. As much I would love for Elasticsearch to perform all the magic mappings, let give it a helping hand by letting the Elasticsearch know that certain properties are DateTime type based on it;s name.
<script type="syntaxhighlighter" class="brush: csharp">
#mapping configuration from above
...
"mappings": {
"keyword": {
"dynamic": "true",
"date_detection": false,
"dynamic_templates": [
{
"date_index": {
"mapping": {
"type": "date"
},
"match": ".*Date|date",
"match_pattern": "regex"
}
}
]
...
</script>
So basically, if any property that has either "date" or "Date" at it's ending then assume it's a DateTime object. For example "createDate" or "updateDate" would match the above template. Also as you may notice, "date_detection" is set to false.
<br><br>
How about make all string into exact lower case match?
<script type="syntaxhighlighter" class="brush: csharp">
#mapping configuration from above
...
"dynamic_templates": [
{
"date_index": {
"mapping": {
"type": "date"
},
"match": ".*Date|date",
"match_pattern": "regex"
}
},
{
"string_index": {
"mapping": {
"analyzer": "analyzer_keyword",
"type": "string"
},
"match": "*",
"match_mapping_type": "string"
}
}
]
...
</script>
So providing dynamic templates when the properties are unknown helps a lot and not have every single field "analyzed" which takes up too memory and extra processing time. The memory consumption analysis will be for another blog post.
<br><br>
Just as an extra, Elasticsearch provides a way to match templates to index names. This means that we provide elasticsearch a template file with mapping information and when an index is created, ES automatically matches the index name with the give template and auto applies the mappings. This template needs to be saved in the "/etc/elasticsearch/templates" folder. An example template file:
<script type="syntaxhighlighter" class="brush: csharp">
#/etc/elasticsearch/templates/keywords_template.json
{
"keywords_template": {
"template": "keywords",
"order": 0,
"settings": {
"index.number_of_shards": 7,
"index.number_of_replicas": 1
},
"mappings": {
"keyword": {
"dynamic": "true",
"dynamic_templates": [
{
"disable_string_index": {
"mapping": {
"type": "string",
"index": "not_analyzed",
"enabled": false
},
"match": "*",
"match_mapping_type": "string"
}
}
],
"_all": { "enabled": false }
}
}
}
}
</script>
..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-32271597861873471942014-11-23T15:54:00.002-05:002014-11-23T16:38:34.374-05:00Elasticsearch - Zen, AWS Cluster SetupIn a cluster environment, multiple elasticsearch nodes/servers join to form a cluster where the shards are distributed and replicated among these servers but to the outside world it is presented as a single system. For elasticsearch to connect to different nodes, ES provides two discovery methods. One being Zen discovery and other being cloud based discovery via plugins for Azure, AWS and Google Compute engine.
<h3>Zen Discovery</h3>
<script type="syntaxhighlighter" class="brush: csharp">
#elasticsearch.yml
# Cluster name
cluster.name: es-test-cluster
#discovery
discovery.type: zen
# Minimum nodes alive to constitute an operational cluster
discovery.zen.minimum_master_nodes: 2
# Unicast Discovery
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [ "192.168.1.1", "192.168.1.2", "192.168.1.3" ]
#failure detection
discovery.zen.ping.fd.ping_interval: 60s
discovery.zen.ping.fd.ping_timeout: 60s
discovery.zen.ping.fd.ping_retries: 10
</script>
From the above snippet, it's pretty straightforward to understand that the discover.type is "zen" and the minimum number of nodes required to form a cluster is "2" and using "unicast" to find other hosts and provide some sort of recovery mechanism (fault detection) if a server went offline or some network problems. This is probably all that Zen discovery has to offer, simple and easy! More info at <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html">Zen discovery</a>
<h3>AWS/EC2 Discovery</h3>
For EC2 discovery, we first need to install the cloud-aws plugin if not already installed
<script type="syntaxhighlighter" class="brush: csharp">
$ /usr/share/elasticsearch/bin/plugin -install elasticsearch/elasticsearch-cloud-aws/2.3.0
</script>
<script type="syntaxhighlighter" class="brush: csharp">
#elasticsearch.yml
# Cluster name
cluster.name: es-test-cluster
#discovery type
discovery.type: ec2
#optional -Region setting to discover nodes
cloud.aws.region: us-west-2
#optional - Security groups
discovery.ec2.groups: sg-xxxxxxx
#optional - to store aws attributes to the nodes - for node awareness
cloud.node.auto_attributes: true
#If not using IAM roles
cloud.aws.access_key: xxxxxxxxxxxxx
cloud.aws.secret_key: xxxxxxxxxxxxxxxxx
</script>
From the above config, the discovery type is ec2 and optionally given a region for the plugin to discover other nodes and security group. If there is no IAM role associated with the server, then AWS secret_key and access_key needs to be provided in-order for the plugin to query AWS for node information. <br><br>
Having the node.auto_attributes set to true would add aws_availability_zone to the node attributes properties which helps in node awareness. What this means is that, given an index with replication factor of 1, ES uses this attribute to determine which node this particular shard is sitting on but makes sure the replicated shard is on a different box. More info at <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-cluster.html#allocation-awareness">Shard Awareness</a>
<script type="syntaxhighlighter" class="brush: csharp">
#running the following should provide you with region information in the attributes section
$ curl http://{ec2-esurl}/_nodes/process?pretty
{
"cluster_name" : "es-cluster-test",
"nodes" : {
"Oaa7jVWNSeyHlmhyYCr32g" : {
"name" : "Mister Fear",
...
"attributes" : {
"aws_availability_zone" : "us-west-2a",
"master" : "true"
}
},
"QibV2okLRyuH0ti6bxvANA" : {
"name" : "Blackheath",
...
"attributes" : {
"aws_availability_zone" : "us-west-2b",
"master" : "true"
}
},
"bEL8b2-lRZ2Cif1nM4WmIQ" : {
"name" : "Sayge",
...
"attributes" : {
"aws_availability_zone" : "us-west-2a",
"master" : "true"
}
}
}
</script>
We can make the Elasticsearch node discovery a little bit faster by filtering the number of servers it needs to ping during the discovery process. This filter can achieved by using the ec2.tag if they are assigned to EC2 servers. In a enterprise environment where there are 100's of ec2 servers deployed on AWS, pinging every single one of them would take a very long time, this should help speed things up.
<script type="syntaxhighlighter" class="brush: csharp">
#elasticsearch.yml
discovery.ec2.tag.env: prod
</script>
More information for this EC2 discovery plugin at <a href="https://github.com/elasticsearch/elasticsearch-cloud-aws">cloud-aws</a> and various discovery at <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery.html">elasticsearch-dsicovery</a>
..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-16713152236765749512014-11-23T14:50:00.000-05:002014-11-23T15:42:05.948-05:00Elasticsearch - Advanced settings and TweaksNow that we have Elasticsearch <a href="http://kufli.blogspot.com/2014/11/elasticsearch-installation-and-general.html">installed</a> and confirmed working, we can start looking into more advanced settings, more of tweaking, to improve Elasticsearch performance. For most use cases, following three area's of Elasticsearch configuration needs to be addressed:
<br>
<ul>
<li>Memory configuration</li>
<li>Threadpool configuration</li>
<li>Data Store configuration</li>
</ul>
<h3>Memory configuration:</h3>
By default Elasticsearch assigns the minimum heap size of 256MB and 1GB maximum heap size. But in real world server environments with many gb in memory availablity, it;s always good to provide 50% of the server memory as a rule of thumb to Elasticsearch process. This setting can be set using:
<script type="syntaxhighlighter" class="brush: csharp">
$ export ES_HEAP_SIZE=2048m
</script>
But providing the heap size is just not enough as the memory can be swapped out by the OS. To prevent this we need to lock the process address space assigned to Elasticsearch. This can be done by adding the following line to elasticsearch.yml file and restarting elasticsearch:
<script type="syntaxhighlighter" class="brush: csharp">
#elasticsearch.yml
bootstrap.mlockall: true
</script>
After starting Elasticsearch, you can see whether this setting was applied successfully by checking the value of mlockall in the output from this request:
<script type="syntaxhighlighter" class="brush: csharp">
$curl http://localhost:9200/_nodes/process?pretty
"nodes" : {
"Oaa7jVWNSeyHlmhyYCr32g" : {
"name" : "Mister Fear",
"transport_address" : "inet[/127.0.0.1:9300]",
"host" : "ip-127-0-0-1",
"ip" : "127.0.0.1",
"version" : "1.3.2",
"build" : "dee175d",
"http_address" : "inet[/127.0.0.1:9200]",
"attributes" : {
"master" : "true"
},
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 3599,
"max_file_descriptors" : 100000,
"mlockall" : false
}
}
</script>
But the <b>mlockall is false</b>. If you see that mlockall is false, then it means that the the mlockall request has failed. The most probable reason is that the user running Elasticsearch doesn’t have permission to lock memory. This can be granted by running <b>ulimit -l unlimited</b> as root before starting Elasticsearch.
<br>
<i>Note that you will always have to run ulimit -l unlimited before elasticsearch restart or else mlockall is set back to false, this is probably because the the User ESprocess is running on is not root</i>
<h3>Threadpool Configuration:</h3>
Elasticsearch can holds several thread pools with a queue bound to each of these pools which allow pending requests to be held instead of discarded. For example, by default for <b>index</b> operation, it has a fixed thread pool size of # no of processors in the system and a queue_size of 200. So if there are more than 200 requests, the new requests are discarded and following exception is returned back to the client: EsRejectedExecutionException[rejected execution (queue capacity 200)..]
<br>
To overcome this limitation and increase the concurrency of elasticsearch processing messages, following setting are be tweaked:
<script type="syntaxhighlighter" class="brush: csharp">
#elasticsearch.yml
#for search operation
threadpool.search.type: fixed
threadpool.search.size: 50
threadpool.search.queue_size: 200
#for bulk operations
threadpool.bulk.type: fixed
threadpool.bulk.size: 10
threadpool.bulk.queue_size: 100
#for indexing operations
threadpool.index.type: fixed
threadpool.index.size: 60
threadpool.index.queue_size: 1000
</script>
So if the use cases if primarily for searching i.e. more search operations than indexing operations, the threadpool for search can be increased and the threadpool for indexing can be much lower. Though queuing up thousands of messages is probably not a wise decision, so tweak responsibly. More information about threadpool size and configuration can be found at <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-threadpool.html">Elasticsearch Threadpool</a>
<br><br>
ES by default assumes that you're going to use it mostly for searching and querying, so it allocates 90% of its allocated total HEAP memory for searching. This can be changed with the following settings. Note that implication of this setting can be significant as you are reducing the memory allocated for search purposes!
<script type="syntaxhighlighter" class="brush: csharp">
#elasticsearch.yml
indices.memory.index_buffer_size: 30%
#above settings grants ES 30% of it's heap memory for index buffer purpose
</script>
More at <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-indices.html">Indices Module</a>
<h3>Store and indices Configuration:</h3>
The store module allows you to control how index data is stored. The index can either be stored in-memory (no persistence) or on-disk (the default). Unless your data is temporary data using in-memory store is a bad idea as you will loose the data upon restart. For disk based storage, we need to have fast disk seeks if the data to be looked up is not in memory. The most optimal way is to use mmap fs which is basically memory mapped files.
<script type="syntaxhighlighter" class="brush: csharp">
#elasticsearch.yml
index.store.type: mmapfs
</script>
More information regarding storage options can be found at <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-store.html">Elasticsearch Store</a>
..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-80773782508201676542014-11-23T13:34:00.001-05:002014-11-23T14:17:45.079-05:00Elasticsearch - Installation and general settingsInstallation of Elasticsearch is a breeze by which I mean it's as simple as downloading the zip/tar file and unzipping it.
<script type="syntaxhighlighter" class="brush: csharp">
$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.3.2.zip
$ unzip elasticsearch-1.3.2.zip -d /usr/share/elasticsearch
$ /usr/share/elasticsearch/bin/plugin -install mobz/elasticsearch-head
$ /usr/share/elasticsearch/bin/plugin -install elasticsearch/elasticsearch-cloud-aws/2.3.0
</script>
<br>
In the above bash script, we are essentially downloading the file, unzipping and installing couple of plugins for administration and cloud discovery. Now that we have Elasticsearch unzipped, we can optionally provide location to it's data, log and configuration folder. There are two ways to provide Elastisearch this configuration. First way is to provide the paths in the yml configuration file elasticsearch.yml. For example
<script type="syntaxhighlighter" class="brush: csharp">
#elasticsearch.yml
path.data: /var/elasticsearch/data
path.logs: /var/elasticsearch/logs
path.work: /tmp/elasticsearch
</script>
Second way is running Elasticsearch in daemon mode, you can setup the paths in the sysconfig file generally located at /etc/sysconfig/elasticsearch. The configuration in this file is passed onto ES as command line settings when elasticsearch is started.
<script type="syntaxhighlighter" class="brush: csharp">
#vi /etc/sysconfig/elasticsearch
LOG_DIR=/var/log/elasticsearch
DATA_DIR=/var/lib/elasticsearch
WORK_DIR=/tmp/elasticsearch
CONF_DIR=/etc/elasticsearch
CONF_FILE=/etc/elasticsearch/elasticsearch.yml
</script>
<i>Note that the configuration in the elasticsearch.yml file overrides the sysconfig file</i>
<br>
But lets say that we using a package manager or puppet scripts to install elasticsearch and now we have no idea where the config files and data directories located. One easy way to get these information is to curl the elasticsearch node endpoint which returns back all the information regarding each node with all path and configuration information
<script type="syntaxhighlighter" class="brush: csharp">
$ curl http://localhost:9200/_nodes?pretty
"path" : {
"conf" : "/etc/elasticsearch",
"data" : "/var/elasticsearch/data",
"logs" : "/var/elasticsearch/logs",
"work" : "/tmp/elasticsearch",
"home" : "/usr/share/elasticsearch"
},
"node" : {
"data" : "true",
"master" : "true"
}
</script>
<i>More information on the directory structure can be found at <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-dir-layout.html">Elasticsearch Directory layout</a></i>
<br>
OK, now that we have elasticsearch unzipped and the data directory setup, lets update some minimal but essential Elasticsearch configuration:
<script type="syntaxhighlighter" class="brush: csharp">
#elasticsearch.yml
cluster.name: es-test-cluster
node.name: es-node-1
node.master: true
node.data: true
</script>
<i>List of all configuration can be found at: <a href="https://github.com/elasticsearch/elasticsearch/blob/master/config/elasticsearch.yml">Elasticsearch configuration file</a></i>
<br>
Note that if the node.name is not provided, Elasticsearch automatically assigns a node name based on Marvel comic characters. This is fine as long as Elasticsearch process does not restart as it will assign a new name again which could be trouble if you are monitoring the ES process by node names.
<br>Now that we have the basic elasticsearch settings updated/added, we can start elasticsearch by running:
<script type="syntaxhighlighter" class="brush: csharp">
$ /usr/share/elasticsearch/bin/elasticsearch
#(or) in daemon mode
$ /usr/share/elasticsearch/bin/elasticsearch -d
#(or) using init.d script if one installed by the package manager
$ /etc/init.d/elasticsearch start
</script>
..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-74786589420352447292014-08-09T16:50:00.000-04:002014-08-09T16:55:20.469-04:00AKKA.NET Actor Creation - Typed, Untyped, Receive ActorThere are three potential ways of creating Actors using Akka.net (v0.6.2).
<br><br>
1. Using <b>Receive Actor</b>: In order to use the Receive() method inside an actor the actor must inherit from ReceiveActor. All message handling must be part of the constructor. Please note that handler matches, the one that appears first is used. More info <a href="http://akkadotnet.github.io/wiki/ReceiveActor">Akka.net documentation</a>.
<br><br>
2. Using <b>Typed Actor</b>: In typed actor, you explicitly provide the message types that the actor would receive. This is very helpful in code readability as you are ware if type of message that the actor receives. For receive message, you need to define a overloaded "Handle" function with message types.
<br><br>
3. Using <b>Untyped Actor</b>: Untyped Actor, similar to Receive actor requires to pattern match the message type. For using Untyped actor, the OnReceive function needs to be overridden.
<script type="syntaxhighlighter" class="brush: csharp">
public class ImagePersistanceActor : ReceiveActor, ILogReceive
{
public ImagePersistanceActor()
{
Receive<string>(message => {
Console.WriteLine("Echo from Recieve actor: " + message);
});
Receive<Image>(image => {
Console.WriteLine("OK form Receive Actor: " + image.Id);
});
}
}
public class ImagePersistanceActor2 : TypedActor, IHandle<string>, IHandle<Image>
{
public void Handle(string message) {
Console.WriteLine("Echo From typed actor: " + message);
}
public void Handle(Image image) {
Console.WriteLine("OK form Typed Actor: " + image.Id);
}
}
public class ImagePersistanceActor3 : UntypedActor
{
protected override void OnReceive(object message)
{
if (message.GetType() == typeof(string)) {
Console.WriteLine("Echo from Untyped Actor: " + message);
}
else if (message.GetType() == typeof(Image)) {
var image = (Image)message;
Console.WriteLine("OK form untyped Actor: " + image.Id);
}
}
}
</script>
<br>
Handling unknown messages can be done 2 ways:
<script type="syntaxhighlighter" class="brush: csharp">
//Using Any:
Receive<string>(s => Console.WriteLine("Received string: " + )s);
ReceiveAny(o => Console.WriteLine("Received object: " + o);
//overriding Unhandled
protected override void Unhandled(object message)
{
//Do something with the message.
}
</script>..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-42075333506956015752014-04-24T01:02:00.000-04:002016-04-14T15:30:23.483-04:00Image Ranking by global feature estimationHaving an automated image ranking process would be very beneficial to companies such as <a href="http://500px.com">500px</a> or <a href="http://gettyimages.com">Gettyimages </a>where 1000's of images are ingested/uploaded every day and are traditionally ranked manually by an editor or by crowdsourcing. This human intervention can be temporarily avoided if images can be ranked by estimating it's quality.
<br><br>
To build an image quality estimator, following image quality properties were used for this proof-of-concept:
<ul>
<li>Blur Estimation:<br>
Basically, this technique estimates the proportion of blurred pixels. Results are in the range 0-1. A higher number implies a sharper image. <br>
Example code: <a href="https://github.com/tokenrove/blur-detection">Blur-detection</a>
</li>
<li>Sharpness:<br>
Sharpness measures the clarity and level of detail of an image.<br>
Example code: <a href="http://opencv-users.1802565.n2.nabble.com/Estimation-of-Image-Sharpness-td6052239.html">Estimation-of-Image-Sharpness</a>
</li>
<li>Colorfulness: <br>
Though there is no real way to estimate the colorfulness of an image but there are various algorithms to measure the quality of image based on compression. One such algorithm is from Hasler and Susstruck's colorfulness metrics. <a href="http://infoscience.epfl.ch/record/33994/files/HaslerS03.pdf?version=1">Paper</a>
</li>
<li>Naturalness: <br>
Naturalness is basically a single valued summary of how natural the colours in an image are. One such algorithm is Color Naturalness index (CNI) defined by Huang, Qiao & Wu. <a href="http://cilab.knu.ac.kr/seminar/Seminar/2007/20070804%20Natural%20color%20image%20enhancement%20and%20evaluation%20algorithm%20based%20on%20human%20visual%20system.pdf">Paper</a>
</li>
<li>Image Contrast:<br>
Estimating the contrast of an image.
</li>
<li>Colour Contrast:<br>
This is basically weighted average of the average colour difference of all the segments in the image.
</li>
<li>Brightness: <br>
Extract the average brightness of an image.</li>
</ul>
In this following examples, I have 3 ranks. Rank 1 = Good quality, Rank 2 = Medium and Rank 3 = Bad Quality images. Following is a snippet of my rank calculation.
<script type="syntaxhighlighter" class="brush: java">
//Blur range: 0 to 1, where higher the value less blur the image is
//Sharpness range: 0 to 1, higher the better
//Color range: 0 to 0.5
//Naturalness range:0 to 1
val imageRank = (blurValue.get.value + sharpnessValue.get.value + colorValue.get.value + naturalnessValue.get.value) match {
case t if t >= 2 => 1
case t if t >= 1 && t < 2 => 2
case _ => 3
}
</script>
<br>
<table>
<tr><td colspan="3"><b>Image Rank 1 Examples</b></td></tr>
<tr>
<td><img width="230px" src="http://ppcdn.500px.org/68047517/b0f701bdc2dd313f0f4742aec4d236b031877a28/3.jpg"/></td>
<td><img width="230px" src="http://ppcdn.500px.org/68051789/088a642837c8f7075cb8dc8f1f4f0f19f122ab1f/3.jpg"/></td>
<td><img width="230px" src="http://ppcdn.500px.org/68040551/9d3e5415700c31dc38871e7ed8b2fee13da4ebff/3.jpg"/></td>
</tr>
<tr>
<td>
"blur":0.9660969387755102<br>
"sharpness":0.5067756918923018<br>
"color":0.1462474560577043<br>
"quality":0.8798513615119794<br>
"contrast":0.08234799394638759<br>
"colorContrast":8.328080618896797<br>
"brightness":0.14739056340426107
</td>
<td>
"blur":0.966530612244898<br>
"sharpness":0.49986904977578045<br>
"color":0.19059606130699966<br>
"quality":0.8476687344507642<br>
"contrast":0.08834899629937157<br>
"colorContrast":25.384664271394108<br>
"brightness":0.24995158193079098<br>
</td>
<td>
"blur":0.7220025510204081<br>
"sharpness":0.15089643141301262<br>
"color":0.4003952223119604<br>
"quality":0.9609951671164367<br>
"contrast":0.2292792429036637<br>
"colorContrast":32.053295457355276<br>
"brightness":0.4115245849574555<br>
</td>
</tr>
<tr><td colspan="3"><b>Image Rank 2 Examples</b></td></tr>
<tr>
<td><img width="230px" src="http://ppcdn.500px.org/68054375/3fe15ce1af0a413e54d3f6f71dab4c3d69f00efd/3.jpg"/></td>
<td><img width="230px" src="http://ppcdn.500px.org/57327734/b406167f2b1ca7e929654680c1c4cde2e9583ab6/3.jpg"/></td>
<td><img width="230px" src="http://ppcdn.500px.org/5392284/6dd034b976749d9bee7ffc119e52ea169c7e34b6/3.jpg"/></td>
</tr>
<tr>
<td>
"blur":0.8221428571428572<br>
"sharpness":0.15935783910386503<br>
"color":0.04443692154503701<br>
"quality":0.5085070823398357<br>
"contrast":0.14823618660855284<br>
"colorContrast":26.781796490365295<br>
"brightness":0.27380289202677777<br>
</td>
<td>
"blur":0.9705484693877551<br>
"sharpness":0.41994123003434913<br>
"color":0.19996393727671855<br>
"quality":0.3153617863913573<br>
"contrast":0.10908353919667178<br>
"colorContrast":43.63807505087183<br>
"brightness":0.18082135350839887<br>
<td>
"blur":0.8633290816326531<br>
"sharpness":0.2271008743069167<br>
"color":0.12817043260208896<br>
"quality":0.6680113666484762<br>
"contrast":0.05242860949574584<br>
"colorContrast":44.47249174965709<br>
"brightness":0.222476585047196<br>
</td>
</tr>
</table>
<br>
From the above results, an simple image ranking system can be automatically performed by estimating the global image features values. More complex algorithms can be used to further improve the image quality estimation process such as Bokeh Estimator which can be used to detect background blur and camera focus[<a href="http://www.bobatkins.com/photography/technical/bokeh_background_blur.html">link</a>].<br>
<i>Note: source code to follow soon at my <a href="http://github.com/karthik20522">github</a></i>..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-21109535171708346052013-12-09T21:44:00.000-05:002019-02-07T12:23:42.814-05:00Image processing Benchmarks<br>
For this benchmark the following most widely used image processing libraries were considered.
<br><br>
- imagemagick [http://www.imagemagick.org/script/index.php]<br>
- graphicsmagick [http://www.graphicsmagick.org/]<br>
- epeg [https://github.com/mattes/epeg]<br>
- opencv [http://opencv.org/]<br>
- vips<br>
<br>
Test environment:<br>
Memory: 5.8 GB<br>
Processor: Intel Xeon CPU W3530 @ 2.80Ghz x 4 Core<br>
OS: Ubuntu 13.04 / 64 bit<br>
Graphics: Gallium 0.4 on AMD Redwood<br>
<br>
Original Image - 350KB - 3168x3168 pixels | Resized to 640x480<br>
imagemagick x 3.69 ops/sec ±2.27% (23 runs sampled)<br>
gm x 5.03 ops/sec ±0.68% (29 runs sampled)<br>
opencv x 19.18 ops/sec ±1.27% (49 runs sampled)<br>
epeg x 35.49 ops/sec ±1.16% (60 runs sampled)<br>
vips x 40.62 ops/sec ±5.01% (69 runs sampled)<br>
<br>
Original Image - 1 MB - 3000x2000 | Resized to 640x480<br>
imagemagick x 4.97 ops/sec ±2.35% (29 runs sampled)<br>
gm x 5.00 ops/sec ±0.54% (29 runs sampled)<br>
opencv x 15.15 ops/sec ±1.36% (41 runs sampled)<br>
epeg x 27.47 ops/sec ±0.98% (69 runs sampled)<br>
vips x 36.26 ops/sec ±6.05% (89 runs sampled)<br>
<br>
Original Image - 15MB - 5382x6254 pixels | Resized to 640x480<br>
imagemagick x 0.87 ops/sec ±1.20% (9 runs sampled)<br>
gm x 0.87 ops/sec ±0.66% (9 runs sampled)<br>
vips x 1.74 ops/sec ±0.43% (13 runs sampled)<br>
opencv x 1.88 ops/sec ±4.09% (9 runs sampled)<br>
epeg x 3.87 ops/sec ±0.78% (14 runs sampled)<br>
<br>
From the above results, VIPS seems to be the fastest among all followed by epeg and opencv.
But one thing to consider is the features provided vs performance. Libraries such as VIPS
and EPEG are more optimized towards image resizing and image cropping while opencv,
graphicsmagick and imagemagick provides a slew of image processing and analysis features.
<br><br>
Code snippet for benchmarking: <a href="https://gist.github.com/karthik20522/7605083">https://gist.github.com/karthik20522/7605083</a>..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-31340683851452042542013-11-18T21:30:00.000-05:002013-11-18T21:32:39.459-05:00Speedier upload using Nodejs and Resumable.js<i>[updated]</i> View source code at <a href="https://github.com/karthik20522/MultiPortUpload">https://github.com/karthik20522/MultiPortUpload</a><br><br>
Resumable.js is by far one of the best file uploading plugin that I have used followed by Plupload. Resumable.js provides offline mode features where if a user gets disconnected while uploading it would automatically resume when online. Similar to Plupload it has chunking options. Nodejs on other hand provides a non- blocking functionality which is perfect for uploading purposes.
<br><br>
There is no upload speed difference between upload plugins (resumablejs, plupload etc) except for few features hear and there. Recently I developed a proof of concept for speedier upload using existing plugins and systems. As part of the research was to emulate other file accelerator where multiple ports are used to upload files, thus making uploading quicker.
<br><br>
Using the same above concept, I modified the resumable.js to accept multiple urls as an array and upload individual chunks to different urls in a round-robin style. On the backend I spawned nodejs in multiple ports. <b><i>But resumable.js only uploads multiple chunks in parallel but not multiple files.</i></b> This limitation was overcome with some simple code change and following is a test result with various scenarios.
<br><br>
Note: in resumable.js, simultaneous sends option was set to 3
<br><br>
<table style='border:1px solid black;'>
<tr>
<td></td>
<td><b>Single Server single file upload</b></td>
<td><b>Multiple Server single file upload</b></td>
<td><b>Multiple server + multiple file upload</b></td>
</tr>
<tr>
<td>1 file (109MB)</td>
<td>54secs</td>
<td>56 secs</td>
<td>56 secs</td>
</tr>
<tr>
<td>59 file (109MB)</td>
<td>152secs</td>
<td>156 secs</td>
<td>17 secs</td>
</tr>
</table>
<br><br>
<b>Single Server single file upload</b> – default configuration on resumable.js and single Node.js server to accept files/chunks<br>
<b>Multiple Server single file upload</b> – modified resumable.js to take multiple urls and Node.js was configured to listen to different ports (3000, 3001, 3002). Resumable.js when uploading chunks would upload to different ports in parallel.<br>
<b>Multiple Server + multiple file upload</b> – modified resumable.js to upload multiple files and multiple chunks in parallel instead of one file at a time.
<br><br>
But the above test results are for only 3 simultaneous connections. Modern browsers can handle more than 3 connections, following is the number of connections per server supported by current browsers. The theory is that browsers make parallel connections when different domains are used and uploading parallel would make use of full user bandwidth for faster upload.
<br><br>
<table style='border:1px solid black;'>
<tr><td style='width:150px;'><b>Browser</b></td><td style='width:150px;'><b>Connections</b></td></tr>
<tr><td>IE 6,7</td><td>2</td></tr>
<tr><td>IE8</td><td> 6</td></tr>
<tr><td>Firefox 2</td><td> 2</td></tr>
<tr><td>Firefox 3</td><td> 6</td></tr>
<tr><td>Firefox 4</td><td> 6 (12?)</td></tr>
<tr><td>Safari 4</td><td> 6</td></tr>
<tr><td>Opera</td><td> 4</td></tr>
<tr><td>Chrome 6</td><td> 7</td></tr>
</table>
<br><br>
Let’s test the above scenario with 10 simultaneous connections:
<br><br>
<table style='border:1px solid black;'>
<tr>
<td></td>
<td><b>Single Server single file upload</b></td>
<td><b>Multiple Server single file upload</b></td>
<td><b>Multiple server + multiple file upload</b></td>
</tr>
<tr>
<td>1 file (109MB)</td>
<td>27 secs</td>
<td>18 secs</td>
<td>18 secs</td>
</tr>
<tr>
<td>59 files (109MB)</td>
<td>156 secs</td>
<td>158 secs</td>
<td>14 secs</td>
</tr>
</table>
<br>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgH0Px770rvw3UcI8YsgF2Ucr-eIV0jgPtdzvRtDEcKGdWs0cwacOKpEeO_tK_SKOJI1UBz4hZNG0WSQo9UJF4IGMxYgAUHZzYtjVYqI4brAprrf5k7wVsR1-RSuqejqxsP0W71AYgpmL9Q/s1600/resumableJS.png" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgH0Px770rvw3UcI8YsgF2Ucr-eIV0jgPtdzvRtDEcKGdWs0cwacOKpEeO_tK_SKOJI1UBz4hZNG0WSQo9UJF4IGMxYgAUHZzYtjVYqI4brAprrf5k7wVsR1-RSuqejqxsP0W71AYgpmL9Q/s640/resumableJS.png" /></a><br>
<i>Server was using almost entire Bandwidth on a multi-file upload!<b> ~1Gbps; 986Mbps!</b></i>
<br><br>
As you can clearly see from the above results having different upload endpoints (ports/host-names) would allow browser to make parallel connections as it would treat as a new host.
<br><br>
<b>Advantages</b>:
<ul>
<li>Customizable. In house development</li>
<li>As Fast as user bandwidth</li>
<li>Use Resumable.js plugin for offline support! Win-Win for everyone!</li>
</ul>
<b>Disadvantages</b>:
<ul>
<li>Html5 only i.e. No IE 9 and below support!</li>
<li>Server s/w needs to be reliable enough to handle huge data & IO operations</li>
</ul>
<i>Note: Maximum chunk size for the above test were set to 1MB. There is a bit of code which determines the user Internet speed and determines the chunksize; I am doing this basically by downloading a JPEG file and calculating the time token to download. This chunkSize calculation is just a POC</i>
<script type="syntaxhighlighter" class="brush: javascript">
var chunkSize = 1048; //default chunksize
var a = new XMLHttpRequest();
a.onreadystatechange = function () {
if (a.readyState==4 && a.status==200){
var timeToLoad = new Date - startPingDate;
chunkSize = Math.round(chunkSize - (timeToLoad * Math.random()));
if(chunkSize < 256)
chunkSize = 256; //set a default value if too low
$("#chunkSize").html(chunkSize);
$("#toSize").html(toSize(1*1024*chunkSize));
}
};
var startPingDate = new Date;
a.open("GET", "/pingImage.jpg");
a.send(null);
alert(chunksize);
</script>..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-70525246132729438552013-10-23T22:52:00.001-04:002013-10-24T14:06:04.374-04:00Smart Thumbnail Cropping
Scaling an image down to a thumbnail size is a common practice when hosting it on websites to reduce page load time and save bandwidth and so on.. But very little has been done in optimization of those thumbnails from a human view-ability point of view. Human view-ability, what? Take a large image where background covers the major part of the image and shrink it down to a thumbnail size (say 192 px) and notice that the details of the image is subdued by the background.
<br><br>
To solve this problem of smart cropping, I am using a variation of descriptors and image processing tricks to extract only the most feature rich part of the image and preserve the aspect ratio while cropping the image.
Following are the test results of algorithm used:
<table>
<tr>
<td><b>Sample 1</b></td>
</tr>
<tr>
<td>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiOqjcOD55ZlZLE0K0uLwkXyTl4irw13VEPV8uNp0TXxDInPy4jut5c_dtOnBXFakHVwAspefGHEUzbd2MvhLJ0rsKCg0okbYew8LPluTlCYL6mCsKc4_-Apu0xtKDh-3AykPrVCK0_e2Bz/s1600/test1_original.jpg" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiOqjcOD55ZlZLE0K0uLwkXyTl4irw13VEPV8uNp0TXxDInPy4jut5c_dtOnBXFakHVwAspefGHEUzbd2MvhLJ0rsKCg0okbYew8LPluTlCYL6mCsKc4_-Apu0xtKDh-3AykPrVCK0_e2Bz/s192/test1_original.jpg" /></a>
</td>
<td>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgt7RoXTpc2mYXpPS3BBQLdk1teZJ_c1FQPoBW3tzjJS-w2X3fTneVK6wbSKf2KoYboGVo-uIipcLmDLCWfNTsWF9aHjzL47YBK7Hhl0Cm7D526nGrKw81FUC7wW86d9vDa2hdKWY6V_M2N/s1600/test1_surf.jpg" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgt7RoXTpc2mYXpPS3BBQLdk1teZJ_c1FQPoBW3tzjJS-w2X3fTneVK6wbSKf2KoYboGVo-uIipcLmDLCWfNTsWF9aHjzL47YBK7Hhl0Cm7D526nGrKw81FUC7wW86d9vDa2hdKWY6V_M2N/s192/test1_surf.jpg" /></a>
</td>
<td>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgb1lJnNjsemCz1GYWaAOwND1kt9VShNSk8N_3Abj3xUATuBStWmctB1zkCHz5J_zlQFhspE9vD8cO1av3CodCLNZdUWV-1hdXet2ZI63_BEmt_q9uQ0BrWGQOWlbAJx2tBb64HpNMNIv-s/s1600/test1.jpg" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgb1lJnNjsemCz1GYWaAOwND1kt9VShNSk8N_3Abj3xUATuBStWmctB1zkCHz5J_zlQFhspE9vD8cO1av3CodCLNZdUWV-1hdXet2ZI63_BEmt_q9uQ0BrWGQOWlbAJx2tBb64HpNMNIv-s/s320/test1.jpg" /></a>
</td>
</tr>
<tr>
<td>
<b>Original Thumbnail</b>
</td>
<td>
<b>Feature Extraction<b></b></b>
</td>
<td>
<b>Cropped Thumbnail</b>
</td>
</tr>
<tr>
<td><b>Sample </b>2</td>
</tr>
<tr>
<td>
<a href="http://1.bp.blogspot.com/-oGp-vdBdX5g/UmiHj2eczeI/AAAAAAAAECI/E0wqd6wbkbw/s1600/test2_original.jpg" imageanchor="1" ><img border="0" src="http://1.bp.blogspot.com/-oGp-vdBdX5g/UmiHj2eczeI/AAAAAAAAECI/E0wqd6wbkbw/s200/test2_original.jpg" /></a>
</td>
<td>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgR4wOsK-EU644XsxztmLjLuWhFKJPLilTEOJ-P3z1fZhi4cF2Q-LVli4T7vfj0mom-9nu4e0ehzcPOTmXyvhpDrLi2WAtP0aqV10vAXU6UVVBeEWN9zEYmiBftOz__ZYQY9O6c04uM1EqK/s1600/test2_surf.jpg" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgR4wOsK-EU644XsxztmLjLuWhFKJPLilTEOJ-P3z1fZhi4cF2Q-LVli4T7vfj0mom-9nu4e0ehzcPOTmXyvhpDrLi2WAtP0aqV10vAXU6UVVBeEWN9zEYmiBftOz__ZYQY9O6c04uM1EqK/s200/test2_surf.jpg" /></a>
</td>
<td>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjmYeKDaC8fFsSiaLYcIxjfIbCw-KGJMTBqLKbCjGidBkgn5WhSAmu50zkK4J1QIPlI_db_w7H2-BK8yGQpkKpSVUcvGdDaPQssgZRUdl38-L05ZTKUVmSElUN97APPQXuIf2QnEtDfHDJK/s1600/test2.jpg" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjmYeKDaC8fFsSiaLYcIxjfIbCw-KGJMTBqLKbCjGidBkgn5WhSAmu50zkK4J1QIPlI_db_w7H2-BK8yGQpkKpSVUcvGdDaPQssgZRUdl38-L05ZTKUVmSElUN97APPQXuIf2QnEtDfHDJK/s320/test2.jpg" /></a>
</td>
</tr>
<tr>
<td>
<b>Original Thumbnail</b>
</td>
<td>
<b>Feature Extraction<b></b></b>
</td>
<td>
<b>Cropped Thumbnail</b>
</td>
</tr>
<tr><td><b>Sample 3</b></td></tr>
<tr>
<td>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTS3pnCM1egI5ZKwBN6M4BgSejdMMKfYC2QktJn3Aj1Lg7B5VJm5xKAQlEEzLWN9t5mNgtBd5bZVZELVu0hUptetoTamyVNnhC0Efl3rcZOr1WKLrhtJ7NeqRQeCZ3bb5kg1vKUgtssVcd/s1600/test4_original.jpg" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTS3pnCM1egI5ZKwBN6M4BgSejdMMKfYC2QktJn3Aj1Lg7B5VJm5xKAQlEEzLWN9t5mNgtBd5bZVZELVu0hUptetoTamyVNnhC0Efl3rcZOr1WKLrhtJ7NeqRQeCZ3bb5kg1vKUgtssVcd/s200/test4_original.jpg" /></a>
</td>
<td>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhv8LtScd0sSWPnFkRA6W6xeqdBIuU38dPjsncJY9aBanIcKdLrlIP8sXyZ9W5TxyygLbAz2bBYCgZrMHhQ-SWas6nLLYNQbqscwo9JoNHR8VhIxCTnl9PCX3SK9R0-DVpAU0eWVHwouAmK/s1600/test4_surf.JPG" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhv8LtScd0sSWPnFkRA6W6xeqdBIuU38dPjsncJY9aBanIcKdLrlIP8sXyZ9W5TxyygLbAz2bBYCgZrMHhQ-SWas6nLLYNQbqscwo9JoNHR8VhIxCTnl9PCX3SK9R0-DVpAU0eWVHwouAmK/s200/test4_surf.JPG" /></a>
</td>
<td>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgB6piB4tGcdEbZ6vloACnnO-S9RfgKbt5M3Ndm2uMVqG7BZq3qUSTI7K1Of7XrVRkpB_nYrxSzaofXWTug6_ajB89TvN0Tx6hUWvw3npQ75eC9J4ZrLBhCsD_p67n3IhCh5yDX0Qq6TIFb/s1600/test4.jpg" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgB6piB4tGcdEbZ6vloACnnO-S9RfgKbt5M3Ndm2uMVqG7BZq3qUSTI7K1Of7XrVRkpB_nYrxSzaofXWTug6_ajB89TvN0Tx6hUWvw3npQ75eC9J4ZrLBhCsD_p67n3IhCh5yDX0Qq6TIFb/s200/test4.jpg" /></a>
</td>
</tr>
<tr>
<td>
<b>Original Thumbnail</b>
</td>
<td>
<b>Feature Extraction<b></b></b>
</td>
<td>
<b>Cropped Thumbnail</b>
</td>
</tr>
</table>
<br><br>
So what's in the pipeline:
<ul>
<li>Open source the image processing code</li>
<li>Build a http handler (ASP.NET http handler) for dynamic cropping</li>
</ul>
Thoughts?..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-44946517497296742572013-10-11T00:01:00.000-04:002013-10-11T00:51:01.502-04:00Event Viewer - Image Search"Event Viewer" is an yet another attempt to visualize images similar to my <a href="http://kufli.blogspot.com/2012/11/research-timeline-how-did-i-build-this.html">{re}Search Timeline</a> project.
<br>
Demo at : <a href="http://karthik20522.github.io/EventViewer">http://karthik20522.github.io/EventViewer</a>
<br><br>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQJX2IKem_lqxlL9OD_gBPyZcS5sgxQoqod-UwVNguaSbVjTTkWOvroRD10QPOmidVy9LdiZHC1CpOUFJJc_ZLtm5EagmOoZuZSbrumfYcx_BUnY_FtYUpWlosoqZw4DsJh6BLHuNaeqyF/s1600/eventViewer_new.png" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQJX2IKem_lqxlL9OD_gBPyZcS5sgxQoqod-UwVNguaSbVjTTkWOvroRD10QPOmidVy9LdiZHC1CpOUFJJc_ZLtm5EagmOoZuZSbrumfYcx_BUnY_FtYUpWlosoqZw4DsJh6BLHuNaeqyF/s480/eventViewer_new.png" /></a>
<br><br>
The whole point of this proof-of-concept project is to visualize the images from the perspective of the events rather than just displaying an grid of images. For example, a search on GettyImages.com website basically displays a list of images in a tabular fashion which provides no sense of association between individual images being displayed. But having them grouped together as part of an event provides a sense of association and correlation between images.
<br><br>
Displaying images is always a tricky business. A dominant color filter technique could probably provide an alternate way of scanning thru images as user might be more interested in images of particular color than the fine details of an image.
<br><br>
From a technology standpoint for building this project was nothing special.
<ul>
<li>ASP.NET MVC 4 - Razor</li>
<li>Amazon SQS - for event scrapping from GettyImages</li>
<li>Connect API for event and image detail lookup</li>
<li>MongoDB</li>
<li><a href="https://github.com/karthik20522/DominantBackgroundColor">Dominant Color Extraction</a></li>
</ul>
Source code at: <a href="https://github.com/karthik20522/EventViewer">https://github.com/karthik20522/EventViewer</a>..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-70668711981413935722013-10-01T16:26:00.000-04:002013-10-01T16:26:33.308-04:00Development Stack and stuffFor the past year or two, I had been dabbling with different technologies, frameworks looking for an ideal combination of frontend, backend and development tools. Following are what I tend to use and recommend for both personal and consultancy projects.
<br>
<i>I usually customize most of these open-source projects as per my needs</i><br><br>
<b>Project Management</b>
<ul><li>Asana - <a href="https://asana.com">https://asana.com</a> </li></ul>
<b>Database</b>
<ul><li>MongoDB</li></ul>
<b>CMS</b>
<ul><li>Calipso - <a href="http://calip.so/">http://calip.so/</a></li></ul>
<b>Bulletin Board</b>
<ul><li>NodeBB - <a href="http://nodebb.org/">http://nodebb.org/</a></li></ul>
<b>Configuration Management</b>
<ul><li>Etcd - <a href="https://github.com/coreos/etcd">https://github.com/coreos/etcd</a></li></ul>
<b>Logging and Analysis</b>
<ul><li>Logstash - <a href="http://logstash.net">http://logstash.net</a></li>
<li>Kibana - <a href="http://www.elasticsearch.org/overview/kibana">http://www.elasticsearch.org/overview/kibana</a> </li>
<li>Log.io - http://logio.org<a href="http://logio.org"></a></li></ul>
<b>Search</b>
<ul><li>ElasticSearch - <a href="http://www.elasticsearch.org">http://www.elasticsearch.org</a></li>
<li>Algolia - <a href="http://www.algolia.com">http://www.algolia.com</a></li></ul>
<b>Source Control</b>
<ul><li>Github – <a href="http://github.com">http://github.com</a></li></ul>
<b>AWS – Amazon Web Services</b>
<ul><li>SQS – Simple Queue service</li>
<li>SES – Simple Email Service</li></ul>
<b>Hosting</b>
<ul><li>Rackspace / Amazon</li>
<li>Digital Ocean - <a href="https://www.digitalocean.com">https://www.digitalocean.com</a></li></ul>
<b>Image processing</b>
<ul><li>GraphicsMagick with Node API - <a href="http://aheckmann.github.io/gm/">http://aheckmann.github.io/gm/</a></li></ul>
<b>Development Tools/Frameworks/Languages</b>
<ul><li>.NET
<ul><li>ASP.NET MVC </li>
<li>ASP.NET WebAPI for REST services - <a href="http://www.asp.net/web-api">http://www.asp.net/web-api</a></li>
<li><a href="http://code.google.com/p/dapper-dot-net/">Dapper.NET</a> or <a href="https://github.com/markrendle/Simple.Data">Simple.Data</a> for SQL ORM’s and MongoDB CSharp Driver for MongoDB access</li>
<li>FluentSecurity for user authentication and roles <a href="http://www.fluentsecurity.net/">http://www.fluentsecurity.net/</a></li></ul></li>
<li>NodeJS
<ul><li><a href="https://github.com/nodejitsu/node-http-proxy">NodeProxy</a> for proxy and load balancing service</li>
<li>Node Cluster for Multi Node process</li>
<li>Node Express Boilerplate - <a href="https://github.com/mape/node-express-boilerplate">https://github.com/mape/node-express-boilerplate</a></li></ul></li></ul>
<b>Misc Libraries and websites</b>
<ul><li>ASP.NET Razor Helpers - <a href="http://www.mikepope.com/blog/documents/WebHelpersAPI.html">http://www.mikepope.com/blog/documents/WebHelpersAPI.html</a></li>
<li>OverAPI - <a href="http://overapi.com/">http://overapi.com/</a></li>
<li>JQuery Plugins - <a href="http://www.unheap.com/">http://www.unheap.com/</a></li>
<li>MicroJs - http://microjs.com/</li></ul>
..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-54118601300951369402013-09-06T15:59:00.000-04:002013-09-06T15:59:03.128-04:00Spray.io REST service - Http Error Codes - Lesson 8View the lessons list at <a href="https://github.com/karthik20522/SprayLearning">https://github.com/karthik20522/SprayLearning</a>
<br><br>
In the<a href="http://kufli.blogspot.com/2013/08/sprayio-rest-service-exception.html"> previous post</a> in this series about Spray.io REST API, I talked about simple error handling and rejections. What about errors in the context of RESTful API best practices? From the perspective of the developer consuming your Web API, everything is a black box. When developing their applications, developers depend on well-designed errors when they are troubleshooting and resolving issues when using your API.
<br><br>
There are over 70 HTTP status codes but offcourse we don't need to use all of them. Check out this good Wikipedia entry for all <a href="http://en.wikipedia.org/wiki/Http_error_codes">HTTP Status codes</a>. So in Spray, we basically can implement all the status codes in the complete magnet depending on if its an Exception or Rejection.
<br>
<script type="syntaxhighlighter" class="brush: java">
implicit def myExceptionHandler(implicit log: LoggingContext) =
ExceptionHandler.apply {
case m: MappingException => {
respondWithMediaType(`application/json`) {
val errorMsg = ReponseError("MalformedBody", m.getMessage)
ctx => ctx.complete(415, errorMsg)
}
}
case e: SomeCustomException => ctx => {
val errorMsg = ReponseError("BadRequest", e.getMessage)
ctx.complete(400, errorMsg)
}
case e: Exception => ctx => {
val errorMsg = ReponseError("InternalServerError", e.getMessage)
ctx.complete(500, errorMsg)
}
}
</script>
Checkout out these error/httpStatus code from various RestAPIs:
<ul>
<li><a href="https://dev.twitter.com/docs/error-codes-responses">Twitter</a></li>
<li><a href="http://msdn.microsoft.com/en-us/library/windowsazure/dd179357.aspx">Microsoft Azure</a></li>
<li><a href="https://www.dropbox.com/developers/core/docs">Dropbox</a></li>
<li><a href="http://developer.yahoo.com/social/rest_api_guide/http-response-codes.html">Yahoo</a></li>
</ul>..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-4851122571886035882013-09-05T17:54:00.001-04:002013-09-05T17:54:58.091-04:00Useful Git Alias and HooksGit by itself is very raw and sometime hard to read the console outputs. Following are some of git <b>alias</b> that I tend to use on a daily basis.
<br>
<h1>git lg</h1>
This is a replacement to <b>git log</b>. git log doesn't provide any useful information such as branch name nor it's pretty looking! Yes pretty looking like colors and such. Following git alias solves this mundane problem.
<script type="syntaxhighlighter" class="brush: csharp">
$ git config --global alias.lg "log --color --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit"
</script>
<br>
Following is a comparison screenshot of <b>git lg</b> vs <b>git log</b><br><br>
Using <i>git log</i>:<br>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHoTYnkqEIKbBMSQXSvbPqLFtcLA2Ovm9jGUIoikWcgDUYB_1tYlcfKh0M6_xoiVhmwCsdd2P4KtFuPqvONuK6tZC1AgxAQzuKiWsnjBisucwtYdtwY9W-J-FkSOjOuv8hiXAc7whAubfP/s1600/git_log.png" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHoTYnkqEIKbBMSQXSvbPqLFtcLA2Ovm9jGUIoikWcgDUYB_1tYlcfKh0M6_xoiVhmwCsdd2P4KtFuPqvONuK6tZC1AgxAQzuKiWsnjBisucwtYdtwY9W-J-FkSOjOuv8hiXAc7whAubfP/s400/git_log.png" /></a>
<br><br>
Using <i>git lg alias</i>:<br>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiIuoowDsA4Hnjx5kXZn4h74PQg486riJmOQfxgrD-4lTCkuMKkOM3U5o8ml7mW475ao9vEd6tRiBV9lb7zNRhftd1gFao6IhlSFyf1odgWwKdODTXafkPbiPYfvj5xq7GoRhlYS6NU-QS7/s1600/git_lg_alias.png" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiIuoowDsA4Hnjx5kXZn4h74PQg486riJmOQfxgrD-4lTCkuMKkOM3U5o8ml7mW475ao9vEd6tRiBV9lb7zNRhftd1gFao6IhlSFyf1odgWwKdODTXafkPbiPYfvj5xq7GoRhlYS6NU-QS7/s400/git_lg_alias.png" /></a>
<br>
<h1>git unpushed</h1>
Every now and then I try to look for all uncommitted changes. But this quite a pain to look for from git console. Problem solved with the following alias:
<script type="syntaxhighlighter" class="brush: csharp">
$ git config --global alias.unpushed "log --branches --not --remotes --color --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit"
</script>
<br>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjubDNyyCyT9TLI7CyYVUP1IW-25FThIcEm1OXMv7bU9DfIMXHPzHk1bMy1Fw_-8t8SOn7dua_cuaNkQw5N4t5q4SxAaKoOtDViFCDP0nPGyOmkrCpd96jCI58eFJj3cU3cJMV7VNDkXtXV/s1600/git_unpushed.png" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjubDNyyCyT9TLI7CyYVUP1IW-25FThIcEm1OXMv7bU9DfIMXHPzHk1bMy1Fw_-8t8SOn7dua_cuaNkQw5N4t5q4SxAaKoOtDViFCDP0nPGyOmkrCpd96jCI58eFJj3cU3cJMV7VNDkXtXV/s400/git_unpushed.png" /></a>
<br>
<h1>git undo</h1>
git out-of-the-box doesn't provide an undo option to revert the previous commit. Pain! Following alias intends to solve the problem:
<script type="syntaxhighlighter" class="brush: csharp">
$ git config --global alias.undo "reset --hard HEAD~1"
</script>
<br>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhPAgT-rAq9DxsruVtBn-uUKdFdv_LcdqpK86EOijxyZwe1FWInRaz48EyxHhJil7sIRE1_ZMgFoCkgYQOiCj5Tlrh-kR2jycIiVllKUjaN6pEWQRZDTXX_OTFhGjgXfLzfbY3IHLYUkcbe/s1600/git_undo.png" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhPAgT-rAq9DxsruVtBn-uUKdFdv_LcdqpK86EOijxyZwe1FWInRaz48EyxHhJil7sIRE1_ZMgFoCkgYQOiCj5Tlrh-kR2jycIiVllKUjaN6pEWQRZDTXX_OTFhGjgXfLzfbY3IHLYUkcbe/s400/git_undo.png" /></a>
<br><br>
<h1>Pre-push hooks</h1>
One of the most important hooks that I use is a pre-push hook that executes unit tests before pushing to "master" branch. This is important as it can save broken test cases being pushed to master branch. Following hook script:
<script type="syntaxhighlighter" class="brush: csharp">
#.git/hooks/pre-push
#!/bin/bash
CMD="sbt test" # Command to run your tests - I use Sbt test
protected_branch='master'
# Check if we actually have commits to push
commits=`git log @{u}..`
if [ -z "$commits" ]; then
exit 0
fi
current_branch=$(git symbolic-ref HEAD | sed -e 's,.*/\(.*\),\1,')
if [[ $current_branch = $protected_branch ]]; then
$CMD
RESULT=$?
if [ $RESULT -ne 0 ]; then
echo "failed $CMD"
exit 1
fi
fi
exit 0
</script>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBfii-6SOS3ynl7iyqJnzAmCUSDQVgteELeyIS3qaxNa3N9pb7vxjHTeJtugaK9h_BjJoCo0UCWaGx3IsJPjon1-St3CjtvddwsqWFxWAiAiz6WmUyk3hmBIOYijSpCcbEAfSJiL01chHA/s1600/git_pre_push_sbt.png" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBfii-6SOS3ynl7iyqJnzAmCUSDQVgteELeyIS3qaxNa3N9pb7vxjHTeJtugaK9h_BjJoCo0UCWaGx3IsJPjon1-St3CjtvddwsqWFxWAiAiz6WmUyk3hmBIOYijSpCcbEAfSJiL01chHA/s400/git_pre_push_sbt.png" /></a>
..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-80276515311629065332013-08-19T21:19:00.000-04:002013-08-19T21:19:10.103-04:00Spray.io REST service - Exception, Rejection and Timeout Handling - Lesson 7View the lessons list at <a href="https://github.com/karthik20522/SprayLearning">https://github.com/karthik20522/SprayLearning</a>
<br><br>
<b>Handling exceptions</b> with the application and returning a valid http response with message is probably the way to go for building readable REST Api's. In spray,exceptions thrown during route execution bubble up thru the route structure up to the next enclosing handleExceptions directive. If you’d like to customize the way certain exceptions are handled simply bring a custom ExceptionHandler into implicit scope of the runRoute wrapper. For example:
<br>
<script type="syntaxhighlighter" class="brush: java">
class CustomerServiceActor extends Actor with CustomerService with AjaxService {
implicit def json4sFormats: Formats = DefaultFormats
def actorRefFactory = context
def receive = runRoute(handleExceptions(myExceptionHandler)(
customerRoutes ~ ajaxRoutes))
//capture all exceptions within the above routes
implicit def myExceptionHandler(implicit log: LoggingContext) =
ExceptionHandler.apply {
case e: SomeCustomException => ctx => {
log.debug("%s %n%s %n%s".format(e.getMessage, e.getStackTraceString, e.getCause))
ctx.complete(404, e.getMessage)
}
case e: Exception => ctx => {
log.debug("%s %n%s %n%s".format(e.getMessage, e.getStackTraceString, e.getCause))
ctx.complete(500, e.getMessage)
}
}
}
</script>
<br>
More information on Handling Exceptions can be found at <a href="http://spray.io/documentation/1.2-M8/spray-routing/key-concepts/exception-handling/">spray-routing/key-concepts/exception-handling/</a>
<br><br>
How about handling <b>Rejections</b>? Similar to handling exceptions we can handle rejections in similar fashion. In this example I have created a separate trait with the rejection handler. I came across an issue with some conflict with shapeless syntax and rejection handler syntax "::"
<br>
<script type="syntaxhighlighter" class="brush: java">
//separate trait file
trait CustomRejectionHandler extends HttpService {
implicit val myRejectionHandler = RejectionHandler {
case AuthenticationFailedRejection(credentials) :: _ => complete(Unauthorized, "Credential fail " + credentials)
case _ => complete(BadRequest, "Something went wrong here")
}
//HttpService
class CustomerServiceActor extends Actor with CustomerService with AjaxService with CustomRejectionHandler {
. . . .
def receive = runRoute(handleRejections(myRejectionHandler)(handleExceptions(myExceptionHandler)(
customerRoutes ~ ajaxRoutes)))
implicit def myExceptionHandler(implicit log: LoggingContext) =
ExceptionHandler.apply {
. . . .
}
}
</script>
More information on Handling Rejections at <a href="http://spray.io/documentation/1.2-M8/spray-routing/key-concepts/rejections/">spray-routing/key-concepts/rejections/</a>
<br><br>
<b>Timeout</b> Handling: spray-routing itself does not perform any timeout checking, it relies on the underlying spray-can to watch for request timeouts. The timeout value is defined in the config file (application.conf)
<script type="syntaxhighlighter" class="brush: java">
//application.conf
spray.can.server {
request-timeout = 10s
}
//in HttpService class
class CustomerServiceActor extends Actor with CustomerService with AjaxService with CustomRejectionHandler {
. . . .
def receive = handleTimeouts orElse runRoute(handleRejections(myRejectionHandler)(handleExceptions(myExceptionHandler)(
customerRoutes ~ ajaxRoutes)))
def handleTimeouts: Receive = {
case Timedout(x: HttpRequest) =>
sender ! HttpResponse(StatusCodes.InternalServerError, "Something is taking way too long.")
}
. . . .
</script>
More information on Timeout Handler at <a href="http://spray.io/documentation/1.2-M8/spray-routing/key-concepts/timeout-handling/">spray-routing/key-concepts/timeout-handling/</a>
..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-81504325006214442632013-08-18T16:05:00.000-04:002013-08-18T16:05:43.044-04:00Spray.io REST service - API Versioning - Lesson 6View the lessons list at <a href="https://github.com/karthik20522/SprayLearning">https://github.com/karthik20522/SprayLearning</a>
<br><br>
Before we get into spray, would recommend reading different ways of Versioning an API or best practices of versioning an API at <a href="http://stackoverflow.com/questions/389169/best-practices-for-api-versioning">best-practices-for-api-versioning</a> or <a href="http://stackoverflow.com/questions/10742594/versioning-rest-api">versioning-rest-api</a>.
<br><br>
To summarize the stackoverflow discussion, there are 3 ways to do versioning.
<ul>
<li>Header based - using X-API-Version</li>
<li>URL based - http://{uri}/v1/getCustomer</li>
<li>Content negotiation via Accept headers - application/vnd.example.v1+json (mediatype)</li>
</ul>
For this tutorial, I would be implementing the first two.<br><br>
<b>1) Header based - using X-API-Version</b>
Here I am building a Directive that extracts the version from the request header. If not exists than I am defaulting to 1
<script type="syntaxhighlighter" class="brush: java">
trait VersionDirectives {
val extractVersion: Directive[String :: HNil] =
extract { ctx =>
val header = ctx.request.headers.find(_.name == "X-API-Version")
header match {
case Some(head) => head.value
case _ => "1" //default to 1
}
}
def versioning: Directive[String :: HNil] =
extractVersion.flatMap { v =>
provide(v)
}
}
</script>
In the above code there are two keywords, "extract" and "provide". These are part of Sprays BasicDirective. "extract" basically allows you to extract a single value and "provides" allows you to inject a value into the Directive. But wait, we can make this trait even smaller by getting rid of the provide all together to something as follows:
<script type="syntaxhighlighter" class="brush: java">
trait VersionDirectives {
def versioning: Directive[String :: HNil] =
extract { ctx =>
val header = ctx.request.headers.find(_.name == "X-API-Version")
header match {
case Some(head) => head.value
case _ => "1" //default to 1
}
}
}
</script>
<i>Note: there are more than one way to write same operation in scala/spray. bane of my existence!</i>
<br>
More info at: <a href="https://github.com/spray/spray/blob/release/1.2/spray-routing/src/main/scala/spray/routing/directives/BasicDirectives.scala">spray/routing/directives/BasicDirectives.scala</a>
<br><br>
Now that we have defined the directive, we just need to inherit the directive to service trait and call versioning to extract the version number from X-API-Version header field.
<script type="syntaxhighlighter" class="brush: java">
trait CustomerService extends HttpService with Json4sSupport with VersionDirectives {
val customerRoutes = {
path("getCustomer" / Segment) { customerId =>
get {
versioning { v =>
println(v)
//do something now that you have extracted the version number
. . . .
</script>
<b>2) url based - http://{uri}/v1/getCustomer</b>
<br>
Here we are basically performing regex on the incoming request.uri and extracting the version out of "v*". Spray provides quite a few PathFilters and one of them being PathMatcher. More info at:<br>
<ul>
<li><a href="https://github.com/spray/spray/wiki/Path-Filters">https://github.com/spray/spray/wiki/Path-Filters</a><br></li>
<li><a href="https://github.com/spray/spray/blob/release/1.2/spray-routing/src/main/scala/spray/routing/directives/PathDirectives.scala">spray/routing/directives/PathDirectives</a><br></li>
<li><a href="https://github.com/spray/spray/blob/release/1.2/spray-routing/src/main/scala/spray/routing/PathMatcher.scala">spray/routing/PathMatcher</a></li>
<li><a href="https://github.com/spray/spray/blob/release/1.2/spray-routing-tests/src/test/scala/spray/routing/PathDirectivesSpec.scala">/scala/spray/routing/PathDirectivesSpec - Test cases</a></li>
</ul>
<script type="syntaxhighlighter" class="brush: java">
trait CustomerService extends HttpService with Json4sSupport {
val Version = PathMatcher("""v([0-9]+)""".r)
.flatMap {
case vString :: HNil => {
try Some(Integer.parseInt(vString) :: HNil)
catch {
case _: NumberFormatException => Some(1 :: HNil) //default to version 1
}
}
}
val customerRoutes =
pathPrefix(Version) {
apiVersion =>
{
. . . .
path("getCustomer" / Segment) { customerId =>
get {
complete {
apiVersion match {
case 1 => {
// do something if version 1
}
case 2 => ??? //do something if version 2
case _ => {
//do something if any other version
}
}
}
}
}
}
}
}
</script>..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-45214970266383745682013-08-17T19:58:00.000-04:002013-08-17T19:58:32.046-04:00Spray.io REST service - Authentication - Lesson 5View the lessons list at <a href="https://github.com/karthik20522/SprayLearning ">https://github.com/karthik20522/SprayLearning </a>
<br><br>
"Directives" are small building blocks of which you can construct arbitrarily complex route structures. A directive does one or more of the following:<br>
<ul>
<li>Transform the incoming RequestContext before passing it on to its inner route</li>
<li>Filter the RequestContext according to some logic, i.e. only pass on certain requests and reject all others</li>
<li>Apply some logic to the incoming RequestContext and create objects that are made available to inner routes as "extractions"</li>
<li>Complete the request</li>
</ul>
More detailed information on directives can be found at <a href="http://spray.io/documentation/1.2-M8/">http://spray.io/documentation/1.2-M8/</a> or here <a href="http://spray.io/documentation/1.2-M8/spray-routing/predefined-directives-alphabetically/">Spray-routing/predefined-directives-alphabetically/</a>. In future examples, I would be using Directives so basic understanding would be useful. For this example I would be using Authentication directive to validate the request with a username and password as part of the header.
<br><br>
So, for authentication a new UserAuthentication trait is created which has a function that returns a ContextAuthenticator. The authenticate directive expects either a ContextAuthenticator or a Future[Authenication[T]]. At the crust of ContextAuthenticator is basically Future[Authenication[T]].
<br><br>
Below code that describes the authentication directive basically takes in a ContextAuthenticator and returns a Future of Authentication type of Either Rejection or an Object of type T. What this means is that Either a Rejection is returned (Left) like FailedAuthenticationRejection when credentials are missing or failed; or when successful an object is returned (Right)
<br>
<script type="syntaxhighlighter" class="brush: java">
package object authentication {
type ContextAuthenticator[T] = RequestContext => Future[Authentication[T]]
type Authentication[T] = Either[Rejection, T]
. . . .
}
</script>
<br>
The UserAuthentication trait has two functions, one that returns a ContextAuthentication and other that returns a Future. In this example, I am reading the username and password from the application.conf file and validating against it.<br>
<script type="syntaxhighlighter" class="brush: java">
import com.typesafe.config.ConfigFactory
import scala.concurrent.ExecutionContext.Implicits.global
import spray.routing.AuthenticationFailedRejection
case class User(userName: String, token: String) {}
trait UserAuthentication {
val conf = ConfigFactory.load()
lazy val configusername = conf.getString("security.username")
lazy val configpassword = conf.getString("security.password")
def authenticateUser: ContextAuthenticator[User] = {
ctx =>
{
//get username and password from the url
val usr = ctx.request.uri.query.get("usr").get
val pwd = ctx.request.uri.query.get("pwd").get
doAuth(usr, pwd)
}
}
private def doAuth(userName: String, password: String): Future[Authentication[User]] = {
//here you can call database or a web service to authenticate the user
Future {
Either.cond(password == configpassword && userName == configusername,
User(userName = userName, token = java.util.UUID.randomUUID.toString),
AuthenticationFailedRejection("CredentialsRejected"))
}
}
}
</script>
<i>Note that I am importing "scala.concurrent.ExecutionContext.Implicits.global", this is because futures require an ExecutionContext and I letting spray to use the default actor ExecutingContext. //implicit val system = ActorSystem("on-spray-can")</i>
<br>
<br>
Now that we have the authentication setup, we need to inherit the UserAuthenticationTrait using "with" and use the "authentication" directive by passing the "authenticateUser" function that we defined.
<br>
<script type="syntaxhighlighter" class="brush: java">
trait CustomerService extends HttpService with Json4sSupport with UserAuthentication {
val customerRoutes =
path("addCustomer") {
post {
authenticate(authenticateUser) { user =>
entity(as[JObject]) { customerObj =>
complete {
. . . .
}
}
}
}
} ~
path("getCustomer" / Segment) { customerId =>
get {
authenticate(authenticateUser) { user =>
{
complete {
. . .
}
}
}
}
}
}
[Success GET] http://localhost:8080/getCustomer/520ed44941f19472d5f?usr=karthik&pwd=kufli
[Failed GET] http://localhost:8080/getCustomer/520ed44941f19472d5f9?usr=fail&pwd=wrong
</script>
<br>
<br>
Further readings on this topic:<br>
<a href="https://github.com/spray/spray/blob/release/1.2/spray-routing/src/main/scala/spray/routing/authentication/HttpAuthenticator.scala">Spray/routing/authentication/HttpAuthenticator.scala</a><br>
<a href="https://github.com/spray/spray/blob/release/1.2/spray-routing-tests/src/test/scala/spray/routing/SecurityDirectivesSpec.scala">Spray/routing/SecurityDirectivesSpec.scala</a><br>
<a href="https://groups.google.com/forum/#!topic/spray-user/5DBEZUXbjtw">https://groups.google.com/forum/#!topic/spray-user/5DBEZUXbjtw</a><br>
<a href="https://github.com/spray/spray/blob/master/spray-routing/src/main/scala/spray/routing/RejectionHandler.scala">Spray/routing/RejectionHandler.scala</a><br>
..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-91396676680386936252013-08-16T22:56:00.000-04:002013-08-16T22:59:15.250-04:00Spray.io REST service - MongoDB - Lesson 4View the lessons list at <a href="https://github.com/karthik20522/SprayLearning">https://github.com/karthik20522/SprayLearning</a>
<br><br>
Now that we have Spray Service <a href="http://kufli.blogspot.com/2013/08/sprayio-rest-web-api-basic-setup.html">Setup</a> and <a href="http://kufli.blogspot.com/2013/08/sprayio-rest-service-controller-actions.html">Routes</a> <a href="http://kufli.blogspot.com/2013/08/sprayio-rest-service-json-serialization.html">defined</a>, we can now hookup the database to add and fetch customer information. For the database I am using MongoDB and <a href="http://mongodb.github.io/casbah/">Casbah</a>. Casbah is a Scala toolkit for MongoDB. Casbah is a "toolkit" rather than "driver", as Casbah is a layer on top of the official <a href="https://github.com/mongodb/mongo-java-driver">mongo-java-driver</a> for better integration with Scala.
<br><br>
To get Casbah setup, we need to add casbah and it's related dependencies to the Build.scala file.
<script type="syntaxhighlighter" class="brush: java">
libraryDependencies ++= Seq(
. . . .
"org.mongodb" %% "casbah" % "2.6.2",
"com.typesafe" %% "scalalogging-slf4j" % "1.0.1",
"org.slf4j" % "slf4j-api" % "1.7.1",
"org.slf4j" % "log4j-over-slf4j" % "1.7.1",
"ch.qos.logback" % "logback-classic" % "1.0.3")
</script>
<br>
<i>Note that we have slf4j and scala logging in the dependencies. Without slf4j you would get "Failed to load class org.slf4j.impl.StaticLoggerBinder" error.</i>
<br><br>
In my example, I have created a MongoFactory that has 3 functions: getConnection, getCollection and closeConnection.
<script type="syntaxhighlighter" class="brush: java">
import com.mongodb.casbah.MongoCollection
import com.mongodb.casbah.MongoConnection
object MongoFactory {
private val SERVER = "localhost"
private val PORT = 27017
private val DATABASE = "customerDb"
private val COLLECTION = "customer"
def getConnection: MongoConnection = return MongoConnection(SERVER)
def getCollection(conn: MongoConnection): MongoCollection = return conn(DATABASE)(COLLECTION)
def closeConnection(conn: MongoConnection) { conn.close }
}
</script>
<br>
Now that we have our Factory method, the next is building the data access class for inserting and fetching data. Following code snippet has 2 operations:
<ul>
<li>saveCustomer - which returns back the GUID after inserting into MongoDB</li>
<li>findCustomer - find customer by GUID</li>
</ul>
<script type="syntaxhighlighter" class="brush: java">
class CustomerDal {
val conn = MongoFactory.getConnection
def saveCustomer(customer: Customer) = {
val customerObj = buildMongoDbObject(customer)
val result = MongoFactory.getCollection(conn).save(customerObj)
val id = customerObj.getAs[org.bson.types.ObjectId]("_id").get
println(id)
id
}
def findCustomer(id: String) = {
var q = MongoDBObject("_id" -> new org.bson.types.ObjectId(id))
val collection = MongoFactory.getCollection(conn)
val result = collection findOne q
val customerResult = result.get
val customer = Customer(firstName = customerResult.as[String]("firstName"),
lastName = customerResult.as[String]("lastName"),
_id = Some(customerResult.as[org.bson.types.ObjectId]("_id").toString()),
phoneNumber = Some(customerResult.as[String]("phoneNumber")),
address = Some(customerResult.as[String]("address")),
city = Some(customerResult.as[String]("city")),
country = Some(customerResult.as[String]("country")),
zipcode = Some(customerResult.as[String]("zipcode")))
customer //return the customer object
}
//Convert our Customer object into a BSON format that MongoDb can store.
private def buildMongoDbObject(customer: Customer): MongoDBObject = {
val builder = MongoDBObject.newBuilder
builder += "firstName" -> customer.firstName
builder += "lastName" -> customer.lastName
builder += "phoneNumber" -> customer.phoneNumber.getOrElse("")
builder += "address" -> customer.address.getOrElse("")
builder += "city" -> customer.city.get
builder += "country" -> customer.country.get
builder += "zipcode" -> customer.zipcode.getOrElse("")
builder.result
}
}
</script>
Now to integrate this to the service:
<script type="syntaxhighlighter" class="brush: java">
trait CustomerService extends HttpService with Json4sSupport {
val customerRoutes =
path("addCustomer") {
post {
entity(as[JObject]) { customerObj =>
complete {
val customer = customerObj.extract[Customer]
val customerDal = new CustomerDal
val id = customerDal.saveCustomer(customer)
id.toString()
}
}
}
} ~
path("getCustomer" / Segment) { customerId =>
get {
complete {
//get customer from db using customerId as Key
val customerDal = new CustomerDal
val customer = customerDal.findCustomer(customerId)
customer
}
}
}
}
</script>
More information and resources on Casbah:
<ul>
<li><a href="http://mongodb.github.io/casbah/">http://mongodb.github.io/casbah/</a></li>
<li><a href="http://mongocasbahcookbook.tumblr.com/post/44240846005/mongodbobject-as-criteria-and-as-actions">http://mongocasbahcookbook.tumblr.com/post/44240846005/mongodbobject-as-criteria-and-as-actions</a></li>
<li><a href="http://janxspirit.blogspot.com/2011/11/introduction-to-casbah-scala-mongodb.html">http://janxspirit.blogspot.com/2011/11/introduction-to-casbah-scala-mongodb.html</a></li>
<li><a href="http://stackoverflow.com/questions/tagged/casbah">http://stackoverflow.com/questions/tagged/casbah</a></li>
</ul>
..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.comtag:blogger.com,1999:blog-8142317736019852991.post-41009308615022322872013-08-15T22:51:00.000-04:002013-08-15T22:59:18.485-04:00Spray.io REST service - Json Serialization, De-serialization - Lesson 3
View the lessons list at <a href="https://github.com/karthik20522/SprayLearning">https://github.com/karthik20522/SprayLearning</a><br><br>
For a REST based interface, JSON request and response is the norm. There are two ways to set the response format in Spray. <br><br>
1) <b>using respondWithMediaType</b>
<script type="syntaxhighlighter" class="brush: java">
get {
respondWithMediaType(`application/json`) {
complete {
//Some object response
}
}
}
</script>
In this method, each route/action would have to be explicitly set with media type of json. More info at <a href="https://github.com/spray/spray/wiki/Misc-Directives">https://github.com/spray/spray/wiki/Misc-Directives</a>
<br><br>
2) <b>Globally overriding the default format. </b>
<br><br>
For all Json related operations, I am using Json4s as the json library. In the build.scala add the following json4s dependency.
<script type="syntaxhighlighter" class="brush: java">
libraryDependencies ++= Seq(
. . . . .
"org.json4s" %% "json4s-native" % "3.2.4"
)
</script>
<i>Note: you would need to reload the project in sbt and regenerate Eclipse. Eclipse does not refresh itself when new dependencies are added</i>
<br><br>
In the CustomerServiceActor we will need to set the default Format to json4sFormats by adding the implicit formatter
<script type="syntaxhighlighter" class="brush: java">
import spray.httpx.Json4sSupport
import org.json4s.Formats
import org.json4s.DefaultFormats
import com.example.model.Customer
class CustomerServiceActor extends Actor with CustomerService with AjaxService {
implicit def json4sFormats: Formats = DefaultFormats
. . . .
</script>
For example purpose, lets update the getCutomer action to return a mocked customer. The expected response should be a json formatted customer object.
<script type="syntaxhighlighter" class="brush: java">
path("getCustomer" / Segment) { customerId =>
get {
complete {
val customer = Customer(id = Some(customerId),
firstName = "Karthik",
lastName = "Srinivasan")
customer //return customer obj
}
}
}
/*
[GET] http://localhost:8080/getCustomer/123
[Response] {"firstName":"Karthik","lastName":"S","id":"123", "city":"New York","country":"USA"}
*/
//The customer case class is as follows:
package com.example.model
case class Customer(firstName: String,
lastName: String,
id: Option[String] = None,
phoneNumber: Option[String] = None,
address: Option[String] = None,
city: Option[String] = Some("New York"),
country: Option[String] = Some("USA"),
zipcode: Option[String] = None) {
}
</script>
<br>
To post a customer object and deserialize it to customer type, we can use spray entity [<a href="http://spray.io/documentation/1.1-M8/spray-httpx/unmarshalling/">http://spray.io/documentation/1.1-M8/spray-httpx/unmarshalling/</a>] and cast the post value to a JObject
<script type="syntaxhighlighter" class="brush: java">
import org.json4s.JsonAST.JObject
. . .
path("addCustomer") {
post {
entity(as[JObject]) { customerObj =>
complete {
val customer = customerObj.extract[Customer]
//insert customer information into a DB and return back customer obj
customer
}
}
}
}
/*
[POST] http://localhost:8080/addCustomer { "firstName" : "karthik", "lastName" : "srinivasan", "zipcode" : "01010" }
[Should return] {"firstName":"karthik","lastName":"srinivasan","city":"New York","country":"USA","zipcode":"01010"}
*/
</script>
Further reading on json4s at <a href="https://github.com/json4s/json4s">https://github.com/json4s/json4s</a>..:: karthik ::..http://www.blogger.com/profile/01347159551221689138noreply@blogger.com