Original Link : http://blog.qbox.io/elasticsesarch-percolator
Our Qbox team has been asked about the Percolate API, and we’re glad to share here an introduction on that very popular Elasticsearch feature. But before we get started a small amount of setup is involved. To make sure we’re working with the same environment, we’ll start with installing v1.0.1 of Elasticsearch.
Install and Start Elasticsearch v1.0.1
http://www.elasticsearch.org/download/
Distributed Percolation is a feature of the v1.x series of Elasticsearch. We’ll be using the current version of the v1.x series, v1.0.1 . If you’ve never Installed Elasticsearch, take a moment to watch or read our Elasticsearch Tutorial Ep.1 for detailed instructions.
Mapping and Data
http://sense.qbox.io/gist/28ff2f1031e6d4a5904604d24d26b0bad6238720
For this introduction we've provided a sense gist with runnable code examples in the link above. Once v1.0.1 of Elasticsearch is running locally, you may map your documents, and begin using the Percolate API examples below.
Percolate
The Percolate API is a commonly used utility in Elasticsearch for alerting and monitoring documents. “Search in reverse” is a good way to conceptualize what Percolation does. Searching with Elasticsearch is usually done by querying a set of documents for relevance to the search. Percolate works in the opposite way, however, Percolating your documents against registered queries (percolators) for matches.
v1.0.0 of Elasticsearch brought a major change to how the Percolate API distributes its registered queries. Percolator 0.90.x and previous versions have a single shard index restriction.With a single shard, performance continues to degrade as the number of registered queries grows.
To get around this bottleneck, queries could be partitioned against multiple single shard indices, or you could manipulate Percolate queries to reduce the execution time. Using these methods, however, still caused fundamental scaling limits for any Percolator index shard. Having to “get around” this bottleneck was a concern for the Elasticsearch team who wanted to make the Percolator distributed. v1.0.0 Distributed Percolation put these issues to bed, dropping the previous _percolator index shard restriction for a .percolator type in an index.
Distributed Percolation
A .percolator type gives users a distributed Percolator API environment for full shard distribution, and you can now configure the number of shards necessary for your Percolator queries, changing from a restricted single shard execution to a parallelized execution between all shards within that index. Multiple shards means support for routing and preference, just like the other Search APIs (except the Explain API).
Dropping the old _percolator index shard restriction does create breaking backwards compatibility with the 0.90.x Percolator, but breaking changes in Percolation are a great reason to make renovations and features.
Structure of a Percolator in v1.x
Registering a .percolator has changed little from a Percolator of the 0.90.x series. A more substantial change mentioned earlier is the .percolator is now a type in an index, as shown in the example below. In this Percolator we register a match query for the “sport” field containing “baseball.”
curl -XPUT 'localhost:9200/sports/.percolator/1' -d '{
"query" : {
"match" : {
"sport" : "baseball"
}
}
}'
Default mapping for a .percolator type is a query field type of object, with “enabled” set to false. (Enabled allows disabling of parsing and indexing on a named object.) It is worth noting that this new index type could exist on a dedicated Percolator index. Remember when using a dedicated Percolator index to include the mapping of the documents you _percolate. Without the correct mapping for the documents you _percolate, .percolator queries can be (and probably will be) parsed incorrectly.
Request:
curl -XGET "http://localhost:9200/sports/_mapping"
Response:
{
"sports" : {
"mappings" : {
".percolator" : {
"_id" : {
"index" : "not_analyzed"
},
"properties" : {
"query" : {
"type" : "object",
"enabled" : false
}
}
}
}
}
Percolate
Running _percolate through this .percolator below will return a match if it meets a .percolator relevance. There are a few ways we can run our documents against our Percolator. First, we will use the very standard “doc” body to execute the _percolator API. Usually you would use this method on documents that do not already exist.
Percolator:
curl -XPUT 'localhost:9200/sports/.percolator/1' -d '{
"query" : {
"match" : {
"sport" : "baseball"
}
}
}'
Percolating on a “doc” body:
curl -XPOST "http://localhost:9200/sports/athlete/_percolate/" -d '{
"doc": {
"name": "Jeff",
"birthdate": "1990-4-1",
"sport": "Baseball",
"rating": 2,
"location": "46.12,-68.55"
}
}'
This sports index has a single .percolator with “_id”:”1” that our document matches. You can see in the response below that it took 1ms, 5 out of 5 shards were successful, and we matched one Percolator in the sports index with “_id”: “1”.
Response:
{
"took": 1,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"total": 1,
"matches": [
{
"_index": "sports",
"_id": "1"
}
]
}
Bulk Percolating documents can be achieved with the multi-Percolate API (similar to the bulk API). Format for the multi-Percolate API begins with a header specifying your index, type, and id. Followed after the header is your JSON document body itself. No JSON document is required when Percolating an existing document; only a reference to the “_id” of the document is required.
Request:
curl -XGET 'localhost:9200/sports/athlete/_mpercolate' --data-binary @multi-percolate.txt; echo
Multi-percolate.text:
{"percolate" : {"index" :”sport", "type" : "athlete"}}
{"doc" : {"name":"Michael", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["5", "4"], "location":"46.22,-68.45"}}
{"percolate" : {"index" : twitter", "type" : "tweet", "id" : "1"}}
{}
To _percolate a single an existing document, simply mention the “_id” of the document to Percolate on
curl -XGET 'localhost:9200/sports/athlete/1/_percolate'
Another format for the standard _percolate response is count, which only responds with the total number of matches.
curl -XPOST "http://localhost:9200/sports/athlete/_percolate/count" -d '{
"doc": {
"name": "Jeff",
"birthdate": "1990-4-1",
"sport": "Baseball",
"rating": 2,
"location": "46.12,-68.55"
}
}'
Response:
{
"took": 3,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"total": 1
}
A way to specifically Percolate athletes with the sport baseball would be a filter. We could then create a .percolator on another field about which we are curious, say, a specific birthdate.
curl -XPOST "http://localhost:9200/sports/athlete/_percolate/" -d '{
"doc": {
"name": "Jeff",
"birthdate": "1990-4-1",
"sport": "Baseball",
"rating": 2,
"location": "46.12,-68.55"
},
"filter": {
"term": {
"sport": "baseball"
}
}
}'
curl -XPUT "http://localhost:9200/sports/.percolator/2" -d '{
"query":{
"match": {
"birthdate": "1990-4-1"
}
}
}'
Other supported query string options for _percolate include size, track_scores, sort, facets, aggs, and highlight. Query and filter options only differ by query’s score being computed. The computed score can then be used to show the documents score, which is based on the query’s match to the Percolate query’s metadata. You can also use highlight, facets, or aggregations on these request bodies. Using size to specify the number of matches to return ( defaults to unlimited).
Distributed Percolation can be the solution for some of the most active databases in production today. Fascinating data and analytics can be gained from your real-time _percolate. With distribution, the Percolate API will only grow into more interesting use cases and ideas for Elasticsearch.
If you enjoyed this post, you’ll want to check out some of our other tutorials like An Introduction to Elasticsearch Aggregations for more.
You can use Elasticsearch v1.0.1 and 0.90.12 on Qbox today to try out Percolation on a dedicated cluster of your choosing. If you have any questions, feel free to leave a comment below or contact us. Runnable code of all the examples used in this tutorial can be found at thissense-gist.