|
Elastic/Elasticsearch 2013. 8. 12. 10:42
http://www.elasticsearch.org/download/
breaking changes:- Java Client: Renamed
IndicesAdminClient.existsAliases() to IndicesAdminClient.aliasesExist #3330
new features:- Support for the pattern replace char filter has been added #3197
- A new API to check if there are pending cluster tasks has been added #3368
- A new completion suggestion based on prefix suggestions has been added (this is experiemental) #3376
enhancements:- Support for named filters has been added #3097
- The
has_child query has been optimized to execute faster when matching parent count is low #3190 - Integer field data implementations have been merged #3220
- The rescore query now supports a
score_mode #3258 - Mget fields parameter can now be a string or an array #3270
- XContentParser/Generator now can handle simple arrays #3279
- Bulk deletes now contain a
found field #3320 - Zen discovery cluster events now have an urgent priority #3361
- An own channel for pings has been added in order to be independent from huge cluster state updates #3362
- Cluster state update APIs now respect the master_timeout much better #
- A new dedicated thread pool for the optimize API has been added #3366
FastVectorHighlighter now supports complex queries (such as multi phrase queries with two terms at the same position) #3357- The recursion level of the hunspell filter is now configurable in the mapping #3369
- Every distribution now contains information about its git build #3370
- The dynamic flag in the root object mapper can now be configured dynamically on runtime #3384
- Less cluster state changes if
auto_expand_replicas is set #3399 - Open/Close index API now supports an ackknowledgement from other nodes instead of simply waiting for the change in the cluster state #3400
- Whenever analyzing strings, elasticsearch now uses Lucene methods introduced with Lucene 4.4, which reuse internal data structures#3409
- In addition, the formerly used methods have been deprecated #3411
- Improved alias handling in the cluster state (much faster if you have tens of thousands of aliases) #3410
- The delete API now waits until a shard is removed from disk #3413
- Rerouting of shards now happens on a shard started event #3417
- The Index Template API now is more RESTful, supports HEAD and returns a proper 404 if it does not exist #3434
HighlightBuilder is now consistent with REST API #3435- The header response (including the successful/failed shards) has been streamlined between different requests #3441
bug fixes:- Timestamp index settings in a mapping are now correctly returned #3174
- Field data now supports more than 2B ordinals per segment #3189
- TokenStreams were reset twice when highlighting #3200
- The
geo_shape filter now handles multiple shapes per document correctly#3242 - PluginManager fixes
- The PluginManager now parses parameters correctly again (regression from 0.90.1) #3245
- Calling the PluginManager while having a non-existing plugins directory is now handled #3253
- The index warmer setting to is now configurable at runtime #3246
- The order of fields in a suggest request can now be arbitrary #3247
- More-like-this now correctly returns an error message if used with numeric fields (that error can be simply ignored as well) #3252
- The parent option is now taken into account for delete requests #3257
- The Update APIs
doc_as_upsert option is now taken into account correctly #3265 - Mget requests do not abort completely anymore if any index is missing #3267
- Parent is taken into account in exists request #3276
- Removed java dependency from debian package, so arbitrarily installed java can be used #3284
- Partial fields filtering could return false matches #3288
- Caching of
top_children , has_child and has_parent queryies could lead to a ClassCastException #3290 - Script based sorting was applied after pagination #3309
- Unallocated indexes cannot be closed immediately to prevent indices which cannot be opened anymore #3313
- Thai analyzer now makes use of stopwords #3342
- Unset top level filter now behaves the same as inside a filtered query #3356
- Pattern replace filter now has an empty default set to ensure same behaviour on upgrades #3359
- Alias validation on adding aliases has been improved #3363
- Uncaught exceptions on cluster state updates could lead to hanging request #3364
- FuzzyLikeThisFieldQueryBuilder defaults are now consistent with the REST API #3374
- Updatting a mapping with
ignore_conflicts could hang and timeout #3381 - Setting
index.gc_deletes on runtime is working properly now #3396 - MoreLikeThisFieldQueryBuilder defaults are now consistent with the REST API #3402
- Query/Filter facet counter is now 64bit #3419
- The pid file was not properly overwritten if it already existed #3425
- Search in a shard group while relocation final flip happens could have failed #3427
- UpsertRequests now contain all metadata fields (parent, routing, etc.) #3444
- Retry_on_conflict setting in a bulk request could lead to an NPE #3447
Elastic/Elasticsearch 2013. 7. 31. 14:01
elasticsearch 의 검색결과 return field 에 대한 설명 입니다. 뭐 보시면 다 아실만한 내용이긴 합니다.
- took : 검색질의 응답시간 (milliseconds)
- timed_out : boolean 값으로 검색엔진 내부에서 질의 실행에 대한 timeout 여부
- _shards : 검색 수행한 샤드
- total : 검색 수행한 총 샤드 수
- successful : 검색 수행을 성공한 샤드 수
- failed : 검색 수행을 실패한 샤드 수
- hits : 검색 매칭 결과
- total : 검색 매칭된 문서 총 수
- max_score : 매칭된 문서 중 가장 높은 relevant score
- hits : 매칭된 문서 결과
- _index : 매칭된 인덱스 명
- _type : 매칭된 타입 명
- _id : 매칭된 문서 unique id
- _score : 매칭된 문서 relevant score
- _source : 출력 필드 지정을 하지 않았을 경우 리턴, 모든 필드 목록 포함
- fields : 출력 필드 목록 포함
- highlight : 강조 필드 목록 포함
- facets : 그룹 카운팅 결과
- groupby : 리턴 변수명 (request 시 변수명 지정 가능)
- _type : facet 유형
- missing : missing field 에 대한 카운트
- total : facet 대상 총 수
- other : facet 대상 총 수에 포함 되지 않은 문서 카운트
- terms facet terms
- term : facet 대상 term
- count : 대상 term 의 카운트
그럼 이런 넘들은 어떤 소스코드를 봐야 할까요? 뭐 당연하겠지만 뭔가의 response 코드를 보면 되겠죠.
대표적인 소스코드는 아래 두개의 클래스를 참고 하시면 됩니다. SearchResponse.java InternalSearchResponse.java
Elastic/Elasticsearch 2013. 7. 16. 20:21
https://github.com/elasticsearch/elasticsearch/issues/2707 https://github.com/tlrx/elasticsearch-custom-similarity-provider https://github.com/lukapor/customsimilarity
위 내용을 보면 기본적으로 루씬에서 score 계산에 사용하는 함수들을 @override 해야 합니다. 결국, 문서 및 서비스 특성을 반영한 별도의 ranking algorithm 을 만들어서 적용을 해야 한다는 내용입니다.
package org.elasticsearch.bcsocial.plugin.similarity;
import org.apache.lucene.index.FieldInvertState; import org.apache.lucene.search.similarities.Similarity; import org.apache.lucene.search.similarities.TFIDFSimilarity; import org.apache.lucene.search.similarities.DefaultSimilarity;
public class BcsocialSimilarity extends DefaultSimilarity {
public BcsocialSimilarity() {}
@Override public float lengthNorm(FieldInvertState state) { return 1.0f; }
@Override public float coord(int overlap, int maxOverlap) { return 1.0f; }
@Override public float queryNorm(float sumOfSquaredWeights) { return 1.0f; }
@Override public float tf(float freq) { return 1.0f; }
@Override public float idf(long docFreq, long numDocs) { return 1.0f; }
@Override public String toString() { return "BcsocialSimilarity"; } }
Elastic/Elasticsearch 2013. 7. 16. 19:21
원본 URL : https://gist.github.com/UpOutServers/9a4466108d12452738e9
package org.elasticsearch.plugin.myplugin; import org.elasticsearch.common.inject.Module; import org.elasticsearch.plugins.AbstractPlugin; import org.elasticsearch.rest.RestModule; import org.elasticsearch.script.ScriptModule; public class MyPlugin extends AbstractPlugin { public String name() { return "MyPlugin"; } public String description() { return "MyPlugin"; } public void onModule(RestModule module) { module.addRestAction(/*Your rest class*/); } public void onModule(ScriptModule module) { module.registerScript(/*Your script name*/ ,/*Your script class*/); } }
여기서 제가 만들어서 사용한건 ScriptModule 입니다. AbstractPlugin 을 이용 할 경우 elasticsearch.yml 에 등록하지 않고 바로 ES 가 실행 되면서 플러그인을 로딩해줘서 설정에 대한 번거로움이 없어집니다.
뭐 등록하는거 만들기 귀찮으시다면 그냥 패쓰 하시면 됩니다. 단, elasticsearch.yml 에 등록을 해주셔야 사용이 가능 하다는 거.. :)
아래는 몇 가지 도움이 될 만한 링크 이니 참고하시면 되겠내요.
http://www.elasticsearch.org/guide/reference/modules/scripting/ https://github.com/imotov/elasticsearch-native-script-example http://elasticsearch-users.115913.n3.nabble.com/Loading-and-Registering-Native-Scripts-td3088835.html http://elasticsearch-users.115913.n3.nabble.com/Native-Script-Help-td2980754.html
그리고 매우 중요한건 0.90.x 이랑 아래 버전이랑 API 가 바뀌었으니 꼭 확인하고 사용하시기 바랍니다. 오늘 이것때문에 삽질 했내요..ㅡ.ㅡ;;
Elastic/Elasticsearch 2013. 7. 15. 23:34
http://hnagtech.wordpress.com/2013/04/19/using-payloads-with-solr-4-x/ There are already quite a few good blogs on what Lucene payloads are, how they can be used and developed, either using Lucene API or with Solr. I personally feel, the following two blogs are worth viewing to quick-start on the same. With Solr 4.x, indexing fields with payloads is made all the more easier with some readily available factory objects. The recent Apache Solr’s sample “schema.xml” has some usage details.But the trick part with Solr 4.x is making payloads work at all, and the above information isn’t sufficient, thanks to the (ever-changing!) API changes with Lucene/ Solr every coming release. This is where, this blog tries to fill in. There are 2 parts to the solution, and I will detail accordingly. # 1 QueryParsing Wrapping your specific query terms with ‘PayloadTermQuery’ object in your query parser’s parse() method wouldn’t work. Rather, you should also override SolrQueryParser.getFieldQuery() method, like in the sample below, to identify your payloaded terms. @Override protected Query getFieldQuery(String field, String queryText, boolean quoted) throws SyntaxError { SchemaField sf = this.schema.getFieldOrNull(field); if (sf != null && sf.getType().getTypeName().equalsIgnoreCase("payloads")) { Term t = new Term(field, queryText); Query q = new PayloadTermQuery(t, new MaxPayloadFunction(), false); return q; } return super.getFieldQuery(field, queryText, quoted); }
In the above sample, a field of type ‘payloads’ is considered a payloaded field (you could give a different name), and so the wrapping query is accordingly changed. Only if the above is done, your implementation of Similarity’s scorePayload() function would be invoked. This information on overriding ‘getFieldQuery()’ is of course available in this wiki link, Payloads, however it is hidden somewhere, and a normal google search doesn’t return this link (Try testing!). #2 Scoing using payloads Talking about scorePayload(), the methods’s new signature in Lucene 4.1 is all the more confusing compared to what was available before. @Override public float scorePayload(int doc, int start, int end, BytesRef payload) { if (payload != null) { float x = PayloadHelper.decodeFloat(payload.bytes, payload.offset); return x; } return 1.0F; }
The payload is available as a ‘BytesRef’ instance (unlike a byte array as in previous Lucene versions), and the developer is challenged to find out what method to invoke on that object to get the payload score! Developers may be tempted to play with ‘utf8ToString()’ method but beware. That isn’t the solution. Just note that the member variable ‘bytes’, which is a byte array, is of public scope, and that exactly carries the score. IMHO, the previous idea of a ‘byte []‘ argument seemed much safer, and readable. #3 Adding payloaded documents to index Quite recently in the same article, I had written in this section that if we try to index payloaded documents as a collection using ‘add()’ or ‘addBeans’, then the payload value pertaining to the first document alone is considered, and the same value is taken as score for other documents in the collection. So, I had suggested to add documents one by one, and commit each time (as given below). for (D doc : docsIterator) { server.addBean(doc); server.commit(); }
Unfortunately, it is a big misunderstanding among a few Lucene-using developers like me, and I saw some forums also discussing about this idea. So, I have re-edited this section for the better! There is no problem adding payloaded documents in bulk, but one has to be careful to include ‘payload.offset’ while implementing scorePayload() (as in section #2). Only then, the current document’s payload value would be considered correctly. As mentioned in the previous section, the new signature of scorePayload() hasn’t been fun to understand, with lack of proper getter methods in BytesRef, leaving the developer’s understanding quite vulnerable. This situation would continue to exist till amends are made on the method signature or BytesRef API.
Elastic/Elasticsearch 2013. 7. 15. 23:29
http://sujitpal.blogspot.kr/2010/10/denormalizing-maps-with-lucene-payloads.html
Last week, I tried out Lucene's Payload and SpanQuery features to do some position based custom scoring of terms. I've been interested in the Payload feature ever since I first read about it, because it looked like something I could use to solve another problem at work... The problem is to to be able to store a mapping of concepts to scores along with a document. Our search uses a medical taxonomy, basically a graph of medical concepts (nodes) and their relationships to each other (edges). During indexing, a document is analyzed and a map of node IDs and scores is created and stored in the index. The score is composed of various components, but for simplicity, it can be thought of as the number of occurrences of a node in the document. So after indexing, we would end up with something like this: During search, the query is decomposed into concepts using a similar process, and a query consisting of one or more TermQueries (wrapped in a BooleanQuery) are used to pull documents out of the index. In pseudo-SQL, something like this: | SELECT document FROM index
WHERE nodeId = nodeID(1)
...
AND/OR nodeId = nodeID(n)
ORDER by (score(1) + score(n)) DESC
|
There are many approaches to efficiently model this sort of situation, and over the years we've tried a few. The approach I am going to describe uses Lucene's Payload feature. Basically, the concept map is "flattened" into the main Document, and the scores are farmed out to a Payload byte array, so we can use the scores for scoring our results. Obviously, this is nothing new... other people have used Payloads to do very similar things. In fact, a lot of the code that follows is heavily based on the example in this Lucid Imagination blog post. IndexingAt index time, we flatten our concept map into a whitespace separated list of key-value pairs, and the key and value in each element is separated out with a special character, in our case a "$" sign. So a concept map {p1 => 123.0, p2 => 234.0} would be transformed to "p1$123.0 p2$234.0". Lucene provides the DelimitedPayloadTokenFilter, a custom TokenFilter to parse this string and convert it to equivalent term and payload pairs, so all we have to build on our own is our custom Analyzer. The IndexWriter will use this custom Analyzer for the "data" field in the JUnit test (see below). 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 | // Source: src/main/java/com/mycompany/payload/MyPayloadAnalyzer.java
package com.mycompany.payload;
import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter;
import org.apache.lucene.analysis.payloads.FloatEncoder;
public class MyPayloadAnalyzer extends Analyzer {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
return new DelimitedPayloadTokenFilter(
new WhitespaceTokenizer(reader),
'$', new FloatEncoder());
}
}
|
SearchingOn the search side, we create a custom Similarity implementation that reads the score from the payload and returns it. We will tell our searcher to use this Similarity implementation. We want to use only ourconcept scores, not make it part of the full Lucene score, so we indicate that when we create our PayloadTermQuery in our JUnit test. 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 | // Source: src/main/java/com/mycompany/payload/MyPayloadSimilarity.java
package com.mycompany.payload;
import org.apache.lucene.analysis.payloads.PayloadHelper;
import org.apache.lucene.search.DefaultSimilarity;
public class MyPayloadSimilarity extends DefaultSimilarity {
private static final long serialVersionUID = -2402909220013794848L;
@Override
public float scorePayload(int docId, String fieldName,
int start, int end, byte[] payload, int offset, int length) {
if (payload != null) {
return PayloadHelper.decodeFloat(payload, offset);
} else {
return 1.0F;
}
}
}
|
The actual search logic is in the JUnit test shown below. Here I build a small index with some dummy data in RAM and query it using a straight PayloadTermQuery and two Boolean queries with embedded PayloadTermQueries. 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101 | // Source: src/test/java/com/mycompany/payload/MyPayloadQueryTest.java
package com.mycompany.payload;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.payloads.AveragePayloadFunction;
import org.apache.lucene.search.payloads.PayloadTermQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Test;
public class MyPayloadQueryTest {
private static IndexSearcher searcher;
private static String[] data = {
"p1$123.0 p2$2.0 p3$89.0",
"p2$91.0 p1$5.0",
"p3$56.0 p1$25.0",
"p4$98.0 p5$65.0 p1$33.0"
};
@BeforeClass
public static void setupBeforeClass() throws Exception {
Directory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory,
new MyPayloadAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);;
for (int i = 0; i < data.length; i++) {
Document doc = new Document();
doc.add(new Field("title", "Document #" + i, Store.YES, Index.NO));
doc.add(new Field("data", data[i], Store.YES, Index.ANALYZED));
writer.addDocument(doc);
}
writer.close();
searcher = new IndexSearcher(directory);
searcher.setSimilarity(new MyPayloadSimilarity());
}
@AfterClass
public static void teardownAfterClass() throws Exception {
if (searcher != null) {
searcher.close();
}
}
@Test
public void testSingleTerm() throws Exception {
PayloadTermQuery p1Query = new PayloadTermQuery(
new Term("data", "p1"), new AveragePayloadFunction(), false);
search(p1Query);
}
@Test
public void testAndQuery() throws Exception {
PayloadTermQuery p1Query = new PayloadTermQuery(
new Term("data", "p1"), new AveragePayloadFunction(), false);
PayloadTermQuery p2Query = new PayloadTermQuery(
new Term("data", "p2"), new AveragePayloadFunction(), false);
BooleanQuery query = new BooleanQuery();
query.add(p1Query, Occur.MUST);
query.add(p2Query, Occur.MUST);
search(query);
}
@Test
public void testOrQuery() throws Exception {
PayloadTermQuery p1Query = new PayloadTermQuery(
new Term("data", "p1"), new AveragePayloadFunction(), false);
PayloadTermQuery p2Query = new PayloadTermQuery(
new Term("data", "p2"), new AveragePayloadFunction(), false);
BooleanQuery query = new BooleanQuery();
query.add(p1Query, Occur.SHOULD);
query.add(p2Query, Occur.SHOULD);
search(query);
}
private void search(Query query) throws Exception {
System.out.println("=== Running query: " + query.toString() + " ===");
ScoreDoc[] hits = searcher.search(query, 10).scoreDocs;
for (int i = 0; i < hits.length; i++) {
Document doc = searcher.doc(hits[i].doc);
System.out.println(StringUtils.join(new String[] {
doc.get("title"),
doc.get("data"),
String.valueOf(hits[i].score)
}, " "));
}
}
}
|
The three tests (annotated with @Test) cover the basic use cases that I expect for this search - a single term search, an AND term search and an OR term search. The last two are done by embedding the individual PayloadTermQuery objects into a BooleanQuery. As you can see from the results below, this works quite nicely. This is good news for me, since based on my reading of the LIA2 book, I had (wrongly) concluded that Payloads can only be used with SpanQuery, and that you need special "payload aware" subclasses of SpanQuery to be able to use them (which is true in case of SpanQuery, BTW). 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 | === Running query: data:p1 ===
Document #0 p1$123.0 p2$2.0 p3$89.0 123.0
Document #3 p4$98.0 p5$65.0 p1$33.0 33.0
Document #2 p3$56.0 p1$25.0 25.0
Document #1 p2$91.0 p1$5.0 5.0
=== Running query: +data:p1 +data:p2 ===
Document #0 p1$123.0 p2$2.0 p3$89.0 125.0
Document #1 p2$91.0 p1$5.0 96.0
=== Running query: data:p1 data:p2 ===
Document #0 p1$123.0 p2$2.0 p3$89.0 125.0
Document #1 p2$91.0 p1$5.0 96.0
Document #3 p4$98.0 p5$65.0 p1$33.0 16.5
Document #2 p3$56.0 p1$25.0 12.5
|
PerformanceI also read (on the web, can't find the link now) that Payload queries are usually slower than their non-payload aware counterparts, so I decided to do a quick back-of-the-envelope calculation to see what sort of degradation to expect. I took an existing index containing approximately 30K documents, and its associated (denormalized) concept index, and merged the two into a single new index with the concept map flattened into the document as described above. I ran 5 concept queries, first as a TermQuery (with a custom sort on the concept score field) and then as a PayloadTermQuery, 5 times each, discarding the first query (to eliminate cache warmup overheads), and averaged the wallclock elapsed times for each query. Here are the results: Query Term | #-results | TermQuery (ms) | PayloadTermQuery (ms) | 2800541 | 46 | 0.25 | 1.5 | 2790981 | 39 | 0.25 | 1.75 | 5348177 | 50 | 0.75 | 7.0 | 2793084 | 50 | 0.5 | 1.75 | 2800232 | 50 | 0.5 | 0.75 |
So it appears that on average (excluding outliers), PayloadTermQuery calls are approximately 3-5 times slower than equivalent TermQuery calls. But they do offer a smaller disk (and consequently OS cache) footprint and a simpler programming model, so it remains to be seen if this makes sense for us to use. Update: 2010-10-11 The situation changes when you factor in the actual document retrieval (ie, page through the ScoreDoc array and get the Documents from the searcher using searcher.doc(ScoreDoc.doc)). It appears that the PayloadTermQuery approach is consistently faster, but not significantly so. Query Term | #-results | TermQuery (ms) | PayloadTermQuery (ms) | 2800541 | 46 | 12.5 | 9,25 | 2790981 | 39 | 10.0 | 6.75 | 5348177 | 50 | 9.25 | 9.0 | 2793084 | 50 | 6.5 | 6.0 | 2800232 | 50 | 5.75 | 4.5 |
Elastic/Elasticsearch 2013. 7. 15. 23:12
http://elasticsearch-users.115913.n3.nabble.com/custom-similarity-setting-does-not-work-with-version-0-20-2-td4029500.html
custom similarity setting does not work with version 0.20.277 posts | Hi, I tried following configurations for my custom similarity provider but none of them worked with version 0.20. Can anyone give me some information about this setting for version 0.20.2? Any sample usages will be enough for me curl -XPOST 'http://host:port/tweeter/' -d '
{
"settings": {
"similarity": {
"index": {
"type": "org.elasticsearch.index.similarity.CustomSimilarityProvider"
},
"search": {
"type": "org.elasticsearch.index.similarity.CustomSimilarityProvider"
}
}
}
}
index.similarity.index.type index.similarity.search.type
None of these works.
Thanks...
-- Mustafa Sener
-- Mustafa Sener www.ifountain.com
-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
13 posts | Hello,
you should try this command : curl -XPOST 'http://host:port/tweeter/' -d '
{
"settings": {
"index": {
"similarity": {
"index": {
"type": "org.elasticsearch.index.similarity.CustomSimilarityProvider"
},
"search": {
"type": "org.elasticsearch.index.similarity.CustomSimilarityProvider"
}
}
}
}
}
I hope it can help you. Regards
Benjamin -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
77 posts | That does not work too. I tried it before. On Fri, Feb 8, 2013 at 10:33 AM, benjamin leviant <[hidden email]> wrote: Hello,
you should try this command : curl -XPOST 'http://host:port /tweeter/' -d '
{
"settings": {
"index": {
"similarity": {
"index": {
"type": "org.elasticsearch.index.similarity.CustomSimilarityProvider"
},
"search": {
"type": "org.elasticsearch.index.similarity.CustomSimilarityProvider"
}
}
}
}
}
I hope it can help you. Regards
Benjamin -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out.
-- Mustafa Sener
-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
13 posts | Hi,
What is your the error message ?
How do you get your custom class loaded ?
Regards
Benjamin -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
77 posts | I have no error messages. It seems that this setting is different before version 0.20. Since there is no documentation about this, I cannot configure it properly. On Fri, Feb 8, 2013 at 11:42 AM, benjamin leviant <[hidden email]> wrote: Hi,
What is your the error message ?
How do you get your custom class loaded ?
Regards
Benjamin
-- Mustafa Sener
-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
13 posts | -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
77 posts | I use a FloatingPayload filter to create payloads for each term which is separated by '|' character (school|5.8). Then I use following similarity https://gist.github.com/anonymous/4738207On Fri, Feb 8, 2013 at 12:27 PM, benjamin leviant <[hidden email]> wrote:
-- Mustafa Sener
-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
13 posts | Your implementation looks good.
But to get payload working in elasticsearch, a custom similarity is not enough.
You need also to implement several custom elements : - a token filter : to index payload values - a query parser : to score using payload values
Do you have all these elements working ? -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
77 posts | I have token filter. But not query parser. I use more like this query directly. I will not use token payloads on query, I want to use token payloads stored by filter on index time. Do I still need a query parser even if I use more like this query? On Fri, Feb 8, 2013 at 1:43 PM, benjamin leviant <[hidden email]> wrote: Your implementation looks good.
But to get payload working in elasticsearch, a custom similarity is not enough.
You need also to implement several custom elements : - a token filter : to index payload values - a query parser : to score using payload values
Do you have all these elements working ?
-- Mustafa Sener
-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
13 posts | It is not working, because the "more like this" query do not call the scorePayload method to score the results.
To use payloads, you need a custom query parser with a query having a scoring method that take in account payloads.
You can see documentation about lucene PayloadTermQuery and PayloadNearQuery.
Regards
Benjamin -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
77 posts | Ok, Thanks for your help. It was very beneficial for me. Regards... On Fri, Feb 8, 2013 at 3:20 PM, benjamin leviant <[hidden email]> wrote: It is not working, because the "more like this" query do not call the scorePayload method to score the results.
To use payloads, you need a custom query parser with a query having a scoring method that take in account payloads.
You can see documentation about lucene PayloadTermQuery and PayloadNearQuery.
Regards
Benjamin
-- Mustafa Sener
-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
77 posts | The best option here to use morelikethis query is modify it to take payloads into account. I think I can do this by sub-classing payload query and multiply tf value used by query by payload of that term. What do you think about this design? Actually I need a categorizer which when I enter a text will return best matching categories based on predefined terms and payloads for categories. I select more like this query for this purpose. Payloads are important because I will assign negative payloads to negative samples or terms. On Fri, Feb 8, 2013 at 3:32 PM, Mustafa Sener <[hidden email]> wrote: Ok, Thanks for your help. It was very beneficial for me.
Regards...On Fri, Feb 8, 2013 at 3:20 PM, benjamin leviant <[hidden email]> wrote: It is not working, because the "more like this" query do not call the scorePayload method to score the results.
To use payloads, you need a custom query parser with a query having a scoring method that take in account payloads.
You can see documentation about lucene PayloadTermQuery and PayloadNearQuery.
Regards
Benjamin
-- Mustafa Sener
-- Mustafa Sener
-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
13 posts | Yes you can try by subclassing morelikethis query to customize its scoring logic.
Sorry, but I cannot confirm you it will be work.
Please, give us update on this.
Regards
-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
Elastic/Elasticsearch 2013. 7. 12. 17:29
Original URL : http://www.lucenetutorial.com/advanced-topics/scoring.html
Lucene ScoringThe authoritative document for scoring is found on the Lucene site here. Read that first. Lucene implements a variant of the TfIdf scoring model. That is documented here. The factors involved in Lucene's scoring algorithm are as follows: - tf = term frequency in document = measure of how often a term appears in the document
- idf = inverse document frequency = measure of how often the term appears across the index
- coord = number of terms in the query that were found in the document
- lengthNorm = measure of the importance of a term according to the total number of terms in the field
- queryNorm = normalization factor so that queries can be compared
- boost (index) = boost of the field at index-time
- boost (query) = boost of the field at query-time
The implementation, implication and rationales of factors 1,2, 3 and 4 in DefaultSimilarity.java, which is what you get if you don't explicitly specify a similarity, are:
note: the implication of these factors should be read as, "Everything else being equal, ... [implication]" 1. tf
Implementation: sqrt(freq)
Implication: the more frequent a term occurs in a document, the greater its score
Rationale: documents which contains more of a term are generally more relevant
2. idf
Implementation: log(numDocs/(docFreq+1)) + 1
Implication: the greater the occurrence of a term in different documents, the lower its score
Rationale: common terms are less important than uncommon ones
3. coord
Implementation: overlap / maxOverlap
Implication: of the terms in the query, a document that contains more terms will have a higher score
Rationale: self-explanatory
4. lengthNorm
Implementation: 1/sqrt(numTerms)
Implication: a term matched in fields with less terms have a higher score
Rationale: a term in a field with less terms is more important than one with more queryNorm is not related to the relevance of the document, but rather tries to make scores between different queries comparable. It is implemented as1/sqrt(sumOfSquaredWeights) So, in summary (quoting Mark Harwood from the mailing list), * Documents containing *all* the search terms are good
* Matches on rare words are better than for common words
* Long documents are not as good as short ones
* Documents which mention the search terms many times are good The mathematical definition of the scoring can be found athttp://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/Similarity.html
Hint: look at NutchSimilarity in Nutch to see an example of how web pages can be scored for relevance Customizing scoringIts easy to customize the scoring algorithm. Subclass DefaultSimilarity and override the method you want to customize. For example, if you want to ignore how common a term appears across the index, Similarity sim = new DefaultSimilarity() { public float idf(int i, int i1) { return 1; } } and if you think for the title field, more terms is better Similarity sim = new DefaultSimilarity() { public float lengthNorm(String field, int numTerms) { if(field.equals("title")) return (float) (0.1 *Math.log(numTerms)); else return super.lengthNorm(field, numTerms); } }
Elastic/Elasticsearch 2013. 7. 12. 15:07
Nested type 의 Document Score 는 구할 수 없습니다. ㅡ.ㅡ;; 그렇다고 이걸 해결 못하면 안되겠지요. 힌트, nested
Elastic/Elasticsearch 2013. 7. 1. 20:23
elasticsearch Rest API 사용 시 검색 질의 성능 관련 파라미터 설명 입니다. 지난 번 보았던 timeout 은 collecting 실행 시간에 대한 제한 이였다면, 이번에는 이런 문서를 질의 하고 수집 하는데 처리 하는 역할을 수행 하는 쓰레드 관련 설정 입니다.
/_search?operation_threading=threadPerShard
보시는 것 처럼 적용하기 쉽습니다. 이 옵션을 설정 하지 않을 경우 내부적으로 기본 single thread 로 동작 하게 됩니다. 이 경우 request 가 증가 하게 되면 당연히 성능이 떨어 질 수 밖에 없는 구조 입니다. 꼭 해당 옵션을 확인 해서 사용하시기 바랍니다.
아래는 Java API 에서 설정 하는 내용입니다.
.setOperationThreading(SearchOperationThreading.THREAD_PER_SHARD)
관련 소스파일 입니다.
RestSearchAction.java
SearchOperationThreading.java
이외 search_type도 있는데 이건 나중에 살펴 보죠.
|