'lucene' 태그의 글 목록 (4 Page)

[Elasticsearch] Lucene Arirang Analyzer Plugin for Elasticsearch 5.0.1

Elastic/Elasticsearch 2016. 11. 24. 19:02

우선 빌드한 플러그인 zip 파일 먼저 공유 합니다.

나중에 작업한 내용에 대해서는 github 에 올리도록 하겠습니다.

요즘 프로젝트며 운영 업무가 너무 많아서 이것도 겨우 겨우 시간 내서 작업 했내요.

elasticsearch-analysis-arirang-5.0.1.zip

설치 방법)

$ bin/elasticsearch-plugin install --verbose file:///elasticsearch-analysis-arirang/target/elasticsearch-analysis-arirang-5.0.1.zip

설치 로그)

-> Downloading file:///elasticsearch-analysis-arirang-5.0.1.zip

Retrieving zip from file:///elasticsearch-analysis-arirang-5.0.1.zip

[=================================================] 100%

- Plugin information:

Name: analysis-arirang

Description: Arirang plugin

Version: 5.0.1

* Classname: org.elasticsearch.plugin.analysis.arirang.AnalysisArirangPlugin

-> Installed analysis-arirang

Elasticsearch 실행 로그)

$ bin/elasticsearch

[2016-11-24T18:49:09,922][INFO ][o.e.n.Node ] [] initializing ...

[2016-11-24T18:49:10,083][INFO ][o.e.e.NodeEnvironment ] [aDGu2B9] using [1] data paths, mounts [[/ (/dev/disk1)]], net usable_space [733.1gb], net total_space [930.3gb], spins? [unknown], types [hfs]

[2016-11-24T18:49:10,084][INFO ][o.e.e.NodeEnvironment ] [aDGu2B9] heap size [1.9gb], compressed ordinary object pointers [true]

[2016-11-24T18:49:10,085][INFO ][o.e.n.Node ] [aDGu2B9] node name [aDGu2B9] derived from node ID; set [node.name] to override

[2016-11-24T18:49:10,087][INFO ][o.e.n.Node ] [aDGu2B9] version[5.0.1], pid[56878], build[080bb47/2016-11-11T22:08:49.812Z], OS[Mac OS X/10.12.1/x86_64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_72/25.72-b15]

[2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [aggs-matrix-stats]

[2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [ingest-common]

[2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [lang-expression]

[2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [lang-groovy]

[2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [lang-mustache]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [lang-painless]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [percolator]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [reindex]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [transport-netty3]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [transport-netty4]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded plugin [analysis-arirang]

[2016-11-24T18:49:14,151][INFO ][o.e.n.Node ] [aDGu2B9] initialized

[2016-11-24T18:49:14,151][INFO ][o.e.n.Node ] [aDGu2B9] starting ...

[2016-11-24T18:49:14,377][INFO ][o.e.t.TransportService ] [aDGu2B9] publish_address {127.0.0.1:9300}, bound_addresses {[fe80::1]:9300}, {[::1]:9300}, {127.0.0.1:9300}

[2016-11-24T18:49:17,511][INFO ][o.e.c.s.ClusterService ] [aDGu2B9] new_master {aDGu2B9}{aDGu2B9mQ8KkWCe3fnqeMw}{_y9RzyKGSvqYAFcv99HBXg}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-elected-as-master ([0] nodes joined)

[2016-11-24T18:49:17,584][INFO ][o.e.g.GatewayService ] [aDGu2B9] recovered [0] indices into cluster_state

[2016-11-24T18:49:17,588][INFO ][o.e.h.HttpServer ] [aDGu2B9] publish_address {127.0.0.1:9200}, bound_addresses {[fe80::1]:9200}, {[::1]:9200}, {127.0.0.1:9200}

[2016-11-24T18:49:17,588][INFO ][o.e.n.Node ] [aDGu2B9] started

한글형태소분석 실행)

$ curl -X POST -H "Cache-Control: no-cache" -H "Postman-Token: 6d392d83-5816-71ad-556b-5cd6f92af634" -d '{

"analyzer" : "arirang_analyzer",

"text" : "[한국] 엘라스틱서치 사용자 그룹의 HENRY 입니다."

}' "http://localhost:9200/_analyze"

형태소분석 결과)

{

"tokens": [

{

"token": "[",

"start_offset": 0,

"end_offset": 1,

"type": "symbol",

"position": 0

},

{

"token": "한국",

"start_offset": 1,

"end_offset": 3,

"type": "korean",

"position": 1

},

{

"token": "]",

"start_offset": 3,

"end_offset": 4,

"type": "symbol",

"position": 2

},

{

"token": "엘라스틱서치",

"start_offset": 5,

"end_offset": 11,

"type": "korean",

"position": 3

},

{

"token": "엘라",

"start_offset": 5,

"end_offset": 7,

"type": "korean",

"position": 3

},

{

"token": "스틱",

"start_offset": 7,

"end_offset": 9,

"type": "korean",

"position": 4

},

{

"token": "서치",

"start_offset": 9,

"end_offset": 11,

"type": "korean",

"position": 5

},

{

"token": "사용자",

"start_offset": 12,

"end_offset": 15,

"type": "korean",

"position": 6

},

{

"token": "그룹",

"start_offset": 16,

"end_offset": 18,

"type": "korean",

"position": 7

},

{

"token": "henry",

"start_offset": 20,

"end_offset": 25,

"type": "word",

"position": 8

},

{

"token": "입니다",

"start_offset": 26,

"end_offset": 29,

"type": "korean",

"position": 9

}

]

}

저작자표시 비영리 변경금지

:

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-06-27

Elastic/Elasticsearch 2016. 6. 28. 09:53

몇 가지 눈에 들어 오는게 있어서 scrap 합니다.

[원문]

https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2016-06-27

[요점]

- low-level Java REST client has landed.

별도의 http client 를 이용해서 만들지 않고 es 에서 제공하는거 사용하면 될 것 같습니다.

- index.store.preload

warmmer 기능이 이걸로 대체 되는 것 같습니다.

- no longer turns red when creating an index

순간 red 나올 때가 있었는데 false alarm 이 줄어 들겠내요.

- default similarity is now BM25

TF/IDF 에서 BM25로 넘어 가는 군요.

- wait for status yellow

yellow 도 간혹 발생을 하는데 앞으로 status 에 대해서 다시 점검을 해야 겠내요.

Elasticsearch Core

Changes in 2.x:

The .scripts index now obeys the number_of_shards setting.
Deprecation logging for `_timestamp` and `_ttl`.
Failed synced flushes were reporting an incorrect number of failures.
The index-exists request shouldn't fail if the index is being recovered.
A valid translog file can be deleted incorrectly after a disk full exception and multiple attempts to recover.

Changes in master:

The low-level Java REST client has landed. It is functionally equivalent to the REST clients available in other languages.
The `index.store.preload` setting can preload the specified Lucene files (eg doc values, norms) into MMAP before a segment comes online. This completes the replacement of warmers.
The cluster health no longer turns red when creating an index, unless there is a problem assigning shards.
The default similarity is now BM25.
The `_timestamp` and `_ttl` fields will not be supported on indices created in 5.x.
The `fields` parameter has been removed in favour of `stored_fields`, `docvalue_fields` and (for `text` fields only)`fielddata_fields`.
Some percolator queries don't need in-memory validation to ensure that they match.
Painless now has capturing lambdas, supports adding static methods like `each` to whitelisted classes, has syntax for initialising arrays, lists and maps,
Nested inner hits no longer return _index, _type, and _id, and parent/child inner hits doesn't return _index.
`string` fields weren't upgraded to `text`/`keyword` if `include_in_all` was specified.
Getting a task with wait_for_completion will return the task result.
Nodes info returns the calculated size of the total indexing buffer.
Analysis factories are now MultiTermAware, which will help to remove the lowercase_expanded_terms from the query string query, and to support keyword analyzers on the `keyword` field.
JNA is now a required dependency.
Guice has been removed from the script service,

Ongoing changes:

Sequence number checkpoints are persisted to disk when a segment is flushed.
Reindex-from-remote now uses the Java REST client.
Ensure that primary handover while indexing does not cause a dead lock.
The index file which lists the snapshots in a repository should be written atomically.
The `discovery-azure` plugin doesn't work with the security manager.
It shouldn't be necessary to wait for status yellow before working with a newly created index.
Add helpers to make JSON easier to render in Mustache.
The SynonymQuery should be used for alternative terms, instead of the Bool query.
More time zone edge case bug fixes.
Changes to shard store fetching are required in order to allow for inline rerouting during node join.
Analysis components should implement AnalysisPlugin instead of calling registerTokenizer, allowing Guice to be removed from Hunspell.

Apache Lucene

5.5.2 RC2 release vote is underway
A tricky randomized explain test failure turns out to be a test bug in a recently added test case
Math.toRadians and Math.toDegrees are now banned, since their implementation changes slightly across java versions, impacting our geo tests
RandomAccessFilterStrategy comes back to life for faster filter intersection in some cases
Multi term queries that match no terms rewrite to MatchNoDocsQuery instead of an empty BooleanQuery , making it much simpler to add a helpful reason to MatchNoDocsQuery
The new Ukrainian lemmatizer uses MorfologikFilter with a custom dictionary for efficient dictionary-based Ukrainian analysis
Lucene's confusing and bushy IndexReader hierarchy strikes again
RAMDirectory now also enforces write-once files, and MockDirectoryWrapper now tries harder to corrupt unsync'd index files on close
GeoPoint gets some code cleanups
Eclipse now also fails on unused imports
Auto-prefix terms have been removed since dimensional points is better
CompressionTools has been removed
ForbiddenAPIs is upgraded to version 2.2
It's important to fsync files after copying them via Lucene's Directory!
A tricky test failure was holding up the 5.5.2 release process
Some minor code improvements to SearchGroup
Can we improve the default behavior of query parsers and multi-term queries?
A test bug in MoreLikeThisTest still remains tricky to fix
MoreLikeThis should not invoke toString on a Field object
ScandinavianFoldingFilterFactory and ScandinavianNormalizationFilterFactory are safe for multi-term queries
In the possibly not-rare case where many document share the same point value, we can better compress the docIDs
The ancient query norm and coord blocks progress and should be removed
Should we add a light weight Ukrainian stemmer?
Updating doc values and then using delete-by-query with a doc values query doesn't always work, but fixing it is likely not feasible

저작자표시 비영리 변경금지

:

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-04-25

Elastic/Elasticsearch 2016. 4. 26. 15:48

이번 weekly 에서 눈에 확 들어 오는건 개인적으로 아래 두 가지 입니다.

Thread local leaks when running in web containers have finally been fixed.
CamelCase support has been removed.

원본 글)

https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2016-04-25

Elasticsearch Core

Changes in 2.x:

The index name was missing from the search slowlog.
CamelCase is deprecated (and has deprecation logging).
MoreLikeThis now handles aliases correctly.

Changes in master:

The .percolator type has been replaced with the percolator field datatype.
Added a fingerprint token filter and fingerprint analyzer for duplicate detection.
TransportReplicationAction has been signficantly refactored in order to make it unit testable.
RPM and Deb packages now set permissions explicitly, instead of relying on umasks.
Indexed scripts and templates are now stored in the cluster state, and are called "stored" scripts/templates.
Parameter names in ingest processors are now more consistent.
IP fields support range queries again.
readNamedWriteable and writeNamedWriteable are now public, and writable.readFrom is gone.
UUID generators moved out of Strings, to avoid spooky action at a distance.
The `action.realtime_get` setting has been removed.
Support for unquoted JSON keys can be allowed via a system property, for bwc purposes.
Cross-type mapping updates were not working for boolean fields.
Empty task IDs are now serialised in 1 byte, so that every task can have a parent ID.
Reindex child tasks weren't being marked as such.
Validation failures have been removed from the cluster health response.
Object fields now inherit their dynamic setting from their parent object or type.
Thread local leaks when running in web containers have finally been fixed.
Added a safeguard to protect against too-large rescore windows.
The elasticsearch-plugin script now prints the download URL of the plugin when in verbose mode, and has friendlier error messages.
The startup script now fails with an error code if the elasticsearch binary is not found or is not executable.
CamelCase support has been removed.
The ICU analyzer now accepts custom rule files.

Ongoing changes:

Dots in fields names are now supported, but so far only if the parent fields already exist. Tests are being added to make sure supporting dots fully doesn't break anything.
Persistence of results of long running tasks.
A `minhash` token filter for estimating the Jacard similarity coefficient between two docs.
Pipeline aggs are only needed on the coordinating node.
Adding searchable/aggregatable tags to fields in the field stats API.
Inner hits will no longer support the top-level syntax as the inline syntax has been improved.
It should be possible to pass include/exclude values to the terms aggs using the same format that was used to render bucket keys.
Deleted index tombstones close to being merged.

저작자표시 비영리 변경금지

:

[Elasticsearch] Elastic Stack 5.0 대비 Arirang 형분기 Lucene 6.0 업그레이드 준비

Elastic/Elasticsearch 2016. 4. 26. 15:18

준비 작업을 조금 해야 할 것 같아서 일단 짧게 기록 합니다.

Elastic Stack 5.0이 정식 릴리즈 되게 되면 Lucene 6.x 기반으로 버전이 올라가게 됩니다.

이에 따라 아리랑 형태소 분석기도 버전을 올려야 하는데요.

일단 올려 보니 에러는 한 군데 보입니다.

abstract 로 선언된 method 하나만 구현해 주면 될 것으로 보입니다.

MophemeAttributeImpl.java 파일에 reflectWith(....) 메서드만 구현해 주세요.

@Override
public void reflectWith(AttributeReflector reflector) {
    reflector.reflect(MorphemeAttribute.class, "token", koreanToken);
}

해당 코드에 대한 검증 작업은 하지 않았으니 사용이나 판단은 각자 알아서 하는 것으로 하겠습니다.

저작자표시 비영리 변경금지

:

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-04-11

Elastic/Elasticsearch 2016. 4. 12. 09:59

봐야지 봐야지 하다 이제 봅니다.

제 눈에 띄는 것은

The `match`, `match_phrase`, and `match_phrase_prefix` queries are now separate queries, not just types of the `match` query.

The task manager response now tells you which tasks can be cancelled, and supports a `_cat/tasks` API.

Elasticsearch will no longer accept unquoted field names in JSON.

Now that we have removed the percolator API, we should also remove the percolator type and use percolator fieldsinstead.

예전에 분리 되어 있던걸 합치더니 다시 분리 하는 것 같습니다.

task cancelled 기능을 테스트 해봐야 할 것 같습니다.

이제 field name 작성시 주의해야 겠내요. 좀 더 strict 해졌다고 봐야겠죠. ^^

- 아래 코드가 true에서 false로 되었습니다. (이 기능이 성능이나 기타 다른 기능적인 오류를 만들어 내는 걸까요?)

jsonFactory.configure(JsonParser.Feature.ALLOW_UNQUOTED_FIELD_NAMES, true);

percolator 기능이 fields 로 빠졌내요. 이것도 기능 확인을 해봐야 겠내요.

등록된 issue 를 보면 ㅎㅎ 직관적이고 사용이 좀 더 편해진것 같습니다.

core 2.x에 반영된 내용은 거의 v5.0.0 에 적용 될것 같습니다.

루씬은 일단 6.0.0 이 릴리즈 vote 중이였고 이미 4월 8일에 릴리즈 되었습니다. 이외 다른 내용들은 거의 geo point, locaiton 관련 내용들 입니다.

루씬 6.0.0 릴리즈 소식으로는

Java 8 is the minimum Java version required.
Dimensional points, replacing legacy numeric fields, provides fast and space-efficient support for both single- and multi-dimension range and shape filtering. This includes numeric (int, float, long, double), InetAddress, BigInteger and binary range filtering, as well as geo-spatial shape search over indexed 2D LatLonPoints. See this blog post for details. Dependent classes and modules (e.g., MemoryIndex, Spatial Strategies, Join module) have been refactored to use new point types.
Lucene classification module now works on Lucene Documents using a KNearestNeighborClassifier or SimpleNaiveBayesClassifier.
The spatial module no longer depends on third-party libraries. Previous spatial classes have been moved to a new spatial-extras module.
Spatial4j has been updated to a new 0.6 version hosted by locationtech.
TermsQuery performance boost by a more aggressive default query caching policy.
IndexSearcher's default Similarity is now changed to BM25Similarity.
Easier method of defining custom CharTokenizer instances.

원본링크)

https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2016-04-11

Elasticsearch Core

Changes in 2.x:

Extended Stats could return the wrong result when some indices are missing a field.
Adding an object field with the same name as an existing field should fail.
Shadow replicas should be considered as having size zero.
CORS was broken for preflight requests.
Windows users can configure the Windows service name, description, and user.
Network addresses are now consistently displayed as the ip:port, instead of the hostname.

Changes in master:

Network partitions will no longer cause loss of in flight documents, and we have the test to prove it.
The `match`, `match_phrase`, and `match_phrase_prefix` queries are now separate queries, not just types of the `match` query.
The task manager response now tells you which tasks can be cancelled, and supports a `_cat/tasks` API.
Elasticsearch will no longer accept unquoted field names in JSON.
Elasticsearch now uses mmapfs for Lucene directories instead of a hybrid of niofs/mmapfs.
ParseField is now used to parse query names, which comes with deprecation logging for free.
Geo-points support ignore_malformed correctly again.
Moving averages threw an NPE when no window was specified.
MappedFieldType should be responsible for knowing about which formatter apply, rather than the agg framework.
The allocation-explain API now includes the configured allocation_delay and remaining_delays times.
Hot threads now fail hard if the JVM doesn't support them.
Queries now have a registry, and queries have gradually been migrated to use it.

Ongoing changes:

Bulk request sizes will be subject to a circuit breaker.
Deleted index tombstones are complicated.
ObjectParser should allow constructor args.
Should we enable http compression by default?
Numeric and date fields in 5.0 should use the new Lucene points API.
Now that we have removed the percolator API, we should also remove the percolator type and use percolator fieldsinstead.
Improvements to how we score the _all field based on per-field boosts.

Apache Lucene

The 6.0.0 release vote has passed and the bits were set free a few hours ago! Thank you Nick Knize for taking on the challenging role of release manager!
Many geo3d improvements this week:
- Polygon queries now accept Polygon... inputs, including random nested test polygons, matching our geo2d implementations and respecting the order of polygon vertices
- Geo3d seems to sometimes incorrectly think a polygon is concave when it's really convex
- Adjacent polygon points can now be coplanar
- The unique GeoPath support, which matches all point within X distance of a specified path (think road trip, looking for sushi nearby), now has a simple factory API as well
- Tests were not adequately testing the new simple factory methods for common shapes
- Geo3d now uses a similar encode/decode quantization approach as LatLonPoint
- After lively discussions, geo3d APIs no longer publicly expose classes and methods that could safely be private. APIs should start life private until proven worthy of being public!
Many geo2d improvements as well:
- LatLonPoint Polygon queries are faster using a cool pixelating grid approach, and we can do the same forGeoPointField
- We must improve debuggability of our geo test failures with nice 3D earth models like this example
- Here's a lively discussion about the pros and cons of having our geo tests quantize data only once
- Quantization issues are tricky, and geo2d queries were quantizing the edges of box queries incorrectly, resulting in false positive hits
- We have improved the geo2d tests to never allow "tolerance" on the returned results
- We have moved common geo encoding APIs to core so they can be shared across implementations
- Better random latitude/longitude generation for tests has exposed a tie-break bug in distance sorting, edge case bugs in box query, test bugs and polygon bugs
- Rectangle and Polygon classes have graduated into Lucene's core, to enable sharing across our numerous geo implementations
- A new encoding for GeoPointField will be consistent with LatLonPoint, and use all 64 available bits to minimize quantization error
- GeoPointField gets an efficient distance sort
- Randomized tests tried to create a too-big GeoPointDistanceQuery
- We will move BaseGeoPointTestCase from the spatial module to test-framework allowing us to remove the dependency of the sandbox module on spatial
- SloppyMath.haversin can now move to GeoUtils
The classification module now computes the f1-measure
A previously commented out test assertion comes half way back to life
Our "getting started with Lucene" docs were a bit buggy, but now fixed thanks to a user asking about it
We've upgraded our randomizedtesting dependency to 2.3.4, so we get better messages when there is a static leak in our tests
Points were missing from the codecs package documentation
The DataSplitter in Lucene's classification module should pay attention to classes when splitting
800+ new top-level-domains have been created since we last fixed StandardTokenizer to detect them, but we may need to wait for a JFlex release
Highlighting fails to find terms inside the child query of a BlockJoinQuery
Lucene doesn't have direct support for boolean subset matching, but a number of possible workarounds may work
Math.toRadians is changing its results slightly between Java 1.8 and 1.9
NRTCachingDirectory.listAll sometimes throws IllegalStateException
A scary random test failure is hopefully caused by bad hardware or buggy JVM
TestCoreParser gets some small improvements
A possibly new JVM bug causes JVM crash when decoding postings
JapaneseTokenizer should do a better job validating custom user-provided dictionaries
Another iteration for codec level encryption; this patch uses a new initialization vector for each data block, and seems not to impact search performance
Our release scripts still struggle with the switch from Subversion to git
Sometimes, BooleanQuery's explain method can lie about its score
Another user falls into the unfortunately common trap of thinking Lucene's stored fields store all information about a field

저작자표시 비영리 변경금지

:

[Lucene] TermVector 정보 중 Offset 에 대해서.

ITWeb/검색일반 2016. 3. 30. 17:33

아는 것도 이제는 기억이 가물가물 합니다. 그래서 또 기록해 봅니다.

사내 교육을 하면서 lucene 기본 이론 교육을 하다, start offset 과 end offset 에 대해서 설명을 해주고 있었는데요.

end offset 이 실제 text의 offset 값 보다 1 크다는 것에 대한 질문이 있었습니다.

아는 건데 일단 가볍게라도 설명하고 넘어 가야해서 아무래도 highlight 기능을 위해서 그렇게 설정 하는것 같다고 하고 오늘 문서랑 소스 코드 좀 다시 살펴 봤습니다.

lucene in aciton 에서 퍼온 글)

The start offset is the character position in the original text where the token text begins, and the end offset is the position just after the last character of the token text.

end offset 이 실제 보다 1 큰 이유는 문서에 있습니다.

그런데 왜 이렇게 되었을까를 고민해 보면 내부 처리 방식을 확인해 봐야 합니다.

highlight 기능이기 때문에 이 작업에 필요한 class 파일과 fragment에 대한 처리 로직을 확인 하면 됩니다.

protected String makeFragment( StringBuilder buffer, int[] index, Field[] values, WeightedFragInfo fragInfo,
    String[] preTags, String[] postTags, Encoder encoder ){
  StringBuilder fragment = new StringBuilder();
  final int s = fragInfo.getStartOffset();
  int[] modifiedStartOffset = { s };
  String src = getFragmentSourceMSO( buffer, index, values, s, fragInfo.getEndOffset(), modifiedStartOffset );
  int srcIndex = 0;
  for( SubInfo subInfo : fragInfo.getSubInfos() ){
    for( Toffs to : subInfo.getTermsOffsets() ){
      fragment
        .append( encoder.encodeText( src.substring( srcIndex, to.getStartOffset() - modifiedStartOffset[0] ) ) )
        .append( getPreTag( preTags, subInfo.getSeqnum() ) )
        .append( encoder.encodeText( src.substring( to.getStartOffset() - modifiedStartOffset[0],
          to.getEndOffset() - modifiedStartOffset[0] ) ) )
        .append( getPostTag( postTags, subInfo.getSeqnum() ) );
      srcIndex = to.getEndOffset() - modifiedStartOffset[0];
    }
  }
  fragment.append( encoder.encodeText( src.substring( srcIndex ) ) );
  return fragment.toString();
}

코드 보시면 아시겠죠.

기본적으로 String.substring( inclusive begin index, exclusive end index) 을 이용하기 때문에 end offset 값은 1 커야 하는 것입니다.

다른 의미로 보면 그냥 offset 정보와 text 의 length 정보를 한꺼번에 offsets 로 해결하기 좋은 방법으로 봐도 될 것 같습니다.

저작자표시 비영리 변경금지

:

[Elasticsearch] Timeout 소개.

Elastic/Elasticsearch 2016. 3. 23. 11:39

Timeout 소개라기 보다 하도 예전에 봤던거라 다시 한번 살펴 봤습니다.

2013년도에 0.90 버전때 봤던 코드라 2.2.0 기반으로 정리해 봅니다.

참고링크)

https://www.elastic.co/guide/en/elasticsearch/guide/current/_search_options.html#_timeout_2

원문 Snippet)

By default, the coordinating node waits to receive a response from all shards. If one node is having trouble, it could slow down the response to all search requests.

참고 클래스)

TransportService.java

SearchService.java

SearchRequestBuilder.java

예전과 크게 달라 지지는 않았습니다.

첫번째 Timeout은 Shard 별 Search operation 에 대한 timeout 입니다.

아시는 바와 같이 search request 를 보내게 되면 각 shard 수 만큼 thread 가 action 수행을 하게 됩니다. 이때 개별 thread 에 대한 timeout 설정이라고 보시면 됩니다.

두번째 Timeout은 search coordinator node에서의 timeout 입니다. 즉, 모든 shard 에서 데이터를 받을 때 까지의 timeout 이라고 보시면 됩니다.

저작자표시 비영리 변경금지

:

[Elasticsearch] this-week-in-elasticsearch-and-apache-lucene-2016-03-14 요약

Elastic/Elasticsearch 2016. 3. 18. 10:28

거의 매주 올려 주는 elasticsearch & lucene 소식 입니다.

그냥 학습 한다 생각하고 요점 정리만 해볼 생각 입니다.

원문링크)

https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2016-03-14

원문요약)

올라온 것 중 개인적으로 keep 할 것만 추렸습니다.

Changes in master:

`string` fields will be replaced by `text` and `keyword` fields in 5.0, with the following bwc layer:
- String mappings in old indices will not be upgraded.
- Text/Keyword mappings can be added to old and new indices.
- String mappings on new indices will be upgraded automatically to text/keyword mappings, if possible, with deprecation logging.
- If it is not possible to automatically upgrade, an exception will be thrown.
Norms can no longer be lazy loaded. This is no longer needed as they are no longer loaded into memory. The `norms` setting now take a boolean. Index time boosts are no longer stored as norms.
Queries deprecated in 2.0 have now been removed.
The generic thread pool is now bound to 4x the number of processors.

Ongoing changes:

Dynamic field addition now happens at the end of doc parsing, in preparation for supporting dots in field names.
The percolator API will be deprecated in favour of a percolator query, which will deliver a number of requested features to the percolator.
The reindex API will support throttling.
Index data folders will be named according to the index UUID, rather than the index name.

master에 반영된 내용 중 눈에 확 들어 오는건 string field내요. 이제 text와 keyword로 맵핑을 해야 할 것 같습니다.

이미 반영된건 자동으로 업그레이드 되지 않지만 신규로 생성하는건 자동으로 되내요.

그리고 deprecated 된 query들 이제 remove 되었내요. 혹시라도 계속 사용하셨다면 에러 조심 하세요.

변경중인것 중에는 field명에 dot 지원이랑 percolator query가 눈에 들어 오내요. API 방식에서 Query 방식으로 변경되면 더 편하고 유용하게 사용할 수 있겠습니다.

저작자표시 비영리 변경금지

:

[Elasticsearch] Synonym 적용을 위한 Index Settings 설정 예시

Elastic/Elasticsearch 2016. 3. 17. 18:34

나중에 또 잊어 버릴까봐 기록합니다.

참고문서)

https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html

예시)

"index": {
  "analysis": {
    "analyzer": {
      "arirang_custom": {
        "type": "custom",
        "tokenizer": "arirang_tokenizer",
        "filter": ["lowercase", "trim", "arirang_filter"]
      },
      "arirang_custom_searcher": {
        "tokenizer": "arirang_tokenizer",
        "filter": ["lowercase", "trim", "arirang_filter", "meme_synonym"]
      }
    },
    "filter": {
      "meme_synonym": {
        "type": "synonym",
        "synonyms": [
          "henry,헨리,앙리"
        ]
      }
    }
  }
}

여기서 주의할 점 몇 가지만 기록 합니다.

1. synonym analyzer 생성 시 type을 custom 으로 선언 하거나 type을 아예 선언 하지 않습니다.

2. synonym 은 filter 로 생성 해서 analyzer 에 filter 로 할당 합니다.

3. 색인 시 사용할 것인지 질의 시 사용할 것인지 장단점과 서비스 특성에 맞게 검토 합니다.

4. synonyms_path 를 이용하도록 합니다. (이건 주의라기 보다 관리적 차원)

5. match type 의 query만 사용이 가능 하며, term type 의 query를 사용하고 싶으시다면 색인 시 synonym 적용해야 합니다.

그럼 1번에서 선언 하지 않는 다는 이야기는 뭘까요?

선언 하지 않으시면 그냥 custom 으로 만들어 줍니다.

못 믿으시는 분들을 위해 아래 소스코드 투척 합니다.

[AnalysisModule.java]

String typeName = analyzerSettings.get("type");
Class<? extends AnalyzerProvider> type;
if (typeName == null) {
    if (analyzerSettings.get("tokenizer") != null) {
        // custom analyzer, need to add it
        type = CustomAnalyzerProvider.class;
    } else {
        throw new IllegalArgumentException("Analyzer [" + analyzerName + "] must have a type associated with it");
    }
} else if (typeName.equals("custom")) {
    type = CustomAnalyzerProvider.class;
} else {
    type = analyzersBindings.analyzers.get(typeName);
    if (type == null) {
        throw new IllegalArgumentException("Unknown Analyzer type [" + typeName + "] for [" + analyzerName + "]");
    }
}

저작자표시 비영리 변경금지

:

[Elasticsearch] Elasticsearch에 Arirang 외부 사전 등록하기

Elastic/Elasticsearch 2016. 3. 17. 12:49

arirang 한글 형태소 분석기를 적용하고 사전 데이터를 업데이트 할 일들이 많이 생깁니다.

jar 안에 들어 있는 사전 데이터는 패키지 빌드 후 재배포하고 클러스터 재시작까지 해줘야 하는데요.

이런 과정 없이 사전 데이터만 외부에서 파일로 업데이트 및 관리하고 재시작 없이 바로 적용했으면 합니다.

기본적으로 이전 글에서 사전 데이터를 reload 하는 REST API를 구현해 두었습니다.

이 기능으로 일단 기능 구현은 완료가 된 것입니다.

이전 글 보기)

http://jjeong.tistory.com/1142

그럼 elasticsearch에서 어디에 사전 파일을 두고 관리를 해야 적용이 가능 할까요?

이전 글을 보시면 기본적으로 수명님이 만드신 arirang.morph 에서 classpath 내 org/apache/lucene/analysis/ko/dic 과 같이 생성 및 배치 시키시면 먼저 이 파일을 읽어 들이게 되어 있습니다.

이전 글 보기)

http://jjeong.tistory.com/1069

단, elasticsearch 실행 시 classpath 정보에 생성한 경로를 추가하지 않으시면 사전 파일들을 찾을 수 없으니 이점 유의 하시기 바랍니다.

elasticsearch classpath 설정)

elasticsearch에서 가이드 하는 것은 수정하지 마라 입니다. 하지만 수정 없이는 이를 활용할 수 없으니 이런건 수정해줘야 합니다.

$ vi bin/elasticsearch.in.sh

.....

ES_CLASSPATH="$ES_HOME/lib/elasticsearch-2.2.0.jar:$ES_HOME/lib/*:$ES_HOME/설정하신경로입력"

.....

이렇게 수정하신 후 재시작 하시고 직접 사전 정보 업데이트 후 reload api 를 이용해서 적용되는지 확인해 보시면 되겠습니다.

참고 정보 - 간단 요약)

arirang.morph 에서 properties 파일과 dic 파일 loading flow

Step 1)

load external korean.properties into classpath.

dic files are same.

Step 2)

if not exist, load korean.properties into jar.

dic files are same.

사전 데이터는 어떻게 등록 할 수 있는지 궁금하신 분은 이전 글 참고하세요.

사전 데이터 등록 예제)

http://jjeong.tistory.com/1069

저작자표시 비영리 변경금지

:

jjeong

'lucene'에 해당되는 글 71건

[Elasticsearch] Lucene Arirang Analyzer Plugin for Elasticsearch 5.0.1

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-06-27

Elasticsearch Core

Apache Lucene

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-04-25

Elasticsearch Core

[Elasticsearch] Elastic Stack 5.0 대비 Arirang 형분기 Lucene 6.0 업그레이드 준비

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-04-11

Elasticsearch Core

Apache Lucene

[Lucene] TermVector 정보 중 Offset 에 대해서.

[Elasticsearch] Timeout 소개.

[Elasticsearch] this-week-in-elasticsearch-and-apache-lucene-2016-03-14 요약

[Elasticsearch] Synonym 적용을 위한 Index Settings 설정 예시

[Elasticsearch] Elasticsearch에 Arirang 외부 사전 등록하기

티스토리툴바