'루씬' 태그의 글 목록 (2 Page)

'루씬'에 해당되는 글 30건

[Elasticsearch] Arirang Analyzer + Elasticsearch Analyzer Plugin 사용자 관점 개발리뷰

Elastic/Elasticsearch 2017. 10. 19. 14:25

사용자 관점에서 어떻게 개발 하는지 정리해 보았습니다.

Elasticsearch를 서비스에 사용하면서 한글 처리를 위해 어떤 analyzer를 사용해야 할지 고민해 보신적이 있을 것입니다.

오늘은 제가 사용하고 있는 Lucene Korean Analyzer와 이를 Elasticsearch에 plugin으로 설치하고 사용하는 방법을 알아 보도록 하겠습니다.

들어 가기에 앞서 lucene에서 제공하는 analyzer의 기본 구성과 동작에 대해서 살펴 보겠습니다.

Lucene에서 제공하는 analyzer 는 하나의 tokenizer와 다수의 filter로 구성이 됩니다.

Filter 는 CharFilter와 TokenFilter 두 가지가 있습니다.

CharFilter는 입력된 문자열에서 불필요한 문자를 normalization 하기 위해 사용되며 TokenFilter는 tokenizer에 의해 분해된 token에 대한 filter 처리를 하게 됩니다.

결과적으로 아래와 같은 순서로 analysis 된다고 이해 하면 됩니다.

Input Text →

Character Filter → Filtered Text →

Tokenizer → Tokens →

Token Filter → Filtered Tokens →

Output Tokens

이제 본론으로 들어 가겠습니다.

Lucene Korean Analyzer는 현재 이수명님에 의해 개발 및 유지보수가 되고 있으며 오픈소스로 등록이 되어 있습니다.

관련 소스코드는 아래 두 가지 repository를 통해서 제공 되고 있습니다.

[svn 주소]

https://lucenekorean.svn.sourceforge.net/svnroot/lucenekorean

[github 주소]

https://github.com/korlucene

※ Lucene Korean Analyzer 는 지금 Arirang 이라고 부르고 있습니다.

Arirang의 프로젝트 구성은 크게 두 부분으로 나뉩니다.

arirang analyzer
arirang morph

1. arirang morph

이 프로젝트는 한글 형태소에 대한 기본 분석과 사전 정보로 구성이 되어 있습니다.

한글 처리와 사전 정보를 변경 하고 싶을 경우 본 프로젝트의 코드를 분석하고 수정 해서 활용을 하실 수 있습니다.

2. arirang analyzer

이 프로젝트는 lucene의 analyzer를 상속받아 lucene에서 사용 할 수 있도록 구성이 되어 있습니다.

Lucene의 analyzer pipeline에 필요한

- KoreanAnalyzer

- KoreanFilter

- KoreanFilterFactory

- KoreanToken

- KoreanTokenizer

- KoreanTokenizerFactory

등이 주요 클래스로 구현이 되어 있습니다.

한글 형태소 분석에서 중요한 역할을 하는 부분으로 사전 이라는 것이 있으며, 이를 알아 보도록 하겠습니다.

arirang.morph 프로젝트에 포함이 되어 있으며 언급 한것과 같이 지속적인 업데이트 및 변경이 가능 합니다.

1. Dictionary classpath

org/apache/lucene/analysis/ko/dic

2. Dictionary files

org/apache/lucene/analysis/ko

korean.properties

org/apache/lucene/analysis/ko/dic

abbreviation.dic

cj.dic

compounds.dic

eomi.dic

extension.dic

josa.dic

mapHanja.dic

occurrence.dic

prefix.dic

suffix.dic

syllable.dic

total.dic

uncompounds.dic

3. 주요 사전 설명

주요 사전 설명 이라고는 했지만 쉽고 빠르게 활용할 수 있는 사전이라고 이해 하시면 좋을 것 같습니다.

total.dic
이 사전 파일은 arirang analyzer 에서 사용하는 기본 사전으로 그대로 사용을 하시면 됩니다.
다만, 수정이 필요 하실 경우 아래 extension.dic 파일을 활용 하시면 됩니다.

extension.dic
확장사전이라고 부르며, 사전 데이터를 추가 해야 할 경우 이 파일에 추가해서 운영 및 관리를 하시면 됩니다.

compounds.dic
복합명사 사전으로 하나의 단어가 여러개의 단어로 구성이 되어 있을 경우 이를 분해하기 위한 사전 정보를 관리 하는 파일 입니다.

4. total.dic / extension.dic 파일 구조

체언 용언 기타품사 하여(다)동사 되어(다)동사 '내'가붙을수있는체언 NA NA NA 불규칙변경

예)

# 엘사는 명사이고 동사, 기타품사, 불규칙이 아니다, 라고 가정하면 아래와 같이 표현이 됩니다.

엘사,100000000X

# 노래는 명사이고 하여(다) 동사가 됩니다.

노래,100100000X

# 소리는 명사이고 소리내다와 같이 내가 붙을 수 있는 명사 입니다.

소리,100001000X

불규칙 정보는 아래와 같으며 원문을 참고 하시기 바랍니다.

B : ㅂ 불규칙

H : ㅎ 불규칙

L : 르 불규칙

U : ㄹ 불규칙

S : ㅅ 불규칙

D : ㄷ 불규칙

R : 러 불규칙

X : 규칙

※ 원문 : http://cafe.naver.com/korlucene/135

5. compound.dic 파일 구조

분해전단어:분해후단어1,분해후단어2,...,분해후단어N:DBXX

분해전단어에 하여(다)동사(D), 되어(다)동사(B) 가 붙을 수 있는지 확인 하셔야 합니다.

예)

객관화:객관,화:1100

이와 같이 된 이유는

객관화하다

객관화되다

가 되기 때문입니다.

참고)

http://krdic.naver.com/search.nhn?query=%EA%B0%9D%EA%B4%80%ED%99%94&kind=all

이제 부터는 소스 코드를 내려 받아서 빌드 후 Elasticsearch plugin을 만드는 방법을 알아 보겠습니다.

1. 프로젝트 clone

기본적으로 master branch 를 받습니다.

$ git clone https://github.com/korlucene/arirang.morph.git

$ git clone https://github.com/korlucene/arirang-analyzer-6.git

2. Maven build

두 프로젝트 모드 maven project로 빌드 장비에 maven 이 설치가 되어 있어야 합니다.
※ maven 설치 참고 - https://maven.apache.org/

arirang-analyzer-6 프로젝트에 기본적으로 arirang.morph 패키지가 등록이 되어 있기 때문에 별도 arirang.morph를 수정 하지 않았다면 arirang-analyzer-6 만 빌드하시면 됩니다.

arirang.morph $ mvn clean package

arirang-analyzer-6 $ mvn clean package

3. 기능 테스트

기능 테스트는 arirang-analyzer-6 프로젝트에 포함된 test code를 이용해서 확인해 보시면 됩니다.
src/test 아래 TestKoreanAnalyzer1 클래스를 참고하시면 됩니다.

☞ 아래는 이해를 돕기 위해 원본 테스트 코드를 추가 하였습니다.

/**

* Created by SooMyung(soomyung.lee@gmail.com) on 2014. 7. 30.

public class TestKoreanAnalyzer1 extends TestCase {

public void testKoreanAnalzer() throws Exception {

String[] sources = new String[]{

"고려 때 중랑장(中郞將) 이돈수(李敦守)의 12대손이며",

"이돈수(李敦守)의",

"K·N의 비극",

"金靜子敎授",

"天國의",

"기술천이",

"12대손이며",

"明憲淑敬睿仁正穆弘聖章純貞徽莊昭端禧粹顯懿獻康綏裕寧慈溫恭安孝定王后",

"홍재룡(洪在龍)의",

"정식시호는 명헌숙경예인정목홍성장순정휘장소단희수현의헌강수유령자온공안효정왕후(明憲淑敬睿仁正穆弘聖章純貞徽莊昭端禧粹顯懿獻康綏裕寧慈溫恭安孝定王后)이며 돈령부영사(敦寧府領事) 홍재룡(洪在龍)의 딸이다. 1844년, 헌종의 정비(正妃)인 효현왕후가 승하하자 헌종의 계비로써 중궁에 책봉되었으나 5년 뒤인 1849년에 남편 헌종이 승하하고 철종이 즉위하자 19세의 어린 나이로 대비가 되었다. 1857년 시조모 대왕대비 순원왕후가 승하하자 왕대비가 되었다.",

"노벨상을"

};

KoreanAnalyzer analyzer = new KoreanAnalyzer();

for (String source : sources) {

TokenStream stream = analyzer.tokenStream("dummy", new StringReader(source));

CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);

PositionIncrementAttribute posIncrAtt = stream.addAttribute(PositionIncrementAttribute.class);

PositionLengthAttribute posLenAtt = stream.addAttribute(PositionLengthAttribute.class);

TypeAttribute typeAtt = stream.addAttribute(TypeAttribute.class);

OffsetAttribute offsetAtt = stream.addAttribute(OffsetAttribute.class);

MorphemeAttribute morphAtt = stream.addAttribute(MorphemeAttribute.class);

stream.reset();

while (stream.incrementToken()) {

System.out.println(termAtt.toString() + ":" + posIncrAtt.getPositionIncrement() + "(" + offsetAtt.startOffset() + "," + offsetAtt.endOffset() + ")");

}

stream.close();

}

이제 arirang에 대한 빌드와 기능테스트가 끝났으니 elasticsearch에 설치 하기 위한 plugin 만드는 방법을 알아 보도록 하겠습니다.

먼저, elasticsearch에서 제공하는 plugins 관련 문서를 시간이 된다면 한번 읽어 보시고 아래 내용을 보시길 추천 드립니다.

※ Elasticsearch Plugins and Integrations : https://www.elastic.co/guide/en/elasticsearch/plugins/5.5/index.html

Elastic에서 공식문서에서 제공해 주고 있는 예제는 아래 링크에 나와 있으니 구현 시 참고하시기 바랍니다.

☞ https://github.com/elastic/elasticsearch/tree/master/plugins/jvm-example

※ 제가 추천하는 것은 elasticsearch source code를 다운받아 official하게 작성된 plugin 코드를 참고하여 구현하는 방법 입니다.

그럼 analysis plugin의 기본 프로젝트 구조를 살펴 보겠습니다.

1. Project Directory

src/main

assemblies

plugin.xml

java

org/elasticsearch

index/analysis

${CUSTOM-ANALYZER-NAME}AnalyzerProvider

${CUSTOM-ANALYZER-NAME}TokenFilterFactory

${CUSTOM-ANALYZER-NAME}TokenizerFactory

plugin/analysis/arirang

Analysis${CUSTOM-ANALYZER-NAME}Plugin

resources

plugin-descriptor.propeties

2. Files and classes

plugin.xml
maven assembly plugin을 이용한 패키징을 하기 위한 설정을 구성 합니다.
plugin-descriptor.propeties
plugin authors 정보를 구성 합니다.
elasticsearch reference) https://www.elastic.co/guide/en/elasticsearch/plugins/5.5/plugin-authors.html
${CUSTOM-ANALYZER-NAME}AnalyzerProvider
custom analyzer 생성자 제공을 위한 코드를 작성 합니다.
${CUSTOM-ANALYZER-NAME}TokenFilterFactory
custom filter 생성자 제공을 위한 코드를 작성 합니다.
${CUSTOM-ANALYZER-NAME}TokenizerFactory
custom tokenizer 생성자 제공을 위한 코드를 작성 합니다.
Analysis${CUSTOM-ANALYZER-NAME}Plugin
custom analyzer plugin 등록을 위한 코드를 작성 합니다.

이와 같은 구조를 이용하여 elasticsearch-analysis-arirang plugin을 만들어 보도록 하겠습니다.

본 plugin에서는 arirang에서 제공하는 dynamic dictionary reload 기능을 사용하기 위한 Rest Handler도 추가해서 만들어 보도록 하겠습니다.

소스코드 참고)

https://github.com/HowookJeong/elasticsearch-analysis-arirang/tree/5.5.0

Step1)

Maven project를 생성 합니다.
pom.xml 구성은 github에 등록된 파일을 참고 하셔서 작성 하시면 됩니다.
https://github.com/HowookJeong/elasticsearch-analysis-arirang/blob/5.5.0/pom.xml

Step2)

Plugin project structure를 구성 합니다.

Step3)

root path에 lib 폴더를 생성하고 arirang analyzer 관련 jar 파일을 복사해 놓습니다.
arirang.lucene-analyzer-VERSION.jar
arirang-morph-VERSION.jar

Step4)

pom.xml에서 local jar 파일에 대한 dependency 설정을 추가해 줍니다.

<artifactId>morph</artifactId>

<version>${morph.version}</version>

<scope>system</scope>

<systemPath>${project.basedir}/lib/arirang-morph-${morph.version}.jar</systemPath>

<optional>false</optional>

</dependency>

<artifactId>arirang.lucene-analyzer-${lucene.version}</artifactId>

<version>${morph.version}</version>

<scope>system</scope>

<systemPath>${project.basedir}/lib/arirang.lucene-analyzer-${lucene.version}-${morph.version}.jar</systemPath>

<optional>false</optional>

</dependency>

Step5)

analysis plugin 관련 코드를 작성 합니다.

@Override

public List<RestHandler> getRestHandlers(Settings settings, RestController restController, ClusterSettings clusterSettings,

IndexScopedSettings indexScopedSettings, SettingsFilter settingsFilter, IndexNameExpressionResolver indexNameExpressionResolver,

Supplier<DiscoveryNodes> nodesInCluster) {

return singletonList(new ArirangAnalyzerRestAction(settings, restController));

}

@Override

public Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() {

return singletonMap("arirang_filter", ArirangTokenFilterFactory::new);

}

@Override

public Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() {

Map<String, AnalysisProvider<TokenizerFactory>> extra = new HashMap<>();

extra.put("arirang_tokenizer", ArirangTokenizerFactory::new);

return extra;

}

@Override

public Map<String, AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() {

return singletonMap("arirang_analyzer", ArirangAnalyzerProvider::new);

}

Step6)

analysis 관련 코드를 작성 합니다.

// ArirangAnalyzerProvider

private final KoreanAnalyzer analyzer;

public ArirangAnalyzerProvider(IndexSettings indexSettings, Environment env, String name, Settings settings) throws IOException {

super(indexSettings, name, settings);

analyzer = new KoreanAnalyzer();

}

@Override

public KoreanAnalyzer get() {

return this.analyzer;

}

// ArirangTokenFilterFactory

public ArirangTokenFilterFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) {

super(indexSettings, name, settings);

}

@Override

public TokenStream create(TokenStream tokenStream) {

return new KoreanFilter(tokenStream);

}

// ArirangTokenizerFactory

public ArirangTokenizerFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) {

super(indexSettings, name, settings);

}

@Override

public Tokenizer create() {

return new KoreanTokenizer();

}

Step7)

rest action 관련 코드를 작성 합니다.

// ArirangAnalyzerRestAction

@Inject

public ArirangAnalyzerRestAction(Settings settings, RestController controller) {

super(settings);

controller.registerHandler(RestRequest.Method.GET, "/_arirang_dictionary_reload", this);

}

@Override

protected RestChannelConsumer prepareRequest(RestRequest restRequest, NodeClient client) throws IOException {

try {

DictionaryUtil.loadDictionary();

} catch (MorphException me) {

return channel -> channel.sendResponse(new BytesRestResponse(RestStatus.NOT_ACCEPTABLE, "Failed which reload arirang analyzer dictionary!!"));

} finally {

}

return channel -> channel.sendResponse(new BytesRestResponse(RestStatus.OK, "Reloaded arirang analyzer dictionary!!"));

}

// ArirangAnalyzerRestModule

@Override

protected void configure() {

// TODO Auto-generated method stub

bind(ArirangAnalyzerRestAction.class).asEagerSingleton();

}

Step8)

plugin-descriptor.properties 관련 코드를 작성 합니다.

classname=org.elasticsearch.plugin.analysis.arirang.AnalysisArirangPlugin

name=analysis-arirang

jvm=true

java.version=1.8

site=false

isolated=true

description=Arirang plugin

version=${project.version}

elasticsearch.version=${elasticsearch.version}

hash=${buildNumber}

timestamp=${timestamp}

Step9)

패키징을 하기 위한 plugin.xml 관련 코드를 작성 합니다.

<file>

<source>lib/arirang.lucene-analyzer-6.5.1-1.1.0.jar</source>

<outputDirectory>elasticsearch</outputDirectory>

</file>

<file>

<source>lib/arirang-morph-1.1.0.jar</source>

<outputDirectory>elasticsearch</outputDirectory>

</file>

<file>

<source>target/elasticsearch-analysis-arirang-5.5.0.jar</source>

<outputDirectory>elasticsearch</outputDirectory>

</file>

<file>

<source>${basedir}/src/main/resources/plugin-descriptor.properties</source>

<outputDirectory>elasticsearch</outputDirectory>

</file>

Step10)

빌드를 합니다.

$ mvn clean package -DskipTests=true

여기서는 작성된 코드는 일부만 발췌 했기 때문에 github에 올라간 소스코드를 참고하시기 바랍니다.

또한, 위 단계는 순서가 중요한 것이 아니며 구성과 어떻게 구현을 해야 하는지를 이해 하시는게 중요 합니다.

이제 빌드가 완료 되었으니 설치 및 기능 점검을 수행해 보도록 하겠습니다.

1. 설치

$ bin/elasticsearch-plugin install --verbose file:///path/elasticsearch-analysis-arirang-5.5.0.zip

2. 기능점검

실행

$ bin/elasticsearch

[2017-08-22T18:56:17,223][INFO ][o.e.n.Node ] [singlenode] initializing ...

[2017-08-22T18:56:17,289][INFO ][o.e.e.NodeEnvironment ] [singlenode] using [1] data paths, mounts [[/ (/dev/disk1)]], net usable_space [489.3gb], net total_space [930.3gb], spins? [unknown], types [hfs]

[2017-08-22T18:56:17,289][INFO ][o.e.e.NodeEnvironment ] [singlenode] heap size [1.9gb], compressed ordinary object pointers [true]

[2017-08-22T18:56:17,309][INFO ][o.e.n.Node ] [singlenode] node name [singlenode], node ID [saCA_25vSxyUwF-RagteLw]

[2017-08-22T18:56:17,309][INFO ][o.e.n.Node ] [singlenode] version[5.5.0], pid[12613], build[260387d/2017-06-30T23:16:05.735Z], OS[Mac OS X/10.12.5/x86_64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_72/25.72-b15]

[2017-08-22T18:56:17,309][INFO ][o.e.n.Node ] [singlenode] JVM arguments [-Xms2g, -Xmx2g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+DisableExplicitGC, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Des.path.home=/Users/jeonghoug/dev/server/elastic/elasticsearch-5.5.0]

[2017-08-22T18:56:18,131][INFO ][o.e.p.PluginsService ] [singlenode] loaded module [aggs-matrix-stats]

[2017-08-22T18:56:18,131][INFO ][o.e.p.PluginsService ] [singlenode] loaded module [ingest-common]

[2017-08-22T18:56:18,131][INFO ][o.e.p.PluginsService ] [singlenode] loaded module [lang-expression]

[2017-08-22T18:56:18,131][INFO ][o.e.p.PluginsService ] [singlenode] loaded module [lang-groovy]

[2017-08-22T18:56:18,131][INFO ][o.e.p.PluginsService ] [singlenode] loaded module [lang-mustache]

[2017-08-22T18:56:18,131][INFO ][o.e.p.PluginsService ] [singlenode] loaded module [lang-painless]

[2017-08-22T18:56:18,131][INFO ][o.e.p.PluginsService ] [singlenode] loaded module [parent-join]

[2017-08-22T18:56:18,131][INFO ][o.e.p.PluginsService ] [singlenode] loaded module [percolator]

[2017-08-22T18:56:18,131][INFO ][o.e.p.PluginsService ] [singlenode] loaded module [reindex]

[2017-08-22T18:56:18,132][INFO ][o.e.p.PluginsService ] [singlenode] loaded module [transport-netty3]

[2017-08-22T18:56:18,132][INFO ][o.e.p.PluginsService ] [singlenode] loaded module [transport-netty4]

[2017-08-22T18:56:18,132][INFO ][o.e.p.PluginsService ] [singlenode] loaded plugin [analysis-arirang]

[2017-08-22T18:56:19,195][INFO ][o.e.d.DiscoveryModule ] [singlenode] using discovery type [zen]

[2017-08-22T18:56:19,686][INFO ][o.e.n.Node ] [singlenode] initialized

[2017-08-22T18:56:19,687][INFO ][o.e.n.Node ] [singlenode] starting ...

[2017-08-22T18:56:24,837][INFO ][o.e.t.TransportService ] [singlenode] publish_address {127.0.0.1:9300}, bound_addresses {[fe80::1]:9300}, {[::1]:9300}, {127.0.0.1:9300}

[2017-08-22T18:56:27,899][INFO ][o.e.c.s.ClusterService ] [singlenode] new_master {singlenode}{saCA_25vSxyUwF-RagteLw}{_fn1si8zTT6bkZK1q6ilxQ}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-elected-as-master ([0] nodes joined)

[2017-08-22T18:56:27,928][INFO ][o.e.h.n.Netty4HttpServerTransport] [singlenode] publish_address {127.0.0.1:9200}, bound_addresses {[fe80::1]:9200}, {[::1]:9200}, {127.0.0.1:9200}

[2017-08-22T18:56:27,928][INFO ][o.e.n.Node ] [singlenode] started

형태소분석기 확인

http://localhost:9200/_analyze?pretty&analyzer=arirang_analyzer&text=한국 엘라스틱서치 사용자 그룹의 HENRY 입니다.

형태소분석기 결과 확인

{

"tokens" : [

{

"token" : "한국",

"start_offset" : 0,

"end_offset" : 2,

"type" : "korean",

"position" : 0

{

"token" : "엘라스틱서치",

"start_offset" : 3,

"end_offset" : 9,

"type" : "korean",

"position" : 1

{

"token" : "엘라",

"start_offset" : 3,

"end_offset" : 5,

"type" : "korean",

"position" : 1

{

"token" : "스틱",

"start_offset" : 5,

"end_offset" : 7,

"type" : "korean",

"position" : 2

{

"token" : "서치",

"start_offset" : 7,

"end_offset" : 9,

"type" : "korean",

"position" : 3

{

"token" : "사용자",

"start_offset" : 10,

"end_offset" : 13,

"type" : "korean",

"position" : 4

{

"token" : "그룹",

"start_offset" : 14,

"end_offset" : 16,

"type" : "korean",

"position" : 5

{

"token" : "henry",

"start_offset" : 18,

"end_offset" : 23,

"type" : "word",

"position" : 6

{

"token" : "입니다",

"start_offset" : 24,

"end_offset" : 27,

"type" : "korean",

"position" : 7

}

]

}

형태소분석기 RESTful endpoint 실행 및 결과

실행)

http://localhost:9200/_arirang_dictionary_reload

결과)

Reloaded arirang analyzer dictionary!!

이제 기본적인 arirang analyzer와 elasticsearch용 plugin 까지 살펴 보았습니다.

마지막으로 arirang analyzer의 사전 데이터 수정과 반영을 살펴 보겠습니다.

☞ arirang 에서 제공하는 기본 dictionary path 변경을 하지 않고 사전 내용만 변경 하는 것으로 하겠습니다.

1. 사전 파일에 대한 classpath 설정

elasticsearch 실행 시 사전 파일에 대한 classpath 등록이 되어 있어야 정상적으로 로딩이 됩니다.
elasticsearch.in.sh 파일을 수정해 줍니다.

ES_CLASSPATH="$ES_HOME/lib/elasticsearch-5.5.0.jar:$ES_HOME/lib/*:$ES_CONF_PATH/dictionary"

예) 위에서 언급한 사전 관련 path와 파일들이 존재해야 합니다.

config/dictionary/org/apache/lucene/analysis/ko

config/dictionary/org/apache/lucene/analysis/ko/dic

ES_CONF_PATH는 기본 path.conf 정보와 동일해야 합니다.

2. 사전 정보 수정 및 반영

1번 path에 위치한 사전 파일을 수정합니다.

3. 사전 reload

elasticsearch restart 없이 /_arirang_dictionary_reload API를 호출하여 반영 합니다.

여기까지 오셨으면 이제 arirang analyzer와 elasticseearch-analysis-arirang plugin 그리고 dictionary에 대한 기본 활용을 하실수 있게 되셨다고 생각합니다.

기술된 모든 정보는 모두 오픈소스이기 때문에 출처를 정확히 명시해 주시고 언제든지 오류와 개선에 대해서는 적극적인 참여 부탁 드립니다.

참고 사이트)

http://cafe.naver.com/korlucene

https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

https://www.elastic.co/guide/en/elasticsearch/plugins/current/index.html

저작자표시 비영리 변경금지

[Lucene] SynonymFilter -> SynonymGraphFilter + FlattenGraphFilter

ITWeb/검색일반 2017. 7. 31. 18:37

오늘 뭐 좀 보다가 그냥 공유해 봅니다.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html

lucene 6.6 에서는 SynonymFilter 가 @Deprecated 되어 있습니다.

대체 filter 는 글에도 나와 있지만 SynonymGraphFilter 인데요.

재밌는건 이넘은 search time 에서 동작 하는 거라 index time 에는 여전히 SynonymFilter 또는 FlattenGraphFilter 를 사용해야 한다는 점입니다.

아직 깊게 분석해보지 않아서 ^^;

간만에 lucene 코드 까서 이것 저것 테스트 해보니 재밌내요.

그냥 참고 하시라고 올려봤습니다.

저작자표시 비영리 변경금지

[Arirang] 사전 기반으로만 형태소 분석 처리 해보기

ITWeb/검색일반 2017. 6. 23. 13:51

그냥 사전만 가지고 몇 가지 형태소 분석 처리를 하기 위한 팁 정보 입니다.

한마디로 노가다 입니다.

모든 부분에 공통적으로 적용 되는 것은 아니며 사용 형태에 따라 수정 하셔야 하는 부분이니 그냥 참고 정도만 하자라고 생각해 주세요.

공통)

- 복합명사 분해 시 분해 된 단어가 용언일 경우 복합명사를 사용하지 말고 확장사전에 등록해서 사용을 합니다.

또는 분해 된 단어가 용언일 경우 찾아서 체언 처리를 해줍니다.

- 체언과 기타품사 차이는 체언은 단독으로 사용 시 형태소 분석이 되지만 기타품사는 분석 되지 않습니다.

복합명사)

그리는게:그리는,게:0000

확장사전)

그리,100000000X

그리는게,100000000X

분해)

그리는게

그리는

그리

'그리는' 자체를 체언으로 분해 하고 싶을 경우 확장 사전에 체언으로 등록이 되어야 하며, 그리에 대한 용언도 동일하게 체언처리가 되어야 합니다.

- '~요', '~해요' 로 끝나는 용언 처리

좋아해요

확장사전)

좋아,100000000X

좋아해,100000000X

- '~져', '~져서', '~서' 로 끝나는 용언 처리

기존 용언으로 등록된 단어를 체언으로 변경 해야 합니다.

010000000X -> 100000000X

'~서' 의 경우 사전에 '서,110000000X' 와 같이 등록이 되어 있어 복합명사 사전에 추가 등록을 합니다.

복합명사 등록 시 분해된 명사에 대한 확장사전 등록이 되어 있어야 합니다.

확장사전)

어두워지,100000000X

어두워,100000000X

늘어지,100000000X

늘어져,100000000X

복합명사)

어두워서:어두워서,어두워:0000

어두워져:어두워져,어두워:0000

어두워져서:어두워져서,어두워:0000

늘어져서:늘어져서,늘어져:0000

- 복합용언 + '~요' 로 끝나는 용언 처리

크고낮아요

말려들어요

복합명사)

크고낮아:크고,낮아:0000

말려들어:말려,들어:0000

- '~다', '~데' 로 끝나는 용언 처리

크다

작다

큰데

작은데

'~다' 끝나는 용언이 형태소분리가 되기 위해서는 확장사전에 등록이 되어야 합니다.

확장사전)

크다,100000000X

작다,100000000X

큰데,100000000X

작은데,100000000X

- '~ㄴ', '~은', '~는' 으로 끝나는 용언 처리

짧은

넒은

튀어나온

어울리는

어울리,010000000X 용언 처리가 되어 있기 때문에 체언으로 fully 등록 합니다.

잃어가는

확장사전)

짧은,100000000X

넓은,100000000X

튀어나온,100000000X

잃어가는,100000000X

어울리는,100000000X

- 'ㅎ' 불규칙 용언 처리

노랗고

동그랗고

확장사전)

노랗,100000000X

복합명사)

노랗고:노랗고,노랗:0000

- '~하', '~한' 으로 끝나는 용언 처리

확장사전에 용언 처리가 되어 있는지 확인 합니다.

용언 처리가 되어 있다면 체언으로 변경해 줍니다.

확장사전에 ~하, ~한 을 제외 및 하다 동사 표기를 포함한 체언으로 등록 합니다.

확장사전 1)

넓적하,010000000X -> 100000000X

넓적,100100000X

저작자표시 비영리 변경금지

[Elasticsearch] elasticsearch-analysis-arirang 5.0.1 플러그인 개발기

Elastic/Elasticsearch 2016. 11. 25. 12:31

Elasticsearch cluster 업그레이드를 위해 먼저 한글형태소 분석기 업그레이드가 필요합니다.

기본적으로 한글형태소 분석기 플러그인을 만들기 위해서는 아래의 내용을 어느 정도는 잘 알고 다룰수 있어야 합니다.

- Elasticsearch

- Lucene

- Arirang

Arirang 은 아래 링크를 통해서 소스와 jar 파일을 구하실 수 있습니다.

- http://cafe.naver.com/korlucene

- https://lucenekorean.svn.sourceforge.net/svnroot/lucenekorean

[출처] 카페 대문 (루씬 한글분석기 커뮤니티)

- https://github.com/soomyung

최근에 수명님 이외 mgkaki 님이 컨트리뷰션을 해주시고 계신듯 합니다. :)

Lucene & Arirang 변경 사항)

- lucene 6.1 과 6.2 의 패키지 구조가 변경이 되고 클래스도 바뀌었습니다.

- arirang 에서 제공하던 pairmap 관련 버그가 수정되었습니다. (그전에 수정이 되었을수도 있습니다. ^^;)

- lucene 에서 제공 되던 CharacterUtils 가 refactoring 되었습니다.

- arirang 에서 KoreanTokenizer 에 선언된 CharacterUtils 를 변경된 내용에 맞게 고쳐주어야 합니다.

Remove CharacterUtils.getInstance()

CharacterUtils.codePointAt(...) to Character.codePointAt(...)

- arirang 6.2 source를 내려 받으시면 위 변경 내용이 반영 되어 있습니다.

- arirang.morph 1.1.0 을 내려 받으셔야 합니다.

Elasticsearch Plugin 변경 사항)

플러그인 개발 변경 사항은 기본 구조 변경이 많이 되었기 때문에 수정 사항이 많습니다.

보기에 따라서 적을 수도 있지만 판단은 각자의 몫으로 ^^

- arirang.lucene-analyzer 와 arirang-morph 업데이트가 되어야 합니다.

- 기존에 binding 하던 AnalysisBinderProcessor를 사용하지 않습니다.

- 이제는 Plugin, AnalysisPlugin 에서 등록을 진행 합니다.

public class AnalysisArirangPlugin extends Plugin implements AnalysisPlugin {

@Override

public Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() {

return singletonMap("arirang_filter", ArirangTokenFilterFactory::new);

}

@Override

public Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() {

Map<String, AnalysisProvider<TokenizerFactory>> extra = new HashMap<>();

extra.put("arirang_tokenizer", ArirangTokenizerFactory::new);

return extra;

}

@Override

public Map<String, AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() {

return singletonMap("arirang_analyzer", ArirangAnalyzerProvider::new);

}

- AnalyzerProvider, TokenFilterFactory, TokenizerFactory 내 생성자 argument 가 바뀌었습니다.

IndexSettings indexSettings, Environment env, String name, Settings settings

- assemble 하기 위한 plugin.xml 내 outputDirectory 가 elasticsearch 로 변경이 되었습니다.

- outputDirectory 가 elasticsearch 로 작성되어 있지 않을 경우 에러가 발생 합니다.

이 정도 변경 하고 나면 이제 빌드 및 설치를 하셔도 됩니다.

이전 글 참고) [Elasticsearch] Lucene Arirang Analyzer Plugin for Elasticsearch 5.0.1

※ 플러그인을 만들면서 우선 lucene 6.1 과 6.2 가 바뀌어서 살짝 당황 했었습니다.

당연히 6.x 간에는 패키지 구조에 대한 변경은 없을거라는 기대를 했었는데 이게 잘못이였던 것 같습니다.

역시 lucene 5.x 에서 6.x 로 넘어 가기 때문에 elasticsearch 5.x 는 많이 바뀌었을 거라는 생각은 했었구요.

그래도 생각했던 것 보다 오래 걸리지는 않았지만 역시 참고할 만한 문서나 자료는 어디에도 없더라구요.

소스 보는게 진리라는건 변하지 않는 듯 싶내요. 작성하고 보니 이게 개발기인지 애매하내요. ^^;

소스코드)

https://github.com/HowookJeong/elasticsearch-analysis-arirang

저작자표시 비영리 변경금지

[Elasticsearch] Lucene Arirang Analyzer Plugin for Elasticsearch 5.0.1

Elastic/Elasticsearch 2016. 11. 24. 19:02

우선 빌드한 플러그인 zip 파일 먼저 공유 합니다.

나중에 작업한 내용에 대해서는 github 에 올리도록 하겠습니다.

요즘 프로젝트며 운영 업무가 너무 많아서 이것도 겨우 겨우 시간 내서 작업 했내요.

elasticsearch-analysis-arirang-5.0.1.zip

설치 방법)

$ bin/elasticsearch-plugin install --verbose file:///elasticsearch-analysis-arirang/target/elasticsearch-analysis-arirang-5.0.1.zip

설치 로그)

-> Downloading file:///elasticsearch-analysis-arirang-5.0.1.zip

Retrieving zip from file:///elasticsearch-analysis-arirang-5.0.1.zip

[=================================================] 100%

- Plugin information:

Name: analysis-arirang

Description: Arirang plugin

Version: 5.0.1

* Classname: org.elasticsearch.plugin.analysis.arirang.AnalysisArirangPlugin

-> Installed analysis-arirang

Elasticsearch 실행 로그)

$ bin/elasticsearch

[2016-11-24T18:49:09,922][INFO ][o.e.n.Node ] [] initializing ...

[2016-11-24T18:49:10,083][INFO ][o.e.e.NodeEnvironment ] [aDGu2B9] using [1] data paths, mounts [[/ (/dev/disk1)]], net usable_space [733.1gb], net total_space [930.3gb], spins? [unknown], types [hfs]

[2016-11-24T18:49:10,084][INFO ][o.e.e.NodeEnvironment ] [aDGu2B9] heap size [1.9gb], compressed ordinary object pointers [true]

[2016-11-24T18:49:10,085][INFO ][o.e.n.Node ] [aDGu2B9] node name [aDGu2B9] derived from node ID; set [node.name] to override

[2016-11-24T18:49:10,087][INFO ][o.e.n.Node ] [aDGu2B9] version[5.0.1], pid[56878], build[080bb47/2016-11-11T22:08:49.812Z], OS[Mac OS X/10.12.1/x86_64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_72/25.72-b15]

[2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [aggs-matrix-stats]

[2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [ingest-common]

[2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [lang-expression]

[2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [lang-groovy]

[2016-11-24T18:49:11,335][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [lang-mustache]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [lang-painless]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [percolator]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [reindex]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [transport-netty3]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded module [transport-netty4]

[2016-11-24T18:49:11,336][INFO ][o.e.p.PluginsService ] [aDGu2B9] loaded plugin [analysis-arirang]

[2016-11-24T18:49:14,151][INFO ][o.e.n.Node ] [aDGu2B9] initialized

[2016-11-24T18:49:14,151][INFO ][o.e.n.Node ] [aDGu2B9] starting ...

[2016-11-24T18:49:14,377][INFO ][o.e.t.TransportService ] [aDGu2B9] publish_address {127.0.0.1:9300}, bound_addresses {[fe80::1]:9300}, {[::1]:9300}, {127.0.0.1:9300}

[2016-11-24T18:49:17,511][INFO ][o.e.c.s.ClusterService ] [aDGu2B9] new_master {aDGu2B9}{aDGu2B9mQ8KkWCe3fnqeMw}{_y9RzyKGSvqYAFcv99HBXg}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-elected-as-master ([0] nodes joined)

[2016-11-24T18:49:17,584][INFO ][o.e.g.GatewayService ] [aDGu2B9] recovered [0] indices into cluster_state

[2016-11-24T18:49:17,588][INFO ][o.e.h.HttpServer ] [aDGu2B9] publish_address {127.0.0.1:9200}, bound_addresses {[fe80::1]:9200}, {[::1]:9200}, {127.0.0.1:9200}

[2016-11-24T18:49:17,588][INFO ][o.e.n.Node ] [aDGu2B9] started

한글형태소분석 실행)

$ curl -X POST -H "Cache-Control: no-cache" -H "Postman-Token: 6d392d83-5816-71ad-556b-5cd6f92af634" -d '{

"analyzer" : "arirang_analyzer",

"text" : "[한국] 엘라스틱서치 사용자 그룹의 HENRY 입니다."

}' "http://localhost:9200/_analyze"

형태소분석 결과)

{

"tokens": [

{

"token": "[",

"start_offset": 0,

"end_offset": 1,

"type": "symbol",

"position": 0

{

"token": "한국",

"start_offset": 1,

"end_offset": 3,

"type": "korean",

"position": 1

{

"token": "]",

"start_offset": 3,

"end_offset": 4,

"type": "symbol",

"position": 2

{

"token": "엘라스틱서치",

"start_offset": 5,

"end_offset": 11,

"type": "korean",

"position": 3

{

"token": "엘라",

"start_offset": 5,

"end_offset": 7,

"type": "korean",

"position": 3

{

"token": "스틱",

"start_offset": 7,

"end_offset": 9,

"type": "korean",

"position": 4

{

"token": "서치",

"start_offset": 9,

"end_offset": 11,

"type": "korean",

"position": 5

{

"token": "사용자",

"start_offset": 12,

"end_offset": 15,

"type": "korean",

"position": 6

{

"token": "그룹",

"start_offset": 16,

"end_offset": 18,

"type": "korean",

"position": 7

{

"token": "henry",

"start_offset": 20,

"end_offset": 25,

"type": "word",

"position": 8

{

"token": "입니다",

"start_offset": 26,

"end_offset": 29,

"type": "korean",

"position": 9

}

]

}

저작자표시 비영리 변경금지

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-06-27

Elastic/Elasticsearch 2016. 6. 28. 09:53

몇 가지 눈에 들어 오는게 있어서 scrap 합니다.

[원문]

https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2016-06-27

[요점]

- low-level Java REST client has landed.

별도의 http client 를 이용해서 만들지 않고 es 에서 제공하는거 사용하면 될 것 같습니다.

- index.store.preload

warmmer 기능이 이걸로 대체 되는 것 같습니다.

- no longer turns red when creating an index

순간 red 나올 때가 있었는데 false alarm 이 줄어 들겠내요.

- default similarity is now BM25

TF/IDF 에서 BM25로 넘어 가는 군요.

- wait for status yellow

yellow 도 간혹 발생을 하는데 앞으로 status 에 대해서 다시 점검을 해야 겠내요.

Elasticsearch Core

Changes in 2.x:

The .scripts index now obeys the number_of_shards setting.
Deprecation logging for `_timestamp` and `_ttl`.
Failed synced flushes were reporting an incorrect number of failures.
The index-exists request shouldn't fail if the index is being recovered.
A valid translog file can be deleted incorrectly after a disk full exception and multiple attempts to recover.

Changes in master:

The low-level Java REST client has landed. It is functionally equivalent to the REST clients available in other languages.
The `index.store.preload` setting can preload the specified Lucene files (eg doc values, norms) into MMAP before a segment comes online. This completes the replacement of warmers.
The cluster health no longer turns red when creating an index, unless there is a problem assigning shards.
The default similarity is now BM25.
The `_timestamp` and `_ttl` fields will not be supported on indices created in 5.x.
The `fields` parameter has been removed in favour of `stored_fields`, `docvalue_fields` and (for `text` fields only)`fielddata_fields`.
Some percolator queries don't need in-memory validation to ensure that they match.
Painless now has capturing lambdas, supports adding static methods like `each` to whitelisted classes, has syntax for initialising arrays, lists and maps,
Nested inner hits no longer return _index, _type, and _id, and parent/child inner hits doesn't return _index.
`string` fields weren't upgraded to `text`/`keyword` if `include_in_all` was specified.
Getting a task with wait_for_completion will return the task result.
Nodes info returns the calculated size of the total indexing buffer.
Analysis factories are now MultiTermAware, which will help to remove the lowercase_expanded_terms from the query string query, and to support keyword analyzers on the `keyword` field.
JNA is now a required dependency.
Guice has been removed from the script service,

Ongoing changes:

Sequence number checkpoints are persisted to disk when a segment is flushed.
Reindex-from-remote now uses the Java REST client.
Ensure that primary handover while indexing does not cause a dead lock.
The index file which lists the snapshots in a repository should be written atomically.
The `discovery-azure` plugin doesn't work with the security manager.
It shouldn't be necessary to wait for status yellow before working with a newly created index.
Add helpers to make JSON easier to render in Mustache.
The SynonymQuery should be used for alternative terms, instead of the Bool query.
More time zone edge case bug fixes.
Changes to shard store fetching are required in order to allow for inline rerouting during node join.
Analysis components should implement AnalysisPlugin instead of calling registerTokenizer, allowing Guice to be removed from Hunspell.

Apache Lucene

5.5.2 RC2 release vote is underway
A tricky randomized explain test failure turns out to be a test bug in a recently added test case
Math.toRadians and Math.toDegrees are now banned, since their implementation changes slightly across java versions, impacting our geo tests
RandomAccessFilterStrategy comes back to life for faster filter intersection in some cases
Multi term queries that match no terms rewrite to MatchNoDocsQuery instead of an empty BooleanQuery , making it much simpler to add a helpful reason to MatchNoDocsQuery
The new Ukrainian lemmatizer uses MorfologikFilter with a custom dictionary for efficient dictionary-based Ukrainian analysis
Lucene's confusing and bushy IndexReader hierarchy strikes again
RAMDirectory now also enforces write-once files, and MockDirectoryWrapper now tries harder to corrupt unsync'd index files on close
GeoPoint gets some code cleanups
Eclipse now also fails on unused imports
Auto-prefix terms have been removed since dimensional points is better
CompressionTools has been removed
ForbiddenAPIs is upgraded to version 2.2
It's important to fsync files after copying them via Lucene's Directory!
A tricky test failure was holding up the 5.5.2 release process
Some minor code improvements to SearchGroup
Can we improve the default behavior of query parsers and multi-term queries?
A test bug in MoreLikeThisTest still remains tricky to fix
MoreLikeThis should not invoke toString on a Field object
ScandinavianFoldingFilterFactory and ScandinavianNormalizationFilterFactory are safe for multi-term queries
In the possibly not-rare case where many document share the same point value, we can better compress the docIDs
The ancient query norm and coord blocks progress and should be removed
Should we add a light weight Ukrainian stemmer?
Updating doc values and then using delete-by-query with a doc values query doesn't always work, but fixing it is likely not feasible

저작자표시 비영리 변경금지

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-04-25

Elastic/Elasticsearch 2016. 4. 26. 15:48

이번 weekly 에서 눈에 확 들어 오는건 개인적으로 아래 두 가지 입니다.

Thread local leaks when running in web containers have finally been fixed.
CamelCase support has been removed.

원본 글)

https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2016-04-25

Elasticsearch Core

Changes in 2.x:

The index name was missing from the search slowlog.
CamelCase is deprecated (and has deprecation logging).
MoreLikeThis now handles aliases correctly.

Changes in master:

The .percolator type has been replaced with the percolator field datatype.
Added a fingerprint token filter and fingerprint analyzer for duplicate detection.
TransportReplicationAction has been signficantly refactored in order to make it unit testable.
RPM and Deb packages now set permissions explicitly, instead of relying on umasks.
Indexed scripts and templates are now stored in the cluster state, and are called "stored" scripts/templates.
Parameter names in ingest processors are now more consistent.
IP fields support range queries again.
readNamedWriteable and writeNamedWriteable are now public, and writable.readFrom is gone.
UUID generators moved out of Strings, to avoid spooky action at a distance.
The `action.realtime_get` setting has been removed.
Support for unquoted JSON keys can be allowed via a system property, for bwc purposes.
Cross-type mapping updates were not working for boolean fields.
Empty task IDs are now serialised in 1 byte, so that every task can have a parent ID.
Reindex child tasks weren't being marked as such.
Validation failures have been removed from the cluster health response.
Object fields now inherit their dynamic setting from their parent object or type.
Thread local leaks when running in web containers have finally been fixed.
Added a safeguard to protect against too-large rescore windows.
The elasticsearch-plugin script now prints the download URL of the plugin when in verbose mode, and has friendlier error messages.
The startup script now fails with an error code if the elasticsearch binary is not found or is not executable.
CamelCase support has been removed.
The ICU analyzer now accepts custom rule files.

Ongoing changes:

Dots in fields names are now supported, but so far only if the parent fields already exist. Tests are being added to make sure supporting dots fully doesn't break anything.
Persistence of results of long running tasks.
A `minhash` token filter for estimating the Jacard similarity coefficient between two docs.
Pipeline aggs are only needed on the coordinating node.
Adding searchable/aggregatable tags to fields in the field stats API.
Inner hits will no longer support the top-level syntax as the inline syntax has been improved.
It should be possible to pass include/exclude values to the terms aggs using the same format that was used to render bucket keys.
Deleted index tombstones close to being merged.

저작자표시 비영리 변경금지

[Elasticsearch] Elastic Stack 5.0 대비 Arirang 형분기 Lucene 6.0 업그레이드 준비

Elastic/Elasticsearch 2016. 4. 26. 15:18

준비 작업을 조금 해야 할 것 같아서 일단 짧게 기록 합니다.

Elastic Stack 5.0이 정식 릴리즈 되게 되면 Lucene 6.x 기반으로 버전이 올라가게 됩니다.

이에 따라 아리랑 형태소 분석기도 버전을 올려야 하는데요.

일단 올려 보니 에러는 한 군데 보입니다.

abstract 로 선언된 method 하나만 구현해 주면 될 것으로 보입니다.

MophemeAttributeImpl.java 파일에 reflectWith(....) 메서드만 구현해 주세요.

@Override
public void reflectWith(AttributeReflector reflector) {
    reflector.reflect(MorphemeAttribute.class, "token", koreanToken);
}

해당 코드에 대한 검증 작업은 하지 않았으니 사용이나 판단은 각자 알아서 하는 것으로 하겠습니다.

저작자표시 비영리 변경금지

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-04-11

Elastic/Elasticsearch 2016. 4. 12. 09:59

봐야지 봐야지 하다 이제 봅니다.

제 눈에 띄는 것은

The `match`, `match_phrase`, and `match_phrase_prefix` queries are now separate queries, not just types of the `match` query.

The task manager response now tells you which tasks can be cancelled, and supports a `_cat/tasks` API.

Elasticsearch will no longer accept unquoted field names in JSON.

Now that we have removed the percolator API, we should also remove the percolator type and use percolator fieldsinstead.

예전에 분리 되어 있던걸 합치더니 다시 분리 하는 것 같습니다.

task cancelled 기능을 테스트 해봐야 할 것 같습니다.

이제 field name 작성시 주의해야 겠내요. 좀 더 strict 해졌다고 봐야겠죠. ^^

- 아래 코드가 true에서 false로 되었습니다. (이 기능이 성능이나 기타 다른 기능적인 오류를 만들어 내는 걸까요?)

jsonFactory.configure(JsonParser.Feature.ALLOW_UNQUOTED_FIELD_NAMES, true);

percolator 기능이 fields 로 빠졌내요. 이것도 기능 확인을 해봐야 겠내요.

등록된 issue 를 보면 ㅎㅎ 직관적이고 사용이 좀 더 편해진것 같습니다.

core 2.x에 반영된 내용은 거의 v5.0.0 에 적용 될것 같습니다.

루씬은 일단 6.0.0 이 릴리즈 vote 중이였고 이미 4월 8일에 릴리즈 되었습니다. 이외 다른 내용들은 거의 geo point, locaiton 관련 내용들 입니다.

루씬 6.0.0 릴리즈 소식으로는

Java 8 is the minimum Java version required.
Dimensional points, replacing legacy numeric fields, provides fast and space-efficient support for both single- and multi-dimension range and shape filtering. This includes numeric (int, float, long, double), InetAddress, BigInteger and binary range filtering, as well as geo-spatial shape search over indexed 2D LatLonPoints. See this blog post for details. Dependent classes and modules (e.g., MemoryIndex, Spatial Strategies, Join module) have been refactored to use new point types.
Lucene classification module now works on Lucene Documents using a KNearestNeighborClassifier or SimpleNaiveBayesClassifier.
The spatial module no longer depends on third-party libraries. Previous spatial classes have been moved to a new spatial-extras module.
Spatial4j has been updated to a new 0.6 version hosted by locationtech.
TermsQuery performance boost by a more aggressive default query caching policy.
IndexSearcher's default Similarity is now changed to BM25Similarity.
Easier method of defining custom CharTokenizer instances.

원본링크)

https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2016-04-11

Elasticsearch Core

Changes in 2.x:

Extended Stats could return the wrong result when some indices are missing a field.
Adding an object field with the same name as an existing field should fail.
Shadow replicas should be considered as having size zero.
CORS was broken for preflight requests.
Windows users can configure the Windows service name, description, and user.
Network addresses are now consistently displayed as the ip:port, instead of the hostname.

Changes in master:

Network partitions will no longer cause loss of in flight documents, and we have the test to prove it.
The `match`, `match_phrase`, and `match_phrase_prefix` queries are now separate queries, not just types of the `match` query.
The task manager response now tells you which tasks can be cancelled, and supports a `_cat/tasks` API.
Elasticsearch will no longer accept unquoted field names in JSON.
Elasticsearch now uses mmapfs for Lucene directories instead of a hybrid of niofs/mmapfs.
ParseField is now used to parse query names, which comes with deprecation logging for free.
Geo-points support ignore_malformed correctly again.
Moving averages threw an NPE when no window was specified.
MappedFieldType should be responsible for knowing about which formatter apply, rather than the agg framework.
The allocation-explain API now includes the configured allocation_delay and remaining_delays times.
Hot threads now fail hard if the JVM doesn't support them.
Queries now have a registry, and queries have gradually been migrated to use it.

Ongoing changes:

Bulk request sizes will be subject to a circuit breaker.
Deleted index tombstones are complicated.
ObjectParser should allow constructor args.
Should we enable http compression by default?
Numeric and date fields in 5.0 should use the new Lucene points API.
Now that we have removed the percolator API, we should also remove the percolator type and use percolator fieldsinstead.
Improvements to how we score the _all field based on per-field boosts.

Apache Lucene

The 6.0.0 release vote has passed and the bits were set free a few hours ago! Thank you Nick Knize for taking on the challenging role of release manager!
Many geo3d improvements this week:
- Polygon queries now accept Polygon... inputs, including random nested test polygons, matching our geo2d implementations and respecting the order of polygon vertices
- Geo3d seems to sometimes incorrectly think a polygon is concave when it's really convex
- Adjacent polygon points can now be coplanar
- The unique GeoPath support, which matches all point within X distance of a specified path (think road trip, looking for sushi nearby), now has a simple factory API as well
- Tests were not adequately testing the new simple factory methods for common shapes
- Geo3d now uses a similar encode/decode quantization approach as LatLonPoint
- After lively discussions, geo3d APIs no longer publicly expose classes and methods that could safely be private. APIs should start life private until proven worthy of being public!
Many geo2d improvements as well:
- LatLonPoint Polygon queries are faster using a cool pixelating grid approach, and we can do the same forGeoPointField
- We must improve debuggability of our geo test failures with nice 3D earth models like this example
- Here's a lively discussion about the pros and cons of having our geo tests quantize data only once
- Quantization issues are tricky, and geo2d queries were quantizing the edges of box queries incorrectly, resulting in false positive hits
- We have improved the geo2d tests to never allow "tolerance" on the returned results
- We have moved common geo encoding APIs to core so they can be shared across implementations
- Better random latitude/longitude generation for tests has exposed a tie-break bug in distance sorting, edge case bugs in box query, test bugs and polygon bugs
- Rectangle and Polygon classes have graduated into Lucene's core, to enable sharing across our numerous geo implementations
- A new encoding for GeoPointField will be consistent with LatLonPoint, and use all 64 available bits to minimize quantization error
- GeoPointField gets an efficient distance sort
- Randomized tests tried to create a too-big GeoPointDistanceQuery
- We will move BaseGeoPointTestCase from the spatial module to test-framework allowing us to remove the dependency of the sandbox module on spatial
- SloppyMath.haversin can now move to GeoUtils
The classification module now computes the f1-measure
A previously commented out test assertion comes half way back to life
Our "getting started with Lucene" docs were a bit buggy, but now fixed thanks to a user asking about it
We've upgraded our randomizedtesting dependency to 2.3.4, so we get better messages when there is a static leak in our tests
Points were missing from the codecs package documentation
The DataSplitter in Lucene's classification module should pay attention to classes when splitting
800+ new top-level-domains have been created since we last fixed StandardTokenizer to detect them, but we may need to wait for a JFlex release
Highlighting fails to find terms inside the child query of a BlockJoinQuery
Lucene doesn't have direct support for boolean subset matching, but a number of possible workarounds may work
Math.toRadians is changing its results slightly between Java 1.8 and 1.9
NRTCachingDirectory.listAll sometimes throws IllegalStateException
A scary random test failure is hopefully caused by bad hardware or buggy JVM
TestCoreParser gets some small improvements
A possibly new JVM bug causes JVM crash when decoding postings
JapaneseTokenizer should do a better job validating custom user-provided dictionaries
Another iteration for codec level encryption; this patch uses a new initialization vector for each data block, and seems not to impact search performance
Our release scripts still struggle with the switch from Subversion to git
Sometimes, BooleanQuery's explain method can lie about its score
Another user falls into the unfortunately common trap of thinking Lucene's stored fields store all information about a field

저작자표시 비영리 변경금지

[Lucene] TermVector 정보 중 Offset 에 대해서.

ITWeb/검색일반 2016. 3. 30. 17:33

아는 것도 이제는 기억이 가물가물 합니다. 그래서 또 기록해 봅니다.

사내 교육을 하면서 lucene 기본 이론 교육을 하다, start offset 과 end offset 에 대해서 설명을 해주고 있었는데요.

end offset 이 실제 text의 offset 값 보다 1 크다는 것에 대한 질문이 있었습니다.

아는 건데 일단 가볍게라도 설명하고 넘어 가야해서 아무래도 highlight 기능을 위해서 그렇게 설정 하는것 같다고 하고 오늘 문서랑 소스 코드 좀 다시 살펴 봤습니다.

lucene in aciton 에서 퍼온 글)

The start offset is the character position in the original text where the token text begins, and the end offset is the position just after the last character of the token text.

end offset 이 실제 보다 1 큰 이유는 문서에 있습니다.

그런데 왜 이렇게 되었을까를 고민해 보면 내부 처리 방식을 확인해 봐야 합니다.

highlight 기능이기 때문에 이 작업에 필요한 class 파일과 fragment에 대한 처리 로직을 확인 하면 됩니다.

protected String makeFragment( StringBuilder buffer, int[] index, Field[] values, WeightedFragInfo fragInfo,
    String[] preTags, String[] postTags, Encoder encoder ){
  StringBuilder fragment = new StringBuilder();
  final int s = fragInfo.getStartOffset();
  int[] modifiedStartOffset = { s };
  String src = getFragmentSourceMSO( buffer, index, values, s, fragInfo.getEndOffset(), modifiedStartOffset );
  int srcIndex = 0;
  for( SubInfo subInfo : fragInfo.getSubInfos() ){
    for( Toffs to : subInfo.getTermsOffsets() ){
      fragment
        .append( encoder.encodeText( src.substring( srcIndex, to.getStartOffset() - modifiedStartOffset[0] ) ) )
        .append( getPreTag( preTags, subInfo.getSeqnum() ) )
        .append( encoder.encodeText( src.substring( to.getStartOffset() - modifiedStartOffset[0],
          to.getEndOffset() - modifiedStartOffset[0] ) ) )
        .append( getPostTag( postTags, subInfo.getSeqnum() ) );
      srcIndex = to.getEndOffset() - modifiedStartOffset[0];
    }
  }
  fragment.append( encoder.encodeText( src.substring( srcIndex ) ) );
  return fragment.toString();
}

코드 보시면 아시겠죠.

기본적으로 String.substring( inclusive begin index, exclusive end index) 을 이용하기 때문에 end offset 값은 1 커야 하는 것입니다.

다른 의미로 보면 그냥 offset 정보와 text 의 length 정보를 한꺼번에 offsets 로 해결하기 좋은 방법으로 봐도 될 것 같습니다.

저작자표시 비영리 변경금지

◀ PREV : [1] : [2] : [3] : NEXT ▶

jjeong

'루씬'에 해당되는 글 30건

[Elasticsearch] Arirang Analyzer + Elasticsearch Analyzer Plugin 사용자 관점 개발리뷰

[Lucene] SynonymFilter -> SynonymGraphFilter + FlattenGraphFilter

[Arirang] 사전 기반으로만 형태소 분석 처리 해보기

[Elasticsearch] elasticsearch-analysis-arirang 5.0.1 플러그인 개발기

[Elasticsearch] Lucene Arirang Analyzer Plugin for Elasticsearch 5.0.1

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-06-27

Elasticsearch Core

Apache Lucene

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-04-25

Elasticsearch Core

[Elasticsearch] Elastic Stack 5.0 대비 Arirang 형분기 Lucene 6.0 업그레이드 준비

[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-04-11

Elasticsearch Core

Apache Lucene

[Lucene] TermVector 정보 중 Offset 에 대해서.

티스토리툴바