'분류 전체보기'에 해당되는 글 1583건

  1. 2021.07.05 [Elasticsearch] Bulk Request by Restful API
  2. 2021.07.02 [Intellij] intellij-java-google-style.xml 적용하기
  3. 2021.05.11 [Elasticsearch] Nori 사전 빌드하기.
  4. 2021.04.30 [Elasticsearch] Arirang Plugin에 테스트로 Jamo Tokenizer API 넣어 봤습니다.
  5. 2021.04.19 [Java] jenv 구성
  6. 2021.04.19 [Python] 비동기 요청 예제
  7. 2021.04.09 [Elasticsearch] esrally 링크.
  8. 2021.04.06 [Elasticsearch] Elastic APM Quick 구성
  9. 2021.04.05 [Elasticsearch] Nori Analyzer 테스트
  10. 2021.04.05 [Elasticsearch] Document Indexing 관련

[Elasticsearch] Bulk Request by Restful API

Elastic/Elasticsearch 2021. 7. 5. 15:41

공홈 문서 보시면 됩니다.

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

 

워낙에 제가 예전에 작성해 놓은 글이 있어서 그냥 최신 버전으로 remind 합니다.

$ curl -s -H "Content-Type: application/x-ndjson" -XPOST http://localhost:9200/_bulk 
  --data-binary "@ATC.json"

 

ATC.json)

{"index": { "_index": "atc", "_id": "1"}}
{"chosung": "ㄴㅇㅋ", "chosung_eng": "nike", "jamo_kor": "ㄴㅏㅇㅣㅋㅣ", "jamo_eng": "skdlzl", "item_kor": "나이키", "item_eng": "nike"}

 

:

[Intellij] intellij-java-google-style.xml 적용하기

ITWeb/개발일반 2021. 7. 2. 10:10

intellij 버전이 올라 가면서 codestyle 정의가 바뀌었네요.

https://raw.githubusercontent.com/google/styleguide/gh-pages/intellij-java-google-style.xml

이전 글은 import 방법만 참고하시고 새로운 style 을 다운로드 받아서 적용해 보세요.

 


 

각자 필요한 코드 스타일은 아래서 다운로드 받으시면 됩니다.

 

[구글 코드 스타일]

https://github.com/google/styleguide

 

[Intellij 적용하기 - mac]

# 저는 intellij 2016.3 사용중입니다.

# 다운로드 받으신 intellij-java-google-style.xml 파일을 아래 경로로 복사해서 넣습니다.

 

$ cd ~/Library/Preferences/IntelliJIdea2016.3/codestyles

 

[Intellij 에서 import 하기]

Preferences -> Editor -> Code Style -> Manage Button -> Import

 

# 그러나 파일을 먼저 복사해 넣었기 때문에 import 하지 않으셔도 scheme 이 정상적으로 나옵니다.

# Scheme 을 GoogleStyle 로 변경 하시면 끝납니다.

 

:

[Elasticsearch] Nori 사전 빌드하기.

Elastic/Elasticsearch 2021. 5. 11. 20:51

[추가 사항]

https://issues.apache.org/jira/browse/SOLR-12655?focusedCommentId=16604160&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&fbclid=IwAR3jRIpaCQ497v-qhofkc3DVmNabPab1ErDQhXnOsA0LNoqpHypa5cUSpy0#comment-16604160

 

아래 발생한 오류는 UnknownDictionaryBuilder.java 에서 아래 코드 수정으로 해결 되었습니다.

기본적으로 ivy.xml, build,xml 에 보면 사전 버전 정보가 들어가 있습니다.

이 사전이 변경 되면서 POS tag list 가 달라 졌는데요. 이 영향으로 에러가 발생 하게 됩니다.

private static final String NGRAM_DICTIONARY_ENTRY = "NGRAM,1801,3561,3668,SY,*,*,*,*,*,*,*";

코드를 수정 하지 않으려면 사전 버전을 맞춰서 사용 하시면 됩니다.

 

Elasticsearch User Group 에 #유정인 님이 도움 주셨습니다.


https://github.com/jimczi/nori/blob/master/how-to-custom-dict.asciidoc

https://bitbucket.org/eunjeon/mecab-ko/src/mecab-0.996/
https://bitbucket.org/eunjeon/mecab-ko-dic/src/v2.1.1/

$ git clone https://bitbucket.org/eunjeon/mecab-ko.git
$ git checkout tags/mecab-0.996
$ ./configure
$ make
$ sudo make install
$ mecab -v

https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/

$ wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz
$ tar -xvzf mecab-ko-dic-2.1.1-20180720.tar.gz
$ cd mecab-ko-dic-2.1.1-20180720
$ brew install autoconf automake libtool
$ autoreconf
$ ./configure
$ make
$ sudo make install
$ ./tools/add-userdic.sh

$ tar cvzf custom-mecab-ko-dic.tar.gz mecab-ko-dic-2.1.1-20180720
$ git clone https://github.com/apache/lucene.git
$ git checkout tags/releases/lucene-solr/8.8.1
$ vi lucene/analysis/nori/ivy.xml

~       <!--artifact name="mecab-ko-dic" type=".tar.gz" url="https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.3-20170922.tar.gz" /-->
+         <artifact name="mecab-ko-dic" type=".tar.gz" url="file:///Users/mzc02-henryjeong/Temp/fastcampus/analysis-nori/custom-mecab-ko-dic.tar.gz" />

 

$ vi lucene/analysis/nori/build.xml

~   <!--property name="dict.version" value="mecab-ko-dic-2.0.3-20170922" /-->
+   <property name="dict.version" value="mecab-ko-dic-2.1.1-20180720" />

 

$ cd lucene/analysis/nori
// apache ant 설치
$ mkdir -p ~/.ant/lib
$ ant ivy-bootstrap
$ ant regenerate
build-dict:
[delete] Deleting /Users/mzc02-henryjeong/Temp/analysis-nori/lucene/lucene/analysis/nori/src/resources/org/apache/lucene/analysis/ko/dict/TokenInfoDictionary$buffer.dat
[delete] Deleting /Users/mzc02-henryjeong/Temp/analysis-nori/lucene/lucene/analysis/nori/src/resources/org/apache/lucene/analysis/ko/dict/TokenInfoDictionary$fst.dat
[delete] Deleting /Users/mzc02-henryjeong/Temp/analysis-nori/lucene/lucene/analysis/nori/src/resources/org/apache/lucene/analysis/ko/dict/TokenInfoDictionary$posDict.dat
[delete] Deleting /Users/mzc02-henryjeong/Temp/analysis-nori/lucene/lucene/analysis/nori/src/resources/org/apache/lucene/analysis/ko/dict/TokenInfoDictionary$targetMap.dat
[java] Exception in thread "main" java.lang.AssertionError
[java] at org.apache.lucene.analysis.ko.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:112)
[java] at org.apache.lucene.analysis.ko.util.UnknownDictionaryWriter.put(UnknownDictionaryWriter.java:39)
[java] at org.apache.lucene.analysis.ko.util.UnknownDictionaryBuilder.readDictionaryFile(UnknownDictionaryBuilder.java:71)
[java] at org.apache.lucene.analysis.ko.util.UnknownDictionaryBuilder.readDictionaryFile(UnknownDictionaryBuilder.java:47)
[java] at org.apache.lucene.analysis.ko.util.UnknownDictionaryBuilder.build(UnknownDictionaryBuilder.java:41)
[java] at org.apache.lucene.analysis.ko.util.DictionaryBuilder.build(DictionaryBuilder.java:39)
[java] at org.apache.lucene.analysis.ko.util.DictionaryBuilder.main(DictionaryBuilder.java:52)

BUILD FAILED
$ git status .
HEAD detached at releases/lucene-solr/8.8.1
Changes not staged for commit:
(use "git add/rm ..." to update what will be committed)
(use "git restore ..." to discard changes in working directory)
modified: build.xml
modified: ivy.xml
deleted: src/resources/org/apache/lucene/analysis/ko/dict/CharacterDefinition.dat
deleted: src/resources/org/apache/lucene/analysis/ko/dict/ConnectionCosts.dat
modified: src/resources/org/apache/lucene/analysis/ko/dict/TokenInfoDictionary$buffer.dat
modified: src/resources/org/apache/lucene/analysis/ko/dict/TokenInfoDictionary$fst.dat
modified: src/resources/org/apache/lucene/analysis/ko/dict/TokenInfoDictionary$posDict.dat
modified: src/resources/org/apache/lucene/analysis/ko/dict/TokenInfoDictionary$targetMap.dat
deleted: src/resources/org/apache/lucene/analysis/ko/dict/UnknownDictionary$buffer.dat
deleted: src/resources/org/apache/lucene/analysis/ko/dict/UnknownDictionary$posDict.dat
deleted: src/resources/org/apache/lucene/analysis/ko/dict/UnknownDictionary$targetMap.dat
$ ant jar
...중략...
-jar-core:
[jar] Building jar: /Users/mzc02-henryjeong/Temp/analysis-nori/lucene/lucene/build/analysis/nori/lucene-analyzers-nori-8.8.1-SNAPSHOT.jar
...중략...

/Users/mzc02-henryjeong/Works/app/apache-ant-1.10.10

일단 시간이 별로 없어서 이 정도까지만 테스트 하고 오류는 나중에 심각 하게 살펴 보겠습니다.

Arirang 만 잘 해도 되는데 Nori 도 할 줄 알아야 하니까...
근데 사전 관리 방식은 Arirang 이 편하고 좋습니다.

사실 한자 사전 고치려다가 여기까지 왔네요.

:

[Elasticsearch] Arirang Plugin에 테스트로 Jamo Tokenizer API 넣어 봤습니다.

Elastic/Elasticsearch 2021. 4. 30. 09:25

https://github.com/HowookJeong/elasticsearch-analysis-arirang/tree/hanguel-jamo-tokenizer-7.12.0

 

Checkout 받으신 후 빌드 하시고 설치 하시면 됩니다.

 

$ mvn clean install -DskipTests=true
$ bin/elasticsearch-plugin install file:///Users/mzc02-henryjeong/Works/github/howookjeong/elasticsearch-analysis-arirang/target/elasticsearch-analysis-arirang-7.12.0.zip

 

[Request]
curl --location --request POST 'http://localhost:9200/_arirang/jamo?text=엘라스틱서치&token=CHOSUNG'

 

[Method]

GET / POST

 

[Response]
CHOSUNG -> ㅇㄹㅅㅌㅅㅊ
JUNGSUNG -> ㅔㅏㅡㅣㅓㅣ
JONGSUNG -> ㄹㄱ
KORTOENG -> dpffktmxlrtjcl

 

[Parameters]

  • text
    형태소 분석할 문자열
  • token
    분석 유형 지정
    CHOSUNG (초성)
    JUNGSUNG (중성)
    JONGSUNG (종성)
    KORTOENG (한영 변환)

기능 테스트로 넣어 둔거라서 성능적인 검증은 하지 않았습니다.

:

[Java] jenv 구성

ITWeb/개발일반 2021. 4. 19. 20:13

참고문서)
https://brew.sh/index_ko

 

$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
$ brew install jenv

$ brew tap AdoptOpenJDK/openjdk

 

    # Shell: bash
    echo 'export PATH="$HOME/.jenv/bin:$PATH"' >> ~/.bash_profile
    echo 'eval "$(jenv init -)"' >> ~/.bash_profile

    $ source ~/.bash_profile
    
    # Shell: zsh
    echo 'export PATH="$HOME/.jenv/bin:$PATH"' >> ~/.zshrc
    echo 'eval "$(jenv init -)"' >> ~/.zshrc

    $ source ~/.zshrc

 

jdk version 별 설치 하기)
$ brew install --cask adoptopenjdk8
$ brew install --cask adoptopenjdk11
$ brew install --cask adoptopenjdk15
$ brew install --cask adoptopenjdk16

 

jenv 추가)
$ jenv add /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home
$ jenv add /Library/Java/JavaVirtualMachines/adoptopenjdk-11.jdk/Contents/Home
$ jenv add /Library/Java/JavaVirtualMachines/adoptopenjdk-15.jdk/Contents/Home
$ jenv add /Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home

 

jenv 사용)

Usage: jenv <command> [<args>]

Some useful jenv commands are:
   commands    List all available jenv commands
   local       Set or show the local application-specific Java version
   global      Set or show the global Java version
   shell       Set or show the shell-specific Java version
   rehash      Rehash jenv shims (run this after installing executables)
   version     Show the current Java version and its origin
   versions    List all Java versions available to jenv
   which       Display the full path to an executable
   whence      List all Java versions that contain the given executable
   add         Add JDK into jenv. A alias name will be generated by parsing "java -version"

See `jenv help <command>' for information on a specific command.
For full documentation, see: https://github.com/jenv/jenv/blob/master/README.md

 

$ jenv versions

  system
  1.8
* 1.8.0.242 (set by /Users/mzc02-henryjeong/.jenv/version)
  11.0
  11.0.6
  13.0
  13.0.2
  15
  15.0
  15.0.2
  16
  openjdk64-1.8.0.242
  openjdk64-11.0.6
  openjdk64-13.0.2
  openjdk64-15.0.2
  openjdk64-16

JAVA_HOME 변경을 위한 plugin 설치)

$ jenv enable-plugin export

(주의 : eval "$(jenv init -)" 설정이 빠져 있으면 동작 하지 않습니다.)

 

jenv global 설정)

/Users/henryjeong/.jenv/version

 

$ jenv global 11.0
$ jenv global
11.0

 

jenv local 설정)

$ jenv local 16
$ jenv local
16
$ java -version
openjdk version "16" 2021-03-16
OpenJDK Runtime Environment AdoptOpenJDK (build 16+36)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 16+36, mixed mode, sharing)

 

로컬 설정은 해당 버전이 필요한 Application 에 들어가서 실행 하시면 됩니다.

:

[Python] 비동기 요청 예제

ITWeb/개발일반 2021. 4. 19. 18:47

비동기 처리 하는거 예제가 필요해서 급조 합니다.

 

참고문서)

docs.python.org/ko/3/library/asyncio.html

docs.python-requests.org/en/master/

$ pyenv virtualenv 3.8.6 helloworld
$ pyenv activate helloworld
(helloworld) $ pip install requests asyncio
(helloworld) $ python asyncio.py

(helloworld) $ vi asyncio.py
import requests
import string
import random
import asyncio

async def requestAsyncio(content, cid, aid):
    url = "http://localhost:8080/helloworld"

    payload={}

    headers = {
      'accept': 'application/json, text/plain, */*',
      'accept-encoding': 'gzip, deflate, br',
      'accept-language': 'ko-KR,ko;q=0.9,en-US;q=0.8,en;q=0.7,ja-JP;q=0.6,ja;q=0.5,zh-MO;q=0.4,zh;q=0.3',
      'content-type': 'application/x-www-form-urlencoded',
      'origin': 'http://localhost:8080',
      'referer': 'http://localhost:8080/helloworld',
      'sec-fetch-dest': 'empty',
      'sec-fetch-mode': 'cors',
      'sec-fetch-site': 'same-site',
      'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    }

    response = requests.request("POST", url, headers=headers, data=payload)

    print(response.status_code)
    print(response.text)

if __name__ == "__main__":
    letters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

    cid = "helloworld"
    aid = "python"
    content1 = random.choice(letters)
    content2 = random.choice(letters)

    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.gather(requestKhan(content1, cid, aid), requestKhan(content2, cid, aid)))
    loop.close()
    
(helloworld) $ python asyncio.py

(helloworld) $ pyenv deactivate helloworld
$

 

:

[Elasticsearch] esrally 링크.

Elastic/Elasticsearch 2021. 4. 9. 08:31

https://github.com/elastic/rally
https://esrally.readthedocs.io/en/stable/

 

elasticsearch cluster 성능 점검용으로 활용 하면 좋아요.

:

[Elasticsearch] Elastic APM Quick 구성

Elastic/Elasticsearch 2021. 4. 6. 14:39

Elastic 사에서 제공 하는 다양한 도구와 서비스 들이 있습니다.

APM 이라는 아주 좋은 도구도 제공 하는데요.

Quick 하게 필요한 정보만 기록해 봅니다.

 

[Elastic APM Server]

https://www.elastic.co/guide/en/apm/server/current/overview.html
https://www.elastic.co/downloads/apm

 

[Elastic APM Agent]

https://www.elastic.co/guide/en/apm/agent/java/current/intro.html
https://search.maven.org/search?q=g:co.elastic.apm%20AND%20a:elastic-apm-agent

 

<intellij 에서 vm 옵션으로 등록합니다.>
-javaagent:/Users/mzc02-henryjeong/Works/elastic/apm-agent/elastic-apm-agent-1.22.0.jar -Delastic.apm.service_name=poc-service -Delastic.apm.application_packages=com.mzc.poc -Delastic.apm.server_url=http://localhost:8200

 

<Kibana 에서 Index Pattern 등록 하고 Discover 합니다.>

apm-{versin}-onboarding-*
apm-{versin}-span-*
apm-{versin}-error-*
apm-{versin}-transaction-*
apm-{versin}-profile-*
apm-{versin}-metric-*

- alias 로 자동 생성 되어 있음.

 

구성 시 사전 필요한 stack 은)

- Elasticsearch

- Kibana

- Spring Boot Web Application

:

[Elasticsearch] Nori Analyzer 테스트

Elastic/Elasticsearch 2021. 4. 5. 17:23

Nori Analyzer 기본 테스트 입니다.

공홈 참고문서)

www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-nori.html

 

기본 사전)

bitbucket.org/eunjeon/mecab-ko-dic/src/master/

 

POS Tag)

lucene.apache.org/core/8_8_0/analyzers-nori/org/apache/lucene/analysis/ko/POS.Tag.html

 

여기서 주의 할 점은 filter 선언 시 postags 가 아닌 stoptags 로 선언 하셔야 합니다.

제가 실수로 postags 로 작성을 했었네요. (수정해 두었습니다.)

 

_analyze  API 를 이용해서 RESTful API 호출로 테스트 한 내용입니다.

{
    "tokenizer": {
        "type": "nori_tokenizer",
        "decompound_mode": "mixed",
        "discard_punctuation": "true",
        "user_dictionary_rules": ["c++ c+ +", "C샤프", "세종", "세종시 세종 시"]
    },
    "filter": [
        {        
            "type": "nori_part_of_speech",
            "stoptags": [
                "E",
                "IC",
                "J",
                "MAG", "MAJ", "MM",
                "SP", "SSC", "SSO", "SC", "SE",
                "XPN", "XSA", "XSN", "XSV",
                "UNA", "NA", "VSV"
            ]
        },
        {
            "type": "nori_readingform"
        }
    ],
    "text": "世宗市에서 c++ 언어를 가르치는 학원이 있나요?",
    "attributes" : ["posType", "leftPOS", "rightPOS", "morphemes", "reading"],
    "explain": true        
}
더보기

실행한 결과)

{
    "detail": {
        "custom_analyzer": true,
        "charfilters": [],
        "tokenizer": {
            "name": "__anonymous__nori_tokenizer",
            "tokens": [
                {
                    "token": "世宗",
                    "start_offset": 0,
                    "end_offset": 2,
                    "type": "word",
                    "position": 0,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": "세종",
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "市",
                    "start_offset": 2,
                    "end_offset": 3,
                    "type": "word",
                    "position": 1,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": "시",
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "에서",
                    "start_offset": 3,
                    "end_offset": 5,
                    "type": "word",
                    "position": 2,
                    "leftPOS": "J(Ending Particle)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "J(Ending Particle)"
                },
                {
                    "token": "c++",
                    "start_offset": 6,
                    "end_offset": 9,
                    "type": "word",
                    "position": 3,
                    "positionLength": 2,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": "c+/NNG(General Noun)++/NNG(General Noun)",
                    "posType": "COMPOUND",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "c+",
                    "start_offset": 6,
                    "end_offset": 8,
                    "type": "word",
                    "position": 3,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "+",
                    "start_offset": 8,
                    "end_offset": 9,
                    "type": "word",
                    "position": 4,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "언어",
                    "start_offset": 10,
                    "end_offset": 12,
                    "type": "word",
                    "position": 5,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "를",
                    "start_offset": 12,
                    "end_offset": 13,
                    "type": "word",
                    "position": 6,
                    "leftPOS": "J(Ending Particle)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "J(Ending Particle)"
                },
                {
                    "token": "가르치",
                    "start_offset": 14,
                    "end_offset": 17,
                    "type": "word",
                    "position": 7,
                    "leftPOS": "VV(Verb)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "VV(Verb)"
                },
                {
                    "token": "는",
                    "start_offset": 17,
                    "end_offset": 18,
                    "type": "word",
                    "position": 8,
                    "leftPOS": "E(Verbal endings)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "E(Verbal endings)"
                },
                {
                    "token": "학원",
                    "start_offset": 19,
                    "end_offset": 21,
                    "type": "word",
                    "position": 9,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "이",
                    "start_offset": 21,
                    "end_offset": 22,
                    "type": "word",
                    "position": 10,
                    "leftPOS": "J(Ending Particle)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "J(Ending Particle)"
                },
                {
                    "token": "있",
                    "start_offset": 23,
                    "end_offset": 24,
                    "type": "word",
                    "position": 11,
                    "leftPOS": "VA(Adjective)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "VA(Adjective)"
                },
                {
                    "token": "나요",
                    "start_offset": 24,
                    "end_offset": 26,
                    "type": "word",
                    "position": 12,
                    "leftPOS": "E(Verbal endings)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "E(Verbal endings)"
                }
            ]
        },
        "tokenfilters": [
            {
                "name": "__anonymous__nori_part_of_speech",
                "tokens": [
                    {
                        "token": "世宗",
                        "start_offset": 0,
                        "end_offset": 2,
                        "type": "word",
                        "position": 0,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": "세종",
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "市",
                        "start_offset": 2,
                        "end_offset": 3,
                        "type": "word",
                        "position": 1,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": "시",
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "c++",
                        "start_offset": 6,
                        "end_offset": 9,
                        "type": "word",
                        "position": 3,
                        "positionLength": 2,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": "c+/NNG(General Noun)++/NNG(General Noun)",
                        "posType": "COMPOUND",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "c+",
                        "start_offset": 6,
                        "end_offset": 8,
                        "type": "word",
                        "position": 3,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "+",
                        "start_offset": 8,
                        "end_offset": 9,
                        "type": "word",
                        "position": 4,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "언어",
                        "start_offset": 10,
                        "end_offset": 12,
                        "type": "word",
                        "position": 5,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "가르치",
                        "start_offset": 14,
                        "end_offset": 17,
                        "type": "word",
                        "position": 7,
                        "leftPOS": "VV(Verb)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "VV(Verb)"
                    },
                    {
                        "token": "학원",
                        "start_offset": 19,
                        "end_offset": 21,
                        "type": "word",
                        "position": 9,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "있",
                        "start_offset": 23,
                        "end_offset": 24,
                        "type": "word",
                        "position": 11,
                        "leftPOS": "VA(Adjective)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "VA(Adjective)"
                    }
                ]
            },
            {
                "name": "__anonymous__nori_readingform",
                "tokens": [
                    {
                        "token": "세종",
                        "start_offset": 0,
                        "end_offset": 2,
                        "type": "word",
                        "position": 0,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": "세종",
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "시",
                        "start_offset": 2,
                        "end_offset": 3,
                        "type": "word",
                        "position": 1,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": "시",
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "c++",
                        "start_offset": 6,
                        "end_offset": 9,
                        "type": "word",
                        "position": 3,
                        "positionLength": 2,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": "c+/NNG(General Noun)++/NNG(General Noun)",
                        "posType": "COMPOUND",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "c+",
                        "start_offset": 6,
                        "end_offset": 8,
                        "type": "word",
                        "position": 3,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "+",
                        "start_offset": 8,
                        "end_offset": 9,
                        "type": "word",
                        "position": 4,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "언어",
                        "start_offset": 10,
                        "end_offset": 12,
                        "type": "word",
                        "position": 5,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "가르치",
                        "start_offset": 14,
                        "end_offset": 17,
                        "type": "word",
                        "position": 7,
                        "leftPOS": "VV(Verb)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "VV(Verb)"
                    },
                    {
                        "token": "학원",
                        "start_offset": 19,
                        "end_offset": 21,
                        "type": "word",
                        "position": 9,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "있",
                        "start_offset": 23,
                        "end_offset": 24,
                        "type": "word",
                        "position": 11,
                        "leftPOS": "VA(Adjective)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "VA(Adjective)"
                    }
                ]
            }
        ]
    }
}

synonyms filter 추가)

주의 할 사항은 user_dic.txt 에 정의 되지 않은 단어의 경우 의도한 결과가 나오지 않을 수 있습니다.

{
    "tokenizer": {
        "type": "nori_tokenizer",
        "decompound_mode": "mixed",
        "discard_punctuation": "true",
        "user_dictionary_rules": ["c++ c+", "c샤프", "c샵", "삼성전자", "세종", "세종시 세종 시"]
    },
    "filter": [
        {
            "type": "synonym_graph",            
            "synonyms": [ 
                "삼성전자, 삼전",
                "c샤프, c샵"
            ]
        },
        {        
            "type": "nori_part_of_speech",
            "stoptags": [
                "E",
                "IC",
                "J",
                "MAG", "MAJ", "MM",
                "SP", "SSC", "SSO", "SC", "SE",
                "XPN", "XSA", "XSN", "XSV",
                "UNA", "NA", "VSV"
            ]
        },
        {
            "type": "nori_readingform"
        }        
    ],
    "text": "世宗市에서 c++, c샤프 언어를 가르치는 삼성전자 학원이 있나요?",
    "attributes" : ["posType", "leftPOS", "rightPOS", "morphemes", "reading"],
    "explain": false        
}

nori_userdict.txt)

user_dictionary_rules 를 user_dictionary 로 변경해서 설정을 하게 되면 아래와 같습니다.

  • "user_dictionary": "nori_userdict.txt"
    • 위 파일은 elasticsearch 가 설치된 위치의 config 경로 아래 위치 합니다.
c++ c+
c샤프
c샵
삼성전자
세종
세종시 세종 시

 

:

[Elasticsearch] Document Indexing 관련

Elastic/Elasticsearch 2021. 4. 5. 15:36

Elasticsearch 에서 Indexing 관련해서 봐두면 좋은 Class 입니다.

 

  • InternalEngine
    • Node 레벨에서 선언 되며, Elasticsearch 에서의 대부분의 Operation 에 대한 정의가 되어 있습니다.
  • NodeClient
    • Elasticsearch Cluster 구성 시 Node 에 해당 합니다.
  • IndexShard
    • 물리적인 Index 의 Operation 에 대한 정의가 되어 있습니다.
  • Translog
    • Commit 되지 않은 색인 작업 내역에 대한 Operation 정의가 되어 있습니다.

Flush 에 대한 대략적인 흐름)

    Commit 하면 tranlog 를 indexWriter 가 segments 파일에 write 하고 tranlog 는 flush 되면서 refresh 동기화가 이루어 집니다.
    (Synced flush 의 경우 refresh 가 먼저 수행 됩니다.)

 

: