[Elasticsearch] Nori Analyzer 테스트

Elastic/Elasticsearch 2021. 4. 5. 17:23

Nori Analyzer 기본 테스트 입니다.

공홈 참고문서)

www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-nori.html

 

기본 사전)

bitbucket.org/eunjeon/mecab-ko-dic/src/master/

 

POS Tag)

lucene.apache.org/core/8_8_0/analyzers-nori/org/apache/lucene/analysis/ko/POS.Tag.html

 

여기서 주의 할 점은 filter 선언 시 postags 가 아닌 stoptags 로 선언 하셔야 합니다.

제가 실수로 postags 로 작성을 했었네요. (수정해 두었습니다.)

 

_analyze  API 를 이용해서 RESTful API 호출로 테스트 한 내용입니다.

{
    "tokenizer": {
        "type": "nori_tokenizer",
        "decompound_mode": "mixed",
        "discard_punctuation": "true",
        "user_dictionary_rules": ["c++ c+ +", "C샤프", "세종", "세종시 세종 시"]
    },
    "filter": [
        {        
            "type": "nori_part_of_speech",
            "stoptags": [
                "E",
                "IC",
                "J",
                "MAG", "MAJ", "MM",
                "SP", "SSC", "SSO", "SC", "SE",
                "XPN", "XSA", "XSN", "XSV",
                "UNA", "NA", "VSV"
            ]
        },
        {
            "type": "nori_readingform"
        }
    ],
    "text": "世宗市에서 c++ 언어를 가르치는 학원이 있나요?",
    "attributes" : ["posType", "leftPOS", "rightPOS", "morphemes", "reading"],
    "explain": true        
}
더보기

실행한 결과)

{
    "detail": {
        "custom_analyzer": true,
        "charfilters": [],
        "tokenizer": {
            "name": "__anonymous__nori_tokenizer",
            "tokens": [
                {
                    "token": "世宗",
                    "start_offset": 0,
                    "end_offset": 2,
                    "type": "word",
                    "position": 0,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": "세종",
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "市",
                    "start_offset": 2,
                    "end_offset": 3,
                    "type": "word",
                    "position": 1,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": "시",
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "에서",
                    "start_offset": 3,
                    "end_offset": 5,
                    "type": "word",
                    "position": 2,
                    "leftPOS": "J(Ending Particle)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "J(Ending Particle)"
                },
                {
                    "token": "c++",
                    "start_offset": 6,
                    "end_offset": 9,
                    "type": "word",
                    "position": 3,
                    "positionLength": 2,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": "c+/NNG(General Noun)++/NNG(General Noun)",
                    "posType": "COMPOUND",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "c+",
                    "start_offset": 6,
                    "end_offset": 8,
                    "type": "word",
                    "position": 3,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "+",
                    "start_offset": 8,
                    "end_offset": 9,
                    "type": "word",
                    "position": 4,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "언어",
                    "start_offset": 10,
                    "end_offset": 12,
                    "type": "word",
                    "position": 5,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "를",
                    "start_offset": 12,
                    "end_offset": 13,
                    "type": "word",
                    "position": 6,
                    "leftPOS": "J(Ending Particle)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "J(Ending Particle)"
                },
                {
                    "token": "가르치",
                    "start_offset": 14,
                    "end_offset": 17,
                    "type": "word",
                    "position": 7,
                    "leftPOS": "VV(Verb)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "VV(Verb)"
                },
                {
                    "token": "는",
                    "start_offset": 17,
                    "end_offset": 18,
                    "type": "word",
                    "position": 8,
                    "leftPOS": "E(Verbal endings)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "E(Verbal endings)"
                },
                {
                    "token": "학원",
                    "start_offset": 19,
                    "end_offset": 21,
                    "type": "word",
                    "position": 9,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "이",
                    "start_offset": 21,
                    "end_offset": 22,
                    "type": "word",
                    "position": 10,
                    "leftPOS": "J(Ending Particle)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "J(Ending Particle)"
                },
                {
                    "token": "있",
                    "start_offset": 23,
                    "end_offset": 24,
                    "type": "word",
                    "position": 11,
                    "leftPOS": "VA(Adjective)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "VA(Adjective)"
                },
                {
                    "token": "나요",
                    "start_offset": 24,
                    "end_offset": 26,
                    "type": "word",
                    "position": 12,
                    "leftPOS": "E(Verbal endings)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "E(Verbal endings)"
                }
            ]
        },
        "tokenfilters": [
            {
                "name": "__anonymous__nori_part_of_speech",
                "tokens": [
                    {
                        "token": "世宗",
                        "start_offset": 0,
                        "end_offset": 2,
                        "type": "word",
                        "position": 0,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": "세종",
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "市",
                        "start_offset": 2,
                        "end_offset": 3,
                        "type": "word",
                        "position": 1,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": "시",
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "c++",
                        "start_offset": 6,
                        "end_offset": 9,
                        "type": "word",
                        "position": 3,
                        "positionLength": 2,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": "c+/NNG(General Noun)++/NNG(General Noun)",
                        "posType": "COMPOUND",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "c+",
                        "start_offset": 6,
                        "end_offset": 8,
                        "type": "word",
                        "position": 3,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "+",
                        "start_offset": 8,
                        "end_offset": 9,
                        "type": "word",
                        "position": 4,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "언어",
                        "start_offset": 10,
                        "end_offset": 12,
                        "type": "word",
                        "position": 5,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "가르치",
                        "start_offset": 14,
                        "end_offset": 17,
                        "type": "word",
                        "position": 7,
                        "leftPOS": "VV(Verb)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "VV(Verb)"
                    },
                    {
                        "token": "학원",
                        "start_offset": 19,
                        "end_offset": 21,
                        "type": "word",
                        "position": 9,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "있",
                        "start_offset": 23,
                        "end_offset": 24,
                        "type": "word",
                        "position": 11,
                        "leftPOS": "VA(Adjective)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "VA(Adjective)"
                    }
                ]
            },
            {
                "name": "__anonymous__nori_readingform",
                "tokens": [
                    {
                        "token": "세종",
                        "start_offset": 0,
                        "end_offset": 2,
                        "type": "word",
                        "position": 0,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": "세종",
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "시",
                        "start_offset": 2,
                        "end_offset": 3,
                        "type": "word",
                        "position": 1,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": "시",
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "c++",
                        "start_offset": 6,
                        "end_offset": 9,
                        "type": "word",
                        "position": 3,
                        "positionLength": 2,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": "c+/NNG(General Noun)++/NNG(General Noun)",
                        "posType": "COMPOUND",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "c+",
                        "start_offset": 6,
                        "end_offset": 8,
                        "type": "word",
                        "position": 3,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "+",
                        "start_offset": 8,
                        "end_offset": 9,
                        "type": "word",
                        "position": 4,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "언어",
                        "start_offset": 10,
                        "end_offset": 12,
                        "type": "word",
                        "position": 5,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "가르치",
                        "start_offset": 14,
                        "end_offset": 17,
                        "type": "word",
                        "position": 7,
                        "leftPOS": "VV(Verb)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "VV(Verb)"
                    },
                    {
                        "token": "학원",
                        "start_offset": 19,
                        "end_offset": 21,
                        "type": "word",
                        "position": 9,
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "NNG(General Noun)"
                    },
                    {
                        "token": "있",
                        "start_offset": 23,
                        "end_offset": 24,
                        "type": "word",
                        "position": 11,
                        "leftPOS": "VA(Adjective)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "reading": null,
                        "rightPOS": "VA(Adjective)"
                    }
                ]
            }
        ]
    }
}

synonyms filter 추가)

주의 할 사항은 user_dic.txt 에 정의 되지 않은 단어의 경우 의도한 결과가 나오지 않을 수 있습니다.

{
    "tokenizer": {
        "type": "nori_tokenizer",
        "decompound_mode": "mixed",
        "discard_punctuation": "true",
        "user_dictionary_rules": ["c++ c+", "c샤프", "c샵", "삼성전자", "세종", "세종시 세종 시"]
    },
    "filter": [
        {
            "type": "synonym_graph",            
            "synonyms": [ 
                "삼성전자, 삼전",
                "c샤프, c샵"
            ]
        },
        {        
            "type": "nori_part_of_speech",
            "stoptags": [
                "E",
                "IC",
                "J",
                "MAG", "MAJ", "MM",
                "SP", "SSC", "SSO", "SC", "SE",
                "XPN", "XSA", "XSN", "XSV",
                "UNA", "NA", "VSV"
            ]
        },
        {
            "type": "nori_readingform"
        }        
    ],
    "text": "世宗市에서 c++, c샤프 언어를 가르치는 삼성전자 학원이 있나요?",
    "attributes" : ["posType", "leftPOS", "rightPOS", "morphemes", "reading"],
    "explain": false        
}

nori_userdict.txt)

user_dictionary_rules 를 user_dictionary 로 변경해서 설정을 하게 되면 아래와 같습니다.

  • "user_dictionary": "nori_userdict.txt"
    • 위 파일은 elasticsearch 가 설치된 위치의 config 경로 아래 위치 합니다.
c++ c+
c샤프
c샵
삼성전자
세종
세종시 세종 시

 

: