'2016/07 글 목록

[Elasticsearch] MoreLikeThis API 설명

Elastic/Elasticsearch 2016. 7. 22. 16:21

mlt 를 이용하면 쉽게 추천 기능을 구현 할 수 있습니다.

그래서 해당 API에 대한 문서를 제 맘데로 옮겨다 놓았습니다.

추후 elasticsearch + mlt 를 이용한 machine learning 이나 recommendation 구현 방법에 대해서 공유 하도록 하겠습니다.

참고문서)

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html

API 설명)

More Like This Query

- MLT 쿼리는 주어진 문서셋과 같은 문서를 찾도록 해줍니다.

- 입력된 문서의 대표 키워드 셋을 가지고 질의 하여 결과를 리턴하게 됩니다.

- 리턴된 결과의 문서들은 대표 키워드 셋과 유사한 문서들로 구성 되게 됩니다.

- MLT 질의 시 문서 또는 키워드로 질의를 할 수 있습니다.

Parameters

Document Input Parameters

- like

작성된 문서 또는 text를 바탕으로 문서를 검색 합니다.

- unlike

작성된 문서 또는 text에서 제외 시킬 term을 작성 합니다.

- fields

문서에서 analyzed text 를 가져올 필드를 지정 합니다.

이 필드를 대상으로 질의 수행이 이루어 집니다.

- like_text

like 와 더불어 문서를 검색 할떄 추가적으로 사용할 text를 작성 합니다.

- ids or docs

@deprecated

Term Selection Parameters

- max_query_terms

작성된 문서 또는 text에서 추출하여 사용할 최대 query term size 입니다. (default 25)

- min_term_freq

작성된 문서 또는 text의 최소 TF 값으로 이 값보다 작을 경우 작성된 문서와 text는 무시 됩니다. (default 2)

- min_doc_freq

입력된 개별 term들에 대해서 각각 matching 된 문서의 최소 크기로 해당 크기 보다 작은 term의 결과는 무시 됩니다. (default 5)

- max_doc_freq

입력된 개별 term들에 대해서 각각 matching 된 문서의 최대 크기로 해당 크기 보다 큰 term의 결과는 무시 됩니다. (default unbounded 0)

- min_word_length

입력된 개별 term들의 최소 길이로 정의한 값보다 작은 term은 무시 됩니다. (default 0)

- max_word_length

입력된 개별 term들의 최대 길이로 정의한 값보다 큰 term은 무시 됩니다. (default unbounded 0)

- stop_words

불용어 목록을 등록 합니다.

- analyzer

입력한 문서와 text에 대한 analyzer 를 지정 합니다. 지정 하지 않을 경우 first field 의 analyzer 를 사용하게 됩니다.

Query Formation Parameters

- minimum_should_match

작성된 문서 또는 text에서 추출된 term matching 에 대한 minimum_should_match 정보를 구성 합니다. (default 30%)

- boost_terms

tems boost value 를 지정 합니다.

- include

검색 결과로 입력 문서를 포함 할지 말지를 결정 합니다. (default false)

- boost

전체 질의에 대한 boost value 를 지정 합니다. (default 1.0)

샘플 QueryDSL)

{

"query": {

"more_like_this": {

"fields": [

"title"

],

"like": "마스크 수분",

"min_term_freq": 1,

"min_doc_freq": 10,

"min_word_length": 2,

"include": true

}

},

"from": 0,

"size": 5,

"fields": [

"id",

"title"

]

}

저작자표시 비영리 변경금지 (새창열림)

:

[Elasticsearch] Transport Bulk to Rest Bulk data format 변환

Elastic/Elasticsearch 2016. 7. 22. 10:02

java 로 bulk indexing 코드를 구현할 경우 색인 데이터 format을 그대로 rest bulk indexing 에서 사용을 할 수가 없습니다.

그래서 변환 하는 스크립트를 간단하게 작성해 봤습니다.

Reference)

https://www.elastic.co/guide/en/elasticsearch/reference/2.3/docs-bulk.html

Java Bulk Indexing Format)

{ "field1" : "value1" }

Rest Bulk Indexing Format)

{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }

{ "field1" : "value1" }

보시면 아시겠지만 index/type/id 에 대한 meta 정보가 있느냐 없느냐의 차이 입니다.

당연하겠지만 java api 에서는 meta 정보를 set 하도록 되어 있습니다. 하지만 rest api 에서는 set 하는 과정이 없기 때문에 당연히 정보를 위와 같이 넣어 줘야 합니다.

변환 스크립트)

#!/bin/bash

while read line

do

header="{ \"index\" : { \"_index\" : \"INDEX_NAME\", \"_type\" : \"TYPE_NAME\" } }"

echo -e $header >> query_result.txt

echo -e $line >> query_result.txt

done < $1

실행)

$ ./convertJavaToRestFormat.sh query_result.json

Rest Bulk Action)

$ curl -XPOST 'http://localhost:9200/INDEX_NAME/TYPE_NAME/_bulk' --data-binary "@query_result.txt"

저작자표시 비영리 변경금지 (새창열림)

:

[MySQL] JSON Type 사용하기

ITWeb/개발일반 2016. 7. 21. 10:12

elasticsearch에서 문서 데이터를 처리 하다 보니 json format을 많이 사용하게 됩니다.

그래서 색인을 위한 데이터중 일부는 mysql 에 json 형태로 저장할 필요가 있는데요. mysql 5.7 부터 data type 으로 json 을 지원하고 있습니다.

참고문서)

https://dev.mysql.com/doc/refman/5.7/en/json.html

https://dev.mysql.com/doc/refman/5.7/en/json-function-reference.html

선언)

CREATE TABLE XXXX (

seq BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,

doc JSON NOT NULL,

PRIMARY KEY (seq)

) ENGINE = InnoDB

CHARACTER SET utf8

COLLATE utf8_general_ci;

등록)

INSERT INTO XXXX(doc) VALUE (?)

조회)

SELECT doc

FROM XXXX

JDBC)

# 등록

PrepareStatement.setObject(1, OBJECT);

# 조회

ResultSet.getObjecrt(1, "doc");

- JDBC 를 통해 DB 에 등록 할 때 value 값에 대한 Object를 Array 인지 Object 인지 확인해서 넣어 주어야 값이 원하는 형태로 저장 됩니다.

Elasticsearch Mapping)

"FIELD_NAME": { "type":"object", "enabled":false }

저작자표시 비영리 변경금지 (새창열림)

:

[Java] Character Encoding 확인하기.

ITWeb/개발일반 2016. 7. 15. 12:49

Query String 방식으로 요청을 보낼 때 가끔 브라우저의 인코딩 설정으로 인해서 문자열이 깨져서 전달 될 때가 있습니다.

그래서 깨진 문자열이 어떤 character set 인지 확인이 필요 한데요.

확인 작업에 사용하기 위해서 기본 인코딩 함수 기록해 봤습니다.

쉬운것도 기억 못하는 나이라서 이제는 복습과 기록만이 ^^

[실행 코드]

public static void detectCharset() throws Exception {
  String text = "미미박스";
  String encode = "";
  String [] charsets = {"UTF-8","EUC-KR","ISO-8859-1", "CP1251", "KSC5601"};

  for ( String charset: charsets ) {
    encode = URLEncoder.encode(text, charset);
    LOG.debug("origin["+text+"], "+"encoded["+encode+"], charset["+charset+"]" );
  }
}

[실행 결과]

- origin[미미박스], encoded[%EB%AF%B8%EB%AF%B8%EB%B0%95%EC%8A%A4], charset[UTF-8]

- origin[미미박스], encoded[%B9%CC%B9%CC%B9%DA%BD%BA], charset[EUC-KR]

- origin[미미박스], encoded[%3F%3F%3F%3F], charset[ISO-8859-1]

- origin[미미박스], encoded[%3F%3F%3F%3F], charset[CP1251]

- origin[미미박스], encoded[%B9%CC%B9%CC%B9%DA%BD%BA], charset[KSC5601]

저작자표시 비영리 변경금지 (새창열림)

:

[HTML] window.history 스펙

ITWeb/개발일반 2016. 7. 13. 17:50

window.history 스펙이라기 보다 그냥 history interface 내용을 가져온 것입니다.

원문)

https://html.spec.whatwg.org/multipage/browsers.html#the-history-interface

내용)

enum ScrollRestoration { "auto", "manual" };

interface History {
  readonly attribute unsigned long length;
  attribute ScrollRestoration scrollRestoration;
  readonly attribute any state;
  void go(optional long delta = 0);
  void back();
  void forward();
  void pushState(any data, DOMString title, optional USVString? url = null);
  void replaceState(any data, DOMString title, optional USVString? url = null);
};

본 내용을 찾아본 이유는 location.href 값을 변경하고 싶었고 변경 시 reloading 되지 않도록 하기 위해서 입니다.

pushState, replaceState 를 이용하면 원하는 기능을 구현 할 수 있습니다.

설명)

window . history . length

Returns the number of entries in the joint session history.

window . history . scrollRestoration [ = value ]

Returns the scroll restoration mode of the current entry in the session history.

Can be set, to change the scroll restoration mode of the current entry in the session history.

window . history . state

Returns the current state object.

window . history . go( [ delta ] )

Goes back or forward the specified number of steps in the joint session history.

A zero delta will reload the current page.

If the delta is out of range, does nothing.

window . history . back()

Goes back one step in the joint session history.

If there is no previous page, does nothing.

window . history . forward()

Goes forward one step in the joint session history.

If there is no next page, does nothing.

window . history . pushState(data, title [, url ] )

Pushes the given data onto the session history, with the given title, and, if provided and not null, the given URL.

window . history . replaceState(data, title [, url ] )

Updates the current entry in the session history to have the given data, title, and, if provided and not null, URL.

저작자표시 비영리 변경금지 (새창열림)

:

[Java] List 데이터 중복 제거.

ITWeb/개발일반 2016. 7. 11. 18:34

구글링 하면 흔하게 나오는 코드 입니다.

list 데이터에서 중복 데이터를 제거 하기 위해 코드 등록해 봅니다.

아주 단순하게 생각하면 그냥 일반 sorting algorithm 을 이용해서 돌리면 되는데요.

이런거 말고 Collection 을 이용해서 하는 것도 있어서 그냥 기록해 봤습니다.

public static ArrayList<String> deDuplicate1(ArrayList<String> list) {

ArrayList<String> result = new ArrayList<>();

HashSet<String> set = new HashSet<>();

for (String item : list) {

if (!set.contains(item)) {

result.add(item);

set.add(item);

}

return result;

}

public static ArrayList<String> deDuplicate2(ArrayList<String> list) {

HashSet<String> set = new HashSet<>(list);

ArrayList<String> result = new ArrayList<>(set);

return result;

}

저작자표시 비영리 변경금지 (새창열림)

:

[Java] Hash Algorithm 테스트

ITWeb/개발일반 2016. 7. 8. 11:19

그냥 인터넷에 돌아 다니는 코드 가져가 기록해 봤습니다.

[코드]

public class HashGenerator {

  public static void main (String[] args)
    throws NoSuchAlgorithmException {
    
    String md2 = getHash("test", "md2");
    String md5 = getHash("test", "md5");
    String sha1 = getHash("test", "sha1");
    String sha256 = getHash("test", "sha-256");
    String sha384 = getHash("test", "sha-384");
    String sha512 = getHash("test", "sha-512");

    // “MD2″, “MD5″, “SHA1″, “SHA-256″, “SHA-384″, “SHA-512″
    System.out.println("MD2     : [" + md2 + "](" + md2.length() + ")");
    System.out.println("MD5     : [" + md5 + "](" + md5.length() + ")");
    System.out.println("SHA1    : [" + sha1 + "](" + sha1.length() + ")");
    System.out.println("SHA-256 : [" + sha256 + "](" + sha256.length() + ")");
    System.out.println("SHA-384 : [" + sha384 + "](" + sha384.length() + ")");
    System.out.println("SHA-512 : [" + sha512 + "](" + sha512.length() + ")");
  }

  public static String getHash(String message, String algorithm)
    throws NoSuchAlgorithmException {
    
    try {
      byte[] buffer = message.getBytes();
      MessageDigest md = MessageDigest.getInstance(algorithm);
      md.update(buffer);
      byte[] digest = md.digest();
      String hex = "";
      
      for(int i = 0 ; i < digest.length ; i++) {
        int b = digest[i] & 0xff;
        if (Integer.toHexString(b).length() == 1) hex = hex + "0";
        hex  = hex + Integer.toHexString(b);
      }
      
      return hex;
    } catch(NoSuchAlgorithmException e) {
      e.printStackTrace();
    }
    
    return null;
  }
}

[결과]

MD2 : [dd34716876364a02d0195e2fb9ae2d1b](32)

MD5 : [098f6bcd4621d373cade4e832627b4f6](32)

SHA1 : [a94a8fe5ccb19ba61c4c0873d391e987982fbbd3](40)

SHA-256 : [9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08](64)

SHA-384 : [768412320f7b0aa5812fce428dc4706b3cae50e02a64caa16a782249bfe8efc4b7ef1ccb126255d196047dfedf17a0a9](96)

SHA-512 : [ee26b0dd4af7e749aa1a8ee3c10ae9923f618980772e473f8819a5d4940e0db27ac185f8a0e1d5f84f88bc887fd67b143732c304cc5fa9ad8e6f57f50028a8ff](128)

저작자표시 비영리 변경금지 (새창열림)

:

jjeong

'2016/07'에 해당되는 글 7건

[Elasticsearch] MoreLikeThis API 설명

[Elasticsearch] Transport Bulk to Rest Bulk data format 변환

[MySQL] JSON Type 사용하기

[Java] Character Encoding 확인하기.

[HTML] window.history 스펙

[Java] List 데이터 중복 제거.

[Java] Hash Algorithm 테스트

티스토리툴바