'apache'에 해당되는 글 15건

  1. 2019.11.05 [httpcomponents-client] CloseableHttpClient - Accept Encoding
  2. 2017.02.01 [Apache Mahout] GenericDataModel 예제코드.
  3. 2017.01.24 [검색추천] Apache mahout + Elastic Stack 을 이용한 기본 추천
  4. 2015.05.27 [Apache Tajo] Apache Tajo 데스크탑 + Zeppelin 연동 하기
  5. 2015.05.15 [Elasticsearch] Apache Tajo & Elasticsearch 한글 README
  6. 2015.05.14 [Elasticsearch] Collaborate Apache Tajo + Available SQL on Elasticsearch
  7. 2015.03.31 도전! Apache Tajo Contributor.
  8. 2015.03.27 [Elasticsearch] elasticsearch를 apache tajo의 external storage로 사용하기 3
  9. 2014.08.18 apache commons cli maven
  10. 2013.03.19 apache so module guide.

[httpcomponents-client] CloseableHttpClient - Accept Encoding

ITWeb/개발일반 2019. 11. 5. 10:53

RESTful 통신을 많이 하면서 httpclient 를 활용이 높습니다.

제가 사용하고 있는 httpclient 중에 CloseableHttpClient 가 있는데 이 클라이언트의 경우 Accept Encoding 설정이 기본적으로 enable 되어 있습니다.

그래서 기억력을 돕기 위해 또 기록해 봅니다.

 

참고 이전 글)

https://jjeong.tistory.com/1369

 

HttpClientBuilder.java)

public CloseableHttpClient build() {
... 중략 ...
            if (!contentCompressionDisabled) {
                if (contentDecoderMap != null) {
                    final List<String> encodings = new ArrayList<String>(contentDecoderMap.keySet());
                    Collections.sort(encodings);
                    b.add(new RequestAcceptEncoding(encodings));
                } else {
                    b.add(new RequestAcceptEncoding());
                }
            }
            if (!authCachingDisabled) {
                b.add(new RequestAuthCache());
            }
            if (!cookieManagementDisabled) {
                b.add(new ResponseProcessCookies());
            }
            if (!contentCompressionDisabled) {
                if (contentDecoderMap != null) {
                    final RegistryBuilder<InputStreamFactory> b2 = RegistryBuilder.create();
                    for (final Map.Entry<String, InputStreamFactory> entry: contentDecoderMap.entrySet()) {
                        b2.register(entry.getKey(), entry.getValue());
                    }
                    b.add(new ResponseContentEncoding(b2.build()));
                } else {
                    b.add(new ResponseContentEncoding());
                }
            }
... 중략 ...
    }

RequestAcceptEndocing.java)

...중략...
    public RequestAcceptEncoding(final List<String> encodings) {
        if (encodings != null && !encodings.isEmpty()) {
            final StringBuilder buf = new StringBuilder();
            for (int i = 0; i < encodings.size(); i++) {
                if (i > 0) {
                    buf.append(",");
                }
                buf.append(encodings.get(i));
            }
            this.acceptEncoding = buf.toString();
        } else {
            this.acceptEncoding = "gzip,deflate";
        }
    }
...중략...

요청 하는 Client 에서 Server 로 콘텐츠를 압축해서 전송해줘 해야 압축해서 전송을 해주게 되는 내용입니다.

아무리 서버에서 압축 전송이 가능 하도록 설정을 했어도 요청을 하지 않으면 그냥 plain/text 로 넘어 올 수 밖에 없습니다.

 

참고문서)

https://developer.mozilla.org/ko/docs/Web/HTTP/Headers/Accept-Encoding

 

Accept-Encoding

Accept-Encoding 요청 HTTP 헤더는, 보통 압축 알고리즘인, 클라이언트가 이해 가능한 컨텐츠 인코딩이 무엇인지를 알려줍니다. 컨텐츠 협상을 사용하여, 서버는 제안된 내용 중 하나를 선택하고 사용하며 Content-Encoding 응답 헤더를 이용해 선택된 것을 클라이언트에게  알려줍니다.

developer.mozilla.org

 

:

[Apache Mahout] GenericDataModel 예제코드.

ITWeb/개발일반 2017. 2. 1. 11:52

Apache Mahout 의 DataModel 구현체는 아래 프로젝트의 패키지에 포함이 되어 있습니다.


[Project]

- mahout-mr 


[Package]

- org.apache.mahout.cf.taste.impl.model.*


[Example]

FastByIDMap<PreferenceArray> result = new FastByIDMap<PreferenceArray>();
List<Preference> prefsList = Lists.newArrayList();
prefsList.add(new GenericPreference(1645390, 123456, 0.4));
result.put(1645390, new GenericUserPreferenceArray(prefsList));

return new ExampleRecommender(new GenericDataModel(result));

public GenericPreference(long userID, long itemID, float value)


코드 자체가 너무 쉬워서 이만 줄입니다.


:

[검색추천] Apache mahout + Elastic Stack 을 이용한 기본 추천

Elastic/Elasticsearch 2017. 1. 24. 11:47

Elastic Stack 과 Apache mahout 을 이용한 추천 데이터 생성을 다뤄 볼까 합니다.

기본적으로는 Elastic Stack 만 가지고도 cohort 분석을 통해 추천 데이터 마트 구성이 가능 한데요.

추천 데이터에 대한 품질을 좀 더 좋게 하기 위해 Apache mahout 을 활용해 보도록 하겠습니다.


여기서 다루는 내용은 누구나 쉽게 접근 할 수 있도록 Hello World! 수준만 기술 합니다.


[Elastic Stack]

https://www.elastic.co/products


[Apache mahout]

https://mahout.apache.org/


위 두 솔루션은 모두 오픈소스 이며 예제 코드가 해당 소스에 잘 만들어져 있어 누구나 쉽게 활용이 가능합니다.


Step 1)

Elasticsearch + Logstash + Kibana 를 이용해 로그를 수집하고 추천 할 raw data 를 생성 합니다.


User item click log -> Logstash collect -> Elasticsearch store -> Kibana visualize -> CSV download


여기서 수집한 데이터 중 추출 데이터는 user id + item id + click count 입니다.

아래는 Kibana QueryDSL 예제 입니다.

{

  "size": 0,

  "query": {

    "filtered": {

      "query": {

        "query_string": {

          "query": "cp:CLK AND id:[0 TO *]",

          "analyze_wildcard": true

        }

      },

      "filter": {

        "bool": {

          "must": [

            {

              "range": {

                "time": {

                  "gte": 1485010800000,

                  "lte": 1485097199999,

                  "format": "epoch_millis"

                }

              }

            }

          ],

          "must_not": []

        }

      }

    }

  },

  "aggs": {

    "2": {

      "terms": {

        "field": "user_id",

        "size": 30000,

        "order": {

          "_count": "desc"

        }

      },

      "aggs": {

        "3": {

          "terms": {

            "field": "item_id",

            "size": 10,

            "order": {

              "_count": "desc"

            }

          }

        }

      }

    }

  }

}


Step 2)

Apache mahout 에서 사용할 recommender 는 UserBasedRecommender 입니다.

샘플 코드에도 나와 있지만 dataset.csv 파일은 아래와 같은 형식 입니다.

- Creating a User-Based Recommender in 5 minutes


1,10,1.0
1,11,2.0
1,12,5.0
1,13,5.0

형식) userId,itemId,ratingValue


Step1 에서 위와 같은 형식을 맞추기 위해 user_id, item_id, click_count 를 생성 하였습니다.

이 데이터를 기반으로 UserBasedRecommender 를 돌려 보도록 하겠습니다.


Step 3)

아래 보시면 샘플 코드가 잘 나와 있습니다.

https://github.com/apache/mahout/tree/master/examples/src/main/java/org/apache/mahout


Main class 하나 만드셔서 Step2 에 나와 있는 코드로 돌려 보시면 됩니다.

저는 UserBasedRecommender 를 implements 해서 별도로 구현했습니다.

이건 누구나 쉽게 하실 수 있는 부분이기 때문에 examples 에 나와 있는 BookCrossingRecommender 클래스등을 참고 하시면 됩니다.


UserBasedRecommenderRunner runner = new UserBasedRecommenderRunner();

Recommender recommender = runner.buildRecommender();


// 710039번 유저에 대한 추천 아이템 3개

List<RecommendedItem> recommendations = recommender.recommend(710039, 3);


for (RecommendedItem recommendation : recommendations) {

    LOG.debug("추천 아이템 : {}", recommendation);

}


[실행 로그]

11:39:31.527 [main] INFO  o.a.m.c.t.i.model.file.FileDataModel - Creating FileDataModel for file /git/prototype/data/user-to-item.csv

11:39:31.626 [main] INFO  o.a.m.c.t.i.model.file.FileDataModel - Reading file info...

11:39:31.765 [main] INFO  o.a.m.c.t.i.model.file.FileDataModel - Read lines: 63675

11:39:31.896 [main] INFO  o.a.m.c.t.i.model.GenericDataModel - Processed 10000 users

11:39:31.911 [main] INFO  o.a.m.c.t.i.model.GenericDataModel - Processed 19124 users

11:39:31.949 [main] DEBUG o.a.m.c.t.i.r.GenericUserBasedRecommender - Recommending items for user ID '710039'

11:39:31.965 [main] DEBUG o.a.m.c.t.i.r.GenericUserBasedRecommender - Recommendations are: [RecommendedItem[item:35222, value:4.0], RecommendedItem[item:12260, value:4.0], RecommendedItem[item:12223, value:1.5]]

11:39:31.966 [main] DEBUG o.h.p.mahout.meme.MemeProductRunner - 추천 아이템 : RecommendedItem[item:35222, value:4.0]

11:39:31.966 [main] DEBUG o.h.p.mahout.meme.MemeProductRunner - 추천 아이템 : RecommendedItem[item:12260, value:4.0]

11:39:31.967 [main] DEBUG o.h.p.mahout.meme.MemeProductRunner - 추천 아이템 : RecommendedItem[item:12223, value:1.5]


[Recommender]

similarity = new PearsonCorrelationSimilarity(dataModel);


// 이웃한 N명의 사용자 데이터로 추천 데이터 생성

// UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, dataModel, 0.2);


// 특정 값이나 임계치를 넘는 모든 사용자의 데이터로 추천 데이터 생성, samplingrate : user sampling rate 10%

// UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, dataModel, 0.1);


UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.2, similarity, dataModel, 1.0);

recommender = new GenericUserBasedRecommender(dataModel, neighborhood, similarity);


- 데이터 크기가 너무 작아 ThresholdUserNeighborhood 를 이용하였습니다.


이와 같이 검색 클릭 로그를 기반으로 CF를 돌려 추천 데이터를 만드는 아주 간단한 방법을 알아봤습니다.

만든 추천 데이터에 대한 평가도 가능 합니다.

역시 examples 에 xxxxxxEvaluator 클래스들을 참고하셔서 구현해 보시면 됩니다.


:

[Apache Tajo] Apache Tajo 데스크탑 + Zeppelin 연동 하기

ITWeb/Apache Tajo 2015. 5. 27. 18:35

Apache Tajo 데스크탑 버전과 Zeppelin을 이용한 분석 환경 구성입니다.

모든 설치 및 사용 가이드는 각각의 홈페이지에 자세히 나와 있습니다.

단지 제가 실행한 로그만 모아서 글 남겨 봅니다.


Apache Tajo 데스크탑 설치 하기


1. 다운로드 및 설치 가이드

http://www.gruter.com/blog/getting-started-with-tajo-on-your-desktop/


2. 압축 해제 하기

$ tar -xvzf tajo-0.11.0-desktop-3.0.tar.gz

$ ln -s tajo-0.11.0-desktop-3.0 tajo

$ cd tajo


3. 설정 하기

$ bin/configure.sh

Enter JAVA_HOME [required]

/Library/Java/JavaVirtualMachines/jdk1.7.0_55.jdk/Contents/Home

Would you like advanced configure? [y/N]

y

Enter tajo.rootdir [default: file:///Users/hwjeong/temp/kgloballondon/tajo/data/tajo]


Enter tajo.staging.directory [default: file:///Users/hwjeong/temp/kgloballondon/tajo/data/staging]


Enter tajo.worker.tmpdir.locations [default: /Users/hwjeong/temp/kgloballondon/tajo/data/tempdir]


Enter heap size(MB) for worker [default: 1024]


Done. To start Tajo, run /Users/hwjeong/temp/kgloballondon/tajo/bin/startup.sh

- JAVA_HOME 설정은 개발환경에 맞춰 설정 하시면 됩니다.
- advanced configure 설정은 따로 하지 않으셔도 되지만 어떤 항목이 있는지 확인하기 위해 "y"를 선택했습니다.
- OSX 사용자의 경우 아래와 같이 확인 하시면 됩니다.
$ /usr/libexec/java_home -v 1.7
또는
$ /usr/libexec/java_home -v 1.6

4. Tajo 실행 하기
$ bin/startup.sh
starting master, logging to /Users/hwjeong/temp/kgloballondon/tajo/bin/../logs/tajo-hwjeong-master-jeong-ui-MBP.out
Tajo master starting....Connection to localhost port 26003 [tcp/*] succeeded!
Tajo master started.

starting worker, logging to /Users/hwjeong/temp/kgloballondon/tajo/bin/../logs/tajo-hwjeong-worker-jeong-ui-MBP.out
Tajo worker started.

Tajo master web UI
http://localhost:26080

5. 테스트 데이터 등록
$ bin/make-test.sh
Databases and tables for test were successfully created.

6. Tajo Shell 명령어 실행
$ bin/tsql

default> \c tpc_h10m
You are now connected to database "tpc_h10m" as user "hwjeong".
tpc_h10m> \d
customer
lineitem
nation
orders
part
partsupp
region
supplier
tpc_h10m>

여기까지는 "다운로드 및 설치 가이드"에 나와 있는 것과 동일 합니다.
다만 제가 수행한 로그를 기록한 것 뿐입니다.

Zeppelin 설치 하기

1. 다운로드 및 설치 가이드
https://zeppelin.incubator.apache.org/docs/install/install.html

2. git clone 하기
$ git clone https://github.com/apache/incubator-zeppelin.git zeppelin
Cloning into 'zeppelin'...
remote: Counting objects: 21256, done.
remote: Total 21256 (delta 0), reused 0 (delta 0), pack-reused 21256
Receiving objects: 100% (21256/21256), 10.76 MiB | 1.75 MiB/s, done.
Resolving deltas: 100% (8584/8584), done.
Checking connectivity... done.
$ cd zeppelin

3. Local mode 로 빌드하기
$ sudo mvn clean install -DskipTests
- sudo 로 실행한 이유는 dependency 설치 시 권한 문제로 인한 오류를 예방하기 위해서 입니다.

4. Zeppelin 실행 하기
$ bin/zeppelin-daemon.sh start
Zeppelin start                                             [  OK  ]
-  중지는 start 대신 stop 하시면 됩니다.
$ bin/zeppelin-daemon.sh stop
Zeppelin stop                                              [  OK  ]

5. Zeppelin WebUI 접속 하기
http://localhost:8080/

Zeppelin에서 Apache Tajo SQL 사용 하기

1. Note 생성 하기


2. Tajo 질의 작성 하기

## Note 2AQG17JRB를 클릭 하세요.

%tajo select * from tpc_h10m.nation;


%tajo 

SELECT n.n_name as nation, sum(o.o_totalprice) as order_amount 

FROM tpc_h10m.customer c, tpc_h10m.nation n, tpc_h10m.orders o 

WHERE c.c_nationkey = n.n_nationkey 

and o.o_custkey = c.c_custkey 

GROUP BY c.c_nationkey, n.n_name 

ORDER BY n.n_name;

- "%tajo" 부분은 zeppelin의 interpreter binding 정보를 참고 하시면 되며, tajo를 지정한 내용입니다.

- tajo의 tsql에서 제공하는 "\명령어"는 지원되지 않기 때문에 사용에 유의 하셔야 합니다.



3. 질의 결과를 Graph로 보기

Command Shell 하단에 그래프 아이콘을 클릭 하시면 결과를 보실 수 있습니다.


여기까지 제공된 문서를 기반으로 구성해본 내용이였습니다.

더불어 Apache Tajo와 Zeppelin과의 통신은 JDBC Driver를 통해서 이루어 집니다.




:

[Elasticsearch] Apache Tajo & Elasticsearch 한글 README

Elastic/Elasticsearch 2015. 5. 15. 11:20

원본링크)

https://github.com/gruter/tajo-elasticsearch/blob/master/README.KR.md


Apache Tajo & Elasticsearch

  • Collaborate Apache Tajo + Elasticsearch
  • Apache Tajo의 External Storage로 구현된 내용입니다.
  • 설치 가이드

Software Stack

  • 스택이랄 것까지는 없지만 기본은 Apache Tajo + Elasticsearch 입니다.
  • 각 open source 별 패키지 종속은 갖기 때문에 사용전에 확인 하시고 필요한 것들을 설치해 주셔야 합니다.
  • 기본적으로 소스 받으셔서 빌드 하신 후 설치 사용하시면 됩니다.

Apache Tajo

  • Hadoop 2.3.0 or higher (up to 2.5.1)
  • Java 1.6 or 1.7
  • Protocol buffer 2.5.0

Elasticsearch

  • 버전별로 JDK 종속을 갖습니다.
  • 1.2 이상 부터는 JDK 1.7 이상
  • 1.1 이하 부터는 JDK 1.6

동작방식

  • Apache Tajo에 external table로 생성해서 Elasticsearch로 질의하는 방식 입니다.
  • 현재 구현된 기능은 아래 두 가지 입니다.
    • CREATE EXTERNAL TABLE
    • SELECT
  • Meta 정보를 Tajo에서 저장하고 있고 실제 데이터는 Elasticsearch에 위치하게 됩니다.
  • SQL 질의 시 Tajo에서 Elasticsearch로 QueryDSL로 변환된 질의를 수행하여 데이터를 가져오게 됩니다.
  • 이렇게 획득한 데이터를 WHERE 조건에 맞게 Selection 해서 결과를 리턴하게 됩니다.

어디에 사용하면 될까요?

  • Elasticsearch에서 JOIN 사용에 대한 아쉬움이 있으셨던 분들
  • HDFS에 저장된 데이터와 함께 분석 또는 질의에 대한 요구가 있으신 분들
  • HDFS 데이터에 대한 중간 결과를 Elasticsearch로 저장해서 활용하고 싶으셨던 분들
  • 검색엔진은 잘 모르겠고 그냥 SQL만 아시는 분들

JDBC Driver 사용은 가능 한가요?

  • Apache Tajo의 JDBC Driver를 이용해서 사용하시면 됩니다.

사용 시 주의사항

  • 현재 QUAL에 대한 조건이 내려오지 않아 Full Scan 하기 때문에 실시간 서비스용으로는 적합하지 않습니다.
  • Batch 또는 관리/분석 도구에서 사용하는 용도로 쓰시기 바랍니다.
  • QUAL을 내려 주는 기능은 Apache Tajo 팀에서 현재 구현중에 있어 완료 되면 반영할 예정입니다.

문의

  • 요청사항이나 개선요구사항이 있으신 분들은 메일이나 이슈 등록해 주시면 최대한 반영해 보겠습니다.


:

[Elasticsearch] Collaborate Apache Tajo + Available SQL on Elasticsearch

Elastic/Elasticsearch 2015. 5. 14. 18:40

원본 링크) https://github.com/gruter/tajo-elasticsearch


Apache Tajo & Elasticsearch

  • Collaborate Apache Tajo + Elasticsearch

Apache Tajo User Group

Apache Tajo Mailing List

Elasticsearch User Group

Registerer

Master Branch Build Environment

  • JDK 6 or later
  • Elasticsearch 1.1.2

tajo-es-1.5.2 Branch Build Environment

  • JDK 7 or later
  • Elasticsearch 1.5.2

tajo-es-1.1.2 Branch Build Environment

  • JDK 6 or later
  • Elasticsearch 1.1.2

HADOOP

$ cd ~/server/app/hadoop-2.3.0

Prerequisites

  • Hadoop 2.3.0 or higher (up to 2.5.1)
  • Java 1.6 or 1.7
  • Protocol buffer 2.5.0
  • Go to Link

Source Clone & Build

$ git clone https://github.com/gruter/tajo-elasticsearch.git
$ cd tajo-elasticsearch
$ mvn clean package -DskipTests -Pdist -Dtar
$ cd tajo-dist/target/
$ ls -al tajo-0.*.tar.gz
-rw-r--r--  1 hwjeong  staff  59521544  5 14 13:59 tajo-0.11.0-SNAPSHOT.tar.gz

Apache Tajo Installation

$ cd ~/server/app
$ mkdir tajo
$ cd tajo
$ cp ~/git/tajo-elasticsearch/tajo-dist/target/tajo-0.11.0-SNAPSHOT.tar.gz .
$ tar -xvzf tajo-0.11.0-SNAPSHOT.tar.gz
$ ln -s tajo-0.11.0-SNAPSHOT tajo
$ cd tajo
$ vi conf/tajo-env.sh
export HADOOP_HOME=/Users/hwjeong/server/app/hadoop-2.3.0
export JAVA_HOME=`/usr/libexec/java_home -v 1.7`

Apache Tajo Worker - ssh keygen

$ cd ~/.ssh
$ ssh-keygen -t rsa
$ cat id_rsa.pub > authorized_keys
$ chmod 600 authorized_keys

Apache Tajo Run & Sample Data

$ cd ~/server/app/tajo/tajo
$ bin/start-tajo.sh
$ cat > data.csv
1|abc|1.1|a
2|def|2.3|b
3|ghi|3.4|c
4|jkl|4.5|d
5|mno|5.6|e
^C

Hadoop Run & Make User Directory

$ cd ~/server/app/hadoop-2.3.0
$ sbin/start-all.sh
$ bin/hadoop fs -ls /
$ bin/hadoop fs -mkdir /user/tajo
$ bin/hadoop fs -chown hwjeong /user/tajo
$ bin/hadoop fs -ls /user
drwxr-xr-x   - hwjeong supergroup          0 2015-05-14 14:42 /user/tajo
$ bin/hadoop fs -moveFromLocal ~/server/app/tajo/tajo/data.csv /user/tajo/

Apache Tajo CLI

$ cd ~/server/app/tajo/tajo
$ bin/tsql
default> create external table tajodemotbl (id int, name text, score float, type text) using csv with ('csvfile.delimiter'='|') location 'hdfs://localhost:9000/user/tajo/data.csv';
OK
default> \d tajodemotbl;

table name: default.tajodemotbl
table path: hdfs://localhost:9000/user/tajo/data.csv
store type: csv
number of rows: unknown
volume: 60 B
Options:
'text.delimiter'='|'

schema:
id  INT4
name    TEXT
score   FLOAT4
type    TEXT

default> select * from tajodemotbl where id > 2;
Progress: 0%, response time: 1.557 sec
Progress: 0%, response time: 1.558 sec
Progress: 100%, response time: 1.86 sec
id,  name,  score,  type
-------------------------------
3,  ghi,  3.4,  c
4,  jkl,  4.5,  d
5,  mno,  5.6,  e
(3 rows, 1.86 sec, 48 B selected)
default>

Elasticsearch Installation & Run

$ cd ~/server/app/elasticsearch/elasticsearch-1.1.2
# no configuration
$ bin/elasticsearch -f

Create Index & Document

package org.gruter.elasticsearch.test;

import org.elasticsearch.action.WriteConsistencyLevel;
import org.elasticsearch.action.admin.indices.create.CreateIndexResponse;
import org.elasticsearch.action.count.CountRequest;
import org.elasticsearch.action.count.CountResponse;
import org.elasticsearch.action.index.IndexRequestBuilder;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.action.support.replication.ReplicationType;
import org.elasticsearch.client.Client;
import org.elasticsearch.common.settings.ImmutableSettings;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentFactory;
import org.elasticsearch.node.Node;
import org.elasticsearch.node.NodeBuilder;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import static org.junit.Assert.assertEquals;

public class ElasticsearchCRUDTest {
  private static final Logger log = LoggerFactory.getLogger(ElasticsearchCRUDTest.class);

  private ImmutableSettings.Builder settings;
  private Node node;
  private Client client;
  private int DOC_SIZE = 1000;

  @Before
  public void setup() throws Exception {
    settings = ImmutableSettings.settingsBuilder();
    settings.put("cluster.name", "elasticsearch");

    node = NodeBuilder.nodeBuilder()
        .settings(settings)
        .data(false)
        .local(false)
        .client(true)
        .node();

    client = node.client();
  }

  @Test
  public void testElasticsearchCRUD() throws Exception {
    // delete index
    try {
      client.admin().indices().prepareDelete("tajo_es_index").execute().actionGet();
    } catch (Exception e) {
    } finally {
    }

    // create index
    Settings indexSettings = ImmutableSettings.settingsBuilder()
        .put("number_of_shards","1")
        .put("number_of_replicas", "0")
        .build();

    XContentBuilder builder = XContentFactory.jsonBuilder()
        .startObject()
          .startObject("tajo_es_type")
            .startObject("_all")
              .field("enabled", "false")
            .endObject()
            .startObject("_id")
              .field("path", "field1")
            .endObject()
            .startObject("properties")
              .startObject("field1")
                .field("type", "long").field("store", "no").field("index", "not_analyzed")
              .endObject()
              .startObject("field2")
                .field("type", "string").field("store", "no").field("index", "not_analyzed")
              .endObject()
              .startObject("field3")
                .field("type", "string").field("store", "no").field("index", "analyzed")
              .endObject()
            .endObject()
          .endObject()
        .endObject();

    CreateIndexResponse res = client.admin().indices().prepareCreate("tajo_es_index")
        .setSettings(indexSettings)
        .addMapping("tajo_es_type", builder)
        .execute()
        .actionGet();

    assertEquals(res.isAcknowledged(), true);

    // add document
    IndexRequestBuilder indexRequestBuilder = client.prepareIndex().setIndex("tajo_es_index").setType("tajo_es_type");
    IndexResponse indexResponse;

    for ( int i=0; i<DOC_SIZE; i++ ) {
      builder = XContentFactory.jsonBuilder()
          .startObject()
            .field("field1", i).field("field2", "henry" + i).field("field3", i + ". hello world!! elasticsearch on apache tajo!!")
          .endObject();

      indexResponse = indexRequestBuilder.setSource(builder)
          .setId(String.valueOf(i))
          .setOperationThreaded(false)
          .setConsistencyLevel(WriteConsistencyLevel.QUORUM)
          .setReplicationType(ReplicationType.ASYNC)
          .execute()
          .actionGet();

      assertEquals(indexResponse.isCreated(), true);
    }

    client.admin().indices().prepareRefresh("tajo_es_index").execute().actionGet();

    CountRequest request = new CountRequest();
    request.indices("tajo_es_index").types("tajo_es_type");
    CountResponse response = client.count(request).actionGet();
    long count = response.getCount();

    assertEquals(count, DOC_SIZE);
  }

  @After
  public void tearDown() throws Exception {
    client.close();
    node.close();
  }
}

Check Status

Create External Table for Elasticsearch on Tajo and Test Query

create external table tajo_es_index (
  _type text,
  _score double,
  _id text,
  field1 bigint,
  field2 text,
  field3 text
)
using elasticsearch
with (
  'es.index'='tajo_es_index',
  'es.type'='tajo_es_type'
)

$ cd ~/server/app/tajo/tajo
$ bin/tsql

Try \? for help.
default> create external table tajo_es_index (
>   _type text,
>   _score double,
>   _id text,
>   field1 bigint,
>   field2 text,
>   field3 text
> )
> using elasticsearch
> with (
>   'es.index'='tajo_es_index',
>   'es.type'='tajo_es_type'
> );
OK
default> select count(*) from tajo_es_index;
Progress: 0%, response time: 1.397 sec
Progress: 0%, response time: 1.398 sec
Progress: 0%, response time: 1.802 sec
Progress: 100%, response time: 1.808 sec
?count
-------------------------------
1000
(1 rows, 1.808 sec, 5 B selected)

default> select * from tajo_es_index where field1 > 10 and field1 < 15;
Progress: 100%, response time: 0.583 sec
_type,  _score,  _id,  field1,  field2,  field3
-------------------------------
tajo_es_type,  0.0,  11,  11,  henry11,  11. hello world!! elasticsearch on apache tajo!!
tajo_es_type,  0.0,  12,  12,  henry12,  12. hello world!! elasticsearch on apache tajo!!
tajo_es_type,  0.0,  13,  13,  henry13,  13. hello world!! elasticsearch on apache tajo!!
tajo_es_type,  0.0,  14,  14,  henry14,  14. hello world!! elasticsearch on apache tajo!!
(4 rows, 0.583 sec, 320 B selected)

Elasticsearch "with" Options

  public static final String OPT_CLUSTER = "es.cluster";
  public static final String OPT_NODES = "es.nodes";
  public static final String OPT_INDEX = "es.index";
  public static final String OPT_TYPE = "es.type";
  public static final String OPT_FETCH_SIZE = "es.fetch.size";
  public static final String OPT_PRIMARY_SHARD = "es.primary.shard";
  public static final String OPT_REPLICA_SHARD = "es.replica.shard";
  public static final String OPT_PING_TIMEOUT = "es.ping.timeout";
  public static final String OPT_CONNECT_TIMEOUT = "es.connect.timeout";
  public static final String OPT_THREADPOOL_RECOVERY = "es.threadpool.recovery";
  public static final String OPT_THREADPOOL_BULK = "es.threadpool.bulk";
  public static final String OPT_THREADPOOL_REG = "es.threadpool.reg";
  public static final String OPT_TIME_SCROLL = "es.time.scroll";
  public static final String OPT_TIME_ACTION = "es.time.action";


:

도전! Apache Tajo Contributor.

ITWeb/Apache Tajo 2015. 3. 31. 16:20

Prologue.


이 글은 오픈소스 프로젝트에 참여하는 방법 중 하나로 코드에 대한 기여를 통해 Contributor 가 되는 방법을 공유하고자 작성 한 것입니다. 수많은 Apache Open Source Project 중에 최근 가장 Hot 한 Apache Tajo를 기준으로 Contributor 가 되는 방법을 알아보도록 하겠습니다. Git 명령어와 Github을 사용해 보신 분들이라면 쉽게 내용을 파악하실 수 있을 거라고 생각합니다. 이 글이 조금이나마 망설이고 있던 많은 예비 Contributor 분들에게 도움이 되면 좋겠습니다.


시작하기에 앞서.


도전하기에 앞서 내가 기여하고자 하는 내용이 해당 Open Source Project 의 Roadmap 상에 존재하는지 누군가 이미 진행하고 있는지 확인하는 과정을 거친 후 시작하시면 좋습니다.


Apache Tajo Roadmap 확인하기.

Apache Tajo Issue 확인하기.


도전! Apache Tajo Contributor.


기본 콘셉트는 아래와 같은 단계를 가집니다.


  • Step 0. Github 개인 계정으로 Apache Tajo Master Branch Fork 하기
  • Step 1. Apache Tajo Master 소스를 로컬 Repository로 Clone
  • Step 2. 개발용 Branch 와 코드 Merge 및 Push 용 Branch 생성
  • Step 3. 개발용 Branch에서 코딩 시작
  • Step 4. 개발이 완료되면 코드 Merge를 위해 생성해 놓은 Branch 와 Merge
  • Step 5. Patch 파일이 필요할 수 있기 때문에 Master Branch 와 Diff를 이용해 Patch 생성
  • Step 6. Merge 된 Branch를 개인 Github로 Push 하여 등록
  • Step 7. Github 사이트에 들어가면 등록된 Branch로 Pull Request 버튼이 생성 됨
  • Step 8. Pull Request를 보내고 나면 일단 완료


시작하기.


시작하기에 앞서 가장 먼저 해야 할 준비사항은 Github에 개인 계정을 생성 하는 것입니다.



계정 생성이 완료되었으면 이제 Contributor가 되기 위해 Apache Tajo Master Branch를 먼저 Fork 하고 개인 Github 저장소에서 내려받도록 하겠습니다.


$ git clone https://git-wip-us.apache.org/repos/asf/tajo.git

## 또는

$ git clone https://github.com/apache/tajo.git


## Clone을 어디서 내려받도록 하느냐에 따라 git remote 정보가 달라지게 됩니다.

## 기본적으로 Fork 이후 개인 Github에서 내려받게 되지만 아래 설명에서는 두 개의 Remote Repository를 대상으로 설명하기 위해 위와 같이 개인 Github이 아닌 Apache Tajo Git에서 내려받도록 하였습니다.


## 개인 Github에서 Clone 하기

$ git clone https://github.com/howookjeong/tajo.git


소스를 내려받은 후 Remote Repository 등록을 통해 앞으로 진행할 코드 관리 기준을 만들어 놓습니다.

  • Apache Tajo Repository 와 개인 Github Repository를 등록합니다.
  • Apache Tajo Repository는 코드 관리 기준 저장소로 사용을 하고, 개인 Github Repository는 개발용 코드 관리 기준 저장소로 사용을 합니다.

## 저장소 목록을 보여 줍니다.

$ git remote


## 아래와 같이 Apache Tajo의 Master Branch를 내려받았기 때문에 기본 Remote Repository가 Apache Repository가 됩니다.

## 관리를 쉽게 하기 위해 이름을 origin에서 asf로 변경하도록 하겠습니다.

$ git remote rename origin asf


## Pull Request 관리를 위한 개인 Github Repository를 추가합니다.

$ git remote add origin https://github.com/howookjeong/tajo.git


여기까지 다 하셨다면, 이제 Contributor 도전 시작을 위한 준비가 완료된 것입니다.


개발 브랜치(Branch) 만들기.


Branch는 두 개를 만들도록 하겠습니다.

  • 하나는 코드 Merge 및 개인 Github에 Branch를 등록해서 Pull Request를 보내는 용도
  • 다른 하나는 순수 개발 진행을 위한 용도

Branch 생성은 Apache Issue 이름으로 생성하는 것이 좋습니다.


여기서 Master Branch에 Merge를 하면 되지 않느냐고 하시는 분들도 있을 수 있습니다. 저 같은 경우 Master에 하지 않는 이유는 Apache Tajo Master Branch 와 코드를 계속 동일하게 유지하기 위해서 Merge를 하지 않는 것입니다.


## 아래 명령어를 통해 현재 Master Branch로 선택이 되어 있는지 확인합니다.

## 아닐 경우 Master Branch로 변경해 줍니다.

$ git branch


## Pull Request 시 Commit Log를 하나로 만들어 주기 위해 Merge 대상 Branch를 생성합니다.

$ git branch TAJO-1451


## 개발용 Branch를 생성 합니다.

## 위에서 생성한 Branch 명에서 _COMMIT이나 _DEV 와 같은 Suffix를 이용해서 생성하면 됩니다.

$ git branch TAJO-1451_COMMIT


## 개발 진행을 위해 개발용 Branch로 변경 합니다.

$ git checkout TAJO-1451_COMMIT


## 위와 같은 과정에서 Branch 생성과 생성한 Branch로 변경을 한번에 수행할 수 있는 명령어는 아래와 같습니다.

## 사용에 참고 하시면 좋을 것 같습니다.

## 위에서 사용한 branch + checkout을 checkout -b 로 한번에 수행한 예시입니다.

$ git checkout -b TAJO-1451_COMMIT


※ Branch 생성 및 관리 방법은 개발자 성향에 따라 다를 수 있으니 반드시 위와 같은 방법을 따를 필요는 없습니다. 본인 성향에 맞게 사용해도 아무런 문제가 되지 않습니다.


개발 코드 커밋(Commit) 하기.


위에서 설명한 것처럼 개발용 Branch로 변경을 하고 개발을 진행하시면 됩니다.


$ git checkout TAJO-1451_COMMIT


개발을 위한 기본 환경 및 빌드 가이드는 아래와 같습니다.


※ 참고 링크 

http://tajo.apache.org/docs/current/getting_started.html

https://cwiki.apache.org/confluence/display/TAJO/How+to+Contribute+to+Tajo


## Prerequisites

Hadoop 2.3.0 or higher (up to 2.5.1)

Java 1.6 or 1.7

Protocol Buffer 2.5.0


## Proto 사용을 위해 먼저 빌드를 수행합니다. 이것을 하는 이유는 Proto 관련 Class 들이 생성이 안되어 있기 때문에 생성하기 위함입니다.

## IDE 도구에서 Project를 오픈해 보면 에러가 발생해 있는 것을 확인할 수 있습니다.

## 빌드 명령어의 실행 위치는 clone 한 root 위치에서 수행합니다. 일반적으로 git/tajo 가 됩니다.

$ mvn -DskipTests clean install


## 개발이 완료되면 tar로 묶어 배포 버전을 만들도록 합니다.

$ mvn clean package -DskipTests -Pdist -Dtar


※ 코딩 주의 사항 


개발이 완료되었으면 이제 Commit을 합니다.


## 변경 사항을 확인하고 싶을 경우 아래 명령어를 이용합니다.

$ git status/diff/log


## 아래 명령어를 통해 구현한 코드들을 Commit 하도록 합니다.

## 신규로 추가된 파일들에 대해서는 Add 명령을 먼저 실행 한 후 Commit을 해야만 정상적으로 반영이 됩니다.

$ git commit -m "implement elasticsearch storage for tajo"


## 신규로 추가한 파일 이외 기존 파일이 수정 되었다면 아래 명령어를 이용해서 반영하도록 합니다.

$ git commit -a -m "implement elasticsearch storage for tajo"


개발 코드 머지(Merge) 하기.


이제 작업이 완료되었으니 개인 Github에 Push 할 Branch 와 Merge를 합니다.


## Branch Merge를 위해 미리 생성해 놓은 Push 용 Branch로 변경합니다.

$ git checkout TAJO-1451


## Merge 및 Commit Log를 하나로 합칩니다.

## --squash 옵션이 Commit Log를 하나로 만들어 줍니다.

$ git merge --squash TAJO-1451_COMMIT


개발 코드 패치(Patch) 파일 생성하기.


Apache Travis CI 수행을 돕기 위해 또는 Committer 가 쉽게 리뷰를 할 수 있도록 Patch 파일을 만들어 등록한 Apache Jira Issue에 등록을 합니다.


## 현재 작업 Branch는 TAJO-1451 입니다.

$ git diff master --no-prefix > TAJO-1451.patch


※ 참고 링크 

https://cwiki.apache.org/confluence/display/TAJO/How+to+Contribute+to+Tajo


개발 코드 풀리퀘스트(Pull Request) 하기.


Pull Request를 위해 개인 Github 저장소에 Merge 한 Branch를 Push 합니다.


## 먼저 Merge 한 내용을 로컬 저장소에 반영하기 위해 Commit을 합니다.

$ git commit -m "implement elasticsearch storage for tajo"


## 이제 모두 반영 되었으니 개인 Github 저장소로 Branch를 등록하도록 합니다.

## 등록 시 어떤 저장소로 등록할 지 선택이 가능하며, 기본 가이드는 개인 Github 저장소로 등록하는 것입니다.

## 제일 처음 asf 와 origin을 등록 하였으며, 개인 Github 저장소인 origin으로 Push 하도록 하겠습니다.

## Push 명령어 실행 시 Github 의 ID/PWD를 물어 봅니다.

$ git push origin TAJO-1451


만약, Master Branch로 Merge, Commit 그리고 Push 까지 실행했을 경우 아래 명령어를 통해서 Rollback 할 수 있도록 합니다.


## git log 명령어를 통해 자신이 수행한 이전 Commit Head 정보를 확인 후 아래와 같이 Reset 합니다.

$ git reset --hard HEAD~1


## Reset 된 Master Branch를 다시 Push 해서 초기 상태를 복구를 합니다.

$ git push origin +master


이제 개인 Github 웹페이지로 이동해 Push 한 Branch 가 잘 올라와 있는지 확인하고 해당 Branch에 생성되어 있는 Pull Request 버튼을 클릭합니다.


## 아래와 같은 형식으로 작성을 합니다.

제목) APACHE-JIRA-ISSUE-NUMBER: 요청 제목 작성.......


## 만약 잘못 작성했을 경우 Github에서 수정하시면 됩니다.

예) TAJO-1451: implement elasticsearch storage for tajo.


아래는 Pull Request가 완료된 화면 예시입니다.



이후 진행은 Committer 의 리뷰와 피드백 등의 과정을 통해서 반영 여부가 결정되게 됩니다.


Epilogue.


글을 마무리하며, 많은 예비 Contributor 분들이 실제 Open Source Project에 더 적극적으로 참여하고 생태계가 활성화되길 기대해 봅니다. 비록 부족한 내용의 글이였지만 끝까지 읽어 주셔서 감사합니다.


글쓴이 소개 : 정호욱


2015년 Software 개발 15년차에 접어 들었습니다. Yahoo! Korea, NHN Technology, Samsung Electorics 등에서 Community, Social Search, Search Advertisement 서비스 등을 개발하였고 오픈소스 기반 검색 라이브러리인 Lucene을 이용해서 다양한 프로젝트를 수행하였습니다. 현재 빅데이터 전문 기업인 Gruter에서 오픈소스 기반 검색엔진을 활용한 다양한 프로젝트와 개발 업무를 수행하고 있습니다. 그리고 Elasticsearch 라는 오픈소스 검색엔진 기술과 경험을 블로그와 커뮤니티를 통해 공유하고 있습니다.


:

[Elasticsearch] elasticsearch를 apache tajo의 external storage로 사용하기

Elastic/Elasticsearch 2015. 3. 27. 00:26


지난 주 부터 작업하던게 있는데요.

아직 어찌 될지는 알수 없지만 일단 안되더라도 필요하신 분들이 있을지 몰라 공유해 봅니다.


https://issues.apache.org/jira/plugins/servlet/mobile#issue/TAJO-1451


내용은 apache tajo에 external storage로 elasticsearch를 이용할수 있도록 컨트리븃 하고 있습니다.


정리하면 이런게 가능 합니다.

1. Sql을 이용한 elasticsearch index 질의 (ansi sql fully support 됩니다.)

2. Hdfs 나 기타 tajo 에서 지원하는 다양한 스토리지 데이터와 조인 (최근 1.5에서 inner hits가 추가 되었지만 비교할바가 안됩니다.)

3. 기타 다양한 활용도 가능 하겠죠. (상상력을 발휘하세요.)


뭐 최종 커밋이 안되더라도 제가 테스트 해보니 매우 유용해 보입니다. ^^


많은 분들이 데뷰에서 보여드린 es jdbc 드라이버를 원하실텐데요. 일단 이걸로 대리 만족 하심 어떨까해서 공유해 봅니다.

:

apache commons cli maven

ITWeb/개발일반 2014. 8. 18. 14:23

elasticsearch cli 기능 구현을 위해서 아래 라이브러리 사용.


<!-- apache commons cli -->

<dependency>

<groupId>commons-cli</groupId>

<artifactId>commons-cli</artifactId>

<version>1.2</version>

</dependency>


:

apache so module guide.

ITWeb/개발일반 2013. 3. 19. 15:14

[Reference URL]

http://www.penguinpowered.org/documentation/apache2_modules.html

http://threebit.net/tutorials/apache2_modules/tut1/tutorial1.html

http://threebit.net/tutorials/apache2_modules/tut2/tutorial2.html



Document Index




Introduction

There appears to be a lack of documentation written for new developers regarding creation of Apache http modules. The documentation that I have found so far assumes the a new developer will automatically know how to perform certain tasks such as compiling their modules into the httpd binary or compiling their modules as DSO modules. I didn't know these procedures, and I assume that there are other people out there who could also use a quick ramp up into this fascinating world. This page and the associated module source will hopefully give you a jump start.

Document Index


Document Status

For the moment, please accept this document as a work in progress. I'm only a beginning C programmer, as I'm teaching myself in the evenings. I'm also just getting started with Apache modules. So please bear with me for any grody code and bad examples. I'm putting this together because I couldn't find anything to get myself started. More importantly than bearing with me, please jump in and correct / advise me if I get something wrong here. Also, answers to sections where I've stated I couldn't work out how to achieve something would be much appreciated and will be added to the document.

Document Index


What we do and don't cover

I'm not going to cover modules for the 1.3.x versions of the Apache httpd server, as there is already an excellent book out there that does this. The book you want is named Writing Apache Modules with Perl and C: Customizing Your Web ServerWriting Apache modules with Perl and C and the ISBN is 156592567X For the purposes of this text, we will be working with a very simple module that prints Hello World in a browser window when set as the handler for a location. All examples will assume that we are discussing this module.

Document Index


Build choices

Before we get into creating a module, we'll cover what to do with the module once you've written it. Firstly, you need to compile your module. You can either compile your module into the httpd binary or as a DSO module. The DSO route is preferable especially during the development phase, because you don't want to have to compile and install a new binary every time you want to test your module. If you want to read more about DSO modules, then look here. As I still haven't worked out how to add a module to the source tree. I will only cover building your module as a DSO here. Under Apache 1.3.x, you would add a command like

--activate-module=src/modules/work/mod_hello.c 
This no longer appears to work in 2.0.x To build a DSO module without going through the rig-morale of rebuilding the httpd binary, you can use the APache eXtenSion tool. If you already have Apache httpd installed on your machine, man apxs should give you loads of information about the use of this command. For our simple hello world module, the following apxs command should work fine:
apxs -c -i -a mod_hello.c 

The apxs command should be run from the directory that contains the source for the hello world module. If all goes well, it should install the module into your modules directory.

Document Index


The module

Well, if you're still with me at this point (i.e. you haven't been bored to tears), then we'll get to the meaty bit. A sample module. The module that we will be developing and testing here is a simple module that merely implements a content handler. This is a handler that generates or modifies content by my understanding. If modifying content is your primary objective, then I advise you to look at filters as well. I found 2 articles by Ryan Bloom athttp://www.onlamp.com/pub/a/apache/2001/08/23/apache_2.html and http://www.onlamp.com/pub/a/apache/2001/09/13/apache_2.html

These will probably be a good place to start if you're looking at writing a filter as opposed to a general purpose module.

Our source is stored in a file named mod_hello.c which you may have gathered from the apxs command earlier. The reason for this is that the source file name is one resource that apxs can use to work out the module name. The source is as follows with comments interspersed.

/* The following module borrows from mod_example.c but 
* is much simplified. Please see http://www.apache.org/LICENSE.txt 
* for the licence that mod_example.c and this module are licenced 
* under. All commentary on the module is included as 
* standard C style comments below */ 

/* All of these include files can be found in the Apache http source 
* tree in include/ . If you've got the Apache http server already 
* installed, then there will be an include directory either under the 
* directory that was specified with --prefix or somewhere in your 
* standard 
* include paths. All the functions that you can make use of can be 
* found 
* in the include files. */ 

#include "httpd.h" 
#include "http_config.h" 
#include "http_core.h" 
#include "http_log.h" 
#include "http_main.h" 
#include "http_protocol.h" 
#include "http_request.h" 
#include "util_script.h" 
#include "http_connection.h" 

/* This example just takes a pointer to the request record as its only 
* argument */ 
static int hello_handler(request_rec *r) 
{ 

        /* We decline to handle a request if hello-handler is not the value 
         * of r->handler */ 
        if (strcmp(r->handler, "hello-handler")) { 
                return DECLINED; 
        } 

        /* The following line just prints a message to the errorlog */ 
        ap_log_error(APLOG_MARK, APLOG_NOERRNO|APLOG_NOTICE, 0, r->server, 
        "mod_hello: %s", "Before content is output"); 

        /* We set the content type before doing anything else */ 
        ap_set_content_type(r, "text/html"); 

        /* If the request is for a header only, and not a request for 
         * the whole content, then return OK now. We don't have to do 
         * anything else. */ 
        if (r->header_only) { 
                return OK; 
        } 

        /* Now we just print the contents of the document using the 
         * ap_rputs and ap_rprintf functions. More information about 
         * the use of these can be found in http_protocol.h */ 
        ap_rputs("<HTML>\n", r);
	ap_rputs("\t<HEAD>\n", r);
	ap_rputs("\t\t<TITLE>\n\t\t\tHello There\n\t\t</TITLE>\n", r);
	ap_rputs("\t</HEAD>\n\n", r);
	ap_rputs("<BODY BGCOLOR=\"#FFFFFF\>"\n" ,r);
	ap_rputs("<H1>Hello </H1>\n", r);
	ap_rputs("Hello world\n", r);
	ap_rprintf(r, "<br>A sample line generated by ap_rprintf<br>\n");
	ap_rputs("</BODY></HTML>\n" ,r); 

        /* We can either return OK or DECLINED at this point. If we return 
        * OK, then no other modules will attempt to process this request */ 
        return OK; 
} 


/* Each function our module provides to handle a particular hook is 
* specified here. See mod_example.c for more information about this 
* step. Suffice to say that we need to list all of our handlers in 
* here. */ 
static void x_register_hooks(apr_pool_t *p) 
{ 
        ap_hook_handler(hello_handler, NULL, NULL, APR_HOOK_MIDDLE); 
} 


/* Module definition for configuration. We list all callback routines 
* that we provide the static hooks into our module from the other parts 
* of the server. This list tells the server what function will handle a 
* particular type or stage of request. If we don't provide a function 
* to handle a particular stage / callback, use NULL as a placeholder as 
* illustrated below. */ 
module AP_MODULE_DECLARE_DATA hello_module = 
{ 
        STANDARD20_MODULE_STUFF, 
        NULL, /* per-directory config creator */ 
        NULL, /* directory config merger */ 
        NULL, /* server config creator */ 
        NULL, /* server config merger */ 
        NULL, /* command table */ 
        x_register_hooks, /* other request processing hooks */ 
}; 
Document Index


So I wrote a module - Now what ?

I guess after all of this effort, you'd like to see the results of your module. To receive your quota of gratification, you'll need to edit your Apache config (normally http.conf) and restart your webserver.

This module merely prints Hello World to your browser when you make a request that this handler handles.

If your apxs command that we showed earlier for building the module worked as advertised, you should find a line similar to the following in your httpd.conf

LoadModule hello_module modules/mod_hello.so 

modules/ may be libexec/ on many systems. This is the line that tells the server to load this module into memory at startup. This way the http server can provide functionality as and when required without bloating the main codebase.

Now you need to tell a location to use this handler. A simple way to do this is to add the following to your httpd.conf:

Once you've done this, test your server config by running httpd -t and then restart the server. Now point your browser at and you should get a page saying Hello World. If you don't either you did something wrong following these instructions, or (more likely) I did something wrong writing these instructions. If you do run into problems at this stage, please let me know so that I can update this document.

Document Index


Concussion

This would have been a conclusion, but by the time I had got my head around all of this stuff, I felt a bit concussed. I hope that the above text proves useful to someone out there. It pretty much covers the things that I was not able to understand when I started this.

If you have any questions or suggestions for improvements, then please feel free to e-mail me by clicking this link and filling out the form that you are presented with.

Document Index



This tutorial guides the reader through the minimal tasks involved in writing a module for Apache 2. The module developed here has almost no functionality - it's only impact is the generation of a static message to logs/errorlog for each HTTP request.

This tutorial is not intended to showcase Apache's module API. Instead, it guides the reader through the other tasks that must be done to properly develop, compile and install a custom module - namely autoconf and automake.

Further tutorials will build from this one and explore the advanced module API. Drop a message to kevino at threebit.net if you feel the need.

# Throughout the tutorial, look for links to Apache's
# LXR website http://lxr.webperf.org/
# For example, clink on AP_MODULE_DECLARE_DATA below.
module AP_MODULE_DECLARE_DATA tut2_module;

Preparation

If you don't actually want to run or test the code in this tutorial, then feel free to skip this step. Otherwise, you'll want to perform the following actions so your work area prepared for compiling and running the tutorial.

I have assumed in this tutorial that you have an account on a Linux (or Unix) machine and you have installed the GNU build tools (autoconf, automake, etc). If you haven't then you're not going to get very far - consult your OS documentation.

# Prepare the temporary directory
cd $HOME
mkdir threebit-tutorials
cd threebit-tutorials

# Remember the tutorial home directory for later.
export TUTORIAL_HOME=`pwd`
Download via HTTP
cd $TUTORIAL_HOME
wget "http://threebit.net/tutorials/tutorials.tar.gz"
tar zxvf tutorials.tar.gz
Download via Anonymous CVS
cd $TUTORIAL_HOME
CVSROOT=:pserver:anonymous@threebit.net:/usr/local/cvs

# use "anonymous" as the password.
cvs login

cvs co tutorials/apache2_modules
mv tutorials/* .
rm -rf tutorials
Apache Note: You will get a "404 - Not Found" error if 2.0.43 is no longer the newest version of Apache. Just substitute the current version tag if that is the case.
cd $TUTORIAL_HOME

wget http://www.apache.org/dist/httpd/httpd-2.0.43.tar.gz
tar zxf httpd-2.0.43.tar.gz

cd httpd-2.0.43
./configure --prefix=$TUTORIAL_HOME/apache2 --enable-so
make
make install
Now we will fix the ServerName and Listen configuration directives so that we can run this installation as an unpriviledged user.
# store the location of the apache configuration file.
HTTPCONF=$TUTORIAL_HOME/apache2/conf/httpd.conf

# replace the ServerName directive
cat $HTTPCONF | \
  sed 's/#ServerName new.host.name:80/ServerName localhost/' \
  > $HTTPCONF.new
mv $HTTPCONF.new $HTTPCONF

# replace the Listen directive.
cat $HTTPCONF | sed 's/^Listen 80/Listen 21000/' > $HTTPCONF.new
mv $HTTPCONF.new $HTTPCONF
And test the configuration:
$TUTORIAL_HOME/apache2/bin/apachectl configtest
Syntax OK

mod_tut1.c

As stated above, the purpose of this module is to write data to the error log for each HTTP request. We are obviously building a useless module - but by limiting what the module does it becomes easier to explain what everything is doing.

The source code to the module is pretty much self documenting but let us examine each block independently.

/*
 * Include the core server components.
 */
#include "httpd.h"
#include "http_config.h"
Obviously an Apache module will require information about structures, macros and functions from Apache's core. These two header files are all that is required for this module, but real modules will need to include other header files relating to request handling, logging, protocols, etc.
/*
 * Declare and populate the module's data structure.  The
 * name of this structure ('tut1_module') is important - it
 * must match the name of the module.  This structure is the
 * only "glue" between the httpd core and the module.
 */
module AP_MODULE_DECLARE_DATA tut1_module =
{
  // Only one callback function is provided.  Real
  // modules will need to declare callback functions for
  // server/directory configuration, configuration merging
  // and other tasks.
  STANDARD20_MODULE_STUFF,
  NULL,
  NULL,
  NULL,
  NULL,
  NULL,
  mod_tut1_register_hooks,      /* callback for registering hooks */
};
Every module must declare it's data structure as shown above. Since this module does not require any configuration most of the callback locations have been left blank, except for the last one - that one is invoked by the HTTPD core so that the module can declare other functions that should be invoked to handle various events (like an HTTP request).
/*
 * This function is a callback and it declares what
 * other functions should be called for request
 * processing and configuration requests. This
 * callback function declares the Handlers for
 * other events.
 */
static void mod_tut1_register_hooks (apr_pool_t *p)
{
  // I think this is the call to make to register a
  // handler for method calls (GET PUT et. al.).
  // We will ask to be last so that the comment
  // has a higher tendency to go at the end.
  ap_hook_handler(mod_tut1_method_handler, NULL, NULL, APR_HOOK_LAST);
}
When this function is called by the HTTPD core, it registers a handler that should be invoked for all HTTP requests.
/*
 * This function is registered as a handler for HTTP methods and will
 * therefore be invoked for all GET requests (and others).  Regardless
 * of the request type, this function simply sends a message to
 * STDERR (which httpd redirects to logs/error_log).  A real module
 * would do *alot* more at this point.
 */
static int mod_tut1_method_handler (request_rec *r)
{
  // Send a message to stderr (apache redirects this to the error log)
  fprintf(stderr,"apache2_mod_tut1: A request was made.\n");

  // We need to flush the stream for messages to appear right away.
  // Performing an fflush() in a production system is not good for
  // performance - don't do this for real.
  fflush(stderr);

  // Return DECLINED so that the Apache core will keep looking for
  // other modules to handle this request.  This effectively makes
  // this module completely transparent.
  return DECLINED;
}
This is the function that will be invoked for every HTTP request. This is where the meat of an Apache module should go.

GNU Build Tools

Looking in TODO $TUTORIAL_HOME/tut1, you will find some familiar files that are included with most GNU applications.

Makefile.amAn input file for automake
configure.inAn inputfile to autoconf.
mod_tut1.cThe source code to the tutorial module.
tutorial1.htmlThis file.
The remaining files can safely be ignored.
AUTHORSautomake will produce warnings if this file is not present.
COPYINGThe GPL license. automake will complain if this file is not present.
CVS/CVS state directory. Ignore it. If you downloaded the tutorial using the tar ball then it won't even exist.
ChangeLogAnother automake file.
INSTALLStandard install instructions. In this case, it points the reader to this file.
NEWSAnother automake file.
READMEAnother automake file.

This tutorial does not aim to be a complete reference for the GNU build tools. See the following references for information.

Aside from the module source code itself, the only files of interest to the reader are configure.in and Makefile.am. To briefly discuss these files without duplicating documetation contained in the above references:

configure.in is an input file to autoconf and is used to configure the module source code and dependencies for each target platform. Running autoconf creates the configure script we are all so familiar with.

Makefile.am is an input file to automake and is used to create a Makefile.in file. The Makefile.in file is then used by configure to create real Makefile's.

If you're confused - have no fear because I still am! You probably don't need to understand everything - just plug away through the tutorial. If you want to understand what's going on, I suggest you read the references cited above.

configure.in

I would be lying to you if I told you that I understand everything in this file. However, it seems to work so I'll tell you what I know. :) See configure.in for raw file.
AC_INIT
The mandatory autoconf initialization macro.
# Automake initialization
AM_INIT_AUTOMAKE(mod_tut1, 1.0)
This macro is provided by automake and is required when automake is used. The arguments are the package name and version number. I have provided reasonable values for the parameters but still haven't figured out what their impact is.
AC_PROG_CC
AM_PROG_LIBTOOL
These two macros add checks for suitable cc and libtool programs.
AC_DEFUN([APACHE_DIR],[

  AC_ARG_WITH(
    apache,
    [  --with-apache[=DIR]     Apache server directory],
    ,
    [with_apache="no"]
  )

  AC_MSG_CHECKING(for Apache directory)

  if test "$with_apache" = "no"; then
    AC_MSG_ERROR( Specify the apache using --with-apache)
  else
    # make sure that a well known include file exists
    if test -e $with_apache/include/httpd.h; then
      apache_dir=$with_apache
      AC_MSG_RESULT(APACHE found!)
    else
      AC_MSG_ERROR( $with_apache not found. )
    fi
  fi

])
This declares a new autoconf macro named APACHE_DIR. It is used to handle the --with-apache=/usr/local/apache2 argument to configure.
APACHE_DIR
This runs the APACHE_DIR macro that was just defined. When successfull, the directory location is stored in apache_dir.
AC_SUBST(apache_dir) 
Not all variables that are set in shell snippets are persisted to the configuration status file (config.status). This call to AC_SUBST persists the value of apache_dir.
AC_OUTPUT(Makefile)
Finally, AC_OUTPUT() saves the results of the configuration and causes a real Makefile to be generated.

Makefile.am

This file is used by automake to generate a Makefile.in file. As stated earlier, Makefile.in is then parsed using an invocation of configure to create an actual Makefile.

Since writing an Apache module is the same as writing, compiling and linking any standard shared library, automake is well suited to the task.

Again, consult the full automake documentation for all the info. See the raw Makefile.am.

lib_LTLIBRARIES = libmodtut1.la
This tells automake that we are creating a shared library named libmottut1.la.
libmodtut1_la_SOURCES = mod_tut1.c
This tells automake what source files should be compiled as part of the library. In this case there is only one, but there could be serveral.
INCLUDES = -I@apache_dir@/include
Header files from the apache distribution are required when compiling the module. This directive provides a list of include directories to pass on to gcc. Does apache_dir look familiar? If you said yes, then step to the front of the class - configure will subsitute the value that was passed in with --with-apache when the Makefile is written.

aclocal, autoconf, automake

Now that you have some idea of what those files mean we can run the utilities that use them.

aclocal is used to import macros defined by automake so that autoconf can understand what's going on.

cd $TUTORIAL_HOME/apache2_modules/tut1

# import automake m4 macros.
aclocal

# create configure based on configure.in
autoconf

# create Makefile.in based on Makefile.am and configure.in
automake -a

configure

Now we can run configure to prepare the module's Makefile.
# The ubiquitous configure script
./configure --with-apache=$TUTORIAL_HOME/apache2

make

And now we can run make to compile the module. Note: don't run make install. We'll handle the module installation later.
make

apxs

** DO NOT RUN make install ** Ordinarially you would, but the install step for an Apache module is different. Instead, apxs is used to register the module in httpd.conf and move the shared object into the apache lib directory.
$TUTORIAL_HOME/apache2/bin/apxs -i -a -n tut1 libmodtut1.la
apxs also addes the following line to httpd.conf:
LoadModule tut1_module        modules/libmodtut1.so

Run Apache

Now we are ready to run Apache and test the module.
# Change to the apache directory
cd $TUTORIAL_HOME/apache2

# Start Apache
bin/apachectl start

# Use Lynx to hit the web server.
lynx --source http://localhost:21000 | grep success

# Look for the module's message in the error log
cat logs/error_log | grep tut1
apache2_mod_tut1: A request was made.

Success!

The tutorial one module has been successfully compiled and installed into the Apache 2 runtime.

Updates

2003.06.03: Dmitry Muntean was kind enough to send in a question and resolution for a problem he was having.

Dmitry: ... on the aclocal step it said that "macro AM_PROG_LIBTOOL not found in library", After some looking around I've discovered that AM_PROG_LIBTOOL is changed to AC_PROG_LIBTOOL, so I did and aclocal went fine. Then when I launched autoconf it said to me:

configure.in:9: error: possibly undefined macro: AC_PROG_LIBTOOL
    If this token and others are legitimate, please use m4_pattern_allow.
    See the Autoconf documentation.
As far as I see possibly you used another version of these utilites. I use automake 1.7.2 and autoconf 2.57. What version have you used? Or if this isn't the problem can you point me what it is?

Kevin: automake (GNU automake) 1.4-p4

Dmitry:The problem was that libtool.m4 must be included at the end of aclocal.m4, and ltmain.sh must be in that directory. Both files were taken from latest libtool package(for now it is 1.2.5).


This tutorial guides the reader through the portions of the Apache API that are used by modules to control their configuration. For the moment, we will introduce the handling of a single server-wide configuration directive.

This module builds on Tutorial 1 by allowing the message that is written to the error log to be customized. Again, this is a trivial task, but I hope that is makes for a clear example of how to use the API to perform a task, without confusing the topic by doing something actually usefull (that's left to you! haa ha!).

# Throughout the tutorial, look for links to Apache's
# LXR website http://lxr.webperf.org/
# For example, clink on AP_MODULE_DECLARE_DATA below.
module AP_MODULE_DECLARE_DATA tut2_module;

Preparation

If you do not plan to compile and run the code presented here, you can skip this step. Otherwise, complete the preparation step from Tutorial One. After doing so the following should be true:

  • The $TUTORIAL_HOME environment variable should be set to $HOME/threebit-tutorials.
  • Apache2 is installed in $TUTORIAL_HOME/apache2.
  • The module source code is in $TUTORIAL_HOME/apache2_modules/tut2.

mod_tut2.c

The source code for this module is contained in one file (source code). The other files included with mod_tut2 were explained during Tutorial 1 and I will not duplicate their explanation here. Those portions of this module's source code that have not been changed since Tutorial 1 will also not be explained here.
#ifndef DEFAULT_MODTUT2_STRING
#define DEFAULT_MODTUT2_STRING "apache2_mod_tut2: A request was made."
#endif
Here we define the default string that will be written to the error log if the module has been loaded but the ModuleTutorialString configuration directive was not detected in httpd.conf.
module AP_MODULE_DECLARE_DATA tut2_module;
The AP_MODULE_DECLARE_DATA macro is used by a module to declare itself to the httpd core. The apache convention for naming the identifier is to use UNIQUE_NAME-module. That said, most people refer to a module by the reverse - hence, I call this module mod_tut2. I haven't been around long enough to know why this developed but somehow it did. In case you don't believe me, Auto Index is called mod_autoindex, but it's module identifier is autoindex_module.
typedef struct {
  char *string;
} modtut2_config;
We will need to store the customizable string somewhere - this struct will be used to do so. It is silly to use a struct to hold a single string but we may as well start of with a struct because it won't be long until our module needs a richer configuration.
static void *create_modtut2_config(apr_pool_t *p, server_rec *s)
{
  // This module's configuration structure.
  modtut2_config *newcfg;

  // Allocate memory from the provided pool.
  newcfg = (modtut2_config *) apr_pcalloc(p, sizeof(modtut2_config));

  // Set the string to a default value.
  newcfg->string = DEFAULT_MODTUT2_STRING;

  // Return the created configuration struct.
  return (void *) newcfg;
}
This function will be called once by the httpd core to create the initial module configuration. This is accomplished by allocating space for the struct from the provided apr_pool_t. A malloc function provided by APR is used so that it is impossible to leak memory during Apache's runtime - in other words, this module does not need to worry about freeing any memory in the future because when the pool is released, the memory allocated to this module's configuration is also automatically released. This pattern is used extensively throughout Apache.
static const command_rec mod_tut2_cmds[] =
{
  AP_INIT_TAKE1(
    "ModuleTutorialString",
    set_modtut2_string,
    NULL,
    RSRC_CONF,
    "ModTut2String (string) The error_log string."
  ),
  {NULL}
};
This httpd core is responsible for reading and parsing the httpd.conf configuration file. By default, Apache knows how to handle the default configuration directives. The array of command_rec structures above is passed to the httpd core by this module to declare a new configuration directive.

AP_INIT_TAKE1 - This macro declares a configuration directive that takes only one argument. The httpd core will take care of guaranteeing that the configuration is valid (a minimum and maximum of one argument) before bothering to call the provided function. This reduces alot of duplication within each module. There are several options here depending on the purpose of the configuration directive (AP_INIT_NO_ARGS, AP_INIT_RAW_ARGS, AP_INIT_TAKE2, etc etc).

"ModuleTutorialString" - The configuration directive that may now appear in httpd.conf. I haven't looked it up, but I imagine there is a best-practices guide for creating configuration directives.

set_modtut2_string - This is the function that will be called by the httpd core when the configuration directive is detected (assuming it is properly formatted. This function is covered in detail below.

NULL - I don't know what this is for yet. :)

Update (2003.10.26) Maurizio Codogno contributes via email:
In Tutorial 2 you write that you don't yet know the meaning of the fourth field in macro AP_INIT_TAKE1 and similar, which usually is set as NULL. If I read the source correctly, this is used as a structure to send further data to the initializing function - the second parameter, "void *mconfig", when it is called.

RSRC_CONF - This field is used to state where the configuration directive may appear. By using RSRC_CONF we have stated that it can only appear outside of a <Directive> or <Location> scope. I *think* that means it can only be used globally, but you should confirm that.

Usage Message - In case of syntax errors, the httpd core will return this message to the user.

{NULL} - This is just a null placeholder in the array of command_rec structs. It is used to signal the end of new configuration directives.

static const char *set_modtut2_string(cmd_parms *parms,
  void *mconfig, const char *arg)
{
  modtut2_config *s_cfg = ap_get_module_config(
    parms->server->module_config, &tut2_module);

  s_cfg->string = (char *) arg;

  return NULL;
}
This function will be called by the httpd core when the configuration directive we specify later on is encountered in httpd.conf. Notice that it does not malloc space to hold the new configuration. Instead, the ap_get_module_config function is used to obtain it - somehow the httpd core will end up calling create_modtut2_config for us if it hasn't already.

Once the configuration has been obtained, we set the value of the string member to that of the provided argument. We do not need to make a copy of the argument because it is safe to use as is (I read that in the source somewhere, but I have lost the reference to it.)

Finally, we return NULL for success. We could have returned a (char*) containing an error message; httpd will return the string to the user in such a case.

static int mod_tut2_method_handler (request_rec *r)
{
  // Get the module configuration
  modtut2_config *s_cfg = ap_get_module_config(
    r->server->module_config, &tut2_module);

  // Send a message to the log file.
  // [thanks to Min Xu for the security suggestion]
  fprintf(stderr,"%s",s_cfg->string);

  // [deleted - trying to be brief]
}
And finally, the real workhorse. This function is called for each HTTP request. Again, we use ap_get_module_config to obtain the module configuration, though this time we do so by referencing it from the request record. The configured string is written to the error_log stream.
module AP_MODULE_DECLARE_DATA tut2_module =
{
  STANDARD20_MODULE_STUFF,
  NULL,
  NULL,
  create_modtut2_config,
  NULL,
  mod_tut2_cmds,
  mod_tut2_register_hooks,
};
And now we make another call to AP_MODULE_DECLARE_DATA to re-declare the module along with more information. This time around we provide two more details:

create_modtut2_config - Here we tell the httpd core what function should be called when the module configuration data needs to be created/allocated.

mot_tut2_cmds - Here we pass in the list of new configuration directives.

mod_tut2_register_hooks and STANDARD20_MODULE_STUFF are unchanged from the previous module.

Compile, Install, Run

Now it is time to compile and install the module. See the previous tutorial for an explanation of what's going on here.
cd $TUTORIAL_HOME/apache2_modules/tut2
aclocal
autoconf
automake -a
./configure --with-apache=$TUTORIAL_HOME/apache2
make
$TUTORIAL_HOME/apache2/bin/apxs -i -a -n tut2 libmodtut2.la
This module has not been installed. You may want to confirm that only mod_tut2 is enabled if you ran mod_tut1 previously.
# LoadModule tut1_module        modules/libmodtut1.so
LoadModule tut2_module        modules/libmodtut2.so
Also, if you want to customize the string then add the ModuleTutorialString directive to httpd.conf too. The last line in httpd.conf should be okay for this.
ModuleTutorialString "You need to put quotes around multiple words."
Restart Apache so mod_tut2 get's loaded. It's always a good idea to check the configuration first too.
# check the configuration then restart apache
cd $TUTORIAL_HOME/apache2
bin/apachectl configtest
bin/apachectl stop
bin/apachectl start
# make a request to cause a message to be written
lynx --source http://localhost:21000 | grep success
# look for the message in the error log.
tail -100l logs/error_log 

$Id: tutorial2.html,v 1.12 2004/01/29 05:02:59 kevino Exp $


: