'ITWeb/Hadoop일반' 카테고리의 글 목록 (2 Page)

'ITWeb/Hadoop일반'에 해당되는 글 17건

zookeeper 그게 뭔가요?

ITWeb/Hadoop일반 2012. 4. 27. 17:10

이제 Open Source Stack 중 zookeeper 에 대해서 학습을 해보려 합니다.
워낙에 잡식성이라서.. 새로운 기술에 대해서 공부하고 배우는걸 좋아라 하여.. ㅎㅎ
뭐 이 세계는 평생 공부해도 부족할 것 같긴하내요.

일단 zookeeper 를 이해하기 위해서 필요한 정보를 어디서 구해야 할지 정리해 봅니다.

[당연한 이야기지만 시작이 되는 곳이죠]
http://zookeeper.apache.org

[BigData 솔루션으로 유명한 그루터 김형준 수석님이 정리한 내용]
http://www.google.co.kr/custom?domains=www.jaso.co.kr&q=zookeeper&sa=%EA%B2%80%EC%83%89&sitesearch=www.jaso.co.kr&client=pub-0533580896793712&forid=1&ie=EUC-KR&oe=EUC-KR&safe=active&cof=GALT%3A%23008000%3BGL%3A1%3BDIV%3A%23336699%3BVLC%3A663399%3BAH%3Acenter%3BBGC%3AFFFFFF%3BLBGC%3A336699%3BALC%3A0000FF%3BLC%3A0000FF%3BT%3A000000%3BGFNT%3A0000FF%3BGIMP%3A0000FF%3BLH%3A40%3BLW%3A220%3BL%3Ahttp%3A%2F%2Fwww.jaso.co.kr%2Fjaso_logo.jpg%3BS%3Ahttp%3A%2F%2Fwww.jaso.co.kr%3BFORID%3A1&hl=ko

http://www.jaso.co.kr/338

정리라고 하기에는.. ㅋ 딸랑 두개.. 근데.. 저는 위 두군데 문서면.. 충분한것 같구요.
더 필요한 정보들은 커뮤니티등을 통해서.. 얻도록 하겠습니다.

이제 담주 부터.. 고고씽~

[원문]

http://zookeeper.apache.org/doc/trunk/zookeeperOver.html

ZooKeeper: A Distributed Coordination Service for Distributed Applications

ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming. It is designed to be easy to program to, and uses a data model styled after the familiar directory tree structure of file systems. It runs in Java and has bindings for both Java and C.

Coordination services are notoriously hard to get right. They are especially prone to errors such as race conditions and deadlock. The motivation behind ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch.

[구글번역]

ZooKeeper : 분산 어플 리케이션을위한 분산 조정 서비스
ZooKeeper는 분산 애플 리케이션을위한 분산, 오픈 소스 협력 서비스입니다. 그것은분산 응용 프로그램이 더 높은 동기화를위한 수준의 서비스, 구성 유지 보수 및 그룹과 명명을 구현하는시 구축할 수있는 기본형의 단순한 집합을 제공합니다. 그것은 프로그램에 쉽게 될 수 있도록 설계, 및 파일 시스템의 익숙한 디렉토리 트리 구조 후 스타일을 데이터 모델을 사용합니다. 그것은 자바에서 실행 및 Java와 C 모두에 대해 바인딩을 가지고

조정 서비스를 바로 얻을 악명 어렵다. 그들은 이러한 경쟁 조건과 교착 상태와 같은 오류 특히 경향이있다. ZooKeeper 뒤에 동기는 분산 응용 프로그램에게 처음부터조정 서비스를 구현의 책임을 완화하는 것입니다.

hadoop master/slave 또는 jobtracker/namenode/datanode 들의 각각 설치 및 구성은 어떻게???

ITWeb/Hadoop일반 2012. 3. 7. 14:47

[혼자내린 결론]

Master/Slave 구성이 맞는 방법인 것 같다.
Master 에서 namenode 와 jobtracker 를 분리한다는게 개념적으로 맞지 않는 것 같다.
jobtracker 의 역할은 수행할 job 을 hdfs 에 저장하고 각 slave 즉 datanode 의 tasktracker 에 할당 및 관리 하는 역할을 수행 하기 때문에 분리를 한다고 hadoop 구조상 namenode 가 죽으면 jobtracker 역시 meta 정보를 획득할 창구가 없어지므로 같은 machine 에 구성 되는게 맞는 방법일듯.

그러나 고민은 jobtracker 에서 job 을 어찌 되었건 실행 시켜서 분산 시키는 부분에 있어서는 master 에 부하가 많이 몰리지 않을까 싶기도 한데.. 막상 bigdata 를 돌려 본적이 없으니.. 의문만 남겨 놓고.. 다음에 실행을 해봐야 할 것 같다.

[결론]
- Machine 1 : Master (Namenode + JobTracker)
- Machine 2~N : Slave (Datanode + TaskTracker)
- Job 실행은 Master 에서 runJob 후 JobTracker 가 각 TaskTracker 로 Job 할당 하여 실행.
- 아래 그림은 혼자내린 결론에 대한 근거 데이터 입니다.
※ Cloudera 에서 발표한 문서에서 캡쳐한 내용 입니다.

On very small clusters, the NameNode, JobTracker and

Secondary NameNode can all reside on a single machine

– It is typical to put them on separate machines as the cluster

grows beyond 20-30 nodes

[추가결론]
- 발표한 자료에 의하면 위와 같이 노드가 20~30 개가 넘을 경우 각각 머신을 분리 하는게 일반적이라는 군요.

감은 오지만 막상 해보지 않으면 내것이 아니기에...
ibm 문서를 보면.. master, slave 로 구성을 해도 되고..
hadoop 문서를 보면 namenode, jobtracker, datanode 로 구성을 해야 할 것 같고..

그럼.. hadoop-{$VERSION}.tar.gz 파일을 각 서버들에 깔아주고 설정을 용도에 맞게만 잡아 주면 되는 건가??

Standalone 으로 하는건 그냥.. 몰라도 따라만 해도 동작을 하니...
Fully-distributed 로 한번 해봐야 할 것 같다.
서버가 없으니.. 흠.. 언제 해본다..

어디 좋은 문서나 자료 있으면 공유 좀 부탁 드립니다.
모든 개발자가 삽질하는 날이 없어질때까지.. 정보의 공유는 계속 되어야 합니다.... ^^;

[참고사이트1]

http://guru1013.egloos.com/2584725

- 여기 사이트 보면.. 다 동일한 세팅을 해주면 되는 것으로 보인다.
- 왜냐하면.. conf 에 master 와 slave 에 대한 설정이 있기 때문에 지가 무슨 용도 인지 자동으로 인식 하지 않을까?

[참고사이트2]

http://cloudblog.8kmiles.com/2011/12/05/hadoop-fully-distributed-setup/

- 여기 사이트 보면.. 그냥 Master 머신에서 세팅을 해주면 그냥 되는 것으로 보인다.
- 흠.. 이게 맞는것 같은데..

[참고사이트2 펌글]

OS & Tools used in this setup:

OS: Ubuntu – 11.04
JVM: Sun JDK – 1.6.0_26
Hadoop: Apache Hadoop – 0.20.2

Note: Identify the machines to setup hadoop in cluster mode. We have used 4 servers (2 Ubuntu & 2 Debian Servers – 1 machine as hadoop master, 3 machines as hadoop slave) in this example setup.

Our Setup:
1 hadoop master => ubuntu-server
3 hadoop-slaves => ubuntu1-xen, debian1-xen, debian2-xen

Follow the points from 1 to 3 explained below to setup hadoop in all the identified machines.

1. Prerequisites

Step-1: Follow the instructions in this link.

Step-2: If the identified machines are in the same network and can be accessed using dns (qualified names) then skip this step else, edit the /etc/hosts file in all the identified machines and update them with the hosts information of all the identified machines. The changes that we did for our setup are shown below…

user1@ubuntu-server:~$ sudo vim /etc/hosts

user1@ubuntu1-xen:~$ sudo vim /etc/hosts

user1@debian1-xen:~$ sudo vim /etc/hosts

user1@debian2-xen:~$ sudo vim /etc/hosts

Sample hosts information that we have used in our setup:

192.168.---.--- ubuntu-server
192.168.---.--- ubuntu1-xen
192.168.---.--- debian1-xen
192.168.---.--- debian2-xen

2. Setup Apache Hadoop

Follow the instructions in this link.

3. Configure Hadoop in Fully Distributed (or Cluster) Mode

Step-1: Edit the config file – /opt/hadoop/conf/masters as shown below.

localhost

Step-2: Edit the config file – /opt/hadoop/conf/slaves as shown below. (use dns qualified name if it exists)

ubuntu1-xen
debian1-xen
debian2-xen

Step-3: Edit the config file – /opt/hadoop/conf/core-site.xml as shown below.

Property: hadoop.tmp.dir
Description: A base directory for hadoop to store dfs and mapreduce data.
Default: /tmp/hadoop-${user.name}
Our Value: /var/opt/hadoop/cluster
How to?:

user1@ubuntu-server:~$ cd /var/opt
user1@ubuntu-server:/var/opt$ sudo mkdir hadoop
user1@ubuntu-server:/var/opt$ cd hadoop
user1@ubuntu-server:/var/opt/hadoop$ sudo mkdir cluster
user1@ubuntu-server:/var/opt/hadoop$ cd ..
user1@ubuntu-server:/var/opt$ sudo chown -R hadoop:hadoop hadoop

Property: fs.default.name
Description: The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation.
Default: file:///
Our Value: hdfs://ubuntu-server:10818/

Step-4: Edit the config file – /opt/hadoop/conf/hdfs-site.xml as shown below.

Property: dfs.replication
Description: Default block replication.
Default: 3
Our Value: 3

Step-5: Edit the config file – /opt/hadoop/conf/mapred-site.xml as shown below.

Property: mapred.job.tracker
Description: The host and port that the MapReduce job tracker runs at. If “local” – (standalone mode), then jobs are run in-process as a single map and reduce task.
Default: local
Our Value: ubuntu-server:10814

Step-6: Copy master’s public key (~/.ssh/id_rsa.pub) and append it in ~/.ssh/authorized_keys file in all the identified hadoop slave machines.

# HADOOP MASTER #
user1@ubuntu-server:~$ sudo su - hadoop
hadoop@ubuntu-server:~$ cat ~/.ssh/id_rsa.pub
# copy the master's public key.

Note: Do this in all the identified hadoop slave machines.

# HADOOP SLAVE #
user1@ubuntu1-xen:~$ sudo su - hadoop
hadoop@ubuntu1-xen:~$ vim ~/.ssh/authorized_keys
# paste the copied master's public key and save (:wq) the file.

4. Run Hadoop Cluster

Step-1: Goto hadoop master machine (in our case, ubuntu-server machine) and login as hadoop.

user1@ubuntu-server:~$ sudo su - hadoop
hadoop@ubuntu-server:~$ cd /opt/hadoop
hadoop@ubuntu-server:/opt/hadoop$

Step-2: ssh all salves from the master. e.g. shown below…

hadoop@ubuntu-server:/opt/hadoop$ ssh ubuntu1-xen

hadoop@ubuntu-server:/opt/hadoop$ ssh debian1-xen

hadoop@ubuntu-server:/opt/hadoop$ ssh debian2-xen

Step-3: Format namenode.

hadoop@ubuntu-server:/opt/hadoop$ bin/hadoop namenode -format

Step-4: Start hadoop.

hadoop@ubuntu-server:/opt/hadoop$ bin/start-all.sh

To check if all the hadoop processes are running, use the jps command as shown below…

hadoop@ubuntu-server:/opt/hadoop$ jps

Master should list NameNode, JobTracker, SecondaryNameNode
All Slaves should list DataNode, TaskTracker

FAQ: Where to find the logs? – at /opt/hadoop/logs
FAQ: How to check hadoop is running or not? – use jps command or goto http://ubuntu-server:50070 to get more information on HDFS and goto http://ubuntu-server:50030 to get more information on MapReduce (Job Tracker)

Step-5: Stop hadoop.

hadoop@ubuntu-server:/opt/hadoop$ bin/stop-all.sh

That’s it!

Links:
How to configure hadoop in standalone mode?
How to configure hadoop in pseudo distributed mode?

Hadoop 관련 Reference 문서링크들...

ITWeb/Hadoop일반 2012. 3. 7. 14:31

- 아파치 재단의 하둡 프로젝트 문서들 입니다.

http://hadoop.apache.org/
http://hadoop.apache.org/common/docs/current/hdfs_design.html

- 그루터에 계시는 김형준 수석님의 문서들 입니다.

http://www.jaso.co.kr/category/project/lucene_hadoop

- developerWorks 에 올라온 기술자료 입니다.

http://www.ibm.com/developerworks/kr/library/l-hadoop-1/
http://www.ibm.com/developerworks/kr/library/l-hadoop-2/

hadoop map reduce working flow.

ITWeb/Hadoop일반 2012. 3. 7. 14:09

[참고사이트]

http://architects.dzone.com/articles/how-hadoop-mapreduce-works
http://en.wikipedia.org/wiki/MapReduce
http://www.jaso.co.kr/265
http://nadayyh.springnote.com/pages/6064899?print=1
http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html

이해하기 쉽게 설명된 문서를 발견했내요.
뭐든 기초가 중요합니다.
어설프게 알았다고.. 그냥 넘어 가지 말자구요.
저도 계속 의문점들에 대해서 찾아보고 배우고 있습니다.. ㅎㅎ

[퍼온글]

In my previous post, I talk about the methodology of transforming a sequential algorithm into parallel. After that, we can implement the parallel algorithm, one of the popular framework we can use is the Apache Opensource Hadoop Map/Reduce framework.

Functional Programming

Multithreading is one of the popular way of doing parallel programming, but major complexity of multi-thread programming is to co-ordinate the access of each thread to the shared data. We need things like semaphores, locks, and also use them with great care, otherwise dead locks will result.

If we can eliminate the shared state completely, then the complexity of co-ordination will disappear.

This is the fundamental concept of functional programming. Data is explicitly passed between functions as parameters or return values which can only be changed by the active function at that moment. Imagine functions are connected to each other via a directed acyclic graph. Since there is no hidden dependency (via shared state), functions in the DAG can run anywhere in parallel as long as one is not an ancestor of the other. In other words, analyze the parallelism is much easier when there is no hidden dependency from shared state.

User defined Map/Reduce functions

Map/reduce is a special form of such a DAG which is applicable in a wide range of use cases. It is organized as a “map” function which transform a piece of data into some number of key/value pairs. Each of these elements will then be sorted by their key and reach to the same node, where a “reduce” function is use to merge the values (of the same key) into a single result.

map(input_record) {

...

emit(k1, v1)

...

emit(k2, v2)

...

}

reduce (key, values) {

aggregate = initialize()

while (values.has_next) {

aggregate = merge(values.next)

}

collect(key, aggregate)

}

The Map/Reduce DAG is organized in this way.

A parallel algorithm is usually structure as multiple rounds of Map/Reduce

HDFS

The distributed file system is designed to handle large files (multi-GB) with sequential read/write operation. Each file is broken into chunks, and stored across multiple data nodes as local OS files.

There is a master “NameNode” to keep track of overall file directory structure and the placement of chunks. This NameNode is the central control point and may re-distributed replicas as needed. DataNode reports all its chunks to the NameNode at bootup. Each chunk has a version number which will be increased for all update. Therefore, the NameNode know if any of the chunks of a DataNode is stale (e.g. when the DataNode crash for some period of time). Those stale chunks will be garbage collected at a later time.

To read a file, the client API will calculate the chunk index based on the offset of the file pointer and make a request to the NameNode. The NameNode will reply which DataNodes has a copy of that chunk. From this points, the client contacts the DataNode directly without going through the NameNode.

To write a file, client API will first contact the NameNode who will designate one of the replica as the primary (by granting it a lease). The response of the NameNode contains who is the primary and who are the secondary replicas. Then the client push its changes to all DataNodes in any order, but this change is stored in a buffer of each DataNode. After changes are buffered at all DataNodes, the client send a “commit” request to the primary, which determines an order to update and then push this order to all other secondaries. After all secondaries complete the commit, the primary will response to the client about the success. All changes of chunk distribution and metadata changes will be written to an operation log file at the NameNode. This log file maintain an order list of operation which is important for the NameNode to recover its view after a crash. The NameNode also maintain its persistent state by regularly check-pointing to a file. In case of the NameNode crash, a new NameNode will take over after restoring the state from the last checkpoint file and replay the operation log.

MapRed

The job execution starts when the client program submit to the JobTracker a job configuration, which specifies the map, combine and reduce function, as well as the input and output path of data.

The JobTracker will first determine the number of splits (each split is configurable, ~16-64MB) from the input path, and select some TaskTracker based on their network proximity to the data sources, then the JobTracker send the task requests to those selected TaskTrackers.

Each TaskTracker will start the map phase processing by extracting the input data from the splits. For each record parsed by the “InputFormat”, it invoke the user provided “map” function, which emits a number of key/value pair in the memory buffer. A periodic wakeup process will sort the memory buffer into different reducer node by invoke the “combine” function. The key/value pairs are sorted into one of the R local files (suppose there are R reducer nodes).

When the map task completes (all splits are done), the TaskTracker will notify the JobTracker. When all the TaskTrackers are done, the JobTracker will notify the selected TaskTrackers for the reduce phase.

Each TaskTracker will read the region files remotely. It sorts the key/value pairs and for each key, it invoke the “reduce” function, which collects the key/aggregatedValue into the output file (one per reducer node).

Map/Reduce framework is resilient to crash of any components. The JobTracker keep tracks of the progress of each phases and periodically ping the TaskTracker for their health status. When any of the map phase TaskTracker crashes, the JobTracker will reassign the map task to a different TaskTracker node, which will rerun all the assigned splits. If the reduce phase TaskTracker crashes, the JobTracker will rerun the reduce at a different TaskTracker.

After both phase completes, the JobTracker will unblock the client program.

[Private Thinking+Reference]

[구성은 어떻게?]
- [Master : Namenode : JobTracker] : Single Namenode Cluster
- [Slave : Datanode : TaskTracker] : 1-N Datanode Cluster
- [Client : Run Job] : Job 을 실행 시키기 위한 서버(?)
- 아파치 하둡 사이트의 cluster setup 문서를 보니 아래와 같이 되어 있군요.

Typically you choose one machine in the cluster to act as the NameNode and one machine as to act as the JobTracker, exclusively. The rest of the machines act as both a DataNode and TaskTracker and are referred to as slaves.

[JobTracker 실행은 어떻게?]
- 실행 시키는 방법이야 여러가지가 있겠지만 기본 예제들을 통해서 보면..
- Client 의 Request 를 받아서 실행 시키거나
- Cron 이나 Scheduler 에 등록 시켜 놓고 주기적으로 실행 시키거나

[그럼 MapReducer 프로그램이 어디에 있어야 하지?]
- Master(Namenode) 에 있으면 될것 같습니다.
- 윗 줄은 잘못 되었으니 삭제하구요. 최소 3대로 구성을 해야 겠내요.
- 근데 Client Node 라는 구성이 더 필요 할 것 같다는 생각이 듭니다.
- 어차피 jar 로 묶어서 배포 하고 기본 실행도 WordCount 예제에서 보듯이..

bin/hadoop jar wordcount.jar org.apache.hadoop.examples.WordCount input output

- 이렇게 실행 command 를 request 시점에 또는 scheduler 가 실행 시키면 될 것 같습니다.
- 이를 이해 하기 위한 JobTracker 와 TaskTracker 의 동작 원리는 아래 내용을 참고하세요.

사용자가 만든 main() 메소드가 수행되면서 JobClient 클래스의 runJob()을 호출하게 되면 JobClient에서는 다음과 같은 작업을 수행한다.
1. jobConf에 설정된 정보를 이용하여 job.xml을 구성한 다음 HDFS에 저장
2. 사용자의 Job 클래스 또는 Job 클래스가 있는 jar 파일을 job.jar로 묶어 HDFS에 저장
3. InputFormat의 getSplit() 메소드를 호출하여 반환되는 값을 이용하여 job.split 파일을 HDFS에 저장

Hadoop MapReducer WordCount 막 따라해보기..

ITWeb/Hadoop일반 2012. 3. 6. 16:50

[참고사이트]

http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html

[시작하기전에]

- 일단 hadoop-0.21.0 으로 위에 tutorial 을 보고 시작 하였습니다.
- 바로 문제 봉착....

hadoop-0.21.0-core.jar 파일이 없어 compile 할때.. 계속 에러를 냅니다.

- 일단 classpath 문제로 생각해서 설정을 막 해보았으나 잘 안됩니다.
- 그래서 hadoop*.jar 를 모두 풀어서 걍 hadoop-0.21.0-core.jar 로 묶어 버렸습니다.
- 이렇게 해서 classpath 에 hadoop-0.21.0-core.jar 를 설정해 주고 compile 하니 Success!!
- hadoop-0.20.0 부터 하위 버전에는 그냥 hadoop*core.jar 가 tar.gz 파일에 들어 있습니다.

[WordCount.java Source Code]

/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
*     http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
}

public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

[Compile 하기]

cd $HADOOP_HOME
mkdir example
ubuntu:~/app/hadoop$ javac -cp ./hadoop-0.21.0-core.jar:./lib/commons-cli-1.2.jar -d example WordCount.java
ubuntu:~/app/hadoop$ jar cvf wordcount.jar -C example/ .

[WordCount 테스트하기]

- 먼저 테스트할 파일을 생성 합니다.
cd $HADOOP_HOME/example
vi file01

Hello World Bye World

vi file02

Hello Hadoop Goodbye Hadoop

ubuntu:~/app/hadoop/example$ ../bin/hadoop fs -mkdir input
ubuntu:~/app/hadoop/example$ ../bin/hadoop fs -put file01 ./input/
ubuntu:~/app/hadoop/example$ ../bin/hadoop fs -put file02 ./input/
ubuntu:~/app/hadoop/example$ ../bin/hadoop jar ../wordcount.jar org.apache.hadoop.examples.WordCount input output
ubuntu:~/app/hadoop/example$ ../bin/hadoop jar ../wordcount.jar org.apache.hadoop.examples.WordCount input output

Bye    1
Goodbye    1
Hadoop    2
Hello    2
World    2

- 정상적으로 잘 동작 하는 걸 확인 하실 수 있습니다.
- 여기서 중요한건.. hadoop-*-core.jar 가 없어서 짜증 나시는 분들이 계실텐데요. 위에서 이야기한 방식을 아래 작성해 놓았으니 참고하세요.

[hadoop-0.21.0-core.jar 만들기]

cd $HADOOP_HOME
mkdir hadoop-0.21.0-core
cp *.jar ./hadoop-0.21.0-core/
cd ./hadoop-0.21.0
jar xvf hadoop-hdfs-ant-0.21.0.jar
jar xvf hadoop-mapred-examples-0.21.0.jar
jar xvf hadoop-common-0.21.0.jar
jar xvf hadoop-hdfs-test-0.21.0-sources.jar
jar xvf hadoop-mapred-test-0.21.0.jar
jar xvf hadoop-common-test-0.21.0.jar
jar xvf hadoop-hdfs-test-0.21.0.jar
jar xvf hadoop-mapred-tools-0.21.0.jar
jar xvf hadoop-hdfs-0.21.0-sources.jar
jar xvf hadoop-mapred-0.21.0-sources.jar
jar xvf hadoop-hdfs-0.21.0.jar
jar xvf hadoop-mapred-0.21.0.jar
# org 폴더만 남기고 모든 파일 및 폴더를 삭제 합니다.
cd ..
jar cvf hadoop-0.21.0-core.jar -C hadoop-0.21.0-core/ .
# 이제 ls -al 해 보시면 hadoop-0.21.0-core.jar 가 생성된걸 보실 수 있습니다.
# 완전 노가다 방법이니.. 걍 참고만 하시길..

Hadoop 막 따라하고 테스트 하기...

ITWeb/Hadoop일반 2012. 3. 6. 12:01

[참고문서]

http://apache.mirror.cdnetworks.com//hadoop/common/
http://wiki.apache.org/hadoop/GettingStartedWithHadoop
http://wiki.apache.org/hadoop/HowToConfigure
http://wiki.apache.org/hadoop/QuickStart
http://hadoop.apache.org/common/docs/current/cluster_setup.html
http://hadoop.apache.org/common/docs/current/single_node_setup.html

[Prepare to Start the Hadoop Cluster]

Unpack the downloaded Hadoop distribution. In the distribution, edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation.

Try the following command:
$ bin/hadoop
This will display the usage documentation for the hadoop script.

[Standalone Operation]

[hadoop-0.21.0]
cd $HADOOP_HOME
mkdir input
cp conf/*.xml input
bin/hadoop jar hadoop-mapred-examples-0.21.0.jar grep input output 'dfs[a-z.]+'
cat output/*

[hadoop-0.22.0]
ubuntu:~/app/hadoop-0.22.0$ mkdir input
ubuntu:~/app/hadoop-0.22.0$ cp conf/*.xml input
ubuntu:~/app/hadoop-0.22.0$ bin/hadoop jar hadoop-mapred-examples-0.22.0.jar grep input output 'dfs[a-z.]+'
ubuntu:~/app/hadoop-0.22.0$ cat output/*

[hadoop-1.0.1]
ubuntu:~/app/hadoop-1.0.1$ mkdir input
ubuntu:~/app/hadoop-1.0.1$ cp conf/*.xml input
ubuntu:~/app/hadoop-1.0.1$ bin/hadoop jar hadoop-mapred-examples-0.22.0.jar grep input output 'dfs[a-z.]+'
ubuntu:~/app/hadoop-1.0.1$ cat output/*

- 직접 해보시면 아시겠지만.. 동일하게 동작하며 똑같은 결과가 나옵니다.

[Pseudo-Distributed Operation]

[hadoop-0.21.0]
{conf/core-site.xml}

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

{conf/hdfs-site.xml}

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

{conf/mapred-site.xml}

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9001</value>
    </property>
</configuration>

{Setup passphraseless ssh}

Now check that you can ssh to the localhost without a passphrase:
$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

- "ssh: connect to host localhost port 22: Connection refused" 가 나오면 일단 ssh 가 정상 설치 되어 있는지 확인을 하고, 설치가 되어 있다면 /etc/ssh/sshd_config 에 port 설정은 잘 되어 있는지 보시고 restart 후 재시도 하시면 될 겁니다. (더 상세한 내용은 구글링을 통해서 해결해 보세요.)

{Execution}

ubuntu:~/app/hadoop-0.21.0$ bin/hadoop namenode -format
ubuntu:~/app/hadoop-0.21.0$ bin/start-all.sh

The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).

Browse the web interface for the NameNode and the JobTracker; by default they are available at:

NameNode - http://localhost:50070/
JobTracker - http://localhost:50030/

ubuntu:~/app/hadoop-0.21.0$ bin/hadoop fs -put conf input
ubuntu:~/app/hadoop-0.21.0$ bin/hadoop jar hadoop-mapred-examples-0.21.0.jar grep input output 'dfs[a-z.]+'
ubuntu:~/app/hadoop-0.21.0$ bin/hadoop fs -get output output
ubuntu:~/app/hadoop-0.21.0$ cat output/*
or
ubuntu:~/app/hadoop-0.21.0$ bin/hadoop fs -cat output/*
ubuntu:~/app/hadoop-0.21.0$ cat output/*

이하 다른 버전들도 동일하게 테스트 수행 하면 됨.
다음에는 HDFS 에 읽고/쓰기를 테스트 해보려 합니다.

hadoop-1.0.1 설치 및 테스트 맛보기

ITWeb/Hadoop일반 2012. 2. 29. 15:26

[참고사이트]
http://blog.softwaregeeks.org/archives/category/develop/hadoop

http://www.ibm.com/developerworks/kr/library/l-hadoop-1/

http://www.ibm.com/developerworks/kr/library/l-hadoop-2/

http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html

hadoop-1.0.1 설치 및 테스트를 해보자.. (Single Machine 설정 입니다.)
기존에 0.20.X 버전과 설치 방법과 테스트 방법은 동일 합니다.

먼저 필요한 파일들을 다운 받아야 겠죠.
- JDK : http://www.oracle.com/technetwork/java/javase/downloads/jdk-6u31-download-1501634.html
- Hadoop : http://mirror.apache-kr.org//hadoop/common/hadoop-1.0.1/
- 파일을 다운로드 받으셔서 압축을 해제 하신 후 아래 폴더 구조에 맞게 넣어 주시면 됩니다.

기본 디렉토리 구조
- 저는 그냥 일반 계정에 설치를 하였습니다.
- 일반적으로는 hadoop 계정을 만드셔서 해당 계정에 설치를 하시면 될 것 같습니다.

/home
    /henry
        /app
            /jdk
            /hadoop

환경설정하기
- bash 설정입니다.
- vi .bash_profile 또는 .bashrc 에 추가 합니다.

export JAVA_HOME=/home/henry/app/jdk
export HADOOP_HOME=/home/henry/app/hadoop
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH

샘플 실행하기
- 0.20.X 와 같이 example jar 파일이 있습니다.
- hadoop-examples-1.0.1.jar

$HADOOP_HOME에서 수행하는 command 순서 입니다.
mkdir temp
cp hadoop-examples-1.0.1.jar ./temp/
cd temp
jar xvf hadoop-examples-1.0.1.jar
rm -rf META-INF
rm -f hadoop-examples-1.0.1.jar
jar cvf ../hadoop-examples-1.0.1-0.jar .
cd ..
vi input.txt # 여러 단어들을 입력 하시면 됩니다. 예제가 word count 이므로

a
aa
b
bb
a
aaa
b
bb
c
cc
ccc
dd
cc

# 이와 같이 넣어 봤습니다.
hadoop jar hadoop-examples-1.0.1.0.jar org.apache.hadoop.examples.WordCount input.txt output
# 실행 결과

henry@ubuntu:~/app/hadoop$ hadoop jar hadoop-examples-1.0.1.0.jar org.apache.hadoop.examples.WordCount input.txt output
Warning: $HADOOP_HOME is deprecated.

12/02/29 15:16:06 INFO util.NativeCodeLoader: Loaded the native-hadoop library
****file:/home/henry/app/hadoop-1.0.1/input.txt
12/02/29 15:16:06 INFO input.FileInputFormat: Total input paths to process : 1
12/02/29 15:16:07 INFO mapred.JobClient: Running job: job_local_0001
12/02/29 15:16:07 INFO util.ProcessTree: setsid exited with exit code 0
12/02/29 15:16:07 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@ae533a
12/02/29 15:16:07 INFO mapred.MapTask: io.sort.mb = 100
12/02/29 15:16:07 INFO mapred.MapTask: data buffer = 79691776/99614720
12/02/29 15:16:07 INFO mapred.MapTask: record buffer = 262144/327680
12/02/29 15:16:07 INFO mapred.MapTask: Starting flush of map output
12/02/29 15:16:07 INFO mapred.MapTask: Finished spill 0
12/02/29 15:16:07 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/02/29 15:16:08 INFO mapred.JobClient: map 0% reduce 0%
12/02/29 15:16:10 INFO mapred.LocalJobRunner:
12/02/29 15:16:10 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
12/02/29 15:16:10 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6782a9
12/02/29 15:16:10 INFO mapred.LocalJobRunner:
12/02/29 15:16:10 INFO mapred.Merger: Merging 1 sorted segments
12/02/29 15:16:10 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 82 bytes
12/02/29 15:16:10 INFO mapred.LocalJobRunner:
12/02/29 15:16:10 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
12/02/29 15:16:10 INFO mapred.LocalJobRunner:
12/02/29 15:16:10 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
12/02/29 15:16:13 INFO mapred.LocalJobRunner: reduce > reduce
12/02/29 15:16:13 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
12/02/29 15:16:14 INFO mapred.JobClient: map 100% reduce 100%
12/02/29 15:16:14 INFO mapred.JobClient: Job complete: job_local_0001
12/02/29 15:16:14 INFO mapred.JobClient: Counters: 20
12/02/29 15:16:14 INFO mapred.JobClient:   File Output Format Counters
12/02/29 15:16:14 INFO mapred.JobClient:     Bytes Written=56
12/02/29 15:16:14 INFO mapred.JobClient:   FileSystemCounters
12/02/29 15:16:14 INFO mapred.JobClient:     FILE_BYTES_READ=288058
12/02/29 15:16:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=354898
12/02/29 15:16:14 INFO mapred.JobClient:   File Input Format Counters
12/02/29 15:16:14 INFO mapred.JobClient:     Bytes Read=36
12/02/29 15:16:14 INFO mapred.JobClient:   Map-Reduce Framework
12/02/29 15:16:14 INFO mapred.JobClient:     Map output materialized bytes=86
12/02/29 15:16:14 INFO mapred.JobClient:     Map input records=13
12/02/29 15:16:14 INFO mapred.JobClient:     Reduce shuffle bytes=0
12/02/29 15:16:14 INFO mapred.JobClient:     Spilled Records=18
12/02/29 15:16:14 INFO mapred.JobClient:     Map output bytes=88
12/02/29 15:16:14 INFO mapred.JobClient:     Total committed heap usage (bytes)=324665344
12/02/29 15:16:14 INFO mapred.JobClient:     CPU time spent (ms)=0
12/02/29 15:16:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=115
12/02/29 15:16:14 INFO mapred.JobClient:     Combine input records=13
12/02/29 15:16:14 INFO mapred.JobClient:     Reduce input records=9
12/02/29 15:16:14 INFO mapred.JobClient:     Reduce input groups=9
12/02/29 15:16:14 INFO mapred.JobClient:     Combine output records=9
12/02/29 15:16:14 INFO mapred.JobClient:     Physical memory (bytes) snapshot=0
12/02/29 15:16:14 INFO mapred.JobClient:     Reduce output records=9
12/02/29 15:16:14 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0
12/02/29 15:16:14 INFO mapred.JobClient:     Map output records=13

henry@ubuntu:~/app/hadoop$ cat output/*
a    2
aa    1
aaa    1
b    2
bb    2
c    1
cc    2
ccc    1
dd    1

참 쉽죠.. ^^;
그럼 Hadoop Example 코드를 볼까요?
http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

  public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
      
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  
  public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

아주 기초적인 내용이니 활용은 함께 공부해 보아요.. ^^

◀ PREV : [1] : [2] : NEXT ▶

jjeong