hadoop master/slave 또는 jobtracker/namenode/datanode 들의 각각 설치 및 구성은 어떻게???

ITWeb/Hadoop일반 2012. 3. 7. 14:47

[혼자내린 결론]

Master/Slave 구성이 맞는 방법인 것 같다.
Master 에서 namenode 와 jobtracker 를 분리한다는게 개념적으로 맞지 않는 것 같다.
jobtracker 의 역할은 수행할 job 을 hdfs 에 저장하고 각 slave 즉 datanode 의 tasktracker 에 할당 및 관리 하는 역할을 수행 하기 때문에 분리를 한다고 hadoop 구조상 namenode 가 죽으면 jobtracker 역시 meta 정보를 획득할 창구가 없어지므로 같은 machine 에 구성 되는게 맞는 방법일듯.

그러나 고민은 jobtracker 에서 job 을 어찌 되었건 실행 시켜서 분산 시키는 부분에 있어서는 master 에 부하가 많이 몰리지 않을까 싶기도 한데.. 막상 bigdata 를 돌려 본적이 없으니.. 의문만 남겨 놓고.. 다음에 실행을 해봐야 할 것 같다.

[결론]
- Machine 1 : Master (Namenode + JobTracker)
- Machine 2~N : Slave (Datanode + TaskTracker)
- Job 실행은 Master 에서 runJob 후 JobTracker 가 각 TaskTracker 로 Job 할당 하여 실행.
- 아래 그림은 혼자내린 결론에 대한 근거 데이터 입니다.
※ Cloudera 에서 발표한 문서에서 캡쳐한 내용 입니다.

On very small clusters, the NameNode, JobTracker and

Secondary NameNode can all reside on a single machine

– It is typical to put them on separate machines as the cluster

grows beyond 20-30 nodes

[추가결론]
- 발표한 자료에 의하면 위와 같이 노드가 20~30 개가 넘을 경우 각각 머신을 분리 하는게 일반적이라는 군요.

감은 오지만 막상 해보지 않으면 내것이 아니기에...
ibm 문서를 보면.. master, slave 로 구성을 해도 되고..
hadoop 문서를 보면 namenode, jobtracker, datanode 로 구성을 해야 할 것 같고..

그럼.. hadoop-{$VERSION}.tar.gz 파일을 각 서버들에 깔아주고 설정을 용도에 맞게만 잡아 주면 되는 건가??

Standalone 으로 하는건 그냥.. 몰라도 따라만 해도 동작을 하니...
Fully-distributed 로 한번 해봐야 할 것 같다.
서버가 없으니.. 흠.. 언제 해본다..

어디 좋은 문서나 자료 있으면 공유 좀 부탁 드립니다.
모든 개발자가 삽질하는 날이 없어질때까지.. 정보의 공유는 계속 되어야 합니다.... ^^;

[참고사이트1]

http://guru1013.egloos.com/2584725

- 여기 사이트 보면.. 다 동일한 세팅을 해주면 되는 것으로 보인다.
- 왜냐하면.. conf 에 master 와 slave 에 대한 설정이 있기 때문에 지가 무슨 용도 인지 자동으로 인식 하지 않을까?

[참고사이트2]

http://cloudblog.8kmiles.com/2011/12/05/hadoop-fully-distributed-setup/

- 여기 사이트 보면.. 그냥 Master 머신에서 세팅을 해주면 그냥 되는 것으로 보인다.
- 흠.. 이게 맞는것 같은데..

[참고사이트2 펌글]

OS & Tools used in this setup:

OS: Ubuntu – 11.04
JVM: Sun JDK – 1.6.0_26
Hadoop: Apache Hadoop – 0.20.2

Note: Identify the machines to setup hadoop in cluster mode. We have used 4 servers (2 Ubuntu & 2 Debian Servers – 1 machine as hadoop master, 3 machines as hadoop slave) in this example setup.

Our Setup:
1 hadoop master => ubuntu-server
3 hadoop-slaves => ubuntu1-xen, debian1-xen, debian2-xen

Follow the points from 1 to 3 explained below to setup hadoop in all the identified machines.

1. Prerequisites

Step-1: Follow the instructions in this link.

Step-2: If the identified machines are in the same network and can be accessed using dns (qualified names) then skip this step else, edit the /etc/hosts file in all the identified machines and update them with the hosts information of all the identified machines. The changes that we did for our setup are shown below…

user1@ubuntu-server:~$ sudo vim /etc/hosts

user1@ubuntu1-xen:~$ sudo vim /etc/hosts

user1@debian1-xen:~$ sudo vim /etc/hosts

user1@debian2-xen:~$ sudo vim /etc/hosts

Sample hosts information that we have used in our setup:

192.168.---.--- ubuntu-server
192.168.---.--- ubuntu1-xen
192.168.---.--- debian1-xen
192.168.---.--- debian2-xen

2. Setup Apache Hadoop

Follow the instructions in this link.

3. Configure Hadoop in Fully Distributed (or Cluster) Mode

Step-1: Edit the config file – /opt/hadoop/conf/masters as shown below.

localhost

Step-2: Edit the config file – /opt/hadoop/conf/slaves as shown below. (use dns qualified name if it exists)

ubuntu1-xen
debian1-xen
debian2-xen

Step-3: Edit the config file – /opt/hadoop/conf/core-site.xml as shown below.

Property: hadoop.tmp.dir
Description: A base directory for hadoop to store dfs and mapreduce data.
Default: /tmp/hadoop-${user.name}
Our Value: /var/opt/hadoop/cluster
How to?:

user1@ubuntu-server:~$ cd /var/opt
user1@ubuntu-server:/var/opt$ sudo mkdir hadoop
user1@ubuntu-server:/var/opt$ cd hadoop
user1@ubuntu-server:/var/opt/hadoop$ sudo mkdir cluster
user1@ubuntu-server:/var/opt/hadoop$ cd ..
user1@ubuntu-server:/var/opt$ sudo chown -R hadoop:hadoop hadoop

Property: fs.default.name
Description: The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation.
Default: file:///
Our Value: hdfs://ubuntu-server:10818/

Step-4: Edit the config file – /opt/hadoop/conf/hdfs-site.xml as shown below.

Property: dfs.replication
Description: Default block replication.
Default: 3
Our Value: 3

Step-5: Edit the config file – /opt/hadoop/conf/mapred-site.xml as shown below.

Property: mapred.job.tracker
Description: The host and port that the MapReduce job tracker runs at. If “local” – (standalone mode), then jobs are run in-process as a single map and reduce task.
Default: local
Our Value: ubuntu-server:10814

Step-6: Copy master’s public key (~/.ssh/id_rsa.pub) and append it in ~/.ssh/authorized_keys file in all the identified hadoop slave machines.

# HADOOP MASTER #
user1@ubuntu-server:~$ sudo su - hadoop
hadoop@ubuntu-server:~$ cat ~/.ssh/id_rsa.pub
# copy the master's public key.

Note: Do this in all the identified hadoop slave machines.

# HADOOP SLAVE #
user1@ubuntu1-xen:~$ sudo su - hadoop
hadoop@ubuntu1-xen:~$ vim ~/.ssh/authorized_keys
# paste the copied master's public key and save (:wq) the file.

4. Run Hadoop Cluster

Step-1: Goto hadoop master machine (in our case, ubuntu-server machine) and login as hadoop.

user1@ubuntu-server:~$ sudo su - hadoop
hadoop@ubuntu-server:~$ cd /opt/hadoop
hadoop@ubuntu-server:/opt/hadoop$

Step-2: ssh all salves from the master. e.g. shown below…

hadoop@ubuntu-server:/opt/hadoop$ ssh ubuntu1-xen

hadoop@ubuntu-server:/opt/hadoop$ ssh debian1-xen

hadoop@ubuntu-server:/opt/hadoop$ ssh debian2-xen

Step-3: Format namenode.

hadoop@ubuntu-server:/opt/hadoop$ bin/hadoop namenode -format

Step-4: Start hadoop.

hadoop@ubuntu-server:/opt/hadoop$ bin/start-all.sh

To check if all the hadoop processes are running, use the jps command as shown below…

hadoop@ubuntu-server:/opt/hadoop$ jps

Master should list NameNode, JobTracker, SecondaryNameNode
All Slaves should list DataNode, TaskTracker

FAQ: Where to find the logs? – at /opt/hadoop/logs
FAQ: How to check hadoop is running or not? – use jps command or goto http://ubuntu-server:50070 to get more information on HDFS and goto http://ubuntu-server:50030 to get more information on MapReduce (Job Tracker)

Step-5: Stop hadoop.

hadoop@ubuntu-server:/opt/hadoop$ bin/stop-all.sh

That’s it!

Links:
How to configure hadoop in standalone mode?
How to configure hadoop in pseudo distributed mode?

jjeong