'hadoop' 태그의 글 목록 (2 Page)

[Hadoop] Hive + Hadoop HA구성도.

ITWeb/Hadoop일반 2013. 4. 26. 11:00

어제 VM 설치하고 구성하면서 이해한 수준으로 작성 한거라 정확한 HA 구성이라고 보기는 힘들 수 있습니다.

더군다나 실제 production 에 올려 본게 아니고 개발에서 구성만 해본거라 더욱그렇구요.

[구성요소]

- HAProxy : hive service HA 구성을 위해서 사용

- Hive : hadoop 에 데이터 적재 및 분석

- MySQL : hive 의 meta 정보를 저장

- Hadoop : 분산파일시스템 및 MR 관리

[구성도]

늘 느끼는 거지만 오픈소스를 대하면서 정말 중요한 정보들은 어디, 어느 누구도 공유를 잘 안해주내요.

뭐 다들 고생해서 얻은 경험이라 그럴수도 있다고 생각은 하지만 오픈 소스에 대한 쉬운 접근성에 대한 불편한 진실 같습니다.

:

[Hadoop] hadoop version branch history.

ITWeb/Hadoop일반 2013. 4. 24. 15:54

hadoop shell command 로 테스트 도중 overwrite 관련 내용이 궁금해서 찾아본 내용 정리 합니다.

일단, 하둡의 버전 부터 알아 봅시다.

자 그럼 여기서 overwrite 기능이 추가된 버전은 무엇일까요?

https://issues.apache.org/jira/browse/HADOOP-7361

그렇습니다.

0.23.x 부터 시작된 branch 에 반영이 되어 있습니다.

저 처럼 괜히 1.0.x 에서 왜 안되지 하는 삽질은 하지 마시기 바랍니다.

궁금한건 해결해야 하는 성격이라 올려 봅니다.

1.0.x 에서는 삭제하고 올리면 됩니다.

[Hadoop] bin/hadoop fs & dfs are same.

ITWeb/Hadoop일반 2013. 4. 24. 15:31

하둡을 시작 한 지 얼마 되지 않아서 궁금한게 무지 많습니다.

그래서 기초부터 차근 차근 하려고 합니다.

우선 wordcount m/r 테스트 도중 발생한 궁금증??

bin/hadoop fs

bin/hadoop dfs

문서를 보다 보면 이렇게 두 가지가 나오는데요.

두개가 뭐가 다르지 해서 찾아 봤습니다.

bni/hadoop 파일을 열어 보시면 나옵니다.

결론은 두 개가 같다 입니다.

......

elif [ "$COMMAND" = "fs" ] ; then

CLASS=org.apache.hadoop.fs.FsShell

HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"

elif [ "$COMMAND" = "dfs" ] ; then

CLASS=org.apache.hadoop.fs.FsShell

HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"

......

:

[elasticsearch] plugin elasticsearch-hadoop 설치하기.

Elastic/Elasticsearch 2013. 4. 3. 17:13

Main Site : http://www.elasticsearch.org/guide/reference/modules/gateway/hadoop/

Plugin Site : https://github.com/elasticsearch/elasticsearch-hadoop

bin/plugin -install elasticsearch/elasticsearch-hadoop/1.2.0

:

[hadoop] Unable to load realm info from SCDynamicStore

ITWeb/Hadoop일반 2013. 4. 3. 16:03

아래와 같은 에러가 로그에 찍힙니다.

Unable to load realm info from SCDynamicStore

hadoop-1.0.4 로 테스트 중이구요.

이 에러 로그에 대한 해결은 아래 링크 참고하세요.

http://stackoverflow.com/questions/7134723/hadoop-on-osx-unable-to-load-realm-info-from-scdynamicstore

known issue로 되어 있습니다.

[해결방법]

vi conf/hadoop-env.sh

this line : # export HADOOP_OPTS=-server

to : export HADOOP_OPTS="-Djava.security.krb5.realm= -Djava.security.krb5.kdc="

그리고

bin/start-all.sh

하시면 에러가 없어 진걸 확인 할 수 있습니다.

내용은 뭐 보안 관련된 것이구요.

kerberos 인증 관련 입니다.

:

[hadoo] tip, hadoop 실행 시 에러 (Error: JAVA_HOME is not set.)

ITWeb/Hadoop일반 2013. 4. 3. 15:26

하둡 설치 하고 기본 튜토리얼에서 나온 데로 실행을 했을 경우 아래와 같은 에러가 발생 할때가 있습니다.

Error: JAVA_HOME is not set.

보시면, 직관적이죠!!

설치된 경로에서 conf/hadoop-env.sh 파일을 열어 보시면 아래와 같은 주석 line 이 보입니다.

java home path 를 맞게 설정해 주면 되겠죠.

line : # export JAVA_HOME=/usr/lib/j2sdk1.5-sun

to : export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/HOME

저는 맥에서 설치 테스트를 해서 위와 같이 수정했습니다.

:

Hadoop 관련 Reference 문서링크들...

ITWeb/Hadoop일반 2012. 3. 7. 14:31

- 아파치 재단의 하둡 프로젝트 문서들 입니다.

http://hadoop.apache.org/
http://hadoop.apache.org/common/docs/current/hdfs_design.html

- 그루터에 계시는 김형준 수석님의 문서들 입니다.

http://www.jaso.co.kr/category/project/lucene_hadoop

- developerWorks 에 올라온 기술자료 입니다.

http://www.ibm.com/developerworks/kr/library/l-hadoop-1/
http://www.ibm.com/developerworks/kr/library/l-hadoop-2/

:

hadoop map reduce working flow.

ITWeb/Hadoop일반 2012. 3. 7. 14:09

[참고사이트]

http://architects.dzone.com/articles/how-hadoop-mapreduce-works
http://en.wikipedia.org/wiki/MapReduce
http://www.jaso.co.kr/265
http://nadayyh.springnote.com/pages/6064899?print=1
http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html

이해하기 쉽게 설명된 문서를 발견했내요.
뭐든 기초가 중요합니다.
어설프게 알았다고.. 그냥 넘어 가지 말자구요.
저도 계속 의문점들에 대해서 찾아보고 배우고 있습니다.. ㅎㅎ

[퍼온글]

In my previous post, I talk about the methodology of transforming a sequential algorithm into parallel. After that, we can implement the parallel algorithm, one of the popular framework we can use is the Apache Opensource Hadoop Map/Reduce framework.

Functional Programming

Multithreading is one of the popular way of doing parallel programming, but major complexity of multi-thread programming is to co-ordinate the access of each thread to the shared data. We need things like semaphores, locks, and also use them with great care, otherwise dead locks will result.

If we can eliminate the shared state completely, then the complexity of co-ordination will disappear.

This is the fundamental concept of functional programming. Data is explicitly passed between functions as parameters or return values which can only be changed by the active function at that moment. Imagine functions are connected to each other via a directed acyclic graph. Since there is no hidden dependency (via shared state), functions in the DAG can run anywhere in parallel as long as one is not an ancestor of the other. In other words, analyze the parallelism is much easier when there is no hidden dependency from shared state.

User defined Map/Reduce functions

Map/reduce is a special form of such a DAG which is applicable in a wide range of use cases. It is organized as a “map” function which transform a piece of data into some number of key/value pairs. Each of these elements will then be sorted by their key and reach to the same node, where a “reduce” function is use to merge the values (of the same key) into a single result.

map(input_record) {

...

emit(k1, v1)

...

emit(k2, v2)

...

}

reduce (key, values) {

aggregate = initialize()

while (values.has_next) {

aggregate = merge(values.next)

}

collect(key, aggregate)

}

The Map/Reduce DAG is organized in this way.

A parallel algorithm is usually structure as multiple rounds of Map/Reduce

HDFS

The distributed file system is designed to handle large files (multi-GB) with sequential read/write operation. Each file is broken into chunks, and stored across multiple data nodes as local OS files.

There is a master “NameNode” to keep track of overall file directory structure and the placement of chunks. This NameNode is the central control point and may re-distributed replicas as needed. DataNode reports all its chunks to the NameNode at bootup. Each chunk has a version number which will be increased for all update. Therefore, the NameNode know if any of the chunks of a DataNode is stale (e.g. when the DataNode crash for some period of time). Those stale chunks will be garbage collected at a later time.

To read a file, the client API will calculate the chunk index based on the offset of the file pointer and make a request to the NameNode. The NameNode will reply which DataNodes has a copy of that chunk. From this points, the client contacts the DataNode directly without going through the NameNode.

To write a file, client API will first contact the NameNode who will designate one of the replica as the primary (by granting it a lease). The response of the NameNode contains who is the primary and who are the secondary replicas. Then the client push its changes to all DataNodes in any order, but this change is stored in a buffer of each DataNode. After changes are buffered at all DataNodes, the client send a “commit” request to the primary, which determines an order to update and then push this order to all other secondaries. After all secondaries complete the commit, the primary will response to the client about the success. All changes of chunk distribution and metadata changes will be written to an operation log file at the NameNode. This log file maintain an order list of operation which is important for the NameNode to recover its view after a crash. The NameNode also maintain its persistent state by regularly check-pointing to a file. In case of the NameNode crash, a new NameNode will take over after restoring the state from the last checkpoint file and replay the operation log.

MapRed

The job execution starts when the client program submit to the JobTracker a job configuration, which specifies the map, combine and reduce function, as well as the input and output path of data.

The JobTracker will first determine the number of splits (each split is configurable, ~16-64MB) from the input path, and select some TaskTracker based on their network proximity to the data sources, then the JobTracker send the task requests to those selected TaskTrackers.

Each TaskTracker will start the map phase processing by extracting the input data from the splits. For each record parsed by the “InputFormat”, it invoke the user provided “map” function, which emits a number of key/value pair in the memory buffer. A periodic wakeup process will sort the memory buffer into different reducer node by invoke the “combine” function. The key/value pairs are sorted into one of the R local files (suppose there are R reducer nodes).

When the map task completes (all splits are done), the TaskTracker will notify the JobTracker. When all the TaskTrackers are done, the JobTracker will notify the selected TaskTrackers for the reduce phase.

Each TaskTracker will read the region files remotely. It sorts the key/value pairs and for each key, it invoke the “reduce” function, which collects the key/aggregatedValue into the output file (one per reducer node).

Map/Reduce framework is resilient to crash of any components. The JobTracker keep tracks of the progress of each phases and periodically ping the TaskTracker for their health status. When any of the map phase TaskTracker crashes, the JobTracker will reassign the map task to a different TaskTracker node, which will rerun all the assigned splits. If the reduce phase TaskTracker crashes, the JobTracker will rerun the reduce at a different TaskTracker.

After both phase completes, the JobTracker will unblock the client program.

[Private Thinking+Reference]

[구성은 어떻게?]
- [Master : Namenode : JobTracker] : Single Namenode Cluster
- [Slave : Datanode : TaskTracker] : 1-N Datanode Cluster
- [Client : Run Job] : Job 을 실행 시키기 위한 서버(?)
- 아파치 하둡 사이트의 cluster setup 문서를 보니 아래와 같이 되어 있군요.

Typically you choose one machine in the cluster to act as the NameNode and one machine as to act as the JobTracker, exclusively. The rest of the machines act as both a DataNode and TaskTracker and are referred to as slaves.

[JobTracker 실행은 어떻게?]
- 실행 시키는 방법이야 여러가지가 있겠지만 기본 예제들을 통해서 보면..
- Client 의 Request 를 받아서 실행 시키거나
- Cron 이나 Scheduler 에 등록 시켜 놓고 주기적으로 실행 시키거나

[그럼 MapReducer 프로그램이 어디에 있어야 하지?]
- Master(Namenode) 에 있으면 될것 같습니다.
- 윗 줄은 잘못 되었으니 삭제하구요. 최소 3대로 구성을 해야 겠내요.
- 근데 Client Node 라는 구성이 더 필요 할 것 같다는 생각이 듭니다.
- 어차피 jar 로 묶어서 배포 하고 기본 실행도 WordCount 예제에서 보듯이..

bin/hadoop jar wordcount.jar org.apache.hadoop.examples.WordCount input output

- 이렇게 실행 command 를 request 시점에 또는 scheduler 가 실행 시키면 될 것 같습니다.
- 이를 이해 하기 위한 JobTracker 와 TaskTracker 의 동작 원리는 아래 내용을 참고하세요.

사용자가 만든 main() 메소드가 수행되면서 JobClient 클래스의 runJob()을 호출하게 되면 JobClient에서는 다음과 같은 작업을 수행한다.
1. jobConf에 설정된 정보를 이용하여 job.xml을 구성한 다음 HDFS에 저장
2. 사용자의 Job 클래스 또는 Job 클래스가 있는 jar 파일을 job.jar로 묶어 HDFS에 저장
3. InputFormat의 getSplit() 메소드를 호출하여 반환되는 값을 이용하여 job.split 파일을 HDFS에 저장

:

Hadoop MapReducer WordCount 막 따라해보기..

ITWeb/Hadoop일반 2012. 3. 6. 16:50

[참고사이트]

http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html

[시작하기전에]

- 일단 hadoop-0.21.0 으로 위에 tutorial 을 보고 시작 하였습니다.
- 바로 문제 봉착....

hadoop-0.21.0-core.jar 파일이 없어 compile 할때.. 계속 에러를 냅니다.

- 일단 classpath 문제로 생각해서 설정을 막 해보았으나 잘 안됩니다.
- 그래서 hadoop*.jar 를 모두 풀어서 걍 hadoop-0.21.0-core.jar 로 묶어 버렸습니다.
- 이렇게 해서 classpath 에 hadoop-0.21.0-core.jar 를 설정해 주고 compile 하니 Success!!
- hadoop-0.20.0 부터 하위 버전에는 그냥 hadoop*core.jar 가 tar.gz 파일에 들어 있습니다.

[WordCount.java Source Code]

/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
*     http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
}

public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

[Compile 하기]

cd $HADOOP_HOME
mkdir example
ubuntu:~/app/hadoop$ javac -cp ./hadoop-0.21.0-core.jar:./lib/commons-cli-1.2.jar -d example WordCount.java
ubuntu:~/app/hadoop$ jar cvf wordcount.jar -C example/ .

[WordCount 테스트하기]

- 먼저 테스트할 파일을 생성 합니다.
cd $HADOOP_HOME/example
vi file01

Hello World Bye World

vi file02

Hello Hadoop Goodbye Hadoop

ubuntu:~/app/hadoop/example$ ../bin/hadoop fs -mkdir input
ubuntu:~/app/hadoop/example$ ../bin/hadoop fs -put file01 ./input/
ubuntu:~/app/hadoop/example$ ../bin/hadoop fs -put file02 ./input/
ubuntu:~/app/hadoop/example$ ../bin/hadoop jar ../wordcount.jar org.apache.hadoop.examples.WordCount input output
ubuntu:~/app/hadoop/example$ ../bin/hadoop jar ../wordcount.jar org.apache.hadoop.examples.WordCount input output

Bye    1
Goodbye    1
Hadoop    2
Hello    2
World    2

- 정상적으로 잘 동작 하는 걸 확인 하실 수 있습니다.
- 여기서 중요한건.. hadoop-*-core.jar 가 없어서 짜증 나시는 분들이 계실텐데요. 위에서 이야기한 방식을 아래 작성해 놓았으니 참고하세요.

[hadoop-0.21.0-core.jar 만들기]

cd $HADOOP_HOME
mkdir hadoop-0.21.0-core
cp *.jar ./hadoop-0.21.0-core/
cd ./hadoop-0.21.0
jar xvf hadoop-hdfs-ant-0.21.0.jar
jar xvf hadoop-mapred-examples-0.21.0.jar
jar xvf hadoop-common-0.21.0.jar
jar xvf hadoop-hdfs-test-0.21.0-sources.jar
jar xvf hadoop-mapred-test-0.21.0.jar
jar xvf hadoop-common-test-0.21.0.jar
jar xvf hadoop-hdfs-test-0.21.0.jar
jar xvf hadoop-mapred-tools-0.21.0.jar
jar xvf hadoop-hdfs-0.21.0-sources.jar
jar xvf hadoop-mapred-0.21.0-sources.jar
jar xvf hadoop-hdfs-0.21.0.jar
jar xvf hadoop-mapred-0.21.0.jar
# org 폴더만 남기고 모든 파일 및 폴더를 삭제 합니다.
cd ..
jar cvf hadoop-0.21.0-core.jar -C hadoop-0.21.0-core/ .
# 이제 ls -al 해 보시면 hadoop-0.21.0-core.jar 가 생성된걸 보실 수 있습니다.
# 완전 노가다 방법이니.. 걍 참고만 하시길..

:

Hadoop 막 따라하고 테스트 하기...

ITWeb/Hadoop일반 2012. 3. 6. 12:01

[참고문서]

http://apache.mirror.cdnetworks.com//hadoop/common/
http://wiki.apache.org/hadoop/GettingStartedWithHadoop
http://wiki.apache.org/hadoop/HowToConfigure
http://wiki.apache.org/hadoop/QuickStart
http://hadoop.apache.org/common/docs/current/cluster_setup.html
http://hadoop.apache.org/common/docs/current/single_node_setup.html

[Prepare to Start the Hadoop Cluster]

Unpack the downloaded Hadoop distribution. In the distribution, edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation.

Try the following command:
$ bin/hadoop
This will display the usage documentation for the hadoop script.

[Standalone Operation]

[hadoop-0.21.0]
cd $HADOOP_HOME
mkdir input
cp conf/*.xml input
bin/hadoop jar hadoop-mapred-examples-0.21.0.jar grep input output 'dfs[a-z.]+'
cat output/*

[hadoop-0.22.0]
ubuntu:~/app/hadoop-0.22.0$ mkdir input
ubuntu:~/app/hadoop-0.22.0$ cp conf/*.xml input
ubuntu:~/app/hadoop-0.22.0$ bin/hadoop jar hadoop-mapred-examples-0.22.0.jar grep input output 'dfs[a-z.]+'
ubuntu:~/app/hadoop-0.22.0$ cat output/*

[hadoop-1.0.1]
ubuntu:~/app/hadoop-1.0.1$ mkdir input
ubuntu:~/app/hadoop-1.0.1$ cp conf/*.xml input
ubuntu:~/app/hadoop-1.0.1$ bin/hadoop jar hadoop-mapred-examples-0.22.0.jar grep input output 'dfs[a-z.]+'
ubuntu:~/app/hadoop-1.0.1$ cat output/*

- 직접 해보시면 아시겠지만.. 동일하게 동작하며 똑같은 결과가 나옵니다.

[Pseudo-Distributed Operation]

[hadoop-0.21.0]
{conf/core-site.xml}

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

{conf/hdfs-site.xml}

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

{conf/mapred-site.xml}

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9001</value>
    </property>
</configuration>

{Setup passphraseless ssh}

Now check that you can ssh to the localhost without a passphrase:
$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

- "ssh: connect to host localhost port 22: Connection refused" 가 나오면 일단 ssh 가 정상 설치 되어 있는지 확인을 하고, 설치가 되어 있다면 /etc/ssh/sshd_config 에 port 설정은 잘 되어 있는지 보시고 restart 후 재시도 하시면 될 겁니다. (더 상세한 내용은 구글링을 통해서 해결해 보세요.)

{Execution}

ubuntu:~/app/hadoop-0.21.0$ bin/hadoop namenode -format
ubuntu:~/app/hadoop-0.21.0$ bin/start-all.sh

The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).

Browse the web interface for the NameNode and the JobTracker; by default they are available at:

NameNode - http://localhost:50070/
JobTracker - http://localhost:50030/

ubuntu:~/app/hadoop-0.21.0$ bin/hadoop fs -put conf input
ubuntu:~/app/hadoop-0.21.0$ bin/hadoop jar hadoop-mapred-examples-0.21.0.jar grep input output 'dfs[a-z.]+'
ubuntu:~/app/hadoop-0.21.0$ bin/hadoop fs -get output output
ubuntu:~/app/hadoop-0.21.0$ cat output/*
or
ubuntu:~/app/hadoop-0.21.0$ bin/hadoop fs -cat output/*
ubuntu:~/app/hadoop-0.21.0$ cat output/*

이하 다른 버전들도 동일하게 테스트 수행 하면 됨.
다음에는 HDFS 에 읽고/쓰기를 테스트 해보려 합니다.

:

jjeong

'hadoop'에 해당되는 글 21건