Hadoop MapReducer WordCount 막 따라해보기..
ITWeb/Hadoop일반 2012. 3. 6. 16:50[참고사이트]
[시작하기전에]
[WordCount.java Source Code]
[Compile 하기]
[WordCount 테스트하기]
- 여기서 중요한건.. hadoop-*-core.jar 가 없어서 짜증 나시는 분들이 계실텐데요. 위에서 이야기한 방식을 아래 작성해 놓았으니 참고하세요.
[hadoop-0.21.0-core.jar 만들기]
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html
[시작하기전에]
- 일단 hadoop-0.21.0 으로 위에 tutorial 을 보고 시작 하였습니다.
- 바로 문제 봉착....
- 그래서 hadoop*.jar 를 모두 풀어서 걍 hadoop-0.21.0-core.jar 로 묶어 버렸습니다.
- 이렇게 해서 classpath 에 hadoop-0.21.0-core.jar 를 설정해 주고 compile 하니 Success!!
- hadoop-0.20.0 부터 하위 버전에는 그냥 hadoop*core.jar 가 tar.gz 파일에 들어 있습니다.
- 바로 문제 봉착....
hadoop-0.21.0-core.jar 파일이 없어 compile 할때.. 계속 에러를 냅니다.
- 일단 classpath 문제로 생각해서 설정을 막 해보았으나 잘 안됩니다.- 그래서 hadoop*.jar 를 모두 풀어서 걍 hadoop-0.21.0-core.jar 로 묶어 버렸습니다.
- 이렇게 해서 classpath 에 hadoop-0.21.0-core.jar 를 설정해 주고 compile 하니 Success!!
- hadoop-0.20.0 부터 하위 버전에는 그냥 hadoop*core.jar 가 tar.gz 파일에 들어 있습니다.
[WordCount.java Source Code]
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.examples;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.examples;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
[Compile 하기]
cd $HADOOP_HOME
mkdir example
ubuntu:~/app/hadoop$ javac -cp ./hadoop-0.21.0-core.jar:./lib/commons-cli-1.2.jar -d example WordCount.java
ubuntu:~/app/hadoop$ jar cvf wordcount.jar -C example/ .
mkdir example
ubuntu:~/app/hadoop$ javac -cp ./hadoop-0.21.0-core.jar:./lib/commons-cli-1.2.jar -d example WordCount.java
ubuntu:~/app/hadoop$ jar cvf wordcount.jar -C example/ .
[WordCount 테스트하기]
- 먼저 테스트할 파일을 생성 합니다.
cd $HADOOP_HOME/example
vi file01
vi file02
ubuntu:~/app/hadoop/example$ ../bin/hadoop fs -mkdir input
ubuntu:~/app/hadoop/example$ ../bin/hadoop fs -put file01 ./input/
ubuntu:~/app/hadoop/example$ ../bin/hadoop fs -put file02 ./input/
ubuntu:~/app/hadoop/example$ ../bin/hadoop jar ../wordcount.jar org.apache.hadoop.examples.WordCount input output
ubuntu:~/app/hadoop/example$ ../bin/hadoop jar ../wordcount.jar org.apache.hadoop.examples.WordCount input output
- 정상적으로 잘 동작 하는 걸 확인 하실 수 있습니다.cd $HADOOP_HOME/example
vi file01
Hello World Bye World
vi file02
Hello Hadoop Goodbye Hadoop
ubuntu:~/app/hadoop/example$ ../bin/hadoop fs -mkdir input
ubuntu:~/app/hadoop/example$ ../bin/hadoop fs -put file01 ./input/
ubuntu:~/app/hadoop/example$ ../bin/hadoop fs -put file02 ./input/
ubuntu:~/app/hadoop/example$ ../bin/hadoop jar ../wordcount.jar org.apache.hadoop.examples.WordCount input output
ubuntu:~/app/hadoop/example$ ../bin/hadoop jar ../wordcount.jar org.apache.hadoop.examples.WordCount input output
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
Goodbye 1
Hadoop 2
Hello 2
World 2
- 여기서 중요한건.. hadoop-*-core.jar 가 없어서 짜증 나시는 분들이 계실텐데요. 위에서 이야기한 방식을 아래 작성해 놓았으니 참고하세요.
[hadoop-0.21.0-core.jar 만들기]
cd $HADOOP_HOME
mkdir hadoop-0.21.0-core
cp *.jar ./hadoop-0.21.0-core/
cd ./hadoop-0.21.0
jar xvf hadoop-hdfs-ant-0.21.0.jar
jar xvf hadoop-mapred-examples-0.21.0.jar
jar xvf hadoop-common-0.21.0.jar
jar xvf hadoop-hdfs-test-0.21.0-sources.jar
jar xvf hadoop-mapred-test-0.21.0.jar
jar xvf hadoop-common-test-0.21.0.jar
jar xvf hadoop-hdfs-test-0.21.0.jar
jar xvf hadoop-mapred-tools-0.21.0.jar
jar xvf hadoop-hdfs-0.21.0-sources.jar
jar xvf hadoop-mapred-0.21.0-sources.jar
jar xvf hadoop-hdfs-0.21.0.jar
jar xvf hadoop-mapred-0.21.0.jar
# org 폴더만 남기고 모든 파일 및 폴더를 삭제 합니다.
cd ..
jar cvf hadoop-0.21.0-core.jar -C hadoop-0.21.0-core/ .
# 이제 ls -al 해 보시면 hadoop-0.21.0-core.jar 가 생성된걸 보실 수 있습니다.
# 완전 노가다 방법이니.. 걍 참고만 하시길..
mkdir hadoop-0.21.0-core
cp *.jar ./hadoop-0.21.0-core/
cd ./hadoop-0.21.0
jar xvf hadoop-hdfs-ant-0.21.0.jar
jar xvf hadoop-mapred-examples-0.21.0.jar
jar xvf hadoop-common-0.21.0.jar
jar xvf hadoop-hdfs-test-0.21.0-sources.jar
jar xvf hadoop-mapred-test-0.21.0.jar
jar xvf hadoop-common-test-0.21.0.jar
jar xvf hadoop-hdfs-test-0.21.0.jar
jar xvf hadoop-mapred-tools-0.21.0.jar
jar xvf hadoop-hdfs-0.21.0-sources.jar
jar xvf hadoop-mapred-0.21.0-sources.jar
jar xvf hadoop-hdfs-0.21.0.jar
jar xvf hadoop-mapred-0.21.0.jar
# org 폴더만 남기고 모든 파일 및 폴더를 삭제 합니다.
cd ..
jar cvf hadoop-0.21.0-core.jar -C hadoop-0.21.0-core/ .
# 이제 ls -al 해 보시면 hadoop-0.21.0-core.jar 가 생성된걸 보실 수 있습니다.
# 완전 노가다 방법이니.. 걍 참고만 하시길..