RubyでMap Reduceを書く
まずはMapper次にReducerARGF.each_line do |line| line.chomp! words = line.split(' ') words.each do |word| puts "#{word.upcase}\t1" end end
テストしてみる。counter = Hash.new {|h,k| h[k] = 0 } ARGF.each_line do |line| line.chomp! word, num = line.split(/\t/) counter[word] += num.to_i end counter.each do |word, counter| puts "#{word}\t#{counter}" end
$cat input/* a b c a a b c c c $cat input/* | ruby mapper.rb | ruby reducer.rb A 3 B 2 C 4
できた!
Hadoop Streamingで実行する
$hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-1.1.2.jar \ -D mapred.child.env='PATH=$PATH:/home/hadoop/.rvm/bin' \ -input input \ -output output \ -mapper 'ruby mapper.rb' \ -reducer 'ruby reducer.rb' \ -file mapper.rb \ -file reducer.rb packageJobJar: [mapper.rb, reducer.rb, /tmp/hadoop-hadoop/hadoop-unjar3983648244770961497/] [] /tmp/streamjob6573145905468089718.jar tmpDir=null 13/05/28 16:18:00 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/05/28 16:18:00 WARN snappy.LoadSnappy: Snappy native library not loaded 13/05/28 16:18:00 INFO mapred.FileInputFormat: Total input paths to process : 2 13/05/28 16:18:01 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hadoop/mapred/local] 13/05/28 16:18:01 INFO streaming.StreamJob: Running job: job_201305271601_0004 13/05/28 16:18:01 INFO streaming.StreamJob: To kill this job, run: 13/05/28 16:18:01 INFO streaming.StreamJob: /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=ec2-54-244-249-227.us-west-2.compute.amazonaws.com:9011 -kill job_201305271601_0004 13/05/28 16:18:01 INFO streaming.StreamJob: Tracking URL: http://ec2-54-244-249-227.us-west-2.compute.amazonaws.com:50030/jobdetails.jsp?jobid=job_201305271601_0004 13/05/28 16:18:02 INFO streaming.StreamJob: map 0% reduce 0% 13/05/28 16:18:08 INFO streaming.StreamJob: map 67% reduce 0% 13/05/28 16:18:12 INFO streaming.StreamJob: map 100% reduce 0% 13/05/28 16:18:17 INFO streaming.StreamJob: map 100% reduce 33% 13/05/28 16:18:19 INFO streaming.StreamJob: map 100% reduce 100% 13/05/28 16:18:21 INFO streaming.StreamJob: Job complete: job_201305271601_0004 13/05/28 16:18:21 INFO streaming.StreamJob: Output: output $ hadoop fs -ls output Found 3 items -rw-r--r-- 3 hadoop supergroup 0 2013-05-28 16:18 /user/hadoop/output/_SUCCESS drwxr-xr-x - hadoop supergroup 0 2013-05-28 16:18 /user/hadoop/output/_logs -rw-r--r-- 3 hadoop supergroup 12 2013-05-28 16:18 /user/hadoop/output/part-00000 $hadoop fs -cat output/part-00000 A 3 B 2 C 4
できた!