Hadoop example notes


The example comes from 51 CTO.com


#!/usr/bin/env python  
import re   
import sys  

for line in sys.stdin:  
  val = line.strip()
  #Extract the valid part from the line, year is the year, temp is the temperature, Q don't know 
  (year, temp, q) = (val[15:19], val[87:92], val[92:93])
  #Exclude some invalid data  
  if (temp != "+9999" and re.match("[01459]", q)):
    #The output format is year and temperature  
    print "%s\t%s" % (year, temp)

MapReduce framework

A sort operation is performed


#!/usr/bin/env python       
import sys   
#Start with a tuple
(last_key, max_val) = (None, 0)  
for line in sys.stdin:  
  (key, val) = line.strip().split("\t") 

  #A new key is encountered, and the last key is not none 
  if last_key and last_key != key:  
    print "%s\t%s" % (last_ key, max_ VAL) # output in this form  
    (last_key, max_val) = (key, int(val))  #last=current
    #If you don't meet new ones, continue to maintain a maximum temperature
    (last_key, max_val) = (key, max(max_val, int(val))) 
 #Output the last year
if last_key:  
  print "%s\t%s" % (last_key, max_val)

Run as a Linux pipeline:

cat input/ncdc/sample.txt | src/main/ch02/python/max_temperature_map.py | \ 
sort | src/main/ch02/python/max_temperature_reduce.py  
1949 111  
1950 22 

Hadoop command

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-  
  streaming.jar \  
  -input input/ncdc/all \  
  -output output \  
  -mapper "ch02/ruby/max_temperature_map.rb | sort |   
   ch02/ruby/max_temperature_reduce.rb" \  
  -reducer src/main/ch02/ruby/max_temperature_reduce.rb \  
  -file src/main/ch02/ruby/max_temperature_map.rb \  
  -file src/main/ch02/ruby/max_temperature_reduce.rb