The example comes from 51 CTO.com
map.py
#!/usr/bin/env python
import re
import sys
for line in sys.stdin:
val = line.strip()
#Extract the valid part from the line, year is the year, temp is the temperature, Q don't know
(year, temp, q) = (val[15:19], val[87:92], val[92:93])
#Exclude some invalid data
if (temp != "+9999" and re.match("[01459]", q)):
#The output format is year and temperature
print "%s\t%s" % (year, temp)
MapReduce framework
A sort operation is performed
reduce.py
#!/usr/bin/env python
import sys
#Start with a tuple
(last_key, max_val) = (None, 0)
for line in sys.stdin:
(key, val) = line.strip().split("\t")
#A new key is encountered, and the last key is not none
if last_key and last_key != key:
print "%s\t%s" % (last_ key, max_ VAL) # output in this form
(last_key, max_val) = (key, int(val)) #last=current
else:
#If you don't meet new ones, continue to maintain a maximum temperature
(last_key, max_val) = (key, max(max_val, int(val)))
#Output the last year
if last_key:
print "%s\t%s" % (last_key, max_val)
Run as a Linux pipeline:
cat input/ncdc/sample.txt | src/main/ch02/python/max_temperature_map.py | \
sort | src/main/ch02/python/max_temperature_reduce.py
1949 111
1950 22
Hadoop command
hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-
streaming.jar \
-input input/ncdc/all \
-output output \
-mapper "ch02/ruby/max_temperature_map.rb | sort |
ch02/ruby/max_temperature_reduce.rb" \
-reducer src/main/ch02/ruby/max_temperature_reduce.rb \
-file src/main/ch02/ruby/max_temperature_map.rb \
-file src/main/ch02/ruby/max_temperature_reduce.rb