倒排索引是目前几乎所有支持全文检索的搜索引擎都需要依赖的数据结构,该索引结构被用来存储某个单词(或词组)在一个文档或一组文档中存储位置的映射,即提供了一种根据内容来查找文档的方式,由于不是根据文档来确定文档所含的内容,而是进行了相反的操作,因而被称为倒排索引。
图1-1为带词频统计属性的文档呢倒排索引算法
代码1-2和代码1-3为自定义FileInputFormat和自定义RecordReader,用于在mapper阶段,将文档的名称作为key,文档每行的内容作为value传给map函数
代码1-2
package com.hadoop.mapreduce;import java.io.IOException;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.InputSplit;import org.apache.hadoop.mapreduce.RecordReader;import org.apache.hadoop.mapreduce.TaskAttemptContext;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;public class FileNameInputFormat extends FileInputFormat{ @Override public RecordReader createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException { FileNameRecordReader frr = new FileNameRecordReader(); frr.initialize(split, context); return frr; }}
代码1-3
package com.hadoop.mapreduce;import java.io.IOException;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.InputSplit;import org.apache.hadoop.mapreduce.RecordReader;import org.apache.hadoop.mapreduce.TaskAttemptContext;import org.apache.hadoop.mapreduce.lib.input.FileSplit;import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;public class FileNameRecordReader extends RecordReader{ private Text fileName = new Text(); private LineRecordReader lrr = new LineRecordReader(); @Override public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException { lrr.initialize(split, context); fileName.set(((FileSplit) split).getPath().getName()); } @Override public boolean nextKeyValue() throws IOException, InterruptedException { return lrr.nextKeyValue(); } @Override public Text getCurrentKey() throws IOException, InterruptedException { return fileName; } @Override public Text getCurrentValue() throws IOException, InterruptedException { return lrr.getCurrentValue(); } @Override public float getProgress() throws IOException, InterruptedException { return lrr.getProgress(); } @Override public void close() throws IOException { lrr.close(); }}
代码1-4为map函数,该阶段传入文档的单词和文档名称,形如(word#fileName,1)的形式,用于后面reduce函数的统计,可在map阶段去除一些不重要的词汇,例如:a,is,the……等
代码1-4
package com.hadoop.mapreduce;import java.util.StringTokenizer;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class InvertedIndexMapper extends Mapper{ private static final IntWritable ONE = new IntWritable(1); private Text word = new Text(); protected void map(Text key, Text value, Context context) throws java.io.IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken() + "#" + key); context.write(word, ONE); } };}
代码1-5为map阶段之后的统计阶段,对key相同的键值进行统计,以减少带宽的消耗
代码1-5
package com.hadoop.mapreduce;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class IndexCombiner extends Reducer{ private IntWritable result = new IntWritable(); protected void reduce(Text key, java.lang.Iterable values, Context context) throws java.io.IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); };}
代码1-6为将单词相同,而文档可能不同的key强制发送到同一个reduce节点,例如,经过combiner阶段后,得到的键值对有(fish#doc1,2)、(fish#doc2,2)……等这些键值对,但partitioner阶段会针对(fish#doc1,2),(fish#doc2,2)这两个键值对的key对其哈希值进行计算,会将这两个单词相同,但文档不同的键值对发送到不同的reduce节点,这里自定义HashPartitioner,对单词和文档进行拆分,仅仅将单词传入计算,这样,即便单词相同,而文档不同的key也会被发送到同一个reduce节点进行计算
代码1-6
package com.hadoop.mapreduce;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;public class IndexPartitioner extends HashPartitioner{ @Override public int getPartition(Text key, IntWritable value, int numReduceTasks) { String[] arrays = key.toString().split("#"); return super.getPartition(new Text(arrays[0]), value, numReduceTasks); }}
代码1-7为reduce阶段,该阶段用于对单词所出现在多个文档出现的次数进行统计,这里定义三个变量:list、lastWord、currentWord。
例如一个reduce节点被传入(bird#doc3,<1>)、(fish#doc1,<2>)、(fish#doc2,<2>)这三个键值对(<……>表示为多个值),当(bird#doc3,<1>)被传入reduce函数后,currentWord为bird,而lastWord为空,此时上个单词和当前传入的单词不一致,执行notEqualKey方法,但是list为空,所以刚进入notEqualKey方法后又退出来。进而继续执行equalKey方法,而lastWord被也被设置为bird,并对bird在doc3出现的次数进行统计,再将key和统计的次数加入到list。
当(fish#doc1,<2>)被传入reduce函数后,发现currentWord和lastWord不匹配,执行notEqualKey方法,此时list不为空,循环list,对上一个单词出现的文档,出现的次数进行分割,然后传入context,再将list清空。进而继续执行equalKey方法……一直到该reduce节点最后一个键值对
由于最后一个键值对后面不会再有其他不同的单词来进行判断currentWord和lastWord不匹配,进而执行notEqualKey方法将list中的内容写入context,因此在cleanup阶段,在执行一遍notEqualKey方法。
代码1-7
package com.hadoop.mapreduce;import java.io.IOException;import java.util.ArrayList;import java.util.List;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class InvertedIndexReduce extends Reducer{ private List list = new ArrayList (); private Text lastWord = new Text(""); private Text currentWord = new Text(""); private Text value = new Text(); protected void reduce(Text key, Iterable values, Context context) throws java.io.IOException, InterruptedException { String[] arrays = key.toString().split("#"); currentWord.set(arrays[0]); if (!lastWord.equals(currentWord)) { notEqualKey(context); } equalKey(arrays, key, values); }; private void equalKey(String[] arrays, Text key, Iterable values) { lastWord.set(arrays[0]); int sum = 0; for (IntWritable val : values) { sum += val.get(); } list.add(new Text(key + "," + sum)); } private void notEqualKey(Context context) throws IOException, InterruptedException { if (list.isEmpty()) { return; } StringBuilder str = new StringBuilder(); for (Text text : list) { String index = text.toString().split("#")[1]; str.append("(").append(index).append(")"); } value.set(str.toString()); context.write(lastWord, value); list.clear(); } protected void cleanup(Context context) throws java.io.IOException, InterruptedException { notEqualKey(context); };}
代码1-8
package com.hadoop.mapreduce;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class InvertedIndex { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { if (args == null || args.length != 2) { throw new RuntimeException("请输入输入路径、输出路径"); } Configuration conf = new Configuration(); Job job = Job.getInstance(conf); job.setJobName("InvertedIndex"); job.setInputFormatClass(FileNameInputFormat.class); job.setJarByClass(InvertedIndex.class); job.setMapperClass(InvertedIndexMapper.class); job.setCombinerClass(IndexCombiner.class); job.setReducerClass(InvertedIndexReduce.class); job.setPartitionerClass(IndexPartitioner.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}
代码1-9为代码的准备数据。
代码1-9
root@lejian:/data# hadoop fs -ls -R /data-rw-r--r-- 1 root supergroup 18 2017-01-19 16:53 /data/doc1-rw-r--r-- 1 root supergroup 19 2017-01-19 16:53 /data/doc2-rw-r--r-- 1 root supergroup 13 2017-01-20 09:27 /data/doc3root@lejian:/data# hadoop fs -cat /data/doc1one fishtwo fishroot@lejian:/data# hadoop fs -cat /data/doc2red fishblue fishroot@lejian:/data# hadoop fs -cat /data/doc3one red bird
运行代码1-8,结果如代码1-9所示
代码1-10
root@lejian:/data# hadoop jar invertedIndex.jar com.hadoop.mapreduce.InvertedIndex /data /output…………root@lejian:/data# hadoop fs -ls -R /output-rw-r--r-- 1 root supergroup 0 2017-01-20 09:58 /output/_SUCCESS-rw-r--r-- 1 root supergroup 105 2017-01-20 09:58 /output/part-r-00000root@lejian:/data# hadoop fs -cat /output/part-r-00000bird (doc3,1)blue (doc2,1)fish (doc1,2)(doc2,2)one (doc1,1)(doc3,1)red (doc2,1)(doc3,1)two (doc1,1)