mapreduce程序实例的容错机制是怎么样的

点击联系发帖人 时间：2015-12-04 06:22

mapreduce程序

Hadoop、Hive、Hbase、Flume等交流群：（定期清人）、群（定期清人）。博客无法注册，请联系wyphao.
文章总数：500
浏览总数：3,729,042
评论：3053
分类目录：52 个
注册用户数：589
最后更新：日
欢迎关注微信公共帐号：iteblog_hadoop
、Hive、Hbase、Flume等QQ交流群：（已满），请加入新群：
IT英文电子书免费下载频道上线啦，共收录4300+本IT方面的电子书，欢迎访问
　　本文详细地介绍了如何将上的Mapreduce程序转换成的应用程序。有兴趣的可以参考一下：
The key to getting the most out of
is to understand the differences between its RDD API and the original Mapper and Reducer API.
Venerable MapReduce has been Apache Hadoop‘s work-horse computation paradigm since its inception. It is ideal for the kinds of work for which Hadoop was originally designed: large-scale log processing, and batch-oriented ETL (extract-transform-load) operations.
如果想及时了解Spark、Hadoop或者Hbase相关的文章，欢迎关注微信公共帐号：iteblog_hadoop
As Hadoop’s usage has broadened, it has become clear that MapReduce is not the best framework for all computations. Hadoop has made room for alternative architectures by extracting resource management into its own first-class component, YARN. And so, projects like Impala have been able to use new, specialized non-MapReduce architectures to add interactive SQL capability to the platform, for example.
Today, Apache Spark is another such alternative, and is said by many to succeed MapReduce as Hadoop’s general-purpose computation paradigm. But if MapReduce has been so useful, how can it suddenly be replaced? After all, there is still plenty of ETL-like work to be done on Hadoop, even if the platform now has other real-time capabilities as well.
Thankfully, it’s entirely possible to re-implement MapReduce-like computations in Spark. They can be simpler to maintain, and in some cases faster, thanks to Spark’s ability to optimize away spilling to disk. For MapReduce, re-implementation on Spark is a homecoming. Spark, after all, mimics Scala‘s functional programming style and APIs. And the very idea of MapReduce comes from the functional programming language LISP.
Although Spark’s primary abstraction, the RDD (Resilient Distributed Dataset), plainly exposes map() and reduce() operations, these are not the direct analog of Hadoop’s Mapper or Reducer APIs. This is often a stumbling block for developers looking to move Mapper and Reducer classes to Spark equivalents.
Viewed in comparison with classic functional language implementations of map() and reduce() in Scala or Spark, the Mapper and Reducer APIs in Hadoop are actually both more flexible and more complex as a result. These differences may not even be apparent to developers accustomed to MapReduce, but, the following behaviors are specific to Hadoop’s implementation rather than the idea of MapReduce in the abstract:
Mappers and Reducers always use key-value pairs as input and output.
A Reducer reduces values per key only.
A Mapper or Reducer may emit 0, 1 or more key-value pairs for every input.
Mappers and Reducers may emit any arbitrary keys or values, not just subsets or transformations of those in the input.
Mapper and Reducer objects have a lifecycle that spans many map() and reduce() calls. They support a setup() and cleanup() method, which can be used to take actions before or after a batch of records is processed.
This post will briefly demonstrate how to recreate each of these within Spark — and also show that it’s not necessarily desirable to literally translate a Mapper and Reducer!
Key-Value Pairs as Tuples
Let’s say we need to compute the length of each line in a large text input, and report the count of lines by line length. In Hadoop MapReduce, this begins with a Mapper that produces key-value pairs in which the line length is the key, and count of 1 is the value:
public class LineLengthMapper
extends Mapper&LongWritable,Text,IntWritable,IntWritable& {
protected void map(LongWritable lineNumber, Text line, Context context)
throws IOException, InterruptedException {
context.write(new IntWritable(line.getLength()), new IntWritable(1));
It’s worth noting that Mappers and Reducers only operate on key-value pairs. So the input to LineLengthMapper, provided by a TextInputFormat, is actually a pair containing the line as value, with position within the file thrown in as a key, for fun. (It’s rarely used, but, something has to be the key.)
The Spark equivalent is:
lines.map(line =& (line.length, 1))
In Spark, the input is an RDD of Strings only, not of key-value pairs. Spark’s representation of a key-value pair is a Scala tuple, created with the (a,b) syntax shown above. The result of the map() operation above is an RDD of (Int,Int) tuples. When an RDD contains tuples, it gains more methods, such as reduceByKey(), which will be essential to reproducing MapReduce behavior.
Reducer and reduce() versus reduceByKey()
To produce a count of line lengths, it’s necessary to sum the counts per length in a Reducer:
public class LineLengthReducer
extends Reducer&IntWritable,IntWritable,IntWritable,IntWritable& {
protected void reduce(IntWritable length, Iterable&IntWritable& counts, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable count : counts) {
sum += count.get();
context.write(length, new IntWritable(sum));
The equivalent of the Mapper and Reducer above together is a one-liner in Spark:
val lengthCounts = lines.map(line =& (line.length, 1)).reduceByKey(_ + _)
Spark’s RDD API has a reduce() method, but it will reduce the entire set of key-value pairs to one single value. This is not what Hadoop MapReduce does. Instead, Reducers reduce all values for a key and emit a key along with the reduced value. reduceByKey() is the closer analog. But, that is not even the most direct equivalent in S see groupByKey() below.
It is worth pointing out here that a Reducer’s reduce() method receives a stream of many values, and produces 0, 1 or more results. reduceByKey(), in contrast, accepts a function that turns exactly two values into exactly one — here, a simple addition function that maps two numbers to their sum. This associative function can be used to reduce many values to one for the caller. It is a simpler, narrower API for reducing values by key than what a Reducer exposes.
Mapper and map() versus flatMap()
Now, instead consider counting the occurrences of only words beginning with an uppercase character. For each line of text in the input, a Mapper might emit 0, 1 or many key-value pairs:
public class CountUppercaseMapper
extends Mapper&LongWritable,Text,Text,IntWritable& {
protected void map(LongWritable lineNumber, Text line, Context context)
throws IOException, InterruptedException {
for (String word : line.toString().split(& &)) {
if (Character.isUpperCase(word.charAt(0))) {
context.write(new Text(word), new IntWritable(1));
The equivalent in Spark is:
lines.flatMap(
_.split(& &).filter(word =& Character.isUpperCase(word(0))).map(word =& (word,1))
map() will not suffice here, because map() must produce exactly one output per input, but unlike before, one line needs to yield potentially many outputs. Again, the map() function in Spark is simpler and narrower compared to what the Mapper API supports.
The solution in Spark is to first map each line to an array of output values. The array may be empty, or have many values. Merely map()-ing lines to arrays would produce an RDD of arrays as the result, when the result should be the contents of those arrays. The result needs to be “flattened” afterward, and flatMap() does exactly this. Here, the array of words in the line is filtered and converted into tuples inside the function. In a case like this, it’s flatMap() that’s required to emulate such a Mapper, not map().
groupByKey()
It’s simple to write a Reducer that then adds up the counts for each word, as before. And in Spark, again, reduceByKey() could be used to sum counts per word. But what if for some reason the output has to contain the word in all uppercase, along with a count? In MapReduce, that’s:
public class CountUppercaseReducer
extends Reducer&Text,IntWritable,Text,IntWritable& {
protected void reduce(Text word, Iterable&IntWritable& counts, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable count : counts) {
sum += count.get();
context.write(new Text(word.toString().toUpperCase()), new IntWritable(sum));
But reduceByKey() by itself doesn’t quite work in Spark, since it preserves the original key. To emulate this in Spark, something even more like the Reducer API is needed. Recall that Reducer’s reduce() method receives a key and Iterable of values, and then emits some transformation of those. groupByKey() and a subsequent map() can achieve this:
... .groupByKey().map { case (word,ones) =& (word.toUpperCase, ones.sum) }
groupByKey() merely collects all values for a key together, and does not apply a reduce function. From there, any transformation can be applied to the key and Iterable of values. Here, the key is transformed to uppercase, and the values are directly summed.
Be careful! groupByKey() works, but also collects all values for a key into memory. If a key is associated to many values, a worker could run out of memory. Although this is the most direct analog of a Reducer, it’s not necessarily the best choice in all cases. For example, Spark could have simply transformed the keys after a call to reduceByKey:
... .reduceByKey(_ + _).map { case (word,total) =& (word.toUpperCase,total) }
It’s better to let Spark manage the reduction rather than ask it to collect all values just for us to manually sum them.
setup() and cleanup()
In MapReduce, a Mapper and Reducer can declare a setup() method, called before any input is processed, to perhaps allocate an expensive resource like a database connection, and a cleanup() method to release the resource:
public class SetupCleanupMapper extends Mapper&LongWritable,Text,Text,IntWritable& {
private Connection dbC
protected void setup(Context context) {
dbConnection = ...;
protected void cleanup(Context context) {
dbConnection.close();
The Spark map() and flatMap() methods only operate on one input at a time though, and provide no means to execute code before or after transforming a batch of values. It looks possible to simply put the setup and cleanup code before and after a call to map() in Spark:
val dbConnection = ...
lines.map(... dbConnection.createStatement(...) ...)
dbConnection.close() // Wrong!
However, this fails for several reasons:
It puts the object dbConnection into the map function’s closure, which requires that it be serializable (for example, by implementing java.io.Serializable). An object like a database connection is generally not serializable.
map() is a transformation, rather than an operation, and is lazily evaluated. The connection can’t be closed immediately here.
Even so, it would only close the connection on the driver, not necessarily freeing resources allocated by serialized copies.
In fact, neither map() nor flatMap() is the closest counterpart to a Mapper in Spark — it’s the important mapPartitions() method. This method does not map just one value to one other value, but rather maps an Iterator of values to an Iterator of other values. It’s like a “bulk map” method. This means that the mapPartitions() function can allocate resources locally at its start, and release them when done mapping many values.
adding cleanup code is harder because it remains difficult to detect when the transformed iterator has been fully evaluated. For example, this does not work:
lines.mapPartitions { valueIterator =&
val dbConnection = ... // OK
val transformedIterator = valueIterator.map(... dbConnection ...)
dbConnection.close() // Still wrong! May not have evaluated iterator
transformedIterator
A more complete formulation (HT Tobias Pfeiffer) is roughly:
lines.mapPartitions { valueIterator =&
if (valueIterator.isEmpty) {
Iterator[...]()
val dbConnection = ...
valueIterator.map { item =&
val transformedItem = ...
if (!valueIterator.hasNext) {
dbConnection.close()
transformedItem
Although decidedly less elegant than previous translations, it can be done.
There is no flatMapPartitions() method. However, the same effect can be achieved by calling mapPartitions(), followed by a call to flatMap(a => a) to flatten.
The equivalent of a Reducer with setup() and cleanup() is just a groupByKey() followed by a mapPartitions() call like the one above. Take note of the caveat about using groupByKey() above, though.
But Wait, There’s More
MapReduce developers will point out that there is yet more to the API that hasn’t been mentioned yet:
MapReduce supports a special type of Reducer, called a Combiner, that can reduce shuffled data size from a Mapper.
It also supports custom partitioning via a Partitioner, and custom grouping for purposes of the Reducer via grouping Comparator.
The Context objects give access to a Counter API for accumulating statistics.
A Reducer always sees keys in sorted order within its lifecycle.
MapReduce has its own Writable serialization scheme.
Mappers and Reducers can emit multiple outputs at once.
MapReduce alone has tens of tuning parameters.
There are ways to implement or port these concepts into Spark, using APIs like the Accumulator, methods like groupBy() and the partitioner argument in various of these methods, Java or Kryo serialization, caching, and more. To keep this post brief, the remainder will be left to a follow-up post.
The concepts in MapReduce haven’t stopped being useful. It just now has a different and potentially more powerful implementation on Hadoop, and in a functional language that better matches its functional roots. Understanding the differences between Spark’s RDD API, and the original Mapper and Reducer APIs, helps developers better understand how all of them truly work and how to use Spark’s counterparts to best advantage.
本文转载自：/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/
本博客文章除特别声明，全部都是原创！
尊重原创，转载请注明：转载自本文链接地址:
Hadoop YARN公平调度(FairScheduler)介绍
Spark和Hadoop优劣
Hadoop yarn任务调度策略介绍
Spark Checkpoint写操作代码分析
Spark RDD缓存代码分析
Spark Task序列化代码分析
Spark社区可能放弃Spark 1.7而直接发布Spark 2.x
Spark分区器HashPartitioner和RangePartitioner代码详解
下面文章您可能感兴趣云计算下MapReduce多组容错机制架构的分析与研究_张治斌_李燕歌-海文库
全站搜索：
您现在的位置：&>&&>&计算机硬件及网络
云计算下MapReduce多组容错机制架构的分析与研究_张治斌_李燕歌
３１卷　第１期　２０１４年１月微电子学与计算机ＩＣＲＯＥＬＥＣＴＲＯＮＩＣＳ　＆ＣＯＭＰＵＴＥＲＭＶｏｌ．３１　Ｎｏ．１０１４Ｊａｎｕａｒｙ　２云计算下ＭａＲｅｄｕｃｅ多组容错机制架构的分析与研究ｐ李燕歌张治斌，（）河南理工大学计算机科学与技术学院，河南焦作４５４０００在传统的Ｈ即在同机柜中的Ｔ摘　要：提出了ＭａＲｅｄｕｃｅ多组容错机制，ａｄｏｏＭａＲｅｄｕｃｅ架构上进行改进，ａｓｋ－ｐｐｐ　同时减轻Ｊ减低了这样可以缩短发现失效节点的时间，Ｔｒａｃｋｅｒ节点之间增加了多组关系，ｏｂＴｒａｃｋｅｒ节点的负荷，带宽使用率，减少网络拥塞．通过实验证明，ａＲｅｄｕｃｅ多组容错机制提高了ＭａＲｅｄｕｃｅ的工作效率．Ｍｐｐ关键词：云计算；多组容错机制ＭａＲｅｄｕｃｅ并行编程模型；ｐ（）中图分类号：ＴＰ３０２．８　　　　　文献标识码：Ａ　　　　　文章编号：０００－７１８０２０１４０１－００５２－０４１ＡｎａｌｓｉｓａｎｄＳｔｕｄｏｎｕｎｄｅｒＣｌｏｕｄＣｏｍｕｔｉｎＭｕｌｔｉｌｅ　　　　　ｙｙｐｇｐ　　ＳｅｔｓｏｆＦａｕｌｔＴｏｌｅｒａｎｃｅＳｔｒａｔｅＡｒｃｈｉｔｅｃｔｕｒｅｏｆＭａＲｅｄｕｃｅ　　　　　　ｇｙｐ　，ｅＺｂＨＡＮＧＺｈｉｉｎＬＩＹａｎ－－ｇ　　，），（ＨｅｎａｎＰｏｌｔｅｃｈｎｉｃＵｎｉｖｅｒｓｉｔＪｉａｏｚｕｏ４５４０００，ＣｈｉｎａｏｌｌｅｅｏｆＣｏｍｕｔｅｒＳｃｉｅｎｃｅａｎｄＴｅｃｈｎｏｌｏＣ　　　　　　　　ｙｙｇｐｇｙ，ｉ：ＴｏｆＭａＲｅｄｕｃｅｍｒｏｖｅＨａｄｏｏＡｂｓｔｒａｃｔｈｉｓａｅｒｒｅｓｅｎｔｓａＭｕｌｔｉｌｅＳｅｔｓｏｆＦａｕｌｔＴｏｌｅｒａｎｃｅＳｔｒａｔｅ　　　　　　　　　　　ｐｐｐｐｐｐｐｇｙ　，ｂａｄｄｅｄｍｕｌｔｉｌｅｓｅｔｓｒｅｌａｔｉｏｎｓｈｉＭａＲｅｄｕｃｅｆｒａｍｅｗｏｒｋｅｔｗｅｅｎＴａｓｋＴｒａｃｋｅｒｎｏｄｅｉｎｔｈｅｓａｍｅｃａｂｉｎｅｔｒｏｕ　　　　　　　　　　ｐｐｇｐｐ　　，，，，ｔｈｅｆａｉｌｅｄｎｏｄｅａｔｔｈｅｓａｍｅｔｉｍｅｒｅｄｕｃｅｔｈｅｌｏａｄｏｆＪｏｂＴｒａｃｋｅｒｎｏｄｅｒｅｌａｔｉｏｎｓｉｔｃａｎｓｈｏｒｔｔｈｅｔｉｍｅｏｆｆｏｕｎｄｉｎ　　　　　　　　　　　　　　　　ｇ　ｂａｎｄｗｉｄｔｈｕｔｉｌｉｚａｔｉｏｎａｎｄｎｅｔｗｏｒｋｃｏｎｅｓｔｉｏｎ．ＴｈｅｅｘｅｒｉｍｅｎｔａｌｒｅｓｕｌｔｓｓｈｏｗｔｈａｔＭｕｌｔｉｌｅＳｅｔｓｏｆＦａｕｌｔＴｏｌｅｒａｎｃｅ　　　　　　　　　　　　　ｇｐｐｏｆＭａＲｅｄｕｃｅ．ＳｔｒａｔｅｏｆＭａＲｅｄｕｃｅｉｍｒｏｖｅｄｔｈｅｅｆｆｉｃｉｅｎｃ　　　　　ｐｐｐｙｇｙ　　；ｍ；Ｍ：ｍｏｄｅｌｕｌｔｉｌｅｓｅｔｓｏｆｆａｕｌｔｔｏｌｅｒａｎｃｅｓｔｒａｔｅａＲｅｄｕｃｅａｒａｌｌｅｌｒｏｒａｍｍｉｎＫｅｗｏｒｄｓｃｌｏｕｄｃｏｍｕｔｉｎ　　　　　　　　ｐｇｙｐｇｐｐｐｇｇｙ　　１　引言随着信息化程度的日益剧增，信息量成指数级增长，各大商家需要从ＰＢ级甚至是ＥＢ级的海量数据中挖掘出有效信息，伴随着这些问题云计算产生了，但是云计算只是一种思维模式，要想真正地解决这些问题就必须处理好支持云计算后台的云计算数１］据中心［．Ｇｏｏｌｅ公司提出来ＭａＲｅｄｕｃｅ编程模ｇｐ型简单易用、并行化程度高、系统易扩张等特点吸引［］了很多使用者，但是在ＭａＲｅｄｕｃｅ并行编程模型２ｐ集群系统中使用普通ＰＣ代替价格昂贵的服务器，容错问题的解决首先是ＪｏｂＴｒａｃｋｅｒ节点需要ａｓｋＴｒａｃｋｅｒ节点周期性发送的心跳消息获得节点Ｔ的工作状态，其次对出现故障的节点任务进行再执行或者备份启动，这一切的前提都是发现失效节点，如何快速准确地发现失效节点是容错技术中需要解决的问题．］在文献［中提出的主动容错技术中虽然很好４地预知了节点失效但是复杂的预知策略加重了Ｍａ－在文献［中提出的自适应超Ｒｅｄｕｃｅ的运行负担；５］ｐ时时间算法能缩短发现节点失效的时间，但是由于估算作业运算时间不准确造成超期时间设置的不精确．这样从很大程度上减低了运行成本，然而人为操作错误或硬件故障等造成的机器失效现象时常发生，并且随着集群规模的扩大，这些问题会日益增加，所［］传统以ＭａＲｅｄｕｃｅ的容错技术成为研究的热点３．ｐ收稿日期：２０１３－０２－１８；修回日期：２０１３－０４－２７）基金项目：河南省软件科学研究（１０２４００４５００６４　第１期张治斌，等：云计算下ＭａＲｅｄｕｃｅ多组容错机制架构的分析与研究ｐ５３无论是长作业还是Ｔｒａｃｋｅｒ节点之间传输时间长，短作业都能提前发现失效节点；第二，减少网络拥塞，因为ＪｏｂＴｒａｃｋｅｒ和ＴａｓｋＴｒａｃｋｅｒ节点的通信必须通过交换机，是高带宽，而同机柜中的各个Ｔａｓｋ－在多Ｔｒａｃｋｅｒ节点在局域网中进行通信是低带宽，组容错机制中通过低带宽来监测节点失效节约了带宽资源，从根本上减少网络拥塞；第三，节点失效的监测工作部分交给Ｊ减少了ＪｏｂＴｒａｃｋｅｒ来完成，ｏｂ－降低了因为控制节点任务Ｔｒａｃｋｅｒ节点的负载量，繁重带来的性能瓶颈的可能性．只有ＴａｓｋＴｒａｃｋｅｒ节点空闲时才主动执行多组监测的任务，不会影响工作节点计算的主要任务，既保证了数据计算的独立性，又把节点空闲的时间利并且多组监用起来去分担ＪｏｂＴｒａｃｋｅｒ节点的任务，测的过程是在局域网中进行的，不经过交换机，节约了带宽，降低了网络拥塞的可能．（）请求申请２请求申请在多组容错中起到承上启下的作用．ａｓｋＴｒａｃｋｅｒ节点周期性向ＪｏｂＴｒａｃｋｅｒ节点发送Ｔ心跳消息告诉自己的工作状态，空闲ＴａｓｋＴｒａｃｋｅｒ节点会把监测到本机柜中失效的ＴａｓｋＴｒａｃｋｅｒ节点信息也包含在给ＪｏｂＴｒａｃｋｅｒ节点发送的心跳消息中，ｏｂＴｒａｃｋｅｒ节点收到消息后查找任务分配表，Ｊ找出失效节点的任务是什么和任务执行到哪一步．因为对于失效节点正在执行Ｍａｅｄｕｃｅ任ｐ任务和Ｒ务的处理方法是不同的，如果执行的是ｍａｐ工作，一旦失效则结果因为ｍａｐ的输出是在本地存储的，就无法得到，那么就同意请求者的申请，准备进行任务迁移；如果执行的是ｒ因为ｒｅｄｕｃｅ工作，ｅｄｕｃｅ的结果是全局量，节点在完成后失效并不影响最终结果，所以就发送消息给申请者已知晓该节点失效的消息，不需要进行任务迁移．（）任务迁移３任务迁移技术是多组容错机制中的关键技术，根据前两步的判断，对失效节点进行任务迁移，任务迁移技术可保证对用户透明．另外，ｏｂＴｒａｃｋｅｒ向Ｊ从失效节点获取中间结果的节点发送通知，远程获取结果的地址已经改变，并把新地址发送给它们．以上三个步骤在ＭａＲｅｄｕｃｅ多组容错技术结ｐ构关系如图１所示，ｏｂＴｒａｃｋｅｒ为客户端和工作节Ｊ点的桥梁，在整个ＭａＲｅｄｕｃｅ工作中起到神经中枢ｐ的作用，它主要工作任务是为客户端完成请求和告知计算结果的位置信息，为分配任务并对其工作状态进行实时监测．ＴａｓｋＴｒａｃｋｅｒ节点主要是完成是计算的核心，在空闲之ａｅｄｕｃｅ工作，Ｍｐ工作和Ｒ余主动触发多组监测和响应任务节点批准的任务迁移．２　多组容错机制架构的分析［］由一个提供ＭａＲｅｄｕｃｅ架构６采用主从架构，ｐａｓｔｅｒ服务ＪｏｂＴｒａｃｋｅｒ节点和多个提供Ｓａｌｖｅ服Ｍ务ＴａｓｋＴｒａｃｋｅｒ节点组成．ＪｏｂＴｒａｃｋｅｒ节点相当于负责对ＴａＲｅｄｕｃｅ的神经中枢，ａｓｋＴｒａｃｋｅｒ节点Ｍｐ分配任务，同时还通过心跳消息监测ＴａｓｋＴｒａｃｋｅｒ节点的工作状态．ＴａｓｋＴｒａｃｋｅｒ节点负责执行ｍａｐ任务和ｒｅｄｕｃｅ任务．１　需求分析２．ａＲｅｄｕｃｅ并行模型的多组容错机制在设计Ｍｐ时，除了要继承集群的主从架构和原有系统的功能外，还要兼顾Ｔ引入多组ａｓｋＴｒａｃｋｅｒ节点的扩展性、机制带来的系统管理复杂等问题，（）继承原有系统各个ＴａｓｋＴｒａｃｋｅｒ节点计算１的独立性以及高度的并行处理能力；（）改变传统容错方案中再执行和备份的方法，２从如何更快更准确的感知失效节点入手；（）解决传统容错技术中的ＪｏｂＴｒａｃｋｅｒ节点任３务量繁重的问题；（）解决在同机柜中两个以上ＴａｓｋＴｒａｃｋｅｒ节４点竞争失效节点未完成任务问题．２　实现方案２．（）多组监测１与传统的ＭａＲｅｄｕｃｅ容错机制最大的不同点ｐ在于各个计算节点并不是孤立存在的，每个机柜中的Ｔ它们可以通过ａｓｋＴｒａｃｋｅｒ节点存在多组关系，局域网发送询问消息来监测彼此工作状态．多组监例测行为是在ＴａｓｋＴｒａｃｋｅｒ节点空闲时主动执行，如当机柜１中的ＴａｓｋＴｒａｃｋｅｒａ节点完成了Ｊｏｂ－　它会通过局域网向同机柜Ｔｒａｃｋｅｒ分配的任务后，其他的ＴａｓｋＴｒａｃｋｅｒ节点发送消息询问工作状态，一旦发现某个节点失效，监测工作完成．３　ＭａＲｅｄｕｃｅ多组关系ｐ３．１　多组关系建立在现实的网络环境中，若干个ＴａｓｋＴｒａｃｋｅｒ放在一个机柜里组成一个局域网，它们彼此之间通信不需要通过交换机，ｏｂＴｒａｃｋｅｒ节点与各个机柜之Ｊ间的通信都必须通过交换机．ＭａＲｅｄｕｃｅ多组架构ｐ４５微电子学与计算机０１４
年２节点．第一种现象，解决方案是择优选择，首先从ｏｂＴｒａｃｋｅｒ节点那里得知失效节点没有完成的任Ｊ务执行需要相关的数据块存放地址，根据数据本地性原则，如果该地址正好是申请任务迁移的空闲节点其中一个，那么ＪｏｂＴｒａｃｋｅｒ节点就同意那个节点图１　ＭａＲｅｄｕｃｅ多组容错技术结构关系图ｐ进行任务迁移，拒绝其他节点的请求．如果没有一个节点符合，那么就对申请的所有空闲节点进行计算效率比较，节点的计算效率主要是由磁盘读取速率、因此ｃｕ执行效率以及网络带宽三个因素制约的，ｐ定义下面公式作为节点计算能力的值：如图２所示
．ｐｎｏｄｅ＝α×λ×ｄｃｕｉｓｋｐ＋β×＋ｍＶｄｍＶｃａｘａｘｕｉｓｋｐＶｂａｎｄｗｉｔｈ．ｍＶｂａｘａｎｄｗｉｔｈ式中，ＶｄＶｃＶｂｕ，ｉｓｋ，ａｎｄｗｉｔｈ分别指的是节点的磁盘读取ｐ速率，Ｖｄａｘｃｕ执行效率，网络带宽速率，ｍｐｉｓｋ，图２　ＭａＲｅｄｕｃｅ多组架构图ｐ在多组容错机制中最重要的是在同机柜的所有节点之间建立多组关系，已经完成任务的空闲节点担当机柜的管理者角色，利用多组关系向本机柜中的其他节点发送询问命令询问工作状态，多组关系只能在本机柜中有效，是不能跨越机柜的．在每个节点中都有存放记录本机柜其他节点工作状态的多组报告表，这个表中的每条记录都是节点在空闲时经过询问和应答消息方式一条一条添加的，而且这些记录是有期限的，每次ＴａｓｋＴｒａｃｋｅｒ节点向Ｊｏｂ－这样以免重复Ｔｒａｃｋｅｒ节点申请过之后都会清空，的申请加重节点负荷．ＭａＲｅｄｕｃｅ多组关系如图３ｐ所示
．ｍＶｃＶｂｍａｘａｘｕ，ａｎｄｗｉｔｈ分别指的是本机柜中最快的磁ｐ盘读取速率，最快的ｃ最快的网络带宽ｕ执行效率，ｐ速率，α，λ是三个不同的参数指标，α＞０，λβ，β＞０，且α＋ＶｄＶｃＶｂｉｓｋ，ａｎｄｗｉｔｈ是动态变化ｕ，＞０，ｐβ＋λ＝１，的．第二种现象解决办法是ＪｏｂＴｒａｃｋｅｒ节点从其他机柜中找出空闲节点，首先，优先选择数据本地性，任务需要的数据块存储在本节点上；其次，选择数据本机柜性，任务需要的数据块存储在同机柜的其他节点上；再次，以上两者都不满足的话，就根据现象一中节点计算效率比较择优选择．４　实验与结果分析搭建实验平台，我们采用Ｈ安ａｄｏｏｐ开源平台，实验配置：装的是Ｈ整个实验平台由ａｄｏｏ０．２０．１．ｐ　具ＤａｔａＮｏｄｅ８台，Ｃ机组成，ａｍｅＮｏｄｅ１台，９台ＰＮ　　图３　ＭａＲｅｄｕｃｅ多组关系图ｐ体如表１所示．表１　实验平台搭建表主机名称Ｈｏｓｔ１　ｏｓｔ２Ｈ　…ｏｓｔ９Ｈ　ＩＰ２１８．１９８．２４１．１８１２１８．１９８．２４１．１８２…２１８．１９８．２４１．１８９功能ＪｏｂＴｒａｃｋｅｒ节点ＴａｓｋＴｒａｃｋｅｒ节点…ＴａｓｋＴｒａｃｋｅｒ节点３．２　多组关系带来的资源竞争由于多组关系的存在，一个失效节点可能被同机柜中多个空闲节点监测到，也有可能多个节点失效被同一个空闲节点监测到，因此会导致了两种竞争资源的现象：多个空闲节点竞争获得一个失效节点的任务和多个失效节点的任务竞争一个空闲５放在机柜１中，９放在机ｏｓｔ２ｏｓｔ６Ｈ　　其中Ｈ－－柜２中，本实验采用Ｕ学校ｂｕｎｔｕ１０．１０操作系统，校园网，使用Ｍｃｌｉｓｅ７．５编程平台的实验环境．Ｅｐｙ［］使用ＨａｄｏｏＧｒｉｄＭｉｘ７自带的代表性作业ｐ　ｖａＳｏｒｔ作为实验测试基准，ｖａＳｏｒｔ作业包括１０ａａｊｊ个Ｍ作业大小为ａ５个Ｒｅｄｕｃｅ任务个数，１ｐ任务，如图６所示．在这种情况两种技术作业完成平均的，多组容错技术比传统运行时间约是１１９ｓ和１４９ｓ具体实验结果如图６技术提前足足３０ｓ完成任务，所示
．我们通过几组实验数据来分析：５Ｍ．４（）在没有节点失效情况下，ａＲｅｄｕｃｅ多组１Ｍｐ容错机制和传统容错运行原理相同，在整个作业运行过程中没有节点失效的情况下，运行８次分别得到的作业完成时间，具体实验结果如图４所示，得出所有节点都正常工作时作业运行完成的平均时间为３ｓ
．４图６　两个节点失效两种容错机制运行时间对比图从上面所有的实验可以看出多组容错技术提高了Ｍ并且随着集群规模的增ａＲｅｄｕｃｅ的工作效率，ｐ大，失效机器的增多，增效会更加明显．５　结束语制针对传统容错技术再执行和备份方案的前提进行补充，提出建立多组关系可以更快更准确地发现失图４　没有节点失效情况下作业运行时间（）机柜１中一个节点失效情况下通过８次实２验得多组容错和传统错容错中作业运行时间对比．多组容错和传统容错技术中作业平均运行时间分别，多组容错技术比传统技术提前６ｓ约是７１ｓ和７７ｓ完成任务，在大规模的作业中工作效率高的优势将更加明显，具体实验结果如图５所示
．效节点，通过实验对ＭａＲｅｄｕｃｅ多组容错机制的性ｐ能进行测试，结果表明，能够减少发现失效节点的时间，提高ＭａＲｅｄｕｃｅ的工作效率．ｐａＲｅｄｕｃｅ多组容错机制相对于传统的容错Ｍｐ技术没有从集群全局考虑数据本地性造成机柜失效机器极少的情况下可能选择任务迁移的机器不是最理想的，但是在任务分配初期就考虑数据本地性的影响使这种情况发生的概率极低，对多组容错机制的整体优势影响不大．在为失效节点启动任务迁移时并没有考虑到负载均衡问题，将来可以考虑负载均衡的影响，进一步提升系统性能．参考文献：］［］现代计李铭．云计算下的海量数据挖掘研究［Ｊ．１　王鄂，图５　一个节点失效两种容错机制运行时间对比图但是从图５中可以看出，在第３次实验时传统容错机制要比多组容错机制花费的时间要短，这是因为可能存在数据本地性的原因，在同机柜中不存在数据本地性的空闲节点，可是在别的机柜中存在，在这种情况下传统的容错机制可能就更占优势，但是在刚开始作业运行开始时任务调度中就考虑到数据本地性，所以这样情况发生的概率太低对多组容错机制的整体优势影响不大．（）机柜１中两个节点失效情况下通过８次实３验得出多组容错和传统错容错中作业运行时间对比（）：算机：专业版，２２００９１１２２５．－［］，：２ｅａｎＪＧｈｅｍａｗａｎｔＳ．ＭａＲｅｄｕｃｅｓｉｍｌｉｅｄｄａｔａｒｏ　Ｄ　　　　－ｐｐｐｅｓｓｉｎｏｎｌａｒｅｃｌｕｓｔｅｒｓ［Ｊ］．Ｃｏｍｍｕｎｉｃａｔｉｏｎｓｏｆｔｈｅｃ　　　　ｇｇ　（）：ＡＣＭ，２００８，５５１１３．［］，３ｏＳＹ，ＨｏＱｕｅＩＣｈｋｏＢ，ｅｔａｌ．Ｍａｋｉｎｃｌｏｕｄｉｎｔｅｒ　Ｋ　　　　　　　－ｇ　［／／ｍｔｅｄｉａｔｅｄａｔａｆａｕｌｔｏｌｅｒａｎｔＣ］ＳＯＣＣ２０１０．Ｐｒｉｎｃｅｔｏｎ－　　　，Ｕｎｉｖｅｒｓｉｔ２０１０．ｙ［］董国良．主动容错技术在集群系统中的研究与４　付金辉，］（）：应用［６Ｊ．高性能计算技术，２０１２５６４５．－（下转第５９页）　第１期李文泽，等：一种基于粒子群的实际业务流预测算法９５ｍｉｎｔｏｅｎｈａｎｃｅｈｓｉｃａｌｌａｅｒｓｅｃｕｒｉｔｕｓｉｎｒｅｌａｓ　　　　ｇｐｙｙｙｇｙ　　　［］，Ｊ．ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＳｉｎａｌＰｒｏｃｅｓｓｉｎ２０１１，　　　　ｇｇ（）：５９３１３１７３２２．１－［］等．小时间尺度网络流量混２　温祥西，孟相如，马志强，］：沌性分析及趋势预测［Ｊ．电子学报，２０１２，４０（８）１６０９６１６．１－［］等．基于随机网络的多步网３　胡治国，张大陆，侯翠平，：络时延预测模型［００９，３６（７）Ｊ］．计算机科学，２５７．８８－［］，４ｉＳＹｉｎＱ，ＧｕｏＰ．Ａｈｉｅｒａｒｃｈｉｃａｌｍｉｘｔｕｒｅｏｓｆｓｏｆｔｗａｒｅ　Ｌ　　　　　　　］ｒｅｌｉａｂｉｌｉｔｍｏｄｅｌｆｏｒｒｅｄｉｃｔｉｏｎ［Ｊ．ＡｌｉｅｄＭａｔｈｅｍａｔｉｃｓ　　　ｙｐｐｐ　，，（）：ａｎｄＣｏｍｕｔａｔｉｏｎ２００７１８５２１１１２０１３０．　－ｐ［］５ａｒｓｉＨ，ＧｏｂａｌＦ．Ａｒｔｉｆｉｃｉａｌｎｅｕｒａｌｎｅｔｗｏｒｋｓｉｍｕｌａｔｏｒ　Ｆ　　　　　［］ｆｏｒｓｕｅｒｃａａｃｉｔｏｒｅｒｆｏｒｍａｎｃｅｒｅｄｉｃｔｉｏｎＪ．Ｃｏｍｕｔａ　　　－ｐｐｐｐ，（）：ｉｏｎａｌＭａｔｅｒｉａｌｓＳｃｉｅｎｃｅ２００７，３９３６７８８３．ｔ６　　－［］６　单伟，何群．基于非线性时间序列的预测模型检验与］（）：优化的研究［电子学报，Ｊ．００８，３６１２２４８５４８９．２２－［］等．基于粒子群优化的虚拟网络７　程祥，张忠宝，苏森，］（）：映射算法［Ｊ．电子学报，２０１１，３９１０２２４０２４４．２－［］罗辛，王高丽，等．基于Ｆ８ＡＲＩＭＡ模型的流量抽　潘乔，］（）：样测量方法［Ｊ．计算机工程，２８．０１０，３６１５７－［］９ＳＲ－ＬＳＳＶＭ的网络流量预测　陈卫民，陈志刚．基于Ｐ（）：［］０１２，３９７２４．２９Ｊ．计算机科学，９－作者简介：（，硕士，讲师．研究方向为网络优化．李文泽　男，１９７９－）（，盛光磊　男，硕士，讲师．研究方向为网络安全．１９８１－）（上接第５５页）［］上云环境下Ｍ５ａＲｅｄｕｃｅ容错技术的研究［Ｄ］．　朱浩．ｐ海：上海交通大学，２０１２．［］何宏，张福临．Ｍ６ａＲｅｄｕｃｅ框架与调度容错机　王晋川，ｐ］（）：制研究［９Ｊ．中国储运，２０１０１２９０１．－：／／［／７］　ＣｒｉｄＭｉｘｎｔｒｏｄｕｃｔｉｏｎ［ＥＢＯＬ］．ｈｔｔｈｅｎ　ｉｇｐ　Ｄ．Ｇ／／／．ｄｒｂｍａｒｅｄｕｃｅｈａｄｏｏｏｎｘｉｃｈｅｎ．ｏｒｉｄｍｉｘｅｎｃｈｍａｒｋ－－ｇｐｐｇｇｇ作者简介：（，教授．研究方向为云计算、分布式计张治斌　男，１９５３－）算等．（，李燕歌　女，硕士研究生．研究方向为云计算、１９８７－）ａＲｅｄｕｃｅ并行编程模型．Ｍｐ
上一篇：下一篇：
All rights reserved Powered by
copyright &copyright 。文档资料库内容来自网络，如有侵犯请联系客服。}

51无线网