Hadoop : 一个目录下的数据只由一个map处理-mysql教程-PHP中文网

首页

数据库

mysql教程

Hadoop : 一个目录下的数据只由一个map处理

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 07, 2016 pm 04:38 PM

hadoopmap一处理数据需求

有这么个需求：一个目录下的数据只能由一个map来处理。如果多个map处理了同一个目录下的数据会导致数据错乱。刚开始google了下，以为网上都有现成的InputFormat，找到的答案类似我之前写的 mapreduce job让一个文件只由一个map来处理。或者是把目录写在文

有这么个需求：一个目录下的数据只能由一个map来处理。如果多个map处理了同一个目录下的数据会导致数据错乱。

刚开始google了下，以为网上都有现成的InputFormat，找到的答案类似我之前写的 “mapreduce job让一个文件只由一个map来处理“。

或者是把目录写在文件里面，作为输入：

/path/to/directory1
/path/to/directory2
/path/to/directory3

代码里面按行读取：

 @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            FileSystem fs = FileSystem.get(context.getConfiguration());
            for (FileStatus status : fs.listStatus(new Path(value.toString()))) {
                // process file
            }
        }

都不能满足需求，还是自己实现一个 OneMapOneDirectoryInputFormat 吧，也很简单：

import java.io.IOException;
import java.util.*;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
/**
 * 一个map处理一个目录的数据
 */
public abstract class OneMapOneDirectoryInputFormat extends CombineFileInputFormat {
    private static final Log LOG = LogFactory.getLog(OneMapOneDirectoryInputFormat.class);
    @Override
    protected boolean isSplitable(JobContext context, Path file) {
        return false;
    }
    @Override
    public List getSplits(JobContext job) throws IOException {
        // get all the files in input path
        List stats = listStatus(job);
        List splits = new ArrayList();
        if (stats.size() == 0) {
            return splits;
        }
        LOG.info("fileNums=" + stats.size());
        Map> map = new HashMap>();
        for (FileStatus stat : stats) {
            String directory = stat.getPath().getParent().toString();
            if (map.containsKey(directory)) {
                map.get(directory).add(stat);
            } else {
                List fileList = new ArrayList();
                fileList.add(stat);
                map.put(directory, fileList);
            }
        }
        // 设置inputSplit
        long currentLen = 0;
        List pathLst = new ArrayList();
        List offsetLst = new ArrayList();
        List lengthLst = new ArrayList();
        Iterator itr = map.keySet().iterator();
        while (itr.hasNext()) {
            String dir = itr.next();
            List fileList = map.get(dir);
            for (int i = 0; i  path[" + i + "]=" + pathArray[i].toString());
            }
            splits.add(thissplit);
            pathLst.clear();
            offsetLst.clear();
            lengthLst.clear();
            currentLen = 0;
        }
        return splits;
    }
    private long[] getLongArray(List lst) {
        long[] rst = new long[lst.size()];
        for (int i = 0; i 
<p>这个InputFormat的具体使用方法就不说了。其实与“一个Hadoop程序的优化过程 – 根据文件实际大小实现CombineFileInputFormat”中的MultiFileInputFormat比较类似。</p>
    <p class="copyright">
        原文地址：Hadoop : 一个目录下的数据只由一个map处理, 感谢原作者分享。
    </p>

声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

mysql：blob和其他无-SQL存储，有什么区别？May 13, 2025 am 12:14 AM

mysql'sblobissuitableForStoringBinaryDataWithInareLationalDatabase，而alenosqloptionslikemongodb，redis和calablesolutionsoluntionsoluntionsoluntionsolundortionsolunsolunsstructureddata.blobobobsimplobissimplobisslowderperformandperformanceperformancewithlararengelitiate;

mySQL添加用户：语法，选项和安全性最佳实践May 13, 2025 am 12:12 AM

toaddauserinmysql，使用：createUser'username'@'host'Indessify'password'; there'showtodoitsecurely：1）choosethehostcarecarefullytocon trolaccess.2）setResourcelimitswithoptionslikemax_queries_per_hour.3）usestrong，iniquepasswords.4）Enforcessl/tlsconnectionswith

MySQL：如何避免字符串数据类型常见错误？May 13, 2025 am 12:09 AM

toAvoidCommonMistakeswithStringDatatatPesInMysQl，CloseStringTypenuances，chosethirtightType，andManageEngencodingAndCollationsEttingsefectery.1）usecharforfixed lengengters lengengtings，varchar forbariaible lengength，varchariable length，andtext/blobforlabforlargerdata.2 seterters seterters seterters seterters

mySQL：字符串数据类型和枚举？May 13, 2025 am 12:05 AM

mysqloffersechar，varchar，text，and denumforstringdata.usecharforfixed Lengttrings，varcharerforvariable长度，文本forlarger文本，andenumforenforcingDataAntegrityWithaEtofValues。

mysql blob：如何优化斑点请求May 13, 2025 am 12:03 AM

优化MySQLBLOB请求可以通过以下策略：1.减少BLOB查询频率，使用独立请求或延迟加载；2.选择合适的BLOB类型（如TINYBLOB）；3.将BLOB数据分离到单独表中；4.在应用层压缩BLOB数据；5.对BLOB元数据建立索引。这些方法结合实际应用中的监控、缓存和数据分片，可以有效提升性能。

将用户添加到MySQL：完整的教程May 12, 2025 am 12:14 AM

掌握添加MySQL用户的方法对于数据库管理员和开发者至关重要，因为它确保数据库的安全性和访问控制。1)使用CREATEUSER命令创建新用户，2)通过GRANT命令分配权限，3)使用FLUSHPRIVILEGES确保权限生效，4)定期审计和清理用户账户以维护性能和安全。

掌握mySQL字符串数据类型：varchar vs.文本与charMay 12, 2025 am 12:12 AM

chosecharforfixed-lengthdata，varcharforvariable-lengthdata，andtextforlargetextfield.1）chariseffity forconsistent-lengthdatalikecodes.2）varcharsuitsvariable-lengthdatalikenames，ballancingflexibilitibility andperformance.3）

MySQL：字符串数据类型和索引：最佳实践May 12, 2025 am 12:11 AM

在MySQL中处理字符串数据类型和索引的最佳实践包括：1)选择合适的字符串类型，如CHAR用于固定长度，VARCHAR用于可变长度，TEXT用于大文本；2)谨慎索引，避免过度索引，针对常用查询创建索引；3)使用前缀索引和全文索引优化长字符串搜索；4)定期监控和优化索引，保持索引小巧高效。通过这些方法，可以在读取和写入性能之间取得平衡，提升数据库效率。

See all articles