HBase BulkLoad

作者: LoserZhao 日期: 2021年5月28日发表评论 (0) 查看评论

1. 数据导入流程
2. Sqoop Oracle->HBase
3. Sqoop Oracle->HDFS
4. HDFS->HFile
5. HFile->HBase

1. 数据导入流程

一般数据存放在关系型数据库，Oracle、MySQL中。如果数据量不大，少于30G，量少于5000万条，可以用Sqoop 直接 Oracle -> HBase。
量大的话，HBase 性能会严重影响，建议
1、Sqoop Oracle -> HDFS；
2、HDFS -> HFile；
3、HFile -> HBase；

2. Sqoop Oracle->HBase

sqoop import --connect jdbc:oracle:thin:@//192.168.0.43:1521/orapop --table PM_BIRTH --hbase-table PM_BIRTH --column-family INFO --hbase-row-key ID --username ZJPX -P -m 20

1 2	sqoop import --connect jdbc:oracle:thin:@//192.168.0.43:1521/orapop --table PM_BIRTH --hbase-table PM_BIRTH --column-family INFO --hbase-row-key ID --username ZJPX -P -m 20

3. Sqoop Oracle->HDFS

sqoop import --connect jdbc:oracle:thin:@//192.168.0.43:1521/orapop \
--null-string '\\N' \
--null-non-string '\\N' \
--table PR_TRANSFER \
--target-dir /tmp/zhaomin/data/PR_TRANSFER \
--fields-terminated-by '\t' \
--lines-terminated-by '\n' \
-m 30 \
--username PX -P

sqoop import --connect jdbc:oracle:thin:@//192.168.0.43:1521/orapop \

--null-string '\\N' \

--null-non-string '\\N' \

--table PR_TRANSFER \

--target-dir /tmp/zhaomin/data/PR_TRANSFER \

--fields-terminated-by '\t' \

--lines-terminated-by '\n' \

-m 30 \

--username PX -P

4. HDFS->HFile

4.1. 使用说明

Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>

HFiles of data to prepare for a bulk data load, pass the option:
  -Dimporttsv.bulk.output=/path/for/output
  Note: if you do not use this option, then the target table must already exist in HBase
Other options that may be specified with -D include:
  -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
  '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
  -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
  -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
  -Dmapred.job.name=jobName - use the specified mapreduce job name for the import
For performance consider the following options:
  -Dmapred.map.tasks.speculative.execution=false
  -Dmapred.reduce.tasks.speculative.execution=false

HFiles of data to prepare for a bulk data load, pass the option:

-Dimporttsv.bulk.output=/path/for/output

Note: if you do not use this option, then the target table must already exist in HBase

Other options that may be specified with -D include:

-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line

'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs

-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import

-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper

-Dmapred.job.name=jobName - use the specified mapreduce job name for the import

For performance consider the following options:

-Dmapred.map.tasks.speculative.execution=false

-Dmapred.reduce.tasks.speculative.execution=false

4.2. 例子

HADOOP_CLASSPATH=`/usr/lib/hbase/bin/hbase classpath` hadoop jar /usr/lib/hbase/hbase-server.jar importtsv -Dimporttsv.bulk.output=/tmp/zhaomin/data/bulkload_resourse_hfile -Dimporttsv.separator='|' -Dimporttsv.columns=HBASE_ROW_KEY,INFO:a,INFO:b bulkload_text /tmp/zhaomin/data/bulkload_resourse_file.txt

HADOOP_CLASSPATH=`/usr/lib/hbase/bin/hbase classpath` hadoop jar /usr/lib/hbase/hbase-server.jar importtsv -Dimporttsv.bulk.output=/tmp/zhaomin/data/bulkload_resourse_hfile -Dimporttsv.separator='|' -Dimporttsv.columns=HBASE_ROW_KEY,INFO:a,INFO:b bulkload_text /tmp/zhaomin/data/bulkload_resourse_file.txt

4.3. 经验

分隔符：Dimporttsv.separator 如果不指定，默认就是 \t
HBASE_ROW_KEY 必须在，对应每一行第一个字段

5. HFile->HBase

5.1. 例子

HADOOP_CLASSPATH=`/usr/lib/hbase/bin/hbase classpath` hadoop jar /usr/lib/hbase/hbase-server.jar completebulkload /tmp/zhaomin/data/bulkload_resourse_hfile bulkload_text

1 2	HADOOP_CLASSPATH=`/usr/lib/hbase/bin/hbase classpath` hadoop jar /usr/lib/hbase/hbase-server.jar completebulkload /tmp/zhaomin/data/bulkload_resourse_hfile bulkload_text

5.2. 遇到的问题

ERROR mapreduce.LoadIncrementalHFiles: Trying to load more than 32 hfiles to family d of region with start key
Exception in thread "main" java.io.IOException: Trying to load more than 32 hfiles to one family of one region
        at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:288)
        at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:842)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
        at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.main(LoadIncrementalHFiles.java:847)

ERROR mapreduce.LoadIncrementalHFiles: Trying to load more than 32 hfiles to family d of region with start key

Exception in thread "main" java.io.IOException: Trying to load more than 32 hfiles to one family of one region

at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:288)

at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:842)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)

at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.main(LoadIncrementalHFiles.java:847)

5.3. 解决方案

直接在HBase源码中去搜索 hfiles to family 定位到

private boolean checkHFilesCountPerRegionPerFamily(
      final Multimap<ByteBuffer, LoadQueueItem> regionGroups) {
    for (Entry<ByteBuffer,
        ? extends Collection<LoadQueueItem>> e: regionGroups.asMap().entrySet()) {
      final Collection<LoadQueueItem> lqis =  e.getValue();
      HashMap<byte[], MutableInt> filesMap = new HashMap<byte[], MutableInt>();
      for (LoadQueueItem lqi: lqis) {
        MutableInt count = filesMap.get(lqi.family);
        if (count == null) {
          count = new MutableInt();
          filesMap.put(lqi.family, count);
        }
        count.increment();
        if (count.intValue() > maxFilesPerRegionPerFamily) {
          LOG.error("Trying to load more than " + maxFilesPerRegionPerFamily
            + " hfiles to family " + Bytes.toStringBinary(lqi.family)
            + " of region with start key "
            + Bytes.toStringBinary(e.getKey()));
          return false;
        }
      }
    }
    return true;
  }

private boolean checkHFilesCountPerRegionPerFamily(

final Multimap<ByteBuffer, LoadQueueItem> regionGroups) {

for (Entry<ByteBuffer,

? extends Collection<LoadQueueItem>> e: regionGroups.asMap().entrySet()) {

final Collection<LoadQueueItem> lqis = e.getValue();

HashMap<byte[], MutableInt> filesMap = new HashMap<byte[], MutableInt>();

for (LoadQueueItem lqi: lqis) {

MutableInt count = filesMap.get(lqi.family);

if (count == null) {

count = new MutableInt();

filesMap.put(lqi.family, count);

}

count.increment();

if (count.intValue() > maxFilesPerRegionPerFamily) {

LOG.error("Trying to load more than " + maxFilesPerRegionPerFamily

+ " hfiles to family " + Bytes.toStringBinary(lqi.family)

+ " of region with start key "

+ Bytes.toStringBinary(e.getKey()));

return false;

}

return true;

}

private boolean checkHFilesCountPerRegionPerFamily(
      final Multimap<ByteBuffer, LoadQueueItem> regionGroups) {
    for (Entry<ByteBuffer,
        ? extends Collection<LoadQueueItem>> e: regionGroups.asMap().entrySet()) {
      final Collection<LoadQueueItem> lqis =  e.getValue();
      HashMap<byte[], MutableInt> filesMap = new HashMap<byte[], MutableInt>();
      for (LoadQueueItem lqi: lqis) {
        MutableInt count = filesMap.get(lqi.family);
        if (count == null) {
          count = new MutableInt();
          filesMap.put(lqi.family, count);
        }
        count.increment();
        if (count.intValue() > maxFilesPerRegionPerFamily) {
          LOG.error("Trying to load more than " + maxFilesPerRegionPerFamily
            + " hfiles to family " + Bytes.toStringBinary(lqi.family)
            + " of region with start key "
            + Bytes.toStringBinary(e.getKey()));
          return false;
        }
      }
    }
    return true;
  }

private boolean checkHFilesCountPerRegionPerFamily(

final Multimap<ByteBuffer, LoadQueueItem> regionGroups) {

for (Entry<ByteBuffer,

? extends Collection<LoadQueueItem>> e: regionGroups.asMap().entrySet()) {

final Collection<LoadQueueItem> lqis = e.getValue();

HashMap<byte[], MutableInt> filesMap = new HashMap<byte[], MutableInt>();

for (LoadQueueItem lqi: lqis) {

MutableInt count = filesMap.get(lqi.family);

if (count == null) {

count = new MutableInt();

filesMap.put(lqi.family, count);

}

count.increment();

if (count.intValue() > maxFilesPerRegionPerFamily) {

LOG.error("Trying to load more than " + maxFilesPerRegionPerFamily

+ " hfiles to family " + Bytes.toStringBinary(lqi.family)

+ " of region with start key "

+ Bytes.toStringBinary(e.getKey()));

return false;

}

return true;

}

继续看 maxFilesPerRegionPerFamily 赋值过程

maxFilesPerRegionPerFamily = conf.getInt(MAX_FILES_PER_REGION_PER_FAMILY, 32);

从上看到，直接去配置文件中去搜索该配置，如果找不到，默认就是32，而 MAX_FILES_PER_REGION_PER_FAMILY 是下面这个配置。

public static final String MAX_FILES_PER_REGION_PER_FAMILY = "hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily";

最终直接在 hbase-site.xml 中加入该配置，我当前的 HFile有90多个，直接配置成256解决。

原创文章，转载请注明： 转载自LoserZhao – 诗和远方[ http://www.loserzhao.com/ ]

本文链接地址: http://www.loserzhao.com/bigdata/hbase-bulkload.html

文章的脚注信息由WordPress的wp-posturl插件自动生成

HBase, 大数据bulkload, HBase, hdfs, oracle

← 编译 HBase 0.98.13 基于 Hadoop 2.6.0

Hive UDF、UDAF、UDTF使用 →

发表评论？

LoserZhao – 诗和远方

HBase BulkLoad

1. 数据导入流程

2. Sqoop Oracle->HBase

3. Sqoop Oracle->HDFS

4. HDFS->HFile

4.1. 使用说明

4.2. 例子

4.3. 经验

5. HFile->HBase

5.1. 例子

5.2. 遇到的问题

5.3. 解决方案

0 条评论。

发表评论取消回复

近期文章

分类目录

文章归档

技术链接

访问排名

近期评论

功能

LoserZhao – 诗和远方

HBase BulkLoad

1. 数据导入流程

2. Sqoop Oracle->HBase

3. Sqoop Oracle->HDFS

4. HDFS->HFile

4.1. 使用说明

4.2. 例子

4.3. 经验

5. HFile->HBase

5.1. 例子

5.2. 遇到的问题

5.3. 解决方案

0 条评论。

发表评论 取消回复

近期文章

分类目录

文章归档

标签云

技术链接

访问排名

近期评论

功能

发表评论取消回复