目录
1. 数据导入流程
一般数据存放在关系型数据库,Oracle、MySQL中。如果数据量不大,少于30G,量少于5000万条,可以用Sqoop 直接 Oracle -> HBase。
量大的话,HBase 性能会严重影响,建议
1、Sqoop Oracle -> HDFS;
2、HDFS -> HFile;
3、HFile -> HBase;
2. Sqoop Oracle->HBase
1 2 |
sqoop import --connect jdbc:oracle:thin:@//192.168.0.43:1521/orapop --table PM_BIRTH --hbase-table PM_BIRTH --column-family INFO --hbase-row-key ID --username ZJPX -P -m 20 |
3. Sqoop Oracle->HDFS
1 2 3 4 5 6 7 8 9 10 |
sqoop import --connect jdbc:oracle:thin:@//192.168.0.43:1521/orapop \ --null-string '\\N' \ --null-non-string '\\N' \ --table PR_TRANSFER \ --target-dir /tmp/zhaomin/data/PR_TRANSFER \ --fields-terminated-by '\t' \ --lines-terminated-by '\n' \ -m 30 \ --username PX -P |
4. HDFS->HFile
4.1. 使用说明
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
1 2 3 4 5 6 7 8 9 10 11 12 13 |
HFiles of data to prepare for a bulk data load, pass the option: -Dimporttsv.bulk.output=/path/for/output Note: if you do not use this option, then the target table must already exist in HBase Other options that may be specified with -D include: -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper -Dmapred.job.name=jobName - use the specified mapreduce job name for the import For performance consider the following options: -Dmapred.map.tasks.speculative.execution=false -Dmapred.reduce.tasks.speculative.execution=false |
4.2. 例子
1 2 |
HADOOP_CLASSPATH=`/usr/lib/hbase/bin/hbase classpath` hadoop jar /usr/lib/hbase/hbase-server.jar importtsv -Dimporttsv.bulk.output=/tmp/zhaomin/data/bulkload_resourse_hfile -Dimporttsv.separator='|' -Dimporttsv.columns=HBASE_ROW_KEY,INFO:a,INFO:b bulkload_text /tmp/zhaomin/data/bulkload_resourse_file.txt |
4.3. 经验
分隔符:Dimporttsv.separator 如果不指定,默认就是 \t
HBASE_ROW_KEY 必须在,对应每一行第一个字段
5. HFile->HBase
5.1. 例子
1 2 |
HADOOP_CLASSPATH=`/usr/lib/hbase/bin/hbase classpath` hadoop jar /usr/lib/hbase/hbase-server.jar completebulkload /tmp/zhaomin/data/bulkload_resourse_hfile bulkload_text |
5.2. 遇到的问题
1 2 3 4 5 6 7 8 |
ERROR mapreduce.LoadIncrementalHFiles: Trying to load more than 32 hfiles to family d of region with start key Exception in thread "main" java.io.IOException: Trying to load more than 32 hfiles to one family of one region at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:288) at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:842) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.main(LoadIncrementalHFiles.java:847) |
5.3. 解决方案
直接在HBase源码中去搜索 hfiles to family
定位到
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
private boolean checkHFilesCountPerRegionPerFamily( final Multimap<ByteBuffer, LoadQueueItem> regionGroups) { for (Entry<ByteBuffer, ? extends Collection<LoadQueueItem>> e: regionGroups.asMap().entrySet()) { final Collection<LoadQueueItem> lqis = e.getValue(); HashMap<byte[], MutableInt> filesMap = new HashMap<byte[], MutableInt>(); for (LoadQueueItem lqi: lqis) { MutableInt count = filesMap.get(lqi.family); if (count == null) { count = new MutableInt(); filesMap.put(lqi.family, count); } count.increment(); if (count.intValue() > maxFilesPerRegionPerFamily) { LOG.error("Trying to load more than " + maxFilesPerRegionPerFamily + " hfiles to family " + Bytes.toStringBinary(lqi.family) + " of region with start key " + Bytes.toStringBinary(e.getKey())); return false; } } } return true; } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
private boolean checkHFilesCountPerRegionPerFamily( final Multimap<ByteBuffer, LoadQueueItem> regionGroups) { for (Entry<ByteBuffer, ? extends Collection<LoadQueueItem>> e: regionGroups.asMap().entrySet()) { final Collection<LoadQueueItem> lqis = e.getValue(); HashMap<byte[], MutableInt> filesMap = new HashMap<byte[], MutableInt>(); for (LoadQueueItem lqi: lqis) { MutableInt count = filesMap.get(lqi.family); if (count == null) { count = new MutableInt(); filesMap.put(lqi.family, count); } count.increment(); if (count.intValue() > maxFilesPerRegionPerFamily) { LOG.error("Trying to load more than " + maxFilesPerRegionPerFamily + " hfiles to family " + Bytes.toStringBinary(lqi.family) + " of region with start key " + Bytes.toStringBinary(e.getKey())); return false; } } } return true; } |
继续看 maxFilesPerRegionPerFamily 赋值过程
maxFilesPerRegionPerFamily = conf.getInt(MAX_FILES_PER_REGION_PER_FAMILY, 32);
从上看到,直接去配置文件中去搜索该配置,如果找不到,默认就是32,而 MAX_FILES_PER_REGION_PER_FAMILY 是下面这个配置。
public static final String MAX_FILES_PER_REGION_PER_FAMILY
= "hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily";
最终直接在 hbase-site.xml 中加入该配置,我当前的 HFile有90多个,直接配置成256解决。
原创文章,转载请注明: 转载自LoserZhao – 诗和远方[ http://www.loserzhao.com/ ]
本文链接地址: http://www.loserzhao.com/bigdata/hbase-bulkload.html
文章的脚注信息由WordPress的wp-posturl插件自动生成
0 条评论。