分布式NoSQL(14)——educoder_8-4 HBase 开发：高级应用之python API

8-4 HBase 开发：高级应用之python API

第1关：创建表

本关任务：使用python代码在HBase中创建表。

如何使用happpybase连接HBase数据库

一、连接(happybase.Connection)

import happybase 
happybase.Connection(host=’localhost’, port=9090, timeout=None, autoconnect=True, table_prefix=None, table_prefix_separator=b’_’, compat=’0.98’, transport=’buffered’, protocol=’binary’)

获取连接实例 host：主机名 port：端口 timeout：超时时间 autoconnect：连接是否直接打开 table_prefix：用于构造表名的前缀 table_prefix_separator：用于table_prefix的分隔符 compat：兼容模式 transport：运输模式 protocol：协议

例：

connection = happybase.Connection(host="192.168.0.156",

port=9090,timeout=None,autoconnect=True,

table_prefix=None,table_prefix_separator=b'_',

compat='0.98', transport='buffered',protocol='binary')

当connection被创建的时候，默认自动与Hbase建立socket连接的。若不想自动与Hbase建立socket连接，可以将autoconnect参数设置为:

False connection = happybase.Connection('10.1.13.111', autoconnect=False)

然后手动与Hbase建立socket连接 connection.open() open()：打开传输，无返回值 close()：关闭传输，无返回值 connection.close() 连接建立好之后查看可以使用的table print connection.tables() 因为还没有创建table，所以返回结果是 []

二、创建一个table

create_table(name,families)：创建表，无返回值

name：表名

families：列族 families = { "cf":dict(), "df":dict()} connection.create_table(name,families)
如果连接时，有传递表前缀参数时，真实表名将会是："{}_{}".format(table_prefix,name)

connection.create_table( 'my_table', { 'cf1': dict(max_versions=10), 'cf2':dict(max_versio

ns=1,block_cache_enabled=False), 'cf3': dict(), # use defaults } )

此时，我们再通过connection.tables()查看可以使用的table，结果为[‘my_table’] 创建的table即my_table包含3个列族：cf1、cf2、cf3

编程要求

好了，到你啦，使用本关知识，在右侧命令行完成下面的任务要求：

1.执行环节搭建脚本，为使用Python语言操作hbase做好准备。

cd /data/workspace/myshixun/opt/

chmod +x setup-env.sh

./setup-env.sh

这里可能需要等待几分钟，让脚本执行完，thrift和happybase就安装好了。

接下来可以查看各个服务进程是否启动 jps

root@evassh-2932225:~# jps 
1808 ResourceManager 
4785 Jps 
2820 ThriftServer 
1317 NameNode 
2694 HRegionServer 
1447 DataNode 
2506 HQuorumPeer 
1626 SecondaryNameNode 
2570 HMaster

如果启动过程中出现： localhost: zookeeper running as process 2474. Stop it first. master running as process 2538. Stop it first. : regionserver running as process 2665. Stop it first. thrift running as process 2789. Stop it first.

需要我们重新 /app/hbase-2.1.1/binhbase-daemon.sh stop thrift stop-dfs.sh 和stop-hbase.sh，然后再重启。

2.进入Python编译器完成连接和创建表 python3

3.建立一个本地连接对象conn，使用本机和默认端口建立连接

4.创建一个名为student1的表，有两个列族：info和scores 注意：列族定义之间的逗号后要空一格，否则会出错。

5.可以使用conn.tables()查看表是否创建好。如果出现‘BrokenPipeError: [Errno 32] Broken pipe’提示，说明连接已经自动关闭，每次连接如果超过60秒没有操作，就会自动关闭，需要重新建立连接。

6.退出Python编译器 exit()

实验操作步骤

这一关真的很玄学，educoder平台的问题，如果方法和操作没问题的话，过不了的时候也不要太着急，多试几次。

因为默认的连接timeout是60秒，每次连接如果超过60秒没有操作，就会自动关闭，出现‘BrokenPipeError: [Errno 32] Broken pipe’的报错，需要重新建立连接。所以首先我们把timeout时间修改一下：

在/app/hbase-2.1.1/conf目录下

cd /app/hbase-2.1.1/conf
vim hbase-site.xml

添加这一段：

<property>
         <name>hbase.thrift.server.socket.read.timeout</name>
         <value>6000000</value>
         <description>eg:milisecond</description>
</property>

然后就可以按照上面的步骤完成了，之所以说这一关玄学，如下图，命令并没有问题，但是测试未通过，反正emmmm多试几次吧，今天不行明天再试，先写后面的也行，步骤就是这样。我试了n遍才通过【🙂】。

第2关：数据操作

添加数据

要对一个表添加数据，我们需要一个table对象，使用table.put()方法添加数据：

在上一关的例子中，我们创建了my_table表包含3个列族：cf1、cf2、cf3，现在我们往里面写入数据。你可以试试先自己创建这个my_table表。

table = connection.table('my_table') #首先获得表对象

Hbase里存储的数据都是原始的字节字符串

cloth_data = {'cf1:content': 'jeans', 'cf1:price': '299', 'cf1:rating': '98%'}
hat_data = {'cf1:content': 'cap', 'cf1:price': '88', 'cf1:rating': '99%'}
shoe_data = {'cf1:content': 'jacket', 'cf1:price': '988', 'cf1:rating': '100%'}
author_data = {'cf2:name': 'LiuLin', 'cf2:date': '2017-03-09'}

table.put(row='www.test1.com', data=cloth_data)
table.put(row='www.test2.com', data=hat_data)
table.put(row='www.test3.com', data=shoe_data)
table.put(row='www.test4.com', data=author_data)使用put一次只能存储一行数据 如果row key已经存在，则变成了修改数据.

更好的存储数据 table.put()方法会立即给Hbase Thrift server发送一条命令。其实这种方法的效率并不高，我们可以使用更高效的table.batch()方法。

使用batch一次插入多行数据

bat = table.batch()
bat.put('www.test5.com', {'cf1:price': 999, 'cf2:title': 'Hello Python', 'cf2:length': 34, 'cf3:code': 'A43'})
bat.put('www.test6.com', {'cf1:content': 'razor', 'cf1:price': 168, 'cf1:rating': '97%'})
bat.put('www.test7.com', {'cf3:function': 'print'})
bat.send()

更有用的方法是使用上下文管理器来管理batch，这样就不用手动发送数据了，即不再需要bat.send()

*使用with来管理batch *

with table.batch() as bat:
        bat.put('www.test5.com', {'cf1:price': '999', 'cf2:title': 'Hello Python', 'cf2:length': '34', 'cf3:code': 'A43'})
        bat.put('www.test6.com', {'cf1:content': u'剃须刀', 'cf1:price': '168', 'cf1:rating': '97%'})
        bat.put('www.test7.com', {'cf3:function': 'print'})

删除数据

在batch中删除数据

with table.batch() as bat:

bat.delete('www.test1.com')

batch将数据保存在内存中，知道数据被send，第一种send数据的方法是显示地发送，即bat.send()，第二种send数据的方法是到达with上下文管理器的结尾自动发送。

检索数据

全局扫描一个table for key, value in table.scan(): print key, value 结果如下：

检索一行数据 row = table.row('www.test4.com') print row 直接返回该row key的值（以字典的形式），结果为： {'cf2:name': 'LiuLin', 'cf2:date': '2017-03-09'}

检索多行数据 rows = table.rows(['www.test1.com', 'www.test4.com'])print rows 返回的是一个list，list的一个元素是一个tuple，tuple的第一个元素是rowkey，第二个元素是rowkey的值如果想使检索多行数据即table.rows()返回的结果是一个字典，可以这样处理检索多行数据，返回字典 rows_dict = dict(table.rows(['www.test1.com', 'www.test4.com']))print rows_dict 如果想使table.rows()返回的结果是一个有序字典，即OrderedDict，可以这样处理检索多行数据，返回有序字典 from collection import OrderedDict rows_ordered_dict = OrderedDict(table.rows(['www.test1.com', 'www.test4.com'])) print rows_ordered_dict

好了，下面开始你的任务啦： 按照右边的文件要求补完代码。

预期输出： OrderedDict([(b'info:name', b'John'), (b'scores:Bigdata', b'89'), (b'scores:database', b'88')]) b'95001' OrderedDict([(b'info:name', b'John'), (b'scores:Bigdata', b'89'), (b'scores:database', b'88')]) b'95002' OrderedDict([(b'info:name', b'Rose'), (b'scores:database', b'68')]) b'95003' OrderedDict([(b'info:name', b'Greens'), (b'scores:Bigdata', b'76')])

如果运行结果报错提示显示表已经存在，查看各个服务进程是否启动

root@evassh-2932225:~# jps

1808 ResourceManager

4785 Jps

2820 ThriftServer

1317 NameNode

2694 HRegionServer

1447 DataNode

2506 HQuorumPeer

1626 SecondaryNameNode

2570 HMaster

如果处于启动状态就到到hbase shell中手动删除该表。

代码

import happybase
from collections import OrderedDict
conn = happybase.Connection('localhost')
databases = conn.tables()
if b"student1" in databases:
   conn.disable_table("student1")
   conn.delete_table("student1")
conn.create_table("student1",{"info":{}, "scores":{}})
#begins1 插入一行数据：'95001',"info:name":'John','scores:database':88
table = conn.table("student1")
table.put("95001", {"info:name":"John", 'scores:database':'88'})


#end1


# 使用batch上下文管理器一次插入多行数据
##'95002',"info:name":'Rose','scores：database':68
##'95003',"info:name":'Greens','scores：Bigdata':76
##'95001',"scores：Bigdata':89
#begins2
with table.batch() as bat:
   bat.put("95002", {"info:name":'Rose', 'scores:database':'68'})
   bat.put("95003", {"info:name":'Greens', 'scores:Bigdata':'76'})
   bat.put("95001", {'scores:Bigdata':'89'})

#end2


#begins3  打印行键为‘95001’的值，并将其放到一个有序字典中输出
row = table.row("95001")
print(OrderedDict(row))


#ends3
#扫描全表，并输出，提示行键以外的值放进有序字典中输出
#begins4
for key, value in table.scan():
   print(key, OrderedDict(value))

#end4

Victory！