基于Docker搭建Hadoop+Hive

发布时间 2023-10-13 22:56:23作者: INnoVation-V2

基于Docker搭建Hadoop+Hive

本文主要是照搬这篇文章的https://zhuanlan.zhihu.com/p/242658224,但是这篇文章有一些细节配置没有讲清楚,这里对其进行完善

零、环境信息

电脑配置

Ubuntu 20.04.6 LTS (Focal Fossa)
5.15.0-79-generic x86_64
Intel 12th Gen Intel Core i5-12400 
32GB Memory

Docker版本信息

Client: Docker Engine - Community
 Cloud integration: v1.0.35+desktop.5
 Version:           24.0.6
 API version:       1.43
 Go version:        go1.20.7

Server: Docker Desktop 4.24.2 (124339)
 Engine:
  Version:          24.0.6
  API version:      1.43 (minimum version 1.12)

一、Hadoop

1. 拉取镜像

docker pull registry.cn-hangzhou.aliyuncs.com/hadoop_test/hadoop_base

2. 建立hadoop用的内部网络

docker network create --driver=bridge --subnet=172.21.0.0/16  hadoop

3.创建集群

3.1 创建Master

docker run -it --network hadoop -h Master --name Master -p 9870:9870 -p 8088:8088 -p 10000:10000 registry.cn-hangzhou.aliyuncs.com/hadoop_test/hadoop_base bash

运行完后,会进入Docker Bash,需要做两件事,修改hosts和环境变量

附1.修改hosts

vim /etc/hosts

删除最后一行,然后添加以下三行,保存退出

172.21.0.4	Master
172.21.0.3	Slave1
172.21.0.2	Slave2

附2.添加环境变量

vim ~/.bashrc

在最后一行新增

export PATH=$PATH:/usr/local/hadoop/bin/

wq保存退出

最后激活环境变量

source ~/.bashrc

完成后,输入exit退出docker bash

3.2. 创建Slave1

docker run -it --network hadoop -h Slave1 --name Slave1 registry.cn-hangzhou.aliyuncs.com/hadoop_test/hadoop_base bash

运行完后,同样会进入Docker Bash,重复做附1附2两件事,最后退出

3.3. 创建Slave2

docker run -it --network hadoop -h Slave2 --name Slave2 registry.cn-hangzhou.aliyuncs.com/hadoop_test/hadoop_base bash

和3.2一样重复做附1附2,最后退出

4. 启动容器

docker start Master
docker start Slave1
docker start Slave2

5. 启动hadoop

进入master,先格式化hdfs,然后启动hadoop

# 进入Master容器
docker exec -it Master bash
# 格式化hdfs
# 这里如果报错hadoop命令找不到,就是附2没有配置好
hadoop namenode -format

接下来进入hadoop目录,启动hadoop

cd /usr/local/hadoop/sbin
./start-all.sh

可以看到服务起来了,我的日志如下

Starting namenodes on [Master]
Master: Warning: Permanently added 'master,172.19.0.4' (ECDSA) to the list of known hosts.
Starting datanodes
Slave1: Warning: Permanently added 'slave1,172.19.0.3' (ECDSA) to the list of known hosts.
Slave2: Warning: Permanently added 'slave2,172.19.0.2' (ECDSA) to the list of known hosts.
Slave1: WARNING: /usr/local/hadoop/logs does not exist. Creating.
Slave2: WARNING: /usr/local/hadoop/logs does not exist. Creating.
Starting secondary namenodes [Master]
Starting resourcemanager
Starting nodemanagers

可以通过本机ip的8088及9870端口看到监控信息

http://192.168.1.17:8088/cluster/nodes

查看分布式文件系统状态

root@Master:/usr/local/hadoop/sbin# hdfs dfsadmin -report

6. 运行内置WordCount示例

接下来运行内置示例,测试是否正常运行

cd /usr/local/hadoop
cat LICENSE.txt > file1.txt

# 在 HDFS 中创建 input 文件夹
hadoop fs -mkdir /input

# 上传 file1.txt 文件到 HDFS 中
hadoop fs -put file1.txt /input

# 查看 HDFS 中 input 文件夹里的内容
hadoop fs -ls /input

# 运行 wordcount
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount /input /output

#查看 HDFS 中/output文件夹的内容
hadoop fs -ls /output

# 查看part-r-00000文件的内容,就是运行的结果
hadoop fs -cat /output/part-r-00000

Hadoop 到此结束

二、HIVE

1. 下载Hive

这里用的版本是 hive-3.1.2,镜像地址在这里: https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-3.1.2/
如果链接失效,自己重新找一个

# 下载Hive
wget https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

# 拷贝安装包进Master容器
docker cp apache-hive-3.1.2-bin.tar.gz Master:/usr/local

# 进入容器
docker exec -it Master bash
cd /usr/local/

# 解压安装包
tar xvf apache-hive-3.1.2-bin.tar.gz

2. 修改配置文件

cd /usr/local/apache-hive-3.1.2-bin/conf
cp hive-default.xml.template hive-site.xml
vim hive-site.xml

在最前面添加下面配置:

  <property>
    <name>system:java.io.tmpdir</name>
    <value>/tmp/hive/java</value>
  </property>
  <property>
    <name>system:user.name</name>
    <value>${user.name}</value>
  </property>
image-20231013210640230

3. 配置Hive相关环境变量

vim /etc/profile

#文本最后添加
export HIVE_HOME="/usr/local/apache-hive-3.1.2-bin"
export PATH=$PATH:$HIVE_HOME/bin 

# 执行source /etc/profile生效
source /etc/profile

4. 配置mysql作为元数据库

拉取镜像并创建容器

#拉取镜像
docker pull mysql:8.0.18

# 创建容器
docker run --name mysql_hive -p 4306:3306 --net hadoop --ip 172.21.0.5 -v /home/fury/mysql:/var/lib/mysql -e MYSQL_ROOT_PASSWORD=abc123456 -d mysql:8.0.18

这里/home/fury/mysql:/var/lib/mysql是将容器中的/var/lib/mysql目录映射到本地的/home/fury/mysql目录,fury是我的用户名,请将这个目录改成你本地可用的目录。

# 启动容器
docker start mysql_hive
# 进入容器
docker exec -it mysql_hive bash
# 进入myslq
mysql -uroot -p
# 密码上面建立容器时候已经设置abc123456
# 建立hive数据库
create database hive;
# 修改远程连接权限
ALTER USER 'root'@'%' IDENTIFIED WITH mysql_native_password BY 'abc123456';
# 退出mysql
quit
# 退出docker bash
exit

进入Master 容器,继续配置

docker exec -it Master bash
vim /usr/local/apache-hive-3.1.2-bin/conf/hive-site.xml

修改配置文件

搜索name,将value修改成下文中的样子

<property>
	<name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>
<property>
	<name>javax.jdo.option.ConnectionPassword</name>
	<value>abc123456</value>
</property>
<property>
	<name>javax.jdo.option.ConnectionURL</name>
	<value>jdbc:mysql://172.21.0.5:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
</property>
<property>
	<name>javax.jdo.option.ConnectionDriverName</name>
	<value>com.mysql.jdbc.Driver</value>
</property>
<property>
	<name>hive.metastore.schema.verification</name>
	<value>false</value>
<property>

上传mysql驱动

在Master的bash中下载

wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar
# 复制到hive
cp mysql-connector-java-5.1.49.jar /usr/local/apache-hive-3.1.2-bin/lib

修改jar包

#slf4j这个包hadoop及hive两边只能有一个,这里删掉hive这边
rm /usr/local/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar

#guava这个包hadoop及hive中,删掉版本低的那个,把版本高的复制过去,这里删掉hive,复制hadoop
rm /usr/local/apache-hive-3.1.2-bin/lib/guava-
guava-19.0.jar
cp /usr/local/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar /usr/local/apache-hive-3.1.2-bin/lib/

#把hive文件hive-site.xml第3225行的特殊字符删除
vim /usr/local/apache-hive-3.1.2-bin/conf/hive-site.xml

这一行的错误字符是&#8;,将其删除后保存

<description>
      Ensures commands with OVERWRITE (such as INSERT OVERWRITE) acquire Exclusive locks for&#8;transactional tables.  This ensures that inserts (w/o overwrite) running concurrently
      are not hidden by the INSERT OVERWRITE.
</description>

5.添加hive到环境变量

vim ~/.bashrc

最后一行添加,保存退出

export PATH=$PATH:/usr/local/apache-hive-3.1.2-bin/bin

运行下面的命令使其生效

source ~./bashrc

6.初始化运行

schematool -initSchema -dbType mysql

7. 运行示例

# 创建一个数据文件
cd /usr/local
vim test.txt

# 存入内容
1,jack
2,hel
3,nack

# 进入Hive
hive

hive> create table test(id int, name string) row format delimited fields terminated by ',';
OK
Time taken: 1.453 seconds
hive> load data local inpath '/usr/local/test.txt' into table test;
Loading data to table default.test
OK
Time taken: 0.63 seconds
hive> select * from test;
OK
1       jack
2       hel
3       nack
Time taken: 1.611 seconds, Fetched: 3 row(s)

至此,hive安装完毕

三、Hiveserver2

1. 修改hadoop的一些权限配置

  • 进入Master Bash
docker exec -it Master bash
vim /usr/local/hadoop/etc/hadoop/core-site.xml
  • 加入以下配置
<property>
    <name>hadoop.proxyuser.root.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
</property>
  • 重启hdfs:
root@Master:/usr/local/hadoop/sbin# ./stop-dfs.sh
Stopping namenodes on [Master]
Stopping datanodes
Stopping secondary namenodes [Master]
root@Master:/usr/local/hadoop/sbin# ./start-dfs.sh
Starting namenodes on [Master]
Starting datanodes
Starting secondary namenodes [Master]

2. 后台启动hiveserver2

root@Master:/usr/local/hadoop/sbin# nohup hiveserver2 >/dev/null 2>/dev/null &

3. HDFS关闭安全模式

hdfs dfsadmin -safemode leave

4.验证

  • 查看10000端口是否正常运行

    >netstat -ntulp |grep 10000
    
    image-20231013222306532
  • 使用beeline连接

    beeline
    !connect jdbc:hive2://localhost:10000/default
    

    按照提示输入用户密码,就是前面设置的rootabc123456

  • 尝试查询

    0: jdbc:hive2://localhost:10000/default> select * from test;
    
    image-20231013222445999

Hiveserver2配置完毕

四、本地连接Hive