hadoop开发案例-526互联

本次基于陌陌数据案例实现可视化数据分析

数据准备:两个tsv文件，总计包含14w条数据，数据字段包括发送人，接收人账号，性别，GPS坐标等20多个字段，这些字段利用制表符进行分隔开，其中有为null的杂乱数据，需要将这些数据过滤，时间数据格式为年月日时分秒，需要substr()进行截取，GPS坐标利用split函数分割，并且取到经纬度的具体数值。
操作:首先使用datagrip关联Windows的文件，即为sql文件，然后连接hIve数据源，连接时首先解决驱动问题，驱动采用hive3的驱动，然后配置基本链接信息。连接后即可正常使用数据。

上传数据:首先需要在数据库建好一张基本信息表，此表为数据的冗合表，采用\t作为分隔符，建好表之后从本地服务器上加载数据到数据库，本地是指的虚拟机的目录下的数据，命令为:
--建表
create table db_msg.tb_msg_source(
  msg_time             string comment "消息发送时间"
  , sender_name        string comment "发送人昵称"
  , sender_account     string comment "发送人账号"
  , sender_sex         string comment "发送人性别"
  , sender_ip          string comment "发送人ip地址"
  , sender_os          string comment "发送人操作系统"
  , sender_phonetype   string comment "发送人手机型号"
  , sender_network     string comment "发送人网络类型"
  , sender_gps         string comment "发送人的GPS定位"
  , receiver_name      string comment "接收人昵称"
  , receiver_ip        string comment "接收人IP"
  , receiver_account   string comment "接收人账号"
  , receiver_os        string comment "接收人操作系统"
  , receiver_phonetype string comment "接收人手机型号"
  , receiver_network   string comment "接收人网络类型"
  , receiver_gps       string comment "接收人的GPS定位"
  , receiver_sex       string comment "接收人性别"
  , msg_type           string comment "消息类型"
  , distance           string comment "双方距离"
  , message            string comment "消息内容"
)
--指定分隔符为制表符
row format delimited fields terminated by '\t';

--加载数据到表中

load data local inpath '/root/hivedata/data1.tsv' into table db_msg.tb_msg_source;

load data local inpath '/root/hivedata/data2.tsv' into table db_msg.tb_msg_source;

--查询表验证数据文件是否映射成功

select * from tb_msg_source limit 10;

前置准备工作准备好后即可开始数据清洗，可能涉及到有空字段需要过滤，有字段需要截取或者分割，总之将所有的数据清洗完之后保存到新的表之中。
之后即可编写sql

案例

hadoop

词频wordcount案例hadoop

wordcount案例hadoop bug

wordcount案例hadoop2 hadoop