经历了资源降维的打击之后,我终于认清现实,决定静下心来处理眼前的这个看似有规划,实则处处不合理的集群。
入职的第一周的周五,也就是前几天,我们智能数据部门的一台机器就因为磁盘故障而宕机了(据运维说是磁盘坏掉了,导致不能开机),当时100T出头的磁盘空间写到了85T,不得不说存储资源还是蛮紧张的;当我发现节点宕机之后,第一反应就是担心hdfs的故障恢复机制会不会将剩余的磁盘写满,当时也有同事提出暂定将备份机制调至2来规避这个问题,但是所以人忙来忙去直至下班也没有处理掉这个问题(解释一下,我是因为新入职各方面权限的限制而没有处理这个问题)。
果不其然,到周六晚上的时候集群罢工了,罢工的原因就是hdfs进行故障恢复直接将磁盘均写满至90%以上,而超过资源阈值以后节点在yarn上不对外提供服务,进一步导致整个集群的计算资源被锁死,调度全部暂停。
当时是晚十一点多我被叫来解决这个问题,那么处理过类似情况的人都知道,想要解决这个问题无非两种办法,一种是扩容,另一种就是删数据,相较于扩容,删数据这种办法不仅显得简单而且执行高效。但是公司hive仓库里面所维护的表统统为外部表(所以在hive里面操作删表是无用的)于是本人便用蠢笨的命令行(例如:hadoop dfs -du /路径 | awk ‘{s+=$1}END{print s/1024/1024/1024,“G”}’)去寻找可删的、占用空间较大的文件,以至于花费了相当长的时间及精力才处理完这个事件,作为教训,之后赶紧写了这个【路径信息获取脚本】,以便在日后能够帮助我和同事能够快速地了解hdfs文件分布明细,进一步分析相关问题。
#!/bin/bash
# 获取工作目录路径 $(dirname $0),意为取得当前执行的脚本文件的父目录
workdir=$(cd $(dirname $0); pwd)
date=`date +%Y-%m-%d-%H:%M:%S`
init(){
# 先删除,以便获取最新生成结果
rm -rf $workdir/hdfs_detail.txt
touch $workdir/hdfs_detail.txt
chmod 777 $workdir/hdfs_detail.txt
rm -rf $workdir/path.txt
touch $workdir/path.txt
chmod 777 $workdir/path.txt
echo "========================================================================================================" >> $workdir/hdfs_detail.txt
echo " ___ _ __ _ __ __ " >> $workdir/hdfs_detail.txt
echo " | | /\\ \"--------| | | / / | |------\" \"--------\" \\ \\ / / " >> $workdir/hdfs_detail.txt
echo " | | / \\ / \"-------\" | | / / | |----\" | | \"----\" | \\ \\ / / " >> $workdir/hdfs_detail.txt
echo " | | / /\\ \\ | | | | / / | | | | | | | | \\ \\ / / " >> $workdir/hdfs_detail.txt
echo " | | / / \\ \\ | | | | / / | |----/ / | | | | \\ ^ / " >> $workdir/hdfs_detail.txt
echo " | | / / \\ \\ | | | || | | |-----\" | | | | | | " >> $workdir/hdfs_detail.txt
echo " | | / /------\\ \\ | | | | \\ \\ | | \\ \\ | | | | | | " >> $workdir/hdfs_detail.txt
echo " __ | | / /--------\\ \\ | | | | \\ \\ | | \\ \\ | | | | | | " >> $workdir/hdfs_detail.txt
echo " \\ \\___/ / / / \\ \\ \\ \"-------\" | | \\ \\ | | \\ \\ | \"----\" | | | TM。" >> $workdir/hdfs_detail.txt
echo " \\_____/ /_/ \\_\\ \"--------| |_| \\_\\ |_| \\_\\ \"--------\" |_| 毛利老弟 " >> $workdir/hdfs_detail.txt
echo "========================================================================================================" >> $workdir/hdfs_detail.txt
echo "-----------------------------------------[HDFS明细探知]-------------------------------------------------" >> $workdir/hdfs_detail.txt
echo "[Init Time]:$date" >> $workdir/hdfs_detail.txt
echo "--" >> $workdir/hdfs_detail.txt
echo "--" >> $workdir/hdfs_detail.txt
# sed -i '/1111/a\2222' a.txt 在a.txt中找到所有符合1111得 后面加上2222
}
hdfs_collect(){
echo " ----[ 汇总 ]---- " >> $workdir/hdfs_detail.txt
echo "" >> $workdir/hdfs_detail.txt
echo "| 大小 | 占用 | 当前目录 |" >> $workdir/hdfs_detail.txt
hadoop dfs -ls / | awk '{print $8}' >> $workdir/path.txt
hadoop dfs -du / | awk '{S+=$1}{M+=$2}END{printf "%-12s%-6s%-12s%-6s%-10s\n", S/1024/1024/1024/1024,"(T)",M/1024/1024/1024/1024,"(T)","根目录"}' >> $workdir/hdfs_detail.txt
hadoop dfs -du / | awk '{printf "%-12s%-6s%-12s%-6s%-10s\n", $1/1024/1024/1024,"(G)",$2/1024/1024/1024,"(G)",$3}' >> $workdir/hdfs_detail.txt
echo "" >> $workdir/hdfs_detail.txt
echo "" >> $workdir/hdfs_detail.txt
}
hdfs_detail(){
echo " ----[ 明细 ]---- " >> $workdir/hdfs_detail.txt
echo "" >> $workdir/hdfs_detail.txt
# 一级目录
cat $workdir/path.txt | while read line
do
# 判空
if [ ${#line} != 0 ] && [ $line != "/auto_cron_flag" ] && [ $line != "/auto_cron_logs" ] && [ $line != "/auto_cron_script" ]; then
# 根目录下目录的大小
hadoop dfs -du $line | awk '{S+=$1}{M+=$2}END{printf "%-0s%-12s%-6s%-12s%-6s%-10s\n","-- ", S/1024/1024/1024,"(G)",M/1024/1024/1024,"(G)","'$line'"}' >> $workdir/hdfs_detail.txt
rm -rf $workdir/path1.txt
touch $workdir/path1.txt
chmod 777 $workdir/path1.txt
hadoop fs -ls $line | awk '{print $8}' >> $workdir/path1.txt
# 二级目录
cat $workdir/path1.txt | while read line1
do
# 判空
if [ ${#line1} != 0 ]; then
hadoop dfs -du $line1 | awk '{S+=$1}{M+=$2}END{printf "%-0s%-12s%-6s%-12s%-6s%-10s\n"," -- ", S/1024/1024/1024,"(G)",M/1024/1024/1024,"(G)","'$line1'"}' >> $workdir/hdfs_detail.txt
rm -rf $workdir/path2.txt
touch $workdir/path2.txt
chmod 777 $workdir/path2.txt
hadoop fs -ls $line1 | awk '{print $8}' >> $workdir/path2.txt
# 三级目录
cat $workdir/path2.txt | while read line2
do
# 判空
if [ ${#line2} != 0 ]; then
hadoop dfs -du $line2 | awk '{S+=$1}{M+=$2}END{printf "%-0s%-12s%-6s%-12s%-6s%-10s\n"," -- ", S/1024/1024/1024,"(G)",M/1024/1024/1024,"(G)","'$line2'"}' >> $workdir/hdfs_detail.txt
rm -rf $workdir/path3.txt
touch $workdir/path3.txt
chmod 777 $workdir/path3.txt
hadoop fs -ls $line2 | awk '{print $8}' >> $workdir/path3.txt
# 四级目录
cat $workdir/path3.txt | while read line3
do
# 判空
if [ ${#line3} != 0 ]; then
hadoop dfs -du $line3 | awk '{S+=$1}{M+=$2}END{printf "%-0s%-12s%-6s%-12s%-6s%-10s\n"," -- ", S/1024/1024/1024,"(G)",M/1024/1024/1024,"(G)","'$line3'"}' >> $workdir/hdfs_detail.txt
fi
done
fi
done
fi
done
echo "" >> $workdir/hdfs_detail.txt
fi
done
rm -rf $workdir/path.txt
rm -rf $workdir/path1.txt
rm -rf $workdir/path2.txt
rm -rf $workdir/path3.txt
}
init
hdfs_collect
hdfs_detail
echo "SUCCESS"
执行语句
将源代码复制、粘贴至hdfs_detail.sh ,然后执行:
sh hdfs_detail.sh
生成结果
等待执行完成以后(执行成功打印SUCCESS,当然中途Ctrl+C中断程序也是会打印SUCCESS),在同等目录下应该生成一个名为hdfs_detail.txt的文件。
生成效果如:
========================================================================================================
___ _ __ _ __ __
| | /\ "--------| | | / / | |------" "--------" \ \ / /
| | / \ / "-------" | | / / | |----" | | "----" | \ \ / /
| | / /\ \ | | | | / / | | | | | | | | \ \ / /
| | / / \ \ | | | | / / | |----/ / | | | | \ ^ /
| | / / \ \ | | | || | | |-----" | | | | | |
| | / /------\ \ | | | | \ \ | | \ \ | | | | | |
__ | | / /--------\ \ | | | | \ \ | | \ \ | | | | | |
\ \___/ / / / \ \ \ "-------" | | \ \ | | \ \ | "----" | | | TM。
\_____/ /_/ \_\ "--------| |_| \_\ |_| \_\ "--------" |_| 毛利老弟
========================================================================================================
-----------------------------------------[HDFS明细探知]-------------------------------------------------
[Init Time]:2019-11-06-16:10:25
--
--
----[ 汇总 ]----
| 大小 | 占用 | 当前目录 |
22.5673 (T) 67.8291 (T) 根目录
0 (G) 0 (G) /*******
0.000814012 (G) 0.00244204 (G) /*******
13.9856 (G) 41.9567 (G) /*******
7.47824 (G) 22.4347 (G) /*******
114.452 (G) 343.355 (G) /*******
0 (G) 0 (G) /*******
0 (G) 0 (G) /*******
20357.2 (G) 61137.6 (G) /*******
0.898082 (G) 3.06924 (G) /*******
0 (G) 0 (G) /*******
0.851672 (G) 2.55501 (G) /*******
2614.86 (G) 7907.95 (G) /*******
1.67638e-08 (G) 0.375 (G) /*******
----[ 明细 ]----
-- 0 (G) 0 (G) /*******
-- 114.452 (G) 343.355 (G) /*******
-- 114.452 (G) 343.355 (G) /*******/Docker
-- 79.268 (G) 237.804 (G) /*******/Docker/serv172_20_23_22
-- 0.0769568 (G) 0.23087 (G) /*******/Docker/serv172_20_23_22/docker_allinone_serv172_20_23_22.tar.gz.20190228201901
-- 26.5116 (G) 79.5347 (G) /*******/Docker/serv172_20_23_22/docker_allinone_serv172_20_23_22.tar.gz.20191104093001
-- 26.3217 (G) 78.965 (G) /*******/Docker/serv172_20_23_22/docker_allinone_serv172_20_23_22.tar.gz.20191105093001
-- 26.3578 (G) 79.0735 (G) /*******/Docker/serv172_20_23_22/docker_allinone_serv172_20_23_22.tar.gz.20191106093001
-- 35.1836 (G) 105.551 (G) /*******/Docker/serv172_20_2_24
-- 11.2747 (G) 33.824 (G) /*******/Docker/serv172_20_2_24/docker_allinone_serv172_20_2_24.tar.gz.20191104093001
-- 11.2812 (G) 33.8437 (G) /*******/Docker/serv172_20_2_24/docker_allinone_serv172_20_2_24.tar.gz.20191105093001
-- 11.7929 (G) 35.3786 (G) /*******/Docker/serv172_20_2_24/docker_allinone_serv172_20_2_24.tar.gz.20191106093001
-- 0.0625 (G) 0.1875 (G) /*******/Docker/serv172_20_2_24/docker_tomcat.tar.gz.20190228201901
-- 0.257463 (G) 0.772389 (G) /*******/Docker/serv172_20_2_24/docker_tomcat.tar.gz.20191104093001
-- 0.257463 (G) 0.772389 (G) /*******/Docker/serv172_20_2_24/docker_tomcat.tar.gz.20191105093001
-- 0.257463 (G) 0.772389 (G) /*******/Docker/serv172_20_2_24/docker_tomcat.tar.gz.20191106093001
-- 0 (G) 0 (G) /*******
-- 0 (G) 0 (G) /*******
-- 0 (G) 0 (G) /*******/cmtest
说明
只往根目录下查找了四层明细(可查到分区表的明细),剩下的就没有往下写了,有需要的朋友可根据自身的需求继续往下写几层,只要注意性能问题就好,实施过程如有问题可留言交流。