Hadoop(3.3.4)-HDFS操作

发布时间 2024-01-07 19:40:43作者: OnePandas

Apache Hadoop 3.3.4 – Overview

01.appendToFile

hadoop fs -appendToFile localfile /user/hadoop/hadoopfile
hadoop fs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile
hadoop fs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
hdfs dfs -appendToFile /root/tmp/202302/02/1.txt hdfs://192.168.88.161:8020/tmp/test20230202/1.txt

02.cat

-ignoreCrc 忽略检查验证
hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hadoop fs -cat file:///file3 /user/hadoop/file4

03.checksum

-v 显示文件块的大小
hadoop fs -checksum hdfs://nn1.example.com/file1
hadoop fs -checksum file:///etc/hosts

04.chgrp

更改文件的组关联。用户必须是文件的所有者,或者是超级用户。其他信息在权限指南中。

-R 将文件的组关联进行递归更改

05.chmod

-R 将文件使用权限进行递归更改
hdfs dfs -chmod -R 777 /tmp/tmp

06.chown

-R 递归更改

07.copyFromLocal

将文件上传到HDFS, 同 -put

08.copyToLocal

将文件下载到本地,同 -get

09.count

计算指定文件模式匹配的路径下的目录、文件和字节数。获取配额和使用情况。带有 -count 的输出列包括:DIR_COUNT、FILE_COUNT、CONTENT_SIZE、路径名

-q -u 和 -q 选项控制输出包含哪些列。-q 表示显示配额,-u 将输出限制为仅显示配额和使用情况。
-u -u 和 -q 选项控制输出包含哪些列。-q 表示显示配额,-u 将输出限制为仅显示配额和使用情况。
-v 显示标题行
-x -x 选项从结果计算中排除快照。如果没有 -x 选项(缺省值),则始终根据所有 INodes 计算结果,包括给定路径下的所有快照。如果给定了 -u 或 -q 选项,则忽略 -x 选项。
-h 可以更人性化的展示字节大小B,K,M,G
-e 显示纠删码策略
-s -s 选项显示每个目录的快照计数。
hadoop fs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hadoop fs -count -q hdfs://nn1.example.com/file1
hadoop fs -count -q -h hdfs://nn1.example.com/file1
hadoop fs -count -q -h -v hdfs://nn1.example.com/file1
hadoop fs -count -u hdfs://nn1.example.com/file1
hadoop fs -count -u -h hdfs://nn1.example.com/file1
hadoop fs -count -u -h -v hdfs://nn1.example.com/file1
hadoop fs -count -e hdfs://nn1.example.com/file1
hadoop fs -count -s hdfs://nn1.example.com/file1

10.test

判断hdfs是否存在文件或者文件夹

命令参数 描述
-d 如果指定路径是一个目录返回0否则返回1
-e 如果指定路径存在返回0否则返回1
-f 如果指定路径是一个文件返回0否则返回1
-s 如果指定路径文件大小大于0返回0否则返回1
-z 如果指定指定文件大小等于0返回0否则返回1

11.getmerge

# 将hdfs目录中的文件合并下载到本地
hdfs dfs -getmerge hdfs://ip:port/tmp/tmp ./value.txt

12.expunge

清空回收站

13.skipTrash

直接删除,不放入回收站

14.report

查看hdfs总容量和使用情况

hdfs dfsadmin -report

15.distcp

参数 说明
-append 重用目标文件中的现有数据,并在可能的情况下添加新数据,新增进去而不是覆盖它
-async 是否应该阻塞distcp执行
-atomic 提交所有更改或不提交更改
-bandwidth 以MB/second为单位指定每个map的带宽
-delete 删除目标文件中存在的文件,但在源文件中不存在,走HDFS垃圾回收站
-diff 使用snapshot diff报告来标识源和目标之间的差异
-f 需要复制的文件列表
-filelimit (已弃用!)限制复制到<= n的文件数
-filters 从复制的文件列表中排除
-i 忽略复制过程中的失败
-log HDFS上的distcp执行日志文件夹保存
-m 限制同步启动的map数,默认每个文件对应一个map,每台机器最多启动20个map
-mapredSslConf 配置ssl配置文件,用于hftps://
-numListstatusThreads 用于构建文件清单的线程数(最多40个),当文件目录结构复杂时应该适当增大该值
-overwrite 选择无条件覆盖目标文件,即使它们存在。
-p 保留源文件状态(rbugpcaxt)(复制,块大小,用户,组,权限,校验和类型,ACL,XATTR,时间戳)
-sizelimit (已弃用!)限制复制到<= n的文件数字节
-skipcrccheck 是否跳过源和目标路径之间的CRC检查。
-strategy 选择复制策略,默认值uniformsize,每个map复制的文件总大小均衡;可以设置为dynamic,使更快的map复制更多的文件,以提高性能
-tmp 要用于原子的中间工作路径承诺
-update 如果目标文件的名称和大小与源文件不同,则覆盖;如果目标文件大小和名称与源文件相同则跳过
hadoop distcp -i  -p hdfs://192.168.40.100:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp hdfs://192.168.40.200:8020/user/hive/warehouse/iot.db/

hadoop distcp -i -update -delete -p hdfs://192.168.40.100:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp hdfs://192.168.40.200:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp

16.find

Usage: hadoop fs -find <path> ... <expression> ...

Finds all files that match the specified expression and applies selected actions to them. If no path is specified then defaults to the current working directory. If no expression is specified then defaults to -print.

The following primary expressions are recognised:

  • -name pattern
    -iname pattern

    Evaluates as true if the basename of the file matches the pattern using standard file system globbing. If -iname is used then the match is case insensitive.

  • -print
    -print0

    Always evaluates to true. Causes the current pathname to be written to standard output. If the -print0 expression is used then an ASCII NULL character is appended.

The following operators are recognised:

  • expression -a expression

    expression -and expression

    expression expression

    Logical AND operator for joining two expressions. Returns true if both child expressions return true. Implied by the juxtaposition of two expressions and so does not need to be explicitly specified. The second expression will not be applied if the first fails.

Example:

hadoop fs -find / -name test -print

16.ls

Usage: hadoop fs -ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] <args>

Options:

  • -C: Display the paths of files and directories only.(仅显示文件和目录的路径)
  • -d: Directories are listed as plain files.(目录列为普通文件)
  • -h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864).(格式化,显示m和g)
  • -q: Print ? instead of non-printable characters.
  • -R: Recursively list subdirectories encountered.(递归列出遇到的子目录。)
  • -t: Sort output by modification time (most recent first).(按修改时间对输出进行排序(最近的第一个)。)
  • -S: Sort output by file size.(按文件大小对输出进行排序。)
  • -r: Reverse the sort order.(颠倒排序顺序。)
  • -u: Use access time rather than modification time for display and sorting.(使用访问时间而不是修改时间进行显示和排序。)
  • -e: Display the erasure coding policy of files and directories only.(仅显示文件和目录的擦除编码策略。)

For a file ls returns stat on the file with the following format:

permissions number_of_replicas userid groupid filesize modification_date modification_time filename

For a directory it returns list of its direct children as in Unix. A directory is listed as:

permissions userid groupid modification_date modification_time dirname

Files within a directory are order by filename by default.

Example:

hadoop fs -ls /user/hadoop/file1
hadoop fs -ls -e /ecdir

17.mkdir

18.mkdir

Usage: hadoop fs -mkdir [-p] <paths>

Takes path uri’s as argument and creates directories.

Options:

  • The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.

Example:

hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir

19.mv

Usage: hadoop fs -mv URI [URI ...] <dest>

Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across file systems is not permitted.

Example:

hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
hadoop fs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1

20.put

Usage: hadoop fs -put [-f] [-p] [-l] [-d] [-t <thread count>] [-q <thread pool queue size>] [ - | <localsrc> ...] <dst>

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system if the source is set to “-”

Copying fails if the file already exists, unless the -f flag is given.

Options:

  • -p : Preserves access and modification times, ownership and the permissions. (assuming the permissions can be propagated across filesystems)(保留访问和修改时间、所有权和权限。(假设权限可以跨文件系统传播))
  • -f : Overwrites the destination if it already exists.(覆盖已存在的目标。)
  • -l : Allow DataNode to lazily persist the file to disk, Forces a replication factor of 1. This flag will result in reduced durability. Use with care.(允许DataNode将文件延迟持久化到磁盘,强制复制因子为1。此标志将导致耐久性降低。小心使用。)
  • -d : Skip creation of temporary file with the suffix ._COPYING_.(跳过创建后缀为的临时文件_复制。)
  • -t <thread count> : Number of threads to be used, default is 1. Useful when uploading directories containing more than 1 file.(要使用的线程数,默认值为1。上传包含多个文件的目录时很有用。)
  • -q <thread pool queue size> : Thread pool queue size to be used, default is 1024. It takes effect only when thread count greater than 1.(要使用的线程池队列大小,默认值为1024。只有当线程数大于1时,它才会生效。)

Examples:

hadoop fs -put localfile /user/hadoop/hadoopfile
hadoop fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
hadoop fs -put -d localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
hadoop fs -put -t 5 localdir hdfs://nn.example.com/hadoop/hadoopdir
hadoop fs -put -t 10 -q 2048 localdir1 localdir2 hdfs://nn.example.com/hadoop/hadoopdir

21.rm

Usage: hadoop fs -rm [-f] [-r |-R] [-skipTrash] [-safely] URI [URI ...]

Delete files specified as args.

If trash is enabled, file system instead moves the deleted file to a trash directory (given by FileSystem#getTrashRoot).

Currently, the trash feature is disabled by default. User can enable trash by setting a value greater than zero for parameter fs.trash.interval (in core-site.xml).

See expunge about deletion of files in trash.

Options:

  • The -f option will not display a diagnostic message or modify the exit status to reflect an error if the file does not exist.(选项将不会显示诊断消息,也不会修改退出状态以反映文件不存在时的错误。)
  • The -R option deletes the directory and any content under it recursively.(选项以递归方式删除目录及其下的任何内容。)
  • The -r option is equivalent to -R.
  • The -skipTrash option will bypass trash, if enabled, and delete the specified file(s) immediately. This can be useful when it is necessary to delete files from an over-quota directory.(-选项将绕过垃圾桶(如果启用),并立即删除指定的文件。当需要从超过配额的目录中删除文件时,这可能很有用。)
  • The -safely option will require safety confirmation before deleting directory with total number of files greater than hadoop.shell.delete.limit.num.files (in core-site.xml, default: 100). It can be used with -skipTrash to prevent accidental deletion of large directories. Delay is expected when walking over large directory recursively to count the number of files to be deleted before the confirmation.(选项在删除文件总数大于`hadoop.shell.delete.limit.num.files'(在core-site.xml中,默认值:100)的目录之前需要进行安全确认。它可以与-skipTrash一起使用,以防止意外删除大目录。当递归遍历大目录以计算确认前要删除的文件数时,预计会出现延迟。)

Example:

hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir

22.rmdir

Usage: hadoop fs -rmdir [--ignore-fail-on-non-empty] URI [URI ...]

Delete a directory.

Options:

  • --ignore-fail-on-non-empty: When using wildcards, do not fail if a directory still contains files.(使用通配符时,如果目录仍包含文件,请不要失败。)

Example:

hadoop fs -rmdir /user/hadoop/emptydir

23.tail

Usage: hadoop fs -tail [-f] URI

Displays last kilobyte of the file to stdout.

Options:

  • The -f option will output appended data as the file grows, as in Unix.

Example:

hadoop fs -tail pathname

24.touch

Usage: hadoop fs -touch [-a] [-m] [-t TIMESTAMP] [-c] URI [URI ...]

Updates the access and modification times of the file specified by the URI to the current time. If the file does not exist, then a zero length file is created at URI with current time as the timestamp of that URI.

  • Use -a option to change only the access time(仅更改访问时间的选项)
  • Use -m option to change only the modification time(仅更改修改时间的选项)
  • Use -t option to specify timestamp (in format yyyyMMdd:HHmmss) instead of current time(用于指定时间戳(格式为yyyyMMdd:HHmmss)而不是当前时间的选项)
  • Use -c option to not create file if it does not exist(如果文件不存在,则不创建该文件的选项)

The timestamp format is as follows * yyyy Four digit year (e.g. 2018) * MM Two digit month of the year (e.g. 08 for month of August) * dd Two digit day of the month (e.g. 01 for first day of the month) * HH Two digit hour of the day using 24 hour notation (e.g. 23 stands for 11 pm, 11 stands for 11 am) * mm Two digit minutes of the hour * ss Two digit seconds of the minute e.g. 20180809:230000 represents August 9th 2018, 11pm

Example:

hadoop fs -touch pathname
hadoop fs -touch -m -t 20180809:230000 pathname
hadoop fs -touch -t 20180809:230000 pathname
hadoop fs -touch -a pathname

25.touchz

Usage: hadoop fs -touchz URI [URI ...]

Create a file of zero length. An error is returned if the file exists with non-zero length.(创建一个长度为零的文件。如果存在长度为非零的文件,则返回错误。)

Example:

hadoop fs -touchz pathname

26.help

# 查看ls的命令帮助文档
hadoop fs -help ls

27.将fsimage文件转换为xml文件

hdfs oiv -p 文件类型(xml) -i 镜像文件 -o 转换后文件输出路径

28.将edits文件转换为xml文件

hdfs oev -p 文件类型(xml) -i 镜像文件 -o 转换后文件输出路径

29.查看支持的压缩算法

hadoop checknative

命令集合

The Hadoop FileSystem shell works with Object Stores such as Amazon S3, Azure ABFS and Google GCS.

# Create a directory
hadoop fs -mkdir s3a://bucket/datasets/

# Upload a file from the cluster filesystem
hadoop fs -put /datasets/example.orc s3a://bucket/datasets/

# touch a file
hadoop fs -touchz wasb://yourcontainer@youraccount.blob.core.windows.net/touched

Unlike a normal filesystem, renaming files and directories in an object store usually takes time proportional to the size of the objects being manipulated. As many of the filesystem shell operations use renaming as the final stage in operations, skipping that stage can avoid long delays.

In particular, the put and copyFromLocal commands should both have the -d options set for a direct upload.

# Upload a file from the cluster filesystem
hadoop fs -put -d /datasets/example.orc s3a://bucket/datasets/

# Upload a file from under the user's home directory in the local filesystem.
# Note it is the shell expanding the "~", not the hadoop fs command
hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket/datasets/

# create a file from stdin
# the special "-" source means "use stdin"
echo "hello" | hadoop fs -put -d -f - wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt

Objects can be downloaded and viewed:

# copy a directory to the local filesystem
hadoop fs -copyToLocal s3a://bucket/datasets/

# copy a file from the object store to the cluster filesystem.
hadoop fs -get wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt /examples

# print the object
hadoop fs -cat wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt

# print the object, unzipping it if necessary
hadoop fs -text wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt

## download log files into a local file
hadoop fs -getmerge wasb://yourcontainer@youraccount.blob.core.windows.net/logs\* log.txt

Commands which list many files tend to be significantly slower than when working with HDFS or other filesystems

hadoop fs -count s3a://bucket/
hadoop fs -du s3a://bucket/

Other slow commands include find, mv, cp and rm.

Find

This can be very slow on a large store with many directories under the path supplied.

# enumerate all files in the object store's container.
hadoop fs -find s3a://bucket/ -print

# remember to escape the wildcards to stop the shell trying to expand them first
hadoop fs -find s3a://bucket/datasets/ -name \*.txt -print

Rename

The time to rename a file depends on its size.

The time to rename a directory depends on the number and size of all files beneath that directory.

hadoop fs -mv s3a://bucket/datasets s3a://bucket/historical

If the operation is interrupted, the object store will be in an undefined state.

Copy

hadoop fs -cp s3a://bucket/datasets s3a://bucket/historical

The copy operation reads each file and then writes it back to the object store; the time to complete depends on the amount of data to copy, and the bandwidth in both directions between the local computer and the object store.

The further the computer is from the object store, the longer the copy takes

Deleting objects

The rm command will delete objects and directories full of objects. If the object store is eventually consistent, fs ls commands and other accessors may briefly return the details of the now-deleted objects; this is an artifact of object stores which cannot be avoided.

If the filesystem client is configured to copy files to a trash directory, this will be in the bucket; the rm operation will then take time proportional to the size of the data. Furthermore, the deleted files will continue to incur storage costs.

To avoid this, use the the -skipTrash option.

hadoop fs -rm -skipTrash s3a://bucket/dataset

Data moved to the .Trash directory can be purged using the expunge command. As this command only works with the default filesystem, it must be configured to make the default filesystem the target object store.

hadoop fs -expunge -D fs.defaultFS=s3a://bucket/