Linux文本处理三剑客之grep

发布时间 2023-08-20 17:23:17作者: 百里骑

相信大家对于grep都不陌生,或多或少都用过。

但大部分人可能都只用过最基本的字符匹配,而稍微复杂一点的用法没有使用过。

我们不追求过于复杂的参数用法,而是要了解grep还能干什么,有什么我平常没有用到的功能,从而能够提升我的工作效率。

比如我自己,很长一段时间,我都只会用:

grep -Enr 'xxx' file

如果工作中经常需要用到文本搜索,那么掌握grep更多用法就很有必要了。

命令格式

命令语法

grep [OPTION]... PATTERNS [FILE]...

命令选项

Pattern selection and interpretation:
  -E, --extended-regexp     PATTERNS are extended regular expressions
  -F, --fixed-strings       PATTERNS are strings
  -G, --basic-regexp        PATTERNS are basic regular expressions
  -P, --perl-regexp         PATTERNS are Perl regular expressions
  -e, --regexp=PATTERNS     use PATTERNS for matching
  -f, --file=FILE           take PATTERNS from FILE
  -i, --ignore-case         ignore case distinctions in patterns and data
      --no-ignore-case      do not ignore case distinctions (default)
  -w, --word-regexp         match only whole words
  -x, --line-regexp         match only whole lines
  -z, --null-data           a data line ends in 0 byte, not newline

Miscellaneous:
  -s, --no-messages         suppress error messages
  -v, --invert-match        select non-matching lines
  -V, --version             display version information and exit
      --help                display this help text and exit

Output control:
  -m, --max-count=NUM       stop after NUM selected lines
  -b, --byte-offset         print the byte offset with output lines
  -n, --line-number         print line number with output lines
      --line-buffered       flush output on every line
  -H, --with-filename       print file name with output lines
  -h, --no-filename         suppress the file name prefix on output
      --label=LABEL         use LABEL as the standard input file name prefix
  -o, --only-matching       show only nonempty parts of lines that match
  -q, --quiet, --silent     suppress all normal output
      --binary-files=TYPE   assume that binary files are TYPE;
                            TYPE is 'binary', 'text', or 'without-match'
  -a, --text                equivalent to --binary-files=text
  -I                        equivalent to --binary-files=without-match
  -d, --directories=ACTION  how to handle directories;
                            ACTION is 'read', 'recurse', or 'skip'
  -D, --devices=ACTION      how to handle devices, FIFOs and sockets;
                            ACTION is 'read' or 'skip'
  -r, --recursive           like --directories=recurse
  -R, --dereference-recursive  likewise, but follow all symlinks
      --include=GLOB        search only files that match GLOB (a file pattern)
      --exclude=GLOB        skip files that match GLOB
      --exclude-from=FILE   skip files that match any file pattern from FILE
      --exclude-dir=GLOB    skip directories that match GLOB
  -L, --files-without-match  print only names of FILEs with no selected lines
  -l, --files-with-matches  print only names of FILEs with selected lines
  -c, --count               print only a count of selected lines per FILE
  -T, --initial-tab         make tabs line up (if needed)
  -Z, --null                print 0 byte after FILE name

Context control:
  -B, --before-context=NUM  print NUM lines of leading context
  -A, --after-context=NUM   print NUM lines of trailing context
  -C, --context=NUM         print NUM lines of output context
  -NUM                      same as --context=NUM
      --color[=WHEN],
      --colour[=WHEN]       use markers to highlight the matching strings;
                            WHEN is 'always', 'never', or 'auto'
  -U, --binary              do not strip CR characters at EOL (MSDOS/Windows)

关于正则表达式的模式选择(-E, -F, -G, -P)可参考知乎文章:

https://zhuanlan.zhihu.com/p/435815082

几个派系有所区别,我本人习惯用 -E (extended regular expressions)。

常用参数

  -E, --extended-regexp     选择正则表达式的模式(扩展模式)
  -i, --ignore-case         忽略大小写
  -w, --word-regexp         单词匹配
  -v, --invert-match        反转模式(输出不含有目标字符的文本)
  -n, --line-number         打印匹配的行号
  -r, --recursive           递归搜索(搜索对象是目录)
  -c, --count               仅仅打印匹配的个数

案例

创建测试文件grep.txt, 内容如下:

$ cat grep.txt
Today is a good day, a sunny day, a wonderful day, a important day.
I am a boy, a good boy, a lovely boy.
I like reading, sports, and coding.
Enjoy coding.

基本搜索

搜索文本中含有字母'a'的所有行,

-E: 扩展模式

-n: 输出行号

可以看到其实reading/and中的a也配匹配到了。

grep -En "a" grep.txt
1:Today is a good day, a sunny day, a wonderful day, a important day.
2:I am a boy, a good boy, a lovely boy.
3:I like reading, sports, and coding.

现在限制为单词匹配,

加上参数 -w.

$ grep -Ewn "a" grep.txt
1:Today is a good day, a sunny day, a wonderful day, a important day.
2:I am a boy, a good boy, a lovely boy.

只显示匹配行的行数。

(注意不是a的个数,而是含有目标字符的行数)

$ grep -Ewnc "a" grep.txt
2

进阶搜索

匹配多个模式

也就是同时匹配多个文本或者字符。

同时匹配含有a或者boy的行;

$ grep -Ewn "a|boy" grep.txt
1:Today is a good day, a sunny day, a wonderful day, a important day.
2:I am a boy, a good boy, a lovely boy.

那如果我要匹配同时含有a和boy的行呢?

.表示匹配任意字符;
*表示任意个;

意思就是说a和boy中可以有任意字符;

$ grep -Ewn "a.*boy" grep.txt
2:I am a boy, a good boy, a lovely boy.

更多参数用法

比如有时候我比较关注发生匹配时前面几行或者后面几行,就可以使用

-A n; 打印模式匹配之后的n行;

-B n; 打印模式匹配之前的n行;

-C n; 打印模式匹配周围的n行;

$ grep -A 1 -Ewn "a.*boy" grep.txt
2:I am a boy, a good boy, a lovely boy.
3-I like reading, sports, and coding.

善用正则表达式组合

如果熟悉正则表达式,可以更加灵活的搜索自己想要的文本。

可参考正则表达式大全:https://www.cnblogs.com/fozero/p/7868687.html

比如搜索数字,字母;

比如匹配手机号,身份证号;

各种pattern都可以通过正则表达式达到要求。