Searching large source trees in an efficient way on Linux

TL;DR

Here is the alias:

alias search 'find \!:1 -noleaf -type f -not -path "*/boost/*" -not -path "*/extensions/*" -print0 | xargs -0 -n 100 -P 8 grep -I --color -H -n \!:2*'

 

How do I use it?

Here is how I use it:

search [dir] [term] [grep_options]
e.g.
search ./src/ the\ search\ term
search ./src/ keyTerm -A5 -B5

How does it work?

find

This search alias uses find as follows to locate all files under the provided directory (i.e. first argument) while excluding directories that we don’t care about:

find \!:1 -noleaf -type f -not -path "*/boost/*" -not -path "*/extensions/*" -print0

For aliases remember this:

!* is all but the first
!:0 is only the first, the command itself
!:1 is only the first argument
!:2* is all but the first argument
!$ is only the last argument
!:1- is all but the last argument
!! is all
$0 is the shell
$# is the number of args
$$ is the process id (PID)
$! is the PID of the previous command
$? is the return code from the previous command

Thus, the “\!:1” means only the first argument, and the bang (!) has to be escaped.

\!:1

The “-noleaf” is used because I am normally working on Windows/NTFS mounts and it is not safe to assume that directories containing 2 fewer subdirectories than their hard link count only contain files.

-noleaf

We only want to gather files for searching so I use the “-type f”.

-type f

I normally have very large directories which I do not care to search in, so I specify:

-not -path "*/boost/*" -not -path "*/extensions/*"

Finally for the find command I pass “-print0” which returns null (instead of new line) terminated strings. This adds support for paths with spaces in them:

-print0

xargs

The xargs command controls how many files are being passed into grep and it is handling running them in parallel.

xargs -0 -n 100 -P 8 grep -I --color -H -n \!:2*

The “-0” option is used here to tell xargs that the strings coming in are null terminated (this adds support for files with spaces):

-0

The “-n 100” and the “-P 8” options are where the speed and power of this alias come from. The “-n 100” is telling xargs to pass 100 files from find into grep at a time. The “-P 8” is telling xargs to run 8 grep commands in parallel.

This means that if we have a source tree of 1600 files, then grep will be called 16 times and each will be passed 100 files. The best part is that 8 of those grep commands will be running in parallel each on 100 files, so the command finishes as if there were only two (2) grep invocations – very fast even on large source trees:

-n 100 -P 8

grep

The grep command is used to do the actual searching in files.

grep -I --color -H -n \!:2*

The “-I” option ignores binary files:

-I

Colored results make it much easier to see hits:

--color

Because we are passing in the files to grep it may not show the file name where the hit occurred so we add “-H” to print the file name:

-H

The line number is also important, so we add “-n”:

-n

The ability to control grep is handled with an arguments wildcard. Here the “\!:2*” means the second and all subsequent arguments passed into the search alias. Thus the grep search term and all other grep options can be specified after the directory to search:

\!:2*

The final piece is that the xargs command will add the files from the find command to the grep command. It will add 100 (or less if there are less than 100) files to every grep command and each of those will be run in parallel with up to 8 running at any given time.

Enjoy your searching.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s