Searching large source trees in an efficient way on Linux
Here is the alias:
alias search 'find \!:1 -noleaf -type f -not -path "*/boost/*" -not -path "*/extensions/*" -print0 | xargs -0 -n 100 -P 8 grep -I --color -H -n \!:2*'
How do I use it?
Here is how I use it:
search [dir] [term] [grep_options] e.g. search ./src/ the\ search\ term search ./src/ keyTerm -A5 -B5
How does it work?
This search alias uses find as follows to locate all files under the provided directory (i.e. first argument) while excluding directories that we don’t care about:
find \!:1 -noleaf -type f -not -path "*/boost/*" -not -path "*/extensions/*" -print0
For aliases remember this:
!* is all but the first !:0 is only the first, the command itself !:1 is only the first argument !:2* is all but the first argument !$ is only the last argument !:1- is all but the last argument !! is all $0 is the shell $# is the number of args $$ is the process id (PID) $! is the PID of the previous command $? is the return code from the previous command
Thus, the “\!:1” means only the first argument, and the bang (!) has to be escaped.
The “-noleaf” is used because I am normally working on Windows/NTFS mounts and it is not safe to assume that directories containing 2 fewer subdirectories than their hard link count only contain files.
We only want to gather files for searching so I use the “-type f”.
I normally have very large directories which I do not care to search in, so I specify:
-not -path "*/boost/*" -not -path "*/extensions/*"
Finally for the find command I pass “-print0” which returns null (instead of new line) terminated strings. This adds support for paths with spaces in them:
The xargs command controls how many files are being passed into grep and it is handling running them in parallel.
xargs -0 -n 100 -P 8 grep -I --color -H -n \!:2*
The “-0” option is used here to tell xargs that the strings coming in are null terminated (this adds support for files with spaces):
The “-n 100” and the “-P 8” options are where the speed and power of this alias come from. The “-n 100” is telling xargs to pass 100 files from find into grep at a time. The “-P 8” is telling xargs to run 8 grep commands in parallel.
This means that if we have a source tree of 1600 files, then grep will be called 16 times and each will be passed 100 files. The best part is that 8 of those grep commands will be running in parallel each on 100 files – very fast even on large source trees:
-n 100 -P 8
The grep command is used to do the actual searching in files.
grep -I --color -H -n \!:2*
The “-I” option ignores binary files:
Colored results make it much easier to see hits:
Because we are passing in the files to grep it may not show the file name where the hit occurred so we add “-H” to print the file name:
The line number is also important, so we add “-n”:
The ability to control grep is handled with an arguments wildcard. Here the “\!:2*” means the second and all subsequent arguments passed into the search alias. Thus the grep search term and all other grep options can be specified after the directory to search:
The final piece is that the xargs command will add the files from the find command to the grep command. It will add 100 (or less if there are less than 100) files to every grep command and each of those will be run in parallel with up to 8 running at any given time.
Enjoy your searching.
TheSoftwareProgrammer View All
I like science and writing software.
Leave a Reply