/A Surprisingly Common Mistake Involving Wildcards & The Find Command

A Surprisingly Common Mistake Involving Wildcards & The Find Command


2020-01-21 – By Robert Elder

     Do you notice anything wrong with the following Linux command?

find . -name *.jpg

     If you don’t you’re not alone.  Right now, as I’m typing this, 4 of the top 4 Google results for “find command linux” contain at least one example of the pattern illustrated by the command above.

     Unfortunately, there is one very subtle problem with this command that can cause you to make very serious errors if you use it.  What’s even worse is that the error isn’t likely to be obvious every time you use it.  For those of you who know about globbing, you already know what the error is and you don’t need to read this article.  For everyone else, let’s review an example to illustrate what the problem is.

     Assume you’re working within a source code repository, and you’d like to do some cleanup to get rid of unused files.  In this case, your goal is to keep only the python source files that end with a .py extension, and delete all other types of files.  Here are some commands to generate the specific listing of files we’ll consider in this example:

touch readme.txt
mkdir src
mkdir lib
touch src/main.py
touch src/main.pyc
touch lib/other.py
touch lib/other.pyc
touch lib/stuff.py
touch lib/foo.py
touch lib/foo.pyc

     To review a list of all files that are present in the repository, you decide to issue this command at the root of the project:

find .

     Which produces this result:

.
./lib
./lib/foo.pyc
./lib/stuff.py
./lib/foo.py
./lib/other.pyc
./lib/other.py
./readme.txt
./src
./src/main.py
./src/main.pyc

     To find all files with the .py extension, you issue this command:

find . -name *.py

     Which produces this result:

./lib/stuff.py
./lib/foo.py
./lib/other.py
./src/main.py

     Then, because you’re interested in deleting any file that doesn’t end in .py, you issue this command

find . -not -name *.py

     which shows you the opposite set of files:

.
./lib
./lib/foo.pyc
./lib/other.pyc
./readme.txt
./src
./src/main.pyc

     Now, because you want to actually delete the set of files that we just found, you decide to add the -delete flag and issue the command again:

find . -not -name *.py -delete

     After inspecting the results, you find that this command successfully deleted exactly the set of files that you intended to:

.
./lib
./lib/stuff.py
./lib/foo.py
./lib/other.py
./src
./src/main.py

     Because this process worked so well, you decide to save this command so you can use it again in the future to impress your coworkers.

     Later, you decide that you need to perform the exact same task of deleting any file that doesn’t end in a .py extension from your repository.  However, this time the repository contains the following list of files that is slightly different:

.
./lib
./lib/foo.pyc
./lib/stuff.py
./lib/foo.py
./lib/other.pyc
./lib/other.py
./readme.txt
./src
./src/main.py
./src/main.pyc
./test.py

     So again, you decide to issue the exact same command that we just used with the intention of deleting all files that aren’t python source files:

find . -not -name *.py -delete

     But to your surprise, you find that you’ve just deleted way more files than you expected to, including most of the ones that end with the .py extension, with only a single file remaining:

.
./test.py

     What Just Happened Here???  Almost all of your files are now gone forever!

     The key problem here is the lack of quoting around the wildcard used to specify the file extension.  If the command was re-written to enclose the wildcard in single quotes like this:

find . -not -name '*.py' -delete

     we would have gotten the desired result of deleting only files that don’t end with the .py extension.

     Using quotes solves our problem, but why did it ever work in the first place?  After all, we didn’t use any quotes the first time, but we still got exactly the result that we wanted.

     To explain this, let’s make a short C program does nothing other than print out the parameters that are passed to it:

#include <stdio.h>
int main(int argc, char *argv[]) {
	int i;
	printf("argc=%dn", argc);
	for(i=0;i < argc; i++){
		printf("Arg %d is: '%s'n", i, argv[i]);
	}
	return 0;
}

     You can compile this program with this command:

gcc main.c

     If you run this program with the following argument in a directory with no python source files:

./a.out *.py

     You can see that the output includes the literal wildcard character:

argc=2
Arg 0 is: './a.out'
Arg 1 is: '*.py'

     However, if you create a file in this directory that ends with the .py extension and run the same command again:

touch foo.py
./a.out *.py

     you’ll get the following result:

argc=2
Arg 0 is: './a.out'
Arg 1 is: 'foo.py'

     As you can see in this example, the shell will first attempt to expand our wildcard pattern to match any files that are present before passing their expanded names to the program we want to run.  This is called ‘globbing’.  Globbing is an operation that is performed by the shell itself, and it happens independently of the actual command we’re running.

     You can type:

man glob

     or

man 7 glob

     for more information.

     Therefore, you should be aware that the value of the arguments that get passed to a given program will actually depend on the current contents of your filesystem.  If we create more files that end in .py:

touch boo.py
touch moo.py
mkdir test
touch test/abc.py
touch test/def.py

     and run our command again:

./a.out *.py

     you can see that the shell expands this wildcard to match all files in the current directory that have the .py extension.  This also implies that your program may be passed a different number of arguments depending on how many files match the wildcard:

argc=4
Arg 0 is: './a.out'
Arg 1 is: 'boo.py'
Arg 2 is: 'foo.py'
Arg 3 is: 'moo.py'

     Looking back at the contents of the current directory from the first time we ran the delete command, we can see that there are no files at the root of the project that have the .py extension.  Therefore, the wildcard *.py didn’t match anything and was passed the ‘find’ program completely unchanged.  In the second example, the repository contains one file in the current directory that matches the .py extension, so the shell will replace the ‘*.py’ with ‘test.py’ before the ‘find’ program even gets a chance to look at what you’ve typed.  Most importantly, the ‘find’ command uses a different algorithm than shell globbing does when matching wildcard characters.  More specifically, the find command will apply the search pattern against the base of the file name with all leading directories removed.  This is contrasted from shell globbing which will expand the wildcard between each path component separately.  When no path components are specified, the wildcard will match only files in the current directory.  This is what leads to the confusing case we just saw.  This diagram shows a flow chart that describes the logic that will determine if a program ever gets to see an unquoted wildcard character at all:

     An important observation to make is that the core problem we’ve encountered here is actually something you can encounter with any Linux program, and not just the find command.  In fact, if you used attempted to use the grep command with a wildcard to filter the output of find, you could encounter the exact same kind of mistake:

touch test.py
mkdir abc
touch abc/foo.py
find . | egrep *.py

     which outputs:

./test.py

     It just so happens that most casual uses for shell commands don’t encounter this problem:

touch test.py
mkdir abc
touch abc/foo.py
find . | egrep "*py"

     which outputs:

./abc/foo.py
./test.py

     Having said this, there are a few other cases where unexpected globbing can cause you problems.  For example, any case where you make use of unquoted variables in shell scripts can be a source of bugs:



echo ${1} | xxd

     This very simple shell script takes whatever text you supply as the first argument and pipes it into the xxd program to produce a hexadecimal dump of the argument.  Most strings that you supply as the first argument will be interpreted literally.  However, if you supply wildcard characters, even if they are quoted, they will be expanded by globbing inside the script since the variable in the script is not enclosed in quotes.

./echo-star-demo.sh Hello
./echo-star-demo.sh '*'

     Here is the output from the above commands (when run in a directory that only contains the script file):

00000000: 4865 6c6c 6f0a                           Hello.
00000000: 6563 686f 2d73 7461 722d 6465 6d6f 2e73  echo-star-demo.s
00000010: 680a                                     h

     Another fairly contrived example would be a situation where you’re using a command like ‘bc’ to compute the results of mathematical expressions.  If you’re working in an empty directory, you could do:

echo 4 * 4 | bc

     and see that the result is ’16’.  However, if the current directory contained a file whose name was also part of a valid mathematical expression:

touch '+2099+4+'

     The result could be changed to anything since the filename would get substituted as if it were part of the mathematical expression, which is probably not what you want:

echo 4 * 4 | bc

     the output from the above command with the extra file present will be:

2111

     The conclusion is that you need to be aware that the shell will attempt to expand unquoted wildcard characters, so you should always use quotes when specifying a whildcard with the ‘-name’ option of the find command:

find . -name '*.jpg'

Original Source