Linux – Finding distinct file types with the find command

An example of using the Unix find command to quickly check out the number of different file extensions in a directory tree.

If you find yourself working on an unfamiliar website or codebase, the Unix find command provides a quick and easy way to get a high-level view of what’s lurking in there.

In most cases the file types will be identified by their filename extension – if we want to find all the different extensions under the current directory the following command will list them:-

# find . -type f -name "*.*" | sed 's/.*\.//' | sort -u
JPG
css
gif
htm
html
jpg
js
php
png
swf
txt
xap
xml

Here we’re using find to locate all plain files (type f) under the current directory and its subdirectories which match *.* – i.e. there’s a period in the name so it (probably) has an extension. Next we use sed to remove everything preceding the final period so we have a list of each file’s extension. Finally we use sort – u to get a list of the unique ones.

Useful though it is to note that our codebase here has both upper case and lower case versions of the same extensions, it might not be what we’re looking for. If we want to treat upper case and lower case as equivalent we can use the tr command to make them all lower-case first.

# find . -type f -name "*.*" | sed 's/.*\.//' | 
    tr '[:upper:]' '[:lower:]' | sort -u
css
gif
htm
html
jpg
js
php
png
swf
txt
xap
xml

Finally it might be useful to see how many of each file type we have. We can do this by using uniq -c, which provides a count of all unique values. It only counts adjacent matches though, so we’ll still need to do a sort first, dropping the -u as uniq will do that bit for us:-

# find . -type f -name "*.*" | sed 's/.*\.//' | 
    tr '[:upper:]' '[:lower:]' | sort | uniq -c
  151 css
  587 gif
  166 htm
   86 html
  257 jpg
  439 js
 1585 php
 1332 png
   54 swf
   29 txt
    6 xml

2 responses to “Linux – Finding distinct file types with the find command

  1. How can you handle files with multiple periods in the file name and files with spaces IE
    foo.bar.tgz
    foo.2016.8.14.tgz

    In both of these cases I am looking for something that only shows .tgz and ignores the other leading periods

    • I’d expect the commands above to work okay in this scenario as the sed expression matches and removes everything up to the last period (at least it works that way on my Mac!) If you’re seeing different behaviour what sort of environment are you running in?

Leave a Reply

Your email address will not be published. Required fields are marked *