Path Traversal

File path traversals are a common vulnerability closely related to injection attacks. Instead of escaping from quotation marks, as we saw in the previous section’s examples, this attack escapes into parent directories to make unexpected access to other parts of the filesystem. For example, to serve a collection of images, an implementation might collect image files in a directory named /server/data/image_store and then process requests for an image named X by fetching image data from the path /server/data/image_store/X, formed from the (untrusted) input name X.

The obvious attack would be requesting the name ../../secret/key, which would return the file /server/secret/key that should have been private. Recall that . (dot) is a special name for the current directory and .. (dot-dot) is the parent directory that allows traversal toward the filesystem root, as shown by this sequence of equivalent pathnames:

/server/data/image_store/../../secret/key
/server/data/../secret/key
/server/secret/key

The best way to secure against this kind of attack is to limit the character set allowed in the input (X in our example). Often, input validation ensuring that the input is an alphanumeric string suffices to completely close the door. This works well because it excludes the troublesome file separator and parent directory forms needed to escape from the intended part of the filesystem.

However, sometimes that approach is too limiting. When it’s necessary to handle arbitrary filenames this simple method is too restrictive, so you have more work to do, and it can get complicated because filesystems are complicated. Furthermore, if your code will run across different platforms, you need to be aware of possible filesystem differences (for example, the *nix path separator is a slash, but on Microsoft Windows it’s a backslash).

Here is a simple example of a function that inspects input strings before using them as subpaths for accessing files in the directory that this Python code resides in (denoted by __file__). The idea is to provide access only to files in a certain directory or its subdirectories—but absolutely not to arbitrary files elsewhere. In the version shown here, the guard function safe_path checks the input for a leading slash (which goes to the filesystem root) or parent directory dot-dot and rejects inputs that contain these. To get this right you should work with paths using standard libraries, such as Python’s os.path suite of functionality, rather than ad hoc string manipulation. But this alone isn’t sufficient to ensure against breaking out of the intended directory:

def safe_path(path):
    """Checks that argument path is a safe file path. If not, returns None.
    If safe, returns the normalized absolute file path.
    """
    if path.startswith('/') or path.startswith('..'):
        return None
    base_dir = os.path.dirname(os.path.abspath(__file__))
    filepath = os.path.normpath(os.path.join(base_dir, path))
    return filepath

The remaining hole in this protection is that the path can name a valid directory, and then go up to the parent directory, and so on to break out. For example, since the current directory this sample code runs in is five levels below the root, the path ./../../../../../etc/passwd (with five dot-dots) resolves to the /etc/passwd file.

We could improve the string-based tests for invalid paths by rejecting any path containing dot-dot, but such an approach can be risky, since it’s hard to be certain that we’ve anticipated all possible tricks and completely blocked them. Instead, there’s a straightforward solution that relies on the os.path library, rather than constructing path strings with your own code:

def safe_path(path):
    """Checks that argument path is a safe file path. If not, returns None.
    If safe, returns the normalized absolute file path.
    """
    base_dir = os.path.dirname(os.path.abspath(__file__))
    filepath = os.path.normpath(os.path.join(base_dir, path))
    if base_dir != os.path.commonpath([base_dir, filepath]):
        return None
    return filepath

This protection you can take to the bank, and here’s why. The base directory is a reliable path, because there is no involvement of untrusted input: it’s fully derived from values completely under the programmer’s control. After joining with the input path string, that path gets normalized, which resolves any dot-dot parent references to produce an absolute path: filepath. Now we can check that the longest common subpath of these is the intended directory to which we want to restrict access.