Get Nested Folders and Files in S3 Bucket with just few lines of code

Get Nested Folders and Files in S3 Bucket with just few lines of code

Table of contents

No heading

No headings in the article.

If you are using amazon S3 with a python client, then these few lines of code can help you to parse an S3 bucket nested content in minutes.

Motivation
It's simple, I came across this problem a few days before. I searched through StackOverflow but didn't get any satisfactory results. After giving it some time, I kinda figured out how to do the above without retrieving every result (coz that's more time taking).

Jumping to the problem/scenario
Suppose you are working in an S3 bucket whose folder structure is similar to something like this:

└── ParentFolder
    ├── SubFolder1
    │   ├── Folder1
    │   │   ├── File1
    │   │   └── File2
    │   └── Folder2
    │       ├── File1
    │       ├── File2
    │       └── File3
    └── SubFolder2
        ├── Folder1
        │   ├── File1
        │   └── File2
        └── Folder2
            ├── File1
            ├── File2
            └── File3

Now suppose for the time being you only want the paths for the subfolder and the parse the subfolders to do operations with the files in some other functions. So for getting just the subfolder paths in a dictionary format where the keys will be the folder name and the values will be a list of all folders inside a subfolder. Something like this:

{
"Sub1" : ["Parent/Sub1/Folder1", "Parent/Sub1/Folder2"],
"Sub2" : ["Parent/Sub2/Folder"]
}

Here is a sample function that does the same

def get_all_subfolders_in_bucket(
    bucket_name: str,
    prefix_key: str = "folder/",
    exclude_prefix_keys: List[str] = ["folder/test-folder/"],
) -> Dict[str, List[str]]:
    """Generates all the subfolder for all the tenant name inside the bucket

    --- folder
        --- tenant name
            ---- cam_id
            ---- cam_id
        --- tenant name
            ...
    Args:
        bucket_name (str): The name of the bucket
        prefix_key (str, optional): Initial prefix key. Defaults to 'folder/'.
        exclude_prefix_keys (List[str], optional): Any keys to exclude. 
        Defaults to ['folder/test-folder/'].

    Returns:
        Dict[str, List[str]]: Returns a Dict[str, List[str]] 
        where key is the parent_dir and values are a list of subdir
    """

    s3_client = boto3.client(
        "s3",
        aws_access_key_id="...",
        aws_secret_access_key="...",
    )

    common_prefix_results = {}
    results = s3_client.list_objects(
        Bucket=bucket_name, 
        Prefix=prefix_key, 
        Delimiter="/"
    )

    for result in results.get("CommonPrefixes"):
        folder = result.get("Prefix")

        if folder in exclude_prefix_keys:
            continue

        common_prefix_results[folder] = []
        for subfolder in s3_client.list_objects(
            Bucket=bucket_name, Prefix=folder, Delimiter="/"
        ).get("CommonPrefixes"):
            common_prefix_results[folder].append(subfolder.get("Prefix"))
    return common_prefix_results

All we are doing is, we start from a parent dir named folder and then we use a for loop to check all the initial subfolders. We exclude that subfolder example test-folder which are mentioned in the arguments. If the subfolder is not included in the excluded list then we move forward and search for the next level of folders inside the subfolder using this:

s3_client.list_objects(Bucket=bucket_name, Prefix=folder, Delimiter="/").get("CommonPrefixes")

The above command essentially returns a list of directors with some key values. All we wanna get is the folder path inside the subfolder. And we store this inside a dict that we initialised before, such that it matches the structure of the above JSON.

As a rule of thumb, you can essentially remember these mentioned points

  • CommonPrefixes can be thought of as the set of common files (dir) inside a parent directory

  • Prefix acts as the key to fetch the name of those files (dir)

That's it. Now you know, how to parse a folder structure inside an amazon S3 bucket right? You can tweak these functions to get some similar jobs done too. Make sure to like this blog if you find this helpful and informative. 😁

Cheers 🥳