Using EASI scratch and project buckets

EASI has a Scratch bucket available for all users.

  • Scratch means temporary: all files will be deleted after 30 days.
  • Use the scratch bucket to save files between processing runs or share files between projects, temporarily.

Project buckets are available to selected users as well. A project bucket can exist in another AWS account and be cross-linked to EASI. An EASI admin will assign users to a "project", which will enable their access to the bucket. Files in a project bucket are subject to the bucket owner's life cycle rules, administration and costs.

Cross-account project buckets may benefit from additional ACL settings. See User Guide/08-cross-account-storage-usage (in your deployment).

Glossary:

  • S3 storage items are called objects. Typically these are files but they could be any blob of data.
  • An object's name is its key. The key can be just about any string. Typically we include a / in the key to make it look like a directory path, which we're familiar with from regular file systems.

There are two AWS APIs that can be used to read/write to a scratch or project bucket. Examples for both are given in this notebook.

  • AWS CLI - linux program (use in terminal)
  • boto3 - python library (use in code)

We show writing first so that you add a test file for the reading section.

  • Writing
    • User ID
    • Select a test file
    • Upload a file
  • Reading
    • List objects
    • Read a file directly
    • Copy a file to local

Imports and setup¶

In [4]:
import sys, os
import boto3
from datetime import datetime as dt

# EASI tools
import git
repo = git.Repo('.', search_parent_directories=True).working_tree_dir
if repo not in sys.path: sys.path.append(repo)
from easi_tools import EasiDefaults
In [13]:
client = boto3.client('s3')

easi = EasiDefaults()
bucket = easi.scratch
Successfully found configuration for deployment "chile"
In [14]:
# Optional, for parallel uploads and downloads of large files
# Add a (..., Config=config) parameter to the relevant upload and download functions

# from boto3.s3.transfer import TransferConfig
# config = TransferConfig(
#     multipart_threshold = 1024 * 25,
#     max_concurrency = 10,
#     multipart_chunksize = 1024 * 25,
#     use_threads = True
# )

Writing¶

User ID¶

To write to the scratch bucket the root of the key must be your AWS User ID.

For a project bucket this restriction probably doesn't apply. Any root key conditions are managed by the bucket owner.

In [15]:
%%bash

userid=`aws sts get-caller-identity --query 'UserId' | sed 's/["]//g'`
echo $userid
AROAT2ES654NBY3M4X7WD:jhodge
In [16]:
userid = boto3.client('sts').get_caller_identity()['UserId']
print(userid)
AROAT2ES654NBY3M4X7WD:jhodge

Select a test file¶

For use in this notebook.

In [17]:
testfile = '/home/jovyan/test-file.txt'
In [18]:
%%bash -s "$testfile"
 
testfile=$1
touch $testfile
ls -l $testfile
-rw-r--r-- 1 jovyan users 0 May 24 15:28 /home/jovyan/test-file.txt

Upload a file¶

In [19]:
%%bash -s "$bucket" "$userid" "$testfile"

bucket=$1
userid=$2
testfile=$3

aws s3 cp ${testfile} s3://${bucket}/${userid}/
upload: ../../../test-file.txt to s3://easido-prod-user-scratch/AROAT2ES654NBY3M4X7WD:jhodge/test-file.txt
In [20]:
target = testfile.split('/')[-1]
try:
    print(f'upload: {testfile} to s3://{bucket}/{userid}/{target}')
    r = client.upload_file(testfile, bucket, f'{userid}/{target}')
    print('Success.')
except Exception as e:
    print(e)
    print('Failed.')
upload: /home/jovyan/test-file.txt to s3://easido-prod-user-scratch/AROAT2ES654NBY3M4X7WD:jhodge/test-file.txt
Success.

Reading¶

List objects¶

The boto3.list_objects_v2 function will return at most 1000 keys. Two options are shown here.

  1. Basic use of list_objects_v2
  2. Paginated list objects, for potentially >1000 keys
In [21]:
%%bash -s "$bucket" "$userid"

bucket=$1
userid=$2

aws s3 ls s3://${bucket}/${userid}/
2023-05-24 15:28:45          0 test-file.txt
In [22]:
# Basic use of list_objects_v2

response = client.list_objects_v2(Bucket=bucket, Prefix=f'{userid}/')

# from pprint import pprint
# pprint(response)

# List each key with its last modified time stamp
if 'Contents' in response:
    for c in response['Contents']:
        key = c['Key']
        lastmodified = c['LastModified'].strftime('%Y-%d-%m %H:%M:%S')
        size = c['Size']
        print(f'{lastmodified}\t{size} {key}')
2023-24-05 15:28:45	0 AROAT2ES654NBY3M4X7WD:jhodge/test-file.txt
In [23]:
# Paginated list objects, for potentially >1000 keys

paginator = client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket=bucket, Prefix=f'{userid}/')

for response in page_iterator:
    if 'Contents' in response:
        for c in response['Contents']:
            key = c['Key']
            lastmodified = c['LastModified'].strftime('%Y-%d-%m %H:%M:%S')
            psize = c['Size']
            print(f'{lastmodified}\t{size} {key}')
2023-24-05 15:28:45	0 AROAT2ES654NBY3M4X7WD:jhodge/test-file.txt

Read a file directly¶

Many data reading packages can read a file from an s3://bucket/key path into memory. Examples include:

  • rasterio and rioxarray
  • gdal

For packages that can not read from an S3 path, first copy the file to your home directory or a temporary directory (e.g., dask workers). Then read the file with a normal file path.

Copy a file to local¶

In [24]:
%%bash -s "$bucket" "$userid" "$testfile"

bucket=$1
userid=$2
testfile=$3

source=`basename $testfile`
aws s3 cp s3://${bucket}/${userid}/${source} ${testfile}
ls -l $testfile
download: s3://easido-prod-user-scratch/AROAT2ES654NBY3M4X7WD:jhodge/test-file.txt to ../../../test-file.txt
-rw-r--r-- 1 jovyan users 0 May 24 15:28 /home/jovyan/test-file.txt
In [25]:
source = testfile.split('/')[-1]
try:
    print(f'download: s3://{bucket}/{userid}/{source} to {testfile}')
    r = client.download_file(bucket, f'{userid}/{source}', testfile)
    print('Success.')
except Exception as e:
    print(e)
    print('Failed.')
download: s3://easido-prod-user-scratch/AROAT2ES654NBY3M4X7WD:jhodge/test-file.txt to /home/jovyan/test-file.txt
Success.
In [ ]: