S3 urls - get name and path in bucket

Question

S3 urls - get name and path in bucket

I have a variable that has aws s3 url

s3://bucket_name/folder1/folder2/file1.json

I want to get bucket_name in variables and leave ie / folder1 / folder2 / file1.json in another variable. I tried regular expressions and could get the bucket_name as shown below, but not sure if there is a better way.

 m = re.search('(?<=s3:\/\/)[^\/]+', 's3://bucket_name/folder1/folder2/file1.json') print(m.group(0))

How can I get the rest ie - folder1 / folder2 / file1.json?

I checked if there is a boto3 function to extract the bucket_name and key from the url, but could not find it.

+24

python boto3

Lijju mathew Mar 07 '17 at 6:06

source share

4 answers

For those who, like me, tried to use urlparse to extract the key and basket to create an object using boto3. There is one important detail: remove the slash from the beginning of the key

 from urlparse import urlparse o = urlparse('s3://bucket_name/folder1/folder2/file1.json') bucket = o.netloc key = o.path boto3.client('s3') client.put_object(Body='test', Bucket=bucket, Key=key.lstrip('/'))

It took a while to figure this out, because boto3 does not throw any exceptions.

+12

Mikhail Sirotenko Jan 13 '18 at 22:59

source share

A solution that works without urllib or re (also processes the previous slash):

 def split_s3_path(s3_path): path_parts=s3_path.replace("s3://","").split("/") bucket=path_parts.pop(0) key="/".join(path_parts) return bucket, key

To run:

 bucket, key = split_s3_path("s3://my-bucket/some_folder/another_folder/my_file.txt")

Returns:

 bucket: my-bucket key: some_folder/another_folder/my_file.txt

+7

mikeviescas Jun 14 '18 at 19:02

source share

If you want to do this with regular expressions, you can do the following:

 >>> import re >>> uri = 's3://my-bucket/my-folder/my-object.png' >>> match = re.match(r's3:\/\/(.+?)\/(.+)', uri) >>> match.group(1) 'my-bucket' >>> match.group(2) 'my-folder/my-object.png'

This has the advantage that you can check the s3 circuit rather than allow anything there.

+3

Alec hewitt Nov 06 '17 at 5:36

source share

kichik · Accepted Answer · 2017-03-07T06:10:11+0000

Since this is a regular URL, you can use urlparse to get all parts of the URL.

 >>> from urlparse import urlparse >>> o = urlparse('s3://bucket_name/folder1/folder2/file1.json', allow_fragments=False) >>> o ParseResult(scheme='s3', netloc='bucket_name', path='/folder1/folder2/file1.json', params='', query='', fragment='') >>> o.netloc 'bucket_name' >>> o.path '/folder1/folder2/file1.json'

You may need to remove the leading slash from the key, as the following answer suggests.

 o.path.lstrip('/')

In Python 3, urlparse moved to urllib.parse so use:

 from urllib.parse import urlparse

Here is a class that takes care of all the details.

 try: from urlparse import urlparse except ImportError: from urllib.parse import urlparse class S3Url(object): """ >>> s = S3Url("s3://bucket/hello/world") >>> s.bucket 'bucket' >>> s.key 'hello/world' >>> s.url 's3://bucket/hello/world' >>> s = S3Url("s3://bucket/hello/world?qwe1=3#ddd") >>> s.bucket 'bucket' >>> s.key 'hello/world?qwe1=3#ddd' >>> s.url 's3://bucket/hello/world?qwe1=3#ddd' >>> s = S3Url("s3://bucket/hello/world#foo?bar=2") >>> s.key 'hello/world#foo?bar=2' >>> s.url 's3://bucket/hello/world#foo?bar=2' """ def __init__(self, url): self._parsed = urlparse(url, allow_fragments=False) @property def bucket(self): return self._parsed.netloc @property def key(self): if self._parsed.query: return self._parsed.path.lstrip('/') + '?' + self._parsed.query else: return self._parsed.path.lstrip('/') @property def url(self): return self._parsed.geturl()

S3 urls - get name and path in bucket

More articles: