S3 urls - get name and path in bucket

I have a variable that has aws s3 url

s3://bucket_name/folder1/folder2/file1.json 

I want to get bucket_name in variables and leave ie / folder1 / folder2 / file1.json in another variable. I tried regular expressions and could get the bucket_name as shown below, but not sure if there is a better way.

 m = re.search('(?<=s3:\/\/)[^\/]+', 's3://bucket_name/folder1/folder2/file1.json') print(m.group(0)) 

How can I get the rest ie - folder1 / folder2 / file1.json?

I checked if there is a boto3 function to extract the bucket_name and key from the url, but could not find it.

+24
source share
4 answers

Since this is a regular URL, you can use urlparse to get all parts of the URL.

 >>> from urlparse import urlparse >>> o = urlparse('s3://bucket_name/folder1/folder2/file1.json', allow_fragments=False) >>> o ParseResult(scheme='s3', netloc='bucket_name', path='/folder1/folder2/file1.json', params='', query='', fragment='') >>> o.netloc 'bucket_name' >>> o.path '/folder1/folder2/file1.json' 

You may need to remove the leading slash from the key, as the following answer suggests.

 o.path.lstrip('/') 

In Python 3, urlparse moved to urllib.parse so use:

 from urllib.parse import urlparse 

Here is a class that takes care of all the details.

 try: from urlparse import urlparse except ImportError: from urllib.parse import urlparse class S3Url(object): """ >>> s = S3Url("s3://bucket/hello/world") >>> s.bucket 'bucket' >>> s.key 'hello/world' >>> s.url 's3://bucket/hello/world' >>> s = S3Url("s3://bucket/hello/world?qwe1=3#ddd") >>> s.bucket 'bucket' >>> s.key 'hello/world?qwe1=3#ddd' >>> s.url 's3://bucket/hello/world?qwe1=3#ddd' >>> s = S3Url("s3://bucket/hello/world#foo?bar=2") >>> s.key 'hello/world#foo?bar=2' >>> s.url 's3://bucket/hello/world#foo?bar=2' """ def __init__(self, url): self._parsed = urlparse(url, allow_fragments=False) @property def bucket(self): return self._parsed.netloc @property def key(self): if self._parsed.query: return self._parsed.path.lstrip('/') + '?' + self._parsed.query else: return self._parsed.path.lstrip('/') @property def url(self): return self._parsed.geturl() 
+41
source

For those who, like me, tried to use urlparse to extract the key and basket to create an object using boto3. There is one important detail: remove the slash from the beginning of the key

 from urlparse import urlparse o = urlparse('s3://bucket_name/folder1/folder2/file1.json') bucket = o.netloc key = o.path boto3.client('s3') client.put_object(Body='test', Bucket=bucket, Key=key.lstrip('/')) 

It took a while to figure this out, because boto3 does not throw any exceptions.

+12
source

A solution that works without urllib or re (also processes the previous slash):

 def split_s3_path(s3_path): path_parts=s3_path.replace("s3://","").split("/") bucket=path_parts.pop(0) key="/".join(path_parts) return bucket, key 

To run:

 bucket, key = split_s3_path("s3://my-bucket/some_folder/another_folder/my_file.txt") 

Returns:

 bucket: my-bucket key: some_folder/another_folder/my_file.txt 
+7
source

If you want to do this with regular expressions, you can do the following:

 >>> import re >>> uri = 's3://my-bucket/my-folder/my-object.png' >>> match = re.match(r's3:\/\/(.+?)\/(.+)', uri) >>> match.group(1) 'my-bucket' >>> match.group(2) 'my-folder/my-object.png' 

This has the advantage that you can check the s3 circuit rather than allow anything there.

+3
source

Source: https://habr.com/ru/post/1015550/


All Articles