I have been trying to find a better command line tool for duplicating buckets than s3cmd. s3cmd can duplicate buckets without having to download and upload each file. The command I normally run to duplicate buckets using s3cmd is:
s3cmd cp -r --acl-public s3://bucket1 s3://bucket2
This works, but it is very slow as copies each file via the API one at a time. If s3cmd could run in parallel mode, I'd be very happy.
Are there other options available as a command line tools or code that people use to duplicate buckets that are faster than s3cmd?
Edit: Looks like s3cmd-modification is exactly what I'm looking for. Too bad it does not work. Are there any other options?
AWS CLI seems to do the job perfectly, and has the bonus of being an officially supported tool.
aws s3 sync s3://mybucket s3://backup-mybucket
Supports concurrent transfers by default. See http://docs.aws.amazon.com/cli/latest/topic/s3-config.html#max-concurrent-requests
To quickly transfer a huge number of small files, run from an EC2 instance to decrease latency, and increase max_concurrent_requests to reduce impact of latency. e.g.
aws configure set default.s3.max_concurrent_requests 200
If you don't mind using the AWS console, you can:
It's still fairly slow, but you can leave it alone and let it do its thing.
I have tried cloning two buckets using the AWS web console, the s3cmd and the AWS CLI. Although these methods works most of the time, they are painfully slow.
Then I found s3s3mirror - a specialized tool for syncing two S3 buckets. It's multithreaded and a lot faster than the other approaches I have tried. I quickly moved GBs of data from one AWS region to another.
Check it out at https://github.com/cobbzilla/s3s3mirror, or download a Docker container from https://registry.hub.docker.com/u/pmoust/s3s3mirror/
I don't know of any other S3 command line tools but if nothing comes up here, it might be easiest to write your own.
Pick whatever language and Amazon SDK/Toolkit you prefer. Then you just need to list/retrieve the source bucket contents and copy each file (In parallel obviously)
Looking at the source for s3cmd-modification (and I admit I know nothing about python), it looks like they have not parallelised the bucket-to-bucket code but perhaps you could use the standard upload/download parallel code as a starting point to do this.
As this is about Google's first hit on this subject, adding extra information.
'Cyno' made a newer version of s3cmd-modification, which now supports parallel bucket-to-bucket syncing. Exactly what I was waiting for as well.
Pull request is at https://github.com/pcorliss/s3cmd-modification/pull/2, his version at https://github.com/pearltrees/s3cmd-modification
For adhoc solution use
aws cli to sync between buckets:
aws s3 sync speed depends on:
- latency for an API call to S3 endpoint
- amount of API calls made in concurrent
To increase sync speed:
aws s3 sync from an AWS instance (c3.large on FreeBSD is OK ;-) )
- update ~/.aws/config with:
max_concurrent_requests = 128
max_queue_size = 8096
with following config and instance type I was able to sync bucket (309GB, 72K files, us-east-1) within 474 seconds.
For more generic solution consider - AWS DataPipeLine or S3 cross-region replication.