Optimizing Amazon S3 for Large-Scale Operations

Optimizing Amazon S3 for Large-Scale Operations

With more than 15 years of continuous operation behind it, Amazon S3 is not only one of the oldest object storage services in the public cloud space but also the most widely used. Launched in 2006 by Amazon Web Services (AWS), the service has become an essential part of the connected society, working quietly behind the scenes to store the digital objects that the average person probably uses every day.

In fact, Amazon Simple Storage Service – popularly known as S3 – is so entrenched in our daily routines that it is extremely likely that almost every company that has a relationship with AWS has also used S3 as well.

If you look at some of the notable logos that are part of this family, then you will see internet brands such as Netflix, Tumblr, Pinterest, Reddit and Mojang Studios (Minecraft’s developers) all leveraging the cloud architecture that the service provides. And these companies benefit from an easy-to-manage and cost-effective ecosystem that gives them what they need when they want it.

However, the apparent simplicity of S3 doesn’t tell the whole story.

Increased digitalization in both the private and public sector has seen elements such as static websites, website content, file storage, data lakes and data sinks all added to its capabilities, with companies of all sizes taking advantage of best-in-class scalability, data availability, security and performance. And, thanks to the 99.9999999% data durability that it was designed for, Amazon S3 can store any type of object.

Taking the above into account, there is a consensus that while S3 is easy to use for small scale applications, the growing demands of data analytics and machine learning are ensuring that large scale optimization will be at the forefront of future storage and retrieval decisions. In addition, there will be a need for companies of all sizes to understand where best to achieve an optimum performance with Amazon S3, especially in terms of security, compliance and connectivity.

What follows are some guidelines as how best to optimize Amazon S3 for your data storage needs and, importantly, the questions you should be asking a digital engineer.

Understand and Identify

This might sound simple, but the best way to optimize anything is through measurement. By understanding what the optimized performance can be and applying that to the required number of transactions, you can quickly get a grip on what you want from a service such as S3.

In this case, you will need to understand the network throughput of transport between S3 and other AWS services such as Amazon Elastic Cloud Compute Cloud (EC2). This will include the CPU usage and RAM requirements of EC2, a measurement or monitoring activity that can be achieved by adding Amazon CloudWatch into the mix.

In addition, it is advisable to identify DNS lookup time and include the latency and data transfer speeds between servers and S3 buckets in your measurement strategy.

Getting Data Faster

When it comes to data analytics, the adage that faster delivery is usually better is certainly in play. As you might expect from any service that follows Amazon’s ecommerce blueprint, S3 has been built to satisfy the demand from end users to fetch objects with the shortest of lag time.

Thanks to its highly scalable nature, S3 allows companies to achieve the necessary high throughput. The caveat is that you need to have a big enough pipe for the right number of instances, but that is something that can be attained with the use of an open-source tool such as Amazon S3DistCP.

S3DistCP is an extension of Apache DistCP – a distributed copy capability built on top of a MapReduce framework and designed to move data faster from S3.

You can learn more about S3DistCP here, but what you need to know is that it uses a lot of workers and instances to achieve the required data transfer. In the Hadoop ecosystem, for example, DistCP is used to move data and this extension has been optimized to work with S3, including the option to move data between HDFS and S3.

Efficient Network Connectivity Choices

Fast transfer of data is almost impossible if network connectivity is not up to the task. That means you need to pay attention to how the network is performing and, importantly, where throughput can be improved to cope with high changes in performance.

On a global basis, S3 bucket names are unique but each bucket is stored in a region that is selected when you initially create the storage option. To optimize performance, it is essential that you always access the bucket from Amazon EC2 instances that are in the same AWS Region wherever possible.

It is also worth noting that EC2 instance types are a crucial part of this process as well. Some instance types have a higher bandwidth for network connectivity than others, so it is worth checking out the EC2Instances website to compare network bandwidth.

However, if servers are in a major data center but are not part of Amazon EC2, then you might consider using DirectConnect ports to get a significantly higher bandwidth – for the record, you pay a fee per port. If you decide not to go down this route, then you can use S3 Transfer Acceleration to get data into AWS faster.

For static content – files that do not change in response to a user actions – we recommend that you make use of Amazon CloudFront or another CDN with S3 as the origin.

Horizontal Connections

The need to spread requests across numerous connections is a commonly used design pattern when you are thinking about how best to horizontally scale performance. And that becomes even more important when you are building high performance applications.

It is best to think about Amazon S3 as a very large distributed system, as opposed to a single network endpoint – the traditional storage server model.

When we do this, optimized performances can be achieved by issuing multiple concurrent requests to S3. If possible, these requests can be spread over separate connections to maximize the accessible bandwidth at the service itself.

Byte-Range Fetches

As we know, there will be times when the client will only need a proportion of the requested file and not the entire object stored. When this happens, we can set a range HTTP header in a “GET Object” request, allowing us to fetch a byte-range and conserve bandwidth usage.

In the same way that we used concurrent connections to establish optimized performance through horizontal scaling, we can lean on the same concept here. Amazon S3 allows you to fetch different byte ranges from within the same object, helping to achieve higher aggregate throughput versus a single-whole object request.

Fetching smaller ranges of a large object also allows an application to improve the retry times when these requests are interrupted. For context, the typical sizes for byte-range requests are either 8 MB or 16 MB.

In addition, if objects are PUT by using a multipart upload, it is always good practice to GET them in the same part sizes (or at least aligned to part boundaries) to optimize performance. GET requests can also directly address individual parts – for example, GET?partNumber=N

If you need to take a deeper dive into how to retrieve objects from S3, there is wealth of information here.

Retry Requests

When you are initiating a large-scale fetch process or byte-range fetch request, it is prudent to setup a retry option on these requests. In most cases, aggressive timeouts and retries help drive consistent latency.

Considering the scale and reach of Amazon S3, common wisdom dictates that if the first request is slow, then a retry request is not only going to take a different path but also likely to quickly succeed.

S3 Transfer Acceleration

If the intention of this blog post is to give you guidelines and best practices, then here is one that should be set in stone.

If you need to transfer objects between longer distances, always use AWS S3 Transfer Accelerator.

The reason for this is simple; the feature provides fast, easy, and secure long-distance transfer of files between the client and the allocated S3 bucket. This is down to the fact that it makes use of globally distributed CloudFront edge locations, use of which drastically increases the speed of transfer.

In fact, we advise using the Amazon S3 Transfer Acceleration Speed Comparison tool to compare the results, both before and after using the feature.

Object Organization and Key Names

Although optimizing performance on Amazon S3 should be the goal, very few people know that its latency is heavily dependent on key names.

That should not come as a surprise, as Amazon has built an entire empire on key words and tagged content. However, in the S3 solution, having similar prefixes with key names for more than, say, 100 requests per second adds a significant amount to the latency.

As we noted above, there is a defined trend towards more large-scale operations in S3 and it may be wise to consider the following:

  • Use naming schemes with more variability at the beginning of the key names to avoid internal “hot spots” within the infrastructure – for example, alphanumeric or hex hash codes in the first 6 to 8 characters
  • Incorporate a tagging mechanism

Either of these will help to increase the speed of the requests but it is understood that any naming convention or mechanism must be decided upon upfront. This would naturally include both folder organization and key naming of objects. Additionally, you should avoid using too many inner folders within the files as this can make data crawling that much slower.

Security

Even though an optimum performance is always the desired result, achieving this can be for nothing if you have taken your eye off the security aspects of the data storage. As we move ever closer to a fully digitized society, cyber security and its attendant benefits must be taken into consideration when you are securing S3 buckets.

VentureBeat reported that an industry study of AWS S3 had found that 46% of buckets had been misconfigured and could be considered “unsafe.” And while the research was conducted by a cloud security provider and fell neatly into the category of self-serving, there is little doubt that companies need to limit risk wherever they can.

In our view, the following mandatory steps/questions must be taken to secure S3 buckets.

Before an object is put into S3, you should ask yourself:

  • Who should be able to modify this data?
  • Who should be able to read this data?
  • What are the possibilities of future changes on the read/write data?

Comparable questions should be asked when you are considering the encryption and compliance aspects of using S3 for data storage, namely:Should the data be encrypted?

  • If data encryption is required, how the encryption keys are going to be managed?

Again, if you want to take a more in-depth look at S3 security, there is extensive documentation here.

Compliance

Security and compliance go hand in hand. And while some data is completely non-sensitive and can be shared by anyone, other data – health records, personal information, financial details – is not only extremely sensitive but also enticing to a few black-hatted individuals in the digital ecosystem.

With that in mind, you must ask yourself the following as a bare minimum to ensure that you are optimizing for compliance:

  • Is there any data being stored that contains financial, health, credit card, banking or government identities?
  • Does the data need to comply to regulatory requirements such as HIPAA?
  • Is there any region-specific or localized data restrictions that must be taken into consideration?

Obviously, if data is sensitive then every effort must be made to ensure that it is not compromised by being stored in an unsafe S3 bucket. On the plus side, AWS is vocal about how “cloud security is the highest priority” and customers are told that they benefit from a data center and network architecture that have been built with security in mind.

Concluding Thoughts

After around 15 years as one of the leading exponents of safe, simple and secure object storage, Amazon S3 has garnered a reputation for delivering a product that makes web-scale computing that much easier. Much of that is down to the easy access to the interface and the ability to store and retrieve data anytime and from anywhere on the web.

As defining statements go, that is hard to beat but knowing this does not mean that companies should approach their storage solutions with a casual attitude.

Data storage is often the glue that holds everything together, so it becomes clear that picking the right partner to take you on your digital storage journey is a key element. Once that decision is made, then optimizing performance for large-scale operations through Amazon S3 should be an easier path to take.

To find out how Infostretch can help you navigate your digital journey in the cloud, contact us today. Alternatively, if you want to learn more about our partnership with AWS, please fill out the form below.

Interested in our Cloud Services?

Contact Infostretch +1 408-727-1100

By submitting this form, you agree that you have read and understand Infostretch’s Terms and Conditions. You can opt-out of communications at any time. We respect your privacy.

By submitting this form, you agree that you have read and understand Infostretch’s Terms and Conditions. You can opt-out of communications at any time. We respect your privacy.