Why does Cloudfront cache objects in a few hours?

Question

Why does Cloudfront cache objects in a few hours?

Cloudfront is configured to cache images from our application. I found that the images were evicted from the cache very quickly. Since images are generated dynamically on the fly, this is quite intense for our server. To solve the problem, I installed a test file.

Header Headers

The image is served from our source server with the correct Last-Modified and Expires headers.

Cloudfront Cache Behavior

Since the site is HTTPS, I set the Viewer Protocol Policy to HTTPS . Forward Headers set to None and Object Caching to Use Origin Cache Headers .

Source Image Request

I asked for the image at 11:25:11. This led to the following status and headers:

Code: 200 (OK)
Caching: No
Expire: Thu, 29 Sep 2016 09:24:31 GMT
Last-Modified: Wed, 30 Sep 2015 09:24:31 GMT
X-Cache: Miss from cloudfront

Next request

Re-loading after a while (11:25:43) returned the image with:

Code: 304 (not changed)
Caching: Yes
Expire: Thu, 29 Sep 2016 09:24:31 GMT
X-Cache: A hit from the cloud

Request in a few hours

Almost three hours later (at 14:16:11) I went to the same page and the image was uploaded with:

Code: 200 (OK)
Caching: Yes
Expire: Thu, 29 Sep 2016 09:24:31 GMT
Last-Modified: Wed, 30 Sep 2015 09:24:31 GMT
X-Cache: Miss from the Cloud Front

Since the image was cached by the browser, it loaded quickly. But I cannot understand how Cloudfront was unable to return the cached image. For this, the application had to create an image again.

I read that Cloudfront is crowding out files from its cache after a few days of inactivity. This is not the case as shown above. How could this be?

0

caching image amazon-cloudfront

richard 30 sept '15 at 18:44

source share

1 answer

Michael - sqlbot · Answer 1 · 2015-10-01T02:31:05+0000

I read that Cloudfront is crowding out files from its cache after a few days of inactivity.

Do you have an official source for this?

Here is the official answer:

If the object at the location of the edge is not often requested, CloudFront can supplant the object - delete the object before its expiration date - free up space for objects that were requested recently.
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Expiration.html

Guaranteed storage time for cached objects is not guaranteed, and objects with low demand will most likely be evicted ... but this is not the only factor that you may not have considered. Eviction cannot be a problem or a single problem.

Objects cached by CloudFront are similar to Schrödinger's cat. This is a free analogy, but I'm running with it: regardless of whether the object is "in the cloud cache" at any given time, it is not a yes or no question.

CloudFront has somewhere around 53 border locations (where your browser connects and content is physically stored) in 37 cities. In some large cities, there are 2 or 3. Each request that falls on the cloud route is routed (via DNS) to the most theoretically optimal location - for simplicity we will call it the “nearest” edge where you are.

Cloudfront's internal workings are not publicly available, but the general consensus based on observations and supposedly authoritative sources is that all these edge locations are independent. They do not use caches.

If, for example, you are in Texas (USA) and your request is made and stored in Dallas / Fort Worth, Texas, and if the chances are equal, then you can receive any request from you to get to any of the Dallas territories then until you get two misses of the same object, the probability is about 50/50 that your next request will be missed. If I ask that the same object from my location that I know from experience tends to route through South Bend, IN, then the probability that my first request will be missed is 100%, even if it is cached in Dallas.

Thus, the object is not in the cache or not in it, because there is no "one" cache (one, global).

It is also possible that CloudFront's definition of the “closest” edge of your browser will change over time.

The CloudFront mechanism for detecting the nearest edge seems dynamic and adaptive. Changes in the topology of the Internet as a whole can change the shift, the position of the edge of which will tend to receive requests sent from a given IP address, so it is quite possible that within a few hours you connect to it, will change. Maintenance or outages or other problems affecting a particular edge can also cause requests from a given source IP address to be sent to a different edge than normal, and this may also give you the impression of object exclusion, since the new the cache will be different from the old one.

Looking at the response headers, it’s impossible to determine which edge location is processed by each request. However, this information is provided in CloudFront access logs .

I have a sample and size image service that processes around 750,000 images per day. This is for CloudFront, and my hit / prom ratio is around 50/50. These, of course, are not all CloudFront errors, since my image pool exceeds 8 million, viewers around the world, and my max-age directive is shorter than yours. It has been quite a while since I last analyzed the magazines to determine which and how “misses” seem unexpected (although there were definitely some when I did this, but their number was not unreasonable), but this is done easy enough, as the logs indicate whether each answer was a hit or a gap, as well as determining the location of the edge ... so you could analyze this to see if the template is really here.

My service stores all of its output content in S3, and when a new request arrives, it first sends a quick request to the S3 bucket to see if there is work that can be avoided. If the result is returned by S3, then this result is returned to CloudFront, and does not do all the work of typing and resizing. Keep in mind, I did not realize this opportunity because of the number of CloudFront misses ... I developed it from the very beginning, before I even tested it for CloudFront, because - in the end - CloudFront is the cache, and the contents of the cache are pretty highly variable and ephemeral, by definition.

Update:. I stated above that it is not possible to determine the location of the edge redirecting a specific request by examining the request headers from CloudFront ... however, it seems that with some degree of accuracy, examining the source IP address of the incoming request.

For example, a validation request sent to one of my source servers via CloudFront arrives from 54.240.144.13 if I delete my site from home or 205.251.252.153 when I delete the site from my office — the locations are only a few miles apart friend, but on opposite sides of the border of the state and the use of two different Internet providers. A reverse DNS lookup of these addresses reveals these host names:

 server-54-240-144-13.iad12.r.cloudfront.net. server-205-251-252-153.ind6.r.cloudfront.net.

CloudFront edge areas are called at the nearest major airport, plus a random number. For iad12 ... "IAD" is the International Air Transport Association (IATA) code for the Washington, DC Dulles airport, so this is likely to be one of the regional locations in Ashburn, VA (which has three, presumably with different numbers codes at the end, but I can’t confirm which of these data). For ind6 “IND” corresponds to an airport in Indianapolis, Indiana, so this strongly indicates that this request comes through the south bend, IN, extreme location. The reliability of this test will depend on the sequence in which CloudFront supports reverse DNS records. It is not documented how many independent caches can be in any given edge location; the assumption is that there is only one, but there can be more than one, which has the effect of increasing the transmittance for a very small number of requests, but disappears in the mix for a large number of requests.

Why does Cloudfront cache objects in a few hours?

Header Headers

Cloudfront Cache Behavior

Source Image Request

Next request

Request in a few hours

More articles: