Cloud information template for creating an ECS service stuck in CREATE_IN_PROGRESS

I am creating an AWS ECS service using Cloudformation.

Everything seems to be completed successfully, I see that the instance is attached to the load balancer, the load balancer declares the instance as healthy, and if I am in the load balancer, I successfully fall into my working container,

Looking at the ECS control panel, I see that the service has stabilized and everything looks fine. I also see that the container is stable and does not complete / not created.

However, the Cloudformation template never completes, it is stuck in CREATE_IN_PROGRESS after about 30-60 minutes when it rolls back, claiming that the service has not stabilized. Looking at CloudTrail, I can see several RegisterInstancesWithLoadBalancer created by ecs-service-scheduler , all with the same parameters, i.e. the same instance IDs and load balancing. I use standard IAM roles and permissions for ECS, so this should not be a permissions issue.

Has anyone had a similar problem?

+24
source share
8 answers

Your AWS::ECS::Service must register the full ARN for TaskDefinition (Source: See answer from ChrisB @AWS on AWS forums ). The main TaskDefinition is to install TaskDefinition using the full ARN, including revision . If you skip the revision ( :123 in the example below), the latest version is used, but CloudFormation still leaves for lunch with "CREATE_IN_PROGRESS" about an hour before the crash. Here is one way to do this:

 "MyService": { "Type": "AWS::ECS::Service", "Properties": { "Cluster": { "Ref": "ECSClusterArn" }, "DesiredCount": 1, "LoadBalancers": [ { "ContainerName": "myContainer", "ContainerPort": "80", "LoadBalancerName": "MyELBName" } ], "Role": { "Ref": "EcsElbServiceRoleArn" }, "TaskDefinition": { "Fn::Join": ["", ["arn:aws:ecs:", { "Ref": "AWS::Region" }, ":", { "Ref": "AWS::AccountId" }, ":task-definition/my-task-definition-name:123"]]} } } } 

Here is a great way to grab the latest version of MyTaskDefinition via aws cli and jq :

 aws ecs list-task-definitions --family-prefix MyTaskDefinition | jq --raw-output .taskDefinitionArns[0][-1:] 
+18
source

There is no need to register the full ARN for TaskDefinition, because when the logical identifier of this resource is provided by the built-in Ref function, Ref returns the Amazon resource name (ARN).

In the following example, the Ref function returns the ARN of the MyTaskDefinition task, for example, arn: aws: ecs: us-west-2: 123456789012: task / 1abf0f6d-a411-4033-b8eb-a4eed3ad252a.

{"Ref": "MyTaskDefinition"}

Source http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-ecs-taskdefinition.html

+8
source

I think I had a similar problem. Try looking at the Desired Counter property in the Service template. I think CloudFormation will indicate that the creation / update is still ongoing until the Service reaches this amount of "DesiredCount" in your cluster.

+6
source

I found another related script that would trigger this, and I thought I'd put it here if anyone else came across it. If you define a TaskDefinition with an image that does not actually exist in its ContainerDefinition , and then you try to run this TaskDefinition as a service, you will encounter the same freeze problem (or at least something that looks like the same problem).

NOTE An example of the following YAML fragments is provided in one CloudFormation template.

So, as an example, I created this Repository :

 MyRepository: Type: AWS::ECR::Repository 

And then I created this Cluster :

 MyCluster: Type: AWS::ECS::Cluster 

And this TaskDefinition (abbreviated):

 MyECSTaskDefinition: Type: AWS::ECS::TaskDefinition Properties: # ... ContainerDefinitions: # ... Image: !Join ["", [!Ref "AWS::AccountId", ".dkr.ecr.", !Ref "AWS::Region", ".amazonaws.com/", !Ref MyRepository, ":1"]] # ... 

Based on the above, I decided to create a Service as follows:

 MyECSServiceDefinition: Type: AWS::ECS::Service Properties: Cluster: !Ref MyCluster DesiredCount: 2 PlacementStrategies: - Type: spread Field: attribute:ecs.availability-zone TaskDefinition: !Ref MyECSTaskDefinition 

That everything seemed reasonable to me, but, it turns out, there are two problems with this, as it is written / expanded, which made it hang.

  • DesiredCount set to 2, which means that it will actually try to deploy the service and start it, rather than just defining it. If I set DesiredCount to 0, that would be fine.
  • Image defined in MyECSTaskDefinition does not yet exist. I made a repository as part of this template, but I didn't actually click anything on it. Therefore, when MyECSServiceDefinition tried to deploy a DesiredCount of 2 instances, it depended because the image was not actually available in the repository (because the repository had just been created in the same template).

So, at the moment, the solution is to create a CloudFormation stack with DesiredCount 0 for Service , load the corresponding Image into the repository, and then update the CloudFormation stack to expand the service. Or, in turn, you have a separate template that sets up the underlying infrastructure, such as a repository, downloads the assemblies for this, and then has a separate template to run, which installs the Services themselves.

Hope this helps anyone who has this problem!

+6
source

Everything that interferes with the definition of ECS to achieve the desired count. One example is the lack of permissions in policies associated with the role used by instances. Check the ECS agent log instances (/var/log/ecs/ecs-agent.log.timestamp).

Another example: Instances do not have enough memory to match the requested desired account .... events show the following:

"... myService was unable to host the task because the container instance did not fulfill all its requirements. The nearest matching container instance 123456789 has insufficient available memory ..."

+2
source

I have the same problem. I decided to increase the amount of allocated memory to determine the task.

The containers you use must not exceed the available memory on your ECS instance.

0
source

To add another possibility, I ran into this problem once, when everything was in order with the template, the number of tasks required = the number of tasks started, etc. It turned out that one of the base EC2 instances stuck about 100% of the state of the EC2 processor saw it as "healthy"). This prevented CloudFormation from validating this particular instance. I killed a bad copy of EC2, and ECS spun a really healthy copy.

0
source

To add another data point, I saw AWS::ECS::Service permanently stuck in CREATE_IN_PROGRESS if the ECR docker image is not available a) in the ECR repository and b) passed a health check.

I tried several times to load AWS::ECS::Service with the valid-image-hash-but-failing-health-check container, then fix the image and do various “set the desired amount to zero”, “set it back”, etc. etc., and nothing of the AFAICT will peel off.

Ultimately, I will have to remove the stack and start over with an image that immediately passes a health check. Then it works fine.

Super flakes.

0
source

Source: https://habr.com/ru/post/1232059/


All Articles