Designing a Terraform language feature like Terraform Actions

Disclaimer: I am working at HashiCorp (now IBM) as part of the Terraform Core team. The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.
Since I am involved in Terraform my opinions can sometimes be (unconsciously) biased. I hope you enjoy the post anyway.

What are Terraform Actions?

Terraform Actions are a new block in the Terraform language that allows you to express non-CRUD operations in your configuration. Please see Introduction to Terraform Actions for a detailed introduction.

As an aside: Martin Atkins has an interesting article on language design in the context of Terraform called Evolving the Terraform Language, I can recommend it to anyone interested in language design.

Fundamental Design Principles

When designing Terraform Actions, we had a few basic design principles in mind.

Specificity

Actions should be very specific and do one task really well. We did not want to write a general-purpose scripting language within Terraform, but create building blocks to automate common non-CRUD tasks. By letting providers define their own actions, we can ensure that actions are tailored to the specific needs of each provider.

Simplicity

Actions should be very easy to use. To achieve that, we made them as simple as possible:

They don’t have state, making it easy to reason about them -> they always do the same thing.
They can not be read, making it clear that when they are referenced in the configuration, it is always in the context of when they are executed. You can’t use an action in a place where a value is expected.
They are not aware of their caller. This means they run the very same way, no matter if they are called from a resource’s lifecycle or through the -invoke flag. They don’t take any inputs, so there is no way for a caller to pass any arguments. One action declaration will behave the same way no matter where it is called from.

All of this is done to ensure that Actions are easy to understand and read.

CLI Invoke

This one is a bit simpler than the rest: All actions can be specifically invoked through the CLI. This tenant allows users to retry failed actions in an otherwise successful run. This is particularly useful when debugging and troubleshooting issues or when the provider API in question is not working as expected.

Reducing Complexity

One important aspect of bringing a new innovative feature into a very mature language is to reduce the scope of the feature. It needs to still be useful and easy to understand, while allowing for flexibility and extensibility. The more user feedback you get and the more you observe open source adoption of the feature, the better you can design the next iteration.

Here are some design decisions we made to reduce the initial complexity of Actions. None of these decisions are set in stone, and we are open to feedback and suggestions. We will definitely rethink all of them, so don’t count them in or out yet for the long-term future.

Actions are scoped to a single module

The CLI -invoke flag can invoke actions from any module, but actions can only be triggered through lifecycle action_triggers when the action is defined in the same module as the triggering resource. This enforces the boundary between modules and removes the option to produce spaghetti code by relying on actions defined in other modules. It also removes the possibility of using this feature in a nuanced and very helpful way. This rigid approach should work well for most use cases, and it is easy enough to expose actions across module boundaries in some way later on.

Actions can not affect the state of the resources

For some possible actions, it would make sense to affect the state of resources. This also means that these actions can either not be triggered by action_triggers or they can only be triggered by action_triggers if the resource they change is also the triggering resource, and they don’t contradict the configuration of the resource.

This is a lot of ifs and whens for a limited set of actions. By removing this possibility from the equation, we can simplify the design and make it easier to understand and use.

While changing the resource’s state is helpful in some cases, where e.g. a computed unconfigured attribute is changed by an action, a simple refresh run after the action is done will reconcile the state of the resource with the remote state, so it’s already easy enough to work around right now.

There is also an additional difficulty in the freedom providers have in their implementation. An action might change a computed attribute that another action reads and uses to compute another computed attribute to change from. This is a classical data race problem, which means one needs to have a deterministic order of execution for actions that can affect the state of resources.

Parallelism

Currently, all Actions for a resource run in order of their action_triggers and index in the actions list within the action triggers. We only guarantee that all actions within the same action_triggers actions run in that particular order, though, so in theory we could run all action triggers in parallel. It really depends on the use-cases and the complexity of conditions etc. if this would be preferable to the current approach.

`before_destroy` / `after_destroy` events

We don’t have action events associated with destroying resources. When destroying a resource, all dependency edges related to the resource to be destroyed are reversed. This is necessary to ensure that the resources are destroyed before the resource. Take this configuration as an example:

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "main" {
  vpc_id     = aws_vpc.main.id
  cidr_block = "10.0.0.0/24"
}

When creating this configuration, we want to create the VPC first and then the subnet. When destroying the configuration, we want to destroy the subnet first and then the VPC so that each resource can run through its own destruction lifecycle and no resource is destroyed by being included in the other resources’ remote object.

This creates some interesting challenges graph-wise, especially when only parts of the graph are reversed, which can lead to cyclic references. Because of this, one needs to design this feature carefully to ensure that the graph is always acyclic, likely by introducing a set of restrictions on before_destroy / after_destroy triggered actions.

This heightened complexity, together with the uncertainty on whether this is useful for the majority of users, is a reason why we decided not to implement this feature at this time.

Natural Limitations

There were some more detailed decisions to be made when encountering the “natural” limitations of Terraform as a language and the domain of imperative actions together.

Not all reference cycles are created equal

When working with Terraform you might have encountered cyclic references like this:

resource "example_resource" "a" {
  attribute = example_resource.b.id
}

resource "example_resource" "b" {
  attribute = example_resource.a.id
}

This yields an error because Terraform cannot determine the order in which to create the resources.

When actions are involved, the order is determined by the event that triggers the action. before_create & before_update will run before the resource, after_create & after_update will run after the resource. This means we have more leeway with actions causing reference cycles since we don’t use the references to determine the order of execution.

This configuration would be valid, although you can see references that look cyclic:

action "example_action" "hello" {
  config {
    attribute = example_resource.a.attribute
  }
}

resource "example_resource" "a" {
  attribute = "foo"

  lifecycle {
    action_trigger {
      events = [after_create]
      actions = [action.example_action.hello]
    }
  }
}

But if you change after_create to before_create, it will result in a cyclic dependency and error.

In the case of before_create, this is intuitively logical as well: How are we supposed to read a value of a resource that hasn’t been created yet?

In the case of before_update, one could argue that it is possible to e.g. read the old value from state or only read the config values. But these are already two possible solutions, and there might be even more, so it is not intuitively clear what the result would be. Because of this, we decided to also disallow these cycles in before_update triggered actions, effectively only allowing them in after_create and after_update triggered actions.

One could also debate that these cyclic references might be confusing since this is the only place we would allow them in Terraform. But this is also the only place where we have somewhat imperative code and where the timing of something is defined through specific attributes in the configuration. We validated this decision quite thoroughly and believe it is the right choice.

Using `self` in conditions

This has a similar problem as before_create / before_update triggered actions: It is not intuitively clear what the value of self would be before the resource has been created / updated. With self in conditions, we wanted to leave ourselves open to the possibility of designing a better solution in the future; therefore, we decided to disallow it for now.

If you need a workaround: With the after_create / after_update actions you can reference the triggering resource directly in the condition.

Using `count.index` / `each.key` / `each.value` in conditions of `before_create` / `before_update` triggered actions

Currently, before_create / before_update triggered actions are not allowed to use count.index / each.key / each.value in conditions. The reason is that before_create / before_update triggered actions currently run before the entire resource lifecycle. This means they also run before the for_each / count meta-arguments have been evaluated, and we therefore can not access the values of these meta-globals.

This is theoretically possible; the priority of changing this depends on the feedback we receive from the community.

Actions have no `depends_on` meta-attribute

The depends_on meta-attribute defines dependencies between resources without requiring references in the configuration; it therefore controls the order in which resources are processed. This means depends_on would either be used to validate that on each resource’s action_trigger.actions list the order of actions is correct or to automatically add the dependencies between actions in that list. Personally, I would prefer the first solution as it is more explicit, but then it’s more of a safeguard. Which also means one can make do without it. Since we want all actions to be invokable by the CLI, we would also need to either change the semantics for the CLI invocation (not ideal) or allow multiple actions to be invoked at once through the CLI. All in all, this is solvable, but we need to understand better what people would want to use the feature for. Adding it just because it’s possible and present for other types might not be the best idea.

Please leave a comment

Do you have feedback around actions or do you want to hear about a specific topic around Terraform / Software Development / Language Design / Infrastructure as Code? Please let me know in the comments below.