Terraform Modules & Stacks: A walk through the runtimes of Terraform

Disclaimer: I am working at HashiCorp (now IBM) as part of the Terraform Core team. The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.
Since I am involved in Terraform my opinions can sometimes be (unconsciously) biased. I hope you enjoy the post anyway.

If you haven't heard about Terraform Stacks yet, please check out "Terraform Stacks: In under 50 words".

Open Table of contents

Background
Terraform Module Runtime
Stacks Runtime
What are the benefits?

Background

Terraform Stacks has its own language (.tfstacks.hcl) for defining your high-level infrastructure. This language contains blocks e.g. for defining components. Each component is a Terraform Root module (where you run terraform plan) that the Terraform Module runtime executes. Martin Atkins wrote a blog post about the Stacks language, in this blog post I’d like to compare it with the module runtime from an implementation point of view. Both are using different but similar approaches to the problem terraform is solving.

Terraform Module Runtime

A great place to start learning the internals of how Terraform works is the Architecture.md file in the Terraform repository. Instead of going through this document, I will give you a brief overview of the Terraform Module runtime on the example of a terraform plan operation.

Architecture overview

After parsing the configuration files, installing the provider, and setting up the backends we come to the (for this post) interesting part, we need to build a dependency graph. If a Terraform resource uses another resource / data source / variable / local value in its configuration, the runtime needs to execute / evaluate the referencee first.

locals {
    greeting = "Hello " + "World!"
}

resource "null_resource" "default" {
  provisioner "local-exec" {
    // This creates a dependency to the locals value,
    // which the runtime evaluates first
    command = "echo '${locals.greeting}'"
  }
}

Building the dependency graph

Each operation has its own GraphBuilder, we use the PlanGraphBuilder for the plan operation we are examining. Each builder has a set of steps that we run through to build and refine the graph for the needs of the operation. This is the first part of the PlanGraphBuilder.Step method:

// Source: https://github.com/hashicorp/terraform/blob/73b9a681a6ce5d6f71175305992ca812c21bc2a2/internal/terraform/graph_builder_plan.go#L135-L168
	steps := []GraphTransformer{
		// Creates all the resources represented in the config
		&ConfigTransformer{
			Concrete: b.ConcreteResource,
			Config:   b.Config,

			// Resources are not added from the config on destroy.
			skip: b.Operation == walkPlanDestroy,

			importTargets: b.ImportTargets,

			// We only want to generate config during a plan operation.
			generateConfigPathForImportTargets: b.GenerateConfigPath,
		},

		// Add dynamic values
		&RootVariableTransformer{
			Config:       b.Config,
			RawValues:    b.RootVariableValues,
			Planning:     true,
			DestroyApply: false, // always false for planning
		},
		&ModuleVariableTransformer{
			Config:       b.Config,
			Planning:     true,
			DestroyApply: false, // always false for planning
		},
		&variableValidationTransformer{},
		&LocalTransformer{Config: b.Config},
		&OutputTransformer{
			Config:      b.Config,
			RefreshOnly: b.skipPlanChanges || b.preDestroyRefresh,
			Destroying:  b.Operation == walkPlanDestroy,
			Overrides:   b.Overrides,

There are in total 27 steps, going through each of these in thorough detail will make us lose track of our goal of comparing this runtime with the stacks runtime, so let’s talk through a few more interesting ones:

`ConfigTransformer`

In this transformer we take the parsed HCL Terraform configuration, create NodeAbstractResources for each resource and data source. NodeAbstractResource is being used when the graph is walked, we will take a deeper look at that in a minute. The ConfigTransformer adds each of these NodeAbstractResources to the graph, which at that point has no edges. Adding edges to the graph and thereby adding the dependency information is the job of the:

`ReferenceTransformer`

This transformer comes after all nodes have been added to the graph and is responsible for adding dependency edges to the graph. For that it loops through all the nodes in the graph and asks each one for their references which it then uses to construct the edges. Seems really straight forward, but how does a node knows its references? The GraphNodeReferencer interface has a method References() []*addrs.Reference which returns all references for the node. Let’s take a look at the implementation for an abstract node:

// Source: https://github.com/hashicorp/terraform/blob/73b9a681a6ce5d6f71175305992ca812c21bc2a2/internal/terraform/node_resource_abstract.go#L160-L174
		refs, _ := langrefs.ReferencesInExpr(addrs.ParseRef, c.Count)
		result = append(result, refs...)
		refs, _ = langrefs.ReferencesInExpr(addrs.ParseRef, c.ForEach)
		result = append(result, refs...)

		for _, expr := range c.TriggersReplacement {
			refs, _ = langrefs.ReferencesInExpr(addrs.ParseRef, expr)
			result = append(result, refs...)
		}

		// ReferencesInBlock() requires a schema
		if n.Schema != nil {
			refs, _ = langrefs.ReferencesInBlock(addrs.ParseRef, c.Config, n.Schema)
			result = append(result, refs...)
		}

We go through all the meta-arguments like count and for_each and use the langrefs.ReferencesInExpr to find all references in the expression. For nodes with a schema (another transformer attaches schemas for nodes that have one, e.g. resources and data sources) we use the langrefs.ReferencesInBlock method which recursively walks the blocks and attributes within the e.g. resource to find every reference in every expression.

`ProviderTransformer`

The ProviderTransformer checks in with all nodes to find all providers that need to be configured. A node can specify its required providers by implementing the GraphNodeProviderConsumer interface. This interface answers provider related questions like what is the address of the provider config I require (ProvidedBy()), what is the fully qualified name of the provider I require (Provider()), and it can set the provider address for this resource (SetProvider(addrs.AbsProviderConfig)).

In this case we use ProvidedBy to collect all requested configurations. We then go through the configured providers and use SetProvider on the nodes in need of a provider to set their corresponding provider for later use.

Walking the graph

After constructing the graph we walk it (in the right order) and execute the Execute methods on the nodes in the graph.

The Execute function does the work required for the current operation (validate / plan / apply / destroy / etc) in relation to the node. A NodeApplyableProvider for example will either validate the provider configuration or configure the provider (making it ready for the resource nodes to make calls).

For locals the Execute function simply evaluates the value of the expression for the local value (each value is an individual node in the graph.).

For resources the logic is a bit more complex as there is more to do, we got to provision infrastructure or at least plan to do so, right?

To execute a resource instance we first get the provider and refresh the current state. Once we have the current state we can ask the provider to provide a plan for the current configuration in the current state and write the change which is then used to build the plan. This section is the call to the provider for example:

// Source: https://github.com/hashicorp/terraform/blob/73b9a681a6ce5d6f71175305992ca812c21bc2a2/internal/terraform/node_resource_abstract_instance.go#L950-L960
		resp = provider.PlanResourceChange(providers.PlanResourceChangeRequest{
			TypeName:         n.Addr.Resource.Resource.Type,
			Config:           unmarkedConfigVal,
			PriorState:       unmarkedPriorVal,
			ProposedNewState: proposedNewVal,
			PriorPrivate:     priorPrivate,
			ProviderMeta:     metaConfigVal,
			ClientCapabilities: providers.ClientCapabilities{
				DeferralAllowed: deferralAllowed,
			},
		})

Did you notice how we went from calling it resource to resource instance? This is because a resource might have a for_each or a count configured. These nodes go through a process called expansion triggered by the graph walk function calling DynamicExpand on the resource nodes, which adds resource instances to the graph instead of the normal resources which are mapping 1:1 to the config.

This gives us a high-level overview of how the module runtime works, now let’s take a look at the stacks runtime!

Stacks Runtime

The most foundational aspect of the stacks runtime is the promising package. It provides a Deadlock free promise implementation and is in and of itself quite interesting. As a quick summary for the purpose of this blogpost you can think of Promises as future values that can be awaited by many tasks. Their value is computed by exactly one task, either a blocking MainTask or an async AsyncTask. We also rely on promising.Once which combines multiple calls to the same result into a single promise all callers can depend on.

We will now talk through the lifecycle of a component when running a plan. We will go from parsing the configuration to planning the component instance (the conventional module runtime is being used to plan / apply the root module the component is referencing). The Stacks Runtime is controlled by a GRPC API instead of a CLI, we will therefore skip ahead to the parts doing the actual work and not cover how we got there (it was most likely a GRPC call at the root).

Config parsing

The first thing we need to do when working with the stacks language is loading the configuration. This is done in the stackconfig.LoadConfigDir function which recursively loads the main stack along with all embedded stacks we discover. Embedding stacks is helpful when you want to compose your stack of multiple smaller sub-stacks (similar to modules in the module runtime). It is implemented in the Terraform Core codebase, but not yet supported in HCP Terraform / Terraform Cloud.

After this we parse each file against a very simple schema

// Source: https://github.com/hashicorp/terraform/blob/73b9a681a6ce5d6f71175305992ca812c21bc2a2/internal/stacks/stackconfig/file.go#L211-L224
var rootConfigSchema = &hcl.BodySchema{
	Attributes: []hcl.AttributeSchema{
		{Name: "language"},
	},
	Blocks: []hcl.BlockHeaderSchema{
		{Type: "stack", LabelNames: []string{"name"}},
		{Type: "component", LabelNames: []string{"name"}},
		{Type: "variable", LabelNames: []string{"name"}},
		{Type: "locals"},
		{Type: "output", LabelNames: []string{"name"}},
		{Type: "provider", LabelNames: []string{"type", "name"}},
		{Type: "required_providers"},
	},
}

This gives us the top-level structure and it’s in the approach very similar to normal terraform where instead of having one big schema to parse everything against we use a gradual parsing approach where each block type is responsible for parsing their own configuration. We loop through every block type and hand the work of to the respective parsers / decoders.

// Source: https://github.com/hashicorp/terraform/blob/main/internal/stacks/stackconfig/file.go#L85-L93
	}

	for _, block := range content.Blocks {
		switch block.Type {

		case "component":
			decl, moreDiags := decodeComponentBlock(block)
			diags = diags.Append(moreDiags)
			diags = diags.Append(

In the case of a component this is decodeComponentBlock. This function is not only responsible for parsing the hcl.Body against a schema, it also parses and validates the values in the configuration so that the rest of the system can get straight to work.

Static walk

Now that we have everything parsed we are ready to do our plan. Sort of. In theory we could get right to work, we have everything we need in place. This works great if the configuration is correct and no issues arise, but in practice that’s not always the case. Let’s say you wrote a component config with a for_each looping over a 100 item long list and you misspelled inputs as inpu. If we expanded (resolved the for_each into 100 different component instances) and try to run the component we would get 100 error messages with the same error. Of course we could retrospectively collapse them into one, but there is a simpler solution for this case: Don’t run into the error for each component instance, but for each component config.

This is what we call the static walk. The goal of the static walk is to find configuration errors as early as possible.

// Source: https://github.com/hashicorp/terraform/blob/73b9a681a6ce5d6f71175305992ca812c21bc2a2/internal/stacks/stackruntime/internal/stackeval/walk_static.go#L55-L64
	for _, obj := range stackConfig.LocalValues(ctx) {
		visit(ctx, walk, obj)
	}

	for _, obj := range stackConfig.Providers(ctx) {
		visit(ctx, walk, obj)
	}

	for _, obj := range stackConfig.Components(ctx) {
		visit(ctx, walk, obj)

We are looping over e.g. stackConfig.Components(ctx) which returns a map of ComponentConfigs. At this moment in time we are not dealing with a graph, but with simple lists of things. Specifically *Configs`. They are used when we talk about just the configuration and it’s also what they validate. They check for errors in the configuration that can be found without expanding the configs into instances, this is where the term static comes from, since it’s “just” static configuration.

// Source: https://github.com/hashicorp/terraform/blob/73b9a681a6ce5d6f71175305992ca812c21bc2a2/internal/stacks/stackruntime/internal/stackeval/component_config.go#L459-L549
func (c *ComponentConfig) checkValid(ctx context.Context, phase EvalPhase) tfdiags.Diagnostics {
	diags, err := c.validate.Do(ctx, func(ctx context.Context) (tfdiags.Diagnostics, error) {
		var diags tfdiags.Diagnostics

		moduleTree, moreDiags := c.CheckModuleTree(ctx)
		diags = diags.Append(moreDiags)
		if moduleTree == nil {
			return diags, nil
		}
		decl := c.Declaration(ctx)

		variableDiags := c.CheckInputVariableValues(ctx, phase)
		diags = diags.Append(variableDiags)
		// We don't actually exit if we found errors with the input variables,
		// we can still validate the actual module tree without them.

		_, providerDiags := c.CheckProviders(ctx, phase)
		diags = diags.Append(providerDiags)
		if providerDiags.HasErrors() {
			// If there's invalid provider configuration, we can't actually go
			// on and validate the module tree. We need the providers and if
			// they're invalid we'll just get crazy and confusing errors
			// later if we try and carry on.
			return diags, nil
		}

		providerSchemas, moreDiags := c.neededProviderSchemas(ctx, phase)
		diags = diags.Append(moreDiags)
		if moreDiags.HasErrors() {
			return diags, nil
		}

		tfCtx, err := terraform.NewContext(&terraform.ContextOpts{
			PreloadedProviderSchemas: providerSchemas,
			Provisioners:             c.main.availableProvisioners(),
		})
		if err != nil {
			// Should not get here because we should always pass a valid
			// ContextOpts above.
			diags = diags.Append(tfdiags.Sourceless(
				tfdiags.Error,
				"Failed to instantiate Terraform modules runtime",
				fmt.Sprintf("Could not load the main Terraform language runtime: %s.\n\nThis is a bug in Terraform; please report it!", err),
			))
			return diags, nil
		}

		providerClients, valid := c.neededProviderClients(ctx, phase)
		if !valid {
			diags = diags.Append(&hcl.Diagnostic{
				Severity: hcl.DiagError,
				Summary:  "Cannot validate component",
				Detail:   fmt.Sprintf("Cannot validate %s because its provider configuration assignments are invalid.", c.Addr()),
				Subject:  decl.DeclRange.ToHCL().Ptr(),
			})
			return diags, nil
		}
		defer func() {
			// Close the unconfigured provider clients that we opened in
			// neededProviderClients.
			for _, client := range providerClients {
				client.Close()
			}
		}()

		// When our given context is cancelled, we want to instruct the
		// modules runtime to stop the running operation. We use this
		// nested context to ensure that we don't leak a goroutine when the
		// parent context isn't cancelled.
		operationCtx, operationCancel := context.WithCancel(ctx)
		defer operationCancel()
		go func() {
			<-operationCtx.Done()
			if ctx.Err() == context.Canceled {
				tfCtx.Stop()
			}
		}()

		diags = diags.Append(tfCtx.Validate(moduleTree, &terraform.ValidateOpts{
			ExternalProviders: providerClients,
		}))
		return diags, nil
	})
	if err != nil {
		// this is crazy, we never return an error from the inner function so
		// this really shouldn't happen.
		panic(fmt.Sprintf("unexpected error from validate.Do: %s", err))
	}

	return diags
}

Let’s go through the interesting parts step by step. The first thing that should catch our attention is c.validate.Do that wraps the entire implementaion. c.validate is a promising.Once[tfdiags.Diagnostics] so a promise that is only computed once no matter how often checkValid is being called (it’s called from the Validate and the PlanChanges method).

First we validate the input variables and providers we instantiate the providers so that we can pass it into the module runtime. With terraform.NewContext we instantiate a context in which we can execute the module runtimes validate functionality. This context is the same one that is used when one runs terraform validate (or any other CLI commands).

Dynamic walk

Now that we validated the static configuration we need to do the actual work of planning. For that we walk the dynamic objects (e.g. providers, components, embedded stacks). For components this means we switch from dealing with ComponentConfigs to Components when we iterate through the components. The first thing we do is call the Component.Instances method. This method does the expansion, converting the for_each value from the configuration into a map of ComponentInstances.

We then visit each of the instances and call the ComponentInstance.PlanChanges method which will, after creating provider clients and handling deferrals for the component instance, call Terraform through the GRPC API to plan the root module associated with the component instance. After this much setup, the actual call through the GRPC API is quite simple:

// Source: https://github.com/hashicorp/terraform/blob/99a94908e74dba61f05fba90d44585fd97e6a931/internal/stacks/stackruntime/internal/stackeval/planning.go#L121-L158
	tfCtx, err := terraform.NewContext(&terraform.ContextOpts{
		Hooks: []terraform.Hook{
			&componentInstanceTerraformHook{
				ctx:   ctx,
				seq:   seq,
				hooks: hooksFromContext(ctx),
				addr:  addr,
			},
		},
		Providers:                providerFactories,
		PreloadedProviderSchemas: providerSchemas,
		Provisioners:             main.availableProvisioners(),
	})
	if err != nil {
		// Should not get here because we should always pass a valid
		// ContextOpts above.
		diags = diags.Append(tfdiags.Sourceless(
			tfdiags.Error,
			"Failed to instantiate Terraform modules runtime",
			fmt.Sprintf("Could not load the main Terraform language runtime: %s.\n\nThis is a bug in Terraform; please report it!", err),
		))
		return nil, diags
	}

	// When our given context is cancelled, we want to instruct the
	// modules runtime to stop the running operation. We use this
	// nested context to ensure that we don't leak a goroutine when the
	// parent context isn't cancelled.
	operationCtx, operationCancel := context.WithCancel(ctx)
	defer operationCancel()
	go func() {
		<-operationCtx.Done()
		if ctx.Err() == context.Canceled {
			tfCtx.Stop()
		}
	}()

	plan, moreDiags := tfCtx.Plan(moduleTree, state, opts)

Last but not least, we need to create a StackPlan for the changes. I won’t go into details about the differences in plan formats between the module runtime and the stacks runtime, they are a bit different, but the general idea is the same.

As you can see, in Stacks we never create a graph, we simply walk through the blocks in the configuration and progressively build up items to be executed (be it validation, expansion or planning). The resulting call-graph is essentially our dependency graph and the promising library we use makes it easy to build up the graph without having to worry about race conditions and dependencies.

What are the benefits?

The approach we took in the Stacks runtime is more predictable while developing, it is very close to what code one would normally write in a single-threaded environment. This makes it very straightforward to implement and debug new features. At least for me it feels like there is a clearer path to follow when reading the code for the first time and it appears more colocated to me.

If it’s a better approach than the graph-based approach in the module runtime can only be judged after a few years of iteration, I am sure the module runtime was way simpler to understand in the beginning and became more and more complex over time as the projects feature set grew and Terraform had to be developed with more and more edge-cases and legacy behavior in mind.