Atomized

Instrumenting short-lived containers with AWS X-Ray

  6 minute read

I know, I know. Why are you running short-lived containers vs. running a Lambda function? Answer: disk space and run time limit constraints.

One of the biggest headaches that I’ve experienced is with instrumenting short-lived containers – whether with New Relic, DataDog, or AWS X-Ray, is that there has been magic baked into the commonly used SDK’s that use certain periods of time or indicators like garbage collection cycles to transmit their data.

Where we find ourselves today

Today our build processes utilize a step functions that orchestrate the different steps of the build processes. The different steps are a mix of Lambda functions and ECS tasks based upon the length of time or potential need for disk space.

During the phase that handles the building of Docker containers, we run Docker-in-Docker in an ECS task. Once we’re done, we simply shutdown the container and send a signal back to the step function to let it know to proceed with the rest of the build process.

The problem we faced — even after contacting AWS support — while instrumenting our container, was that the AWS X-Ray SDK for Go was failing to ship the trace data to the X-Ray daemon before the container would shutdown (even after closing our segments).

Try, try again (maybe?)

We tried a number of different tactics, ranging from sending the data to the X-Ray daemon running in our account manually to adding cool down periods for the container to allowing the X-Ray client to configure itself using the AWS metadata endpoint.

After a number of hours (a number that I’m ashamed to admit to) I finally thought – “What if I marshal the segment data structure and use the Go SDK for the actual AWS X-Ray service?”

EUREKA!

One commit later, instantiating the AWS X-Ray service client, marshaling the segment data, and using the PutTraceSegments method, we had data coming into AWS X-Ray!

(Proof it worked)

Thoughts on the journey

What did we learn?

I’m still trying to figure out how to answer this one; but, I think the main thing here is to not forget about the AWS SDK for your respective language if you’re in a tight spot.

What was the root cause?

Unsure. AWS support and I parted ways with the support technician also confused about why we weren’t seeing transaction data. We looked at everything from VPC configurations to IAM permissions to Cloudtrail to sanity check we didn’t miss anything. Everything we looked at came up empty.

Why did we use X-Ray over the other solutions for APM?

X-Ray has some magic sauce baked into X-Ray that lets them easily wrap AWS clients and see the transactions we make with the various AWS services. We found that New Relic failed to provide clear documentation so that was a huge no-go for us. DataDog, on the other hand required additional infrastructure and that seems unnecessary so that was also a no-go.

Besides that cost of X-Ray was a factor. X-Ray has a perpetual free tier. So that’s pretty sweet. And then beyond that (as of the date of writing this) it costs $5 for each million traces sent to X-Ray.

Final Thought

We find instrumenting containers extremely frustrating at times. Perhaps it’s because it’s a weak spot in my technical abilities. So if you have any insight about how you’d have fixed our problem, shoot us an email at hello@atomizedhq.com.

Start simplifying your CI/CD processes

Atomized helps developers deploy application infrastructure
without installing CLI tools or spinning up Kubernetes clusters

Atomized
Atomized makes it easier to deploy your applications to your cloud

Funded by Y Combinator

Contact