Responsive

Integrating Datadog Instrumented Apps in your OpenTelemetry Stack

Integrating Datadog Instrumented Apps in your OpenTelemetry Stack
May 16, 2024
7 min
read
Daniel Dias
Software Engineer
Tracetest

Struggling with integrating legacy APIs with OpenTelemetry? Learn to process data from Datadog-instrumented apps in your OTel stack!

Share on X
Share on LinkedIn
Share on Reddit
Share on HackerNews
Copy URL

Table of Contents

At Tracetest, we encountered an issue while assisting a customer to integrate legacy APIs, which were instrumented with older SDKs, with new APIs using OpenTelemetry (OTel) SDKs. We've documented our findings in this article, outlining how such a problem can arise in an observability stack and how it can be resolved using the OpenTelemetry Collector.

This guide also shows how to establish an observability pipeline that can process data from Datadog-instrumented apps. This allows you to receive and correlate data within OpenTelemetry stacks without having to send it to Datadog.

## The Legacy SDK Issue

Have you ever faced challenges with telemetry data standardization from the Datadog SDKs? Integrating Datadog SDKs with OpenTelemetry SDKs can complicate observability systems due to inconsistent telemetry data sent to Datadog. This inconsistency can make detailed data analysis, such as traces, inaccessible before it's sent to Datadog, thereby complicating the setup of an efficient observability pipeline.

## How Datadog SDKs Work

Datadog is a monitoring and analytics platform that allows developers and IT operators to observe the performance of their applications, servers, databases, and tools, all in one place, being one of the major players in the observability landscape.

To allow developers to integrate with their platform quickly, they provide a set of SDKs that enable applications to send telemetry automatically to Datadog. These SDKs are used for apps that were instrumented before OpenTelemetry started to standardize how to send telemetry with the OTLP protocol. Due to that, systems that use both Datadog SDKs and OTel SDKs can be complex in terms of observability, where their data is only available for analysis at the server level. In this scenario, developers cannot use other OTel solutions, like traces, to analyze data before sending it to Datadog.

## Structure an Observability Stack with Datadog

When you instrument an API with Datadog SDKs, you must send the telemetry to an
agent, and then this agent sends your telemetry to Datadog servers.

![https://res.cloudinary.com/djwdcmwdz/image/upload/v1715868912/Blogposts/datadog-instr-otel/Screenshot_2024-05-16_at_16.14.54_zay5mp.png](https://res.cloudinary.com/djwdcmwdz/image/upload/v1715868912/Blogposts/datadog-instr-otel/Screenshot_2024-05-16_at_16.14.54_zay5mp.png)

However, with the rise of OpenTelemetry as standard, new APIs usually centralize the telemetry into an OpenTelemetry Collector and then send data to Datadog, creating mixed environments where data is processed and correlated only on Datadog servers.

![https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869011/Blogposts/datadog-instr-otel/Screenshot_2024-05-16_at_16.16.38_oxenuh.png](https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869011/Blogposts/datadog-instr-otel/Screenshot_2024-05-16_at_16.16.38_oxenuh.png)

To simplify this structure, we can centralize all the communication in an OpenTelemetry Collector and set up a [datadog receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/datadogreceiver) that works like an agent, receiving traces and metrics in Datadog proprietary format:

![https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869122/Blogposts/datadog-instr-otel/Screenshot_2024-05-16_at_16.18.30_jhr2ac.png](https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869122/Blogposts/datadog-instr-otel/Screenshot_2024-05-16_at_16.18.30_jhr2ac.png)

To set up your OpenTelemetry Collector with this receiver, you must use the [OTel Collector Contrib](https://github.com/open-telemetry/opentelemetry-collector-contrib) distribution and add [the `datadog` receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/datadogreceiver/README.md) in the configuration file:

```jsx
receivers:
 otlp:
   protocols:
     grpc:
     http:

 datadog:
   endpoint: 0.0.0.0:8126
   read_timeout: 60s
   
 # ...
 
 service:
 pipelines:
   traces:
     receivers: [otlp, datadog]
     # ...
```

## Sending Datadog Data to Jaeger

Using the latest proposed architecture as a head start, let’s download a demo example that shows how to adapt this architecture to send data to Jaeger, allowing you to run the API locally and make any observability changes needed before publishing to production.

First, open a terminal and download the demo with the following commands:

```bash
git clone git@github.com:kubeshop/tracetest.git
cd ./examples/datadog-propagation

docker compose -f ./docker-compose.step1.yaml up -d
```

This action starts two Ruby on Rails APIs, one instrumented with [ddtrace](https://github.com/datadog/dd-trace-rb) and another with [OpenTelemetry SDK](https://github.com/open-telemetry/opentelemetry-ruby), both connecting to an OpenTelemetry Collector that sends data to Jaeger:

![https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869348/Blogposts/datadog-instr-otel/Screenshot_2024-05-16_at_16.20.43_el3us5.png](https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869348/Blogposts/datadog-instr-otel/Screenshot_2024-05-16_at_16.20.43_el3us5.png)

Next, you can call the OTelSDK-instrumented API, which executes an internal endpoint on the Datadog-instrumented API and returns an output. It is expected that this operation generates just one trace. Try it by executing the command:

```bash
> curl http://localhost:8080/remotehello
# it returns:
# {"hello":"world!","from":"quick-start-api-otel","to":"quick-start-api-datadog"}
```

However, when accessing Jaeger locally (via the link [http://localhost:16686/](http://localhost:16686/)), you see that **two** disjointed traces were generated, one for each API:

![https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869464/Blogposts/datadog-instr-otel/step1-otel-api_cioprm.png](https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869464/Blogposts/datadog-instr-otel/step1-otel-api_cioprm.png)

![https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869497/Blogposts/datadog-instr-otel/step1-datadog-api_shxj8x.png](https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869497/Blogposts/datadog-instr-otel/step1-datadog-api_shxj8x.png)

To correct this problem, you need to change your stack a little bit. Read on to learn how to apply these changes.

### Understand `trace_id` Representations for Datadog and OpenTelemetry

This issue happens due to a difference in the representation of `trace_id`s in Datadog format and OTel SDK format. Datadog considers a `trace_id` as an unsigned Int64, while the OpenTelemetry SDK considers it as a 128-bit hexadecimal representation:

- **Datadog TraceID (int64):** `4289707906147384633`
- **Datadog TraceID (hex):** `3b881a9ce0197d39`
- **OTelSDK TraceID (hex):** `f3c18530c08e00a43b881a9ce0197d39`

Since you have two TraceID representations, you have two Traces in Jaeger.

To propagate their traces through an OpenTelemetry stack, Datadog has an internal representation that can be used to reconstruct a TraceID, which they call, internally in their SDK, as `Upper TraceID`, represented by the attribute `_dd.p.tid` appended to the first span of its trace:

![https://res.cloudinary.com/djwdcmwdz/image/upload/v1715870223/Blogposts/datadog-instr-otel/Untitled_33_cx09ei.png](https://res.cloudinary.com/djwdcmwdz/image/upload/v1715870223/Blogposts/datadog-instr-otel/Untitled_33_cx09ei.png)

Concatenating this `Upper TraceID` with Datadog’s TraceID in hexadecimal representation (called `Lower TraceID`), you have the exact TraceID representation for an OpenTelemetry stack:

![https://res.cloudinary.com/djwdcmwdz/image/upload/v1715870267/Blogposts/datadog-instr-otel/Screenshot_2024-05-16_at_16.37.40_ieqruf.png](https://res.cloudinary.com/djwdcmwdz/image/upload/v1715870267/Blogposts/datadog-instr-otel/Screenshot_2024-05-16_at_16.37.40_ieqruf.png)

Now, to create this representation, we will reconstruct the TraceID at the OpenTelemetry Collector level, using the [transform processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessor).

### Reconstructing TraceID for Datadog Spans

The OTel Collector [transform processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessor) is a component that transforms span data as it passes through the OpenTelemetry Collector. It can modify attributes of a span such as name, kind, attributes, resources, and instrumentation library, among others.

In this context, the `transform processor` is used to reconstruct the TraceID for Datadog spans to facilitate a unified tracing environment.

In the OTel Collector config, you can configure the `transform processor` as follows:

```bash
processors:
 # ...

 transform:
   trace_statements:
     - context: span
       statements:
         # transformation statements
         - set(cache["upper_trace_id"], attributes["_dd.p.tid"]) where attributes["_dd.p.tid"] != nil
         - set(cache["lower_trace_id"], Substring(trace_id.string, 16, 16)) where cache["upper_trace_id"] != nil
         - set(cache["combined_trace_id"], Concat([cache["upper_trace_id"], cache["lower_trace_id"]],"")) where cache["upper_trace_id"] != nil
         - set(trace_id.string, cache["combined_trace_id"]) where cache["combined_trace_id"] != nil
```

Next, add four transformation statements:

- `set(cache["upper_trace_id"], attributes["_dd.p.tid"]) where attributes["_dd.p.tid"] != nil`, where we will capture the Upper TraceID from the `_dd.p.tid` attribute if it is defined and set it to a temporary cache;
- `set(cache["lower_trace_id"], Substring(trace_id.string, 16, 16)) where cache["upper_trace_id"] != nil`, where we will convert a Datadog TraceID into hexadecimal format (where we will have sixteen zeros plus the Lower TraceID). We will remove the "zeros” segment and grab the second half of it (this is why we have a sub string getting only the last 16 hex digits from the TraceID);
- `set(cache["combined_trace_id"], Concat([cache["upper_trace_id"], cache["lower_trace_id"]],"")) where cache["upper_trace_id"] != nil`,
where we concatenate `upper_trace_id` and `lower_trace_id` into the OpenTelemetry TraceID;
- `set(trace_id.string, cache["combined_trace_id"]) where cache["combined_trace_id"] != nil` and finally we set the `trace_id` for this span with the string concatenation.

To see it working, run your example again, with a different `docker compose` file and then execute the same API call as before:

```bash
> docker compose -f ./docker-compose.step2.yaml up -d

> curl http://localhost:8080/remotehello
# it returns:
# {"hello":"world!","from":"quick-start-api-otel","to":"quick-start-api-datadog"}
```

Looking at the Jaeger UI, you can see that the problem was partially solved. Now you have a trace propagated between the Datadog-instrumented API and the OpenTelemetry-instrumented API.

However, all child spans generated by Datadog are segregated in a different trace, defined only as the Lower TraceID. This happens because `ddtrace` only sends a `_dd.p.tid` attribute for the first span generated internally, which makes your transform statements skip the child  spans.

![https://res.cloudinary.com/djwdcmwdz/image/upload/v1715870335/Blogposts/datadog-instr-otel/step2-otel-api_oywxkr.png](https://res.cloudinary.com/djwdcmwdz/image/upload/v1715870335/Blogposts/datadog-instr-otel/step2-otel-api_oywxkr.png)

![https://res.cloudinary.com/djwdcmwdz/image/upload/v1715870337/Blogposts/datadog-instr-otel/step2-datadog-api_i4yyys.png](https://res.cloudinary.com/djwdcmwdz/image/upload/v1715870337/Blogposts/datadog-instr-otel/step2-datadog-api_i4yyys.png)

### Patch Datadog Trace to Send the Upper TraceID to Every Child Span

Since the OpenTelemetry Collector was [designed to process spans considering distributed systems](https://www.notion.so/docs/collector/scaling/), there’s no way to maintain an internal state to replicate the `_dd.p.tid` attribute for the child spans received by the `datadog` receiver.

You can solve this problem directly in the Datadog-instrumented API by applying a minor patch to `ddtrace` to replicate `_dd.p.tid` to all child spans.

In the [`ddtrace` Ruby version](https://github.com/kubeshop/tracetest/blob/main/examples/datadog-propagation/quick_start_api_datadog/config/initializers/datadog.rb#L20), the trace serialization is modified to send this data below:

```ruby
module Datadog
 module Tracing
   module Transport
     class SerializableTrace
       def to_msgpack(packer = nil)
         if ENV.has_key?('INJECT_UPPER_TRACE_ID')
           return trace.spans.map { |s| SerializableSpan.new(s) }.to_msgpack(packer)
         end

         upper_trace_id = trace.spans.find { |span| span.meta.has_key?('_dd.p.tid') }.meta['_dd.p.tid']
         trace.spans.each do |span|
           span.meta["propagation.upper_trace_id"] = upper_trace_id
         end

         trace.spans.map { |s| SerializableSpan.new(s) }.to_msgpack(packer)
       end
     end
   end
 end
end
```

Since it is a customization on the traces, you can opt to grab the `_dd.p.tid` attribute and inject it in each span as the `propagation.upper_trace_id` attribute. Then you can [change](https://github.com/kubeshop/tracetest/blob/main/examples/datadog-propagation/collector/collector.config.step3.yaml#L18) the `transform` processor in the OpenTelemetry Collector to consider this:

```ruby
 transform:
   trace_statements:
     - context: span
       statements:
         - set(cache["upper_trace_id"], attributes["propagation.upper_trace_id"]) where attributes["propagation.upper_trace_id"] != nil
         - set(cache["lower_trace_id"], Substring(trace_id.string, 16, 16)) where cache["upper_trace_id"] != nil
         - set(cache["combined_trace_id"], Concat([cache["upper_trace_id"], cache["lower_trace_id"]],"")) where cache["upper_trace_id"] != nil
         - set(trace_id.string, cache["combined_trace_id"]) where cache["combined_trace_id"] != nil
```

With these changes done, let’s run a new setup with our example:

```bash
> docker compose -f ./docker-compose.step3.yaml up -d

> curl http://localhost:8080/remotehello
# it returns:
# {"hello":"world!","from":"quick-start-api-otel","to":"quick-start-api-datadog"}
```

Opening the Jaeger UI again, you see that the problem is solved! Now you have a
single trace between both APIs and can evaluate a process as a whole on a
developer machine.

![https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869641/Blogposts/datadog-instr-otel/step3_rdfol3.png](https://res.cloudinary.com/djwdcmwdz/image/upload/v1715869641/Blogposts/datadog-instr-otel/step3_rdfol3.png)

### Testing the Observability Stack

Until now, we have been testing the telemetry manually, which can be time-consuming if both APIs are constantly changing. To evaluate both APIs automatically and guarantee that everything working properly, we can create [trace-based tests](https://docs.tracetest.io/concepts/what-is-trace-based-testing), trigger HTTP calls against the APIs, and validate if our traces are logged as intended.

To do that, we will use Tracetest, which triggers service calls (in our case, HTTP calls like our `curl` calls) and validate the emitted traces to ensure that our observability stack is working as intended.

First, we will create a new account on [Tracetest.io](https://tracetest.io/), by accessing [https://app.tracetest.io](https://app.tracetest.io), and then create a [new environment](https://docs.tracetest.io/concepts/environments). Once we have an [API Key for our agent](https://docs.tracetest.io/configuration/agent), we will start our local stack with a new container with a Tracetest Agent:

```bash
> TRACETEST_API_KEY=your-api-key docker compose -f ./docker-compose.step4.yaml up -d
```

Then, we will [install Tracetest CLI](https://docs.tracetest.io/getting-started/installation) and configure it to access our environment with the following command that will guide you to connect to your `personal-org` and environment:

```bash
> tracetest configure

# This command will print some instructions interactively to help to connect to your env:

# What tracetest server do you want to use? (default: https://app.tracetest.io/)

# What Organization do you want to use?:
#  > personal-org (ttorg_000000000000000)
# What Environment do you want to use?:
#  > OTel (ttenv_000000000000000)
# SUCCESS  Successfully configured Tracetest CLI
```

Now we will configure our agent to connect to our local Jaeger, using the following command:

```bash
> tracetest apply datastore -f ./tracebased-tests/tracing-backend.yaml

# It will send the following output, which means that our environment was correctly configured:

# type: DataStore
# spec:
#  id: current
#  name: Jaeger Tracing Backend
#  type: jaeger
#  default: true
#  createdAt: 2023-10-31T00:30:47.137194Z
#  jaeger:
#    endpoint: jaeger:16685
#    tls:
#      insecure: true
```

Next, write a test that checks the trace generated by a calling the `[http://localhost:8080/remotehello](http://localhost:8080/remotehello)` URL and validate whether it has at least one span for both APIs. To do that, we will create a test file called `./tracebased-tests/test_integration.yaml`  with the following contents:

```yaml
type: Test
spec:
 id: CdhJp_xIR
 name: Test observability integration between OTelSDK-instrumented API to Datadog-instrumented API
 trigger:
   type: http
   httpRequest:
     method: GET
     url: http://api-otel:8080/remotehello
     headers:
     - key: Content-Type
       value: application/json
 specs:
 - selector: "span[tracetest.span.type=\"http\" name=\"HelloController#remote_hello\" http.target=\"/remotehello\" http.method=\"GET\"]"
   name: OpenTelemetryAPI-instrumented API has been called
   assertions:
   - attr:http.status_code = 200
 - selector: span[tracetest.span.type="http" name="rack.request" http.method="GET"]
   name: Datadog-instrumented API has been called
   assertions:
   - attr:http.status_code = 200
```

This test will call our API internally from our Docker Compose network at the endpoint `http://api-otel:8080/remotehello` , as specified in `trigger` section, and will validate if there are two spans on the resulting trace for this operation with two `specs`:

- One that checks if there is an `http` span called `HelloController#remote_hello` with status code `200` , defined on the first `selector` item, validating if the `OpenTelemetryAPI-instrumented API has been called` .
- Another that checks if there is an `http` span called `rack.request` with status code `200` , defined on the first `selector` item, validating if the `Datadog-instrumented API has been called` .

We can run the test with the command `tracetest run test -f ./tracebased-tests/test_integration.yaml` and see everything working:

```bash
> tracetest run test -f ./tracebased-tests/test_integration.yaml  

# It returns the following output

# ✔ RunGroup: #mDjJ2ZPIR (https://app.tracetest.io/organizations/ttorg_000000000/environments/ttenv_000000000/run/mDjJ2ZPIR)
# Summary: 1 passed, 0 failed, 0 pending
#  ✔ Test observability integration between OTelSDK-instrumented API to Datadog-instrumented API (https://app.tracetest.io/organizations/ttorg_1cbdabae7b8fd1c6/environments/ttenv_cfbdc98ade85ac15/test/CdhJp_xIR/run/4/test) - trace id: 64af63b48c2895fc1f91b811ef1d0ca3
#        ✔ OpenAPI-instrumented API has been called
#        ✔ Datadog-instrumented API has been called
```

We can see this result in Tracetest UI to see the entire trace:

![https://res.cloudinary.com/djwdcmwdz/image/upload/v1715870407/Blogposts/datadog-instr-otel/Untitled_34_z5m8j7.png](https://res.cloudinary.com/djwdcmwdz/image/upload/v1715870407/Blogposts/datadog-instr-otel/Untitled_34_z5m8j7.png)

## Final Remarks

We have discussed integrating Datadog-instrumented apps in an OpenTelemetry stack. We saw the problem of different TraceID representations for Datadog and OpenTelemetry, which can lead to disjointed traces. To solve that, we used an OpenTelemetry Collector with a `transform` processor to reconstruct the TraceID for Datadog spans, facilitating a unified tracing environment.

Additionally, we can use trace-based tests with Tracetest to automate the validation of these changes on the observability stack and ensure that the application works as intended.

As a team focused on building an open source tool in the observability space, the opportunity to help the overall OpenTelemetry community is important to us. That’s why we are researching and finding new ways of collecting traces from different tools and frameworks and making them work with the OpenTelemetry ecosystem.

The [example sources](https://github.com/kubeshop/tracetest/tree/main/examples/datadog-propagation) used in this article and [setup instructions](https://github.com/kubeshop/tracetest/blob/main/examples/datadog-propagation/README.md) are available in the Tracetest GitHub repository.

Would you like to learn more about Tracetest and what it brings to the table? Visit the Tracetest [docs](https://docs.tracetest.io/getting-started/installation) and try it out by [signing up today](https://app.tracetest.io)!

Also, please feel free to join our [Slack Community](https://dub.sh/tracetest-community), give [Tracetest a star on GitHub](https://github.com/kubeshop/tracetest), or schedule a [time to chat 1:1](https://calendly.com/ken-kubeshop/45min).