Skip to content

feat: add certificate metrics to agent for NGINXaaS#1731

Open
vivki wants to merge 4 commits into
nginx:mainfrom
vivki:naas-1315-certificate-receiver
Open

feat: add certificate metrics to agent for NGINXaaS#1731
vivki wants to merge 4 commits into
nginx:mainfrom
vivki:naas-1315-certificate-receiver

Conversation

@vivki

@vivki vivki commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

NGINXAAS-1315: Certificate expiry metric receiver

Motivation

As a platform engineer managing NGINXaaS deployments, I want to be alerted before a certificate expires. This alert should come from the same monitoring stack, following existing metrics patterns, and the metric labels should help identify which cert is the problem: common name, file path, algorithm, serial number.

nginx-agent already indexes every certificate nginx is using as part of config parsing. This change makes that data useful by exporting it as a metric, giving operators a simple threshold alert on nginx.certificate.expiry without any additional tooling.

The receiver is separate from the existing nginx/nginxplus receivers because it covers a distinct concern (TLS hygiene vs. traffic metrics), it can emit a lot of data points on cert-heavy deployments, and it should be easy to enable or disable independently.

Implementation

Adds a certificate OTel receiver that scrapes cert files via crypto/x509 every 15s and emits nginx.certificate.expiry, a gauge of the Unix timestamp at which each cert expires. Handles multiple certs per PEM file (e.g. chain/bundle files).

Attributes: file_path, public_key_algorithm, serial_number, subject.common_name
Resource attribute: instance.id | Gated on: FeatureCertificates

The config holds only cert file paths (not metadata), keeping it consistent with the nginxplusreceiver pattern where the scraper fetches live data. The scraper uses an mtime-based cache: each 15s scrape calls os.Stat per file and only re-parses via crypto/x509 when the file has actually changed. Renewals are reflected on the next scrape without a collector restart. The collector only restarts when the set of watched paths changes.

Commit Descriptions

Commit Description
3793b30 Add mdatagen stability annotations to existing receivers (prerequisite, no behaviour change)
0e8a6bf Define metric schema (metadata.yaml) and config types
dfc308f Run mdatagen, generated boilerplate only
ce0e4ec Implement scraper, factory, plugin wiring, tests

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING document
  • I have run make install-tools and have attached any dependency changes to this pull request
  • If applicable, I have added tests that prove my fix is effective or that my feature works
  • If applicable, I have checked that any relevant tests pass after adding my changes
  • If applicable, I have updated any relevant documentation (README.md)
  • If applicable, I have tested my cross-platform changes on Ubuntu 22, Redhat 8, SUSE 15 and FreeBSD 13

@vivki vivki requested a review from a team as a code owner June 11, 2026 21:04
@github-actions github-actions Bot added chore Pull requests for routine tasks documentation Improvements or additions to documentation labels Jun 11, 2026
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 71.28378% with 85 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.56%. Comparing base (e10a0d3) to head (a9a7404).
⚠️ Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
internal/collector/otel_collector_plugin.go 4.87% 39 Missing ⚠️
...atereceiver/internal/metadata/generated_metrics.go 82.85% 17 Missing and 1 partial ⚠️
internal/collector/certificatereceiver/scraper.go 77.77% 9 Missing and 7 partials ⚠️
...catereceiver/internal/metadata/generated_config.go 73.33% 4 Missing and 4 partials ⚠️
internal/collector/certificatereceiver/factory.go 84.00% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1731      +/-   ##
==========================================
- Coverage   84.88%   84.56%   -0.32%     
==========================================
  Files         105      111       +6     
  Lines       13632    13920     +288     
==========================================
+ Hits        11571    11772     +201     
- Misses       1538     1611      +73     
- Partials      523      537      +14     
Files with missing lines Coverage Δ
internal/collector/certificatereceiver/config.go 100.00% <100.00%> (ø)
...tereceiver/internal/metadata/generated_resource.go 100.00% <100.00%> (ø)
...r/cpuscraper/internal/metadata/generated_config.go 75.75% <100.00%> (ø)
...emoryscraper/internal/metadata/generated_config.go 73.33% <100.00%> (ø)
internal/collector/factories.go 100.00% <100.00%> (ø)
internal/config/types.go 86.66% <100.00%> (-0.44%) ⬇️
internal/collector/certificatereceiver/factory.go 84.00% <84.00%> (ø)
...catereceiver/internal/metadata/generated_config.go 73.33% <73.33%> (ø)
internal/collector/certificatereceiver/scraper.go 77.77% <77.77%> (ø)
...atereceiver/internal/metadata/generated_metrics.go 82.85% <82.85%> (ø)
... and 1 more

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e10a0d3...a9a7404. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


status:
class: receiver
stability:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the mdatagen schema requires stability to be defined at both the receiver level (status.stability: beta: [metrics]) and the metric level (stability.level: development)

@vivki vivki force-pushed the naas-1315-certificate-receiver branch from e563b15 to 4ba54d3 Compare June 12, 2026 19:43
@vivki vivki changed the title Naas 1315 certificate receiver feat: add certificate metrics to agent for NGINXaaS Jun 15, 2026
}

for _, path := range c.cfg.CertFilePaths {
cert, err := parseCertFile(path)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple issues to think about here:

  1. there are potentially a lot of certs and parsing them can be non-trivial work to do every 15s.
  2. A path may contain more than one certificate

Do we have any notification mechanism for when c.cfg changes?

Maybe something for going through all the filepaths to extract all the certs. keep a list of all the certs with the data we need for each one (expiration, path, pubkeyalgo, serial, etc) as well as the file's mtime.

Then for each scrape we just iterate through that list and stat the file to see if it has changed and we need to reparse.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some changes to the scraper to address your feedback:

  1. there are potentially a lot of certs and parsing them can be non-trivial work to do every 15s.

Added an mtime-based cache. Each scrape does os.Stat per file; if mtime is unchanged we skip the read+parse and use cached certs.

  1. A path may contain more than one certificate

parseCertFile now loops pem.Decode until exhausted instead of stopping at the first block. Each cert gets its own data point.

@vivki vivki force-pushed the naas-1315-certificate-receiver branch from 4ba54d3 to f7f48ee Compare June 23, 2026 00:29
@github-actions github-actions Bot added the enhancement New feature or request label Jun 23, 2026
@vivki vivki force-pushed the naas-1315-certificate-receiver branch from f7f48ee to 9b7c983 Compare June 23, 2026 00:32
vivki added 3 commits June 22, 2026 17:42
Required by mdatagen for nginxplusreceiver, nginxreceiver, and
containermetricsreceiver metrics. No behaviour change.
Add metadata.yaml defining nginx.certificate.expiry (gauge, Unix timestamp)
with attributes file_path, public_key_algorithm, serial_number, subject.common_name.
Add CertificateReceiver config type with InstanceID and CertFilePaths []string.
Run: cd internal/collector/certificatereceiver && mdatagen metadata.yaml
@vivki vivki force-pushed the naas-1315-certificate-receiver branch from 9b7c983 to 54501f1 Compare June 23, 2026 00:43
Scraper reads cert files via crypto/x509 on each 15s scrape and emits
nginx.certificate.expiry (Unix timestamp) per cert — renewals are picked
up immediately without a collector restart. Gated on FeatureCertificates.
Collector restarts only when the set of watched cert file paths changes.
@CVanF5

CVanF5 commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Hey @vivki thanks for the PR! The metric/alerting use case makes sense. One item from me: Agent already parses every certificate for metadata (including the expiry) and populates CertificateMeta with it, which should be available for you to extract. certFilePathsFromFiles keeps only the file path and hands it back to the certificate receiver, which then needs to scrape the cert files off disk. So it looks like to me like we're reading files off disk every 15s for a value that's already available in memory.

I'm also trying to think of a reason reading the cert files off disk every 15s is better, but the file watcher should notify Agent when cert files are renewed. So if there's a good reason for scraping the cert expiry off disk, please update the PR with comments as to why.

@vivki vivki force-pushed the naas-1315-certificate-receiver branch from 54501f1 to a9a7404 Compare June 23, 2026 17:58
@vivki

vivki commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

@CVanF5 Thanks for the feedback! We opted for the 15s scrape because while CertificateMeta has this data in memory, putting it in the receiver config means the collector must restart every time a cert is renewed (since the Certificate's NotAfter changes). Fetching on scrape also follows the nginxplusreceiver pattern.

I've also pushed a cache optimization that should mitigate load concerns, so we're not reading every file during every scrape. (I might've missed your initial review, sorry about that.) The cache is mtime-based; we only os.Stat per file per scrape, and skip the parse entirely unless the file actually changed.

return nil
}

func parseCertFile(path string) ([]*x509.Certificate, error) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to reuse the existing FileMetaWithCertificate function so we're parsing the certificate from the file in the same way we so when setting the FileOverview that we pass to the management plane?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced parseCertFile to address the concern of multi-cert chains. The problem with FileMetaWithCertificate is that it delegates to cert.LoadCertificate, which decodes only the first PEM block, causing the multi-cert issue. I thought it cleaner/less invasive to create a new self-contained func.

oc.config.Collector.Receivers.CertificateReceivers[i+1:]...,
)

return true

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this return early before handling other instances?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's supposed to return early with the correct instance, should match what's going on in updateExistingNginxPlusReceiver and updateExistingNginxOSSReceiver

type: string

metrics:
nginx.certificate.expiry:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: agree on name (nginx.ssl.certificate.expiry to match existing ssl metrics or different namespace to avoid potentially conflicting in the future)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore Pull requests for routine tasks documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants