feat: add certificate metrics to agent for NGINXaaS#1731
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1731 +/- ##
==========================================
- Coverage 84.88% 84.56% -0.32%
==========================================
Files 105 111 +6
Lines 13632 13920 +288
==========================================
+ Hits 11571 11772 +201
- Misses 1538 1611 +73
- Partials 523 537 +14
... and 1 file with indirect coverage changes Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
|
|
||
| status: | ||
| class: receiver | ||
| stability: |
There was a problem hiding this comment.
Note: the mdatagen schema requires stability to be defined at both the receiver level (status.stability: beta: [metrics]) and the metric level (stability.level: development)
e563b15 to
4ba54d3
Compare
| } | ||
|
|
||
| for _, path := range c.cfg.CertFilePaths { | ||
| cert, err := parseCertFile(path) |
There was a problem hiding this comment.
A couple issues to think about here:
- there are potentially a lot of certs and parsing them can be non-trivial work to do every 15s.
- A path may contain more than one certificate
Do we have any notification mechanism for when c.cfg changes?
Maybe something for going through all the filepaths to extract all the certs. keep a list of all the certs with the data we need for each one (expiration, path, pubkeyalgo, serial, etc) as well as the file's mtime.
Then for each scrape we just iterate through that list and stat the file to see if it has changed and we need to reparse.
There was a problem hiding this comment.
Made some changes to the scraper to address your feedback:
- there are potentially a lot of certs and parsing them can be non-trivial work to do every 15s.
Added an mtime-based cache. Each scrape does os.Stat per file; if mtime is unchanged we skip the read+parse and use cached certs.
- A path may contain more than one certificate
parseCertFile now loops pem.Decode until exhausted instead of stopping at the first block. Each cert gets its own data point.
4ba54d3 to
f7f48ee
Compare
f7f48ee to
9b7c983
Compare
Required by mdatagen for nginxplusreceiver, nginxreceiver, and containermetricsreceiver metrics. No behaviour change.
Add metadata.yaml defining nginx.certificate.expiry (gauge, Unix timestamp) with attributes file_path, public_key_algorithm, serial_number, subject.common_name. Add CertificateReceiver config type with InstanceID and CertFilePaths []string.
Run: cd internal/collector/certificatereceiver && mdatagen metadata.yaml
9b7c983 to
54501f1
Compare
Scraper reads cert files via crypto/x509 on each 15s scrape and emits nginx.certificate.expiry (Unix timestamp) per cert — renewals are picked up immediately without a collector restart. Gated on FeatureCertificates. Collector restarts only when the set of watched cert file paths changes.
|
Hey @vivki thanks for the PR! The metric/alerting use case makes sense. One item from me: Agent already parses every certificate for metadata (including the expiry) and populates I'm also trying to think of a reason reading the cert files off disk every 15s is better, but the file watcher should notify Agent when cert files are renewed. So if there's a good reason for scraping the cert expiry off disk, please update the PR with comments as to why. |
54501f1 to
a9a7404
Compare
|
@CVanF5 Thanks for the feedback! We opted for the 15s scrape because while CertificateMeta has this data in memory, putting it in the receiver config means the collector must restart every time a cert is renewed (since the Certificate's NotAfter changes). Fetching on scrape also follows the nginxplusreceiver pattern. I've also pushed a cache optimization that should mitigate load concerns, so we're not reading every file during every scrape. (I might've missed your initial review, sorry about that.) The cache is mtime-based; we only os.Stat per file per scrape, and skip the parse entirely unless the file actually changed. |
| return nil | ||
| } | ||
|
|
||
| func parseCertFile(path string) ([]*x509.Certificate, error) { |
There was a problem hiding this comment.
Is it possible to reuse the existing FileMetaWithCertificate function so we're parsing the certificate from the file in the same way we so when setting the FileOverview that we pass to the management plane?
There was a problem hiding this comment.
I introduced parseCertFile to address the concern of multi-cert chains. The problem with FileMetaWithCertificate is that it delegates to cert.LoadCertificate, which decodes only the first PEM block, causing the multi-cert issue. I thought it cleaner/less invasive to create a new self-contained func.
| oc.config.Collector.Receivers.CertificateReceivers[i+1:]..., | ||
| ) | ||
|
|
||
| return true |
There was a problem hiding this comment.
Will this return early before handling other instances?
There was a problem hiding this comment.
I think it's supposed to return early with the correct instance, should match what's going on in updateExistingNginxPlusReceiver and updateExistingNginxOSSReceiver
| type: string | ||
|
|
||
| metrics: | ||
| nginx.certificate.expiry: |
There was a problem hiding this comment.
TODO: agree on name (nginx.ssl.certificate.expiry to match existing ssl metrics or different namespace to avoid potentially conflicting in the future)
NGINXAAS-1315: Certificate expiry metric receiver
Motivation
As a platform engineer managing NGINXaaS deployments, I want to be alerted before a certificate expires. This alert should come from the same monitoring stack, following existing metrics patterns, and the metric labels should help identify which cert is the problem: common name, file path, algorithm, serial number.
nginx-agent already indexes every certificate nginx is using as part of config parsing. This change makes that data useful by exporting it as a metric, giving operators a simple threshold alert on
nginx.certificate.expirywithout any additional tooling.The receiver is separate from the existing nginx/nginxplus receivers because it covers a distinct concern (TLS hygiene vs. traffic metrics), it can emit a lot of data points on cert-heavy deployments, and it should be easy to enable or disable independently.
Implementation
Adds a
certificateOTel receiver that scrapes cert files viacrypto/x509every 15s and emitsnginx.certificate.expiry, a gauge of the Unix timestamp at which each cert expires. Handles multiple certs per PEM file (e.g. chain/bundle files).Attributes:
file_path,public_key_algorithm,serial_number,subject.common_nameResource attribute:
instance.id| Gated on:FeatureCertificatesThe config holds only cert file paths (not metadata), keeping it consistent with the nginxplusreceiver pattern where the scraper fetches live data. The scraper uses an mtime-based cache: each 15s scrape calls os.Stat per file and only re-parses via crypto/x509 when the file has actually changed. Renewals are reflected on the next scrape without a collector restart. The collector only restarts when the set of watched paths changes.
Commit Descriptions
3793b300e8a6bfdfc308fce0e4ecChecklist
Before creating a PR, run through this checklist and mark each as complete.
CONTRIBUTINGdocumentmake install-toolsand have attached any dependency changes to this pull requestREADME.md)