OpenSRM
Open Service Reliability Manifest - Define service reliability requirements as code
| Type | ServiceReliabilityManifest | Template |
|---|---|
| File match |
service.reliability.yaml
service.reliability.yml
*.reliability.yaml
*.reliability.yml
.opensrm.yaml
.opensrm.yml
|
| Schema URL | https://catalog.lintel.tools/schemas/schemastore/opensrm/latest.json |
| Source | https://www.schemastore.org/opensrm.json |
Validate with Lintel
npx @lintel/lintel check
Open Service Reliability Manifest - Define service reliability requirements as code https://github.com/rsionnach/opensrm
One of
Definitions
Duration format: number followed by ms, s, m, h, d, or w
Ratio value (0.0 - 1.0)
Service operational pattern https://github.com/rsionnach/opensrm#service-types
Service criticality tier
Latency SLO - measures response time at specified percentiles
Duration format: number followed by ms, s, m, h, d, or w
Throughput SLO - measures request processing rate
Minimum throughput
Throughput unit
Duration format: number followed by ms, s, m, h, d, or w
Processing time SLO - job completion time for worker services
Duration format: number followed by ms, s, m, h, d, or w
Duration SLO - job execution time for batch services
Duration format: number followed by ms, s, m, h, d, or w
Query latency SLO - response time for database services
Duration format: number followed by ms, s, m, h, d, or w
Replication lag SLO for database services
Maximum acceptable replication lag (e.g., 1s, 500ms, 5m)
Duration format: number followed by ms, s, m, h, d, or w
Reversal rate SLO - measures how often humans override AI decisions https://github.com/rsionnach/opensrm#judgment-slos
High-confidence failure SLO - measures rate of confident but wrong decisions https://github.com/rsionnach/opensrm#judgment-slos
Calibration SLO - measures if stated confidence matches actual accuracy (ECE) https://github.com/rsionnach/opensrm#judgment-slos
Feedback latency SLO - measures time until decision quality is known https://github.com/rsionnach/opensrm#judgment-slos
Judgment SLOs for AI gate services - measure decision quality https://github.com/rsionnach/opensrm#judgment-slos
Reversal rate SLO - measures how often humans override AI decisions https://github.com/rsionnach/opensrm#judgment-slos
3 nested properties
High-confidence failure SLO - measures rate of confident but wrong decisions https://github.com/rsionnach/opensrm#judgment-slos
Calibration SLO - measures if stated confidence matches actual accuracy (ECE) https://github.com/rsionnach/opensrm#judgment-slos
Feedback latency SLO - measures time until decision quality is known https://github.com/rsionnach/opensrm#judgment-slos
Service Level Objectives
Availability SLO - measures proportion of successful requests
Latency SLO - measures response time at specified percentiles
7 nested properties
Duration format: number followed by ms, s, m, h, d, or w
Error rate SLO - measures proportion of failed requests
Throughput SLO - measures request processing rate
3 nested properties
Minimum throughput
Throughput unit
Duration format: number followed by ms, s, m, h, d, or w
Processing time SLO - job completion time for worker services
6 nested properties
Duration format: number followed by ms, s, m, h, d, or w
Lag SLO - consumer lag for stream services
Success rate SLO - measures proportion of successful operations
Duration SLO - job execution time for batch services
6 nested properties
Duration format: number followed by ms, s, m, h, d, or w
Schedule adherence SLO - measures timeliness of batch job starts
Data freshness SLO - measures staleness of batch output data
Query latency SLO - response time for database services
7 nested properties
Duration format: number followed by ms, s, m, h, d, or w
Replication lag SLO for database services
2 nested properties
Maximum acceptable replication lag (e.g., 1s, 500ms, 5m)
Duration format: number followed by ms, s, m, h, d, or w
Connection availability SLO for database services
References to external OpenSLO files
Custom SLO types
Judgment SLOs for AI gate services - measure decision quality https://github.com/rsionnach/opensrm#judgment-slos
4 nested properties
Reversal rate SLO - measures how often humans override AI decisions https://github.com/rsionnach/opensrm#judgment-slos
3 nested properties
High-confidence failure SLO - measures rate of confident but wrong decisions https://github.com/rsionnach/opensrm#judgment-slos
Calibration SLO - measures if stated confidence matches actual accuracy (ECE) https://github.com/rsionnach/opensrm#judgment-slos
Feedback latency SLO - measures time until decision quality is known https://github.com/rsionnach/opensrm#judgment-slos
Latency guarantees in a contract
Throughput guarantees in a contract
Maximum sustained requests per second
Maximum burst capacity
Reliability guarantees promised to dependents https://github.com/rsionnach/opensrm#contracts
Throughput guarantees in a contract
2 nested properties
Maximum sustained requests per second
Maximum burst capacity
Judgment guarantees in a contract for AI gate services
Upstream dependency and expected guarantees
Dependency identifier (service name)
Type of dependency
Whether failure causes service failure
Expected guarantees from dependency
URL to dependency's SRM manifest
Service ownership information
Team identifier
Slack channel for alerts (e.g., '#team-oncall')
Team email address
On-call rotation or escalation policy name
PagerDuty integration
2 nested properties
PagerDuty service ID
PagerDuty escalation policy ID
URL to operational runbook
URL to service documentation
Whether on-call rotation is required
Telemetry configuration for AI gate services
Event name mappings
3 nested properties
Event name emitted when AI makes a decision
Event name emitted when a decision is overridden
Event name emitted when ground truth is known
Attribute name mappings
3 nested properties
Field name for decision identifier
Field name for confidence score
Field name for decision value
Observability requirements
Metric requirements
3 nested properties
Metric names that must exist
1 nested properties
Labels that must be present
Metric naming convention
Dashboard requirements
2 nested properties
Whether dashboards must exist
URLs to service dashboards
Alert requirements
2 nested properties
Whether alerts must be configured
Distributed tracing requirements
2 nested properties
Whether distributed tracing must be enabled
Expected sampling rate (0.0 - 1.0)
Deployment requirements
Environments where service is deployed
Deployment gates
3 nested properties
2 nested properties
2 nested properties
4 nested properties
Duration format: number followed by ms, s, m, h, d, or w
Rollback configuration
2 nested properties
Whether automatic rollback is enabled
2 nested properties
Error rate increase threshold
Latency increase threshold
Service identity and classification
Unique service identifier (lowercase alphanumeric with hyphens)
Owning team identifier
Service criticality tier
Name of template to inherit from
Human-readable description
Key-value pairs for filtering/grouping
Arbitrary metadata (not used for selection)
Template identity
Unique template identifier
Human-readable description
Reliability requirements
Service operational pattern https://github.com/rsionnach/opensrm#service-types
Service Level Objectives
16 nested properties
Availability SLO - measures proportion of successful requests
Latency SLO - measures response time at specified percentiles
7 nested properties
Duration format: number followed by ms, s, m, h, d, or w
Error rate SLO - measures proportion of failed requests
Throughput SLO - measures request processing rate
3 nested properties
Minimum throughput
Throughput unit
Duration format: number followed by ms, s, m, h, d, or w
Processing time SLO - job completion time for worker services
6 nested properties
Duration format: number followed by ms, s, m, h, d, or w
Lag SLO - consumer lag for stream services
Success rate SLO - measures proportion of successful operations
Duration SLO - job execution time for batch services
6 nested properties
Duration format: number followed by ms, s, m, h, d, or w
Schedule adherence SLO - measures timeliness of batch job starts
Data freshness SLO - measures staleness of batch output data
Query latency SLO - response time for database services
7 nested properties
Duration format: number followed by ms, s, m, h, d, or w
Replication lag SLO for database services
2 nested properties
Maximum acceptable replication lag (e.g., 1s, 500ms, 5m)
Duration format: number followed by ms, s, m, h, d, or w
Connection availability SLO for database services
References to external OpenSLO files
Custom SLO types
Judgment SLOs for AI gate services - measure decision quality https://github.com/rsionnach/opensrm#judgment-slos
4 nested properties
Reversal rate SLO - measures how often humans override AI decisions https://github.com/rsionnach/opensrm#judgment-slos
High-confidence failure SLO - measures rate of confident but wrong decisions https://github.com/rsionnach/opensrm#judgment-slos
Calibration SLO - measures if stated confidence matches actual accuracy (ECE) https://github.com/rsionnach/opensrm#judgment-slos
Feedback latency SLO - measures time until decision quality is known https://github.com/rsionnach/opensrm#judgment-slos
Reliability guarantees promised to dependents https://github.com/rsionnach/opensrm#contracts
4 nested properties
Throughput guarantees in a contract
2 nested properties
Maximum sustained requests per second
Maximum burst capacity
Judgment guarantees in a contract for AI gate services
Telemetry configuration for AI gate services
2 nested properties
Event name mappings
3 nested properties
Event name emitted when AI makes a decision
Event name emitted when a decision is overridden
Event name emitted when ground truth is known
Attribute name mappings
3 nested properties
Field name for decision identifier
Field name for confidence score
Field name for decision value
Upstream dependencies
Service ownership information
8 nested properties
Team identifier
Slack channel for alerts (e.g., '#team-oncall')
Team email address
On-call rotation or escalation policy name
PagerDuty integration
2 nested properties
PagerDuty service ID
PagerDuty escalation policy ID
URL to operational runbook
URL to service documentation
Whether on-call rotation is required
Observability requirements
4 nested properties
Metric requirements
3 nested properties
Metric names that must exist
Metric naming convention
Dashboard requirements
2 nested properties
Whether dashboards must exist
URLs to service dashboards
Alert requirements
2 nested properties
Whether alerts must be configured
Distributed tracing requirements
2 nested properties
Whether distributed tracing must be enabled
Expected sampling rate (0.0 - 1.0)
Deployment requirements
3 nested properties
Environments where service is deployed
Deployment gates
3 nested properties
Rollback configuration
2 nested properties
Whether automatic rollback is enabled
Partial spec for template inheritance
Service operational pattern https://github.com/rsionnach/opensrm#service-types
Service Level Objectives
16 nested properties
Availability SLO - measures proportion of successful requests
Latency SLO - measures response time at specified percentiles
7 nested properties
Duration format: number followed by ms, s, m, h, d, or w
Error rate SLO - measures proportion of failed requests
Throughput SLO - measures request processing rate
3 nested properties
Minimum throughput
Throughput unit
Duration format: number followed by ms, s, m, h, d, or w
Processing time SLO - job completion time for worker services
6 nested properties
Duration format: number followed by ms, s, m, h, d, or w
Lag SLO - consumer lag for stream services
Success rate SLO - measures proportion of successful operations
Duration SLO - job execution time for batch services
6 nested properties
Duration format: number followed by ms, s, m, h, d, or w
Schedule adherence SLO - measures timeliness of batch job starts
Data freshness SLO - measures staleness of batch output data
Query latency SLO - response time for database services
7 nested properties
Duration format: number followed by ms, s, m, h, d, or w
Replication lag SLO for database services
2 nested properties
Maximum acceptable replication lag (e.g., 1s, 500ms, 5m)
Duration format: number followed by ms, s, m, h, d, or w
Connection availability SLO for database services
References to external OpenSLO files
Custom SLO types
Judgment SLOs for AI gate services - measure decision quality https://github.com/rsionnach/opensrm#judgment-slos
4 nested properties
Reversal rate SLO - measures how often humans override AI decisions https://github.com/rsionnach/opensrm#judgment-slos
High-confidence failure SLO - measures rate of confident but wrong decisions https://github.com/rsionnach/opensrm#judgment-slos
Calibration SLO - measures if stated confidence matches actual accuracy (ECE) https://github.com/rsionnach/opensrm#judgment-slos
Feedback latency SLO - measures time until decision quality is known https://github.com/rsionnach/opensrm#judgment-slos
Service ownership information
8 nested properties
Team identifier
Slack channel for alerts (e.g., '#team-oncall')
Team email address
On-call rotation or escalation policy name
PagerDuty integration
2 nested properties
PagerDuty service ID
PagerDuty escalation policy ID
URL to operational runbook
URL to service documentation
Whether on-call rotation is required
Telemetry configuration for AI gate services
2 nested properties
Event name mappings
3 nested properties
Event name emitted when AI makes a decision
Event name emitted when a decision is overridden
Event name emitted when ground truth is known
Attribute name mappings
3 nested properties
Field name for decision identifier
Field name for confidence score
Field name for decision value
Service Reliability Manifest - defines reliability requirements for a service
Schema version. Must be srm/v1
Document type
Service identity and classification
7 nested properties
Unique service identifier (lowercase alphanumeric with hyphens)
Owning team identifier
Service criticality tier
Name of template to inherit from
Human-readable description
Key-value pairs for filtering/grouping
Arbitrary metadata (not used for selection)
Reliability requirements
8 nested properties
Service operational pattern https://github.com/rsionnach/opensrm#service-types
Service Level Objectives
16 nested properties
Availability SLO - measures proportion of successful requests
Latency SLO - measures response time at specified percentiles
Error rate SLO - measures proportion of failed requests
Throughput SLO - measures request processing rate
Processing time SLO - job completion time for worker services
Lag SLO - consumer lag for stream services
Success rate SLO - measures proportion of successful operations
Duration SLO - job execution time for batch services
Schedule adherence SLO - measures timeliness of batch job starts
Data freshness SLO - measures staleness of batch output data
Query latency SLO - response time for database services
Replication lag SLO for database services
Connection availability SLO for database services
References to external OpenSLO files
Custom SLO types
Judgment SLOs for AI gate services - measure decision quality https://github.com/rsionnach/opensrm#judgment-slos
Reliability guarantees promised to dependents https://github.com/rsionnach/opensrm#contracts
Telemetry configuration for AI gate services
2 nested properties
Event name mappings
Attribute name mappings
Upstream dependencies
Service ownership information
8 nested properties
Team identifier
Slack channel for alerts (e.g., '#team-oncall')
Team email address
On-call rotation or escalation policy name
PagerDuty integration
URL to operational runbook
URL to service documentation
Whether on-call rotation is required
Observability requirements
4 nested properties
Metric requirements
Dashboard requirements
Alert requirements
Distributed tracing requirements
Deployment requirements
3 nested properties
Environments where service is deployed
Deployment gates
Rollback configuration
Reusable template for service reliability manifests
Schema version. Must be srm/v1
Document type
Template identity
2 nested properties
Unique template identifier
Human-readable description
Partial spec for template inheritance
4 nested properties
Service operational pattern https://github.com/rsionnach/opensrm#service-types
Service Level Objectives
16 nested properties
Availability SLO - measures proportion of successful requests
Latency SLO - measures response time at specified percentiles
Error rate SLO - measures proportion of failed requests
Throughput SLO - measures request processing rate
Processing time SLO - job completion time for worker services
Lag SLO - consumer lag for stream services
Success rate SLO - measures proportion of successful operations
Duration SLO - job execution time for batch services
Schedule adherence SLO - measures timeliness of batch job starts
Data freshness SLO - measures staleness of batch output data
Query latency SLO - response time for database services
Replication lag SLO for database services
Connection availability SLO for database services
References to external OpenSLO files
Custom SLO types
Judgment SLOs for AI gate services - measure decision quality https://github.com/rsionnach/opensrm#judgment-slos
Service ownership information
8 nested properties
Team identifier
Slack channel for alerts (e.g., '#team-oncall')
Team email address
On-call rotation or escalation policy name
PagerDuty integration
URL to operational runbook
URL to service documentation
Whether on-call rotation is required
Telemetry configuration for AI gate services
2 nested properties
Event name mappings
Attribute name mappings