Job Details
Revolutionizing protection.
Define what’s next in cybersecurity.
Principal Platform Engineer, Observability (CIPE)
Our Mission
At Palo Alto Networks®, we’re united by a shared mission—to protect our digital way of life. We thrive at the intersection of innovation and impact, solving real-world problems with cutting-edge technology and bold thinking. Here, everyone has a voice, and every idea counts. If you’re ready to do the most meaningful work of your career alongside people who are just as passionate as you are, you’re in the right place.
Who We Are
In order to be the cybersecurity partner of choice, we must trailblaze the path and shape the future of our industry. This is something our employees work at each day and is defined by our values: Disruption, Collaboration, Execution, Integrity, and Inclusion. We weave AI into the fabric of everything we do and use it to augment the impact every individual can have. If you are passionate about solving real-world problems and ideating beside the best and the brightest, we invite you to join us!
We believe collaboration thrives in person. That’s why most of our teams work from the office full time, with flexibility when it’s needed. This model supports real-time problem-solving, stronger relationships, and the kind of precision that drives great outcomes.Job Summary
Your Career
We are looking for a Principal Software Engineer to architect, build, and evolve our observability platform across infrastructure, applications, and developer workflows. This role is ideal for a hands-on technical leader with deep experience in open source observability technologies and Chronosphere, who is equally fluent in building AI-enabled systems and developer experiences using modern AI coding tools such as Claude and Codex.
You will serve as a technical architect for the observability stack, working across engineering, platform, SRE, and product teams to define standards for metrics, logs, traces, profiling, synthetics, alerting, dashboards, and incident response. You will also lead the integration of AI agents, copilots, and skill-based automation into observability workflows — making telemetry, debugging, and reliability operations equally consumable by humans and AI agents. You should be comfortable operating at both strategic and implementation levels: designing architecture, writing production-grade code, reviewing systems, mentoring engineers, and driving adoption across teams.
Your Impact
Observability Architecture
Design and lead the evolution of a modern observability platform using OpenTelemetry, Prometheus, Jaeger, Alertmanager, and related CNCF ecosystem tools.
Define architecture standards for telemetry collection, processing, storage, querying, visualization, alerting, retention, and governance.
Build scalable systems for metrics, distributed tracing, continuous profiling, log aggregation, synthetic monitoring, service health monitoring, and reliability analytics.
Establish best practices for instrumentation across services, infrastructure, Kubernetes workloads, CI/CD systems, and developer platforms.
Evaluate trade-offs around data cardinality, sampling, storage cost, retention, query performance, multi-tenancy, reliability, and operational complexity.
Make pragmatic recommendations on open source, self-managed, managed-service, and hybrid observability approaches.
Create paved-road observability patterns that help engineering teams instrument, monitor, debug, and operate services with minimal friction.
OpenTelemetry and Instrumentation
Lead adoption and standardization of OpenTelemetry across applications, services, infrastructure, and platform components.
Design and implement telemetry pipelines using OpenTelemetry Collector, exporters, processors, receivers, connectors, and custom extensions where needed.
Define conventions for traces, metrics, logs, spans, attributes, resources, service names, correlation IDs, and semantic conventions.
Build libraries, SDK wrappers, golden paths, and internal tooling to simplify observability instrumentation for engineering teams.
Metrics, Monitoring, and Alerting
Architect metrics systems using Prometheus-compatible formats, PromQL, remote write, federation, scraping strategies, service discovery, recording rules, and long-term storage backends.
Design alerting frameworks that reduce noise, improve signal quality, and align with SLOs, SLIs, error budgets, and incident response practices.
Create reusable alerting patterns for Kubernetes, infrastructure, applications, APIs, databases, queues, event-driven systems, and distributed services.
Define standards for dashboarding, runbooks, escalation policies, alert ownership, and production readiness.
Partner with SRE and engineering teams to mature monitoring practices and improve service reliability.
Kubernetes and Platform Engineering
Build observability capabilities for Kubernetes environments, including cluster monitoring, workload telemetry, service mesh visibility, ingress and egress monitoring, and node-level insights.
Develop and maintain Helm charts, Kubernetes manifests, operators, sidecars, agents, DaemonSets, and deployment automation for observability components.
Work with platform teams to ensure observability systems are reliable, secure, multi-tenant, highly available, and easy to operate.
Define standards for resource usage, scaling, upgrades, failover, backup, disaster recovery, access control, and tenant isolation for observability infrastructure.
Support observability across multi-cluster, multi-region, and hybrid cloud environments where applicable.
AI-Enabled Observability and Developer Experience
Design and build AI-enabled observability workflows that allow both humans and AI agents to investigate incidents, query telemetry, summarize signals, and propose remediations.
Define and publish reusable AI skills, agents, and tools (e.g., Claude skills, Codex tools, MCP servers, structured prompts) that encode observability best practices and make platform capabilities consumable by engineering teams and autonomous agents.
Build paved-road AI integrations for triage, alert summarization, root-cause analysis, log/trace exploration, runbook generation, dashboard authoring, and post-incident review.
Establish standards for grounding AI agents in authoritative telemetry, runbooks, and service catalogs, with strong guardrails around accuracy, safety, cost, and auditability.
Use AI coding tools (Claude, Codex, and equivalents) as a first-class part of the engineering workflow — for code generation, refactoring, instrumentation rollouts, migrations, and platform automation — and define patterns the broader team can adopt.
Partner with platform, SRE, and product teams to evolve observability from human-only dashboards toward agent-assisted, self-serve reliability operations.
Qualifications
Your Experience
7+ years of software engineering, platform engineering, infrastructure engineering, or SRE experience, with significant experience building production-grade distributed systems.
Deep hands-on experience with observability systems, including metrics, logs, traces, profiling, dashboards, synthetics, alerting, and incident workflows.
Strong expertise with OpenTelemetry, including SDKs, Collector pipelines, exporters, processors, receivers, semantic conventions, and instrumentation patterns.
Strong experience with Prometheus-compatible metrics, Alertmanager, scraping, cardinality management, federation, and remote write patterns.
Hands-on experience with distributed tracing systems such as Jaeger or similar technologies.
Experience with continuous profiling technologies.
Strong experience with synthetic monitoring and proactive availability testing, including API checks, browser-based checks, blackbox monitoring, dependency checks, and integration with alerting and SLO workflows.
Strong Kubernetes experience, including workload monitoring, service discovery, operators/controllers, Helm, resource management, cluster observability, and multi-tenant platform patterns.
Strong Python engineering skills, including building internal tools, automation, integrations, services, and instrumentation libraries.
Hands-on experience building real solutions, tools, and developer workflows using modern AI coding agents such as Claude, Codex, or equivalent — including prompt design, skill/tool/MCP authoring, agent orchestration, and integrating LLMs into production engineering systems.
Practical understanding of how to design AI-friendly platforms: structured APIs, machine-readable runbooks, telemetry schemas, and skills/tools that allow both humans and AI agents to operate observability effectively.
Experience designing and operating high-scale, highly available infrastructure systems.
Strong understanding of SLOs, SLIs, error budgets, incident response, on-call practices, production readiness, and reliability engineering principles.
Experience writing clear technical design documents, RFCs, standards, operational runbooks, and architecture recommendations.
Ability to influence teams through technical depth, collaboration, mentorship, and pragmatic decision-making.
Technical Skills
Observability: OpenTelemetry, Prometheus, Chronosphere, PromQL, Alertmanager, Grafana, Jaeger, OpenTelemetry Collector.
Telemetry: Metrics, logs, traces, spans, profiles, exemplars, service maps, SLOs, SLIs, error budgets, correlation IDs, semantic conventions.
Synthetics: Grafana k6, Prometheus Blackbox Exporter, Playwright, Selenium, API monitoring, browser checks, HTTP checks, gRPC checks, DNS/TCP/TLS checks, synthetic user journeys.
Kubernetes: Helm, operators, controllers, CRDs, DaemonSets, sidecars, service discovery, ingress, autoscaling, resource limits, multi-cluster observability.
Programming: Python required; Go, Java, Rust, or Node.js preferred.
AI Engineering: Claude, Codex, and equivalent coding agents; skill/tool/MCP authoring; prompt engineering; agent orchestration; LLM integration patterns; grounding, evaluation, and guardrails for AI-driven workflows.
Infrastructure: Linux, containers, networking, distributed systems, cloud platforms, service mesh, load balancers, APIs, queues, databases.
Automation: CI/CD, GitOps, Terraform, Argo CD, Flux, deployment pipelines, release validation, configuration management.
Reliability: Incident response, alert tuning, runbooks, error budgets, capacity planning, performance optimization, disaster recovery, production readiness.
Success in This Role Looks Like
The organization has a clear, scalable observability architecture with strong standards for telemetry generation, collection, storage, querying, retention, and consumption.
Engineering teams can easily instrument services and get useful metrics, traces, profiles, logs, dashboards, synthetic checks, and alerts without deep observability expertise.
Alerting becomes more actionable, less noisy, and better aligned with service health, SLOs, and customer impact.
Synthetic monitoring proactively detects failures in critical user journeys, APIs, infrastructure endpoints, and third-party dependencies before customers are significantly impacted.
The observability platform is reliable, cost-efficient, secure, multi-tenant, and easy to operate across Kubernetes environments.
Continuous profiling and tracing become part of normal performance, debugging, and reliability workflows.
AI agents and skills are first-class consumers of the observability platform — accelerating triage, investigation, and remediation for both humans and autonomous workflows, with measurable improvements in MTTR and engineer productivity.
The Principal Engineer is recognized as the technical leader who can connect architecture, implementation, operational excellence, developer experience, AI-enabled workflows, and business reliability outcomes across the observability stack.
#LI-TD1
Compensation Disclosure
The compensation offered for this position will depend on qualifications, experience, and work location. For candidates who receive an offer at the posted level, the starting base salary (for non-sales roles) or base salary + commission target (for sales/com-missioned roles) is expected to be the annual range listed below. The offered compensation may also include restricted stock units and a bonus. A description of our employee benefits may be found here.
$147,000.00 - $237,500.00/yrOur Commitment
We’re trailblazers that dream big, take risks, and challenge cybersecurity’s status quo. It’s simple: we can’t accomplish our mission without diverse teams innovating, together.
We are committed to providing reasonable accommodations for all qualified individuals with a disability. If you require assistance or accommodation due to a disability or special need, please contact us at accommodations@paloaltonetworks.com.
Palo Alto Networks is an equal opportunity employer. We celebrate diversity in our workplace, and all qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or other legally protected characteristics.
All your information will be kept confidential according to EEO guidelines.
Is role eligible for Immigration Sponsorship? No. Please note that we will not sponsor applicants for work visas for this position.Related Jobs
- Principal Software Engineer (DLP) Santa Clara, California, United States
- Principal Software Engineer (Cloud Platform and AI Engineering) Santa Clara, California, United States
- Principal Software Engineer ADEM (ADEM - Autonomous Digital Experience Management) - Windows Santa Clara, California, United States
MORE PALO ALTO NETWORKS
-
A corporate SaaS story.
How Palo Alto Networks secured critical SaaS apps using SaaS Security Posture Management.
-
Our Culture
Leading the way in a global community, from vision to action.
-
Early Careers
Our early-in-career programs will train you to be a part of the next generation of cybersecurity talent.
No Recently Viewed Jobs
No Recently Viewed Jobs