Job Openings >> Application SRE AI & Ops Manager

Application SRE AI & Ops Manager

Summary

Title:	Application SRE AI & Ops Manager
ID:	1029
Location:	N/A
Department:	Managed Services
Salary Range:	N/A

More about this job >

Description

About IRIS

As one of the UK’s largest privately held software companies, IRIS Software Group exists to simplify the lives of businesses, schools, and organisations. Our operational software is the invisible but essential heartbeat of our customers' success—ensuring critical tasks are handled right the first time, every time.

We help businesses run the tough stuff, stay compliant, and focus on productivity and growth.

The Role

We're looking for a hands-on and strategic Application SRE (service reliability) & Ops Manager to join our team. This role is ideal for someone who thrives in high-availability environments, has a passion for operational excellence, and brings both technical breadth and leadership ambition.

As a Service Engineering Manager, you'll play a key role in safeguarding production systems, preventing incidents, resolving issues rapidly, and continuously improving system resilience. You'll also act as the situation manager during high-priority incidents and lead efforts to prevent recurrence.

This is a highly cross-functional role, interfacing with development, engineering, operations, infrastructure, and customer-facing teams. They will directly interact with the CIM(Critical Incident Management) team on major incidents.

The primary goal of this role is to drive service reliability by prevent incidents before they occur. As a leader you will drive consistency, standards and practices across the engineering teams.

Key Responsibilities

Own the stability, performance, and operational integrity of critical applications in production.
Act as situation manager during incidents — drive resolution and post-incident learning.
Proactively identify and address risks, harden infrastructure and applications, and improve overall system reliability.
Proactively drive preventative methods for incidents, such as auto-detect and recovery.
Drive adoption of Site Reliability Engineering (SRE), application telemetry and continuity principles across teams.
Collaborate with engineering and infrastructure teams to define monitoring, alerting, auto-recovery, capacity and performance metrics.
Leverage tools like Datadog, Azure Monitor, and other APM platforms to optimise observability.
Support application performance management (APM) and troubleshooting.
Help resolve L2 incidents and work closely with L1/L3 support teams.
Guide the implementation of automation and self-healing capabilities.
Apply strong problem-solving, decision-making, and critical-thinking skills under pressure.
Eventually build and lead a team of high-performing service reliability engineers.

Required Skills & Experience

Solid technical background in both application development and infrastructure operations.
Experience with Datadog, Azure, networking fundamentals, and cloud-native technologies.
Hands-on coding experience in .NET, C, plus strong knowledge of SQL and PostgreSQL.
Proven experience in incident response, problem management, and operational resilience.
Familiarity with modern performance engineering and APM practices.
10 years' experience working in or alongside SRE, L2 Ops, or Application Engineering teams.
Excellent communication and collaboration skills – able to work effectively across functions and with customers.
Strong leadership traits – decisive, calm under pressure, and ready to step into a formal leadership role.
Strong critical thinking, decision making and strong aptitude to get beyond resolution to the root cause of prevention.
Experience in high transactional volume service operations and resilience.

Desirable

Previous experience managing or mentoring engineering teams.
Exposure to Application Resilience, DevOps/SRE frameworks (e.g. SLAs, SLOs, error budgets).
Certifications in Software Development, Azure or other cloud platforms.
Strength in performance engineering and tuning.
Recognised for determination, persistence, tenacity for relentless pursuit to continuously improve.

Why Join IRIS?

At IRIS, you'll be part of a collaborative and fast-moving environment where operational excellence is critical. You’ll work with passionate engineers, tackle complex challenges, and play a key role in building the foundation for scalable, reliable systems that power thousands of businesses

Apply Now

Refer to a Friend