Prometheus Group is a leading global provider of comprehensive and intuitive enterprise asset management (EAM) software solutions that work within ERP systems and span the full work management life cycle for maintenance and operations. Our straightforward functionality, graphical visualization, and simple processes enable customers to increase productivity, ensure safety, reduce costs, and improve reporting. Prometheus Group has excellent books of business opportunities to advance and excel in your career, as we work with the largest companies in the world. Job Summary The site reliability engineer is responsible for ensuring the availability and performance of the Prometheus Group hosted customer sites. Additionally, the site reliability engineer is responsible for managing all the underlying infrastructure including Kubernetes cluster upgrades, the decommissioning of the infrastructure, incident management, and root cause analysis and remediation. Responsibilities Work as a part of a response team to resolve reported issues. Pro-actively identify problems and/or gaps in the deployed applications and infrastructure and develop disruption preventive measures. Continue to develop and deliver tools to continuously enhance monitoring capabilities. Assist in the roll-out and deployment of new product features and installations to facilitate our rapid iteration and constant growth. Identify ways to resolve common issues by developing and deploying automation to respond to common human interactions. Work closely with development and DevOps teams to ensure that platforms are designed with 'operability' and 'observability' in mind. Function well in a fast-paced, rapidly changing environment. Required Qualifications Bachelor’s in computer science, information technology, software engineering, or a related field. 3+ years of working experience as a software developer, AWS cloud engineer, or AWS infrastructure engineer. 3+ years of hands-on experience with managing Kubernetes clusters and Docker containers. 3+ years of hands-on experience managing and troubleshooting Linux servers. 2-3 years of automation experience in Terraform, Python, or Ansible. 2+ years of MS SQL and PostgreSQL database instance management and troubleshooting experience. Strong critical thinking skills. Strong troubleshooting experience involving Kubernetes clusters, Docker containers, and Linux. Demonstrable experience working with remote monitoring and logging tools, including but not limited to Dynatrace, Grafana, and Pingdom. Preferred Qualifications Strong interpersonal communication skills (including listening, speaking, and writing) and ability to work well in a diverse, team-focused environment with other SREs, developers, IT operations, and engineers. Ability to work well in high-pressure situations. Knowledge of data structures, relational and non-relational databases, networking, Linux internals, filesystems, web architecture, and related topics. Kubernetes Certified Administrator or related certification is a plus. Benefits Overview Gym Kickback Incentive (Up to £25 per month)

#J-18808-Ljbffr

Site Reliability Engineer

Recent Jobs

Assistant Project Manager – Healthcare

Site Operations Manager

Duty Officer

Quick Search

The Platform

For Employers

Contact Us