Site Reliability Engineer Data & Platform

Site Reliability Engineer Data & Platform

Job Overview

Location
Mississauga, Ontario
Job Type
Full Time Job
Job ID
121413
Date Posted
1 year ago
Recruiter
Raymond Catherine
Job Views
139

Job Description

**Now is an extremely exciting time to join a newly formed group within Citi. The Institutional Clients Group - Engineering and Architecture Practice (EAP) is responsible for defining and building core architecture and technology strategy for the ICG.**

**This position will be in Kafka as-a-Service team which sits under Common Platform Engineering (CPE). The CPE is a department within the EAP group whose mission is to provide engineering for common platform capabilities in ICG, engineer solutions that codify the firm's data strategy into frameworks & tools and to ensure 'Common Product' standards are defined to ensure efficient adoption of common components.**

**We are looking for a SRE with software engineering background who is passionate about running large scale, multi-tenant distributed data systems for customers that expect a very high level of availability. In this role, you will be responsible for the availability, performance, monitoring, emergency response, and capacity planning of the data systems.**

**If you love the hum of big data systems, thinking about how to make them run as smoothly as possible, and want to have a big influence on the architecture plus operational design points of the systems, then you will fit right in. Your solutions will be leveraged by tens of thousands of developers across Citi supporting applications used by hundreds of thousands of internal and client users.**

**What youll be doing:**

**Design & build observability solutions for distributed systems**

**Contribute to the continuous automation of toil, and drive & evangelize the four key DORA metrics**

**Establish Service Level Objectives for core services, monitor their Service Level Indicators, and implement error-budget based alerting**

**Help operational team by building solutions that allow them to identify and resolve health issues of the data systems as quickly as possible**

**Automate the deployment of infrastructure and application for data systems such as Kafka**

**Support the rapid growth of the platform, by expanding its strategy to deploy into an OpenShift environment and AWS Cloud environment (EKS/GKE)**

**Design and implement service improvements for performance & security, relentlessly improve reliability and facilitate effective incident response, mitigation & resolution**

**Write and review technical documents, including design, requirements, and process documentation**

**Advocate for a culture of platform automation with obsession for everything as-a-code approach**

**What we are looking for:**

**4+ years experience in Site Reliability Engineering to create scalable and highly reliable systems**

**Strong fundamentals in distributed systems design and operation with experience building automation to operate large-scale data systems**

**Experience designing & implementing observability solutions for data systems to enable a holistic view of system health**

**Strong understanding of modern site reliability engineering practices and ability to apply them to improve the reliability of systems**

**Experience creating, deploying, and managing the lifecycle of containerised applications on Kubernetes**

**Experience in an agile development environment with modern programming languages such as any of the following: Python, Golang, Java, Kotlin, Scala or similar**

**What gives you an edge:**

**Experience working with the distributed systems and stream processing solutions, hands on experience with Apache Kafka is highly desirable**

**Strong grasp of DevSecOps practices and ability to contribute to improving systems reliability, quality, and time-to-market**

**Experience designing and implementing multiple automated deployment pipelines at both applications and infrastructure level. Ideally, you would have experience with Ansible and Terraform on multiple projects**

**Experience working with the Hashicorp tool set, specifically Vault for secrets management and Consul for service discovery**

**Experience deploying applications and infrastructure into the cloud**

Job ID: 121413

Similar Jobs

Meta

Full Time Job

Site reliability engineer data & platform Site reliability engineer data & platform

Meta is embarking on the most transformative change to its business and technolo...

Full Time Job

Deloitte

Full Time Job

Site reliability engineer data & platform Site reliability engineer data & platform

Deloitte’s Enterprise Performance professionals are leaders in optimizing...

Full Time Job

Labcorp

Full Time Job

Site reliability engineer data & platform Site reliability engineer data & platform

Job Duties/Responsibilities:Determine the acceptability of specimens for testing...

Full Time Job

Braintrust

Full Time Job

Site reliability engineer data & platform Site reliability engineer data & platform

• JOB TYPE: Direct Hire Position (no agencies/C2C - see notes below)â€Â...

Full Time Job

Cookies

This website uses cookies to ensure you get the best experience on our website.

Accept