**Citi is seeking a highly motivated candidate as a L3 support / devops engineer for the Global Spread Product Data platform team. The candidate will help to expand observability over the platform stack and be responsible for ensuring stability of critical platform messaging (e.g. Kafka) and storage (e.g. HBase, Elastic) services. The candidate would a dual responsibility of responding to production incidents that cannot be handled by L2 support and working on preventative measures such as improved observability, setting up rigorous performance testing in lower environments or designing and conducting chaos style exercises. The candidate should have experience in handling high volume data and distributed systems. The candidate handles complex problems independently and demonstrates analytical thinking. Finally, the candidate is expecting to be familiar with reading code and be able to delve into open source products and understand complex issues not covered by any documentation.**
**Key Responsibilities:**
**Works close with L2 support and application teams to debug and resolve incidents relating to the platform**
**Conducts root cause analysis of thorny issues with the full platform development team and prioritizes stability book of work**
**Designs and conducts stability exercises in production and lower environment including single node recovery up to whole data center fail over**
**Helps platform users understand capability and advices on solution design that leverages platform services effectively**
**Keeps up with industry best practices around reliability and observability and brings this back to the whole platform team**
**Develops and setups telemetry as well as automates solutions for operational challenges to stability (such as resource exhaustion problems, bad node recovery etc)**
**Participate actively in platform architecture discussions particularly with focus on reliability & supportability considerations**
**May occasionally work a non-standard shift including nights and/or weekends**
**Required Skills / Experience:**
**Experience working with modern observability stack such as Splunk, ELK, Grafana, Prometheus, Promtail and solving thorny latency and throughput requirements**
**Experience in scripting in Python, shell or equvialent**
**Minimum 2 years of hands on experience supporting an enterprise scale distributed application**
**Minimum 2 years of hands on experience with messaging technology such as Kafka, JMS/EMS, Solace or similar**
**Experience in OpenShift or Kubernetes is a plus**
**Experience working in a Continuous Integration and Continuous Delivery environment and familiar with Jenkins, TeamCity, Code Quality Tools - SonarQube, etc.**
**Experience with streaming technologies like Apache Flink is a plus**
**Experienced in RDBMS and SQL/PLSQL is a plus**
**Excellent written and oral communication skills**
**Citi Canada is an equal opportunity employer. Accordingly, we will make accommodations to respond to the needs of people with disabilities (including, without limitation, physical and mental health disabilities) during the recruitment process and otherwise in accordance with law. Individuals who view themselves as Aboriginals, members of visible minority or racialized communities, and people with disabilities are encouraged to apply.**
Job ID: 121405
Meta is embarking on the most transformative change to its business and technolo...
Deloitte’s Enterprise Performance professionals are leaders in optimizing...
Job Duties/Responsibilities:Determine the acceptability of specimens for testing...
• JOB TYPE: Direct Hire Position (no agencies/C2C - see notes below)â€Â...