Job DescriptionCloud Ops/Service Reliability Engineer
Our client is transforming the way audio devices interact with the Cloud, creating new experiences for connecting people to what they love the most.
We are imagining new cloud native experiences, designing new services, building software architecture and infrastructure, and scaling our solutions to serve millions of users.
Join us to invent and build our cloud platform for the 21st century and power the next wave of innovation at our client.
We are seeking a top-notch Cloud Operations/Site Reliability Engineer to join our Cloud Operations team. In this role, you will be responsible for ensuring all our cloud-based services are healthy, monitored, automated and designed to scale.
The ideal candidate will bring rich knowledge of best practices and design patterns, along with a strong software engineering background.
The ideal candidate is also comfortable and experienced with building and supporting high volume, always-on, mission critical production applications. Principal Duties and Responsibilities:
Some of the specific challenges you will tackle:
- The specific focus for the SRE is to continue to expand the cloud platform capabilities to enable the development teams to deploy, configure, manage, maintain and support production applications and the related infrastructure.
- Additional responsibilities will also include:
- Build and maintain the continuous delivery pipeline to fully automate software releases of the highly available, mission critical cloud services that supports our clients connected products
- Work closely with the development teams from early stages of design all the way through identifying and resolving production issues
- Build tools for deployment, monitoring and operations.
- Troubleshoot and resolve issues in our development, test and production environments
- Participate in solution design for new cloud platform features, open source technologies and tool evaluation and selection, etc.
- Work closely with engineering teams to conduct root cause analysis for production incidents, and to evolve infrastructure and tooling
- Work with platform architects on service and system optimizations, helping to identify and remove potential performance bottlenecks
- Configure and administer our API developer portal and API gateway
- Work closely with product and development teams to ensure that the service architecture aligns with high-availability and scalability goals
- Evangelize, design, implement and automate security controls, governance processes, and compliance validation
- Design, manage, and maintain tools to automate operational processes, build service operations dashboards, and communicate service level metrics to all stakeholders Create and maintain runbooks, operational procedures to ensure the uptime and reliability of production services
- Meet SLAs and internal targets in an on-going effort to build and maintain a highly scalable, high-quality cloud platform that supports our clients connected products
- Participate in an on-call rotation for production infrastructure and services support Stay up-to-date on relevant technologies, plug into user groups, understand trends and opportunities to ensure we are using the best possible techniques and tools
- Design, implement, and maintain monitoring and alerting tools, leveraging existing tools and logging.
- Monitoring will be at all layers in the application stack
- Debug complex problems across the entire cloud services stack
- Create self-service capabilities for developers to monitor and debug microservices deployed in the cloud
- Serve as a primary point responsible for the overall health, performance, and capacity of one or more of our Internet-facing services
- Drive \"chaos testing\" to understand and improve overall resiliency to failures
- Continually improve processes, automation, documentation, monitoring and security
- Extensive knowledge of software development and software testing methodologies along with change and configuration management practices in multiple environments
- Experience developing code in at least one high-level programming language
- Strong understanding and use of APIs in large-scale distributed systems, microservices architecture, cloud computing, etc.
- Strong background in Linux/Unix administration, with scaling and tuning experience
- Experience with application performance analysis and monitoring toolset like AppDynamics, New Relic, Java profiling tools
- Experience with automated deployment, continuous integration, and release engineering tools GoCD, Rundeck, Ansible, Jenkins, etc.
- Passion for developer productivity, with experience automating developer pain points
- Experience with a public cloud provider, ideally Amazon Web Services
- Familiarity with SQL and NoSQL data stores
- Software process automation with popular scripting languages (Python, Node.js, and/or Ruby)
- Knowledge of best practices and IT operations in an always-up, always-available mission critical service
- Understanding of Agile and other development processes and methodologies
- Outstanding communication skills; demonstrated ability to explain complex technical issues to both technical and non-technical audience
Global Technical Talent, a Precocity-owned company and subsidiary of Chenega Corporation, specializes in recruiting and pre-qualifying senior level IT professionals for clients` immediate long- and short-term contract needs, contract to hire and direct hire positions. We build long-term, mutually beneficial partnerships with all of our clients and consultants. By having a complete understanding of our clients' IT organization, personality, needs and expectations we can be relied upon to consistently provide top tier service and match the perfect candidate to the right job.
Associated topics: chief program officer, cpo, manage, manager, management, monitor, product manager, project manager, relationship manager, task