Oglasi za posao Site Reliability Engineer

Oglas je preuzet sa sajta poslodavca i sajt HelloWorld ne garantuje njegovu ažurnost.

Job listing has been deactivated.

Site Reliability Engineer

Nordhealth

Remote

18.08.2024.

MySQL Python AWS Ansible Jenkins PostgreSQL Azure PowerShell Bash SCRUM SaaS DNS Cloud MongoDB Agile intermediate

Who are we?

Nordhealth’s mission is to build software that improves the daily lives of healthcare professionals. We build software that empowers veterinary and therapy professionals to provide the best possible care experiences to their patients. Our products are used daily by over 50,000 professionals in clinics and hospitals across 30+ countries. We excel with 20+ years of experience in healthcare and veterinary software.

We understand that talent comes from everywhere and anywhere. The greater our diversity, the better the products we deliver. That’s why we are a remote-first company, headquartered in Helsinki, Finland, with all 400+ employees working either remotely or from collaboration hubs. While our market presence is currently strongest in the Nordics, our customer base is rapidly growing in our other markets too, especially in Europe and North America (more at our website nordhealth.com.)

About the role

We are seeking a dedicated individual for a role that centers around Provet Cloud, our cloud-based veterinary practice management software (https://www.provet.cloud/). Provet Cloud is designed to help veterinary practitioners save time so they can devote more attention to caring for their patients and to make managing a veterinary practice more efficient and simpler. It offers features for appointment scheduling, electronic medical records, inventory management, billing, and communication within the veterinary team.

The purpose of the Senior SRE role in our company is to ensure the scalability, reliability, and high availability of our platform. This includes automating our infrastructure to accommodate higher loads resulting from increased usage and monitoring the cloud hosting costs to keep them at a proper level as our user base expands. Additionally, the SRE plays a crucial role in maintaining the system's reliability, especially with multiple enterprise-class customers relying on our platform. The SRE team's focus on automation, monitoring, and proactive maintenance helps us meet the demands of our expanding user base while ensuring that our services remain consistently available and performant.

This is a unique opportunity to join our team and contribute to enhancing the efficiency and simplicity of veterinary practice management through Provet Cloud!

Your key responsibilities include:

Automate infrastructure to accommodate growing user base and workload.
Monitor and optimize cloud hosting costs to maintain efficiency.
Ensure the system is highly available and reliable for all our customers especially for enterprise-class customers.
Implement and maintain monitoring systems for performance and reliability.
Troubleshoot and resolve incidents to minimize downtime.
Collaborate with development and operations teams to improve system performance and stability.
Plan and execute capacity planning to meet future demands.
Implement and maintain disaster recovery and failover procedures.
Continuously evaluate and improve system architecture for scalability and reliability.

What will help you to be successful in this role?

Ideally, you have already gained some experience from working in a fast growing, global SaaS company.

Success factors and key challenges of the role:

Maintaining high availability while simultaneously optimizing costs is crucial for the SRE role. This involves balancing the need for reliability with cost-effectiveness to ensure efficient operations.
Keeping infrastructure maintained and updated with minimal downtime is essential, ideally with no noticeable interruptions for our clients and users. This requires careful planning and execution to minimize disruptions while making necessary changes.
Effective resource planning in a rapidly changing environment is critical to avoid overprovisioning while still meeting increasing demands. This involves staying proactive and adaptable to ensure resources are utilized optimally.
Continuous review and improvement of disaster recovery plans and procedures are necessary to mitigate potential risks effectively. Regular testing and updates are vital to ensure readiness for any unforeseen events.
Quick analysis and mitigation of any issues or incidents is essential, along with a clear plan for permanent resolution. This includes identifying root causes and implementing corrective measures to prevent recurrence.

Critical Knowledge and Experience:

Proficiency in AWS, Azure, or Google Cloud, and infrastructure as code (IaC) tools like Terraform.
Strong scripting abilities using Python, Bash, or PowerShell for infrastructure automation.
Experience with monitoring tools like Prometheus or Grafana for real-time monitoring and alerting.
Knowledge of incident management processes and tools like PagerDuty for effective incident resolution.
Understanding of HA and reliability principles, including failover and disaster recovery strategies.
Familiarity with networking concepts such as TCP/IP, DNS, and VPNs.

Having one or more of these skills will help in succeeding in this role:

Experience with tools like Ansible or Terraform for managing infrastructure configuration.
Understanding of CI/CD pipelines and experience with Jenkins or GitLab CI/CD for automating software delivery.
Awareness of security best practices and experience implementing security controls like IAM and encryption.
Basic knowledge of DBMS and experience with MySQL, PostgreSQL, or MongoDB.
Familiarity with logging frameworks like ELK or Splunk for analyzing log data.
Experience in performance optimization techniques to improve system performance.
Understanding of Agile methodologies and experience with Scrum or Kanban for iterative development.

What’s in it for you?

At Nordhealth, we do things a little bit differently. We value continuous improvement, diverse teams and autonomy which drive our collaboration. Our global healthcare domain is rapidly developing and we are seeking colleagues who enjoy working in this type of environment.

In addition, we offer:

The chance to work in a meaningful industry and in a fast-growing, global company on a path to changing digital healthcare
Competitive compensation and benefits
Learning and professional growth opportunities
The tools you need, and enjoy using
Frequent company events and talented colleagues from around the world

If you enjoy working in a fast-growing and international environment with the possibility to make an impact, this might be the perfect job for you. Apply now! We'll fill the position as soon as we find the right person.

Prikaži tekst oglasa