PrecisionPulse: A FinOps Success Story at nference

4 min readJan 18, 2024

Background

AI image generated using Bing’s Image Creator from Designer

Cloud complexity challenges organizations to adapt swiftly. Balancing costs for optimal ROI is a struggle. FinOps addresses this by optimizing spend and instilling financial accountability in the dynamic cloud spending model.

According to Gartner, global end-user spending on public cloud services is projected to surge to $544 billion in 2022, marking a 20.6% increase from the $451 billion recorded in 2021. Anticipated end-user spending for 2025 is expected to be around $917 billion. The promise of enhanced productivity and competitive advantages fuels this accelerated shift to the cloud.

About nference

Nference harnesses the surge in biomedical data from electronic medical records (EMRs) to collaborate with medical centres. Their mission is to convert decades of unstructured EMR data into potent solutions, enabling global scientific breakthroughs in personalized diagnostics and treatments. We believe the key to advancing human health lies in developing technology to curate and synthesize the world’s biomedical data for transformative scientific discovery.

Nference’s pain point

At the start of 2023, nference was spending more than $10 million on multiple cloud platforms growing steadily each month. The adoption of governance, and cloud cost management could have been more optimal, resulting in significant wastage, inefficiency, and a lack of clarity regarding expenditure drivers.

Key Challenges Addressed

The primary objective in the initial phase was to grant teams improved visibility, perfromance evaluation and allocation capabilities.

Nference functions across various cloud platforms tailored for data partners. Cost data from these diverse cloud sources was gathered in separate CSV sheets. These sheets were subsequently utilized to generate pivot tables, analyze cost data, and produce a monthly report. However, this approach posed scalability challenges and rendered decision-making reliant on a monthly report.

To address this issue, we initiated the implementation of standardized tags and imported all the sheets into PowerBI. This streamlined cost analysis, allows any team to easily analyze and interpret the data.

This enabled us to establish a benchmark for the costs incurred by data partners, facilitating the creation of a Bill of Materials for the seamless onboarding of new data partners.

We additionally developed Grafana dashboards to monitor activities and determine the appropriate utilization of resources.

Sample Dashboard from https://grafana.com/grafana/dashboards/1860-node-exporter-full/

The objective in the second phase is to tackle and resolve the easily attainable improvements or opportunities.

After gaining visibility into both cost and performance data, we leveraged this information to optimize resources. This involved right-sizing underutilized resources and removing those that were not in use — a surprisingly common issue.

We conducted a performance-to-cost comparison of resources, making necessary adjustments. This included common conversions such as transitioning from SSD to HDD, relocating data from high-cost to low-cost buckets, and minimizing the required number of cores.

Disk cost from GCP: https://cloud.google.com/compute/disks-image-pricing#disk

Bucket Cost from GCP: https://cloud.google.com/storage/pricing#price-tables

We discovered that a substantial portion of our expenditures was allocated to licenses, including RHEL and NoMachine. To optimize costs, we made a strategic shift by replacing these licenses with open-source alternatives like Rocky OS and Apache Guacamole. This not only resulted in immediate cost savings but also contributed to ongoing and future expense reductions.

RHEL cost from GCP: https://cloud.google.com/compute/disks-image-pricing#premiumimages

In the third phase, our focus was on establishing processes to prevent the recurrence of similar issues in the future.

Our initial priority was to revamp the process of provisioning and de-provisioning resources. Initially, a Jira ticket was the sole requirement for creating any resource without scrutinizing the necessity or establishing a timeline for decommissioning once the resource was utilized.

The updated provisioning process for resources will now involve obtaining approvals, benchmarking resource costs to ensure efficiency, and implementing an alert system to prompt the removal of resources based on predefined time criteria.

We instituted regular sync-ups with teams to gain insights into any challenges they encountered and receive updates on open tickets associated with cost reduction.

The technical implementation yielded significant cost reductions.

We introduced numerous technical changes in the cloud architecture, leading to substantial cost reductions. One noteworthy change was transitioning from a conventional cluster-based setup to a Kubernetes infrastructure built on top of Spot Nodes (typically discounted by 60–90% based on availability and cloud provider.). This architectural shift resulted in a 60% reduction in costs and a twofold increase in performance.

Conclusion

Our strategic approach resulted in ~45% cost reductions. We improved visibility, streamlined data analysis, and optimized resource utilization. Key initiatives included standardized tags, Grafana dashboards, and a shift to open-source solutions. The third phase focused on preventative processes and revamped resource provisioning. Technological changes, particularly transitioning to Kubernetes on Spot Nodes, led to a remarkable cost reduction and doubled performance. This comprehensive strategy reflects nference’s commitment to efficiency and strategic cost management in cloud operations.