Engineering the Hyperforce Experience for All

In our Engineering Energizers Q&A series, we explore the paths of engineering leaders who have made significant achievements in their respective fields. Today, we spotlight Srinivas Ranganathan, Director of Software Engineering at Salesforce, who leads the control plane platform for the Salesforce Perimeter. The Perimeter acts as the main entry point for enterprise customer traffic into Salesforce’s applications. It includes secure entry points into Hyperforce and trusted edge locations maintained by commercial CDN vendors.

Discover how Srinivas and his team tackle complex technical issues, including the creation of a streaming data platform for perimeter telemetry — empowering Salesforce teams at all levels of data literacy to work effectively and efficiently.

What is your team’s mission?

Our team is tasked with extracting actionable insights from the telemetry data collected at the Salesforce Perimeter.

We are challenged by the continuous, high-volume generation of telemetry data, which is contained in logs, metrics, events, and traces. Each piece of data provides valuable measurements and dimensions that detail the end-user experience. Given the scale of Salesforce operations, where hundreds of thousands of enterprise customers manage tens of millions of domains, the customer-specific dimensions like customer names and domains present high-cardinality challenges.

We’re committed to equipping Salesforce engineering teams with near-real-time, customer-specific insights derived from this perimeter telemetry. These insights are crucial as they inform daily analyses and support significant business decisions across the company.

What was the most significant technical challenge your team faced recently?

Recently, our team tackled a significant technical challenge: the need for a low-latency, user-friendly interface to access and analyze perimeter telemetry at Salesforce. Previously, our teams had to choose between a log search platform, offering detailed granularity with schema-on-read semantics, and general-purpose time series databases that summarized data along limited dimensions. Both options provided necessary data access but lacked efficiency, hindering effective operations.

To resolve this, we developed a near real-time streaming data platform that allows Salesforce teams to conduct low-latency queries for efficient analysis of perimeter telemetry. We utilized Spark Streaming jobs to standardize telemetry data into a protobuf format, which was then ingested into a specially designed time-series hypercube. This hypercube provides a comprehensive view of customer interactions.

A look at the system architecture.

Further enhancing our system, we integrated Datadog’s Vector to source telemetry from commercial CDNs and used Imply/Druid to construct a multidimensional hypercube with 18 dimensions and 13 measurements, tailored to represent core Salesforce traffic telemetry. This included key dimensions like customer ID and domain name, and measurements such as request count and bytes sent. The effectiveness of this solution is showcased in an internal dashboard below that displays a 24-hour view of an internal Salesforce domain, highlighting the practical benefits of our system through various data slices.

The above dashboard shows a 24-hour summary view of a Salesforce domain.

The above dashboard surfaces perimeter errors and isolates them by location.

How do you balance the need for rapid deployment with maintaining high standards of trust and security?

Balancing rapid deployment with high standards of trust and security is essential in our operations. We achieve this by utilizing proven big data frameworks, notably Imply, which outperforms open-source Apache Druid for our needs. These frameworks allow Salesforce teams to access per-customer insights within two minutes of telemetry generation at the perimeter. These frameworks have allowed us to significantly speed up our feature delivery and drive adoption.

We found these Imply capabilities to be useful over open source Apache Druid:

Improved Analytics User Interface: The launch of Pivot 2.0 offers an intuitive, low-latency interface that simplifies data analysis with a point-and-click approach, allowing users to efficiently explore various dimensions.
Clarity Dashboards: These dashboards provide clear insights into Imply usage.
Professional Services and Technical Support: Imply provides extensive support and consultation services.
OAuth Integration: Seamlessly integrated with Salesforce’s enterprise single sign-on (SSO), this feature bolsters security without slowing down access.

Furthermore, we are dedicated to enhancing data literacy among our user base through periodic training sessions.

How do you gather your data platform’s feedback from users and how does it influence future development?

Using data from Clarity, combined with user interviews and feedback, shapes our development roadmap and feature enhancements. We’ve expanded the scope of our platform by increasing the number of measurements, dimensions, and sources of perimeter telemetry ingested, enhancing our service and broadening our user base. Our platform is crucial for debugging live site incidents like latency issues, errors, and Level 7 DDoS attacks, allowing tasks that once took 15 minutes to be completed in seconds.

Moreover, we retain telemetry data for up to a year, enabling detailed analysis of usage trends that have uncovered opportunities for substantial cost savings. We’ve also enhanced our data ingestion capabilities to include both perimeter egress telemetry and non-perimeter sources like Salesforce Private Connect, providing a more complete view of customer experiences and increasing the platform’s effectiveness and reach.

How do you manage challenges related to scalability in your data platform?

Our project has prioritized managing scalability due to a significant increase in data volume, particularly after integrating telemetry from commercial CDN vendors which doubled our data processing load. We designed our platform with horizontal scalability in mind, allowing us to expand our ETL layer (Spark), data pipeline (Kafka), and data sink (Imply/Druid) efficiently.

User engagement has quadrupled in the past six months, with our platform now serving hundreds of diverse users monthly, from engineers to executives. Despite millions of monthly queries, we maintain sub-second query latency and manage costs effectively at less than $0.01 per query.

Daily, our platform handles over 60 TB of telemetry data, aggregating tens of billions of data points. Using Spark, we process this data in a structured manner, with a one-hour tumbling window materialized every minute, leading to a daily increase of 30 GB in the perimeter hypercube. This methodical approach ensures our platform remains robust and efficient amidst growing demands.

Learn More

Hungry for more Hyperforce stories? Check out Hyperforce’s template for enhancing developer workflow in this blog.
Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.