Bridging Natural Language and SQL with Generative AI

In our “Engineering Energizers” Q&A series, we shine a spotlight on the innovative engineers at Salesforce. Today, we introduce Nick Gibson from the Cloud Economics and Capacity Management (CECM) team. This dynamic and elite group is revolutionizing how internal teams access infrastructure data. They’ve developed Horizon Agent, a conversational AI assistant integrated with Slack that translates natural language into reliable SQL queries.

Discover how CECM addressed the challenges of unpredictable LLM behavior, prevented regressions from knowledge base updates, and scaled the system to handle thousands of queries, all while providing fast and trusted insights across the company.

What is the team’s mission?

CECM’s focus is on empowering internal Salesforce stakeholders — capacity planners, finance teams, service owners, and product managers — to make intelligent, data-driven decisions about infrastructure cost, usage, and readiness. This involves forecasting spend, facilitating cloud migrations, and preventing outages by ensuring systems are properly provisioned.

The challenge is that Salesforce’s cloud infrastructure data — spanning capacity, cost, usage, and reliability—is massive, complex, and spread across dozens of interrelated tables, making it difficult for non-technical users to navigate schemas or write accurate SQL on their own.

To make the data more accessible for users, the team introduced Horizon Agent, a cutting-edge generative AI assistant that translates natural language into efficient, reliable SQL queries. Integrated with Slack, Horizon Agent connects users directly to the Unified Intelligence Platform (UIP), Salesforce’s state-of-the-art internal analytics and data lake offering. With Horizon Agent, users can ask questions like “What will my service cost next year?” and receive accurate answers in seconds, without waiting for engineers to write SQL or create dashboards.

This also supports CECM’s strategy of maintaining a small set of critical dashboards, while enabling customers to get customized insights through AI-generated queries that pull data to answer questions not explicitly covered by those dashboards.

Before Horizon Agent, internal teams faced significant delays and consumed hundreds of engineering hours each month by submitting backlog tickets or escalating requests to engineers. By automating natural-language query generation, Horizon Agent eliminates these bottlenecks, accelerates access to insights, and enables engineers to focus on higher-impact tasks. The tool has significantly boosted decision-making speed and broadened access to critical infrastructure data across the company.

Example Horizon Agent conversation about service cost.

What was the most significant technical challenge faced during the development of Horizon Agent?

The most complex challenge was managing the non-deterministic behavior of large language models (LLMs). Even with access to the correct data, repeated queries often produced varying SQL outputs, which eroded trust and consistency. To ensure a high standard of reliability when replacing manual processes with an automated system, a consensus voting mechanism was introduced.

Each prompt now generates ten candidate SQL queries, which are then evaluated by a scoring pipeline using cosine similarity modeling and Levenshtein distance calculations to identify the most consistent and high-confidence responses. After filtering out outliers, the most reliable SQL query is returned to the user. This approach significantly increased the system’s efficacy score—the percentage of responses rated 4 or 5 out of 5 — from 50% at launch to 80%. This substantial improvement in accuracy was pivotal in transitioning from early access to general availability and in establishing trust among internal teams.

Quarterly Horizon Agent efficacy ratings.

What ongoing research and development efforts are aimed at improving Horizon Agent’s capabilities?

The next major feature in development for Horizon Agent is the automatic generation of visualizations based on SQL results. While users currently receive accurate raw data, the lack of visual representation can make interpretation difficult. The system needs to handle a wide range of data types, from time series to point-in-time snapshots, each requiring different charting logic.

To tackle this, the team is exploring both open-source tools and Salesforce-native solutions, such as Tableau Next, which are designed to support AI-driven chart generation. The aim is to create a flexible, generative interface that can adapt to the diverse data types and user needs.

The ultimate goal is to transform Horizon Agent into a comprehensive insight delivery platform. By dynamically generating visuals alongside SQL results, the system will make data more accessible and interpretable, reducing the need for post-query analysis, especially for users who are not proficient in SQL or data visualization.

Nick shares why engineers should Salesforce.

How is Horizon Agent deployed rapidly while maintaining high standards of trust and security?

Horizon Agent is deployed using a dual-track system that separates application infrastructure from data-layer intelligence. The application code, written in Python, is delivered through CI/CD pipelines and undergoes rigorous testing, including static analysis, vulnerability scanning, unit tests, and regression checks. This component handles interactions with Slack, orchestrates LLMs through Einstein Gateway, and communicates with UIP.

The knowledge base, which includes table metadata, documentation, and query guidelines, is managed independently. This allows for quick updates, often going live in as little as 15 minutes, without the need for full deployment cycles. Any errors or misunderstandings related to table structure or naming can be addressed promptly.

This architecture combines the benefits of tightly controlled infrastructure with agile, user-responsive knowledge management. The result is a system that evolves rapidly while maintaining minimal risk to stability and trust.

Nick discusses an emerging AI technology that his team is researching.

What strategies were used to ensure that enhancements in one area of Horizon Agent didn’t compromise others?

Small changes to the Horizon Agent knowledge base sometimes caused unexpected issues elsewhere. This is a common challenge with LLMs, which can form connections between seemingly unrelated concepts. Fixing one answer could inadvertently affect others.

To mitigate this, the team developed a pre-production benchmarking pipeline. This pipeline runs a comprehensive set of natural-language questions, each paired with an expected SQL output, against the current system. The outputs are then evaluated by a separate LLM, acting as a SQL judge, which checks for schema coverage, filtering logic, and column selection.

Only knowledge base updates that pass the benchmark threshold are deployed. This approach ensures continuous improvement while maintaining system consistency and output quality. It also provides a clear, measurable way to verify that changes are effective and do not introduce new issues.

Horizon Agent live service health metrics.

How are scalability challenges managed within Horizon Agent?

Scalability has been a key focus since Horizon Agent moved to general availability. The system currently supports around 50 monthly active users, with each session generating 10 to 20 LLM calls per query. Despite this load, response times remain within seconds.

Horizon Agent was designed with performance in mind from the start. The Slack-based interface connects to UIP, which is powered by Salesforce’s internal Trino implementation (provided by the Big Data Infrastructure team). Model inference is managed through the Einstein Gateway, providing access to high-performance LLMs like those from OpenAI’s GPT series.

The system employs aggressive caching, retry and backoff logic, and real-time performance monitoring with Grafana dashboards. Since its launch, Horizon Agent has averaged only one error per day, even while handling thousands of interactions. This reliability is a direct result of prioritizing scalability as a core design principle.

Learn more