Data Structure Guidelines

Integrating Max, a high-performance chat agent, into your analytics workflow hinges on robust data structure management. Max harnesses SQL to build tabular views of underlying data, enabling the system to automatically update and enrich the language model and chat pipeline with semantic information. This real-time adaptability is crucial as it maintains conversational relevance and responsiveness without manual updates.

To optimize queries, ensure the data in the tabular view is:

  • Aggregatable using standard SQL methods, with no overlapping totals across the dataset.
  • Consistent and harmonized, eliminating any discrepancies that could impair data integrity.

For natural language processing to be effective, it's essential that the data:

  • Avoids ambiguous values across dimensions.
  • Maintains consistency to reduce complexity and enhance interpretability.

Supporting time series queries requires a date field formatted as "yyyy-mm-dd", either directly or through a SQL expression. This setup facilitates trending, breakout, and filtering capabilities based on time, enhancing the analytical power of Max. In this article, we will delve into these guidelines, outlining how to structure your data effectively to maximize the efficiency and accuracy of Max's chat interface.

 

Ideal Data Structure Guidelines

  • A single tabular dataset (one table), fully machine readable
    • A single table is the ideal dataset to be used in Max. A star schema input is feasible, but requires real-time SQL or Pandas based manipulation. A physical table or view can result in improved performance.
  • Harmonized dimensions per column
  • Metrics as columns
  • Fully aggregable fact columns (if aggregation is necessary)
    • Hierarchies should be fully aggregable at all levels
    • No period types that overlap
  • A single period granularity that can be aggregated to the desired period levels
    • Must be a machine readable format ie “YYYY-MM-DD”
  • Ideally no dimension values across dim columns that are the same as it will impact the LLM’s ability to interpret dim values
    • Ex: manufacturer = Coke, Brand = Coke. Prefer: manufacturer = Coca-Cola, Brand = Coke.
  • No special characters or spaces in table or column names

 

Typical Enterprise Data Structure

Enterprise data is often stored in multiple, large star schema databases. This is ideal for reducing redundancy, but can be slower and more costly for dynamic query creation that needs to selectively access the needed tables.

 

Max Copilots are targeted to solve a specific Business Problem

Each max copilot is a bespoke application targeted at solving a specific business problem. A single table is better suited to this type of solution*. A single denormalized fact table allows for simple query creation and execution, without requiring dynamic query building depending on the context of a question. 

*Note: The single fact should only contain data relevant for the analysis purposes. Excess data only slows query performance.

 

Example Data Structure

Using sample sales data.

If total levels exist (such as “Manufacturer”), then each level should be uniquely identifiable if they’re provided in the same level as the subordinate values. Subordinate values should also sum to the total.

 

Survey Data Notes & Sample Data

  • Max will utilize aggregated survey data to the necessary level for desired end user analysis
    • Not respondant or raw survey data
    • Ex. Brand managers would need respondent demographic data aggregated to the Brand level minimum
  • The data should include metrics pre-calculated to the preferred level of aggregation. Following the example above, a solution for brand managers would be pre-aggregated to the “Brand” level as shown on the prior slide
  • A single period granularity that must be a machine readable format ie. “YYYY-MM-DD”
  • Ideally no dimension values across dim columns that are the same as this impacts the LLM’s ability to interpret dim values
    • Ex: manufacturer = Coke, Brand = Coke. Prefer: manufacturer = Coca-Cola, Brand = Coke.
  • No special characters or spaces in table or column names
  • If multiple aggregation levels are necessary, separate views or tables with data at these levels should be provided for query efficiency 
  • If beginning with survey level data: AR requires aggregation formulae for each metric that will need to be rolled up above respondent level (ex: sum, max, min, avg, etc.)
  • If statistical significance is used for a particular analysis, the statistical significance calculation must be documented, preferably in a SQL reference to the provided tables.
  • Similarly, any other required calculation must be documented in the form of SQL on your tables and shared with AR.
  • Reference values are required to perform data validation for any level in the data and for any calculated metric.

 

Other Data Requirements

  • AnswerRocket requires aggregation formulae for each metric that will need to be rolled up above respondent level (ex: sum, max, min, avg, etc.).
  • If statistical significance is used for a particular analysis, the statistical significance calculation must be documented, preferably in a SQL reference to the provided tables.
  • Similarly, any other required calculation must be documented and share with AR.
  • Reference values are required to perform data validation for any level in the data and for any calculated metric.

 

Time Periods

  • Ideally, all metrics should share a single common lowest time period granularity.
    • This makes all metrics aggregable (in tandem) from that lowest granularity to any larger time period.

Requirements for Custom Periods, Including Fiscal Year

  • Custom periods can exist anywhere data is analyzed on based on a non-gregorian calendar
    • Ex: A custom fiscal year that starts in July
    • Ex: weekly periods that roll to 13 x 4-week periods or a 4-4-5 ‘monthly’ synthesized calendar
  • If needed, the data requires:
    • A distinct column for each date unit (e.g. year, month, quarter, fiscal year, etc.)
      • A sequence column for each of the date unit columns 

 

Updated

Was this article helpful?