Data Structure Guidelines

Integrating Max, a high-performance chat agent, into your analytics workflow hinges on robust data structure management. Max harnesses SQL to build tabular views of underlying data, enabling the system to automatically update and enrich the language model and chat pipeline with semantic information. This real-time adaptability is crucial as it maintains conversational relevance and responsiveness without manual updates.

To optimize queries, ensure the data in the tabular view is:

  • Aggregatable using standard SQL methods, with no overlapping totals across the dataset.
  • Consistent and harmonized, eliminating any discrepancies that could impair data integrity.

For natural language processing to be effective, it's essential that the data:

  • Avoids ambiguous values across dimensions.
  • Maintains consistency to reduce complexity and enhance interpretability.

Supporting time series queries requires a date field formatted as "yyyy-mm-dd", either directly or through a SQL expression. This setup facilitates trending, breakout, and filtering capabilities based on time, enhancing the analytical power of Max. In this article, we will delve into these guidelines, outlining how to structure your data effectively to maximize the efficiency and accuracy of Max's chat interface.

 

Ideal Data Structure Guidelines

  • A single tabular dataset (one table), fully machine readable
  • Harmonized dimensions per column
  • Metrics as columns
  • Fully aggregable fact columns
    • Hierarchies should be fully aggregable at all levels
    • No period types that overlap
  • A single period granularity that can be aggregated to the desired period levels
    • Must be a machine readable format ie “YYYY-MM-DD”
  • Ideally no dimension values across dim columns that are the same. This will impact the LLM’s ability to interpret dim values
    • Ex: manufacturer = Coke, Brand = Coke. Prefer: manufacturer = Coca-Cola, Brand = Coke. 
  • No special characters or spaces in table or column names

 

Example Data Structure

Using sample sales data.

Example Data Structure.jpg

 

Survey Data Notes & Sample Data

  • Generally, AnswerRocket expects respondent-level data, not raw survey data. 
  • The respondent-level data should include metrics pre-calculated to that level.
  • View the sample structure below:

Sample Survey Data Structure.jpg

 

Other Data Requirements

  • AnswerRocket requires aggregation formulae for each metric that will need to be rolled up above respondent level (ex: sum, max, min, avg, etc.).
  • If statistical significance is used for a particular analysis, the statistical significance calculation must be documented, preferably in a SQL reference to the provided tables.
  • Similarly, any other required calculation must be documented and share with AR.
  • Reference values are required to perform data validation for any level in the data and for any calculated metric.

 

Requirements for Custom Periods, Including Fiscal Year

  • Custom periods can exist anywhere data is analyzed on based on a non-gregorian calendar
    • Ex: A custom fiscal year that starts in July
    • Ex: weekly periods that roll to 13 x 4-week periods or a 4-4-5 ‘monthly’ synthesized calendar
  • If needed, the data requires:
    • A distinct column for each date unit (e.g. year, month, quarter, fiscal year, etc.)
    • A sequence column for each of the date unit columns
      • Must use the naming convention: date unit + “_sequence” (e.g. if there is a “year” column then there must be a “year_sequence” column)

Custom Period Requirements.jpg

 

Updated

Was this article helpful?