AzureCosmosDB · aayush3011 · Jun 29, 2026 · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026
diff --git a/.env.template b/.env.template
@@ -60,13 +60,18 @@ AI_FOUNDRY_EMBEDDING_DEPLOYMENT_NAME=text-embedding-3-large
 AI_FOUNDRY_EMBEDDING_DIMENSIONS=1536
 AI_FOUNDRY_EMBEDDING_DATA_TYPE=float32
 AI_FOUNDRY_EMBEDDING_DISTANCE_FUNCTION=cosine
-# Optional. Vector index type for the memories container: diskANN (default),
-# quantizedFlat, or flat. diskANN requires the DiskANN capability on the Cosmos
-# DB account; use quantizedFlat or flat for accounts without it (e.g. the
-# classic Cosmos DB emulator).
-AI_FOUNDRY_EMBEDDING_VECTOR_INDEX_TYPE=diskANN
+# Optional. Vector index type for the memories container: quantizedFlat
+# (default), diskANN, or flat. quantizedFlat works on any Cosmos DB account
+# (including the classic emulator); diskANN requires the DiskANN capability on
+# the Cosmos DB account, so opt into it explicitly when available.
+AI_FOUNDRY_EMBEDDING_VECTOR_INDEX_TYPE=quantizedFlat
 COSMOS_DB_FULL_TEXT_LANGUAGE=en-US
 
+# Embed raw conversation turns on write so they can be vector-searched via
+# search_turns(). The turns container is always provisioned with a
+# vector index, so toggling this never requires recreating the container.
+ENABLE_TURN_EMBEDDINGS=false
+
 AI_FOUNDRY_CHAT_DEPLOYMENT_NAME=<your-model-deployment>
 # Optional. Pin the Azure OpenAI REST API version used by chat and embeddings
 # clients. Leave blank to use the toolkit default ("2024-12-01-preview").

diff --git a/Docs/concepts.md b/Docs/concepts.md
@@ -94,6 +94,8 @@ Memories stored in Cosmos DB include embeddings generated by Microsoft AI Foundr
 
 Facts work especially well for vector search because each fact is stored as a small, self-contained document.
 
+By default raw conversation turns are *not* embedded — only derived memories (facts, episodic, procedural, summaries) carry vectors. Set `enable_turn_embeddings=True` (env `ENABLE_TURN_EMBEDDINGS`) to also embed turns on write, then call `search_turns()` to vector-search the raw conversation log. The turns container is always provisioned with a `quantizedFlat` vector index, so this flag only toggles embedding generation and can be turned on or off at any time without recreating the container.
+
 ---
 
 ## Processing Pipeline

diff --git a/Docs/public_api.md b/Docs/public_api.md
@@ -14,7 +14,7 @@
 
 ### Connection
 
-- `__init__(cosmos_endpoint=None, cosmos_credential=None, cosmos_key=None, cosmos_database=None, cosmos_container=None, cosmos_turns_container='memories_turns', cosmos_summaries_container='memories_summaries', cosmos_counter_container=None, cosmos_lease_container=None, cosmos_throughput_mode=None, cosmos_autoscale_max_ru=None, ai_foundry_endpoint=None, ai_foundry_credential=None, ai_foundry_api_key=None, embedding_deployment_name='text-embedding-3-large', embedding_dimensions=None, chat_deployment_name='gpt-4o-mini', use_default_credential=True, processor=None) -> None` — configure local state, model clients, optional Cosmos auto-connect, and optional processing backend. The SDK uses a hard 3-container topology: turns in `memories_turns`, facts/episodic/procedural in `memories`, and summaries in `memories_summaries` (or the names you pass).
+- `__init__(cosmos_endpoint=None, cosmos_credential=None, cosmos_key=None, cosmos_database=None, cosmos_container=None, cosmos_turns_container='memories_turns', cosmos_summaries_container='memories_summaries', cosmos_counter_container=None, cosmos_lease_container=None, cosmos_throughput_mode=None, cosmos_autoscale_max_ru=None, ai_foundry_endpoint=None, ai_foundry_credential=None, ai_foundry_api_key=None, embedding_deployment_name='text-embedding-3-large', embedding_dimensions=None, chat_deployment_name='gpt-4o-mini', use_default_credential=True, enable_turn_embeddings=None, processor=None) -> None` — configure local state, model clients, optional Cosmos auto-connect, and optional processing backend. The SDK uses a hard 3-container topology: turns in `memories_turns`, facts/episodic/procedural in `memories`, and summaries in `memories_summaries` (or the names you pass). `enable_turn_embeddings` (default `False`, env `ENABLE_TURN_EMBEDDINGS`) embeds raw turns on write so they can be vector-searched via `search_turns()`; the turns container is always provisioned with a vector index, so toggling this never requires recreating it.
 - `close() -> None` — close Cosmos/model clients and owned credentials.
 - `connect_cosmos(endpoint=None, credential=None, key=None, database=None, container=None, turns_container=None, summaries_container=None) -> None` — connect to existing memory, turns, and summaries containers.
 - `create_memory_store(database=None, container=None, turns_container=None, summaries_container=None, counter_container=None, lease_container=None, endpoint=None, credential=None, key=None, embedding_dimensions=None, embedding_data_type=None, distance_function=None, full_text_language=None, throughput_mode=None, autoscale_max_ru=None) -> None` — create/connect the memory, turns, summaries, counter, and lease containers.
@@ -37,7 +37,8 @@
 
 ### Retrieval
 
-- `search_cosmos(search_terms, memory_id=None, user_id=None, role=None, memory_types=None, thread_id=None, hybrid_search=False, top_k=5, tags_all=None, tags_any=None, exclude_tags=None, include_superseded=False, min_salience=None, min_confidence=None, created_after=None, created_before=None) -> list[dict]` — vector or hybrid search memories.
+- `search_cosmos(search_terms, memory_id=None, user_id=None, role=None, memory_types=None, thread_id=None, hybrid_search=False, top_k=5, tags_all=None, tags_any=None, exclude_tags=None, include_superseded=False, min_salience=None, min_confidence=None, created_after=None, created_before=None) -> list[dict]` — vector or hybrid search derived memories (facts/episodic/procedural).
+- `search_turns(search_terms, user_id, thread_id=None, role=None, hybrid_search=False, top_k=5, tags_all=None, tags_any=None, exclude_tags=None, created_after=None, created_before=None) -> list[dict]` — vector or hybrid search the raw conversation log instead of facts/episodic/procedural (requires turn embeddings; see `enable_turn_embeddings`). `user_id` is required so the search is scoped to one partition instead of scanning every user's turns.
 - `get_procedural_prompt(user_id) -> Optional[str]` — read the active procedural prompt.
 - `get_procedural_history(user_id, limit=10) -> list[dict]` — read procedural prompt history.
 - `get_procedural_memories(user_id, priority=None, category=None, min_salience=None, include_superseded=False) -> list[dict]` — retrieve procedural memory documents.
@@ -67,7 +68,7 @@ Local-buffer methods remain synchronous in-memory operations; Cosmos, retrieval,
 
 ### Connection
 
-- `__init__(cosmos_endpoint=None, cosmos_credential=None, cosmos_key=None, cosmos_database=None, cosmos_container=None, cosmos_turns_container='memories_turns', cosmos_summaries_container='memories_summaries', cosmos_counter_container=None, cosmos_lease_container=None, cosmos_throughput_mode=None, cosmos_autoscale_max_ru=None, ai_foundry_endpoint=None, ai_foundry_credential=None, ai_foundry_api_key=None, embedding_deployment_name='text-embedding-3-large', embedding_dimensions=None, chat_deployment_name='gpt-4o-mini', use_default_credential=True, processor=None) -> None` — configure async local state, model clients, and optional processing backend. The async SDK uses the same hard 3-container topology as the sync client.
+- `__init__(cosmos_endpoint=None, cosmos_credential=None, cosmos_key=None, cosmos_database=None, cosmos_container=None, cosmos_turns_container='memories_turns', cosmos_summaries_container='memories_summaries', cosmos_counter_container=None, cosmos_lease_container=None, cosmos_throughput_mode=None, cosmos_autoscale_max_ru=None, ai_foundry_endpoint=None, ai_foundry_credential=None, ai_foundry_api_key=None, embedding_deployment_name='text-embedding-3-large', embedding_dimensions=None, chat_deployment_name='gpt-4o-mini', use_default_credential=True, enable_turn_embeddings=None, processor=None) -> None` — configure async local state, model clients, and optional processing backend. The async SDK uses the same hard 3-container topology as the sync client. `enable_turn_embeddings` (default `False`, env `ENABLE_TURN_EMBEDDINGS`) embeds raw turns on write so they can be vector-searched via `search_turns()`.
 - `async close() -> None` — close async/sync resources and owned credentials.
 - `async connect_cosmos(endpoint=None, credential=None, key=None, database=None, container=None, turns_container=None, summaries_container=None) -> None` — connect to existing memory, turns, and summaries containers.
 - `async create_memory_store(database=None, container=None, turns_container=None, summaries_container=None, counter_container=None, lease_container=None, endpoint=None, credential=None, key=None, embedding_dimensions=None, embedding_data_type=None, distance_function=None, full_text_language=None, throughput_mode=None, autoscale_max_ru=None) -> None` — create/connect memory, turns, summaries, counter, and lease containers.
@@ -90,7 +91,8 @@ Local-buffer methods remain synchronous in-memory operations; Cosmos, retrieval,
 
 ### Retrieval
 
-- `async search_cosmos(search_terms, memory_id=None, user_id=None, role=None, memory_types=None, thread_id=None, hybrid_search=False, top_k=5, tags_all=None, tags_any=None, exclude_tags=None, include_superseded=False, min_salience=None, min_confidence=None, created_after=None, created_before=None) -> list[dict]` — vector or hybrid search memories.
+- `async search_cosmos(search_terms, memory_id=None, user_id=None, role=None, memory_types=None, thread_id=None, hybrid_search=False, top_k=5, tags_all=None, tags_any=None, exclude_tags=None, include_superseded=False, min_salience=None, min_confidence=None, created_after=None, created_before=None) -> list[dict]` — vector or hybrid search derived memories (facts/episodic/procedural).
+- `async search_turns(search_terms, user_id, thread_id=None, role=None, hybrid_search=False, top_k=5, tags_all=None, tags_any=None, exclude_tags=None, created_after=None, created_before=None) -> list[dict]` — vector or hybrid search the raw conversation log instead of facts/episodic/procedural (requires turn embeddings; see `enable_turn_embeddings`). `user_id` is required so the search is scoped to one partition instead of scanning every user's turns.
 - `async get_procedural_prompt(user_id) -> Optional[str]` — read the active procedural prompt.
 - `async get_procedural_history(user_id, limit=10) -> list[dict]` — read procedural prompt history.
 - `async get_procedural_memories(user_id, priority=None, category=None, min_salience=None, include_superseded=False) -> list[dict]` — retrieve procedural memory documents.

diff --git a/azure/cosmos/agent_memory/_base/base_client.py b/azure/cosmos/agent_memory/_base/base_client.py
@@ -18,6 +18,7 @@
 )
 from azure.cosmos.agent_memory.exceptions import CosmosNotConnectedError, MemoryNotFoundError, ValidationError
 from azure.cosmos.agent_memory.logging import configure_logging, get_logger
+from azure.cosmos.agent_memory.thresholds import get_enable_turn_embeddings
 
 logger = get_logger(__name__)
 
@@ -46,6 +47,7 @@ def _init_base_config(
         embedding_dimensions: Optional[int],
         chat_deployment_name: str,
         use_default_credential: bool,
+        enable_turn_embeddings: Optional[bool] = None,
         default_credential_module: str = "azure.identity",
     ) -> None:
         """Initialize shared local state, config values, and default credentials."""
@@ -76,6 +78,9 @@ def _init_base_config(
         self._embedding_deployment_name = embedding_deployment_name
         self._embedding_dimensions = _resolve_embedding_dimensions(embedding_dimensions)
         self._chat_deployment_name = chat_deployment_name
+        self._enable_turn_embeddings = (
+            enable_turn_embeddings if enable_turn_embeddings is not None else get_enable_turn_embeddings()
+        )
 
         self._owns_cosmos_credential = False
         self._owns_ai_foundry_credential = False

diff --git a/azure/cosmos/agent_memory/_utils.py b/azure/cosmos/agent_memory/_utils.py
@@ -146,7 +146,8 @@ def _resolve_embedding_dimensions(val: Optional[int]) -> int:
     """Resolve embedding dimensions from explicit value or ``AI_FOUNDRY_EMBEDDING_DIMENSIONS`` env var.
 
     Defaults to 1536 (the dimension we ship with for ``text-embedding-3-large``
-    truncated to 1536, which is the size DiskANN is tuned for in our containers).
+    truncated to 1536, which is the size our quantizedFlat vector indexes are
+    tuned for in our containers).
 
     Raises :class:`ConfigurationError` if the env var is set but cannot be
     parsed as a positive integer.
@@ -259,13 +260,15 @@ def _resolve_distance_function(val: Optional[str]) -> str:
 def _resolve_vector_index_type(val: Optional[str]) -> str:
     """Resolve vector index type from explicit value or ``AI_FOUNDRY_EMBEDDING_VECTOR_INDEX_TYPE`` env var.
 
-    Defaults to ``diskANN``. Raises :class:`ConfigurationError` for unknown values.
+    Defaults to ``quantizedFlat``. Raises :class:`ConfigurationError` for unknown values.
 
-    ``diskANN`` requires the Cosmos DB account to have the DiskANN vector index
-    capability enabled. Accounts that do not (for example the classic Cosmos DB
-    emulator) can use ``quantizedFlat`` or ``flat`` instead.
+    ``quantizedFlat`` works on any Cosmos DB account (including the classic
+    emulator). ``diskANN`` requires the Cosmos DB account to have the DiskANN
+    vector index capability enabled; opt into it explicitly when available.
     """
-    raw = (val if val is not None else os.environ.get("AI_FOUNDRY_EMBEDDING_VECTOR_INDEX_TYPE") or "diskANN").strip()
+    raw = (
+        val if val is not None else os.environ.get("AI_FOUNDRY_EMBEDDING_VECTOR_INDEX_TYPE") or "quantizedFlat"
+    ).strip()
     if raw not in _ALLOWED_VECTOR_INDEX_TYPES:
         raise ConfigurationError(
             message=(
@@ -434,9 +437,15 @@ def _container_policies(
     embedding_data_type: str,
     distance_function: str,
     full_text_language: str,
-    vector_index_type: str = "diskANN",
+    include_salience_composite: bool = True,
+    vector_index_type: str = "quantizedFlat",
 ) -> tuple[dict, dict, dict]:
-    """Build the vector, indexing, and full-text policies for container creation."""
+    """Build the vector, indexing, and full-text policies for container creation.
+
+    ``include_salience_composite`` adds the ``(salience, created_at, id)``
+    composite index required by procedural synthesis on the MEMORIES container.
+    Turns reuse this builder with it disabled (turns are never synthesized).
+    """
     vector_embedding_policy = {
         "vectorEmbeddings": [
             {
@@ -451,25 +460,27 @@ def _container_policies(
     indexing_policy = {
         "includedPaths": [{"path": "/*"}],
         "excludedPaths": [
-            {"path": "/embedding/*"},
             {"path": "/source_memory_ids/*"},
             {"path": "/supersedes_ids/*"},
+            {"path": '/"_etag"/?'},
         ],
         "vectorIndexes": [{"path": "/embedding", "type": vector_index_type}],
         "fullTextIndexes": [{"path": "/content"}],
+    }
+
+    if include_salience_composite:
         # Procedural synthesis selects TOP N by (salience DESC, created_at ASC, id ASC).
         # Cosmos requires a composite index for multi-property ORDER BY; without it the
         # query returns a non-deterministic 50 of N when many docs share the default
         # salience (0.5), which makes the source-id short-circuit in synthesize_procedural
         # thrash and burn LLM calls on every reconcile.
-        "compositeIndexes": [
+        indexing_policy["compositeIndexes"] = [
             [
                 {"path": "/salience", "order": "descending"},
                 {"path": "/created_at", "order": "ascending"},
                 {"path": "/id", "order": "ascending"},
             ]
-        ],
-    }
+        ]
 
     full_text_policy = {
         "defaultLanguage": full_text_language,
-Original file line number
+Diff line change
@@ Expand Up @@
     Facts work especially well for vector search because each fact is stored as a small, self-contained document.
+    By default raw conversation turns are *not* embedded — only derived memories (facts, episodic, procedural, summaries) carry vectors. Set `enable_turn_embeddings=True` (env `ENABLE_TURN_EMBEDDINGS`) to also embed turns on write, then call `search_turns()` to vector-search the raw conversation log. The turns container is always provisioned with a `quantizedFlat` vector index, so this flag only toggles embedding generation and can be turned on or off at any time without recreating the container.
     ---
     ## Processing Pipeline
@@ Expand Down @@