LlamaIndex

Overview

LlamaIndex (formerly GPT Index) is an open-source Python framework specialized in data ingestion, indexing, and retrieval for building knowledge-augmented LLM applications. Originally launched in 2022, the framework focuses on connecting LLMs with private or domain-specific data through Retrieval-Augmented Generation (RAG). LlamaIndex addresses the problem of LLMs lacking access to custom data by providing comprehensive tooling for data loading, transformation, indexing, and intelligent retrieval.

The framework differs from general-purpose orchestration tools (LangChain, CrewAI) by specializing in the data ingestion and retrieval pipeline. While it includes agent capabilities, LlamaIndex’s core strength lies in transforming unstructured data into queryable indexes with sophisticated retrieval strategies. The architecture centers on data connectors, indexes, retrievers, and query engines working together to enable semantic search over private data.

Key technical components covered:

RAG architecture and data pipeline
Data connectors and document loading
Indexing strategies and vector stores
Node parsers and chunking strategies
Query engines and retrieval mechanisms
Response synthesis strategies
Agents and tool integration
Workflows for orchestration
Chat engines and context management
Observability and tracing
Version history and ecosystem evolution

RAG Architecture and Data Pipeline

LlamaIndex implements Retrieval-Augmented Generation through four-stage pipeline:

Stage 1: Indexing (offline)

Data ingestion through connectors:

from llama_index.core import SimpleDirectoryReader
 
# Load documents from directory
documents = SimpleDirectoryReader(
    input_dir="./data",
    recursive=True,
    required_exts=[".pdf", ".txt", ".docx"]
).load_data()
 
# Or from specific sources
from llama_index.readers.database import DatabaseReader
 
db_reader = DatabaseReader(
    sql_database="postgresql://localhost/mydb"
)
documents = db_reader.load_data(query="SELECT * FROM articles")

Document transformation into nodes (chunks):

from llama_index.core.node_parser import SentenceSplitter
 
parser = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=50
)
nodes = parser.get_nodes_from_documents(documents)

Embedding generation converts text to vectors:

from llama_index.embeddings.openai import OpenAIEmbedding
 
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
 
for node in nodes:
    node.embedding = embed_model.get_text_embedding(node.text)

Vector storage in database:

from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
 
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.create_collection("my_collection")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
 
# Create index with custom vector store
from llama_index.core import StorageContext
 
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)

Stage 2: Retrieval (online)

User query converted to embedding and searched against vector database:

# Semantic search
query_engine = index.as_query_engine(similarity_top_k=5)
 
# Query executed internally:
# 1. Query text → embedding
# 2. Vector similarity search
# 3. Return top-k most similar nodes

Stage 3: Augmentation

Retrieved documents combined with user query forming enriched prompt:

User Query: "What is the company's return policy?"

Retrieved Context:
- Document 1 (score: 0.92): "Our return policy allows..."
- Document 2 (score: 0.87): "Returns must be made within..."
- Document 3 (score: 0.81): "Refunds are processed..."

Augmented Prompt sent to LLM:
"Given this context: [retrieved documents]
Answer the question: What is the company's return policy?"

Stage 4: Generation

LLM generates response based on augmented prompt ensuring answer grounded in provided context rather than parametric knowledge.

Complete RAG example:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
 
# Index phase (run once)
documents = SimpleDirectoryReader("./company_docs").load_data()
index = VectorStoreIndex.from_documents(documents)
 
# Save index for reuse
index.storage_context.persist(persist_dir="./storage")
 
# Query phase (run many times)
from llama_index.core import load_index_from_storage
 
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
 
query_engine = index.as_query_engine()
response = query_engine.query("What is the return policy?")
print(response)

Data Connectors and Document Loading

LlamaHub provides 100+ data connectors for diverse sources:

File-based connectors:

from llama_index.core import SimpleDirectoryReader
 
# Load multiple file types
documents = SimpleDirectoryReader(
    "./data",
    file_extractor={
        ".pdf": "PDFReader",
        ".docx": "DocxReader",
        ".md": "MarkdownReader"
    }
).load_data()

Database connectors:

from llama_index.readers.database import DatabaseReader
 
# PostgreSQL
db_reader = DatabaseReader(
    sql_database="postgresql://user:pass@localhost/db"
)
docs = db_reader.load_data(query="SELECT content FROM articles")
 
# MongoDB
from llama_index.readers.mongodb import SimpleMongoReader
 
mongo_reader = SimpleMongoReader(
    host="localhost",
    port=27017
)
docs = mongo_reader.load_data(
    db_name="mydb",
    collection_name="documents"
)

Web and API connectors:

# Notion
from llama_index.readers.notion import NotionPageReader
 
notion_reader = NotionPageReader(integration_token="secret_...")
docs = notion_reader.load_data(page_ids=["page-id-1", "page-id-2"])
 
# Google Docs
from llama_index.readers.google import GoogleDocsReader
 
google_reader = GoogleDocsReader()
docs = google_reader.load_data(document_ids=["doc-id-1"])
 
# Slack
from llama_index.readers.slack import SlackReader
 
slack_reader = SlackReader(slack_token="xoxb-...")
docs = slack_reader.load_data(channel_ids=["C123456"])

Document structure:

class Document:
    text: str                    # Main content
    metadata: Dict[str, Any]     # Source, author, date, etc.
    doc_id: str                  # Unique identifier
    embedding: Optional[List[float]]  # Vector embedding

Metadata enrichment:

documents = SimpleDirectoryReader("./data").load_data()
 
for doc in documents:
    doc.metadata.update({
        "source": "internal_docs",
        "department": "engineering",
        "last_updated": "2025-01-15"
    })

Metadata enables filtering during retrieval:

from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter
 
filters = MetadataFilters(filters=[
    ExactMatchFilter(key="department", value="engineering")
])
 
retriever = index.as_retriever(filters=filters)

Node Parsers and Chunking Strategies

Node parsers transform documents into chunks (nodes) optimized for retrieval:

SentenceSplitter (default) splits at sentence boundaries:

from llama_index.core.node_parser import SentenceSplitter
 
parser = SentenceSplitter(
    chunk_size=512,      # Tokens per chunk
    chunk_overlap=50,    # Overlap between chunks
    separator=" "        # Split on spaces when needed
)
 
nodes = parser.get_nodes_from_documents(documents)

SentenceWindowNodeParser includes context windows:

from llama_index.core.node_parser import SentenceWindowNodeParser
 
parser = SentenceWindowNodeParser(
    window_size=3,           # Include 3 sentences before and after
    window_metadata_key="window",
    original_text_metadata_key="original_sentence"
)
 
nodes = parser.get_nodes_from_documents(documents)
 
# Each node contains:
# - Core sentence for embedding
# - Window of surrounding sentences in metadata
# - Enables precise retrieval with broader context

Use case: Retrieve specific sentence but provide LLM with surrounding context for better understanding.

HierarchicalNodeParser creates multi-level hierarchy:

from llama_index.core.node_parser import HierarchicalNodeParser
 
parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]  # Three levels
)
 
nodes = parser.get_nodes_from_documents(documents)
 
# Creates structure:
# Level 1: 2048-token chunks (coarse overview)
# Level 2: 512-token chunks (detailed sections)
# Level 3: 128-token chunks (fine-grained details)
 
# Retrieval can start coarse and drill down

Use case: Long documents where hierarchical retrieval (summary → detail) improves relevance.

TokenTextSplitter splits by token count:

from llama_index.core.node_parser import TokenTextSplitter
 
parser = TokenTextSplitter(
    chunk_size=512,
    chunk_overlap=20,
    separator=" "
)
 
nodes = parser.get_nodes_from_documents(documents)

SemanticSplitterNodeParser splits at semantic boundaries:

from llama_index.core.node_parser import SemanticSplitterNodeParser
 
parser = SemanticSplitterNodeParser(
    buffer_size=1,           # Sentences to group
    breakpoint_percentile_threshold=95  # Semantic shift threshold
)
 
nodes = parser.get_nodes_from_documents(documents)
 
# Identifies topic shifts and splits there
# Produces variable-length chunks aligned with meaning

Chunking strategy trade-offs:

Small chunks (128-256 tokens):

✅ Precise retrieval (exact relevant passage)
✅ Less noise in context
❌ May miss surrounding context
❌ More chunks = more embeddings to store

Large chunks (1024-2048 tokens):

✅ Broader context included
✅ Fewer chunks to manage
❌ Less precise retrieval
❌ May exceed context window when many retrieved

Optimal range: 256-512 tokens for most applications.

Query Engines and Retrieval Mechanisms

Query engines orchestrate retrieval and response generation:

Basic vector retrieval:

from llama_index.core import VectorStoreIndex
 
index = VectorStoreIndex.from_documents(documents)
 
query_engine = index.as_query_engine(
    similarity_top_k=3,        # Retrieve top 3 most similar
    response_mode="compact"    # Synthesis strategy
)
 
response = query_engine.query("What is the return policy?")
print(response)

Custom retriever with advanced strategies:

from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core import QueryBundle
 
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=5,
    vector_store_query_mode="default"  # or "mmr", "hybrid"
)
 
query_bundle = QueryBundle(query_str="return policy")
nodes = retriever.retrieve(query_bundle)
 
for node in nodes:
    print(f"Score: {node.score:.3f}")
    print(f"Text: {node.text[:200]}...")

Hybrid retrieval combining vector and keyword search:

from llama_index.core.retrievers import VectorIndexRetriever, KeywordTableSimpleRetriever
from llama_index.core.retrievers import BaseRetriever
 
class HybridRetriever(BaseRetriever):
    def __init__(self, vector_retriever, keyword_retriever, mode="OR"):
        self.vector_retriever = vector_retriever
        self.keyword_retriever = keyword_retriever
        self.mode = mode
    
    def _retrieve(self, query_bundle):
        vector_nodes = self.vector_retriever.retrieve(query_bundle)
        keyword_nodes = self.keyword_retriever.retrieve(query_bundle)
        
        if self.mode == "AND":
            # Return nodes appearing in both
            vector_ids = {n.node_id for n in vector_nodes}
            return [n for n in keyword_nodes if n.node_id in vector_ids]
        else:  # OR mode
            # Merge and deduplicate
            all_nodes = {n.node_id: n for n in vector_nodes + keyword_nodes}
            return list(all_nodes.values())
 
# Use hybrid retriever
vector_ret = VectorIndexRetriever(index=vector_index, similarity_top_k=5)
keyword_ret = KeywordTableSimpleRetriever(index=keyword_index)
 
hybrid_retriever = HybridRetriever(vector_ret, keyword_ret, mode="OR")
 
query_engine = RetrieverQueryEngine(
    retriever=hybrid_retriever,
    response_synthesizer=get_response_synthesizer()
)

Advanced retrieval strategies:

MMR (Maximal Marginal Relevance): Balances relevance and diversity:

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
    vector_store_query_mode="mmr",
    vector_store_kwargs={
        "mmr_threshold": 0.5  # Balance relevance vs diversity
    }
)

Metadata filtering:

from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter
 
filters = MetadataFilters(filters=[
    ExactMatchFilter(key="department", value="HR"),
    ExactMatchFilter(key="year", value="2024")
])
 
query_engine = index.as_query_engine(
    filters=filters,
    similarity_top_k=5
)

Reranking improves retrieval quality:

from llama_index.postprocessor.cohere_rerank import CohereRerank
 
reranker = CohereRerank(api_key="...", top_n=3)
 
query_engine = index.as_query_engine(
    similarity_top_k=10,          # Initial retrieval
    node_postprocessors=[reranker]  # Rerank to top 3
)

Response Synthesis Strategies

Response synthesizers determine how retrieved chunks are used to generate answers:

Refine strategy iteratively improves answer:

from llama_index.core import get_response_synthesizer
 
synthesizer = get_response_synthesizer(response_mode="refine")
 
# Process:
# 1. Generate initial answer from chunk 1 + query
# 2. Refine answer with chunk 2 + previous answer + query
# 3. Refine again with chunk 3 + refined answer + query
# 4. Continue through all retrieved chunks
 
query_engine = index.as_query_engine(
    response_synthesizer=synthesizer,
    similarity_top_k=5
)

Advantages: Comprehensive answers incorporating all retrieved context. Disadvantages: Multiple LLM calls (one per chunk), higher cost and latency.

Compact strategy consolidates before refinement:

synthesizer = get_response_synthesizer(response_mode="compact")
 
# Process:
# 1. Concatenate chunks to fill context window
# 2. Generate answer from concatenated context
# 3. Refine if context exceeds single call
 
# Fewer LLM calls than pure refine

Tree Summarize builds hierarchical summary:

synthesizer = get_response_synthesizer(response_mode="tree_summarize")
 
# Process (bottom-up tree):
# 1. Summarize pairs of chunks
# 2. Summarize summaries recursively
# 3. Root node = final answer
 
# Efficient for large document sets

Accumulate generates separate answers:

synthesizer = get_response_synthesizer(response_mode="accumulate")
 
# Process:
# 1. Generate answer from chunk 1
# 2. Generate answer from chunk 2
# 3. Generate answer from chunk 3
# 4. Concatenate all answers
 
# Useful for multi-perspective responses

Compact Accumulate optimized accumulation:

synthesizer = get_response_synthesizer(response_mode="compact_accumulate")
 
# Consolidates chunks first, then accumulates
# Fewer LLM calls than pure accumulate

Strategy selection guide:

Simple queries: Use compact (fast, efficient)
Complex analysis: Use refine (comprehensive)
Summarization: Use tree_summarize (hierarchical)
Multiple perspectives: Use accumulate (separate answers)

Agents and Tool Integration

LlamaIndex implements agents as LLM-powered systems using tools to accomplish tasks:

Function tool creation:

from llama_index.core.tools import FunctionTool
 
def multiply(a: int, b: int) -> int:
    """Multiply two integers and return the result"""
    return a * b
 
def add(a: int, b: int) -> int:
    """Add two integers and return the result"""
    return a + b
 
multiply_tool = FunctionTool.from_defaults(fn=multiply)
add_tool = FunctionTool.from_defaults(fn=add)

Query engine as tool (RAG as a tool):

from llama_index.core.tools import QueryEngineTool
 
# Create query engine from index
query_engine = index.as_query_engine()
 
# Wrap as tool
query_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="company_knowledge_base",
    description="Useful for answering questions about company policies, procedures, and documentation"
)

Agent creation with tools:

from llama_index.llms.openai import OpenAI
from llama_index.core.agent import FunctionCallingAgent
 
llm = OpenAI(model="gpt-4")
 
agent = FunctionCallingAgent.from_tools(
    tools=[multiply_tool, add_tool, query_tool],
    llm=llm,
    verbose=True,
    system_prompt="""You are a helpful assistant that can perform calculations 
    and answer questions about company policies."""
)
 
response = agent.chat("What is (121 * 3) + 42? Also, what's our vacation policy?")
print(response)

Agent execution flow:

Receive user query
Analyze which tools might be helpful
Generate function calls with parameters
Execute tools in sequence or parallel
Incorporate tool results into reasoning
Generate final response

Parallel function calling (OpenAI 1.1.0+):

from llama_index.agent.openai import OpenAIAgent
 
agent = OpenAIAgent.from_tools(
    [add_tool, multiply_tool],
    llm=OpenAI(model="gpt-4"),
    verbose=True
)
 
# Agent can call multiple tools simultaneously
response = agent.chat("Calculate both 5 + 3 and 7 * 9")
# Executes add(5, 3) and multiply(7, 9) in parallel

Multi-document agent with tool per data source:

# Create separate indexes for different document types
policy_index = VectorStoreIndex.from_documents(policy_docs)
handbook_index = VectorStoreIndex.from_documents(handbook_docs)
faq_index = VectorStoreIndex.from_documents(faq_docs)
 
# Create tools from each index
policy_tool = QueryEngineTool.from_defaults(
    query_engine=policy_index.as_query_engine(),
    name="policy_search",
    description="Search company policies and procedures"
)
 
handbook_tool = QueryEngineTool.from_defaults(
    query_engine=handbook_index.as_query_engine(),
    name="handbook_search",
    description="Search employee handbook"
)
 
faq_tool = QueryEngineTool.from_defaults(
    query_engine=faq_index.as_query_engine(),
    name="faq_search",
    description="Search frequently asked questions"
)
 
# Agent intelligently selects appropriate tool
agent = FunctionCallingAgent.from_tools(
    [policy_tool, handbook_tool, faq_tool],
    llm=llm
)

Workflows for Orchestration

Workflows (introduced August 2024) provide event-driven orchestration for complex multi-step processes:

Workflow structure:

from llama_index.core.workflow import Workflow, StartEvent, StopEvent, step
 
class RAGWorkflow(Workflow):
    @step
    async def ingest(self, ctx, event: StartEvent) -> IngestEvent:
        """Step 1: Ingest and chunk documents"""
        documents = event.documents
        nodes = self.parser.get_nodes_from_documents(documents)
        return IngestEvent(nodes=nodes)
    
    @step
    async def embed(self, ctx, event: IngestEvent) -> EmbedEvent:
        """Step 2: Generate embeddings"""
        for node in event.nodes:
            node.embedding = await self.embed_model.aget_embedding(node.text)
        return EmbedEvent(nodes=event.nodes)
    
    @step
    async def store(self, ctx, event: EmbedEvent) -> StopEvent:
        """Step 3: Store in vector database"""
        self.index.insert_nodes(event.nodes)
        return StopEvent(result=f"Indexed {len(event.nodes)} nodes")
 
# Run workflow
workflow = RAGWorkflow()
result = await workflow.run(documents=documents)

Complex workflow with branching:

class MultiModalWorkflow(Workflow):
    @step
    async def classify_input(self, ctx, event: StartEvent) -> ClassifyEvent:
        """Determine if input is text or image query"""
        if event.has_image:
            return ImageQueryEvent(image=event.image, query=event.query)
        else:
            return TextQueryEvent(query=event.query)
    
    @step
    async def process_text(self, ctx, event: TextQueryEvent) -> RetrievalEvent:
        """Handle text-only queries"""
        nodes = self.text_retriever.retrieve(event.query)
        return RetrievalEvent(nodes=nodes, query=event.query)
    
    @step
    async def process_image(self, ctx, event: ImageQueryEvent) -> RetrievalEvent:
        """Handle image + text queries"""
        # Use multimodal model for retrieval
        nodes = self.image_retriever.retrieve(event.query, event.image)
        return RetrievalEvent(nodes=nodes, query=event.query)
    
    @step
    async def synthesize(self, ctx, event: RetrievalEvent) -> StopEvent:
        """Generate final response"""
        response = self.synthesizer.synthesize(
            query=event.query,
            nodes=event.nodes
        )
        return StopEvent(result=response)

Async-first architecture enables:

Concurrent step execution
Non-blocking I/O operations
Efficient resource utilization
Scalable production deployments

State management through context:

@step
async def step_with_state(self, ctx, event: MyEvent):
    # Store state
    await ctx.set("key", "value")
    
    # Retrieve state
    value = await ctx.get("key")
    
    # State persists across steps
    return NextEvent(data=value)

Chat Engines and Context Management

Chat engines extend query engines with conversation memory:

Basic chat engine:

from llama_index.core.chat_engine import SimpleChatEngine
 
chat_engine = index.as_chat_engine(
    chat_mode="simple",
    llm=llm,
    verbose=True
)
 
# Conversation
response1 = chat_engine.chat("What is the return policy?")
print(response1)
 
response2 = chat_engine.chat("How long do I have?")  # Refers to previous context
print(response2)

CondensePlusContextChatEngine condenses conversation history:

from llama_index.core.memory import ChatMemoryBuffer
 
memory = ChatMemoryBuffer.from_defaults(token_limit=3000)
 
chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",
    memory=memory,
    llm=llm,
    context_prompt="""You are a helpful assistant with access to company documentation.
    Use the context below to answer questions accurately.
    
    Context:
    {context_str}
    
    Chat History:
    {chat_history}
    
    Answer the user's question based on the context and history.""",
    verbose=True
)
 
# Condenses chat history + current question into standalone query
# Retrieves relevant context
# Generates response with both history and retrieved context

Context management strategies:

ChatMemoryBuffer with token limits:

from llama_index.core.memory import ChatMemoryBuffer
 
memory = ChatMemoryBuffer.from_defaults(
    token_limit=3000,  # Keep last 3K tokens of conversation
    tokenizer_fn=tokenizer.encode  # Custom tokenizer
)
 
memory.put_messages([
    ChatMessage(role="user", content="Hello!"),
    ChatMessage(role="assistant", content="Hi! How can I help?")
])
 
# Retrieve conversation history
history = memory.get()

Streaming chat responses:

chat_engine = index.as_chat_engine()
 
streaming_response = chat_engine.stream_chat("Tell me about the product")
 
for token in streaming_response.response_gen:
    print(token, end="", flush=True)

ReAct chat mode for agent-based chat:

chat_engine = index.as_chat_engine(
    chat_mode="react",
    tools=[query_tool, calculator_tool],
    llm=llm,
    verbose=True
)
 
# Agent decides when to retrieve from index vs use other tools
response = chat_engine.chat("Calculate ROI based on our pricing documentation")

Observability and Tracing

LlamaIndex provides instrumentation module (v0.10.20+) replacing legacy callbacks:

Basic tracing setup:

import llama_index.core
from llama_index.core.instrumentation import get_dispatcher
 
# Enable instrumentation
dispatcher = get_dispatcher()
 
# Basic event handler
from llama_index.core.instrumentation import EventHandler
 
class SimpleEventHandler(EventHandler):
    def handle(self, event):
        print(f"Event: {event.event_type}, Time: {event.timestamp}")
 
dispatcher.add_event_handler(SimpleEventHandler())

LlamaTrace integration (hosted observability platform):

import llama_index.core
import os
 
# Configure LlamaTrace
PHOENIX_API_KEY = "your-api-key"
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
 
llama_index.core.set_global_handler(
    "arize_phoenix",
    endpoint="https://llamatrace.com/v1/traces"
)
 
# All operations automatically traced
query_engine = index.as_query_engine()
response = query_engine.query("sample query")
 
# View traces at llamatrace.com

OpenTelemetry integration:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
 
# Configure OTLP exporter
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(otlp_exporter)
)
 
# LlamaIndex automatically emits spans

Custom instrumentation:

from llama_index.core.instrumentation import SpanHandler
 
class CustomSpanHandler(SpanHandler):
    def span_enter(self, span):
        print(f"Starting: {span.name}")
        span.start_time = time.time()
    
    def span_exit(self, span):
        duration = time.time() - span.start_time
        print(f"Completed: {span.name} in {duration:.2f}s")
 
dispatcher.add_span_handler(CustomSpanHandler())

Traced operations include:

Document loading and parsing
Embedding generation
Vector store operations
Retrieval queries
LLM calls with token counts
Response synthesis
Agent tool invocations

Trace visualization shows:

End-to-end latency breakdown
Token usage per operation
Retrieval relevance scores
LLM call parameters and responses
Error traces and debugging info

Version History and Ecosystem Evolution

February 2024: LlamaCloud launch - Enterprise offering for document ingestion, parsing, indexing, and storage. Managed infrastructure for production RAG deployments.

March 2024: LlamaParse independence - Parser for complex documents (PDFs, presentations, forms) became standalone tool. Advanced parsing capabilities for tables, charts, layouts.

May 2024: Property Graph Index - Added graph-based indexing for knowledge graphs and entity relationships. Enables graph traversal queries beyond vector similarity.

June 2024: LlamaDeploy framework - Transforms agents into microservices for production deployment. Facilitates containerization, scaling, and service orchestration.

July 2024: LlamaTrace observability - First-class observability platform built on Arize Phoenix. Hosted tracing, monitoring, and evaluation for LlamaIndex applications.

August 2024: Workflows framework 1.0 - Event-driven orchestration system for complex agentic workflows. Async-first architecture with enhanced observability and state management.

September 2024: LlamaParse Premium Mode - Advanced parsing features for complex layouts, multi-column documents, and embedded objects.

December 2024: LlamaReport feature - Transforms document databases into polished reports. Automated report generation from knowledge bases.

June 2025: Workflows 1.0 standalone - Workflows became independent framework for Python and TypeScript. Complete rewrite with improved developer experience.

October 2025: VersionRAG framework - Version-aware RAG handling evolving documents through hierarchical graph structure. Tracks document changes and retrieves appropriate versions.

Ecosystem components:

LlamaIndex (core): Framework and libraries
LlamaHub: 100+ data connectors
LlamaCloud: Managed enterprise platform
LlamaParse: Document parsing service
LlamaTrace: Observability platform
LlamaDeploy: Production deployment tools
LlamaReport: Report generation

Package structure (v0.10+):

llama-index-core: Core functionality
llama-index-llms-*: LLM integrations (openai, anthropic, etc.)
llama-index-embeddings-*: Embedding models
llama-index-vector-stores-*: Vector database integrations
llama-index-readers-*: Data connectors

Modular package structure enables installing only required components reducing dependencies.

Edmondo's Vault

Explorer

LLamaIndex