LlamaIndex
Overview
LlamaIndex (formerly GPT Index) is an open-source Python framework specialized in data ingestion, indexing, and retrieval for building knowledge-augmented LLM applications. Originally launched in 2022, the framework focuses on connecting LLMs with private or domain-specific data through Retrieval-Augmented Generation (RAG). LlamaIndex addresses the problem of LLMs lacking access to custom data by providing comprehensive tooling for data loading, transformation, indexing, and intelligent retrieval.
The framework differs from general-purpose orchestration tools (LangChain, CrewAI) by specializing in the data ingestion and retrieval pipeline. While it includes agent capabilities, LlamaIndex’s core strength lies in transforming unstructured data into queryable indexes with sophisticated retrieval strategies. The architecture centers on data connectors, indexes, retrievers, and query engines working together to enable semantic search over private data.
Key technical components covered:
- RAG architecture and data pipeline
- Data connectors and document loading
- Indexing strategies and vector stores
- Node parsers and chunking strategies
- Query engines and retrieval mechanisms
- Response synthesis strategies
- Agents and tool integration
- Workflows for orchestration
- Chat engines and context management
- Observability and tracing
- Version history and ecosystem evolution
RAG Architecture and Data Pipeline
LlamaIndex implements Retrieval-Augmented Generation through four-stage pipeline:
Stage 1: Indexing (offline)
Data ingestion through connectors:
from llama_index.core import SimpleDirectoryReader
# Load documents from directory
documents = SimpleDirectoryReader(
input_dir="./data",
recursive=True,
required_exts=[".pdf", ".txt", ".docx"]
).load_data()
# Or from specific sources
from llama_index.readers.database import DatabaseReader
db_reader = DatabaseReader(
sql_database="postgresql://localhost/mydb"
)
documents = db_reader.load_data(query="SELECT * FROM articles")Document transformation into nodes (chunks):
from llama_index.core.node_parser import SentenceSplitter
parser = SentenceSplitter(
chunk_size=512,
chunk_overlap=50
)
nodes = parser.get_nodes_from_documents(documents)Embedding generation converts text to vectors:
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
for node in nodes:
node.embedding = embed_model.get_text_embedding(node.text)Vector storage in database:
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.create_collection("my_collection")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
# Create index with custom vector store
from llama_index.core import StorageContext
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)Stage 2: Retrieval (online)
User query converted to embedding and searched against vector database:
# Semantic search
query_engine = index.as_query_engine(similarity_top_k=5)
# Query executed internally:
# 1. Query text → embedding
# 2. Vector similarity search
# 3. Return top-k most similar nodesStage 3: Augmentation
Retrieved documents combined with user query forming enriched prompt:
User Query: "What is the company's return policy?"
Retrieved Context:
- Document 1 (score: 0.92): "Our return policy allows..."
- Document 2 (score: 0.87): "Returns must be made within..."
- Document 3 (score: 0.81): "Refunds are processed..."
Augmented Prompt sent to LLM:
"Given this context: [retrieved documents]
Answer the question: What is the company's return policy?"
Stage 4: Generation
LLM generates response based on augmented prompt ensuring answer grounded in provided context rather than parametric knowledge.
Complete RAG example:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
# Index phase (run once)
documents = SimpleDirectoryReader("./company_docs").load_data()
index = VectorStoreIndex.from_documents(documents)
# Save index for reuse
index.storage_context.persist(persist_dir="./storage")
# Query phase (run many times)
from llama_index.core import load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine()
response = query_engine.query("What is the return policy?")
print(response)Data Connectors and Document Loading
LlamaHub provides 100+ data connectors for diverse sources:
File-based connectors:
from llama_index.core import SimpleDirectoryReader
# Load multiple file types
documents = SimpleDirectoryReader(
"./data",
file_extractor={
".pdf": "PDFReader",
".docx": "DocxReader",
".md": "MarkdownReader"
}
).load_data()Database connectors:
from llama_index.readers.database import DatabaseReader
# PostgreSQL
db_reader = DatabaseReader(
sql_database="postgresql://user:pass@localhost/db"
)
docs = db_reader.load_data(query="SELECT content FROM articles")
# MongoDB
from llama_index.readers.mongodb import SimpleMongoReader
mongo_reader = SimpleMongoReader(
host="localhost",
port=27017
)
docs = mongo_reader.load_data(
db_name="mydb",
collection_name="documents"
)Web and API connectors:
# Notion
from llama_index.readers.notion import NotionPageReader
notion_reader = NotionPageReader(integration_token="secret_...")
docs = notion_reader.load_data(page_ids=["page-id-1", "page-id-2"])
# Google Docs
from llama_index.readers.google import GoogleDocsReader
google_reader = GoogleDocsReader()
docs = google_reader.load_data(document_ids=["doc-id-1"])
# Slack
from llama_index.readers.slack import SlackReader
slack_reader = SlackReader(slack_token="xoxb-...")
docs = slack_reader.load_data(channel_ids=["C123456"])Document structure:
class Document:
text: str # Main content
metadata: Dict[str, Any] # Source, author, date, etc.
doc_id: str # Unique identifier
embedding: Optional[List[float]] # Vector embeddingMetadata enrichment:
documents = SimpleDirectoryReader("./data").load_data()
for doc in documents:
doc.metadata.update({
"source": "internal_docs",
"department": "engineering",
"last_updated": "2025-01-15"
})Metadata enables filtering during retrieval:
from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter
filters = MetadataFilters(filters=[
ExactMatchFilter(key="department", value="engineering")
])
retriever = index.as_retriever(filters=filters)Node Parsers and Chunking Strategies
Node parsers transform documents into chunks (nodes) optimized for retrieval:
SentenceSplitter (default) splits at sentence boundaries:
from llama_index.core.node_parser import SentenceSplitter
parser = SentenceSplitter(
chunk_size=512, # Tokens per chunk
chunk_overlap=50, # Overlap between chunks
separator=" " # Split on spaces when needed
)
nodes = parser.get_nodes_from_documents(documents)SentenceWindowNodeParser includes context windows:
from llama_index.core.node_parser import SentenceWindowNodeParser
parser = SentenceWindowNodeParser(
window_size=3, # Include 3 sentences before and after
window_metadata_key="window",
original_text_metadata_key="original_sentence"
)
nodes = parser.get_nodes_from_documents(documents)
# Each node contains:
# - Core sentence for embedding
# - Window of surrounding sentences in metadata
# - Enables precise retrieval with broader contextUse case: Retrieve specific sentence but provide LLM with surrounding context for better understanding.
HierarchicalNodeParser creates multi-level hierarchy:
from llama_index.core.node_parser import HierarchicalNodeParser
parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128] # Three levels
)
nodes = parser.get_nodes_from_documents(documents)
# Creates structure:
# Level 1: 2048-token chunks (coarse overview)
# Level 2: 512-token chunks (detailed sections)
# Level 3: 128-token chunks (fine-grained details)
# Retrieval can start coarse and drill downUse case: Long documents where hierarchical retrieval (summary → detail) improves relevance.
TokenTextSplitter splits by token count:
from llama_index.core.node_parser import TokenTextSplitter
parser = TokenTextSplitter(
chunk_size=512,
chunk_overlap=20,
separator=" "
)
nodes = parser.get_nodes_from_documents(documents)SemanticSplitterNodeParser splits at semantic boundaries:
from llama_index.core.node_parser import SemanticSplitterNodeParser
parser = SemanticSplitterNodeParser(
buffer_size=1, # Sentences to group
breakpoint_percentile_threshold=95 # Semantic shift threshold
)
nodes = parser.get_nodes_from_documents(documents)
# Identifies topic shifts and splits there
# Produces variable-length chunks aligned with meaningChunking strategy trade-offs:
Small chunks (128-256 tokens):
- ✅ Precise retrieval (exact relevant passage)
- ✅ Less noise in context
- ❌ May miss surrounding context
- ❌ More chunks = more embeddings to store
Large chunks (1024-2048 tokens):
- ✅ Broader context included
- ✅ Fewer chunks to manage
- ❌ Less precise retrieval
- ❌ May exceed context window when many retrieved
Optimal range: 256-512 tokens for most applications.
Query Engines and Retrieval Mechanisms
Query engines orchestrate retrieval and response generation:
Basic vector retrieval:
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(
similarity_top_k=3, # Retrieve top 3 most similar
response_mode="compact" # Synthesis strategy
)
response = query_engine.query("What is the return policy?")
print(response)Custom retriever with advanced strategies:
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core import QueryBundle
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=5,
vector_store_query_mode="default" # or "mmr", "hybrid"
)
query_bundle = QueryBundle(query_str="return policy")
nodes = retriever.retrieve(query_bundle)
for node in nodes:
print(f"Score: {node.score:.3f}")
print(f"Text: {node.text[:200]}...")Hybrid retrieval combining vector and keyword search:
from llama_index.core.retrievers import VectorIndexRetriever, KeywordTableSimpleRetriever
from llama_index.core.retrievers import BaseRetriever
class HybridRetriever(BaseRetriever):
def __init__(self, vector_retriever, keyword_retriever, mode="OR"):
self.vector_retriever = vector_retriever
self.keyword_retriever = keyword_retriever
self.mode = mode
def _retrieve(self, query_bundle):
vector_nodes = self.vector_retriever.retrieve(query_bundle)
keyword_nodes = self.keyword_retriever.retrieve(query_bundle)
if self.mode == "AND":
# Return nodes appearing in both
vector_ids = {n.node_id for n in vector_nodes}
return [n for n in keyword_nodes if n.node_id in vector_ids]
else: # OR mode
# Merge and deduplicate
all_nodes = {n.node_id: n for n in vector_nodes + keyword_nodes}
return list(all_nodes.values())
# Use hybrid retriever
vector_ret = VectorIndexRetriever(index=vector_index, similarity_top_k=5)
keyword_ret = KeywordTableSimpleRetriever(index=keyword_index)
hybrid_retriever = HybridRetriever(vector_ret, keyword_ret, mode="OR")
query_engine = RetrieverQueryEngine(
retriever=hybrid_retriever,
response_synthesizer=get_response_synthesizer()
)Advanced retrieval strategies:
MMR (Maximal Marginal Relevance): Balances relevance and diversity:
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=10,
vector_store_query_mode="mmr",
vector_store_kwargs={
"mmr_threshold": 0.5 # Balance relevance vs diversity
}
)Metadata filtering:
from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter
filters = MetadataFilters(filters=[
ExactMatchFilter(key="department", value="HR"),
ExactMatchFilter(key="year", value="2024")
])
query_engine = index.as_query_engine(
filters=filters,
similarity_top_k=5
)Reranking improves retrieval quality:
from llama_index.postprocessor.cohere_rerank import CohereRerank
reranker = CohereRerank(api_key="...", top_n=3)
query_engine = index.as_query_engine(
similarity_top_k=10, # Initial retrieval
node_postprocessors=[reranker] # Rerank to top 3
)Response Synthesis Strategies
Response synthesizers determine how retrieved chunks are used to generate answers:
Refine strategy iteratively improves answer:
from llama_index.core import get_response_synthesizer
synthesizer = get_response_synthesizer(response_mode="refine")
# Process:
# 1. Generate initial answer from chunk 1 + query
# 2. Refine answer with chunk 2 + previous answer + query
# 3. Refine again with chunk 3 + refined answer + query
# 4. Continue through all retrieved chunks
query_engine = index.as_query_engine(
response_synthesizer=synthesizer,
similarity_top_k=5
)Advantages: Comprehensive answers incorporating all retrieved context. Disadvantages: Multiple LLM calls (one per chunk), higher cost and latency.
Compact strategy consolidates before refinement:
synthesizer = get_response_synthesizer(response_mode="compact")
# Process:
# 1. Concatenate chunks to fill context window
# 2. Generate answer from concatenated context
# 3. Refine if context exceeds single call
# Fewer LLM calls than pure refineTree Summarize builds hierarchical summary:
synthesizer = get_response_synthesizer(response_mode="tree_summarize")
# Process (bottom-up tree):
# 1. Summarize pairs of chunks
# 2. Summarize summaries recursively
# 3. Root node = final answer
# Efficient for large document setsAccumulate generates separate answers:
synthesizer = get_response_synthesizer(response_mode="accumulate")
# Process:
# 1. Generate answer from chunk 1
# 2. Generate answer from chunk 2
# 3. Generate answer from chunk 3
# 4. Concatenate all answers
# Useful for multi-perspective responsesCompact Accumulate optimized accumulation:
synthesizer = get_response_synthesizer(response_mode="compact_accumulate")
# Consolidates chunks first, then accumulates
# Fewer LLM calls than pure accumulateStrategy selection guide:
- Simple queries: Use
compact(fast, efficient) - Complex analysis: Use
refine(comprehensive) - Summarization: Use
tree_summarize(hierarchical) - Multiple perspectives: Use
accumulate(separate answers)
Agents and Tool Integration
LlamaIndex implements agents as LLM-powered systems using tools to accomplish tasks:
Function tool creation:
from llama_index.core.tools import FunctionTool
def multiply(a: int, b: int) -> int:
"""Multiply two integers and return the result"""
return a * b
def add(a: int, b: int) -> int:
"""Add two integers and return the result"""
return a + b
multiply_tool = FunctionTool.from_defaults(fn=multiply)
add_tool = FunctionTool.from_defaults(fn=add)Query engine as tool (RAG as a tool):
from llama_index.core.tools import QueryEngineTool
# Create query engine from index
query_engine = index.as_query_engine()
# Wrap as tool
query_tool = QueryEngineTool.from_defaults(
query_engine=query_engine,
name="company_knowledge_base",
description="Useful for answering questions about company policies, procedures, and documentation"
)Agent creation with tools:
from llama_index.llms.openai import OpenAI
from llama_index.core.agent import FunctionCallingAgent
llm = OpenAI(model="gpt-4")
agent = FunctionCallingAgent.from_tools(
tools=[multiply_tool, add_tool, query_tool],
llm=llm,
verbose=True,
system_prompt="""You are a helpful assistant that can perform calculations
and answer questions about company policies."""
)
response = agent.chat("What is (121 * 3) + 42? Also, what's our vacation policy?")
print(response)Agent execution flow:
- Receive user query
- Analyze which tools might be helpful
- Generate function calls with parameters
- Execute tools in sequence or parallel
- Incorporate tool results into reasoning
- Generate final response
Parallel function calling (OpenAI 1.1.0+):
from llama_index.agent.openai import OpenAIAgent
agent = OpenAIAgent.from_tools(
[add_tool, multiply_tool],
llm=OpenAI(model="gpt-4"),
verbose=True
)
# Agent can call multiple tools simultaneously
response = agent.chat("Calculate both 5 + 3 and 7 * 9")
# Executes add(5, 3) and multiply(7, 9) in parallelMulti-document agent with tool per data source:
# Create separate indexes for different document types
policy_index = VectorStoreIndex.from_documents(policy_docs)
handbook_index = VectorStoreIndex.from_documents(handbook_docs)
faq_index = VectorStoreIndex.from_documents(faq_docs)
# Create tools from each index
policy_tool = QueryEngineTool.from_defaults(
query_engine=policy_index.as_query_engine(),
name="policy_search",
description="Search company policies and procedures"
)
handbook_tool = QueryEngineTool.from_defaults(
query_engine=handbook_index.as_query_engine(),
name="handbook_search",
description="Search employee handbook"
)
faq_tool = QueryEngineTool.from_defaults(
query_engine=faq_index.as_query_engine(),
name="faq_search",
description="Search frequently asked questions"
)
# Agent intelligently selects appropriate tool
agent = FunctionCallingAgent.from_tools(
[policy_tool, handbook_tool, faq_tool],
llm=llm
)Workflows for Orchestration
Workflows (introduced August 2024) provide event-driven orchestration for complex multi-step processes:
Workflow structure:
from llama_index.core.workflow import Workflow, StartEvent, StopEvent, step
class RAGWorkflow(Workflow):
@step
async def ingest(self, ctx, event: StartEvent) -> IngestEvent:
"""Step 1: Ingest and chunk documents"""
documents = event.documents
nodes = self.parser.get_nodes_from_documents(documents)
return IngestEvent(nodes=nodes)
@step
async def embed(self, ctx, event: IngestEvent) -> EmbedEvent:
"""Step 2: Generate embeddings"""
for node in event.nodes:
node.embedding = await self.embed_model.aget_embedding(node.text)
return EmbedEvent(nodes=event.nodes)
@step
async def store(self, ctx, event: EmbedEvent) -> StopEvent:
"""Step 3: Store in vector database"""
self.index.insert_nodes(event.nodes)
return StopEvent(result=f"Indexed {len(event.nodes)} nodes")
# Run workflow
workflow = RAGWorkflow()
result = await workflow.run(documents=documents)Complex workflow with branching:
class MultiModalWorkflow(Workflow):
@step
async def classify_input(self, ctx, event: StartEvent) -> ClassifyEvent:
"""Determine if input is text or image query"""
if event.has_image:
return ImageQueryEvent(image=event.image, query=event.query)
else:
return TextQueryEvent(query=event.query)
@step
async def process_text(self, ctx, event: TextQueryEvent) -> RetrievalEvent:
"""Handle text-only queries"""
nodes = self.text_retriever.retrieve(event.query)
return RetrievalEvent(nodes=nodes, query=event.query)
@step
async def process_image(self, ctx, event: ImageQueryEvent) -> RetrievalEvent:
"""Handle image + text queries"""
# Use multimodal model for retrieval
nodes = self.image_retriever.retrieve(event.query, event.image)
return RetrievalEvent(nodes=nodes, query=event.query)
@step
async def synthesize(self, ctx, event: RetrievalEvent) -> StopEvent:
"""Generate final response"""
response = self.synthesizer.synthesize(
query=event.query,
nodes=event.nodes
)
return StopEvent(result=response)Async-first architecture enables:
- Concurrent step execution
- Non-blocking I/O operations
- Efficient resource utilization
- Scalable production deployments
State management through context:
@step
async def step_with_state(self, ctx, event: MyEvent):
# Store state
await ctx.set("key", "value")
# Retrieve state
value = await ctx.get("key")
# State persists across steps
return NextEvent(data=value)Chat Engines and Context Management
Chat engines extend query engines with conversation memory:
Basic chat engine:
from llama_index.core.chat_engine import SimpleChatEngine
chat_engine = index.as_chat_engine(
chat_mode="simple",
llm=llm,
verbose=True
)
# Conversation
response1 = chat_engine.chat("What is the return policy?")
print(response1)
response2 = chat_engine.chat("How long do I have?") # Refers to previous context
print(response2)CondensePlusContextChatEngine condenses conversation history:
from llama_index.core.memory import ChatMemoryBuffer
memory = ChatMemoryBuffer.from_defaults(token_limit=3000)
chat_engine = index.as_chat_engine(
chat_mode="condense_plus_context",
memory=memory,
llm=llm,
context_prompt="""You are a helpful assistant with access to company documentation.
Use the context below to answer questions accurately.
Context:
{context_str}
Chat History:
{chat_history}
Answer the user's question based on the context and history.""",
verbose=True
)
# Condenses chat history + current question into standalone query
# Retrieves relevant context
# Generates response with both history and retrieved contextContext management strategies:
ChatMemoryBuffer with token limits:
from llama_index.core.memory import ChatMemoryBuffer
memory = ChatMemoryBuffer.from_defaults(
token_limit=3000, # Keep last 3K tokens of conversation
tokenizer_fn=tokenizer.encode # Custom tokenizer
)
memory.put_messages([
ChatMessage(role="user", content="Hello!"),
ChatMessage(role="assistant", content="Hi! How can I help?")
])
# Retrieve conversation history
history = memory.get()Streaming chat responses:
chat_engine = index.as_chat_engine()
streaming_response = chat_engine.stream_chat("Tell me about the product")
for token in streaming_response.response_gen:
print(token, end="", flush=True)ReAct chat mode for agent-based chat:
chat_engine = index.as_chat_engine(
chat_mode="react",
tools=[query_tool, calculator_tool],
llm=llm,
verbose=True
)
# Agent decides when to retrieve from index vs use other tools
response = chat_engine.chat("Calculate ROI based on our pricing documentation")Observability and Tracing
LlamaIndex provides instrumentation module (v0.10.20+) replacing legacy callbacks:
Basic tracing setup:
import llama_index.core
from llama_index.core.instrumentation import get_dispatcher
# Enable instrumentation
dispatcher = get_dispatcher()
# Basic event handler
from llama_index.core.instrumentation import EventHandler
class SimpleEventHandler(EventHandler):
def handle(self, event):
print(f"Event: {event.event_type}, Time: {event.timestamp}")
dispatcher.add_event_handler(SimpleEventHandler())LlamaTrace integration (hosted observability platform):
import llama_index.core
import os
# Configure LlamaTrace
PHOENIX_API_KEY = "your-api-key"
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
llama_index.core.set_global_handler(
"arize_phoenix",
endpoint="https://llamatrace.com/v1/traces"
)
# All operations automatically traced
query_engine = index.as_query_engine()
response = query_engine.query("sample query")
# View traces at llamatrace.comOpenTelemetry integration:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
# Configure OTLP exporter
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(otlp_exporter)
)
# LlamaIndex automatically emits spansCustom instrumentation:
from llama_index.core.instrumentation import SpanHandler
class CustomSpanHandler(SpanHandler):
def span_enter(self, span):
print(f"Starting: {span.name}")
span.start_time = time.time()
def span_exit(self, span):
duration = time.time() - span.start_time
print(f"Completed: {span.name} in {duration:.2f}s")
dispatcher.add_span_handler(CustomSpanHandler())Traced operations include:
- Document loading and parsing
- Embedding generation
- Vector store operations
- Retrieval queries
- LLM calls with token counts
- Response synthesis
- Agent tool invocations
Trace visualization shows:
- End-to-end latency breakdown
- Token usage per operation
- Retrieval relevance scores
- LLM call parameters and responses
- Error traces and debugging info
Version History and Ecosystem Evolution
February 2024: LlamaCloud launch - Enterprise offering for document ingestion, parsing, indexing, and storage. Managed infrastructure for production RAG deployments.
March 2024: LlamaParse independence - Parser for complex documents (PDFs, presentations, forms) became standalone tool. Advanced parsing capabilities for tables, charts, layouts.
May 2024: Property Graph Index - Added graph-based indexing for knowledge graphs and entity relationships. Enables graph traversal queries beyond vector similarity.
June 2024: LlamaDeploy framework - Transforms agents into microservices for production deployment. Facilitates containerization, scaling, and service orchestration.
July 2024: LlamaTrace observability - First-class observability platform built on Arize Phoenix. Hosted tracing, monitoring, and evaluation for LlamaIndex applications.
August 2024: Workflows framework 1.0 - Event-driven orchestration system for complex agentic workflows. Async-first architecture with enhanced observability and state management.
September 2024: LlamaParse Premium Mode - Advanced parsing features for complex layouts, multi-column documents, and embedded objects.
December 2024: LlamaReport feature - Transforms document databases into polished reports. Automated report generation from knowledge bases.
June 2025: Workflows 1.0 standalone - Workflows became independent framework for Python and TypeScript. Complete rewrite with improved developer experience.
October 2025: VersionRAG framework - Version-aware RAG handling evolving documents through hierarchical graph structure. Tracks document changes and retrieves appropriate versions.
Ecosystem components:
- LlamaIndex (core): Framework and libraries
- LlamaHub: 100+ data connectors
- LlamaCloud: Managed enterprise platform
- LlamaParse: Document parsing service
- LlamaTrace: Observability platform
- LlamaDeploy: Production deployment tools
- LlamaReport: Report generation
Package structure (v0.10+):
llama-index-core: Core functionalityllama-index-llms-*: LLM integrations (openai, anthropic, etc.)llama-index-embeddings-*: Embedding modelsllama-index-vector-stores-*: Vector database integrationsllama-index-readers-*: Data connectors
Modular package structure enables installing only required components reducing dependencies.