LMDiskANN
LM-DiskANN is an approximate nearest‐neighbor indexing library that builds and maintains a disk‐resident graph structure for high‐dimensional data. It aims to provide a balance between fast search times, memory efficiency, and support for incremental updates (insertion and deletion). Under the hood, LM-DiskANN stores vector embeddings in memory‐mapped files, limiting in‐memory usage while preserving quick random access. It also leverages a graph-based BFS expansion to find candidate neighbors, with user‐configurable parameters like search expansion factor (EF_SEARCH) and maximum neighborhood size (maxdegree).
"LM-diskann: Low memory footprint in disk-native dynamic graph-based ann indexing." Pan, Yu, Jianxin Sun, and Hongfeng Yu, 2023 IEEE International Conference on Big Data (BigData). IEEE, 2023.
It offers the optional user‐key mapping, allowing users to associate string keys with each embedding. This means you can insert vectors with a custom ID (e.g., "image001"), search for nearest neighbors, then retrieve or delete by either the numeric internal ID or the user key. The library includes utility functions to open, close, and clear these user‐key databases, enabling a more flexible integration with real-world applications where data might be identified by strings or external IDs.
Example Usage with Synthetic Data
Below is a short script you can run to see LM-DiskANN in action. We’ll:
- Create an index for
dim=5vectors. - Insert a few vectors—some with user keys, some without.
- Search for a known vector.
- Delete by either numeric ID or user key.
using LMDiskANN
using Random
# 1 make an index with dimension 5
dim = 5
index = create_index("example_index", dim)
# 2 insert a couple of random vectors
vec1 = rand(Float32, 5)
(key1, id1) = ann_insert!(index, vec1)
@show key1, id1 # e.g. might be ("1", 1) if no user key was specified
# insert a vector with a user key
vec2 = rand(Float32, 5)
(user_key, id2) = ann_insert!(index, vec2; key="my_special_vector")
@show user_key, id2 # e.g. ("my_special_vector", 2)
# 3 search for vec1
results = search(index, vec1, topk=3) # returns an array of (maybeKey, ID)
println("Search results for vec1 = ", results)
# 4 retrieve the actual embedding from the index by numeric ID
retrieved_vec1 = get_embedding_from_id(index, id1)
println("retrieved_vec1 == vec1 ? ", all(retrieved_vec1 .== vec1))
# retrieve by user key
retrieved_vec2 = get_embedding_from_key(index, "my_special_vector")
println("retrieved_vec2 == vec2 ? ", all(retrieved_vec2 .== vec2))
# 5 delete an entry
ann_delete!(index, id1) # by numeric ID
ann_delete!(index, "my_special_vector") # by user key
# 6 confirm they're gone (search or retrieval should fail)
println("After deletions, search(vec1) => ", search(index, vec1, topk=3))
try
get_embedding_from_id(index, id1)
catch e
println("Caught an expected error when retrieving a deleted vector: ", e)
endExample real data embeddings from GLoVe and usage
Build a 10 000-word cosine LM-DiskANN index and explore semantic neighbours using embedding vectors from GLoVe.
using Embeddings #downloads pre-trained vectors of word embeddings
using LMDiskANN #this package added via, add https://github.com/mantzaris/LMDiskANN.jl
using Distances, LinearAlgebra, Random, Printf
tab = load_embeddings(GloVe{:en}, 2; max_vocab_size = 10_000)
words = tab.vocab
vectors = tab.embeddings #100 × 10 000 Float32
(dim, N) = size(vectors)
@info "Loaded $N by $dim GloVe vectors"
word2id = Dict(w => i for (i, w) in enumerate(words)) #helper function
vectors ./= sqrt.(sum(abs2, vectors; dims = 1)) #normalisation
idx = create_index("glove_demo", dim; metric = CosineDist()) #<- create the store
for (vec, w) in zip(eachcol(vectors), words) #insert into the lmdiskann the (vector, word)
ann_insert!(idx, vec; key = w) #<- insert into the store
end
@info "Index built with $(idx.num_points) points"
nearest(w; k = 5) = let id = word2id[w]
hits = search(idx, vectors[:, id], topk = k + 1) #<- search the store, first high result would be the word itself
[words[h[2]] for h in hits[2:end]]
end
for w in ("king", "queen", "paris", "cat", "coffee")
@printf(" %-6s -> %s\n", w, join(nearest(w), ", "))
end
# prints eg:
# king -> prince, queen, son, brother, monarch
# queen -> princess, king, elizabeth, royal, lady
# paris -> france, london, brussels, french, rome
# cat -> dog, cats, pet, dogs, mouse
# coffee -> tea, drinks, beer, wine, drink
#classic analogy for vectors
v = vectors[:, word2id["king"]] .- vectors[:, word2id["man"]] .+ vectors[:, word2id["woman"]]
v ./= norm(v)
ignore = Set(["king", "man", "woman"]) #skip seed words
for (_key, id) in search(idx, v, topk = 10)
w = words[id]
w ∉ ignore && (println("king - man + woman -> ", w); break)
end
#prints:
# king - man + woman -> queenContributing
We welcome contributions from the community! To contribute to this project:
- Fork the repository and create your branch from
main. - Write clear, modular code and follow existing code conventions.
- Document your changes clearly in code and, if needed, in the
README.md. - Test your changes thoroughly.
- Submit a Pull Request with a clear description of what you've added or changed.
Or for general ideas please feel free to share them.
Reporting Issues
Found a bug or an issue? Help us improve by reporting it. Please:
- Open a new GitHub Issue.
- Include a clear and descriptive title.
- Provide steps to reproduce the problem, expected vs. actual behavior, and your system/environment details.
- Add relevant logs, screenshots, or error messages to help diagnose the issue.
Functions
LMDiskANN.ann_delete!LMDiskANN.ann_insert!LMDiskANN.clear_all_databases!LMDiskANN.close_databasesLMDiskANN.close_id_mappingLMDiskANN.create_indexLMDiskANN.delete_by_id!LMDiskANN.delete_by_key!LMDiskANN.get_embedding_from_idLMDiskANN.get_embedding_from_keyLMDiskANN.get_id_from_keyLMDiskANN.get_key_from_idLMDiskANN.insert_key!LMDiskANN.list_all_keysLMDiskANN.load_indexLMDiskANN.open_databasesLMDiskANN.search
LMDiskANN.ann_delete! — Methodann_delete!(index::LMDiskANNIndex{T}, node_id::Union{Int,String}) where {T<:AbstractFloat}Delete a vector (and adjacency) from the index. Implements Algorithm 3 from the LM-DiskANN paper.
Arguments
index::LMDiskANNIndex: The index to delete fromnode_id::Int: The ID of the vector to delete
Returns
LMDiskANNIndex: The updated index instance
Example
LMDiskANN.ann_delete!(index, id)LMDiskANN.ann_insert! — Methodann_insert!(index::LMDiskANNIndex{T},
new_vec::AbstractVector{<:AbstractFloat};
key::Union{Nothing,String}=nothing) where {T<:AbstractFloat}Insert a new vector into the index. Updates the adjacency structure. Returns the assigned ID of the newly inserted vector. Implements Algorithm 2 from the LM-DiskANN paper.
Arguments
index::LMDiskANNIndex{T}: The index to insert intonew_vec::Vector{<:AbstractFloat}: The vector to insert
Optional Arguments
key:: String: the key the user wants to associate with the new vector
Returns
(key,Int): The tuple of the key (string) and ID (int) assigned to the inserted vector
Example
(key,id) = LMDiskANN.ann_insert!(index, vector)LMDiskANN.clear_all_databases! — Methodclear_all_databases!(db_forward, db_reverse)Removes all entries from both forward and reverse databases.
Arguments
db_forward: Forward mapping database handledb_reverse: Reverse mapping database handle
LMDiskANN.close_databases — Methodclose_databases(db_forward, db_reverse)Closes both databases.
LMDiskANN.close_id_mapping — Methodclose_id_mapping(index::LMDiskANNIndex{T})Closes the user-key DB if open.
LMDiskANN.create_index — Methodcreate_index(path_prefix::String, dim::Int; T::Type=Float32, maxdegree::Int=DEFAULT_MAX_DEGREE, metric::PreMetric=Euclidean())Creates a brand new LM-DiskANN index on disk with the given dimension, storing to files: path_prefix.vec, path_prefix.adj, path_prefix.meta.
Arguments
path_prefix::String: Prefix for the index files (without extension)dim::Int: Dimensionality of the vectors to be indexed
Optional arguments
= T::Type=Float32: the typer for the embedding vectors
maxdegree::Int=DEFAULT_MAX_DEGREE: Maximum number of neighbors per nodemetric::PreMetric which is default Euclidean but can be a metric from Distances.jl
Returns
LMDiskANNIndex: A new index instance
Example
index = LMDiskANN.create_index("my_index", 128)LMDiskANN.delete_by_id! — Methoddelete_by_id!(db_forward, db_reverse, internal_id::Int)Deletes the entry for internal_id in db_reverse, and also deletes the corresponding entry in db_forward.
LMDiskANN.delete_by_key! — Methoddelete_by_key!(db_forward, db_reverse, user_key::String)Deletes the entry for user_key in db_forward, and also deletes the corresponding entry in db_reverse.
LMDiskANN.get_embedding_from_id — Methodget_embedding_from_id(index::LMDiskANNIndex{T}, id::Int)::Vector{T} where {T<:AbstractFloat}Given a 1-based ID (the same ID you'd see returned by ann_insert! or in search), retrieve the stored embedding vector from index.vecs.
Throws an error if id is out of range or if it's already deleted (i.e., in the freelist).
LMDiskANN.get_embedding_from_key — Methodget_embedding_from_key(index::LMDiskANNIndex{T}, key::String)::Vector{T} where {T<:AbstractFloat}ARgument is a string key (which was stored during ann_insert!(..., key=...)), look up the 1-based ID from the forward DB, then retrieve the embedding.
Throws an error if the key doesn't exist
LMDiskANN.get_id_from_key — Methodget_id_from_key(db_forward, user_key::String) -> Union{Int, Nothing}Fetches the internalid associated with `userkeyfromdb_forward. Returnsnothing` if not found.
LMDiskANN.get_key_from_id — Methodget_key_from_id(db_reverse, internal_id::Int) -> Union{String, Nothing}Fetches the userkey from `dbreversegiven the internal_id. Returnsnothing` if not found.
LMDiskANN.insert_key! — Methodinsert_key!(db_forward, db_reverse, user_key::String, internal_id::Int)Inserts user_key -> internal_id into db_forward, and internal_id -> user_key into db_reverse.
Both must be done to keep them in sync.
LMDiskANN.list_all_keys — Methodlist_all_keys(db) -> Vector{String}Returns a list of all keys in the database as strings.
Arguments
db: LevelDB database handle
Returns
Vector{String}: List of all keys in the database
LMDiskANN.load_index — Methodload_index(path_prefix::String; metric::PreMetric=Euclidean())Loads an existing index, eltype and the original metric are read from the meta-file automatically. You can also specify the metric which is by default Euclidean from Distances.jl
LMDiskANN.open_databases — Methodopen_databases(forward_path::String, reverse_path::String; create_if_missing=true)Opens (or creates) two LevelDB databases:
- dbforward: stringkey -> stringofinternal_id
- dbreverse: stringofinternalid -> string_key
Returns a tuple (dbforward, dbreverse).
LMDiskANN.search — Methodsearch(index::LMDiskANNIndex{T},
query_vec::AbstractVector{<:AbstractFloat};
topk::Int=10, ef=DEFAULT_EF_SEARCH) where {T<:AbstractFloat}Returns top-k approximate nearest neighbors for query_vec.
Arguments
index::LMDiskANNIndex: The index to searchquery_vec::AbstractVector{Float32}: The query vectortopk::Int=10: Number of nearest neighbors to returnef::Int=DEFAULT_EF_SEARCH(size of the candidate list explored during the graph, largerefmeans higher recall but slower queries) andefis automatically promoted to at leasttopk.
Returns
Vector{Tuple{Union{String,Nothing}, Int}}: a list of (key, id) pairs for the top-k nearest neighbours
Example
results = LMDiskANN.search(index, query_vec, topk=5)