LMDiskANN

LM-DiskANN is an approximate nearest‐neighbor indexing library that builds and maintains a disk‐resident graph structure for high‐dimensional data. It aims to provide a balance between fast search times, memory efficiency, and support for incremental updates (insertion and deletion). Under the hood, LM-DiskANN stores vector embeddings in memory‐mapped files, limiting in‐memory usage while preserving quick random access. It also leverages a graph-based BFS expansion to find candidate neighbors, with user‐configurable parameters like search expansion factor (EF_SEARCH) and maximum neighborhood size (maxdegree).

"LM-diskann: Low memory footprint in disk-native dynamic graph-based ann indexing." Pan, Yu, Jianxin Sun, and Hongfeng Yu, 2023 IEEE International Conference on Big Data (BigData). IEEE, 2023.

It offers the optional user‐key mapping, allowing users to associate string keys with each embedding. This means you can insert vectors with a custom ID (e.g., "image001"), search for nearest neighbors, then retrieve or delete by either the numeric internal ID or the user key. The library includes utility functions to open, close, and clear these user‐key databases, enabling a more flexible integration with real-world applications where data might be identified by strings or external IDs.

Example Usage with Synthetic Data

Below is a short script you can run to see LM-DiskANN in action. We’ll:

  1. Create an index for dim=5 vectors.
  2. Insert a few vectors—some with user keys, some without.
  3. Search for a known vector.
  4. Delete by either numeric ID or user key.
using LMDiskANN  
using Random

# 1 make an index with dimension 5
dim = 5
index = create_index("example_index", dim)

# 2 insert a couple of random vectors
vec1 = rand(Float32, 5)
(key1, id1) = ann_insert!(index, vec1)
@show key1, id1   # e.g. might be ("1", 1) if no user key was specified

# insert a vector with a user key
vec2 = rand(Float32, 5)
(user_key, id2) = ann_insert!(index, vec2; key="my_special_vector")
@show user_key, id2  # e.g. ("my_special_vector", 2)

# 3 search for vec1
results = search(index, vec1, topk=3)  # returns an array of (maybeKey, ID)
println("Search results for vec1 = ", results)

# 4 retrieve the actual embedding from the index by numeric ID
retrieved_vec1 = get_embedding_from_id(index, id1)
println("retrieved_vec1 == vec1 ? ", all(retrieved_vec1 .== vec1))

# retrieve by user key
retrieved_vec2 = get_embedding_from_key(index, "my_special_vector")
println("retrieved_vec2 == vec2 ? ", all(retrieved_vec2 .== vec2))

# 5 delete an entry
ann_delete!(index, id1) # by numeric ID
ann_delete!(index, "my_special_vector") # by user key

# 6 confirm they're gone (search or retrieval should fail)
println("After deletions, search(vec1) => ", search(index, vec1, topk=3))
try
    get_embedding_from_id(index, id1)
catch e
    println("Caught an expected error when retrieving a deleted vector: ", e)
end

Example real data embeddings from GLoVe and usage

Build a 10 000-word cosine LM-DiskANN index and explore semantic neighbours using embedding vectors from GLoVe.

using Embeddings #downloads pre-trained vectors of word embeddings
using LMDiskANN #this package added via, add https://github.com/mantzaris/LMDiskANN.jl
using Distances, LinearAlgebra, Random, Printf

tab = load_embeddings(GloVe{:en}, 2; max_vocab_size = 10_000)
words = tab.vocab
vectors = tab.embeddings #100 × 10 000 Float32
(dim, N) = size(vectors)
@info "Loaded $N by $dim GloVe vectors"

word2id = Dict(w => i for (i, w) in enumerate(words)) #helper function
vectors ./= sqrt.(sum(abs2, vectors; dims = 1)) #normalisation

idx = create_index("glove_demo", dim; metric = CosineDist()) #<- create the store

for (vec, w) in zip(eachcol(vectors), words)  #insert into the lmdiskann the (vector, word)
    ann_insert!(idx, vec; key = w) #<- insert into the store
end
@info "Index built with $(idx.num_points) points"


nearest(w; k = 5) = let id = word2id[w]
    hits = search(idx, vectors[:, id], topk = k + 1) #<- search the store, first high result would be the word itself
    [words[h[2]] for h in hits[2:end]]
end

for w in ("king", "queen", "paris", "cat", "coffee")
           @printf("  %-6s -> %s\n", w, join(nearest(w), ", "))
       end
# prints eg:
#  king   -> prince, queen, son, brother, monarch
#  queen  -> princess, king, elizabeth, royal, lady
#  paris  -> france, london, brussels, french, rome
#  cat    -> dog, cats, pet, dogs, mouse
#  coffee -> tea, drinks, beer, wine, drink

#classic analogy for vectors

v = vectors[:, word2id["king"]] .- vectors[:, word2id["man"]] .+ vectors[:, word2id["woman"]]
v ./= norm(v) 


ignore = Set(["king", "man", "woman"]) #skip seed words
for (_key, id) in search(idx, v, topk = 10)
    w = words[id]
    w ∉ ignore && (println("king - man + woman -> ", w); break)
end

#prints:
# king - man + woman -> queen

Contributing

We welcome contributions from the community! To contribute to this project:

  1. Fork the repository and create your branch from main.
  2. Write clear, modular code and follow existing code conventions.
  3. Document your changes clearly in code and, if needed, in the README.md.
  4. Test your changes thoroughly.
  5. Submit a Pull Request with a clear description of what you've added or changed.

Or for general ideas please feel free to share them.


Reporting Issues

Found a bug or an issue? Help us improve by reporting it. Please:

  • Open a new GitHub Issue.
  • Include a clear and descriptive title.
  • Provide steps to reproduce the problem, expected vs. actual behavior, and your system/environment details.
  • Add relevant logs, screenshots, or error messages to help diagnose the issue.

Functions

LMDiskANN.ann_delete!Method
ann_delete!(index::LMDiskANNIndex{T}, node_id::Union{Int,String}) where {T<:AbstractFloat}

Delete a vector (and adjacency) from the index. Implements Algorithm 3 from the LM-DiskANN paper.

Arguments

  • index::LMDiskANNIndex: The index to delete from
  • node_id::Int: The ID of the vector to delete

Returns

  • LMDiskANNIndex: The updated index instance

Example

LMDiskANN.ann_delete!(index, id)
source
LMDiskANN.ann_insert!Method
ann_insert!(index::LMDiskANNIndex{T},
                 new_vec::AbstractVector{<:AbstractFloat};
                 key::Union{Nothing,String}=nothing) where {T<:AbstractFloat}

Insert a new vector into the index. Updates the adjacency structure. Returns the assigned ID of the newly inserted vector. Implements Algorithm 2 from the LM-DiskANN paper.

Arguments

  • index::LMDiskANNIndex{T}: The index to insert into
  • new_vec::Vector{<:AbstractFloat}: The vector to insert

Optional Arguments

  • key:: String: the key the user wants to associate with the new vector

Returns

  • (key,Int): The tuple of the key (string) and ID (int) assigned to the inserted vector

Example

(key,id) = LMDiskANN.ann_insert!(index, vector)
source
LMDiskANN.clear_all_databases!Method
clear_all_databases!(db_forward, db_reverse)

Removes all entries from both forward and reverse databases.

Arguments

  • db_forward: Forward mapping database handle
  • db_reverse: Reverse mapping database handle
source
LMDiskANN.create_indexMethod
create_index(path_prefix::String, dim::Int; T::Type=Float32, maxdegree::Int=DEFAULT_MAX_DEGREE, metric::PreMetric=Euclidean())

Creates a brand new LM-DiskANN index on disk with the given dimension, storing to files: path_prefix.vec, path_prefix.adj, path_prefix.meta.

Arguments

  • path_prefix::String: Prefix for the index files (without extension)
  • dim::Int: Dimensionality of the vectors to be indexed

Optional arguments

= T::Type=Float32: the typer for the embedding vectors

  • maxdegree::Int=DEFAULT_MAX_DEGREE: Maximum number of neighbors per node
  • metric::PreMetric which is default Euclidean but can be a metric from Distances.jl

Returns

  • LMDiskANNIndex: A new index instance

Example

index = LMDiskANN.create_index("my_index", 128)
source
LMDiskANN.delete_by_id!Method
delete_by_id!(db_forward, db_reverse, internal_id::Int)

Deletes the entry for internal_id in db_reverse, and also deletes the corresponding entry in db_forward.

source
LMDiskANN.delete_by_key!Method
delete_by_key!(db_forward, db_reverse, user_key::String)

Deletes the entry for user_key in db_forward, and also deletes the corresponding entry in db_reverse.

source
LMDiskANN.get_embedding_from_idMethod
get_embedding_from_id(index::LMDiskANNIndex{T}, id::Int)::Vector{T} where {T<:AbstractFloat}

Given a 1-based ID (the same ID you'd see returned by ann_insert! or in search), retrieve the stored embedding vector from index.vecs.

Throws an error if id is out of range or if it's already deleted (i.e., in the freelist).

source
LMDiskANN.get_embedding_from_keyMethod
get_embedding_from_key(index::LMDiskANNIndex{T}, key::String)::Vector{T} where {T<:AbstractFloat}

ARgument is a string key (which was stored during ann_insert!(..., key=...)), look up the 1-based ID from the forward DB, then retrieve the embedding.

Throws an error if the key doesn't exist

source
LMDiskANN.get_id_from_keyMethod
get_id_from_key(db_forward, user_key::String) -> Union{Int, Nothing}

Fetches the internalid associated with `userkeyfromdb_forward. Returnsnothing` if not found.

source
LMDiskANN.get_key_from_idMethod
get_key_from_id(db_reverse, internal_id::Int) -> Union{String, Nothing}

Fetches the userkey from `dbreversegiven the internal_id. Returnsnothing` if not found.

source
LMDiskANN.insert_key!Method
insert_key!(db_forward, db_reverse, user_key::String, internal_id::Int)

Inserts user_key -> internal_id into db_forward, and internal_id -> user_key into db_reverse.

Both must be done to keep them in sync.

source
LMDiskANN.list_all_keysMethod
list_all_keys(db) -> Vector{String}

Returns a list of all keys in the database as strings.

Arguments

  • db: LevelDB database handle

Returns

  • Vector{String}: List of all keys in the database
source
LMDiskANN.load_indexMethod
load_index(path_prefix::String; metric::PreMetric=Euclidean())

Loads an existing index, eltype and the original metric are read from the meta-file automatically. You can also specify the metric which is by default Euclidean from Distances.jl

source
LMDiskANN.open_databasesMethod
open_databases(forward_path::String, reverse_path::String; create_if_missing=true)

Opens (or creates) two LevelDB databases:

  • dbforward: stringkey -> stringofinternal_id
  • dbreverse: stringofinternalid -> string_key

Returns a tuple (dbforward, dbreverse).

source
LMDiskANN.searchMethod
search(index::LMDiskANNIndex{T},
            query_vec::AbstractVector{<:AbstractFloat};
            topk::Int=10, ef=DEFAULT_EF_SEARCH) where {T<:AbstractFloat}

Returns top-k approximate nearest neighbors for query_vec.

Arguments

  • index::LMDiskANNIndex: The index to search
  • query_vec::AbstractVector{Float32}: The query vector
  • topk::Int=10: Number of nearest neighbors to return
  • ef::Int=DEFAULT_EF_SEARCH (size of the candidate list explored during the graph, larger ef means higher recall but slower queries) and ef is automatically promoted to at least topk.

Returns

  • Vector{Tuple{Union{String,Nothing}, Int}} : a list of (key, id) pairs for the top-k nearest neighbours

Example

results = LMDiskANN.search(index, query_vec, topk=5)
source