Migrate full-text + KNN search from raw Lucene to Hibernate Search 8.x; add graph-semantic search (!41) · Merge requests · Carsten Hammer / Taxonomy

The project used manual ByteBuffersDirectory + IndexWriter/IndexSearcher for full-text search and a separate raw Lucene KNN index for vector search. Node embeddings only captured name/description, ignoring relation structure. This replaces both with Hibernate Search 8.2.0.Final (ORM mapper + Lucene backend) and adds a graph-semantic search endpoint that queries both node and relation indexes.

Dependencies

Add hibernate-search-mapper-orm:8.2.0.Final + hibernate-search-backend-lucene:8.2.0.Final (HS 8.x targets Hibernate ORM 7.x; HS 7.x targets ORM 6.x — counterintuitive versioning)
Upgrade Lucene 9.11.1 → 9.12.3 to match HS 8.2 transitive requirement

Entity indexing

TaxonomyNode: @Indexed, @FullTextField(analyzer="english"/"german") on name/description, @KeywordField on code/uuid/externalId, @GenericField on taxonomyRoot/level/parentCode, @TypeBinding(NodeEmbeddingBinder) for the vector field
TaxonomyRelation: @Indexed, @FullTextField on description, @KeywordField on relationType, @IndexedEmbedded on source/target nodes, @TypeBinding(RelationEmbeddingBinder) for its vector field

Embedding bridges

NodeEmbeddingBinder and RelationEmbeddingBinder are Hibernate Search TypeBinder implementations that compute DJL/ONNX embeddings at index time via SpringContextHolder. Node enriched text now includes relation summaries:

Business Process Management.
Outgoing: supports Communication Requirements, supports Network Planning.
Incoming: depends_on Infrastructure Services.

Relation enriched text: "{sourceName} {relationType} {targetName}. {description}". Both degrade gracefully when the DJL model is unavailable.

Service refactoring

SearchService: replace MultiFieldQueryParser + manual index with SearchSession f.match() / f.wildcard() queries
LocalEmbeddingService: replace raw KnnFloatVectorField/KnnFloatVectorQuery with f.knn().field("embedding").matching(queryVector). scoreNodes() now uses score projection (f.composite(f.entity(), f.score())) to derive accurate 0–100% cosine percentages instead of a hardcoded approximation
TaxonomyService: remove searchService.buildIndex() and localEmbeddingService.invalidateVectorIndex() — Hibernate Search auto-indexes on JPA persist

Graph-semantic search (`GET /api/search/graph?q=&maxResults=20`)

Queries both TaxonomyNode and TaxonomyRelation KNN indexes, aggregates relation hits by taxonomy root and type, and returns:

{
  "matchedNodes": [...],
  "relationCountByRoot": {"BP": 12, "CO": 5},
  "topRelationTypes": {"SUPPORTS": 7, "DEPENDS_ON": 3},
  "summary": "BP has the most matching relationships (12). Most common relation type: supports (7)"
}

Configuration

HibernateSearchAnalysisConfigurer registers English/German Lucene analyzers via LuceneAnalysisConfigurer
SpringContextHolder provides ApplicationContext access from HS bridges (non-Spring-managed)
application.properties: backend.type=lucene, directory.type=local-heap (equivalent to prior ByteBuffersDirectory)

Tests

16 new tests: 7 in SemanticSearchTests (HS query behavior, enriched text builders) + 9 in GraphSearchTests (endpoint + service)
Fixed hybridSearchFallsBackToFullTextWhenEmbeddingNotLoaded — was comparing TaxonomyNodeDto by object reference; now compares by code list

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

mlrepo.djl.ai
- Triggering command: /usr/lib/jvm/temurin-17-jdk-amd64/bin/java /usr/lib/jvm/temurin-17-jdk-amd64/bin/java -jar /home/REDACTED/work/Taxonomy/Taxonomy/target/surefire/surefirebooter-20260306225441964_3.jar /home/REDACTED/work/Taxonomy/Taxonomy/target/surefire 2026-03-06T22-54-41_497-jvmRun1 surefire-20260306225441964_1tmp surefire_0-20260306225441964_2tmp (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

Context

The Taxonomy project currently uses raw Lucene directly for both full-text search (SearchService) and KNN vector search (LocalEmbeddingService). This involves manual ByteBuffersDirectory management, manual IndexWriter/IndexSearcher lifecycle, and manual document construction. Meanwhile, the sandbox project (sandbox-jgit-storage-hibernate) uses Hibernate Search annotations (@Indexed, @FullTextField, @VectorField, @KeywordField) and its SearchSession API which handles all of this automatically.

Additionally, the current LOCAL_ONNX embeddings only capture individual node names/descriptions but not the graph relationships (TaxonomyRelation). This means the DJL/ONNX embedding model cannot answer graph-semantic questions like "which Business Processes are supported the most?" because the relation structure is invisible to the embedding index.

Requirements

1. Migrate from raw Lucene to Hibernate Search

Add Hibernate Search dependency to pom.xml:

hibernate-search-mapper-orm (the ORM mapper)
hibernate-search-backend-lucene (Lucene backend)

Annotate TaxonomyNode entity with Hibernate Search annotations:

@Indexed on the class
@FullTextField(analyzer = "english") on nameEn and descriptionEn
@FullTextField(analyzer = "german") on nameDe and descriptionDe
@KeywordField on code, uuid, externalId
@VectorField(dimension = 384, similarityFunction = VectorSimilarityFunction.COSINE) for the embedding vector
@GenericField on taxonomyRoot, level, parentCode for filtering

Annotate TaxonomyRelation entity similarly:

@Indexed on the class
@FullTextField for an enrichedText transient field that serializes the relation as natural language (e.g. "Business Process Management supports Communication Requirements")
@VectorField(dimension = 384) for the relation embedding vector
@KeywordField on sourceNode code, targetNode code, relationType for filtering

Configure Hibernate Search in application.properties:

spring.jpa.properties.hibernate.search.backend.type=lucene
spring.jpa.properties.hibernate.search.backend.directory.type=local-heap (in-memory, like current ByteBuffersDirectory)
Configure custom analyzers for English/German via LuceneAnalysisConfigurer (can reuse the logic from TaxonomyAnalysisConfigurer)

Refactor SearchService:

Replace manual ByteBuffersDirectory + IndexWriter + MultiFieldQueryParser with Hibernate Search's SearchSession
Use searchSession.search(TaxonomyNode.class).where(f -> f.match().fields("nameEn", "descriptionEn", "nameDe", "descriptionDe").matching(queryString)) for full-text
Use f.match().field("code").matching(queryString) for keyword matches
Remove the manual buildIndex() call from TaxonomyService.loadTaxonomyFromExcel() — Hibernate Search auto-indexes on persist. Instead, trigger a mass indexer after the initial @PostConstruct data load: searchSession.massIndexer(TaxonomyNode.class, TaxonomyRelation.class).startAndWait()
Remove TaxonomyAnalysisConfigurer (its logic moves into the Hibernate Search LuceneAnalysisConfigurer bean)

Refactor LocalEmbeddingService:

Replace the manual KNN vector directory with Hibernate Search's @VectorField + f.knn() predicates
The buildVectorIndex() method is no longer needed — vectors are stored as part of the Hibernate Search index
Use a @Transient field + @IndexingDependency(derivedFrom = ...) or a custom PropertyBridge/ValueBridge to compute the embedding vector at index time using DJL
scoreNodes() becomes a SearchSession query: f.knn(k).field("embedding").matching(queryVector).filter(f.bool().should(f.match().field("code").matching(code1)).should(...))
semanticSearch() becomes: f.knn(topK).field("embedding").matching(queryVector)
findSimilarNodes() uses the same pattern with a filter to exclude the source node

Refactor HybridSearchService:

Can now combine full-text and KNN in a single Hibernate Search query using f.bool() with f.match() and f.knn() predicates, or keep the RRF approach

2. Enrich node embeddings with relation data

Modify the node text used for embedding — when computing the embedding vector for a TaxonomyNode, build an enriched text string that includes the node's relations:

Business Process Management.
NATO C3 Taxonomy – Business Processes.
Outgoing: supports Communication Requirements, supports Network Planning.
Incoming: depends_on Infrastructure Services, uses Command Operations.

This should be done in the embedding computation (the PropertyBridge or ValueBridge for the @VectorField), using the node's outgoingRelations and incomingRelations JPA associations.

Index relations as separate documents — each TaxonomyRelation gets its own Hibernate Search document with:

An enriched text field: "{sourceName} {relationType} {targetName}. {description}" -...

This pull request was created from Copilot chat.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Migrate full-text + KNN search from raw Lucene to Hibernate Search 8.x; add graph-semantic search