Migrate full-text + KNN search from raw Lucene to Hibernate Search 8.x; add graph-semantic search
The project used manual ByteBuffersDirectory + IndexWriter/IndexSearcher for full-text search and a separate raw Lucene KNN index for vector search. Node embeddings only captured name/description, ignoring relation structure. This replaces both with Hibernate Search 8.2.0.Final (ORM mapper + Lucene backend) and adds a graph-semantic search endpoint that queries both node and relation indexes.
Dependencies
- Add
hibernate-search-mapper-orm:8.2.0.Final+hibernate-search-backend-lucene:8.2.0.Final(HS 8.x targets Hibernate ORM 7.x; HS 7.x targets ORM 6.x — counterintuitive versioning) - Upgrade Lucene
9.11.1→9.12.3to match HS 8.2 transitive requirement
Entity indexing
-
TaxonomyNode:@Indexed,@FullTextField(analyzer="english"/"german")on name/description,@KeywordFieldon code/uuid/externalId,@GenericFieldon taxonomyRoot/level/parentCode,@TypeBinding(NodeEmbeddingBinder)for the vector field -
TaxonomyRelation:@Indexed,@FullTextFieldon description,@KeywordFieldon relationType,@IndexedEmbeddedon source/target nodes,@TypeBinding(RelationEmbeddingBinder)for its vector field
Embedding bridges
NodeEmbeddingBinder and RelationEmbeddingBinder are Hibernate Search TypeBinder implementations that compute DJL/ONNX embeddings at index time via SpringContextHolder. Node enriched text now includes relation summaries:
Business Process Management.
Outgoing: supports Communication Requirements, supports Network Planning.
Incoming: depends_on Infrastructure Services.
Relation enriched text: "{sourceName} {relationType} {targetName}. {description}". Both degrade gracefully when the DJL model is unavailable.
Service refactoring
-
SearchService: replaceMultiFieldQueryParser+ manual index withSearchSessionf.match()/f.wildcard()queries -
LocalEmbeddingService: replace rawKnnFloatVectorField/KnnFloatVectorQuerywithf.knn().field("embedding").matching(queryVector).scoreNodes()now uses score projection (f.composite(f.entity(), f.score())) to derive accurate 0–100% cosine percentages instead of a hardcoded approximation -
TaxonomyService: removesearchService.buildIndex()andlocalEmbeddingService.invalidateVectorIndex()— Hibernate Search auto-indexes on JPA persist
Graph-semantic search (GET /api/search/graph?q=&maxResults=20)
Queries both TaxonomyNode and TaxonomyRelation KNN indexes, aggregates relation hits by taxonomy root and type, and returns:
{
"matchedNodes": [...],
"relationCountByRoot": {"BP": 12, "CO": 5},
"topRelationTypes": {"SUPPORTS": 7, "DEPENDS_ON": 3},
"summary": "BP has the most matching relationships (12). Most common relation type: supports (7)"
}
Configuration
-
HibernateSearchAnalysisConfigurerregisters English/German Lucene analyzers viaLuceneAnalysisConfigurer -
SpringContextHolderprovidesApplicationContextaccess from HS bridges (non-Spring-managed) -
application.properties:backend.type=lucene,directory.type=local-heap(equivalent to priorByteBuffersDirectory)
Tests
- 16 new tests: 7 in
SemanticSearchTests(HS query behavior, enriched text builders) + 9 inGraphSearchTests(endpoint + service) - Fixed
hybridSearchFallsBackToFullTextWhenEmbeddingNotLoaded— was comparingTaxonomyNodeDtoby object reference; now compares by code list
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
-
mlrepo.djl.ai- Triggering command:
/usr/lib/jvm/temurin-17-jdk-amd64/bin/java /usr/lib/jvm/temurin-17-jdk-amd64/bin/java -jar /home/REDACTED/work/Taxonomy/Taxonomy/target/surefire/surefirebooter-20260306225441964_3.jar /home/REDACTED/work/Taxonomy/Taxonomy/target/surefire 2026-03-06T22-54-41_497-jvmRun1 surefire-20260306225441964_1tmp surefire_0-20260306225441964_2tmp(dns block)
- Triggering command:
If you need me to access, download, or install something from one of these locations, you can either:
- Configure Actions setup steps to set up my environment, which run before the firewall is enabled
- Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)
Original prompt
Context
The Taxonomy project currently uses raw Lucene directly for both full-text search (SearchService) and KNN vector search (LocalEmbeddingService). This involves manual ByteBuffersDirectory management, manual IndexWriter/IndexSearcher lifecycle, and manual document construction. Meanwhile, the sandbox project (sandbox-jgit-storage-hibernate) uses Hibernate Search annotations (@Indexed, @FullTextField, @VectorField, @KeywordField) and its SearchSession API which handles all of this automatically.
Additionally, the current LOCAL_ONNX embeddings only capture individual node names/descriptions but not the graph relationships (TaxonomyRelation). This means the DJL/ONNX embedding model cannot answer graph-semantic questions like "which Business Processes are supported the most?" because the relation structure is invisible to the embedding index.
Requirements
1. Migrate from raw Lucene to Hibernate Search
Add Hibernate Search dependency to pom.xml:
-
hibernate-search-mapper-orm(the ORM mapper) -
hibernate-search-backend-lucene(Lucene backend)
Annotate TaxonomyNode entity with Hibernate Search annotations:
-
@Indexedon the class -
@FullTextField(analyzer = "english")onnameEnanddescriptionEn -
@FullTextField(analyzer = "german")onnameDeanddescriptionDe -
@KeywordFieldoncode,uuid,externalId -
@VectorField(dimension = 384, similarityFunction = VectorSimilarityFunction.COSINE)for the embedding vector -
@GenericFieldontaxonomyRoot,level,parentCodefor filtering
Annotate TaxonomyRelation entity similarly:
-
@Indexedon the class -
@FullTextFieldfor anenrichedTexttransient field that serializes the relation as natural language (e.g. "Business Process Management supports Communication Requirements") -
@VectorField(dimension = 384)for the relation embedding vector -
@KeywordFieldon sourceNode code, targetNode code, relationType for filtering
Configure Hibernate Search in application.properties:
spring.jpa.properties.hibernate.search.backend.type=lucene-
spring.jpa.properties.hibernate.search.backend.directory.type=local-heap(in-memory, like currentByteBuffersDirectory) - Configure custom analyzers for English/German via
LuceneAnalysisConfigurer(can reuse the logic fromTaxonomyAnalysisConfigurer)
Refactor SearchService:
- Replace manual
ByteBuffersDirectory+IndexWriter+MultiFieldQueryParserwith Hibernate Search'sSearchSession - Use
searchSession.search(TaxonomyNode.class).where(f -> f.match().fields("nameEn", "descriptionEn", "nameDe", "descriptionDe").matching(queryString))for full-text - Use
f.match().field("code").matching(queryString)for keyword matches - Remove the manual
buildIndex()call fromTaxonomyService.loadTaxonomyFromExcel()— Hibernate Search auto-indexes on persist. Instead, trigger a mass indexer after the initial@PostConstructdata load:searchSession.massIndexer(TaxonomyNode.class, TaxonomyRelation.class).startAndWait() - Remove
TaxonomyAnalysisConfigurer(its logic moves into the Hibernate SearchLuceneAnalysisConfigurerbean)
Refactor LocalEmbeddingService:
- Replace the manual KNN vector directory with Hibernate Search's
@VectorField+f.knn()predicates - The
buildVectorIndex()method is no longer needed — vectors are stored as part of the Hibernate Search index - Use a
@Transientfield +@IndexingDependency(derivedFrom = ...)or a customPropertyBridge/ValueBridgeto compute the embedding vector at index time using DJL -
scoreNodes()becomes aSearchSessionquery:f.knn(k).field("embedding").matching(queryVector).filter(f.bool().should(f.match().field("code").matching(code1)).should(...)) -
semanticSearch()becomes:f.knn(topK).field("embedding").matching(queryVector) -
findSimilarNodes()uses the same pattern with a filter to exclude the source node
Refactor HybridSearchService:
- Can now combine full-text and KNN in a single Hibernate Search query using
f.bool()withf.match()andf.knn()predicates, or keep the RRF approach
2. Enrich node embeddings with relation data
Modify the node text used for embedding — when computing the embedding vector for a TaxonomyNode, build an enriched text string that includes the node's relations:
Business Process Management.
NATO C3 Taxonomy – Business Processes.
Outgoing: supports Communication Requirements, supports Network Planning.
Incoming: depends_on Infrastructure Services, uses Command Operations.
This should be done in the embedding computation (the PropertyBridge or ValueBridge for the @VectorField), using the node's outgoingRelations and incomingRelations JPA associations.
Index relations as separate documents — each TaxonomyRelation gets its own Hibernate Search document with:
- An enriched text field:
"{sourceName} {relationType} {targetName}. {description}"-...
This pull request was created from Copilot chat.