From 9220482cd351c818cb6aeb88c13f0734024635cb Mon Sep 17 00:00:00 2001 From: Fridolin Somers Date: Fri, 10 Oct 2014 15:06:45 +0200 Subject: [PATCH] Bug 13064 - Indexing problem with ICU on control characters The ICU configuration files contains a rule to remove control characters : This rule is before tokenization. The problem is that "[:Control:]" regex contains line feed, carriage return and tab. See http://www.regular-expressions.info/posixbrackets.html. So when several lines are indexed, last word of line is joined with first line of next line. Thoses words are then not searchable. For example : First line Second line This will become "First lineSecond line", tokenized as "First", "lineSecond" and "line". Test plan : - Use ICU in Zebra configuration - Choose an indexed field, like 300$a - Create a new record - Enter several lines in choosen field, like : First line Second line - Index this record => Without patch the search on "Second" does not return the record => With patch the search on "Second" returns the record - Same tests with tab and carriage return instead of line feed Signed-off-by: Chris Cormack Signed-off-by: Kyle M Hall Signed-off-by: Tomas Cohen Arazi --- etc/zebradb/etc/phrases-icu.xml | 3 ++- etc/zebradb/etc/words-icu.xml | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/etc/zebradb/etc/phrases-icu.xml b/etc/zebradb/etc/phrases-icu.xml index b27b06a9da..076c5ac67d 100644 --- a/etc/zebradb/etc/phrases-icu.xml +++ b/etc/zebradb/etc/phrases-icu.xml @@ -1,5 +1,6 @@ - + + diff --git a/etc/zebradb/etc/words-icu.xml b/etc/zebradb/etc/words-icu.xml index 38af51fae5..57498cbca4 100644 --- a/etc/zebradb/etc/words-icu.xml +++ b/etc/zebradb/etc/words-icu.xml @@ -1,7 +1,8 @@ - + + -- 2.39.5