From 9575a5f4fe84fc0b109062ad084ab32e3cc28a23 Mon Sep 17 00:00:00 2001 From: Joshua Ferraro Date: Wed, 23 Jul 2008 09:49:57 -0500 Subject: [PATCH] UTF-8 to ASCII MAPPINGS -- WARNING: REINDEX REQUIRED You've been warned :-). This patch contains a more complete mapping of UTF-8 to ASCII. The mappings are based on those compiled by Richard Mahoney on the Zebra list: http://lists.indexdata.dk/pipermail/zebralist/2007-August/001707.html Note to documentation team: we need an area in the documentation that discusses how Koha handles searches and indexing for words that contain diacritics, such as E-ACUTE (vs E without an acute). If you can paste this list of mappings from this patch directly into the docs and it preserves the encoding that would be great. NOTE: I don't think this patch addresses issues of combining vs non-combining forms, and may require a refactor to address that. Josh --- etc/zebradb/etc/word-phrase-utf.chr | 291 ++++++++++++++++++++++++---- 1 file changed, 250 insertions(+), 41 deletions(-) diff --git a/etc/zebradb/etc/word-phrase-utf.chr b/etc/zebradb/etc/word-phrase-utf.chr index 9fc53510f5..a6fc582b6e 100644 --- a/etc/zebradb/etc/word-phrase-utf.chr +++ b/etc/zebradb/etc/word-phrase-utf.chr @@ -1,41 +1,250 @@ -# Generic character map. -# -# $Id: word-phrase-utf.chr,v 1.1.2.1 2006/07/03 21:56:45 kados Exp $ - -# Define the basic value-set. *Beware* of changing this without re-indexing -# your databases. -lowercase {0-9}{a-y}üzæäøöå -uppercase {0-9}{A-Y}ÜZÆÄØÖÅ - -# Breaking characters - -space {\001-\040}!"#$%&'\()*+,-./:;<=>?@\[\\]^_`\{|}~ - -# Characters to be considered equivalent for searching purposes. - -# equivalent æä(ae) -# equivalent øö(oe) -# equivalent å(aa) -# equivalent uü - -# Supplemental mappings - -#map (ä) ä -#map (æ) æ -#map (ø) ø -#map (å) å -#map (ö) ö -#map (Ä) Ä -#map (&Aelig;) Æ -#map (Ø) Ø -#map (Å) Å -#map (Ö) Ö - -#map éÉ e -#map á a -#map ó o -#map í i - -#map (Aa) (AA) - -#map (aa) a +# Generic search equivanence character map for Latin languages (English, French, etc.) + +# Define the basic value-set. *Beware* of changing this without re-indexing +# your databases. +encoding utf-8 + +# basic character set +lowercase {0-9}{a-z} +uppercase {0-9}{A-Z} + +# Breaking characters +space {\001-\040}!"#$%&'\()*+,-./:;<=>?@\[\\]^_`\{|}~ + +# Characters to be considered equivalent for searching purposes +equivalent aáàãåâăąȧǎȁȃ +equivalent ӕä(ae) +equivalent ā(aa) +equivalent iíìîịĩĭįǐȉȋ +equivalent ï(ie) +equivalent Ä«(ii) +equivalent uúùûũŭųűǔȕȗ +equivalent ü(ue) +equivalent Å«(uu) +equivalent eéèêẽĕęėěȅȇ +equivalent ëē(ee) +equivalent oóòõôŏǫȯőǒȍȏ +equivalent öø(oe) +equivalent ō(oo) + +# Supplemental mappings +# mapping diacritics to plain ASCII +# A +map ā a +map Ā a +map á a +map Á a +map à a +map À a +map ã a +map à a +map Ã¥ a +map Å a +map â a +map  a +map ä a +map Ä a +map ă a +map Ă a +map ą a +map Ą a +map ȧ a +map Ȧ a +map ǎ a +map Ǎ a +map ȁ a +map Ȁ a +map ȃ a +map Ȃ a +# AE +map ӕ (ae) +map Ӕ (ae) +# I +map Ä« i +map Ī i +map ï i +map Ï i +map î i +map Î i +map í i +map Í i +map ì i +map Ì i +map ị i +map Ị i +map Ä© i +map Ĩ i +map Ä­ i +map Ĭ i +map į i +map Ä® i +map ǐ i +map Ǐ i +map ȉ i +map Ȉ i +map ȋ i +map Ȋ i +# U +map Å« u +map Ū u +map ü u +map Ü u +map ù u +map Ù u +map ú u +map Ú u +map û u +map Û u +map Å© u +map Ũ u +map Å­ u +map Ŭ u +map ų u +map Ų u +map ű u +map Å° u +map ǔ u +map Ǔ u +map ȕ u +map Ȕ u +map ȗ u +map Ȗ u +# E +map ê e +map Ê e +map ë e +map Ë e +map é e +map É e +map è e +map È e +map ē e +map Ē e +map ẽ e +map Ẽ e +map ĕ e +map Ĕ e +map ę e +map Ę e +map ė e +map Ė e +map ě e +map Ě e +map ȅ e +map Ȅ e +map ȇ e +map Ȇ e +# O +map ô o +map Ô o +map ó o +map Ó o +map ò o +map Ò o +map ö o +map Ö o +map õ o +map Õ o +map ø o +map Ø o +map ō o +map Ō o +map ǒ o +map Ǒ o +map ŏ o +map Ŏ o +map Ç« o +map Ǫ o +map ȯ o +map È® o +map ő o +map Ő o +map ȍ o +map Ȍ o +map ȏ o +map Ȏ o +# R +map ṛ r +map Ṛ r +map rÌ¥ r +map RÌ¥ r +map ṝ r +map Ṝ r +map ř r +map Ř r +map ṟ r +# L +map ḷ l +map Ḷ l +map ḹ l +map Ḹ l +map ḻ l +map Ḻ l +# N +map ṅ n +map Ṅ n +map ñ n +map Ñ n +map ǹ n +map Ǹ n +map ń n +map Ń n +map ṉ n +map Ṉ n +map ṇ n +map Ṇ n +# T +map á¹­ t +map Ṭ t +# D +map ḍ d +map Ḍ d +# S +map ś s +map Ś s +map á¹£ s +map á¹¢ s +map ş s +map Ş s +map Å¡ s +map Å  s +map ṡ s +map á¹  s +# M +map ṁ m +map Ṁ m +map ṃ m +map Ṃ m +# H +map ḥ h +map Ḥ h +map ẖ h +# Z +map ź z +map Ź z +map ẓ z +map Ẓ z +map ż z +map Å» z +#C +map ç c +map Ç c +map ć c +map Ć c +map č c +map Č c +# G +map Ç¥ g +map Ǥ g +map Ä¡ g +map Ä  g +map ǧ g +map Ǧ g +map ğ g +map Ğ g +# K +map ḵ k +map Ḵ k +# Y +map ý y +map Ý y -- 2.39.5