This patch adds geosearch to Koha (using Elasticsearch 7). ElasticSearch
search_mappings get new types to store lat/lon, which can be indexed
from MARC 034$s and 034$t. There is a small change to the DB to allow a
new value in search_field.type ENUM.
The QueryBuilder is extended to allow for building advanced
ElasticSearch Querys (eg geo_distance) that cannot be represented in a
simple string query. The UI for searching (including showing the results
on a OSM/Leaflet map) is implemented in a separate plugin
(https://github.com/HKS3/HKS3GeoSearch)
Test Plan:
* make sure you're running ElasticSearch 7
(eg via `curl http://es:9200?pretty | grep number`)
* apply patch
* got to a Framework, check Editor for 034$s and 034$t and save
* got to some books (in the correct framework) and enter some lat and lon into 034$s and 034$t (for example lat=48.216, lon=16.395)
* Run the elasticsearch indexer, maybe limited on the books you edited (-bn 123 -bn 456):
misc/search_tools/rebuild_elasticsearch.pl -b -v
* You can check if the indexing worked by inspecting the document in elasticsearch:
* get the biblionumber (eg 123)
* curl http://es:9200/koha_kohadev_biblios/_doc/123?pretty | grep -A5 geolocation
* You should get back a JSON fragment containing the lat/lon you stored
* You can query elasticsearch directly:
* Run the following curl command, but adapt the value for lat/lng and/or the distance (in meters)
* curl -X GET "http://es:9200/koha_kohadev_biblios/_search?pretty" -H 'Content-Type: application/json' -d '{"query": {"bool":{"must":{"match_all":{}},"filter":{"geo_distance":{"distance":100000,"geolocation":{"lat":48.2,"lon":16.4}}}}}}'
* To run the search via Koha, you need to either install and use https://github.com/HKS3/HKS3GeoSearch or create a handcrafted query string:
* handcrafted query string:
* /cgi-bin/koha/opac-search.pl?advsearch=1&idx=geolocation&q=lat:48.25+lng:18.35+distance:100km&do=Search
* HKS3GeoSearch
* install the plugin and enable it
* got to OPAC / Advanced Search
* There is a new input box "Geographic Search" where you can enter lat/long/radius
* On the search result page a map is shown with pins for each found biblioitem
Sponsored-by: ZAMG - Zentralanstalt für Meterologie und Geodynamik, Austria - https://www.zamg.ac.at/
Sponsored-by: Geosphere - https://www.geosphere.at/
Signed-off-by: Martin Renvoize <martin.renvoize@ptfs-europe.com>
Signed-off-by: Nick Clemens <nick@bywatersolutions.com>
Additional finetuning:
- Fix update and remove fixed fixme
- Update test count as well
- fix last small issues raised in Comment 23
Signed-off-by: Katrin Fischer <katrin.fischer@bsz-bw.de>
In Elasticsearch fields config field_config.yaml, default type as a field 'ci_raw'. This is used for exact search.
This field is missing for type standard number 'stdno'.
Test plan :
1) In the staff interface, go to Administration, and search for SearchEngine
2) Make sure that the SearchEngine preference is set to Elasticsearch and save
3) Return to Administration and select "Search engine configuration"
4) Change the type of "Heading-Main" to "Std. Number" and save
5) Rebuild the index (e.g. "koha-elasticsearch --rebuild -d kohadev")
6) Go to the main staff page and select Authorities
7) Search for a heading (e.g. "A Dual-language book")
=> Result is found with or without patch
8) Click on the sliders and select "is exactly" for the operator and search
=> Result is found only with patch
9) Apply the patch
10) Rebuild the index (e.g. "koha-elasticsearch --rebuild -d kohadev")
11) Click on the sliders and select "is exactly" for the operator and search
=> Result is found only with patch
Signed-off-by: Kevin Carnes <kevin.carnes@ub.lu.se>
Signed-off-by: Marcel de Rooy <m.de.rooy@rijksmuseum.nl>
Signed-off-by: Tomas Cohen Arazi <tomascohen@theke.io>
When defining our sort fields in we defined all as 'numeric'
For other string containing numbers this is likely correct, however,
for callnumbers it is not. e.g. E45 should sort before E7
This patch adds a new 'callnumber' type and deifnes this for cn-sort and
adds to the field maping a sort without numeric set
To test:
0 - Be using ES with Koha
1 - On records with single item, add callnumbers:
VA65 E7 R63 1984
VA65 E7 T35 1990
VA65 E45 R67 1985
2 - Add public note 'shrimp' or something to make them easily searchable as a group
3 - Search for 'shrimp', sort by callnumber
4 - Note E45 comes last, it should come first
5 - Apply patch
6 - Reset ES mappings
7 - Reindex ES
8 - Repeat search
9 - Sorting should be correct when set to callnumber
Signed-off-by: David Nind <david@davidnind.com>
Signed-off-by: Michal Urban <michalurban177@gmail.com>
Signed-off-by: Martin Renvoize <martin.renvoize@ptfs-europe.com>
Signed-off-by: Tomas Cohen Arazi <tomascohen@theke.io>
Add a "year" search field type. Fields with this type will only
retain values that looks like years, so invalid values such as
whitespace or word characters will not be indexed.
This for instance improves the behaviour when sorting by
"date-of-publication". If all values are indexed, records with
junk data instead of valid years will appear first among the search
results, drowning out more relevant hits. If assigning this field
the "year" type these records will instead always appear last,
regarless of sort order.
To test:
1) Have at least two biblios, one with a valid year in 008 (pos 7-10)
and another with an invalid one ("uuuu" for example)
2) Perform a wildcard search (*) and sort results by publication date.
3) The record with invalid year of pulication in 008 should appear first
4) Apply patch and run database updates
5) Reindex ElasticSearch
6) Perform the same search as in 2)
7) The record with the invalid year should now appear last
Signed-off-by: Nick Clemens <nick@bywatersolutions.com>
Signed-off-by: Katrin Fischer <katrin.fischer.83@web.de>
Signed-off-by: Jonathan Druart <jonathan.druart@bugs.koha-community.org>
Signed-off-by: Jonathan Druart <jonathan.druart@bugs.koha-community.org>
The current code for facets doesn't pull strip ending punctuation from facets
This causes duplicate facets for terms that should be combined
Sometimes series can have different punctuation depending on the field they are in
Author initials punctuation should be preserved
To test:
1 - Do search and pull up some records
2 - Edit some of the records to have authors like:
Date, C.J.
Date, C.j.
Date, C.J .
3 - Edit the records to have some series statments like:
830 $aDate, C.J. ;$v5
830 $aDate, C.J. ; $v5
830 $aDate, C.J.; $v5
4 - Add some 490s to the record with first indicator 1 and series like:
You wouldn't want to--
You wouldn't want to
You wouldn't want to..
5 - Search again and note you have 3 facets each for author and series
6 - Apply patch
7 - Repeat
8 - Now you get 2 facets for author, period not removed when following Upper case immediately, is otherwise
9 - Now you should have a single series facet
10 - Switch search engine to ES (index before applying patch)
11 - Note facets are separate again
12 - Reset mappings and reindex
perl misc/search_tools/rebuild_elasticsearch -v -r
13 - Repeat search, facets combined as above
Signed-off-by: Sarah Cornell <sbcornell@cityofportsmouth.com>
Signed-off-by: Katrin Fischer <katrin.fischer.83@web.de>
Signed-off-by: Jonathan Druart <jonathan.druart@bugs.koha-community.org>
If we try to put malformed data into an integer field, Elasticsearch
rejects the whole document.
Setting 'ignore_malformed' to true allows to ignore malformed data and
process the other fields of the document normally
https://www.elastic.co/guide/en/elasticsearch/reference/7.8/ignore-malformed.html
Test plan:
* Without the patch
1. In search engine configuration, change the type of a text field to
'Number' (for instance 'title')
2. misc/search_tools/rebuild_elasticsearch.pl -d -b
3. See that the index is empty (unless you have titles consisting only
of digits)
* With the patch
1. misc/search_tools/rebuild_elasticsearch.pl -d -b
2. Now records are correctly indexed
Signed-off-by: Nick Clemens <nick@bywatersolutions.com>
Signed-off-by: Jonathan Druart <jonathan.druart@bugs.koha-community.org>
This is an interface for quick and efficient browsing through records.
It presents a page at /cgi-bin/koha/opac-browse.pl that allows you to
enter the prefix of an author, title, or subject and it'll give you a
list of the options that match that. You can then scroll through these
and select the one you're after. Selecting it provides a list of records
that match that particular search.
To Test:
1 - Apply patches
2 - Update database (updatedatabase on kohadevbox)
3 - Compile the CSS
https://wiki.koha-community.org/wiki/Working_with_SCSS_in_the_OPAC_and_staff_client
yarn build --view=opac on kohadevbox
4 - Enable the new syspref OpacBrowseSearch
5 - Have ES running and some records in it
SearchEngine syspref set to Elasticsearch
6 - Browse to opac home, click 'Browse search' link
for your site)
7 - Test searching for author, title, and subject
8 - Verify that results are returned in expected order
9 - Experiment with fuzziness
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/common-options.html#fuzziness
Options are: exact (0 edits), fuzzy (1 edit), very fuzzy (2 edits)
10 - Click any result and verify specific titles are correct
11 - Click through title to record and verify it is the correct record
12 - Test that disabling pref removes the link on the opac home
Signed-off-by: David Nind <david@davidnind.com>
Signed-off-by: Katrin Fischer <katrin.fischer.83@web.de>
Signed-off-by: Martin Renvoize <martin.renvoize@ptfs-europe.com>
As of bug 20589 we no longer analyze sort fields and so we no longer need to append ".phrase"
to our sort in searches.
Additionally, sort fields based on 'sum' should also use sum in building the value to sort on
To test:
0 - Be using ES
1 - Find the most circulated item in your collection
2 - Search for '*'
3 - Sort by popularity DESC
4 - Note that item is not first
5 - Try to sort by anything but relevancy, it fails
6 - Apply patch
7 - Redo searches and sorts
8 - Things should now work as expected
Signed-off-by: Ere Maijala <ere.maijala@helsinki.fi>
Signed-off-by: Martin Renvoize <martin.renvoize@ptfs-europe.com>
Generate a list of fields for the query_string query fields parameter,
with possible boosts, instead of using "_all"-field. Also add "search"
flag in search_marc_to_field table so that certain mappings can be
excluded from searches. Add option to include/exclude fields in
query_string "fields" parameter depending on searching in OPAC or staff
client. Refactor code to remove all other dependencies on "_all"-field.
How to test:
1) Reindex authorities and biblios.
2) Search biblios and try to verify that this works as expected.
3) Search authorities and try to verify that this works as expected.
4) Go to "Search engine configuration"
5) Change some "Boost", "Staff client", and "OPAC" settings and save.
6) Verify that those settings where saved accordingly.
7) Click the "Biblios" or "Authorities" tab and change one or more
"Searchable" settings
8) Verfiy that those settings where saved accordingly.
9) Try to verify that these settings has taken effect by peforming
some biblios and/or authorities searches.
Sponsorded-by: Gothenburg Univesity Library
Signed-off-by: Nick Clemens <nick@bywatersolutions.com>
Signed-off-by: Alex Arnaud <alex.arnaud@biblibre.com>
Signed-off-by: Martin Renvoize <martin.renvoize@ptfs-europe.com>
Adds preference ElasticsearchMARCFormat that controls whether MARC records are stored as ISO2709/MARCXML or array. Array is searchable by field and also indexes all subfields in the _all field for searching.
Test plan:
1. Test that searching and indexing works with the patch without any changes.
2. Switch to array format and index some records.
3. Check e.g. the 008 field of a record and verify that the record can be found with the contents enclosed in quotes.
4. Check that it's possible to search for a specific field/subfield. Search query: marc_data_array.fields.655.subfields.a:Diaries
5. Check that tests still pass, especially t/Koha/SearchEngine/Elasticsearch.t
Signed-off-by: Michal Denar <black23@gmail.com>
Signed-off-by: Nick Clemens <nick@bywatersolutions.com>
Signed-off-by: Martin Renvoize <martin.renvoize@ptfs-europe.com>
Default to base64 encoded binary MARC with MARCXML
fallback if record exceeds maximum size
Sponsored-by: Gothenburg University Library
Signed-off-by: Ere Maijala <ere.maijala@helsinki.fi>
Signed-off-by: Martin Renvoize <martin.renvoize@ptfs-europe.com>
Signed-off-by: Nick Clemens <nick@bywatersolutions.com>
Implement optimized indexing for Elasticsearch
How to test:
1) Time a full elasticsearch re-index without this patch by running the
rebuild_elastic_search.pl with the -d flag:
`koha-shell <instance_name> -c "time rebuild_elastic_search.pl -d"`.
2) Apply this patch.
3) Time a full re-index again, it should be about twice at fast (for a
couple of thousand biblios, with fewer biblios results may be more
unpredictable).
Sponsored-by: Gothenburg University Library
Signed-off-by: Ere Maijala <ere.maijala@helsinki.fi>
Signed-off-by: Martin Renvoize <martin.renvoize@ptfs-europe.com>
Signed-off-by: Nick Clemens <nick@bywatersolutions.com>
To test:
1 - Do some authority searches in Zebra
2 - Switch to ES and repeat, results will vary and some may fail
3 - Apply patch and dependencies
4 - Reindex ES
5 - Repeat searches, they should suceed and results should be similar to
Zebra
6 - Slight differences are okay, but results should (mostly) meet
expectations
A few notes:
We add a 'normalizer' to ensure we get a single token from the heading
indexes, this makes 'starts with' work as expcted
We switch to 'AND' for fields searched from cataloging editor - this
matches Zebra results
We force the '__sort' fields for sorting - if sorting looks wrong try
reducing the heading field to a single subfield - this will need to be
addressed on a future bug (multiple subfields create an array, ES sorts
those randomly)
Signed-off-by: Nicolas Legrand <nicolas.legrand@bulac.fr>
Signed-off-by: Katrin Fischer <katrin.fischer.83@web.de>
Signed-off-by: Nick Clemens <nick@bywatersolutions.com>
Improvements:
1) Index settings moved from code to etc/searchengine/elasticsearch/index_config.yaml. An alternative can be specified in koha-conf.xml.
2) Field settings moved from code to etc/searchengine/elasticsearch/field_config.yaml. An alternative can be specified in koha-conf.xml.
3) mappings.yaml has been moved from admin/searchengine/elasticsearch to etc/searchengine/elasticsearch. An alternative can be specified in koha-conf.xml.
4) Default settings have been improved to remove punctuation from phrases used for sorting etc.
5) State variables are used for storing configuration to avoid parsing it multiple times.
6) A possibility to reset the fields too has been added to the reset operation of mappings administration.
7) mappings.yaml has been moved from admin/searchengine/elasticsearch to etc/searchengine/elasticsearch.
8) An stdno field type has been added for standard identifiers.
To test:
1) Run tests in t/Koha/SearchEngine/Elasticsearch.t
2) Clear tables search_fields and search_marc_map
3) Go to admin/searchengine/elasticsearch/mappings.pl?op=reset&i_know_what_i_am_doing=1
4) Verify that admin/searchengine/elasticsearch/mappings.pl displays the mappings properly, including ISBN and other standard number fields.
5) Index some records using the -d parameter with misc/search_tools/rebuild_elastic_search.pl to recreate the index
6) Verify that you can find the records
7) Put <elasticsearch_index_mappings>non_existent</elasticsearch_index_mappings> to koha-conf.xml
8) Verify that admin/searchengine/elasticsearch/mappings.pl?op=reset&i_know_what_i_am_doing=1 fails because it can't find non_existent.
9) Copy etc/searchengine/elasticsearch/mappings.yaml to a new location and make elasticsearch_index_mappings setting in koha-conf.xml point to it.
10) Make a change in the new mappings.yaml.
11) Clear table search_fields (mappings reset doesn't do it yet, see bug 20248)
12) Go to admin/searchengine/elasticsearch/mappings.pl?op=reset&i_know_what_i_am_doing=1
13) Verify that the changes you made are now visible in the mappings UI
Signed-off-by: Kyle M Hall <kyle@bywatersolutions.com>
Bug 20073: Move Elasticsearch yaml files back to admin directory
Signed-off-by: Kyle M Hall <kyle@bywatersolutions.com>
Signed-off-by: Nick Clemens <nick@bywatersolutions.com>