The -munge-config switch has been deprecated for years, and
trying to use it would either not work at all or, if it did "work",
almost certainly damage one's Zebra configuration for Koha.
This patch removes this switch.
To test:
[1] Run rebuild_zebra.pl and verify that no mention is made
of -munge-config.
[2] Run rebuild_zebra.pl to index records in one's test database
and verify that there are no regressions.
Signed-off-by: Galen Charlton <gmc@esilibrary.com>
Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
Removing a really dangerous option
Signed-off-by: Katrin Fischer <Katrin.Fischer.83@web.de>
Passes all tests and QA script.
Ran rebuild_zebra.pl with various options and confirmed
that data was reindexed successfully.
No regressions found.
Signed-off-by: Galen Charlton <gmc@esilibrary.com>
This patch follows up on the previous patch by moving the
check for whether authority and/or biblio indexing have been
specified so that -daemon has a chance to set those modes.
Signed-off-by: Galen Charlton <gmc@esilibrary.com>
Based on feedback, make daemon mode imply -z -a -b and abort
on startup if flags incompatible with an incremental update daemon
are used. Update documentation to match.
Signed-off-by: Galen Charlton <gmc@esilibrary.com>
This change adds code to check the zebraqueue table with a cheap SQL query
and a daemon loop that checks for new entries and processes them incrementally
before sleeping for a controllable number of seconds. The default is 5 seconds
which provides a near realtime search index update. This is desirable particularly
for libraries that are doing active catalogue updating. The query is adjusted
based on whether -a, -b, or -a -b are specified.
Help text updated. Tested against a live 3.12 system.
Note that this fix will benefit from the fix to lack of locking (bug 11078)
Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
Signed-off-by: Martin Renvoize <martin.renvoize@ptfs-europe.com>
Signed-off-by: Galen Charlton <gmc@esilibrary.com>
http://bugs.koha-community.org/show_bug.cgi?id=8745
Signed-off-by: Katrin Fischer <Katrin.Fischer.83@web.de>
1) Runs not with root.
2) Runs with root and -run-as-root.
3) Runs using the normal koha user.
Note: Maybe the message should be clear about why
running as root is bad and which user you should
be running the script with?
Signed-off-by: Jared Camins-Esakov <jcamins@cpbibliography.com>
Added a check to warn users of execution as root user.
Added a 'runas-root' switch to allow users to force execution as root user.
Signed-off-by: Mason James <mtj@kohaaloha.com>
Signed-off-by: Katrin Fischer <Katrin.Fischer.83@web.de>
Signed-off-by: Jared Camins-Esakov <jcamins@cpbibliography.com>
Test plan:
Clear the zebra queue (run rebuild). Update one biblio.
Rebuild zebra (again) with -z. Check zebra log: note 2 exported records.
Now apply patch, and repeat: You will see 1 exported record.
Signed-off-by: Kyle M Hall <kyle@bywatersolutions.com>
Signed-off-by: Katrin Fischer <Katrin.Fischer.83@web.de>
Works as described.
Signed-off-by: Jared Camins-Esakov <jcamins@cpbibliography.com>
When using rebuild_zebra to index all records, skip over
bibliographic or authority records that don't come out
as valid XML. Also, strip extraneous XML declarations when
using --nosanitize.
Test plans
----------
Note that both plans assume that DOM indexing is turned on.
Test plan #1
============
[1] Run rebuild_zebra.pl with the -x -nosanitize options. Without
the patch, zebraidx should terminate early and complain
about invalid XML.
[2] With the patch, the rebuild_zebra.pl should work without
error.
Test plan #2
============
[1] Intentionally make a MARCXML record invalid, e.g, by running
the following SQL:
UPDATE bilbioitems SET marcxml = CONCATENATE(marcxml, 'junk')
WHERE biblionumber = 123;
[2] Run rebuild_zebra.pl -b -x -r
[3] Without the patch, only part of the database will be indexed.
[4] With the patch, rebuild_zebra.pl will not export the bad
record and will give an error message saying so, but will
successfully index the rest of the records.
Signed-off-by: Galen Charlton <gmc@esilibrary.com>
Signed-off-by: Larry Baerveldt <larry@bywatersolutions.com>
Signed-off-by: Mason James <mtj@kohaaloha.com>
Signed-off-by: Paul Poulain <paul.poulain@biblibre.com>
Signed-off-by: Jared Camins-Esakov <jcamins@cpbibliography.com>
Removed NoZebra vestiges. This comprises several code blocks that depend on the NoZebra syspref and NZ related functions/methods.
C4::Biblio->
GetNoZebraIndexes
_DelBiblioNoZebra
_AddBiblioNoZebra
C4::Search->
NZgetRecords
NZanalyse
NZoperatorAND
NZoperatorOR
NZoperatorNOT
NZorder
C4::Installer->
set_indexing_engine
Sponsored-by: Universidad Nacional de Córdoba
Signed-off-by: Julian Maurice <julian.maurice@biblibre.com>
Signed-off-by: Paul Poulain <paul.poulain@biblibre.com>
Signed-off-by: Jared Camins-Esakov <jcamins@cpbibliography.com>
Due to a limitation of Zebra, the register must be cleared *before*
doing shadow indexing if you want to reset the indexes. In light of
that, it does not make sense to do shadow indexing at all when
rebuild_zebra.pl is run with the -r switch. This patch makes -r (reset)
imply -n (no shadow).
To test:
1) Run `rebuild_zebra.pl -b -r -v -v -v`
2) Note that the script never runs the merge phase
Without the patch I see log lines refering to the shadow cache (enabling shadow spec=/home/koha/koha-dev/var/lib/zebradb/biblios/shadow:20G)
With the patch I don't see anything in the logs about shadow. I do however see lines about merging. I think it could just be a misunderstanding of the logs
Signed-off-by: wajasu <matted-34813@mypacks.net>
Signed-off-by: Elliott Davis <elliott@bywatersolutions.com>
Signed-off-by: Jared Camins-Esakov <jcamins@cpbibliography.com>
Previously we used the "delete" command in zebraidx, which fails when
you try to delete a record that doesn't exist in the index. By changing
to the "adelete" command, we can reduce the likelihood of a failed
delete causing ghost records. A symptom of this problem is the warning
message occasionally encountered when indexing from the zebraqueue,
"[warn] cannot delete record above (seems new)."
To test:
1) Add a recordDelete action for a record that does not exist to
zebraqueue in MySQL:
INSERT INTO zebraqueue (biblio_auth_number, operation, server) \
VALUES (999999999, 'recordDelete', 'biblioserver');
2) Run `rebuild_zebra.pl -b -z -v [-x]`.
3) Note that you do not get the message "[warn] cannot delete record
above (seems new)".
Signed-off-by: Chris Cormack <chris@bigballofwax.co.nz>
Passed-QA-by: Paul Poulain <paul.poulain@biblibre.com>
Signed-off-by: Jared Camins-Esakov <jcamins@cpbibliography.com>
This patch adds the Koha::Indexer::RecordNormalizer and
Koha::Indexer::MARC::RecordNormalizer::EmbedSeeFromHeadings packages
to enable the inclusion of alternate forms of headings in bibliographic
searches. When the new syspref IncludeSeeFromInSearches is turned on
(default is off) rebuild_zebra.pl will insert see from headings from
authority records into bibliographic records when indexing, so that a
search on an obsolete term will turn up relevant records.
To test:
1) Enable IncludeSeeFromInSearches
2) Add a heading that has an alternate form to a record (for example,
"Cooking" has the alternate form "Cookery," if you have authority
records from LC)
3) Index the zebraqueue (or reindex if you haven't indexed your system
yet)
4) Confirm that if you search for "Cookery" you get the record you
just modified
Signed-off-by: Jared Camins-Esakov <jcamins@cpbibliography.com>
Rebased on master 5 August 2012
Signed-off-by: Jared Camins-Esakov <jcamins@cpbibliography.com>
Rebased on master 11 September 2012
Signed-off-by: Katrin Fischer <Katrin.Fischer.83@web.de>
Also checked:
- Verified database update works correctly
- Checked system preference and its description
- Checked staff/opac detail pages with feature on/off
- Checked staff/opac search facets
- Downloaded and tested records in various formats
- Tried different searches for 'see from' entries of authorities
- Ran all unit tests
No problems found.
Complete rewrite of rebuild_zebra_sliced.zsh (renamed to .sh). Main
improvements are:
- both biblio and authority records are handled
- records are exported only once
It also add an option --skip-index to rebuild_zebra.pl that permit to
use rebuild_zebra.pl as an 'export only' script.
Description:
Index Koha records by chunks. It is useful when some record causes
errors and stop the indexation process. With this script, if indexation
of one chunk fails, chunk is splitted in 2 (or 3) chunks, and
indexation continue on these chunks.
rebuild_zebra.pl is called only once to export records.
Splitting and indexing is handled by this script (using yaz-marcdump and
zebraidx).
Signed-off-by: Martin Renvoize <martin.renvoize@ptfs-europe.com>
Signed-off-by: Paul Poulain <paul.poulain@biblibre.com>
In rebuild_zebra.pl, if we are in "unimarc" ("marcflavour" syspref), the sub "fix_unimarc_100" is called and checks if 100$a lenght is equal to 35.
If it is not the case, the sub inserts the localtime and more, so we loose the datas in reindexing.
The standart lenght is 36.
I have just changed 35 to 36.
Signed-off-by: Sophie Meynieux <sophie.meynieux@biblibre.com>
Signed-off-by: Paul Poulain <paul.poulain@biblibre.com>
One consequence is that the -x and -a options are no longer
mutually exclusive.
Also, because of the way that the GRS-1 SGML filter works, if you're
indexing multiple documents, you can't just wrap them in a document
element, but the DOM filter *requires* it. Consequently, two
new config settings in koha-conf.xml are added to indicate the
Zebra filter in use so that the -x option of rebuild_zebra.pl
knows whether to wrap the exported records or not:
- bib_index_mode (defaults to 'grs1' if not specified)
- auth_index_mode (defaults to 'dom')
Signed-off-by: Galen Charlton <gmc@esilibrary.com>
Signed-off-by: Jared Camins-Esakov <jcamins@cpbibliography.com>
Signed-off-by: Paul Poulain <paul.poulain@biblibre.com>
This patch reimplement a feature that is on biblibre/master for Koha-community/master
It adds 4 parameters:
* offset = the offset of record. Say 1000 to start rebuilding at the 1000th record of your database
* length = how many records to export. Say 400 to export only 400 records
* where = add a where clause to rebuild only a given itemtype, or anything you want to filter on
Another improvement resulting from offset & length limit is the rebuild_zebra_sliced.zsh
that will be submitted in another patch.
rebuild_zebra_sliced will slice your all database in small chunks, and, if something went wrong for a given slice, will slice the slice, and repeat, until you reach a slice size of 1, showing which record is wrong in your database.
Signed-off-by: Jared Camins-Esakov <jcamins@cpbibliography.com>
Removed mention of -l option for limiting number of items exported, as requested
by QA manager. This can be re-added in a later patch.
Signed-off-by: Paul Poulain <paul.poulain@biblibre.com>
use encoding(UTF-8) rather than utf-8 for stricter
encoding
Marking output as ':utf8' only flags the data as utf8
using :encoding(UTF-8) also checks it as valid utf-8
see binmode in perlfunc for more details
In accordance with the robustness principle input
filehandles have not been changed as code may make
the undocumented assumption that invalid utf-8 is present
in the imput
Fixes errors reported by t/00-testcritic.t
Where feasable some filehandles have been made lexical rather than
reusing global filehandle vars
Signed-off-by: Jonathan Druart <jonathan.druart@biblibre.com>
Signed-off-by: Paul Poulain <paul.poulain@biblibre.com>
Currently, -v option resets Zebra log output to default system values.
This produce amount of log specified in system defaults which is usually
too low for debugging.
This change explicitly forces all Zebra log output which create much more
chatter so it triggers with verbosity level 2
Test scenario:
1. pick koha site to reindex
2. use -v -v options to rebuild_zebra.pl to see additional output
Signed-off-by: Liz Rea <wizzyrea@gmail.com>
Verified help corrections and loglevel 2 output vs. loglevel 1 output. No issues found.
Signed-off-by: Paul Poulain <paul.poulain@biblibre.com>
Sometimes zebra needs a tmp dir in order to work. This ensures that it
is created both by koha-create-dirs in the packages, and by
rebuild_zebra when it runs.
--
tested ok, signing off
Signed-off-by: Mason James <mtj@kohaaloha.com>
This patch allow to handle properly items containing extended characters and
send valid XML records to zebraidx
Signed-off-by: Julian Maurice <julian.maurice@biblibre.com>
Signed-off-by: Paul Poulain <paul.poulain@biblibre.com>
This patch fixes an issue whereby biblios with many items (often > 500) would index,
but not the biblionumber itself, resulting in search results with a) inaccurate item counts
and b) no biblionumber to use in the link to the details page. This is due to Net::Z3950::ZOOM not providing
a mechanism for specifying different connection attributes; the maximumRecordSize ZOOM connection attribute,
if not specified, defaults to 1MB, which is less than the size of a MARC record with many, many 952 fields. Since
it is unlikely we can fix Net::Z3950::ZOOM in a timely fashion, this patch aims to build a workaround on the Koha end.
This patch changes EmbedItemsInMarcBiblio to use append_fields instead of insert_ordered_fields,
so the 999$c will come before the item records. It's VERY unlikely we will encounter more than 1MB of biblio-level MARC
content, as this would break the ISO-2709 standard by a large factor.
To this end, it also moves the fix_biblio_ids portion of get_corrected_marc_record out of rebuild_zebra.pl,
and makes it a part of GetMarcBiblio (right before EmbedItemsInMarcBiblio, so the 952s still come last). fix_biblio_ids
is kept as a subroutine for the deletion portion of rebuild_zebra.pl, which still uses it.
It also uses the subroutine parameter in GetMarcBiblio to do the EmbedItemsInMarcBiblio action, rather than having
rebuild_zebra.pl perform it on the itemless record returned from GetMarcBiblio. Simpler and cleaner that way.
To verify bug issue:
1. Find a biblio with over 700 items (or enough that the resulting MARCXML is greater than 1MB)
2. search for this biblio (in a search that would return multiple results, not just this title). You should get the title in
the results list
3. attempt to click the link to this biblio's details page; the biblionumber should be blank, leading to a 404
To test solution:
1. Apply patch
2. modify the biblio slightly (click the 005 for example) and save
OR manually add the biblio to zebraqueue for reindexing
3. after rebuild_zebra.pl -z -b -x runs, use the same search as above. The title should still appear.
4. click the link, and find yourself on the biblio detail page as desired
Signed-off-by: D Ruth Bavousett <ruth@bywatersolutions.com>
Signed-off-by: Paul Poulain <paul.poulain@biblibre.com>
Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
This both adds a bit of a failsafe to get_raw_biblio, and prevents
records that have been deleted from being updated by the same instance
of rebuild_zebra.
Minor amendment to remove duplication of 6433
Signed-off-by: MJ Ray <mjr@phonecoop.coop>
Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
Adds a new routine, C4::Biblio::EmbedItemsInMarcBiblio, to
embed the items in the bib record when necessary:
* cataloging/additem.pl
* rebuild_zebra.pl
Signed-off-by: Galen Charlton <gmc@esilibrary.com>
Signed-off-by: Claire Hernandez <claire.hernandez@biblibre.com>
Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
This is a squash of four patches by Henri-Damien Laurent
starting work on removing the copy of item record information
in the 9XX field of bibliographic records. The reason
for doing this is primarily to improve performance, in particular,
the expense of having to add/modify the bib record whenever an
item changes. Now, whenever an item changes, the bib record is
put in the queue to be reindexed; when the bib is indexed, the 9XX
fields are inserted into the version of the bib that Zebra indexes.
Since rebuild_zebra.pl runs in a separate process, the processing of the
bib record will not delay (e.g.) circulation.
As part of upgrading to 3.4, the following batch script should be run:
misc/maintenance/remove_items_from_biblioitems.pl --run
This should be followed by a complete reindexing of the bib records, e.g.,
misc/migration_tools/rebuild_zebra.pl -b -r
Signed-off-by: Galen Charlton <gmcharlt@gmail.com>
Signed-off-by: Claire Hernandez <claire.hernandez@biblibre.com>
Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
Reimplements support for -r, as well for -reset
Signed-off-by: D Ruth Bavousett <ruth@bywatersolutions.com>
Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
If the zebra server directories don't exist, zebra will spit the dummy.
This makes rebuild_zebra.pl smart enough to create them if they're not
there. If that fails, it'll scream loudly so you know zebra isn't
reindexing.
Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
This prevents it leaving files lying around in /tmp
Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
Signed-off-by: Galen Charlton <gmcharlt@gmail.com>
With this patch, rebuild_zebra can re-index a whole Koha DB
quickly:
rebuild_zebra -r -b -nosanitize
Biblio (authority) records are dump directly in a file
from marcxml field without beeing transformed into
MARC::Record object and corrected.
DOCUMENTATION:
rebuild_zebra.pl new paramater:
-nosanitize export biblio/authority records directly from DB marcxml
field without sanitizing records. It speed up
dump process but could fail if DB contains badly
encoded records. Works now only with -x and -b
Signed-off-by: Galen Charlton <galen.charlton@liblime.com>
Add the phrase 'if ( $verbose_logging )' to the two print statements
concerning the skipping of biblio or authority records.
I recently had to split biblio and authority index updating in my cron
script ( had some really big records so had to add the -x switch which
should only be used on biblios accourding to the help ). So I noticed
that rebuild_zebra.pl printed messages that it was skipping biblios or
authorities.
This patch is to conditionalize those prints based on the verbose
logging switch.
Signed-off-by: Galen Charlton <galen.charlton@liblime.com>
This reduces the output of the script and zebraidx, and creates a -v
command line switch which will increase the logging to their former
states.
Signed-off-by: Galen Charlton <galen.charlton@liblime.com>
Prior to this patch, rebuild_zebra.pl -z was effectively
hanging on to a lock on the zebraqueue table, preventing
other scripts from inserting new entries into the table.
This had the effect of causing circulation operations
to time out.
Refactored by having rebuld_zebra.pl pull the active
queue into memory, then mark entries done by zebraqueue.id.
Consequently, rebuild_zebra.pl should no longer
block adding new entries into zebraqueue.
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
rebuild_zebra.pl will now mark all zebraqueue entries
of the affected record type(s) done when run in
normal mode to index all records (as opposed to running
it with -z to just process the zebraqueue). This prevents
any running zebraqueue_daemon processes from attempting
to reindex the same records, redundantly.
The new -y swtich overrides this new behavior; in other words, if
running rebuild_zebra.pl without -z, you can specify
-y to *not* mark zebraqueue done.
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
Accidentally introducing a circular reference in a
MARC::Record object does not lead to goodness, particularly
if you export lots and lots of them.
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
The -z option, when used in conjunction with -a and/or -b,
selects the records to reindex from the zebraqueue table.
Both record updates and record deletes are handled.
-z is cannot be used with -s or -r: the updated records
must always be freshly exported, and if zebraqueue
is to be processed, it's assumed that you don't want
to drop the Zebra index first.
This means that rebuild_zebra.pl -b -a -x can be
used as a cronjob to update the indexes periodically; it
is believed that this will offer much better indexing
performance on some setups as compared to zebraqueue_daemon.pl,
which uses Z39.50 extended services to send record updates
to Zebra.
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
At moment using both -a (index authorities) and
-x (export records as MARC XML) is not allowed -
if the Zebra authority database is using the DOM
filter, zebraidx will not be able to process the
exported records correctly.
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
1. Logic to fix up record IDs, UNIMARC 100 field,
and record leader now in separate functions.
2. Removed (incorrect) logic to save corrected record
in database.
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
the unimarc stuff has been moved to marc_defs directory and the
lang specific is in lang_defs
Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
[1] Use File::Temp to create and manage
export directory if -d is not specified.
[2] Added usage message.
[3] Code that attempts to fix up Zebra
configuration files changed so that it
is invoked only if --munge-config option
is supplied; this code will ultimately
either be removed or moved to a separate
script -- the sorts of errors that it
tries to fix should no longer be appearing
in a standard install.
[4] Fixed Win32 portability problem when removing
temporary directory.
Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
There are only 2 UNIMARC specific files (.abs and .chr), they have been moved to etc/zebradb
The rebuild_zebra.pl takes all config file from this location now.
the misc/zebra/ can be removed (and will be soon)
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
this time it's when a biblio don't have biblionumber, has a 100$a field, and it's invalid.
1 biblio in my 300 000 DB (and it was biblio 294 359, of course !)
Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>