Commit graph

50 commits

Author SHA1 Message Date
Ian Walls
4e95e94727 Bug 6789: biblios with many items can result in broken search results link
This patch fixes an issue whereby biblios with many items (often > 500) would index,
but not the biblionumber itself, resulting in search results with a) inaccurate item counts
and b) no biblionumber to use in the link to the details page.  This is due to Net::Z3950::ZOOM  not providing
a mechanism for specifying different connection attributes; the maximumRecordSize ZOOM connection attribute,
if not specified, defaults to 1MB, which is less than the size of a MARC record with many, many 952 fields.  Since
it is unlikely we can fix Net::Z3950::ZOOM in a timely fashion, this patch aims to build a workaround on the Koha end.

This patch changes EmbedItemsInMarcBiblio to use append_fields instead of insert_ordered_fields,
so the 999$c will come before the item records.  It's VERY unlikely we will encounter more than 1MB of biblio-level MARC
content, as this would break the ISO-2709 standard by a large factor.

To this end, it also moves the fix_biblio_ids portion of get_corrected_marc_record out of rebuild_zebra.pl,
and makes it a part of GetMarcBiblio (right before EmbedItemsInMarcBiblio, so the 952s still come last).  fix_biblio_ids
is kept as a subroutine for the deletion portion of rebuild_zebra.pl, which still uses it.

It also uses the subroutine parameter in GetMarcBiblio to do the EmbedItemsInMarcBiblio action, rather than having
rebuild_zebra.pl perform it on the itemless record returned from GetMarcBiblio.  Simpler and cleaner that way.

To verify bug issue:
1. Find a biblio with over 700 items (or enough that the resulting MARCXML is greater than 1MB)
2. search for this biblio (in a search that would return multiple results, not just this title).  You should get the title in
the results list
3. attempt to click the link to this biblio's details page; the biblionumber should be blank, leading to a 404

To test solution:
1. Apply patch
2. modify the biblio slightly (click the 005 for example) and save
   OR manually add the biblio to zebraqueue for reindexing
3. after rebuild_zebra.pl -z -b -x runs, use the same search as above. The title should still appear.
4. click the link, and find yourself on the biblio detail page as desired

Signed-off-by: D Ruth Bavousett <ruth@bywatersolutions.com>
Signed-off-by: Paul Poulain <paul.poulain@biblibre.com>
Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
2011-10-15 13:47:24 +13:00
Jesse Weaver
048c0dc04e Bug 6492 - Deleted biblios cause rebuild_zebra to fail
This both adds a bit of a failsafe to get_raw_biblio, and prevents
records that have been deleted from being updated by the same instance
of rebuild_zebra.

Minor amendment to remove duplication of 6433

Signed-off-by: MJ Ray <mjr@phonecoop.coop>
Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
2011-07-05 11:18:28 +12:00
3b8f1318e0 Bug 6050 Followup, edit a last function call
Signed-off-by: Frédéric Demians <f.demians@tamil.fr>
Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
2011-06-14 14:12:05 +12:00
Srdjan Janković
5829cef6d8 bug_6433: exception handling
Signed-off-by: Magnus Enger <magnus@enger.priv.no>
2011-06-10 11:27:25 +12:00
e96315556b bug 5579: new routine to embed items in bib
Adds a new routine, C4::Biblio::EmbedItemsInMarcBiblio, to
embed the items in the bib record when necessary:

* cataloging/additem.pl
* rebuild_zebra.pl

Signed-off-by: Galen Charlton <gmc@esilibrary.com>
Signed-off-by: Claire Hernandez <claire.hernandez@biblibre.com>
Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
2011-04-19 22:34:21 +12:00
Henri-Damien LAURENT
3584c4426b Bug 5579: remove items from MARC bib
This is a squash of four patches by Henri-Damien Laurent
starting work on removing the copy of item record information
in the 9XX field of bibliographic records.  The reason
for doing this is primarily to improve performance, in particular,
the expense of having to add/modify the bib record whenever an
item changes.  Now, whenever an item changes, the bib record is
put in the queue to be reindexed; when the bib is indexed, the 9XX
fields are inserted into the version of the bib that Zebra indexes.
Since rebuild_zebra.pl runs in a separate process, the processing of the
bib record will not delay (e.g.) circulation.

As part of upgrading to 3.4, the following batch script should be run:

misc/maintenance/remove_items_from_biblioitems.pl --run

This should be followed by a complete reindexing of the bib records, e.g.,

misc/migration_tools/rebuild_zebra.pl -b -r

Signed-off-by: Galen Charlton <gmcharlt@gmail.com>
Signed-off-by: Claire Hernandez <claire.hernandez@biblibre.com>
Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
2011-04-19 22:33:56 +12:00
Ian Walls
8dc56a0d2c Bug 5831: rebuild_zebra.pl doesn't respect -r
Reimplements support for -r, as well for -reset

Signed-off-by: D Ruth Bavousett <ruth@bywatersolutions.com>
Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
2011-03-06 08:44:57 +13:00
Robin Sheat
8de1ef7e94 Bug 5228 - make rebuild_zebra handle fixing the zebra dirs
If the zebra server directories don't exist, zebra will spit the dummy.
This makes rebuild_zebra.pl smart enough to create them if they're not
there. If that fails, it'll scream loudly so you know zebra isn't
reindexing.

Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
2010-12-13 21:59:49 +13:00
Robin Sheat
57d11aee2c Bug 5077 - ensure rebuild_zebra will run somewhere it can read
This prevents it leaving files lying around in /tmp

Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
Signed-off-by: Galen Charlton <gmcharlt@gmail.com>
2010-10-06 08:00:17 -04:00
Donovan Jones
5e0b850d49 Bug 2505 - Add commented use warnings where missing in the misc/ directory 2010-04-21 20:26:44 +12:00
459d732180 Bug 3301 - Speed up rebuild_zebra script
With this patch, rebuild_zebra can re-index a whole Koha DB
quickly:

  rebuild_zebra -r -b -nosanitize

Biblio (authority) records are dump directly in a file
from marcxml field without beeing transformed into
MARC::Record object and corrected.

DOCUMENTATION:

rebuild_zebra.pl new paramater:

-nosanitize  export biblio/authority records directly from DB marcxml
             field without sanitizing records. It speed up
             dump process but could fail if DB contains badly
             encoded records. Works now only with -x and -b

Signed-off-by: Galen Charlton <galen.charlton@liblime.com>
2009-06-29 07:52:46 -05:00
Brian Harrington
25cd35b3a1 bug 2924 fixed rebuild_zebra.pl to work when export is skipped
reindexing now occurs if there are $num_records_exported or if
$skip_export is set

Signed-off-by: Galen Charlton <galen.charlton@liblime.com>
2009-03-04 08:28:22 -06:00
Michael Hafen
086b3ccf9a bug in rebuild_zebra verbose logging - found another print I didn't want to see all the time
Add the phrase 'if ( $verbose_logging )' to the two print statements
concerning the skipping of biblio or authority records.

I recently had to split biblio and authority index updating in my cron
script ( had some really big records so had to add the -x switch which
should only be used on biblios accourding to the help ).  So I noticed
that rebuild_zebra.pl printed messages that it was skipping biblios or
authorities.

This patch is to conditionalize those prints based on the verbose
logging switch.

Signed-off-by: Galen Charlton <galen.charlton@liblime.com>
2008-12-11 09:23:28 -06:00
Michael Hafen
62a590a954 Reduce logging from rebuild_zebra.pl with a command line option
This reduces the output of the script and zebraidx, and creates a -v
command line switch which will increase the logging to their former
states.

Signed-off-by: Galen Charlton <galen.charlton@liblime.com>
2008-10-01 13:05:20 -05:00
Galen Charlton
df1f46f9da bug 2253: improve rebuild_zebra's handling of zebraqueue
Prior to this patch, rebuild_zebra.pl -z was effectively
hanging on to a lock on the zebraqueue table, preventing
other scripts from inserting new entries into the table.
This had the effect of causing circulation operations
to time out.

Refactored by having rebuld_zebra.pl pull the active
queue into memory, then mark entries done by zebraqueue.id.
Consequently, rebuild_zebra.pl should no longer
block adding new entries into zebraqueue.

Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2008-06-19 09:49:06 -05:00
Galen Charlton
3109d5820e rebuild_zebra.pl - add -y option
rebuild_zebra.pl will now mark all zebraqueue entries
of the affected record type(s) done when run in
normal mode to index all records (as opposed to running
it with -z to just process the zebraqueue).  This prevents
any running zebraqueue_daemon processes from attempting
to reindex the same records, redundantly.

The new -y swtich overrides this new behavior; in other words, if
running rebuild_zebra.pl without -z, you can specify
-y to *not* mark zebraqueue done.

Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2008-04-21 11:17:29 -05:00
Galen Charlton
e2c1f11715 fixed memory leak I introduced
Accidentally introducing a circular reference in a
MARC::Record object does not lead to goodness, particularly
if you export lots and lots of them.

Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2008-04-01 06:46:05 -05:00
Galen Charlton
4f001186b6 still more rebuild_zebra refactoring
Merged duplicate code for indexing bibs and
authorities into a single index_records() function.

Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2008-03-25 07:58:03 -05:00
Galen Charlton
a5576b8dfe IMPORTANT: added -z option to rebuild_zebra.pl
The -z option, when used in conjunction with -a and/or -b,
selects the records to reindex from the zebraqueue table.
Both record updates and record deletes are handled.

-z is cannot be used with -s or -r: the updated records
must always be freshly exported, and if zebraqueue
is to be processed, it's assumed that you don't want
to drop the Zebra index first.

This means that rebuild_zebra.pl -b -a -x can be
used as a cronjob to update the indexes periodically; it
is believed that this will offer much better indexing
performance on some setups as compared to zebraqueue_daemon.pl,
which uses Z39.50 extended services to send record updates
to Zebra.

Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2008-03-25 07:58:01 -05:00
Galen Charlton
57d128f727 rebuild_zebra: exit if both -a and -x specified
At moment using both -a (index authorities) and
-x (export records as MARC XML) is not allowed -
if the Zebra authority database is using the DOM
filter, zebraidx will not be able to process the
exported records correctly.

Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2008-03-25 07:57:44 -05:00
Galen Charlton
f0d5da7448 more rebuild_zebra.pl refactoring
1. Logic to fix up record IDs, UNIMARC 100 field,
   and record leader now in separate functions.
2. Removed (incorrect) logic to save corrected record
   in database.

Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2008-03-25 07:57:43 -05:00
Galen Charlton
f98c27a8bc refactor rebuild_zebra: new routine for invoking zebraidx
Created a routine for calling zebraidx, replacing
separate invocations for bibs and authorities.

Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2008-03-25 07:57:42 -05:00
Galen Charlton
ae8a76dacc rebuild_zebra.pl: removed disused $limit option
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2008-03-25 07:57:41 -05:00
Ryan Higgins
71dd69d5ac add option to export and index xml to rebuild_zebra
Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2008-02-15 08:25:46 -06:00
Paul POULAIN
319a32b16e rebuild_zebra : directories updated
the unimarc stuff has been moved to marc_defs directory and the
lang specific is in lang_defs

Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2008-01-03 00:55:12 -06:00
Joshua Ferraro
c6ddddad98 adding a new option, -w, which disables shadow indexing for the current batch (faster indexing of large sets where ACID isn't critical)
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2007-12-30 12:13:27 -06:00
Galen Charlton
4609608ccc allow use of older version of File::Temp
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2007-12-22 22:58:12 -06:00
Galen Charlton
93beb943c0 bug 1661: rebuild_zebra.pl changes
[1] Use File::Temp to create and manage
    export directory if -d is not specified.
[2] Added usage message.
[3] Code that attempts to fix up Zebra
    configuration files changed so that it
    is invoked only if --munge-config option
    is supplied; this code will ultimately
    either be removed or moved to a separate
    script -- the sorts of errors that it
    tries to fix should no longer be appearing
    in a standard install.
[4] Fixed Win32 portability problem when removing
    temporary directory.

Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2007-12-20 19:19:43 -06:00
Paul POULAIN
262a6e2a9a Updating rebuild_zebra.pl : now uses etc config files
There are only 2 UNIMARC specific files (.abs and .chr), they have been moved to etc/zebradb

The rebuild_zebra.pl takes all config file from this location now.
the misc/zebra/ can be removed (and will be soon)

Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2007-11-25 17:07:46 -06:00
Paul POULAIN
f38b7598fc still handling better dirty MARC records
this time it's when a biblio don't have biblionumber, has a 100$a field, and it's invalid.

1 biblio in my 300 000 DB (and it was biblio 294 359, of course !)

Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2007-11-20 16:20:50 -06:00
Paul POULAIN
f1bca9ba50 missing biblionumber AND missing unimarc 100 was not properly handled
now, adding both on the fly when needed.
(had 2 biblios like that in a 290 000 DB, but was enought to have M::F::X complaining & diing !)

Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2007-11-17 11:25:07 -06:00
Paul POULAIN
ef1ac56857 handling wrong MARC record better
Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2007-11-12 17:13:00 -06:00
Paul POULAIN
9149a711fb bugfixes to config files for zebra 2.0.18
those 2 lines are invalid

Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2007-11-08 17:50:00 -06:00
Paul POULAIN
b7eb9e1b5c rebuild_zebra now handle correctly improper authorities records
(missing 100 field are automatically added)

Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2007-11-07 08:18:24 -06:00
Paul POULAIN
bb5cea8e56 deal with wrong authorities when exporting for zebra
(authorities that don't have a 001 field containing authid)

also comment some code when exporting biblios (NOT tested, hdl,pls confirm this commit)

Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2007-11-07 08:18:19 -06:00
Paul POULAIN
89b9e8f8c1 skip empty records (new GetMarcRecord behaviour that returns empty string and not empty MARC::Record)
Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2007-10-31 19:41:49 -05:00
Paul POULAIN
49ef1df969 Adding a new option to rebuildzebra : noxml
This option uses the iso2709 version of the MARC record instead of the XML one
(biblioitems.marc vs biblioitems.marcxml)
No change if the parameter is not set.

Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2007-10-09 19:07:36 -05:00
Joshua Ferraro
ae34e8f45a changing the name of the zebra password file to passwd
Signed-off-by: Chris Cormack <crc@liblime.com>
2007-10-01 23:14:47 -05:00
tipaul
1399945a75 eval() on getAuthority & getBiblio to avoid a script failure 2007-08-01 09:20:03 +00:00
tipaul
5dd3f0229a bugfixes (various), handling utf-8 without guessencoding (as suggested by joshua, fixing some zebra config files -for french but should be interesting for other languages- 2007-06-06 13:08:35 +00:00
tipaul
0569dccd5f some changes to default zebra config for better searches 2007-05-25 09:34:30 +00:00
tipaul
5ff7fcffa4 Bugfixes & improvements (various and minor) :
- updating templates to have tmpl_process3.pl running without any errors
- adding a drupal-like css for prog templates (with 3 small images)
- fixing some bugs in circulation & other scripts
- updating french translation
- fixing some typos in templates
2007-05-22 09:13:54 +00:00
tipaul
ca201e36af Koha NoZebra :
- support for authorities
- some bugfixes in ordering and "CCL" parsing
- support for authorities <=> biblios walking

Seems I can do what I want now, so I consider its done, except for bugfixes that will be needed i m sure !
2007-05-10 14:45:15 +00:00
tipaul
6b201757c1 some bugfixes for this script that automatically build zebra DB from default config files 2007-04-17 08:50:33 +00:00
tipaul
a481fad4b7 Code cleaning :
== Biblio.pm cleaning (useless) ==
* some sub declaration dropped
* removed modbiblio sub
* removed moditem sub
* removed newitems. It was used only in finishrecieve. Replaced by a Koha2Marc+AddItem, that is better.
* removed MARCkoha2marcItem
* removed MARCdelsubfield declaration
* removed MARCkoha2marcBiblio

== Biblio.pm cleaning (naming conventions) ==
* MARCgettagslib renamed to GetMarcStructure
* MARCgetitems renamed to GetMarcItem
* MARCfind_frameworkcode renamed to GetFrameworkCode
* MARCmarc2koha renamed to TransformMarcToKoha
* MARChtml2marc renamed to TransformHtmlToMarc
* MARChtml2xml renamed to TranformeHtmlToXml
* zebraop renamed to ModZebra

== MARC=OFF ==
* removing MARC=OFF related scripts (in cataloguing directory)
* removed checkitems (function related to MARC=off feature, that is completly broken in head. If someone want to reintroduce it, hard work coming...)
* removed getitemsbybiblioitem (used only by MARC=OFF scripts, that is removed as well)
2007-03-29 13:30:31 +00:00
tipaul
a3999812e6 rel_3_0 moved to HEAD 2007-03-09 14:52:58 +00:00
tipaul
f74823bf1b OK, this time it seems to work. The last blocking problem was... a space in
recordId: (bib1,Identifier-standard) just after the comma. Adam agreed it was a bug, and it should be solved soon. But now we are aware, we can avoid putting the space !

In this commit you have all what is needed to setup a working zebra DB in Unimarc :
* collection.abs is UNIMARC specific and must be rewritten for MARC21, in marc21 directory
* pdf.properties is to be copied unmodified in the marc21 directory (can also be put somewhere else)
* rebuild_zebra.pl is SLOW, but 1 step reindexing tool, using ZOOM
* rebuild_zebra_idx is FAST, but 2 step reindexing tool, and does not use zebra. run it, it will create all biblios XML files in /zebra/biblios directory, then zebraidx update biblios in your zebra directory
* zebra.cfg is the zebra config file ;-)
* test_cql2rpn.pl is a script that will query the database and show the results. Works for me, just change the query at the beginning to get answers you expect.

What has to be done :
* benchmarking : it seems the zebraidx update is faster than lightning (400biblios/sec : 10 000biblios in 25seconds), while ZOOM indexing is slow (something like 25biblios/second) More benchmarking could be done.
* completing collection.abs for UNIMARC. I'll take care of it.
* modifying Biblio.pm to use ZOOM instead of the "zebraidx through exec" running actually. I'll take care of it also.
* modify the search API & tools & screens. I'll let the ball to someone else (chris ?) for this. I agree SearchMarc.pm can be dropped and replaced by something else (maybe a new-and-clean Search.pm package)
2006-02-09 10:59:34 +00:00
tipaul
369ee65d94 new version of rebuild_zebra. Should work with Perl-ZOOM, but DOES NOT WORK for me.
I get  :
ZOOM error 10002 "Encoding failed" from diag-set 'ZOOM'

help expected from indexdata...
2006-01-10 17:03:32 +00:00
tipaul
d5938493d7 synch'ing head and rel_2_2 (from 2.2.5, including npl templates)
Seems not to break too many things, but i'm probably wrong here.
at least, new features/bugfixes from 2.2.5 are here (tested on some features on my head local copy)

- removing useless directories (koha-html and koha-plucene)
2006-01-06 16:39:37 +00:00
tipaul
dba37f38e7 This script can be use to rebuild the zebra DB. It stores all koha MARC records in iso2709, in the bilbios directory. After that, you just have to "zebraidx update biblios"
I tried on a 9900 DB, here are the results :

[paul@bureau migration_tools]$ ./rebuild_zebra.pl -c
9900
9903 MARC record done in 37.9104120731354 seconds

[paul@bureau zebra]$ zebraidx update biblios
<snip>
18:31:24-11/08 zebraidx(20348) [log] Iterations . . . 144575
18:31:24-11/08 zebraidx(20348) [log] Distinct words .  39891
18:31:24-11/08 zebraidx(20348) [log] Updates. . . . .     46
18:31:24-11/08 zebraidx(20348) [log] Deletions. . . .      2
18:31:24-11/08 zebraidx(20348) [log] Insertions . . .  39843
18:31:24-11/08 zebraidx(20348) [log] zebra_register_close p=0x8104cf8
18:31:25-11/08 zebraidx(20348) [log] Records:    9887 i/u/d 9881/6/0
18:31:25-11/08 zebraidx(20348) [log] user/system: 531/145
18:31:25-11/08 zebraidx(20348) [log] zebra_stop
18:31:25-11/08 zebraidx(20348) [log] zebraidx times: 11.33  5.31  1.45
2005-08-11 16:35:54 +00:00