Koha/tools
Julian Maurice 76e980bb1a Bug 29333: Fix encoding of imported UNIMARC authorities
MARC::Record and MARC::File::* modules sometimes use the position 09 of
the leader to detect encoding. A blank character means 'MARC-8' while an
'a' means 'UTF-8'.

In a UNIMARC authority this position is used to store the authority type
(see https://www.transition-bibliographique.fr/wp-content/uploads/2021/02/AIntroLabel-2004.pdf [FR]).
In this case, 'a' means 'Personal Name'.

The result is that the import will succeed for a Personal Name
authority, but it will fail for all other authority types.

Steps to reproduce:
0. Be sure to have a Koha UNIMARC instance.
1. Download the MARCXML for "Honoré de Balzac"
   curl -o balzac.marcxml https://www.idref.fr/02670305X.xml
2. Verify that it's encoded in UTF-8
   file balzac.marcxml
   (should output "balzac.marcxml: XML 1.0 document, UTF-8 Unicode
   text")
3. Go to Tools » Stage MARC for import and import balzac.marcxml with
   the following settings:
   Record type: Authority
   Character encoding: UTF-8
   Format: MARCXML
   Do not touch the other settings
4. Once imported, go to the staged MARC management tool and find your
   batch. Click on the authority title "Balzac Honoré de 1799-1850" to
   show the MARC inside a modal window. There should be no encoding
   issue.
5. Write down the imported record id (the number in column '#') and go
   to the MARC authority editor. Replace all URL parameters by
   'breedingid=THE_ID_YOU_WROTE_DOWN'
   The URL should look like this:
   /cgi-bin/koha/authorities/authorities.pl?breedingid=198
   You should see no encoding issues. Do not save the record.
6. Import the batch into the catalog. Verify that the authority record
   has no encoding issue.
7. Now download the MARCXML for "Athènes (Grèce)"
   curl -o athènes.marcxml https://www.idref.fr/027290530.xml
8. Repeat steps 2 to 6 using athènes.marcxml file. At steps 4 and 5 you
   should see encoding issues and that the position 9 of the leader was
   rewritten from 'c' to 'a'. Strangely, importing this batch fix the
   encoding issue, but we still lose the information in position 09 of
   the leader

This patch makes use of the MARCXML representation of the record instead
of the ISO2709 representation, because, unlike
MARC::Record::new_from_usmarc, MARC::Record::new_from_xml allows us to
pass directly the encoding and the format, which prevents data to be
double encoded when position 09 of the leader is different that 'a'

Test plan:
- Follow the "steps to reproduce" above and verify that you have no
  encoding issues.

Signed-off-by: David Nind <david@davidnind.com>
Signed-off-by: Martin Renvoize <martin.renvoize@ptfs-europe.com>
Signed-off-by: Tomas Cohen Arazi <tomascohen@theke.io>
(cherry picked from commit 01d78e1ec7)

Signed-off-by: Lucas Gass <lucas@bywatersolutions.com>
2022-08-23 15:30:11 +00:00
..
csv-profiles
access_files.pl Bug 17600: Standardize our EXPORT_OK 2021-07-16 08:58:47 +02:00
additional-contents.pl Bug 22659: (follow-up) Add category to redirect 2022-07-29 16:06:44 +00:00
ajax-inventory.pl Bug 17600: Standardize our EXPORT_OK 2021-07-16 08:58:47 +02:00
automatic_item_modification_by_age.pl Bug 22827: Add age dependency on other fields than dateaccessioned 2022-04-08 15:49:16 +02:00
background-job-progress.pl Bug 28785: Adjust check_cookie_auth calls 2021-10-18 11:28:41 +02:00
batch_delete_records.pl Bug 29771: Scalar context for split 2022-03-08 23:03:34 -10:00
batch_extend_due_dates.pl Bug 29380: Correct table name in joins to prevent errors 2021-11-03 15:40:52 +01:00
batch_record_modification.pl Bug 29771: Scalar context for split 2022-03-08 23:03:34 -10:00
batch_records_ajax.pl Bug 22785: Allow option to choose which record match is applied during import 2022-05-03 11:19:50 -10:00
batchMod.pl Bug 30525: Items batch modification broken 2022-04-21 13:41:36 -10:00
cleanborrowers.pl Bug 29843: Use in tools/cleanborrowers.pl 2022-02-10 14:44:23 -10:00
copy-holidays.pl Bug 17600: Standardize our EXPORT_OK 2021-07-16 08:58:47 +02:00
csv-profiles.pl Bug 17600: Standardize our EXPORT_OK 2021-07-16 08:58:47 +02:00
exceptionHolidays.pl Bug 17600: Standardize our EXPORT_OK 2021-07-16 08:58:47 +02:00
export.pl Bug 29844: Fix ->search occurrences 2022-02-09 15:36:23 -10:00
holidays.pl Bug 17600: Standardize our EXPORT_OK 2021-07-16 08:58:47 +02:00
import_borrowers.pl Bug 29005: Add option to send welcome email from patron imports 2022-04-20 09:03:39 -10:00
inventory.pl Bug 29695: (follow-up) Remove C4::Reports::Guided::_get_column_defs 2022-04-12 11:40:16 +02:00
letter.pl Bug 30545: Replace the use of jQueryUI Accordion on the notices page 2022-05-02 11:22:58 -10:00
manage-marc-import.pl Bug 30738: Log warnings for background MARC import 2022-07-12 16:40:33 +00:00
marc_modification_templates.pl Bug 17600: Standardize our EXPORT_OK 2021-07-16 08:58:47 +02:00
modborrowers.pl Bug 29926: Add ability for superlibrarians to batch edit password expiration dates 2022-05-06 10:33:09 -10:00
newHolidays.pl Bug 17600: Standardize our EXPORT_OK 2021-07-16 08:58:47 +02:00
overduerules.pl Bug 29844: Fix ->search occurrences 2022-02-09 15:36:23 -10:00
picture-upload.pl Bug 6815: Capture member photo via webcam 2022-03-24 14:22:10 -10:00
problem-reports.pl Bug 17600: Standardize our EXPORT_OK 2021-07-16 08:58:47 +02:00
quotes-upload.pl Bug 17600: Standardize our EXPORT_OK 2021-07-16 08:58:47 +02:00
quotes.pl Bug 17600: Standardize our EXPORT_OK 2021-07-16 08:58:47 +02:00
scheduler.pl Bug 17600: Standardize our EXPORT_OK 2021-07-16 08:58:47 +02:00
showdiffmarc.pl Bug 29333: Fix encoding of imported UNIMARC authorities 2022-08-23 15:30:11 +00:00
stage-marc-import.pl Bug 30525: Items batch modification broken 2022-04-21 13:41:36 -10:00
stockrotation.pl Bug 29771: Remove trivial cases 2022-03-08 23:03:34 -10:00
tools-home.pl Bug 17600: Standardize our EXPORT_OK 2021-07-16 08:58:47 +02:00
upload-cover-image.pl Bug 30972: Don't replace biblio's local cover images when uploading an image's image 2022-06-24 16:17:41 +00:00
upload-file.pl Bug 28785: Adjust check_cookie_auth calls 2021-10-18 11:28:41 +02:00
upload.pl Bug 17600: Standardize our EXPORT_OK 2021-07-16 08:58:47 +02:00
viewlog.pl Bug 19532: (RM follow-up) More use of system preference 2022-03-14 23:11:12 -10:00