Commit graph

11 commits

Author SHA1 Message Date
Galen Charlton
a244ff6671 bug 2505: turn on warnings in seven modules
C4::XSLT
C4::VirtualShelves
C4::Review
C4::Output
C4::Boolean
C4::Charset
C4::Stats

Signed-off-by: Chris Cormack <chrisc@catalyst.net.nz>
Signed-off-by: Galen Charlton <galen.charlton@liblime.com>
2009-06-07 20:09:16 -05:00
Galen Charlton
da51de184c bug 2926: fix staging import hang
Fixes a hang of the staging import tool when it
attempts to process a MARC21 record that claims
that it's UTF-8 when it is not.  The staging import
will now attempt to fix the character encoding of such
records.

Also added a FIXME to bulkmarcimport.pl, which because
of its use of MARC::Batch will skip over such records -
better than the original hang of the staging import, but
worse than the staging import's new ability to fix such
records.

Signed-off-by: Galen Charlton <galen.charlton@liblime.com>
2009-06-07 13:17:06 -05:00
Nahuel ANGELINETTI
1a4e67c143 (bug #3183) fix the SetMarcUnicodeFlag function
This patch fix the funciton SetMarcUnicodeFlag for UNIMARC support, now the function will fix the length of the field, and set encoding as "50  " instead of "5050".

Signed-off-by: Galen Charlton <galen.charlton@liblime.com>
2009-05-08 08:23:14 -05:00
Joe Atzberger
248e0392e2 Multi-bug fix - SetMarcUnicodeFlag for records coming from Koha
This has bearing on bugs 2905, 2665, 2514 and other "wide character" crashes
related to diacritics and Unicode.  This should help open the door for reliable
input of diacriticals via acquisitions.

MARC21_utf8_flag_fix.pl diagnoses and fixes existing problems with MARC data
affected by the bug.

Adding SetMarcUnicodeFlag to TransformKohaToMarc prevents the bug from corrupting
further data.

Signed-off-by: Galen Charlton <galen.charlton@liblime.com>
2009-04-18 09:14:43 -05:00
0551f48150 Improve C4::Charset::MarcToUTF8Record performance
A script like bulkmarkimport.pl spends most of the time
in C4::Charset::MarcToUTF8Record function, and
specifically in C4::Charset::char_decode5426
just initializing a hash. This patch moves this
hash outside function to avoid its initializing
each time the functon is called.

A test on a specific conversion script shows me
that performances were improved from 23s to 8s.

Signed-off-by: Galen Charlton <galen.charlton@liblime.com>
2008-11-06 15:53:29 -06:00
Galen Charlton
cfea172544 work around issue in MARC::Charset
Because of a bug in MARC::Charset 0.98, if a string to convert from
MARC-8 to UTF-8 has (a) one or more diacritics that (b) are only in character positions
128 to 255 inclusive, the resulting converted string is not in
UTF-8, but the legacy 8-bit encoding (e.g., ISO-8859-1).  As a result,
when such a record is converted to XML using ->as_xml_record(), the resulting
XML can be truncated at the offending character.  An example of such a record
is one that has a price in Briish pounds in the 260$c but no other diacritics.

Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2008-04-01 06:46:04 -05:00
Galen Charlton
b549d7e1f1 added StripNonXmlChars to C4::Charset
Added invocations of StripNonXmlChars to uses
of new_from_xml() that involve records
saved to Koha fields via MARC::Record->as_xml();
for batch jobs that work on MARC XML files
coming from external sources, StripNonXmlChars
should not necessarily be used, as it may
be better to reject a file or record if it
contains that kind of encoding error.

Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2008-02-08 20:22:42 -06:00
Galen Charlton
c86f5df431 charset: fixed bug that prevented ISO-5426 conversion
Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2008-02-03 07:24:45 -06:00
Galen Charlton
60a98d258a IMPORTANT - refactor MARC character set handling
* IsStringUTF8ish - determine if scalar contains a string in UTF8
* MarcToUTF8Record - convert MARC blob or MARC::Record to UTF8
* SetMarcUnicodeFlag - set appropriate MARC21 or UNIMARC field to
  indicate that record is in UTF-8.

Design points of this module include:

* No dependencies on other C4 modules, making it easier to add
  more test cases
* All character conversion code in one place
* Single entry point for doing a character conversion on a
  MARC record
* Capture of errors and warnings produced by Text::Iconv
  and MARC::Charset
* Start of support for guessing the source character set of
  a MARC record.

Several functions were moved from other scripts
or modules to C4::Charset:

* C4::Koha->FixEncoding (expanded and renamed
  MarcToUTF8Record)
* C4::Koha->char_decode5426
* fMARC8ToUTF8 from bulkmarcimport.pl (renamed
  _marc_marc8_to_utf8)

Several batch jobs were adjusted to use MarcToUTF8Record instead of
FixEncoding.

Signed-off-by: Chris Cormack <crc@liblime.com>
Signed-off-by: Joshua Ferraro <jmf@liblime.com>
2008-02-03 07:23:56 -06:00
acli
e9858a2910 Moved C4/Charset.pm to C4/Interface/CGI/Output.pm 2003-02-02 07:19:29 +00:00
acli
ea50c2acb6 Preliminary fix of the CGI.pm problem of always assuming that everything is
in ISO-8859-1.

A new C4::Charset module (tentative name) has been created to guess the
charset of a piece of HTML markup. The CGI programs will be modified to use
this module as they are encountered during translation.
2003-01-19 06:15:44 +00:00