Koha-community/Koha - Koha: The world's first free and open source library system

Author	SHA1	Message	Date
Frederic Demians	0551f48150	Improve C4::Charset::MarcToUTF8Record performance A script like bulkmarkimport.pl spends most of the time in C4::Charset::MarcToUTF8Record function, and specifically in C4::Charset::char_decode5426 just initializing a hash. This patch moves this hash outside function to avoid its initializing each time the functon is called. A test on a specific conversion script shows me that performances were improved from 23s to 8s. Signed-off-by: Galen Charlton <galen.charlton@liblime.com>	2008-11-06 15:53:29 -06:00
Galen Charlton	cfea172544	work around issue in MARC::Charset Because of a bug in MARC::Charset 0.98, if a string to convert from MARC-8 to UTF-8 has (a) one or more diacritics that (b) are only in character positions 128 to 255 inclusive, the resulting converted string is not in UTF-8, but the legacy 8-bit encoding (e.g., ISO-8859-1). As a result, when such a record is converted to XML using ->as_xml_record(), the resulting XML can be truncated at the offending character. An example of such a record is one that has a price in Briish pounds in the 260$c but no other diacritics. Signed-off-by: Joshua Ferraro <jmf@liblime.com>	2008-04-01 06:46:04 -05:00
Galen Charlton	b549d7e1f1	added StripNonXmlChars to C4::Charset Added invocations of StripNonXmlChars to uses of new_from_xml() that involve records saved to Koha fields via MARC::Record->as_xml(); for batch jobs that work on MARC XML files coming from external sources, StripNonXmlChars should not necessarily be used, as it may be better to reject a file or record if it contains that kind of encoding error. Signed-off-by: Joshua Ferraro <jmf@liblime.com>	2008-02-08 20:22:42 -06:00
Galen Charlton	c86f5df431	charset: fixed bug that prevented ISO-5426 conversion Signed-off-by: Chris Cormack <crc@liblime.com> Signed-off-by: Joshua Ferraro <jmf@liblime.com>	2008-02-03 07:24:45 -06:00
Galen Charlton	60a98d258a	IMPORTANT - refactor MARC character set handling * IsStringUTF8ish - determine if scalar contains a string in UTF8 * MarcToUTF8Record - convert MARC blob or MARC::Record to UTF8 * SetMarcUnicodeFlag - set appropriate MARC21 or UNIMARC field to indicate that record is in UTF-8. Design points of this module include: * No dependencies on other C4 modules, making it easier to add more test cases * All character conversion code in one place * Single entry point for doing a character conversion on a MARC record * Capture of errors and warnings produced by Text::Iconv and MARC::Charset * Start of support for guessing the source character set of a MARC record. Several functions were moved from other scripts or modules to C4::Charset: * C4::Koha->FixEncoding (expanded and renamed MarcToUTF8Record) * C4::Koha->char_decode5426 * fMARC8ToUTF8 from bulkmarcimport.pl (renamed _marc_marc8_to_utf8) Several batch jobs were adjusted to use MarcToUTF8Record instead of FixEncoding. Signed-off-by: Chris Cormack <crc@liblime.com> Signed-off-by: Joshua Ferraro <jmf@liblime.com>	2008-02-03 07:23:56 -06:00
acli	e9858a2910	Moved C4/Charset.pm to C4/Interface/CGI/Output.pm	2003-02-02 07:19:29 +00:00
acli	ea50c2acb6	Preliminary fix of the CGI.pm problem of always assuming that everything is in ISO-8859-1. A new C4::Charset module (tentative name) has been created to guess the charset of a piece of HTML markup. The CGI programs will be modified to use this module as they are encountered during translation.	2003-01-19 06:15:44 +00:00

7 commits