CollationinICU[共45頁]
Collation in ICUMark DavisChief SW Globalization ArchitectIBMDublin, Ireland 1/26/202221st International Unicode Conference2What is ICU?Premier Unicode Enablement LibraryOpen-source: non-viral licenseFull-Featured, Cross-PlatformC, C+, Java APIsCollation, Charset Conversion, Resources, Boundaries, Calendars, Transforms (case, norm., translit., ), Format/Parse (dates, times, msgs, nums., curr., ), Unicode strings/propshttp:/ Ireland 1/26/202221st International Unicode Conference3Collation = Sorting OrderHow hard can it be?A B C ComplicationsLanguages are complex and variedUnicode is a big set of charactersPerformance is crucialDublin, Ireland 1/26/202221st International Unicode Conference4Varies By:Language Swedish: z German: zUsage Dictionary: f of Telephone: of fCustomizations A a a AVersioning Fixes New Gov. Stds New CharactersDublin, Ireland 1/26/202221st International Unicode Conference5Strength Levels: L1, L2, L31.Base characters: a b2.Accents: as s atignored if there is a L1 character difference3.Case: ao Ao aignored if there is a L1 or L2 difference4.Punctuation: ab a-b aBignored* if there is a L1, L2, or L3 difference5.Tie-breaker: NFD code point orderDublin, Ireland 1/26/202221st International Unicode Conference6Context SensitivityContractions H Z, but CZ CHExpansions OE OFBoth Dublin, Ireland 1/26/202221st International Unicode Conference7Canonical EquivalenceA + x + . + x + + .u + + . + u + . + u + + .Dublin, Ireland 1/26/202221st International Unicode Conference8OdditiesNormal accentscote cot cte ct first accent difference determines orderFrench accentscote cte cot ct last accent difference determines orderLogical Order Exception (Thai, Lao) sorts like Dublin, Ireland 1/26/202221st International Unicode Conference9Merging Database FieldsF1 = LastName, F2 = FirstNameSequentialWeak 1stMergedF1, then F2F1 (L1), F2L1, L2, L3diSilva, JohndiSilva, Freddi Silva, Johndi Silva, Freddsilva, Johndsilva, FreddiSilva, Johndsilva, Johndi Silva, Johndi Silva, FreddiSilva, Freddsilva, FreddiSilva, Johndi Silva, Johndsilva, JohndiSilva, Freddi Silva, Freddsilva, FredDublin, Ireland 1/26/202221st International Unicode Conference10CustomizationsParameters that change collation behaviorChoice of language (locale)Runtime choicesExamples to followDublin, Ireland 1/26/202221st International Unicode Conference11Parametric CustomizationsStrengthBaseBase+AccentBase+Accent+ Case&c.Case: A a a APunctuation: di Silva diSilva diSilva di SilvaDublin, Ireland 1/26/202221st International Unicode Conference12Punctuation / Spaces (Alternates)Base Characterdi silvadi SilvaDi silvaDi SilvaDickensdisilvadiSilvaDisilvaDiSilvaIgnoreableDickens di silvadisilvadi SilvadiSilvaDi silvaDisilvaDi SilvaDiSilvaDublin, Ireland 1/26/202221st International Unicode Conference13Extended CustomizationsUser-defined“&” “ampersand”Merging tailoringsIranian + FrenchScript Orderb ? b ?Numbers A-10 A-2 A-2 A-10Dublin, Ireland 1/26/202221st International Unicode Conference14Other Uses: String SearchingMatch according to locale conventions:e.g. w = v for SwedishUse collation options:ignore case, accentother customizationsDublin, Ireland 1/26/202221st International Unicode Conference15Other Uses: Selection BoundsReturn all records where:Zo name ZormaIgnore case / accentsZoe / zoe / Zo / zo / Dublin, Ireland 1/26/202221st International Unicode Conference16UCAUTS #10: Unicode Collation AlgorithmLevels, Expansions, Contractions, Punctuation, Canonical Equivalence, etc.Default ordering: all Unicode code pointsProvides for tailoring to given languagesAlso see: The Unicode Standard, 5.17: Sorting and SearchingAligned with ISO 14651Dublin, Ireland 1/26/202221st International Unicode Conference17APIsString CompareSort KeysString SearchSelection BoundariesMerged sortkeysDublin, Ireland 1/26/202221st International Unicode Conference18Sort KeysTransform string into series of bytes which will binary-comparea:06 C3 01 20 01 02 00A:06 C3 01 20 01 08 00:06 C3 01 20 32 01 02 02 00ab:06 C3 06 D7 01 20 20 01 02 02 00b:06 D7 01 20 01 02 00 Level 1 Level 2 Level 3 Dublin, Ireland 1/26/202221st International Unicode Conference19String Compare vs. Sort KeysSame results in either caseSC faster for single comparisons average 5 to 10 times!SK faster for multiple comparisons index once binary compare many timesDublin, Ireland 1/26/202221st International Unicode Conference20String SearchNave Approachkey matches in target at iff target.substring(x, y) keyBoundary ComplicationsIgnorables: “a” matches in “(a)”? at & & & ?Contractions: “c” matches in “churo”?Normalization: “” matches in “a”?Dublin, Ireland 1/26/202221st International Unicode Conference21WARNING 1: BasicsNot aligned with character set or repertoireLatin-1: Swedish and German sorting differsNot code point (binary) orderBinary:Z a v aSwedish:v wNot a property of strings: same DatabaseSwedish user: views/selectGerman user: views/selectsDublin, Ireland 1/26/202221st International Unicode Conference22WARNING 2: OperationsOrder not preserved under concatenation / substringingx y xz yzx y zx zyxz yz x yzx zy x yDublin, Ireland 1/26/202221st International Unicode Conference23WARNING 3: DependenceCollation is a relation over stringsSort keys embody part of that relationThus, comparing sort keys from different tailorings (or parameters) gives undefined results.Dublin, Ireland 1/26/202221st International Unicode Conference24WARNING 4: StabilityStable SortRecords with equal comparison come out in original orderProperty of algorithm, not comparisonSemi-Stable Comparisonx y x yProperty of comparison, not algorithmDegrades performanceDoesnt do what people think (or really want)!Dublin, Ireland 1/26/202221st International Unicode Conference25ICU/Java Collation ArchitectureL1-3, contractions, expansions, Locale tailoringsFully rule-based specificationArbitrary runtime user customizations & ? = question mark & $ = dollar sign & z georgeDublin, Ireland 1/26/202221st International Unicode Conference26JavaSun licensed and includes an early version of ICU collation in JavaICU version:Dramatically fasterMuch reduced memory consumptionHalved sort-key lengthMany additional featuresDublin, Ireland 1/26/202221st International Unicode Conference27ICU Collation IFull UCA complianceFull supplementary character supportSolid performanceSmall Sort-KeysSmall Memory FootprintDublin, Ireland 1/26/202221st International Unicode Conference28ICU Collation IIParametric controlTailorable to any languageSimultaneous Multiple VersionsMerging Sort KeysSelection BoundsDublin, Ireland 1/26/202221st International Unicode Conference29Memory-Mappable, Fast InitOld: separate allocationsNew: offsets within mem-mapDublin, Ireland 1/26/202221st International Unicode Conference30Delta Tailoring:Minimize Memory UsageFRfoundUCA:One Copy;80KnotfoundcodenotsynthesizeinputoutputDublin, Ireland 1/26/202221st International Unicode Conference31Simultaneous Multiple VersionsPrograms can link against different versions of ICU, simultaneously.Preserves exact binary order over time.ApplicationNewDBOldDBICU2.1ICU2.0Dublin, Ireland 1/26/202221st International Unicode Conference32PerformanceChecks for identical prefixes firstInvokes normalization only when neededFast paths for common casesMinimizes comparison timeMinimizes sort key lengthDublin, Ireland 1/26/202221st International Unicode Conference33Sort Key CompressionCommon weights are 1-bytePrimary, secondary, tertiary, quarternarySequences are compressedUTF-16 Values for “Mrk Davis” (22 bytes)004D 00E4 0072 006B 0020 0044 0061 0076 0069 0073 0000Sort Key (L3, ignorable punctuation - 19 bytes)2F 17 39 2B 1D 17 41 27 3B 0177 96 0A 018F 80 8F 07 00Dublin, Ireland 1/26/202221st International Unicode Conference34ICU vs. Windows, glibcFull UCA!String comparison: comparable speed -20% . +400%Sort keys: much shorter 50%Warning: speed comparisons are approximate!Depends on data, parameters, features, CPUDublin, Ireland 1/26/202221st International Unicode Conference35More InformationICUhttp:/ Documenthttp:/ Version of these slideshttp:/ Dublin, Ireland 1/26/202221st International Unicode Conference36Q & ADublin, Ireland 1/26/202221st International Unicode Conference37Fast C or D (FCD)Accepts all NFD, most NFC, without normalizationXFCD NFC NFDA- ringYYAngstromYA + ringYYA + graveYYA-ring + graveYA + cedilla + ringYYA + ring + cedillaA-ring + cedillaYDublin, Ireland 1/26/202221st International Unicode Conference38Backup SlidesNot used in the presentation, except in response to questionsDublin, Ireland 1/26/202221st International Unicode Conference39Performance: CodingAvoided unnecessary function calls.Example: strlen too expensive!Avoided use of objectsRewrote core code in CC+ API wraps the C core code.Fast-pathed common casesUsed stack memory buffers(with expansion if necessary)Made inner loops as tight as possibleDublin, Ireland 1/26/202221st International Unicode Conference40WARNING 5: Math. RelationS = Unicode StringsReflexivea S: a aAntisymmetrica, b S: a b & b a a = bTransitivea, b S: a b & b c a cTotala, b S: a b b aDublin, Ireland 1/26/202221st International Unicode Conference41Identical PrefixesSorting / Searching DatabasesMany comparisons to “close” stringsCheck initial prefixes with binary compareDrop into collation loop at first differenceComplicationDublin, Ireland 1/26/202221st International Unicode Conference42Initial Prefix ComplicationNeed to backup if in “bad” position:TypeContraction (Spanish)chNormalizationaSurrogate Pair ExampleDublin, Ireland 1/26/202221st International Unicode Conference43Fractional UCAFractional weights for compressionGaps for tailoring, future UCA additionsOnly stores differences in tailoring fileReduces memory footprinta ba bprimary0861 0865 0871 08751718 60 18 6619secondary2020202003030303tertiary0202020203030303UCAFrac. UCADublin, Ireland 1/26/202221st International Unicode Conference44Exceptional ValuesNormal weight storageP P P P P P P P P P P P P P P P S S S S S S S S C C T T T T T T 1 116b8b6bF F F F T T T T d d d d d d d d d d d d d d d d d d d d d d d d4b4b Tag24 bit dataSpecial Weight StorageNOT_FOUND, EXPANSION, CONTRACTION, THAI, Dublin, Ireland 1/26/202221st International Unicode Conference45Minimal MemoryFlat-file (memory mapped)speeds initializationreduces memory footprint(next slide)Delta TailoringSingle copy of UCA (80K)Small delta files per locale