You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

6.6 KiB

Introduction

This document will as the C++ port matures serve as a log to how different parts of the library work. As of today, there is some general info but mostly CMap specific details.


Font Data Tables

One of the important goals in sfntly is thread safety which is why tables can only be created with their nested Builder class and are immutable after creation.

CMapTable

CMap = character map; it converts code points in a code page to glyph IDs.

The CMapTable is a table of CMaps (CMaps are also tables; one for every encoding supported by the font). Representing an encoding-dependent character map is in one of 14 formats, out of which formats 0 and 4 are the most used; sfntly/C++ will initially only support formats 0, 2, 4 and 12.

CMapFormat0 Byte encoding table

Format 0 is a basic table where a characters glyph ID is looked up in a glyphIdArray256. As it only supports 256 characters it can only encode ASCII and ISO 8859-x (alphabet-based languages).

CMapFormat2 High-byte mapping through table

Chinese, Japanese and Korean (CJK) need special 2 byte encodings for each code point like Shift-JIS.

CMapFormat4 Segment mapping to delta values

This is the preferred format for Unicode Basic Multilingual Plane (BMP) encodings according to the Microsoft spec. Format 4 defines segments (contiguous ranges of characters; variable length). Finding a characters glyph id first means finding the segment it is part of using a binary search (the segments are sorted). A segment has a startCode, an endCode (the minimum and maximum code points in the segment), an idDelta (delta for all code points in the segment) and an idRangeOffset (offset into glyphIdArray or 0).

idDelta and idRangeOffset seem to be the same thing, offsets. In fact, idRangeOffset uses the glyph array to get the index by relying on the fact that the array is immediately after the idRangeOffset table in the font file. So, the segments offset is idRangeOffset[i] but since the idRangeOffset table contains words and not bytes, the value is divided by 2.

glyphIndex = *(&idRangeOffset[i] + idRangeOffset[i] / 2 + (c - startCode[i]))

idDelta[i] is another kind of segment offset used when idRangeOffset[i] = 0, in which case it is added directly to the character code.

glyphIndex = idDelta[i] + c

Class Hierarchy

CMapTable is the main class and the container for all other CMap related classes.

Utility classes

  • CMapTable::CMapId describes a pair of IDs, platform ID and encoding ID that form the CMaps ID. The ID a CMap has is usually a good indicator as to what kind of format the CMap uses (Unicode CMaps are usually either format 4 or format 12).
  • CMapTable::CMapIdComparator
  • CMapTable::CMapIterator iteration through the CMapTable is supported through a Java-style iterator.
  • CMapTable::CMapFilter Java-style filter; CMapIterator supports filtering CMaps. By default, it accepts everything CMap.
  • CMapTable::CMapIdFilter extends CMapFilter; only accepts one type of CMap. Used in conjunction with CMapIterator, this is how the CMap getters are implemented.
  • CMapTable::Builder is the only way to create a CMapTable.

CMaps

  • CMapTable::CMap is the abstract base class that all CMapFormat* derive. It defines basic functions and the abstract CMapTable::CMap::CharacterIterator class to iterate through the characters in the map. The basic implementation just loops through every character between a start and an end. This is overridden so that format specific iteration is performed.
  • CMapFormat0 (mostly done?)
  • CMapFormat2 (needs builders)
  • ... coming soon

[todo: will add images soon; need to upload to svn]


Table Building Pipeline

Building a data table in sfntly is done by the FontDataTable::Builder::build method which defines the general pipeline and leaves the details to each implementing subclass (CMapTable::Builder for example). Note: sub* methods are table specific

ReadableFontDataPtr data = internalReadData()

There are 2 private fields in the FontDataTable::Builder class: rData and wData for ReadableFontData and WritableFontData. This function returns rData if there is any or wData (it is cast to readable font data) if rData is null. They hold the same data!

if (model_changed_)

A font is essentially a binary blob when loaded inside a FontData object. A model is the Java/C++ collection of objects that represent the same data in a manipulable format. If you ask for the model (even if you dont write to it), it will count as changed and the underlying raw data will get updated.

if (!subReadyToSerialize()) return NULL else

  1. size = subDataToSerialize()
  2. WritableDataPtr new_data = container_->getNewData(size)
  3. subSerialize(new_data)
  4. data = new_data

FontDataTablePtr table = subBuildTable(data)

The table is actually built, where subBuildTable is overridden by every class of table but a table header is always added.

Subtable Builders

Subtables are lazily built

When creating the object view of the font and dealing with lots of tables, it would be wasteful to create builders for every subtable there is since most users only do fairly high level manipulation of the font. Instead, only the tables at font level are fully built.

All other subtables have builders that contain valid FontData but the object view is not created by default. For the CMapTable, this means that if you dont go through the getCMapBuilders() method, the CMap builders are not initialized. So, the builder map would seem to be empty when calling its size() method but there are CMaps in the font when calling numCMaps(internalReadFont()).


Character encoders

Sfntly/Java uses a native ICU-based API for encoding characters. Sfntly/C++ uses ICU directly. In unit tests we assume text is encoded in UTF16. Public APIs will use ICU classes like UnicodeString.