You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

165 lines
6.6 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

Introduction
===============================================================
This document will as the C++ port matures serve as a log to how
different parts of the library work. As of today, there is some general
info but mostly CMap specific details.
------------------------------------------------------------------------
Font Data Tables
===========================================================================
One of the important goals in `sfntly` is thread safety which is why
tables can only be created with their nested `Builder` class and are
immutable after creation.
`CMapTable`
--------------------------------------------------------
*CMap* = character map; it converts *code points* in a *code page* to
*glyph IDs*.
The CMapTable is a table of CMaps (CMaps are also tables; one for every
encoding supported by the font). Representing an encoding-dependent
character map is in one of 14 formats, out of which formats 0 and 4 are
the most used; sfntly/C++ will initially only support formats 0, 2, 4
and 12.
### `CMapFormat0` Byte encoding table
Format 0 is a basic table where a characters glyph ID is looked up in a
glyphIdArray256. As it only supports 256 characters it can only encode
ASCII and ISO 8859-x (alphabet-based languages).
### `CMapFormat2` High-byte mapping through table
Chinese, Japanese and Korean (CJK) need special 2 byte encodings for
each code point like Shift-JIS.
### `CMapFormat4` Segment mapping to delta values
This is the preferred format for Unicode Basic Multilingual Plane (BMP)
encodings according to the Microsoft spec. Format 4 defines segments
(contiguous ranges of characters; variable length). Finding a
characters glyph id first means finding the segment it is part of using
a binary search (the segments are sorted). A segment has a
**`startCode`**, an **`endCode`** (the minimum and maximum code points
in the segment), an **`idDelta`** (delta for all code points in the
segment) and an **`idRangeOffset`** (offset into glyphIdArray or 0).
`idDelta` and `idRangeOffset` seem to be the same thing, offsets. In
fact, `idRangeOffset` uses the glyph array to get the index by relying
on the fact that the array is immediately after the `idRangeOffset`
table in the font file. So, the segments offset is `idRangeOffset[i]`
but since the `idRangeOffset` table contains words and not bytes, the
value is divided by 2.
``` {.prettyprint}
glyphIndex = *(&idRangeOffset[i] + idRangeOffset[i] / 2 + (c - startCode[i]))
```
`idDelta[i]` is another kind of segment offset used when
`idRangeOffset[i] = 0`, in which case it is added directly to the
character code.
``` {.prettyprint}
glyphIndex = idDelta[i] + c
```
### Class Hierarchy
`CMapTable` is the main class and the container for all other CMap
related classes.
#### Utility classes
- `CMapTable::CMapId` describes a pair of IDs, platform ID and
encoding ID that form the CMaps ID. The ID a CMap has is usually a
good indicator as to what kind of format the CMap uses (Unicode
CMaps are usually either format 4 or format 12).
- `CMapTable::CMapIdComparator`
- `CMapTable::CMapIterator` iteration through the CMapTable is
supported through a Java-style iterator.
- `CMapTable::CMapFilter` Java-style filter; CMapIterator supports
filtering CMaps. By default, it accepts everything CMap.
- `CMapTable::CMapIdFilter` extends CMapFilter; only accepts one type
of CMap. Used in conjunction with CMapIterator, this is how the CMap
getters are implemented.
- **`CMapTable::Builder`** is the only way to create a CMapTable.
#### CMaps
- **`CMapTable::CMap`** is the abstract base class that all
`CMapFormat*` derive. It defines basic functions and the abstract
`CMapTable::CMap::CharacterIterator` class to iterate through the
characters in the map. The basic implementation just loops through
every character between a start and an end. This is overridden so
that format specific iteration is performed.
- `CMapFormat0` (mostly done?)
- `CMapFormat2` (needs builders)
- ... coming soon
`[todo: will add images soon; need to upload to svn]`
------------------------------------------------------------------------
# Table Building Pipeline
Building a data table in sfntly is done by the
`FontDataTable::Builder::build` method which defines the general
pipeline and leaves the details to each implementing subclass
(`CMapTable::Builder` for example). Note: **`sub*`** methods are table
specific
**`ReadableFontDataPtr data = internalReadData()`**
> There are 2 private fields in the `FontDataTable::Builder` class:
> `rData` and `wData` for `ReadableFontData` and `WritableFontData`.
> This function returns `rData` if there is any or `wData` (it is cast
> to readable font data) if `rData` is null. *They hold the same data!*
**`if (model_changed_)`**
> A font is essentially a binary blob when loaded inside a `FontData`
> object. A *model* is the Java/C++ collection of objects that represent
> the same data in a manipulable format. If you ask for the model (even
> if you dont write to it), it will count as changed and the underlying
> raw data will get updated.
**`if (!subReadyToSerialize())`**
**`return NULL`**
`else`
1. **`size = subDataToSerialize()`**
2. **`WritableDataPtr new_data = container_->getNewData(size)`**
3. **`subSerialize(new_data)`**
4. **`data = new_data`**
**`FontDataTablePtr table = subBuildTable(data)`**
> The table is actually built, where `subBuildTable` is overridden by
> every class of table but a table header is always added.
Subtable Builders
------------------------------------------------------------------------------
Subtables are lazily built
When creating the object view of the font and dealing with lots of
tables, it would be wasteful to create builders for every subtable there
is since most users only do fairly high level manipulation of the font.
Instead, **only the tables at font level are fully built**.
All other subtables have builders that contain valid FontData but the
object view is not created by default. For the `CMapTable`, this means
that if you dont go through the `getCMapBuilders()` method, the CMap
builders are not initialized. So, the builder map would seem to be empty
when calling its `size()` method but there are CMaps in the font when
calling `numCMaps(internalReadFont())`.
------------------------------------------------------------------------
Character encoders
---------------------------------------------------------------------------------
Sfntly/Java uses a native ICU-based API for encoding characters.
Sfntly/C++ uses ICU directly. In unit tests we assume text is encoded in
UTF16. Public APIs will use ICU classes like `UnicodeString`.