# Chrome OS Update Process [TOC] System updates in more modern operating systems like Chrome OS and Android are called A/B updates, over-the-air ([OTA]) updates, seamless updates, or simply auto updates. In contrast to more primitive system updates (like Windows or macOS) where the system is booted into a special mode to override the system partitions with newer updates and may take several minutes or hours, A/B updates have several advantages including but not limited to: * Updates maintain a workable system that remains on the disk during and after an update. Hence, reducing the likelihood of corrupting a device into a non-usable state. And reducing the need for flashing devices manually or at repair and warranty centers, etc. * Updates can happen while the system is running (normally with minimum overhead) without interrupting the user. The only downside for users is a required reboot (or, in Chrome OS, a sign out which automatically causes a reboot if an update was performed where the reboot duration is about 10 seconds and is no different than a normal reboot). * The user does not need (although they can) to request for an update. The update checks happen periodically in the background. * If the update fails to apply, the user is not affected. The user will continue on the old version of the system and the system will attempt to apply the update again at a later time. * If the update applies correctly but fails to boot, the system will rollback to the old partition and the user can still use the system as usual. * The user does not need to reserve enough space for the update. The system has already reserved enough space in terms of two copies (A and B) of a partition. The system doesn’t even need any cache space on the disk, everything happens seamlessly from network to memory to the inactive partitions. ## Life of an A/B Update In A/B update capable systems, each partition, such as the kernel or root (or other artifacts like [DLC]), has two copies. We call these two copies active (A) and inactive (B). The system is booted into the active partition (depending on which copy has the higher priority at boot time) and when a new update is available, it is written into the inactive partition. After a successful reboot, the previously inactive partition becomes active and the old active partition becomes inactive. But everything starts with generating update payloads in (Google) servers for each new system image. Once the update payloads are generated, they are signed with specific keys and stored in a location known to an update server (Omaha). When the updater client initiates an update (either periodically or user initiated), it first consults different device policies to see if the update check is allowed. For example, device policies can prevent an update check during certain times of a day or they require the update check time to be scattered throughout the day randomly, etc. Once policies allow for the update check, the updater client sends a request to the update server (all this communication happens over HTTPS) and identifies its parameters like its Application ID, hardware ID, version, board, etc. Then if the update server decides to serve an update payload, it will respond with all the parameters needed to perform an update like the URLs to download the payloads, the metadata signatures, the payload size and hash, etc. The updater client continues communicating with the update server after different state changes, like reporting that it started to download the payload or it finished the update, or reports that the update failed with specific error codes, etc. Each payload consists of two main sections: metadata and extra data. The metadata is basically a list of operations that should be performed for an update. The extra data contains the data blobs needed by some or all of these operations. The updater client first downloads the metadata and cryptographically verifies it using the provided signatures from the update server’s response. Once the metadata is verified as valid, the rest of the payload can easily be verified cryptographically (mostly through SHA256 hashes). Next, the updater client marks the inactive partition as unbootable (because it needs to write the new updates into it). At this point the system cannot rollback to the inactive partition anymore. Then, the updater client performs the operations defined in the metadata (in the order they appear in the metadata) and the rest of the payload is gradually downloaded when these operations require their data. Once an operation is finished its data is discarded. This eliminates the need for caching the entire payload before applying it. During this process the updater client periodically checkpoints the last operation performed so in the event of failure or system shutdown, etc. it can continue from the point it missed without redoing all operations from the beginning. During the download, the updater client hashes the downloaded bytes and when the download finishes, it checks the payload signature (located at the end of the payload). If the signature cannot be verified, the update is rejected. After the inactive partition is updated, the entire partition is re-read, hashed and compared to a hash value passed in the metadata to make sure the update was successfully written into the partition. In the next step, the [Postinstall] process (if any) is called. The postinstall reconstructs the dm-verity tree hash of the ROOT partition and writes it at the end of the partition (after the last block of the file system). The postinstall can also perform any board specific or firmware update tasks necessary. If postinstall fails, the entire update is considered failed. Then the updater client goes into a state that identifies the update has completed and the user needs to reboot the system. At this point, until the user reboots (or signs out), the updater client will not do any more system updates even if newer updates are available. However, it does continue to perform periodic update checks so we can have statistics on the number of active devices in the field. After the update proved successful, the inactive partition is marked to have a higher priority (on a boot, a partition with higher priority is booted first). Once the user reboots the system, it will boot into the updated partition and it is marked as active. At this point, after the reboot, The updater client calls into the [`chromeos-setgoodkernel`] program. The program verifies the integrity of the system partitions using the dm-verity and marks the active partition as healthy. At this point the system is basically updated successfully. ## Update Engine Daemon The `update_engine` is a single-threaded daemon process that runs all the times. This process is the heart of the auto updates. It runs with lower priorities in the background and is one of the last processes to start after a system boot. Different clients (like Chrome or other services) can send requests for update checks to the update engine. The details of how requests are passed to the update engine is system dependent, but in Chrome OS it is D-Bus. Look at the [D-Bus interface] for a list of all available methods. There are many resiliency features embedded in the update engine that makes auto updates robust including but not limited to: * If the update engine crashes, it will restart automatically. * During an active update it periodically checkpoints the state of the update and if it fails to continue the update or crashes in the middle, it will continue from the last checkpoint. * It retries failed network communication. * If it fails to apply a delta payload (due to bit changes on the active partition) for a few times, it switches to full payload. The updater clients writes its active preferences in `/var/lib/update_engine/prefs`. These preferences help with tracking changes during the lifetime of the updater client and allows properly continuing the update process after failed attempts or crashes. The core update engine code base in a Chromium OS checkout is located in `src/aosp/system/update_engine` fetching [this repository]. ### Policy Management In Chrome OS, devices are allowed to accept different policies from their managing organizations. Some of these policies affect how/when updates should be performed. For example, an organization may want to scatter the update checks during certain times of the day so as not to interfere with normal business. Within the update engine daemon, [UpdateManager] has the responsibility of loading such policies and making different decisions based on them. For example, some policies may allow the act of checking for updates to happen, while they prevent downloading the update payload. Or some policies don’t allow the update check within certain time frames, etc. Anything that relates to the Chrome OS update policies should be contained within the [update_manager] directory in the source code. ### Rollback vs. Enterprise Rollback Chrome OS defines a concept for Rollback: Whenever a newly updated system does not work as it is intended, under certain circumstances the device can be rolled back to a previously working version. There are two types of rollback supported in Chrome OS: A (legacy, original) rollback and an enterprise rollback (I know, naming is confusing). A normal rollback, which has existed for as long as Chrome OS had auto updater, is performed by switching the currently inactive partition into the active partition and rebooting into it. It is as simple as running a successful postinstall on the inactive partition, and rebooting the device. It is a feature used by Chrome that happens under certain circumstances. Of course rollback can’t happen if the inactive partition has been tampered with or has been nuked by the updater client to install an even newer update. Normally a rollback is followed by a Powerwash which clobbers the stateful partition. Enterprise rollback is a new feature added to allow enterprise users to downgrade the installed image to an older version. It is very similar to a normal system update, except that an older update payload is downloaded and installed. There is no direct API for entering into the enterprise rollback. It is managed by the enterprise device policies only. Developers should be careful when touching any rollback related feature and make sure they know exactly which of these two features they are trying to adapt. ### Interactive vs Non-Interactive vs. Forced Updates Non-interactive updates are updates that are scheduled periodically by the update engine and happen in the background. Interactive updates, on the other hand, happen when a user specifically requests an update check (e.g. by clicking on “Check For Update” button in Chrome OS’s About page). Depending on the update server's policies, interactive updates have higher priority than non-interactive updates (by carrying marker hints). They may decide to not provide an update if they have busy server load, etc. There are other internal differences between these two types of updates too. For example, interactive updates try to install the update faster. Forced updates are similar to interactive updates (initiated by some kind of user action), but they can also be configured to act as non-interactive. Since non-interactive updates happen periodically, a forced-non-interactive update causes a non-interactive update at the moment of the request, not at a later time. We can call a forced non-interactive update with: ```bash update_engine_client --interactive=false --check_for_update ``` ### P2P Updates Many organizations might not have the external bandwidth requirements that system updates need for all their devices. To help with this, Chrome OS can act as a payload server to other client devices in the same network subnet. This is basically a peer-to-peer update system that allows the devices to download the update payloads from other devices in the network. This has to be enabled explicitly in the organization through device policies and specific network configurations to be enabled for P2P updates to work. Regardless of the location of update payloads, all update requests go through update servers in HTTPS. Check out the [P2P update related code] for both the server and the client side. ### Network The updater client has the capability to download the payloads using Ethernet, WiFi, or Cellular networks depending on which one the device is connected to. Downloading over Cellular networks will prompt permission from the user as it can consume a considerable amount of data. ### Logs In Chrome OS the `update_engine` logs are located in `/var/log/update_engine` directory. Whenever `update_engine` starts, it starts a new log file with the current data-time format in the log file’s name (`update_engine.log-DATE-TIME`). Many log files can be seen in `/var/log/update_engine` after a few restarts of the update engine or after the system reboots. The latest active log is symlinked to `/var/log/update_engine.log`. ## Update Payload Generation The update payload generation is the process of converting a set of partitions/files into a format that is both understandable by the updater client (especially if it's a much older version) and is securely verifiable. This process involves breaking the input partitions into smaller components and compressing them in order to help with network bandwidth when downloading the payloads. For each generated payload, there is a corresponding properties file which contains the metadata information of the payload in JSON format. Normally the file is located in the same location as the generated payload and its file name is the same as the payload file name plus `.json` postfix. e.g. `/path/to/payload.bin` and `/path/to/payload.bin.json`. This properties file is necessary in order to do any kind of auto update in [`cros flash`], AU autotests, etc. Similarly the updater server uses this file to dispatch the payload properties to the updater clients. Once update payloads are generated, their original images cannot be changed anymore otherwise the update payloads may not be able to be applied. `delta_generator` is a tool with a wide range of options for generating different types of update payloads. Its code is located in `update_engine/payload_generator`. This directory contains all the source code related to mechanics of generating an update payload. None of the files in this directory should be included or used in any other library/executable other than the `delta_generator` which means this directory does not get compiled into the rest of the update engine tools. However, it is not recommended to use `delta_generator` directly. To manually generate payloads easier, [`cros_generate_update_payloads`] should be used. Most of the higher level policies and tools for generating payloads reside as a library in [`chromite/lib/paygen`]. Whenever calls to the update payload generation API are needed, this library should be used instead. ### Update Payload File Specification Each update payload file has a specific structure defined in the table below: |Field|Size (bytes)|Type|Description| |-----|------------|----|-----------| |Magic Number|4|char[4]|Magic string "CrAU" identifying this is an update payload.| |Major Version|8|uint64|Payload major version number.| |Manifest Size|8|uint64|Manifest size in bytes.| |Manifest Signature Size|4|uint32|Manifest signature blob size in bytes (only in major version 2).| |Manifest|Varies|[DeltaArchiveManifest]|The list of operations to be performed.| |Manifest Signature|Varies|[Signatures]|The signature of the first five fields. There could be multiple signatures if the key has changed.| |Payload Data|Varies|List of raw or compressed data blobs|The list of binary blobs used by operations in the metadata.| |Payload Signature Size|Varies|uint64|The size of the payload signature.| |Payload Signature|Varies|[Signatures]|The signature of the entire payload except the metadata signature. There could be multiple signatures if the key has changed.| ### Delta vs. Full Update Payloads There are two types of payload: Full and Delta. A full payload is generated solely from the target image (the image we want to update to) and has all the data necessary to update the inactive partition. Hence, full payloads can be quite large in size. A delta payload, on the other hand, is a differential update generated by comparing the source image (the active partitions) and the target image and producing the diffs between these two images. It is basically a differential update similar to applications like `diff` or `bsdiff`. Hence, updating the system using the delta payloads requires the system to read parts of the active partition in order to update the inactive partition (or reconstruct the target partition). The delta payloads are significantly smaller than the full payloads. The structure of the payload is equal for both types. Payload generation is quite resource intensive and its tools are implemented with high parallelism. #### Generating Full Payloads A full payload is generated by breaking the partition into 2MiB (configurable) chunks and either compressing them using bzip2 or XZ algorithms or keeping it as raw data depending on which produces smaller data. Full payloads are much larger in comparison to delta payloads hence require longer download time if the network bandwidth is limited. On the other hand, full payloads are a bit faster to apply because the system doesn’t need to read data from the source partition. #### Generating Delta Payloads Delta payloads are generated by looking at both the source and target images data on a file and metadata basis (more precisely, the file system level on each appropriate partition). The reason we can generate delta payloads is that Chrome OS partitions are read only. So with high certainty we can assume the active partitions on the client’s device is bit-by-bit equal to the original partitions generated in the image generation/signing phase. The process for generating a delta payload is roughly as follows: 1. Find all the zero-filled blocks on the target partition and produce `ZERO` operation for them. `ZERO` operation basically discards the associated blocks (depending on the implementation). 2. Find all the blocks that have not changed between the source and target partitions by directly comparing one-to-one source and target blocks and produce `SOURCE_COPY` operation. 3. List all the files (and their associated blocks) in the source and target partitions and remove blocks (and files) which we have already generated operations for in the last two steps. Assign the remaining metadata (inodes, etc) of each partition as a file. 4. If a file is new, generate a `REPLACE`, `REPLACE_XZ`, or `REPLACE_BZ` operation for its data blocks depending on which one generates a smaller data blob. 5. For each other file, compare the source and target blocks and produce a `SOURCE_BSDIFF` or `PUFFDIFF` operation depending on which one generates a smaller data blob. These two operations produce binary diffs between a source and target data blob. (Look at [bsdiff] and [puffin] for details of such binary differential programs!) 6. Sort the operations based on their target partitions’ block offset. 7. Optionally merge same or similar operations next to each other into larger operations for better efficiency and potentially smaller payloads. Full payloads can only contain `REPLACE`, `REPLACE_BZ`, and `REPLACE_XZ` operations. Delta payloads can contain any operations. ### Major and Minor versions The major and minor versions specify the update payload file format and the capability of the updater client to accept certain types of update payloads respectively. These numbers are [hard coded] in the updater client. Major version is basically the update payload file version specified in the [update payload file specification] above (second field). Each updater client supports a range of major versions. Currently, there are only two major versions: 1, and 2. And both Chrome OS and Android are on major version 2 (major version 1 is being deprecated). Whenever there are new additions that cannot be fitted in the [Manifest protobuf], we need to uprev the major version. Upreving major version should be done with utmost care because older clients do not know how to handle the newer versions. Any major version uprev in Chrome OS should be associated with a GoldenEye stepping stone. Minor version defines the capability of the updater client to accept certain operations or perform certain actions. Each updater client supports a range of minor versions. For example, the updater client with minor version 4 (or less) does not know how to handle a `PUFFDIFF` operation. So when generating a delta payload for an image which has an updater client with minor version 4 (or less) we cannot produce PUFFDIFF operation for it. The payload generation process looks at the source image’s minor version to decide the type of operations it supports and only a payload that confirms to those restrictions. Similarly, if there is a bug in a client with a specific minor version, an uprev in the minor version helps with avoiding to generate payloads that cause that bug to manifest. However, upreving minor versions is quite expensive too in terms of maintainability and it can be error prone. So one should practice caution when making such a change. Minor versions are irrelevant in full payloads. Full payloads should always be able to be applied for very old clients. The reason is that the updater clients may not send their current version, so if we had different types of full payloads, we would not have known which version to serve to the client. ### Signed vs Unsigned Payloads Update payloads can be signed (with private/public key pairs) for use in production or be kept unsigned for use in testing. Tools like `delta_generator` help with generating metadata and payload hashes or signing the payloads given private keys. ## update_payload Scripts [update_payload] contains a set of python scripts used mostly to validate payload generation and application. We normally test the update payloads using an actual device (live tests). [`brillo_update_payload`] script can be used to generate and test applying of a payload on a host device machine. These tests can be viewed as dynamic tests without the need for an actual device. Other `update_payload` scripts (like [`check_update_payload`]) can be used to statically check that a payload is in the correct state and its application works correctly. These scripts actually apply the payload statically without running the code in payload_consumer. ## Postinstall [Postinstall] is a process called after the updater client writes the new image artifacts to the inactive partitions. One of postinstall's main responsibilities is to recreate the dm-verity tree hash at the end of the root partition. Among other things, it installs new firmware updates or any board specific processes. Postinstall runs in separate chroot inside the newly installed partition. So it is quite separated from the rest of the active running system. Anything that needs to be done after an update and before the device is rebooted, should be implemented inside the postinstall. ## Building Update Engine You can build `update_engine` the same as other platform applications: ```bash (chroot) $ emerge-${BOARD} update_engine ``` or to build without the source copy: ```bash (chroot) $ cros_workon_make --board=${BOARD} update_engine ``` After a change in the `update_engine` daemon, either build an image and install the image on the device using cros flash, etc. or use `cros deploy` to only install the `update_engine` service on the device: ```bash (chroot) $ cros deploy update_engine ``` You need to restart the `update_engine` daemon in order to see the affected changes: ```bash # SSH into the device. restart update-engine # with a dash not underscore. ``` Other payload generation tools like `delta_generator` are board agnostic and only available in the SDK. So in order to make any changes to the `delta_generator`, you should build the SDK: ```bash # Do it only once to start building the 9999 ebuild from ToT. (chroot) $ cros_workon --host start update_engine (chroot) $ sudo emerge update_engine ``` If you make any changes to the D-Bus interface make sure `system_api`, `update_engine-client`, and `update_engine` packages are marked to build from 9999 ebuild and then build both packages in that order: ```bash (chroot) $ emerge-${BOARD} system_api update_engine-client update_engine ``` If you make any changes to [`update_engine` protobufs] in the `system_api`, build the `system_api` package first. ## Running Unit Tests [Running unit tests similar to other platforms]: ```bash (chroot) $ FEATURES=test emerge- update_engine ``` or ```bash (chroot) $ cros_workon_make --board= --test update_engine ``` or ```bash (chroot) $ cros_run_unit_tests --board ${BOARD} --packages update_engine ``` The above commands run all the unit tests, but `update_engine` package is quite large and it takes a long time to run all the unit tests. To run all unit tests in a test class run: ```bash (chroot) $ FEATURES=test \ P2_TEST_FILTER="*OmahaRequestActionTest.*-*RunAsRoot*" \ emerge-amd64-generic update_engine ``` To run one exact unit test fixture (e.g. `MultiAppUpdateTest`), run: ```bash (chroot) $ FEATURES=test \ P2_TEST_FILTER="*OmahaRequestActionTest.MultiAppUpdateTest-*RunAsRoot*" \ emerge-amd64-generic update_engine ``` To run `update_payload` unit tests enter `update_engine/scripts` directory and run the desired `unittest.p`y files. ## Initiating a Configured Update There are different methods to initiate an update: * Click on the “Check For Update” button in setting’s About page. There is no way to configure this way of update check. * Use the [`update_engine_client`] program. There are a few configurations you can do. * Call `autest` in the crosh. Mainly used by the QA team and is not intended to be used by any other team. * Use [`cros flash`]. It internally uses the update_engine to flash a device with a given image. * Run one of many auto update autotests. * Start a [Dev Server] on your host machine and send a specific HTTP request (look at `cros_au` API in the Dev Server code), that has the information like the IP address of your Chromebook and where the update payloads are located to the Dev Server to start an update on your device (**Warning:** complicated to do, not recommended). `update_engine_client` is a client application that can help initiate an update or get more information about the status of the updater client. It has several options like initiating an interactive vs. non-interactive update, changing channels, getting the current status of update process, doing a rollback, changing the Omaha URL to download the payload (the most important one), etc. `update_engine` daemon reads the `/etc/lsb-release` file on the device to identify different update parameters like the updater server (Omaha) URL, the current channel, etc. However, to override any of these parameters, create the file `/mnt/stateful_partition/etc/lsb-release` with desired customized parameters. For example, this can be used to point to a developer version of the update server and allow the update_engine to schedule a periodic update from that specific server. If you have some changes in the protocol that communicates with Omaha, but you don’t have those changes in the update server, or you have some specific payloads that do not exist on the production update server you can use [Nebraska] to help with doing an update. ## Note to Developers and Maintainers When changing the update engine source code be extra careful about these things: ### Do NOT Break Backward Compatibility At each release cycle we should be able to generate full and delta payloads that can correctly be applied to older devices that run older versions of the update engine client. So for example, removing or not passing arguments in the metadata proto file might break older clients. Or passing operations that are not understood in older clients will break them. Whenever changing anything in the payload generation process, ask yourself this question: Would it work on older clients? If not, do I need to control it with minor versions or any other means. Especially regarding enterprise rollback, a newer updater client should be able to accept an older update payload. Normally this happens using a full payload, but care should be taken in order to not break this compatibility. ### Think About The Future When creating a change in the update engine, think about 5 years from now: * How can the change be implemented that five years from now older clients don’t break? * How is it going to be maintained five years from now? * How can it make it easier for future changes without breaking older clients or incurring heavy maintenance costs? ### Prefer Not To Implement Your Feature In The Updater Client If a feature can be implemented from server side, Do NOT implement it in the client updater. Because the client updater can be fragile at points and small mistakes can have catastrophic consequences. For example, if a bug is introduced in the updater client that causes it to crash right before checking for update and we can't quite catch this bug early in the release process, then the production devices which have already moved to the new buggy system, may no longer receive automatic updates anymore. So, always think if the feature is being implemented can be done form the server side (with potentially minimal changes to the client updater)? Or can the feature be moved to another service with minimal interface to the updater client. Answering these questions will pay off greatly in the future. ### Be Respectful Of Other Code Bases The current update engine code base is used in many projects like Android. We sync the code base among these two projects frequently. Try to not break Android or other systems that share the update engine code. Whenever landing a change, always think about whether Android needs that change: * How will it affect Android? * Can the change be moved to an interface and stubs implementations be implemented so as not to affect Android? * Can Chrome OS or Android specific code be guarded by macros? As a basic measure, if adding/removing/renaming code, make sure to change both `build.gn` and `Android.bp`. Do not bring Chrome OS specific code (for example other libraries that live in `system_api` or `dlcservice`) into the common code of update_engine. Try to separate these concerns using best software engineering practices. ### Merging from Android (or other code bases) Chrome OS tracks the Android code as an [upstream branch]. To merge the Android code to Chrome OS (or vice versa) just do a `git merge` of that branch into Chrome OS, test it using whatever means and upload a merge commit. ```bash repo start merge-aosp git merge --no-ff --strategy=recursive -X patience cros/upstream repo upload --cbr --no-verify . ``` [Postinstall]: #postinstall [update payload file specification]: #update-payload-file-specification [OTA]: https://source.android.com/devices/tech/ota [DLC]: https://chromium.googlesource.com/chromiumos/platform2/+/master/dlcservice [`chromeos-setgoodkernel`]: https://chromium.googlesource.com/chromiumos/platform2/+/master/installer/chromeos-setgoodkernel [D-Bus interface]: /dbus_bindings/org.chromium.UpdateEngineInterface.dbus-xml [this repository]: / [UpdateManager]: /update_manager/update_manager.cc [update_manager]: /update_manager/ [P2P update related code]: https://chromium.googlesource.com/chromiumos/platform2/+/master/p2p/ [`cros_generate_update_payloads`]: https://chromium.googlesource.com/chromiumos/chromite/+/master/scripts/cros_generate_update_payload.py [`chromite/lib/paygen`]: https://chromium.googlesource.com/chromiumos/chromite/+/master/lib/paygen/ [DeltaArchiveManifest]: /update_metadata.proto#302 [Signatures]: /update_metadata.proto#122 [hard coded]: /update_engine.conf [Manifest protobuf]: /update_metadata.proto [update_payload]: /scripts/ [Postinstall]: https://chromium.googlesource.com/chromiumos/platform2/+/master/installer/chromeos-postinst [`update_engine` protobufs]: https://chromium.googlesource.com/chromiumos/platform2/+/master/system_api/dbus/update_engine/ [Running unit tests similar to other platforms]: https://chromium.googlesource.com/chromiumos/docs/+/master/testing/running_unit_tests.md [Nebraska]: https://chromium.googlesource.com/chromiumos/platform/dev-util/+/master/nebraska/ [upstream branch]: https://chromium.googlesource.com/aosp/platform/system/update_engine/+/upstream [`cros flash`]: https://chromium.googlesource.com/chromiumos/docs/+/master/cros_flash.md [bsdiff]: https://android.googlesource.com/platform/external/bsdiff/+/master [puffin]: https://android.googlesource.com/platform/external/puffin/+/master [`update_engine_client`]: /update_engine_client.cc [`brillo_update_payload`]: /scripts/brillo_update_payload [`check_update_payload`]: /scripts/paycheck.py [Dev Server]: https://chromium.googlesource.com/chromiumos/chromite/+/master/docs/devserver.md