High-level Specification for Two-way RSync Implementation

Introduction

The technology in this specification is adopted from http://rsync.samba.org, which explains the rsync technology, and http://rproxy.sourceforge.net/doc/protocol/protocol.html, which explains how to use rsync over HTTP to conserve download bandwidth.

This specification extends the model to allow uploads as well.

Rsync is most importantly a way of defining a "signature" for a file. A signature consists of a weak rolling checksum, to identify where blocks may begin, and a strong checksum over blocks of fixed size, to identify which blocks are different between two files with slightly different signatures. RProxy takes the basic diff/update algorithm used by rsync and allows it to be used over HTTP by browsers, proxies and/or Web servers.

This specification borrows from both rsync and RProxy.  It improves on rsync because it’s intended for use with Web pages over HTTP.  It improves on RProxy because it allows upload as well as download.  This enables synchronization of data in either direction.

Xdelta (http://boss.cae.wisc.edu/ftp/hpux/Users/xdelta-1.1.1/xdelta-1.1.1.README.html) is a binary diff format derived from the Rsync work: when two rsync signatures have been compared an xdelta diff can be generated from only one copy of the file (unlike most diff algorithms which require two copies of a file to generate a diff). Xdelta is the domininant diff format for this application because it meets the requirements of being a binary diff generator with source code available and compatible with rsync. The RProxy specifications at sourceforge.net use a customized 'gdiff' format, and I've asked why but haven’t received a response.

Both rsync and early xdelta releases are under GPL, although more recent releases are now under a BSD-style license.

This specification does not cover synchronization of metadata (properties).

Requirements and Scenarios

The original requirements for this specification came from a  few scenarios:

·           Users wish to back-up parts of their computers to a WebDAV server.

·           Home machines can be synchronized with work machines and with laptops, using the WebDAV server as an intermediary.

Since the users in the scenarios often have slow links, the backup/restore functionality thus requires a bandwidth-conserving algorithm, and the Rsync model has been chosen.

The algorithm MUST allow for detection of a change on the server’s copy of the file so that the client can avoid overwriting those changes.This is to solve the two-client-machine synch problem: a client could synchronize from home and work, make changes at work and synchronize to the server, then make changes at home to the same file and try to synchronize. The client must be able to warn the user and allow them to cancel the operation before overwriting the original changes.

Only the client can know if 'rsync' is worth the trouble in a particular situation. If the local file to be uploaded (or cached copy of resource to be downloaded) is small, then the client can do without the rsync overhead by sending normal GET or PUT requests.

Design

Download

The Rproxy project designed a one-round-trip mechanism, suitable for clients to use when updating a cache of web content for which the authoritative version is stored on the server. The model is also suitable for a client to restore data that has previously been backed up to a server.

The key to finding out whether the data is changed is to compare the Rsync signature, which is generated by breaking a file up into equally-sized pieces and generating two checksums for each block. The size of the Rsync signature can be limited to a fixed size by choosing the size of the blocks. The latest 'hsync' specification limits the Rsync signature to 512 bytes, no matter how large the file is. The signature is base-64-encoded to ensure correct transmission.

These are the steps for the single-round-trip download:

1.        The client calculates the Rsync signature [1] for a locally cached file (or file to be restored), and includes Rsync-Signature: header in a GET request for a resource.

 
  GET /resource.name HTTP/1.1
  Rsync-Signature: --base-64-encoded signature data--
  Accept-encoding: application/rsync-xdelta, *
  If-Match: ETag [2]
 
  1. Server compares client's signature to server's signature for the resource named.
  2. If signatures differ, server calculates the xdelta and responds with
 
  HTTP/1.1 200 OK
  Content-length: ---
  ETag: --NewETag--
  Content-type: application/rsync-xdelta
 
  --Xdelta data--

(If signatures (and ETag) are the same, server responds with '304 Not Modified'. )

  1. Client applies the xdeltas it receives, which ensures that the local file now has the same content as the server’s resource. This overwrites any changes that may have been made locally. Client must also store the ETag for future use.

Either client or server may cache Rsync signatures for reuse, as long as there is some way of ensuring that a signature is updated or deleted whenever the associated file changes.

The client should always include the Rsync-Signature header if it’s cheap enough to calculate, because other servers (and proxies) besides the one hosting the client’s vault may begin to support this header.

If the client is downloading a list of files from the server with this method, it can pipeline the GET requests. Support for pipelining requires support for persistent HTTP connections (see RFC 2068): the client asks for a persistent connection in the first pipelined request. The server responds in the order requested, so that the client can match each response with the appropriate request.

Note that this method cannot be used for synchronization, because any changes made locally (to the client's copy of the file) will be overwritten if the xdelta from the server is applied. Thus, the client should check change-dates on local files and compare against the last known synchronization time, unless the client intends to overwrite local content (e.g. restoring from backup content)

Uploading Updates or Synchronization

This model is to be used when the client is uploading or backing up content to the server. It can also be used when performing synchronization, when content may have changed in either location and local changes should not be overwritten.

If the operation is a first-time upload or backup, then the client has no use for signatures or ETags, thus the client uses regular PUT requests to upload the data. This algorithm is only useful when the server already has copies of the files being uploaded, but the copies may be out of date, thus have different signatures and ETags.

Before synchronizing, the client should must obtain the full list of ETags, for any files for which the server has copies. See 'Getting ETags and Signatures' below.

When many files are being uploaded or synchronized, the client may pipeline a bunch of requests together. The client should obtain a WebDAV write lock for the entire hierarchy being synchronized before beginning the operation, to ensure consistency. The lock should be at least an hour long, and should be renewed well before the timeout if the synchronization lasts longer than an hour. If obtaining a lock is not possible, the client must instead send an If-Match header with an ETag with each PATCH request to ensure that the file on the server has not been changed since the beginning of the synchronization operation.

Getting ETags and Signatures

The client must use ETags as well as rsync signatures, in order to tell if the server copies of the files have changed.

First, the client should request ETags for every resource in the hierarchy to be synchronized. ETags are small, cheap for the server to provide in terms of computation and bandwidth, and cheap for the client to compare. To get all ETags for a hierarchy, the client should send a PROPFIND request (depth: infinity) asking for the <DAV:getetag> property. The server responds to this with a 207 multi-status with each resource in the hierarchy, thus providing also a snapshot of the entire namespace as it exists on the server.

If the client compares a 'last changed-date' on each file locally to the last known successful synchronization date for that file, then the client should be able to tell what files have changed locally. Typically, in a synchronization/backup operation, there will be large trees of content that haven't changed locally or on the server. The client should now prune these trees off the set of resources to be synchronized.

The client can also compare what resources exist at each location, to see if there are new resources at the server or on the client.

For resources that have a change-date after the last synch or a fresh ETag, the client must ask the server for the rsync signatures. The client should do this by sending PROPFIND requests for the <http://www.xythos.com/namespaces/StorageServer#rsync-signature> property. The server responds with a 207 Multistatus response to each of these if successful.

Now, for each resource for which the client has the server's signature, the client must calculate its own signature.

Sending xdeltas

The client can pipeline any number of PATCH requests for different resources. [4] Note the use of the If header to send the lock token obtained for the hierarchy.

 
  PATCH /resource.name HTTP/1.1
  If: --lock-token--
  Content-length: --xxx--
  Content-type: application/rsync-xdelta
 
  -- Xdelta data --
 

The server should respond with 200 OK except if an error occurs. The server MUST also respond with the ETag for the client to save.

If WebLogics does not support PATCH, we'll have to tunnel it as with other methods that WebLogics has failed to support.

Finally, the client must UNLOCK if a lock was obtained.

The client should correlate server success responses with PATCH requests to make sure that each file has been successfully synchronized.

Uploading xdeltas without retrieving server signature

If the client can store the old rsync signature (the last one synchronized), as well as the old ETag, then the client can perform synchronization of a locally-changed resource in only one round trip. Using its stored signature, the client can calculate the xdelta. This xdelta is only valid if the client’s stored signature is valid, and the client’s stored signature is valid only if the server’s copy of the resource has not changed since the last synchronization. Thus, the client MUST provide an If-Match header with the ETag, to prevent the xdelta from being applied if changes have been made on the server.

 
  PATCH /foo.jpg HTTP/1.1
  If-Match: Etag
  Content-length: xxx
  Content-type: application/rsync-xdelta
 
  -- Xdelta data --
 

This method also does not require LOCK and UNLOCK to be used, since the server compares the ETag in the same transaction in which it applies the Xdelta.

Restarting Interrupted Synchronization

If synchronization is interrupted and restarted, the client can continue where it left off ONLY IF the client received a success response for each PATCH request.

If some responses were missing, the client MUST collect a new set of signatures for those resources, even if the hierarchy was locked. The client can follow the same process and logic as above to decide what PATCH requests are required, as if this was a completely fresh synch attempt. However, it is likely that more of the content will be in synch, so fewer PATCH requests will be required.

If the client took out a lock that is still valid, the client can continue to use the same lock-token, otherwise the client may have to get a new lock.

Standardization Issues

This specification uses the following non-standard elements:

I don't see significant danger of collision with these names. If we want to attempt to standardize, we ought to pick a better namespace for the rsync-signature property.

Determining Server Support for this specification

Usually a client will only try synchronization against a server that is known to support this specification, so I haven't specified anything for the HTTP OPTIONS response.

The client can send a GET with the Rsync-Signature header to any server, whether it supports HTTP or WebDAV. A server that supports this specification (or the rproxy specification) will respond with a application/rsync-xdelta content type or a successful no-change response. A server that does not support this specification will respond with the body of the resource if the ETag changed.

Clients can only send PROPFIND requests to WebDAV servers, and any WebDAV server should support a depth infinity request for ETags. Only a server that supports this specification will return values for the <rsync-signature> property.

If the client sends a PATCH method with the application/rsync-xdelta content type, most servers (HTTP and WebDAV) will reject the request because PATCH is not widely supported. However, if the server happens to support PATCH but doesn't support the application/rsync-xdelta content type, it will fail (correctly) for that reason. Since there is another xdelta format defined (http://www.bath.ac.uk/~ccslrd/delta/), I named this one rsync-xdelta to avoid conflicting.

Property Synchronization

This model doesn't handle property synchronization, because the scenarios outlined do not require property synchronization.

If the client does want to synchronize properties, there is no way to see if properties have changed on the server without querying for them, because ETags and rsync-signatures do not cover properties. The client must do a PROPFIND for every synchronized property, for every resource, and compare to local values, then PROPPATCH (perhaps pipelined) each changed property. Since currently property values are not large and the set of synchronized properties is expected to be small, this should be acceptable.

If the client synchronizes properties, it must have a good model for which properties are to be synchronized. E.g. there could be a 'synchronized properties' property with a list (including itself) of the properties that should be synchronized. (This property is itself synchronized so that multiple client machines can be consistent in which properties are synched). The list would be pruned if it turned out some of the properties were not writable. As far as the server is concerned it can ignore this property, unless we decide to add special logic to calculate a signature for this property set.

 

footnotes

[1] The current RProxy specifications at sourceforge.net do not have the client calculating the signature. Instead, the client gets the signature from the server when it first downloads the file, and saves this signature (just like an ETag) for use when re-loading the file. This mechanism was chosen only because of possible patent problems with client calculation of signature. These patent problems may need to be investigated.

[2] The client includes the ETag in case the server does not support the Rsync-Signature header, because it will save bandwidth by not downloading the file if it hasn't changed at all. ETags are a standard feature of HTTP 1.1.

[3] If the link is weak and the file is large, the client may wish to break up the body of the resource into multiple requests. This is done by sending a PUT request with the first n bytes of the file, followed by a series of PATCH requests. This process risks leaving the file in an inconsistent state, so if it is used, the client must have some way of setting a flag on the server copy that indicates that a piecemeal PUT will be attempted, and only remove this flag when the pieces are all ACKed. Otherwise, another client synching to the same content will retrieve bogus data.

[4] Again, the client can break any single PATCH request into multiple PATCH requests to the same resource, to be applied by the server in the order received, with the same caveat as in [1].