Licence and Legal FAQ/Rebuild Plan

From OpenStreetMap Foundation

This documents the rebuild plan as it was mid-march. Some significant changes to this plan have occurred since. For up-to-date news on the license change follow the OSMF blog, and the Rebuild mailing list

Background

We are working towards a licence change date of 1st April 2012. Interested parties are tracking progress on the Rebuild mailing list which is also intended for co-ordination of the necessary coding, sysadmin and communication efforts. Participants in this mailing list will be referred to as the "Rebuild Team". This team is open to participation from any interested parties.

"Database Rebuild" refers to the steps that need to be taken to transform the contents of our map database (including historical versions of objects) into a form that ensures that all data published can be licensed under ODbL. The criteria by which CC data may be retained during rebuild (known as WTFE) are now quite well-defined apart from some edge cases and a number of tools exist to demonstrate which objects are not ODbL-clean. The test suite is the vehicle through which remaining edge cases will be identified and confronted.

Within the rebuild team, consensus exists around an approach to the rebuild that will modify the database in-place, chiefly through the addition of a new field that will suppress the publication of a specific historic version of an object from users if not compliant with the new licence. This approach was originally proposed by Matt Amos and he has a body of code written to implement it, including a test harness intended to stress-test the various inclusion or exclusion criteria for map data.

This document intends to describe the detailed tasks required to bring about a successful rebuild based on the WTFE criteria and on Matt's in-place methodology. The document is intended to document the shared understanding of the rebuild team of the tasks that must be carried out, to drive both constructive criticism and actual offers to do some of the work.

Tasks

Since the goal of this table is to subdivide as much as possible the functionality of the rebuild tools, some of the units named below are actually part of a larger single body of work

Title Technology Leader Contribs Status
Rebuild Objects Ruby on rails Matt Amos - Expected 23rd Mar
Test harness for rebuild rules Ruby Matt Amos Frederik Ramm, Dermot McNally, Richard Fairhurst Expected 23rd Mar
API Support for redaction Ruby on rails Matt Amos - Needed 26-29 Mar
Editor tests varied Editor maintainers - Needed 30 Mar


Obtain "Suspect Object" list WTFE Frederik Ramm? - Needed 25 Mar
Test runs dev server, new API server Matt Amos - Planned 24-25th Mar
Freeze list of exceptional changesets From wiki Frederik Ramm - Before 25 Mar
Last CC Planet File Usual way ? - N/A
Enter read-only API mode API server ? - Planned 27 Mar
Go live (data) API server ? - Planned 27-30 Mar
Go live (API changes) API server ? - Planned 27-31 Mar
Enter read-write API mode API server ? - ASAP 27-30 Mar


Help Needed

[Draft] List of tasks that need your participation:

  • Sanity-checking of existing test cases and writing (or describing) more
  • Editor testing against revised API
  • Setup of API environment and test DB for redaction tests
  • Verifying before/after state of chosen objects in test DB for correct application of redaction rules.
  • Using WTFE tools, checking data in your area expected to be treated as clean using the "exceptional changesets" rule. Any still showing as dirty must be flagged to Frederik Ramm and Simon Poole.
  • Installation of API server environment onto ramoth (it is hoped to exploit the licence change to migrate onto ramoth and this may also facilitate some of the tests)

Rebuild Objects

This is a ruby toolset containing object representations of all OSM object types with methods to migrate them from their CC incarnations to ODbL versions, performing any required edits, deletions and/or redaction of historical versions.

Go-live impact

Requires: Nothing

Required by: Test suite (needs object interface), Redaction of production DB (needs full implementation)

Test harness for rebuild rules

The most critical and hard to validate aspect of the rebuild is the correct application of the rules. A flawed ruleset could allow non-ODbL-clean data to endure or cause perfectly valid data to be removed. Because of this, prior to the final rebuild of the production database, we wish to develop the rebuild code in a test-driven fashion. The tests of the rebuild rules are therefore broken out into a separate task.

Tests are written in Ruby, but are still quite intelligible to non-Ruby coders. They manipulate the same Ruby Rebuild Objects as the rebuild itself will. Each test defines the edit history of a single OSM object (node, way or relation), calculates the rebuild actions that will be applied by the Rebuild logic and tests whether the resulting actions are those expected.

Having a comprehensive suite of tests is currently the single highest priority in the rebuild project. Tests are welcome from all comers - if you cannot provide a test case in Ruby code, please write your test case as best you can in prose or pseudo-code and post it to the rebuild list. The tests can be executed locally with very few prerequisites - no OSM rails port installation is required. Please see the code for more details.

Go-live impact

Not used for actual rebuild, but test suite must be stable and complete before we can safely commence actual redaction.

Requires: Nothing

Required for: Redaction in production

API Support for Redaction

A post-rebuild database will contain, at least initially, a mixture of ODbL-clean content and non-clean content marked as "hidden". This will require that any API operations that access historic versions of objects change their behaviour to correctly suppress redacted data.

Go-Live impact

High impact. The updated API will support suppression of redacted changes as indicated in the revised schema. As such, the updated code will depend on the necessary DB schema migration. The revised API code can be safely deployed prior to actual redaction and put live in a single step, although the database changes involved may incur some downtime.

Requires: Knowledge of final DB schema changes and representation of redacted objects.

Required by: Redaction of production data (or, if the existing API code will safely ignore redactions, can wait until ODbL declaration)

Editor Tests

Since API changes are to be made, the most important OSM editors should be tested for non-breakage after the API code is deemed stable and before it is deployed to production. Any issues ought to be confined to functionality that interacts with object history, with revert plugins and undelete support particularly at risk.

The API changes are being developed in such a way that no change in editor behaviour should be required. API calls dealing with historical versions will be returned exactly the same format of data, but with troublesome content obscured, replaced with generic placeholders. Similarly, no new API version will be declared unless a compelling reason to do so can be identified.

Go-live impact

Independent of most of the process, but has to be right once the new API code is live.

Requires: Revised API deployed to a test instance

Required by: Deployment of revised API to production

Obtain "Suspect Object" list

The processing in-place of each OSM object will consume time and resources. However, the vast majority of objects in the database are known to be clean. The rebuild process will leave such a clean object untouched, allowing us an optimisation. Instead of processing every object, knowing that most will involve do-nothing, we intend to process only those objects that are deemed "suspect" - that is, those having at least one non-agreeing mapper in their history.

It is hoped that the suspect objects list can be derived from existing WTFE logic, though it should take a more conservative view than WTFE. Only objects with agreeing mappers throughout their history should be excluded from the list.

Go-live impact

Requires: Source dataset from which to extract

Required by: Redaction of production DB

Test runs

Once the test harness is considered comprehensive enough to warrant it, the rebuild code can be deployed to a test instance of the API database, currently most likely to be hosted on the dev server. This can be seeded with a subset of the OSM database in an interesting area. In-place conversion can then be run against some or all of the test database, with the resulting "cleaned" data examined to test that the logic has been applied as expected.

Go-live impact

This is a control gate before the production DB is touched

Requires: Completed test suite

Required by: Redaction of production data

Freeze list of exceptional changesets

An exceptional changeset is one of the following:

  • One that will be considered ODbL-clean although the mapper has not agreed to the licence change (for use in cases where there are grounds for overruling the mapper's normal preference, often with the specific consent of the mapper).
  • One that will not be considered ODbL-clean even though the mapper has agreed to the licence change (for use in cases where it is known that the changesets contain non-OBbL-safe data).

More information

New information: In the specific case of Poland it seems that we may be receiving details of ODbL-clean data at object level (sub-changeset) as a consequence of the way data imports from UMP data are being relicensed at the granularity of individual UMP contributors. If we are to support this, and the benefit is significant, the exceptional changeset support will need to be extended to cover this case. It may be appropriate to split this to a separate task.

Go-live impact

A gate prior to production redaction

Requires: Final decisions by community on exceptional changesets and (for Poland) single objects

Required by: Redaction of production DB

Last CC Planet File

Prior to any automated data removal, with the actual date dependent on the expected running time of the redaction process, the last CC Planet File will be generated. This will be made available for download, possibly shortly after the actual rebuild has taken place.

Go-live Impact

None, as daily planets are generated anyway. LWG will declare the latest "useful" planet file to be the last CC planet.

Read-only phase

The chosen in-place modification of the DB allows, in theory, for redaction to take place against a running database. Similarly, it is expected that both the existing and the updated API code will behave gracefully with an updated database, other than the fact that the existing code will be unable to filter non-ODbL-clean data. This allows the flexibility to redact the database before deploying the API updates as long as the data set is not declared to be under ODbL until the API changes are made.

However, for reasons of speed, it is proposed to disable API writes during the redaction process. Again, to boost speed, redaction itself is also likely to occur using a private interface to the database rather than going through the API.

The API will be held read-only for the duration of the redaction process. It is expected (though not required) that the updated API code will have been deployed by the time read-write mode is reinstated.

Go Live (data)

Once the tools are complete and deemed to function correctly and stably, they can be deployed to the production API server and the required DB migrations performed.

Once the code is deployed, it is possible to commence redaction on all objects not known to be already clean.

Go Live (tileserver)

Once the database contains ODbL-clean data, we will wish to switch attribution of the tiles we serve (Mapnik layer), requiring in turn a reimportation of rendering data and flushing of tiles, in addition to a new coastline run. Downstream users of our tiles and others involved in attribution of Mapnik tiles (Openlayers devs...) must also be informed.

Risks

The rebuild process will touch objects in the OSM production database, some of them in such a way that data will be removed (in accordance with the tested criteria). This section considers the scope for error and the options to recover from any such errors.

Incorrect data criteria applied

This can happen in one of two directions - the deletion of clean data or the failure to delete problem data. Since the methodology will not destructively edit any existing versions of an object (all changes applying instead to the current version), any object may be reprocessed if such an error is identified, if required using improved selection criteria or perhaps on the basis of a changed decision for exceptional treatment of a changeset or single object.

This approach does have the weakness that conflicts (similar to normal edit conflicts) could arise if such flawed redaction is noticed after a large passage of time. For that reason, vigilance in the early stages is urged, including spot checks during the read-only phase.

Redacting DB proves very slow

This would prolong the read-only phase. No actual data would be damaged, but the impact on mappers would be unfortunate. More on this after the tests yield some benchmarks.

Changesets (or objects) requiring exceptional handling are discovered late

Every effort should be expended to avoid this. It will be possible, though inconvenient, to reprocess objects later discovered not to have received exceptional handling when they should have. In the case of smaller data sets you can expect the do the resolution work yourself if the administrative burden is not warranted.

For larger data sets, reprocessing may be considered, but this will likely require either additional downtime or the extension of the tools to support live redaction. In addition, the comments above about a risk of edit conflicts will also apply.

There are no promises that this remedy will ever be considered, so proceed on the assumption that you have one chance only to get your exceptional handling list right first time round.