HarbourBridge is a stand-alone open source tool for Cloud Spanner evaluation, using data from an existing PostgreSQL or MySQL database. This README provides details of the tool's PostgreSQL capabilities. For general HarbourBridge information see this README.
HarbourBridge can either be used with pg_dump or it can be run directly on a PostgreSQL database (via go's database/sql package).
The following examples assume a harbourbridge
alias has been setup as described
in the Installing HarbourBridge section of the main README.
The tool can used to migrate schema from an existing pg_dump file:
harbourbridge schema -source=postgresql < my_pg_dump_file
You can use any of postgresql
, postgres
, or pg
as the argument to the
-source
flag. They all specify PostgreSQL as the source database.
This will generate a session file with session.json
suffix. This file contains
schema mapping from source to destination. You will need to specify this file
during data migration. You also need to specify a particular Spanner instance and database to use
during data migration.
For example, run
harbourbridge data -session=mydb.session.json -source=pg -target-profile="instance=my-spanner-instance,dbName=my-spanner-database-name" < my_pg_dump_file
You can also run HarbourBridge in a schema-and-data mode, where it will perform both schema and data migration. This is useful for quick evaluation when source database size is small.
harbourbridge schema-and-data -source=pg -target-profile="instance=my-spanner-instance" < my_pg_dump_file
HarbourBridge generates a report file, a schema file, and a bad-data file (if there are bad-data rows). You can control where these files are written by specifying a file prefix. For example,
harbourbridge schema -prefix=mydb. -source=postgres < my_pg_dump_file
will write files mydb.report.txt
, mydb.schema.txt
, and
mydb.dropped.txt
. The prefix can also be a directory. For example,
harbourbridge schema -prefix=~/spanner-eval-mydb/ -source=postgres < my_pg_dump_file
would write the files into the directory ~/spanner-eval-mydb/
. Note
that HarbourBridge will not create directories as it writes these files.
In this case, HarbourBridge connects directly to the PostgreSQL database to
retrieve table schema and data. Set the -source=postgres
and corresponding
source profile connection parameters host
, port
, user
, dbName
and
password
.
For example, to perform schema conversion, run
harbourbridge schema -source=postgres -source-profile="host=<>,port=<>,user=<>,dbName=<>"
Parameters port
and password
are optional. Port (port
) defaults to 5432
for PostgreSQL source. Password can be provided at the password prompt.
Alternatively, you can also set environment variables PGHOST
, PGPORT
,
PGUSER
, PGDATABASE
for direct access. Password can be specified either in
the PGPASSWORD
environment variable or provided at the password prompt.
Note that the various target-profile params described in the previous section are also applicable in direct connect mode.
The HarbourBridge tool maps PostgreSQL types to Spanner types as follows:
PostgreSQL Type | Spanner Type | Notes |
---|---|---|
BOOL |
BOOL |
|
BIGINT |
INT64 |
|
BIGSERIAL |
INT64 |
a |
BYTEA |
BYTES(MAX) |
|
CHAR |
STRING(1) |
CHAR defaults to length 1 |
CHAR(N) |
STRING(N) |
c |
DATE |
DATE |
|
DOUBLE PRECISION |
FLOAT64 |
|
INTEGER |
INT64 |
s |
NUMERIC |
NUMERIC |
p |
REAL |
FLOAT64 |
s |
SERIAL |
INT64 |
a, s |
SMALLINT |
INT64 |
s |
TEXT |
STRING(MAX) |
|
TIMESTAMP |
TIMESTAMP |
t |
TIMESTAMPTZ |
TIMESTAMP |
|
VARCHAR |
STRING(MAX) |
|
VARCHAR(N) |
STRING(N) |
c |
JSON , JSONB |
JSON |
|
ARRAY( pgtype) |
ARRAY( spannertype) |
if scalar type pgtype maps to spannertype |
All other types map to STRING(MAX)
. Some of the mappings in this table
represent potential changes of precision (marked p), dropped autoincrement
functionality (marked a), differences in treatment of timezones (marked t),
differences in treatment of fixed-length character types (marked c), and changes
in storage size (marked s). We discuss these, as well as other limits and notes
on schema conversion, in the following sections.
Spanner's NUMERIC type can store up to 29 digits before the decimal point and up to 9 after the decimal point. PostgreSQL's NUMERIC type can potentially support higher precision that this, so please verify that Spanner's NUMERIC support meets your application needs.
Spanner does not support autoincrementing types, so these both map to INT64
and the autoincrementing functionality is dropped.
PosgreSQL has two timestamp types: TIMESTAMP
and TIMESTAMPTZ
. Both have an 8
byte data representation and provide microsecond resolution, but neither
actually stores a timezone with the data. The key difference between the two
types is how string literals are converted to timestamps and queries return
data. For TIMESTAMP
, all timezone information is dropped, and data is returned
without a timezone. For TIMESTAMPTZ
, string literals are converted to UTC,
using the literal's timezone if it is specified, or the PostgreSQL's timezone
paramater if not. When data is printed stored data (in UTC) is converted to the
timezone from the timezone parameter
Spanner has a single timestamp type. Data is stored as UTC (there is no separate
timezone) Spanner client libraries convert timestamps to UTC before sending them
to Spanner. Data is always returned as UTC. Spanner's timestamp type is
essentially the same as TIMESTAMPTZ
, except that there is no analog of
PostgreSQL's timezone parameter.
In other words, mapping PostgreSQL TIMESTAMPTZ
to TIMESTAMP
is fairly
straightforward, but care should be taken with PostgreSQL TIMESTAMP
data
because Spanner clients will not drop the timezone.
The semantics of fixed-length character types differ between PostgreSQL and
Spanner. The CHAR(n)
type in PostgreSQL is padded with spaces. If a string
value smaller than the limit is stored, spaces will be added to pad it out to
the specified length. If a string longer than the specified length is stored,
and the extra characters are all spaces, then it will be silently
truncated. Moreover, trailing spaces are ignored when comparing two values. In
constrast, Spanner does not give special treatment to spaces, and the specified
length simply represents the maximum length that can be stored. This is close to
the semantics of PostgreSQL's VARCHAR(n)
. However there are some minor
differences. For example, even VARCHAR(n)
has some special treatment of
spaces: strings longer than the specified length are silently truncated if the
extra characters are all spaces.
The tool maps several PostgreSQL types to Spanner types that use more storage.
For example, SMALLINT
is a two-byte integer, but it maps to Spanner's INT64
,
an eight-byte integer. This additional storage could be significant for large
arrays.
Spanner does not support multi-dimensional arrays. So while TEXT[4]
maps to
ARRAY<STRING(MAX)>
and REAL ARRAY
maps to ARRAY<FLOAT64>
, TEXT[][]
maps
to STRING(MAX)
.
Also note that PosgreSQL supports array limits, but the PostgreSQL implementation ignores them. Spanner does not support array size limits, but since they have no effect anyway, the tool just drops them.
Spanner requires primary keys for all tables. PostgreSQL recommends the use of
primary keys for all tables, but does not enforce this. When converting a table
without a primary key, HarbourBridge will create a new primary key of type
INT64. By default, the name of the new column is synth_id
. If there is already
a column with that name, then a variation is used to avoid collisions.
The tool preserves NOT NULL
constraints. Note that Spanner does not require
primary key columns to be NOT NULL
. However, in PostgreSQL, a primary key is a
combination of NOT NULL
and UNIQUE
, and so primary key columns from
PostgreSQL will be mapped to Spanner columns that are both primary keys and NOT NULL
.
The tool maps PostgreSQL foreign key constraints into Spanner foreign key constraints, and
preserves constraint names where possible. Note that Spanner requires foreign key
constraint names to be globally unique (within a database), but in postgres they only
have to be unique for a table, so we add a uniqueness suffix to a name if needed.
Spanner doesn't support ON DELETE
and ON UPDATE
actions, so we drop these.
Spanner does not currently support default values. We drop these PostgreSQL features during conversion.
The tool maps PostgresSQL secondary indexes to Spanner secondary indexes, preserving
constraint names where possible. The tool also maps PostgreSQL UNIQUE
constraints to
Spanner UNIQUE
secondary indexes. Check here
for more details.
PostgreSQL has many other features we haven't discussed, including functions, sequences, procedures, triggers, (non-primary) indexes and views. The tool does not support these and the relevant statements are dropped during schema conversion.
See
Migrating from PostgreSQL to Cloud Spanner
for a general discussion of PostgreSQL to Spanner migration issues.
HarbourBridge follows most of the recommendations in that guide. The main
difference is that we map a few more types to STRING(MAX)
.
As noted earlier when discussing schema conversion of TIMESTAMP, there are some subtle differences in how timestamps are handled in PostgreSQL and Spanner.
During data conversion, PostgreSQL TIMESTAMPTZ
values are converted to UTC and
stored in Spanner. The conversion proceeds as follows. If the value has a
timezone, that timezone is respected during the conversion to UTC. If the value
does not have a timezone, then we look for any set timezone
statements in the
pg_dump output and use the timezone specified. Otherwise, we use the TZ
environment variable as the timezone, and failing that, we use the local system
timezone default (as determined by Go).
In constrast, conversion of PostgreSQL TIMESTAMP
values proceeds by ignoring
any timezone information and just treating the value as UTC and storing it in
Spanner.
Spanner requires that STRING
values be UTF-8 encoded. All Spanner functions
and operators that act on STRING
values operate on Unicode characters rather
than bytes. Since we map many PostgreSQL types (including TEXT
and CHAR
types) to Spanner's STRING
type, HarbourBridge is effectively a UTF-8 based
tool.
Note that the tool itself does not do any encoding/decoding or UTF-8 checks: it passes through data from pg_dump to Spanner. Internally, we use Go's string type, which supports UTF-8.