- What is a Partitioned Table?
- Creating a Partitioned Table
- Loading Data in a Partitioned Table
- Querying a Partitioned Table
Google BigQuery is designed to house some truly monstrous datasets, sometimes hosting tables billions of rows. It is therefore no surprise that Google has implemented the handy capability of partitioned
tables, which allow otherwise daunting datasets to be broken up into smaller, more manageable chunks without losing performance or scalability.
What is a Partitioned Table?
In the past, it was common for database administrators using BigQuery to split large datasets into smaller tables that were divided by date and/or time. For example, each day a table may be generated where the table name is suffixed with the current date (e.g. books_20160724
, books_20160725
, etc). While this method is functional, it creates a lot of headache to not only manage all these split tables, but queries rapidly become quite complex as the potential date ranges increase and crossover.
To resolve this, Google BigQuery introduced the partitioned table
, which is essentially just a normal table, except that it is automatically partitioned for each date.
The _PARTITIONTIME
Psuedo Column
Functionally, a partitioned table
acts just like any other table, except it contains a special _PARTITIONTIME
psuedo column which contains the date-based timestamp representing when a particular record was loaded. This field uses UTC time and is the number of microseconds since the unix epoch.
Creating a Partitioned Table
To create a partitioned table
, you must issue a call to the Tables.insert API method. This can be done either through the API or through the command-line tool.
The critical point here is that you do not need to specify a schema for the partitioned table
at this time, but instead are effectively telling BigQuery to create a special partitioned table
, and then await the insertion of data at a later time which will give it a proper schema.
Here is the API configuration JSON
to create our table (named partition
) via the API:
{
"tableReference": {
"projectId": "bookstore-1382",
"datasetId": "exports",
"tableId": "partition"
},
"timePartitioning": {
"type": "DAY"
}
}
And here’s the same creation using the command-line tool:
bq mk --time_partitioning_type=DAY exports.partition
Loading Data in a Partitioned Table
Loading data into the partitioned table
is no different than loading data into any other table in BigQuery. For the purposes of this example, we’re just using the WebUI and grabbing some data from the [bigquery-public-data:samples.github_timeline]
dataset and setting our Destination Table
to the previously created bookstore-1382:exports.partition
table.
Below is our query:
SELECT
repository_name,
repository_language,
repository_size,
repository_watchers,
created_at
FROM
[bigquery-public-data:samples.github_timeline]
WHERE
created_at IS NOT NULL
ORDER BY
created_at ASC
LIMIT
1000
And the output:
[
{
"repository_name": "node-orm",
"repository_language": "JavaScript",
"repository_size": "208",
"repository_watchers": "101",
"created_at": "2012-03-11 06:36:13"
},
{
"repository_name": "nimbus",
"repository_language": "Objective-C",
"repository_size": "18304",
"repository_watchers": "1259",
"created_at": "2012-03-11 06:36:13"
},
{
"repository_name": "impress.js",
"repository_language": "JavaScript",
"repository_size": "388",
"repository_watchers": "7149",
"created_at": "2012-03-11 06:36:13"
},
...
]
Querying a Partitioned Table
Now that we have loaded some data into our partitioned table
, we can make use of the “automagical” nature when querying our table.
Performing a basic query with no regard for the _PARTITIONTIME
is standard fare, but often we’ll want to retrieve just the data that we loaded into our table for a particular date or daterange. To do this, we simply add a WHERE
clause and compare the _PARTITIONTIME
value to the daterange in question.
For example, here we’re retrieving only records that were loaded today (which, therefore, exist in the partition for today’s date):
SELECT
repository_name,
repository_language,
repository_size,
repository_watchers,
created_at
FROM
exports.partition
WHERE
_PARTITIONTIME == TIMESTAMP(UTC_USEC_TO_DAY(CURRENT_TIMESTAMP()));
The key is the WHERE
clause comparing TIMESTAMP
values to the _PARTITIONTIME
field.
Here we’re getting data from the partition exactly 5
days ago:
SELECT
repository_name,
repository_language,
repository_size,
repository_watchers,
created_at
FROM
exports.partition
WHERE
_PARTITIONTIME == TIMESTAMP(UTC_USEC_TO_DAY(NOW() - 5 * 60 * 60 * 24 * 1000000));
And here we’re getting data from all partitions within the last 3
days:
SELECT
repository_name,
repository_language,
repository_size,
repository_watchers,
created_at,
_PARTITIONTIME pt
FROM
exports.partition
WHERE
_PARTITIONTIME
BETWEEN
TIMESTAMP(UTC_USEC_TO_DAY(NOW() - 3 * 60 * 60 * 24 * 1000000))
AND
TIMESTAMP(UTC_USEC_TO_DAY(CURRENT_TIMESTAMP()));
The options are limitless for querying the partitioned data. Note that if we want to actually SELECT
the _PARTITIONTIME
value (as seen in the above example), since it’s a psuedo column we must assign it to an alias name before output (e.g. SELECT _PARTITIONTIME pt
).