ClickHouse lets you write custom functions in Python using AWS Lambda, which is pretty neat. But the most surprising thing is that you don’t need to deploy anything to ClickHouse itself to make this work; the execution happens entirely within the Lambda environment, and ClickHouse just acts as the orchestrator.
Let’s see this in action. Imagine you have a table of user events, and you want to categorize them based on some complex, evolving business logic that’s easier to manage in Python.
CREATE TABLE user_events (
event_id UUID,
user_id UInt64,
event_type String,
timestamp DateTime
) ENGINE = MergeTree()
ORDER BY (user_id, timestamp);
INSERT INTO user_events VALUES
(generateUUIDv4(), 1001, 'login', now()),
(generateUUIDv4(), 1002, 'purchase', now() - INTERVAL 1 HOUR),
(generateUUIDv4(), 1001, 'logout', now() - INTERVAL 30 MINUTES),
(generateUUIDv4(), 1003, 'view_product', now() - INTERVAL 2 HOURS);
Now, let’s create a Lambda function that takes an event_type and returns a sentiment score.
# lambda_function.py
import json
def lambda_handler(event, context):
event_type = event['event_type']
sentiment = 0
if event_type == 'login':
sentiment = 1
elif event_type == 'purchase':
sentiment = 2
elif event_type == 'view_product':
sentiment = 0.5
else:
sentiment = -1
return {
'statusCode': 200,
'body': json.dumps({
'event_type': event_type,
'sentiment': sentiment
})
}
You’d deploy this as a standard AWS Lambda function, naming it clickhouse_udf_sentiment.
Then, in ClickHouse, you define this Lambda UDF:
CREATE FUNCTION lambda_sentiment AS 'clickhouse_udf_sentiment'
FROM DATA SOURCE 'arn:aws:lambda:us-east-1:123456789012:function:clickhouse_udf_sentiment'
SETTINGS
lambda_role_arn = 'arn:aws:iam::123456789012:role/ClickHouseLambdaExecutionRole',
lambda_payload_format = 'JSON';
The lambda_role_arn is crucial. This IAM role needs permissions to invoke the Lambda function (lambda:InvokeFunction) and potentially access any other AWS services your Lambda might need (though this simple one doesn’t). The lambda_payload_format = 'JSON' tells ClickHouse to send arguments as a JSON object and expect a JSON response.
Now you can use it:
SELECT
event_type,
lambda_sentiment(event_type) AS sentiment_score
FROM user_events
WHERE event_type IN ('login', 'purchase', 'view_product');
This query will send each event_type from the selected rows to your Lambda function, and the results will be integrated directly into your ClickHouse query output.
The mental model here is that ClickHouse is acting as a client to your Lambda function. When you call lambda_sentiment(event_type), ClickHouse constructs a JSON payload like {"event_type": "login"} and sends it via the AWS SDK (configured with the provided role ARN) to the specified Lambda function. The Lambda executes its Python code, generates a response like {"event_type": "login", "sentiment": 1}, which ClickHouse then parses and uses as the function’s return value for that specific row. This is all done in batches for efficiency.
What most people don’t realize is how the lambda_payload_format setting influences both the input to Lambda and the expected output. When set to JSON, ClickHouse expects the Lambda to return a JSON object where the key corresponding to the UDF’s output column name (or a default if not specified) holds the scalar value for that row. If your Lambda returned {"sentiment": 1}, ClickHouse would expect to find a sentiment key in the JSON response. If it’s not found, or if the structure is unexpected, you’ll get parsing errors.
The next step is to explore how to handle more complex data types and arrays as arguments to your Lambda UDFs.