Machine learning functions
evalMLMethod
Prediction using fitted regression models uses evalMLMethod function. See link in linearRegression.
stochasticLinearRegression
The stochasticLinearRegression aggregate function implements stochastic gradient descent method using linear model and MSE loss function. Uses evalMLMethod to predict on new data.
stochasticLogisticRegression
The stochasticLogisticRegression aggregate function implements stochastic gradient descent method for binary classification problem. Uses evalMLMethod to predict on new data.
naiveBayesClassifier
Classifies input text using a Naive Bayes model with n-grams and Laplace smoothing. The model must be configured in ClickHouse before use.
Syntax
Arguments
model_name— Name of the pre-configured model. String The model must be defined in ClickHouse's configuration files (see below).input_text— Text to classify. String Input is processed exactly as provided (case/punctuation preserved).
Returned Value
- Predicted class ID as an unsigned integer. UInt32 Class IDs correspond to categories defined during model construction.
Example
Classify text with a language detection model:
Result 0 might represent English, while 1 could indicate French - class meanings depend on your training data.
Implementation Details
Algorithm Uses Naive Bayes classification algorithm with Laplace smoothing to handle unseen n-grams based on n-gram probabilities based on this.
Key Features
- Supports n-grams of any size
- Three tokenization modes:
byte: Operates on raw bytes. Each byte is one token.codepoint: Operates on Unicode scalar values decoded from UTF‑8. Each codepoint is one token.token: Splits on runs of Unicode whitespace (regex \s+). Tokens are substrings of non‑whitespace; punctuation is part of the token if adjacent (e.g., "you?" is one token).
Model Configuration
You can find sample source code for creating a Naive Bayes model for language detection here.
Additionally, sample models and their associated config files are available here.
Here is an example configuration for a naive Bayes model in ClickHouse:
Configuration Parameters
| Parameter | Description | Example | Default |
|---|---|---|---|
| name | Unique model identifier | language_detection | Required |
| path | Full path to model binary | /etc/clickhouse-server/config.d/language_detection.bin | Required |
| mode | Tokenization method: - byte: Byte sequences- codepoint: Unicode characters- token: Word tokens | token | Required |
| n | N-gram size (token mode):- 1=single word- 2=word pairs- 3=word triplets | 2 | Required |
| alpha | Laplace smoothing factor used during classification to address n-grams that do not appear in the model | 0.5 | 1.0 |
| priors | Class probabilities (% of the documents belonging to a class) | 60% class 0, 40% class 1 | Equal distribution |
Model Training Guide
File Format
In human-readable format, for n=1 and token mode, the model might look like this:
For n=3 and codepoint mode, it might look like:
Human-readable format is not used by ClickHouse directly; it must be converted to the binary format described below.
Binary Format Details Each n-gram stored as:
- 4-byte
class_id(UInt, little-endian) - 4-byte
n-grambytes length (UInt, little-endian) - Raw
n-grambytes - 4-byte
count(UInt, little-endian)
Preprocessing Requirements
Before the model is being created from the document corpus, the documents must be preprocessed to extract n-grams according to the specified mode and n. The following steps outline the preprocessing:
-
Add boundary markers at the start and end of each document based on tokenization mode:
- Byte:
0x01(start),0xFF(end) - Codepoint:
U+10FFFE(start),U+10FFFF(end) - Token:
<s>(start),</s>(end)
Note:
(n - 1)tokens are added at both the beginning and the end of the document. - Byte:
-
Example for
n=3intokenmode:- Document:
"ClickHouse is fast" - Processed as:
<s> <s> ClickHouse is fast </s> </s> - Generated trigrams:
<s> <s> ClickHouse<s> ClickHouse isClickHouse is fastis fast </s>fast </s> </s>
- Document:
To simplify model creation for byte and codepoint modes, it may be convenient to first tokenize the document into tokens (a list of bytes for byte mode and a list of codepoints for codepoint mode). Then, append n - 1 start tokens at the beginning and n - 1 end tokens at the end of the document. Finally, generate the n-grams and write them to the serialized file.