Relational Data - Statements - Select - Join
… … event_1_1 | 12:00 | 42 event_2_1 | 11:59 | 42 … event_2_2 | 12:30 | 42 event_1_2 | 13:00 | 42 event_2_3 | 13:00 | 42 … …
ASOF JOIN
can take the timestamp of a user event from table_1
and find an event in table_2
where the timestamp is closest to the timestamp of the event from table_1
corresponding to the closest match condition. Equal timestamp values are the closest if available. Here, the user_id
column can be used for joining on equality and the ev_time
column can be used for joining on the closest match. In our example, event_1_1
can be joined with event_2_1
and event_1_2
can be joined with event_2_3
, but event_2_2
can’t be joined.
ASOF JOIN
is supported only by hash
and full_sorting_merge
join algorithms. It’s not supported in the Join table engine.
PASTE JOIN Usage
The result of PASTE JOIN
is a table that contains all columns from left subquery followed by all columns from the right subquery. The rows are matched based on their positions in the original tables (the order of rows should be defined). If the subqueries return a different number of rows, extra rows will be cut.
Example:
SELECT *
FROM
(
SELECT number AS a
FROM numbers(2)
) AS t1
PASTE JOIN
(
SELECT number AS a
FROM numbers(2)
ORDER BY a DESC
) AS t2
┌─a─┬─t2.a─┐
│ 0 │ 1 │
│ 1 │ 0 │
└───┴──────┘
Note: In this case result can be nondeterministic if the reading is parallel. Example:
SELECT *
FROM
(
SELECT number AS a
FROM numbers_mt(5)
) AS t1
PASTE JOIN
(
SELECT number AS a
FROM numbers(10)
ORDER BY a DESC
) AS t2
SETTINGS max_block_size = 2;
┌─a─┬─t2.a─┐
│ 2 │ 9 │
│ 3 │ 8 │
└───┴──────┘
┌─a─┬─t2.a─┐
│ 0 │ 7 │
│ 1 │ 6 │
└───┴──────┘
┌─a─┬─t2.a─┐
│ 4 │ 5 │
└───┴──────┘
Distributed JOIN
There are two ways to execute join involving distributed tables:
- When using a normal
JOIN
, the query is sent to remote servers. Subqueries are run on each of them in order to make the right table, and the join is performed with this table. In other words, the right table is formed on each server separately. - When using
GLOBAL ... JOIN
, first the requestor server runs a subquery to calculate the right table. This temporary table is passed to each remote server, and queries are run on them using the temporary data that was transmitted.
Be careful when using GLOBAL
. For more information, see the Distributed subqueries section.
Implicit Type Conversion
INNER JOIN
, LEFT JOIN
, RIGHT JOIN
, and FULL JOIN
queries support the implicit type conversion for “join keys”. However the query can not be executed, if join keys from the left and the right tables cannot be converted to a single type (for example, there is no data type that can hold all values from both UInt64
and Int64
, or String
and Int32
).
Example
Consider the table t_1
:
┌─a─┬─b─┬─toTypeName(a)─┬─toTypeName(b)─┐
│ 1 │ 1 │ UInt16 │ UInt8 │
│ 2 │ 2 │ UInt16 │ UInt8 │
└───┴───┴───────────────┴───────────────┘
and the table t_2
:
┌──a─┬────b─┬─toTypeName(a)─┬─toTypeName(b)───┐
│ -1 │ 1 │ Int16 │ Nullable(Int64) │
│ 1 │ -1 │ Int16 │ Nullable(Int64) │
│ 1 │ 1 │ Int16 │ Nullable(Int64) │
└────┴──────┴───────────────┴─────────────────┘
The query
SELECT a, b, toTypeName(a), toTypeName(b) FROM t_1 FULL JOIN t_2 USING (a, b);
returns the set:
┌──a─┬────b─┬─toTypeName(a)─┬─toTypeName(b)───┐
│ 1 │ 1 │ Int32 │ Nullable(Int64) │
│ 2 │ 2 │ Int32 │ Nullable(Int64) │
│ -1 │ 1 │ Int32 │ Nullable(Int64) │
│ 1 │ -1 │ Int32 │ Nullable(Int64) │
└────┴──────┴───────────────┴─────────────────┘
Usage Recommendations
Processing of Empty or NULL Cells
While joining tables, the empty cells may appear. The setting join_use_nulls define how ClickHouse fills these cells.
If the JOIN
keys are Nullable fields, the rows where at least one of the keys has the value NULL are not joined.
Syntax
The columns specified in USING
must have the same names in both subqueries, and the other columns must be named differently. You can use aliases to change the names of columns in subqueries.
The USING
clause specifies one or more columns to join, which establishes the equality of these columns. The list of columns is set without brackets. More complex join conditions are not supported.
Syntax Limitations
For multiple JOIN
clauses in a single SELECT
query:
- Taking all the columns via
*
is available only if tables are joined, not subqueries. - The
PREWHERE
clause is not available. - The
USING
clause is not available.
For ON
, WHERE
, and GROUP BY
clauses:
- Arbitrary expressions cannot be used in
ON
,WHERE
, andGROUP BY
clauses, but you can define an expression in aSELECT
clause and then use it in these clauses via an alias.
Performance
When running a JOIN
, there is no optimization of the order of execution in relation to other stages of the query. The join (a search in the right table) is run before filtering in WHERE
and before aggregation.
Each time a query is run with the same JOIN
, the subquery is run again because the result is not cached. To avoid this, use the special Join table engine, which is a prepared array for joining that is always in RAM.
In some cases, it is more efficient to use IN instead of JOIN
.
If you need a JOIN
for joining with dimension tables (these are relatively small tables that contain dimension properties, such as names for advertising campaigns), a JOIN
might not be very convenient due to the fact that the right table is re-accessed for every query. For such cases, there is a “dictionaries” feature that you should use instead of JOIN
. For more information, see the Dictionaries section.
Memory Limitations
By default, ClickHouse uses the hash join algorithm. ClickHouse takes the right_table and creates a hash table for it in RAM. If join_algorithm = 'auto'
is enabled, then after some threshold of memory consumption, ClickHouse falls back to merge join algorithm. For JOIN
algorithms description see the join_algorithm setting.
If you need to restrict JOIN
operation memory consumption use the following settings:
- max_rows_in_join — Limits number of rows in the hash table.
- max_bytes_in_join — Limits size of the hash table.
When any of these limits is reached, ClickHouse acts as the join_overflow_mode setting instructs.
Examples
Example:
SELECT
CounterID,
hits,
visits
FROM
(
SELECT
CounterID,
count() AS hits
FROM test.hits
GROUP BY CounterID
) ANY LEFT JOIN
(
SELECT
CounterID,
sum(Sign) AS visits
FROM test.visits
GROUP BY CounterID
) USING CounterID
ORDER BY hits DESC
LIMIT 10
┌─CounterID─┬───hits─┬─visits─┐
│ 1143050 │ 523264 │ 13665 │
│ 731962 │ 475698 │ 102716 │
│ 722545 │ 337212 │ 108187 │
│ 722889 │ 252197 │ 10547 │
│ 2237260 │ 196036 │ 9522 │
│ 23057320 │ 147211 │ 7689 │
│ 722818 │ 90109 │ 17847 │
│ 48221 │ 85379 │ 4652 │
│ 19762435 │ 77807 │ 7026 │
│ 722884 │ 77492 │ 11056 │
└───────────┴────────┴────────┘