Relational Data - Functions - String Functions
are not valid comments in HTML but they are skipped by other rules. 2. CDATA is pasted verbatim. Note: CDATA is XML/XHTML-specific and processed on a "best-effort" basis. 3.
scriptand
styleelements are removed with all their content. Note: it is assumed that closing tag cannot appear inside content. For example, in JS string literal has to be escaped like
“</script>”. Note: comments and CDATA are possible inside
scriptor
style- then closing tags are not searched inside CDATA. Example:
<script><![CDATA[</script> ]]></script>
. But they are still searched inside comments. Sometimes it becomes complicated:
<script>var x = "world- there is no whitespace in HTML, but the function inserts it. Also consider:
Helloworld
,
Hello
world. This behavior is reasonable for data analysis, e.g. to convert HTML to a bag of words. 7. Also note that correct handling of whitespaces requires the support of
and CSS
displayand
white-space` properties.
Syntax
extractTextFromHTML(x)
Arguments
x
— input text. String.
Returned value
- Extracted text. String.
Example
The first example contains several tags and a comment and also shows whitespace processing. The second example shows CDATA
and script
tag processing. In the third example text is extracted from the full HTML response received by the url function.
SELECT extractTextFromHTML(' <p> A text <i>with</i><b>tags</b>. <!-- comments --> </p> ');
SELECT extractTextFromHTML('<![CDATA[The content within <b>CDATA</b>]]> <script>alert("Script");</script>');
SELECT extractTextFromHTML(html) FROM url('http://www.donothingfor2minutes.com/', RawBLOB, 'html String');
Result:
A text with tags . The content within <b>CDATA</b> Do Nothing for 2 Minutes 2:00
ascii
Returns the ASCII code point (as Int32) of the first character of string s
.
If s
is empty, the result is 0. If the first character is not an ASCII character or not part of the Latin-1 supplement range of UTF-16, the result is undefined.
Syntax
ascii(s)
soundex
Returns the Soundex code of a string.
Syntax
soundex(val)
Arguments
val
— Input value. String
Returned value
- The Soundex code of the input value. String
Example
select soundex('aksel');
Result:
┌─soundex('aksel')─┐ │ A240 │ └──────────────────┘
punycodeEncode
Returns the Punycode representation of a string. The string must be UTF8-encoded, otherwise the behavior is undefined.
Syntax
punycodeEncode(val)
Arguments
val
— Input value. String
Returned value
- A Punycode representation of the input value. String
Example
select punycodeEncode('München');
Result:
┌─punycodeEncode('München')─┐ │ Mnchen-3ya │ └───────────────────────────┘
punycodeDecode
Returns the UTF8-encoded plaintext of a Punycode-encoded string. If no valid Punycode-encoded string is given, an exception is thrown.
Syntax
punycodeEncode(val)
Arguments
val
— Punycode-encoded string. String
Returned value
- The plaintext of the input value. String
Example
select punycodeDecode('Mnchen-3ya');
Result:
┌─punycodeDecode('Mnchen-3ya')─┐ │ München │ └──────────────────────────────┘
tryPunycodeDecode
Like punycodeDecode
but returns an empty string if no valid Punycode-encoded string is given.
idnaEncode
Returns the ASCII representation (ToASCII algorithm) of a domain name according to the Internationalized Domain Names in Applications (IDNA) mechanism. The input string must be UTF-encoded and translatable to an ASCII string, otherwise an exception is thrown. Note: No percent decoding or trimming of tabs, spaces or control characters is performed.
Syntax
idnaEncode(val)
Arguments
val
— Input value. String
Returned value
- A ASCII representation according to the IDNA mechanism of the input value. String
Example
select idnaEncode('straße.münchen.de');
Result:
┌─idnaEncode('straße.münchen.de')─────┐ │ xn--strae-oqa.xn--mnchen-3ya.de │ └─────────────────────────────────────┘
tryIdnaEncode
Like idnaEncode
but returns an empty string in case of an error instead of throwing an exception.
idnaDecode
Returns the Unicode (UTF-8) representation (ToUnicode algorithm) of a domain name according to the Internationalized Domain Names in Applications (IDNA) mechanism. In case of an error (e.g. because the input is invalid), the input string is returned. Note that repeated application of idnaEncode()
and idnaDecode()
does not necessarily return the original string due to case normalization.
Syntax
idnaDecode(val)
Arguments
val
— Input value. String
Returned value
- A Unicode (UTF-8) representation according to the IDNA mechanism of the input value. String
Example
select idnaDecode('xn--strae-oqa.xn--mnchen-3ya.de');
Result:
┌─idnaDecode('xn--strae-oqa.xn--mnchen-3ya.de')─┐ │ straße.münchen.de │ └───────────────────────────────────────────────┘
byteHammingDistance
Calculates the hamming distance between two byte strings.
Syntax
byteHammingDistance(string1, string2)
Examples
SELECT byteHammingDistance('karolin', 'kathrin');
Result:
┌─byteHammingDistance('karolin', 'kathrin')─┐
│ 3 │
└───────────────────────────────────────────┘
Alias: mismatches
stringJaccardIndex
Calculates the Jaccard similarity index between two byte strings.
Syntax
stringJaccardIndex(string1, string2)
Examples
SELECT stringJaccardIndex('clickhouse', 'mouse');
Result:
┌─stringJaccardIndex('clickhouse', 'mouse')─┐
│ 0.4 │
└───────────────────────────────────────────┘
stringJaccardIndexUTF8
Like stringJaccardIndex but for UTF8-encoded strings.
editDistance
Calculates the edit distance between two byte strings.
Syntax
editDistance(string1, string2)
Examples
SELECT editDistance('clickhouse', 'mouse');
Result:
┌─editDistance('clickhouse', 'mouse')─┐
│ 6 │
└─────────────────────────────────────┘
Alias: levenshteinDistance
editDistanceUTF8
Calculates the edit distance between two UTF8 strings.
Syntax
editDistanceUTF8(string1, string2)
Examples
SELECT editDistanceUTF8('我是谁', '我是我');
Result:
┌─editDistanceUTF8('我是谁', '我是我')──┐
│ 1 │
└─────────────────────────────────────┘
Alias: levenshteinDistanceUTF8
damerauLevenshteinDistance
Calculates the Damerau-Levenshtein distance between two byte strings.
Syntax
damerauLevenshteinDistance(string1, string2)
Examples
SELECT damerauLevenshteinDistance('clickhouse', 'mouse');
Result:
┌─damerauLevenshteinDistance('clickhouse', 'mouse')─┐
│ 6 │
└───────────────────────────────────────────────────┘
jaroSimilarity
Calculates the Jaro similarity between two byte strings.
Syntax
jaroSimilarity(string1, string2)
Examples
SELECT jaroSimilarity('clickhouse', 'click');
Result:
┌─jaroSimilarity('clickhouse', 'click')─┐
│ 0.8333333333333333 │
└───────────────────────────────────────┘
jaroWinklerSimilarity
Calculates the Jaro-Winkler similarity between two byte strings.
Syntax
jaroWinklerSimilarity(string1, string2)
Examples
SELECT jaroWinklerSimilarity('clickhouse', 'click');
Result:
┌─jaroWinklerSimilarity('clickhouse', 'click')─┐
│ 0.8999999999999999 │
└──────────────────────────────────────────────┘
initcap
Convert the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.
Because initCap
converts only the first letter of each word to upper case you may observe unexpected behaviour for words containing apostrophes or capital letters. For example:
SELECT initCap('mother''s daughter'), initCap('joe McAdam');
will return
┌─initCap('mother\'s daughter')─┬─initCap('joe McAdam')─┐ │ Mother'S Daughter │ Joe Mcadam │ └───────────────────────────────┴───────────────────────┘
This is a known behaviour, with no plans currently to fix it.
Syntax
initcap(val)
Arguments
val
— Input value. String.
Returned value
val
with the first letter of each word converted to upper case. String.
Example
Query:
SELECT initcap('building for fast');
Result:
┌─initcap('building for fast')─┐
│ Building For Fast │
└──────────────────────────────┘
initcapUTF8
Like initcap, initcapUTF8
converts the first letter of each word to upper case and the rest to lower case. Assumes that the string contains valid UTF-8 encoded text. If this assumption is violated, no exception is thrown and the result is undefined.
This function does not detect the language, e.g. for Turkish the result might not be exactly correct (i/İ vs. i/I). If the length of the UTF-8 byte sequence is different for upper and lower case of a code point, the result may be incorrect for this code point.
Syntax
initcapUTF8(val)
Arguments
val
— Input value. String.
Returned value
val
with the first letter of each word converted to upper case. String.
Example
Query:
SELECT initcapUTF8('не тормозит');
Result:
┌─initcapUTF8('не тормозит')─┐
│ Не Тормозит │
└────────────────────────────┘
firstLine
Returns the first line from a multi-line string.
Syntax
firstLine(val)
Arguments
val
— Input value. String
Returned value
- The first line of the input value or the whole value if there is no line separators. String
Example
select firstLine('foo\nbar\nbaz');
Result:
┌─firstLine('foo\nbar\nbaz')─┐ │ foo │ └────────────────────────────┘