Relational Data - Functions - String Functions

From FojiSoft Docs
Revision as of 18:52, 28 August 2024 by Chris.Hansen (talk | contribs) (Import ClickHouse Docs: Wed Aug 28 2024 14:52:33 GMT-0400 (Eastern Daylight Time))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

are not valid comments in HTML but they are skipped by other rules. 2. CDATA is pasted verbatim. Note: CDATA is XML/XHTML-specific and processed on a "best-effort" basis. 3.scriptandstyleelements are removed with all their content. Note: it is assumed that closing tag cannot appear inside content. For example, in JS string literal has to be escaped like“</script>”. Note: comments and CDATA are possible insidescriptorstyle- then closing tags are not searched inside CDATA. Example:

<script><![CDATA[</script> ]]></script>. But they are still searched inside comments. Sometimes it becomes complicated: <script>var x = "world- there is no whitespace in HTML, but the function inserts it. Also consider:Hello

world

,Hello
world. This behavior is reasonable for data analysis, e.g. to convert HTML to a bag of words. 7. Also note that correct handling of whitespaces requires the support of


and CSSdisplayandwhite-space` properties.

Syntax

extractTextFromHTML(x)

Arguments

Returned value

Example

The first example contains several tags and a comment and also shows whitespace processing. The second example shows CDATA and script tag processing. In the third example text is extracted from the full HTML response received by the url function.

SELECT extractTextFromHTML(' <p> A text <i>with</i><b>tags</b>. <!-- comments --> </p> ');
SELECT extractTextFromHTML('<![CDATA[The content within <b>CDATA</b>]]> <script>alert("Script");</script>');
SELECT extractTextFromHTML(html) FROM url('http://www.donothingfor2minutes.com/', RawBLOB, 'html String');

Result:

A text with tags .
The content within <b>CDATA</b>
Do Nothing for 2 Minutes 2:00 &nbsp;

ascii

Returns the ASCII code point (as Int32) of the first character of string s.

If s is empty, the result is 0. If the first character is not an ASCII character or not part of the Latin-1 supplement range of UTF-16, the result is undefined.

Syntax

ascii(s)

soundex

Returns the Soundex code of a string.

Syntax

soundex(val)

Arguments

Returned value

  • The Soundex code of the input value. String

Example

select soundex('aksel');

Result:

┌─soundex('aksel')─┐
│ A240             │
└──────────────────┘

punycodeEncode

Returns the Punycode representation of a string. The string must be UTF8-encoded, otherwise the behavior is undefined.

Syntax

punycodeEncode(val)

Arguments

Returned value

  • A Punycode representation of the input value. String

Example

select punycodeEncode('München');

Result:

┌─punycodeEncode('München')─┐
│ Mnchen-3ya                │
└───────────────────────────┘

punycodeDecode

Returns the UTF8-encoded plaintext of a Punycode-encoded string. If no valid Punycode-encoded string is given, an exception is thrown.

Syntax

punycodeEncode(val)

Arguments

  • val — Punycode-encoded string. String

Returned value

  • The plaintext of the input value. String

Example

select punycodeDecode('Mnchen-3ya');

Result:

┌─punycodeDecode('Mnchen-3ya')─┐
│ München                      │
└──────────────────────────────┘

tryPunycodeDecode

Like punycodeDecode but returns an empty string if no valid Punycode-encoded string is given.

idnaEncode

Returns the ASCII representation (ToASCII algorithm) of a domain name according to the Internationalized Domain Names in Applications (IDNA) mechanism. The input string must be UTF-encoded and translatable to an ASCII string, otherwise an exception is thrown. Note: No percent decoding or trimming of tabs, spaces or control characters is performed.

Syntax

idnaEncode(val)

Arguments

Returned value

  • A ASCII representation according to the IDNA mechanism of the input value. String

Example

select idnaEncode('straße.münchen.de');

Result:

┌─idnaEncode('straße.münchen.de')─────┐
│ xn--strae-oqa.xn--mnchen-3ya.de     │
└─────────────────────────────────────┘

tryIdnaEncode

Like idnaEncode but returns an empty string in case of an error instead of throwing an exception.

idnaDecode

Returns the Unicode (UTF-8) representation (ToUnicode algorithm) of a domain name according to the Internationalized Domain Names in Applications (IDNA) mechanism. In case of an error (e.g. because the input is invalid), the input string is returned. Note that repeated application of idnaEncode() and idnaDecode() does not necessarily return the original string due to case normalization.

Syntax

idnaDecode(val)

Arguments

Returned value

  • A Unicode (UTF-8) representation according to the IDNA mechanism of the input value. String

Example

select idnaDecode('xn--strae-oqa.xn--mnchen-3ya.de');

Result:

┌─idnaDecode('xn--strae-oqa.xn--mnchen-3ya.de')─┐
│ straße.münchen.de                             │
└───────────────────────────────────────────────┘

byteHammingDistance

Calculates the hamming distance between two byte strings.

Syntax

byteHammingDistance(string1, string2)

Examples

SELECT byteHammingDistance('karolin', 'kathrin');

Result:

┌─byteHammingDistance('karolin', 'kathrin')─┐
│                                         3 │
└───────────────────────────────────────────┘

Alias: mismatches

stringJaccardIndex

Calculates the Jaccard similarity index between two byte strings.

Syntax

stringJaccardIndex(string1, string2)

Examples

SELECT stringJaccardIndex('clickhouse', 'mouse');

Result:

┌─stringJaccardIndex('clickhouse', 'mouse')─┐
│                                       0.4 │
└───────────────────────────────────────────┘

stringJaccardIndexUTF8

Like stringJaccardIndex but for UTF8-encoded strings.

editDistance

Calculates the edit distance between two byte strings.

Syntax

editDistance(string1, string2)

Examples

SELECT editDistance('clickhouse', 'mouse');

Result:

┌─editDistance('clickhouse', 'mouse')─┐
│                                   6 │
└─────────────────────────────────────┘

Alias: levenshteinDistance

editDistanceUTF8

Calculates the edit distance between two UTF8 strings.

Syntax

editDistanceUTF8(string1, string2)

Examples

SELECT editDistanceUTF8('我是谁', '我是我');

Result:

┌─editDistanceUTF8('我是谁', '我是我')──┐
│                                   1 │
└─────────────────────────────────────┘

Alias: levenshteinDistanceUTF8

damerauLevenshteinDistance

Calculates the Damerau-Levenshtein distance between two byte strings.

Syntax

damerauLevenshteinDistance(string1, string2)

Examples

SELECT damerauLevenshteinDistance('clickhouse', 'mouse');

Result:

┌─damerauLevenshteinDistance('clickhouse', 'mouse')─┐
│                                                 6 │
└───────────────────────────────────────────────────┘

jaroSimilarity

Calculates the Jaro similarity between two byte strings.

Syntax

jaroSimilarity(string1, string2)

Examples

SELECT jaroSimilarity('clickhouse', 'click');

Result:

┌─jaroSimilarity('clickhouse', 'click')─┐
│                    0.8333333333333333 │
└───────────────────────────────────────┘

jaroWinklerSimilarity

Calculates the Jaro-Winkler similarity between two byte strings.

Syntax

jaroWinklerSimilarity(string1, string2)

Examples

SELECT jaroWinklerSimilarity('clickhouse', 'click');

Result:

┌─jaroWinklerSimilarity('clickhouse', 'click')─┐
│                           0.8999999999999999 │
└──────────────────────────────────────────────┘

initcap

Convert the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.

Because initCap converts only the first letter of each word to upper case you may observe unexpected behaviour for words containing apostrophes or capital letters. For example:

SELECT initCap('mother''s daughter'), initCap('joe McAdam');

will return

┌─initCap('mother\'s daughter')─┬─initCap('joe McAdam')─┐
│ Mother'S Daughter             │ Joe Mcadam            │
└───────────────────────────────┴───────────────────────┘

This is a known behaviour, with no plans currently to fix it.


Syntax

initcap(val)

Arguments

Returned value

  • val with the first letter of each word converted to upper case. String.

Example

Query:

SELECT initcap('building for fast');

Result:

┌─initcap('building for fast')─┐
│ Building For Fast            │
└──────────────────────────────┘

initcapUTF8

Like initcap, initcapUTF8 converts the first letter of each word to upper case and the rest to lower case. Assumes that the string contains valid UTF-8 encoded text. If this assumption is violated, no exception is thrown and the result is undefined.

This function does not detect the language, e.g. for Turkish the result might not be exactly correct (i/İ vs. i/I). If the length of the UTF-8 byte sequence is different for upper and lower case of a code point, the result may be incorrect for this code point.


Syntax

initcapUTF8(val)

Arguments

Returned value

  • val with the first letter of each word converted to upper case. String.

Example

Query:

SELECT initcapUTF8('не тормозит');

Result:

┌─initcapUTF8('не тормозит')─┐
│ Не Тормозит                │
└────────────────────────────┘

firstLine

Returns the first line from a multi-line string.

Syntax

firstLine(val)

Arguments

Returned value

  • The first line of the input value or the whole value if there is no line separators. String

Example

select firstLine('foo\nbar\nbaz');

Result:

┌─firstLine('foo\nbar\nbaz')─┐
│ foo                        │
└────────────────────────────┘