From b56b6fde9888137320ed061a750f19b3e005d1e2 Mon Sep 17 00:00:00 2001 From: Colin Leach Date: Sun, 31 Aug 2025 14:32:19 -0700 Subject: [PATCH 1/8] Add Chars concept [DRAFT] --- concepts/chars/.meta/config.json | 7 ++ concepts/chars/about.md | 127 +++++++++++++++++++++++++++++++ concepts/chars/introduction.md | 4 + concepts/chars/links.json | 10 +++ config.json | 5 ++ 5 files changed, 153 insertions(+) create mode 100644 concepts/chars/.meta/config.json create mode 100644 concepts/chars/about.md create mode 100644 concepts/chars/introduction.md create mode 100644 concepts/chars/links.json diff --git a/concepts/chars/.meta/config.json b/concepts/chars/.meta/config.json new file mode 100644 index 00000000..15a1b762 --- /dev/null +++ b/concepts/chars/.meta/config.json @@ -0,0 +1,7 @@ +{ + "authors": [ + "colinleach" + ], + "contributors": [], + "blurb": "Kotlin was designed from the beginning to use a large subset of Unicode characters, to represent a wide range of writing systems." +} diff --git a/concepts/chars/about.md b/concepts/chars/about.md new file mode 100644 index 00000000..c808a34e --- /dev/null +++ b/concepts/chars/about.md @@ -0,0 +1,127 @@ +# About chars + +This is potentially a _big_ subject! +It is possible to write a long book about it, and several people have done so (search Amazon for "unicode book" to see some examples). + +## A very brief history + +Handling characters in computers was much simpler in earlier decades, when programmers assumed that English was the only important language. +So: 26 letters, upper and lower case, 10 digits, several punctuation marks, plus a code (0x07) to ring a bell, and it all fitted into 7 bits: the [ASCII][wiki-ascii] character set. + +Naturally, people started asking what about à, ä and Ł, then other people started asking about ऄ, ஹ and ญ, and young people wanted emojis 😱. +What to do? + +To cut a long story short, many smart and patient people had to serve on committees for years, working out the details of the [Unicode][web-unicode] character set, and of encodings such as [UTF-8][wiki-utf-8], and lots of software needed a _very_ complicated rewrite. +Also, lots of new bugs were introduced. + +To prevent _everything_ breaking, the Unicode/UTF-8 design ensures that the first 127 codes are identical to ASCII _(even the bell)_. + +## Characters in Kotlin + +Languages designed after about 2005 have the huge advantage that a reasonably stable Unicode standard already existed. + +Kotlin (first released in 2011) was able to assume that users would use a variety of (human) languages, and would need Unicode to express them. + +[Characters][ref-char] in Kotlin are 16-bit by default: enough to express most written alphabets, but not the entire range of emojis. +Note that this is not the full Unicode/UTF-8 standard, which uses up to six bytes (48 bits) per character. + +Character literals are written in single-quotes, and are distinct from strings written in double quotes. +This is probably obvious to people from the C/C++ world, but potentially confusing to Python and JavaScript programmers. + +```kotlin +val a = 'a' +a::class.qualifiedName // => kotlin.Char +a.code // => 97 + +val jha = 'झ' // Devanagari alphabet +jha.code // => 2333 + +val heart = '❤' // heart emoji +heart.code // => 10084 + +Char.MAX_VALUE.code // => 65535 (64k, the largest code point allowed) +``` + +Converting between `Char` and `Int` is simple: + +```kotlin +a.code // => 97 (a.toInt() is deprecated) +Char(97) // => 'a' +``` + +The compiler allows _some_ forms of integer arithmetic on `Char`s: + +```kotlin +'a' + 5 // => 'f' +'c' - 'a' // => 2 +'c' + 'a' // => error! + +'f' + ('A' - 'a') // => 'F' (same as 'f'.uppercase() + +'f'.dec() // => 'e' (decrement) +'f'.inc() // => 'g' (increment) +``` + +## Some functions for `Char` + +As always, there are far too [many functions][ref-char-lib] to discuss here, so this is just a selection. + +- For appropriate alphabets, change case with [`uppercase()`][ref-uppercase] and [`lowercase()`][ref-lowercase]. +- Test case with [`isUpperCase()`][ref-isuppercase] and [`isLowerCase()`][ref-islowercase]. +- Test character type with: + - [`isLetter()`][ref-isletter], covers many alphabets + - [`isDigit()`][ref-isdigit], in range 0..9 + - `isLetterOrDigit()`, combines the previous two + - [`isWhitespace()`][ref-iswhitespace], any whitespace character + +```kotlin +'झ'.isLetter() // => true +'A'.isLowerCase() // => false +'4'.isDigit() // => true +'\t'.isWhitespace() // => true (tab character) +``` +To check if a character is present in a `String`, or a `Char` list or array, we have `in`, equivalent to the [`contains()`][ref-contains] function. + +```kotlin +val clist = "kotlin".toList() // => [k, o, t, l, i, n] +'t' in clist // => true +'t' in "kotlin" // => true +``` + +Also, [regular expressions][ref-regex] (which will be the subject of a later Concept) allow powerful search and manipulation. + +## Char List and String interconversions + +For String to Char List, we can use `toList()`. + +For Char List to String, there is the [`joinToString()`][ref-jointostring] function, which takes a separator (often the empty string) as argument. + +```kotlin +val kt = "kotlin".toList() // => [k, o, t, l, i, n] +kt.joinToString("") // => "kotlin" +kt.joinToString("_") // => "k_o_t_l_i_n" +``` + +Note that `joinToString()` operates on a List or Array. +To _cast_ a single `Char` to a 1-character string, use `toString()`. + +```kotlin +'a'.toString() // => "a" +``` + + +[ref-char]: https://kotlinlang.org/docs/characters.html +[ref-char-lib]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin/-char/ +[wiki-ascii]: https://en.wikipedia.org/wiki/ASCII +[web-unicode]: https://home.unicode.org/ +[wiki-utf-8]: https://en.wikipedia.org/wiki/UTF-8 +[ref-uppercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/uppercase.html +[ref-lowercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/lowercase.html +[ref-isuppercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-upper-case.html +[ref-islowercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-lower-case.html +[ref-isletter]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-letter.html +[ref-isdigit]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-digit.html +[ref-iswhitespace]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-whitespace.html +[ref-jointostring]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.collections/join-to-string.html +[ref-contains]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/contains.html +[ref-regex]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/-regex/ diff --git a/concepts/chars/introduction.md b/concepts/chars/introduction.md new file mode 100644 index 00000000..a9b49168 --- /dev/null +++ b/concepts/chars/introduction.md @@ -0,0 +1,4 @@ +# Introduction + +TODO + diff --git a/concepts/chars/links.json b/concepts/chars/links.json new file mode 100644 index 00000000..0ab2c3ed --- /dev/null +++ b/concepts/chars/links.json @@ -0,0 +1,10 @@ +[ + { + "url": "https://kotlinlang.org/docs/characters.html", + "description": "Kotlin introduction to characters." + }, + { + "url": "https://kotlinlang.org/api/core/kotlin-stdlib/kotlin/-char/", + "description": "Char type and functions" + } +] diff --git a/config.json b/config.json index ba7a0c1d..18736ae5 100644 --- a/config.json +++ b/config.json @@ -1387,6 +1387,11 @@ "uuid": "168827c0-4867-449a-ad22-611c87314c48", "slug": "conditionals", "name": "Conditionals" + }, + { + "uuid": "660e1acd-3072-4bbc-ae44-bbc204da5eab", + "slug": "chars", + "name": "Chars" } ], "key_features": [ From 9e5545d355bc1bf09d700ec1ba0bc4d78ec06c2c Mon Sep 17 00:00:00 2001 From: Colin Leach Date: Mon, 1 Sep 2025 11:37:33 -0700 Subject: [PATCH 2/8] extend Unicode discussion --- concepts/chars/about.md | 32 ++++++--- concepts/chars/introduction.md | 123 ++++++++++++++++++++++++++++++++- 2 files changed, 145 insertions(+), 10 deletions(-) diff --git a/concepts/chars/about.md b/concepts/chars/about.md index c808a34e..ba7b5573 100644 --- a/concepts/chars/about.md +++ b/concepts/chars/about.md @@ -22,8 +22,22 @@ Languages designed after about 2005 have the huge advantage that a reasonably st Kotlin (first released in 2011) was able to assume that users would use a variety of (human) languages, and would need Unicode to express them. -[Characters][ref-char] in Kotlin are 16-bit by default: enough to express most written alphabets, but not the entire range of emojis. -Note that this is not the full Unicode/UTF-8 standard, which uses up to six bytes (48 bits) per character. +~~~~exercism/advanced +[Characters][ref-char] in Kotlin are 16-bit (UTF-16) [`codepoints`][wiki-codepoint], the same as a JVM `char`. +This is enough to express most written alphabets, but not the entire range of emojis. + +The full Unicode standard uses up to six bytes (48 bits) per character (called a [`grapheme`][wiki-grapheme]). + +Kotlin `Strings` support this full standard by using multiple codepoints per character, when necessary. +For example, 😱 would be `\uD83D` and `\uDE31`. + +Unfortunately, Java has no built-in grapheme support, and for compatibility neither does Kotlin. + +[wiki-codepoint]: https://en.wikipedia.org/wiki/Code_point +[wiki-grapheme]: https://en.wikipedia.org/wiki/Grapheme +[ref-char]: https://kotlinlang.org/docs/characters.html +~~~~ + Character literals are written in single-quotes, and are distinct from strings written in double quotes. This is probably obvious to people from the C/C++ world, but potentially confusing to Python and JavaScript programmers. @@ -80,13 +94,6 @@ As always, there are far too [many functions][ref-char-lib] to discuss here, so '4'.isDigit() // => true '\t'.isWhitespace() // => true (tab character) ``` -To check if a character is present in a `String`, or a `Char` list or array, we have `in`, equivalent to the [`contains()`][ref-contains] function. - -```kotlin -val clist = "kotlin".toList() // => [k, o, t, l, i, n] -'t' in clist // => true -'t' in "kotlin" // => true -``` Also, [regular expressions][ref-regex] (which will be the subject of a later Concept) allow powerful search and manipulation. @@ -109,6 +116,13 @@ To _cast_ a single `Char` to a 1-character string, use `toString()`. 'a'.toString() // => "a" ``` +To check if a character is present in a `String`, or a `Char` list or array, we have `in`, which maps to the [`contains()`][ref-contains] function. + +```kotlin +val clist = "kotlin".toList() // => [k, o, t, l, i, n] +'t' in clist // => true +'t' in "kotlin" // => true +``` [ref-char]: https://kotlinlang.org/docs/characters.html [ref-char-lib]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin/-char/ diff --git a/concepts/chars/introduction.md b/concepts/chars/introduction.md index a9b49168..8707d5e1 100644 --- a/concepts/chars/introduction.md +++ b/concepts/chars/introduction.md @@ -1,4 +1,125 @@ # Introduction -TODO +This is potentially a _big_ subject! +It is possible to write a long book about it, and several people have done so (search Amazon for "unicode book" to see some examples). + +## A very brief history + +Handling characters in computers was much simpler in earlier decades, when programmers assumed that English was the only important language. +So: 26 letters, upper and lower case, 10 digits, several punctuation marks, plus a code (0x07) to ring a bell, and it all fitted into 7 bits: the [ASCII][wiki-ascii] character set. + +Naturally, people started asking what about à, ä and Ł, then other people started asking about ऄ, ஹ and ญ, and young people wanted emojis 😱. +What to do? + +To cut a long story short, many smart and patient people had to serve on committees for years, working out the details of the [Unicode][web-unicode] character set, and of encodings such as [UTF-8][wiki-utf-8], and lots of software needed a _very_ complicated rewrite. +Also, lots of new bugs were introduced. + +To prevent _everything_ breaking, the Unicode/UTF-8 design ensures that the first 127 codes are identical to ASCII _(even the bell)_. + +## Characters in Kotlin + +Languages designed after about 2005 have the huge advantage that a reasonably stable Unicode standard already existed. + +Kotlin (first released in 2011) was able to assume that users would use a variety of (human) languages, and would need Unicode to express them. + +Character literals are written in single-quotes, and are distinct from strings written in double quotes. +This is probably obvious to people from the C/C++ world, but potentially confusing to Python and JavaScript programmers. + +```kotlin +val a = 'a' +a::class.qualifiedName // => kotlin.Char +a.code // => 97 + +val jha = 'झ' // Devanagari alphabet +jha.code // => 2333 + +val heart = '❤' // heart emoji +heart.code // => 10084 + +Char.MAX_VALUE.code // => 65535 (64k, the largest code point allowed) +``` + +Converting between `Char` and `Int` is simple: + +```kotlin +a.code // => 97 (a.toInt() is deprecated) +Char(97) // => 'a' +``` + +The compiler allows _some_ forms of integer arithmetic on `Char`s: + +```kotlin +'a' + 5 // => 'f' +'c' - 'a' // => 2 +'c' + 'a' // => error! + +'f' + ('A' - 'a') // => 'F' (same as 'f'.uppercase() + +'f'.dec() // => 'e' (decrement) +'f'.inc() // => 'g' (increment) +``` + +## Some functions for `Char` + +As always, there are far too [many functions][ref-char-lib] to discuss here, so this is just a selection. + +- For appropriate alphabets, change case with [`uppercase()`][ref-uppercase] and [`lowercase()`][ref-lowercase]. +- Test case with [`isUpperCase()`][ref-isuppercase] and [`isLowerCase()`][ref-islowercase]. +- Test character type with: + - [`isLetter()`][ref-isletter], covers many alphabets + - [`isDigit()`][ref-isdigit], in range 0..9 + - `isLetterOrDigit()`, combines the previous two + - [`isWhitespace()`][ref-iswhitespace], any whitespace character + +```kotlin +'झ'.isLetter() // => true +'A'.isLowerCase() // => false +'4'.isDigit() // => true +'\t'.isWhitespace() // => true (tab character) +``` + +## Char List and String interconversions + +For String to Char List, we can use `toList()`. + +For Char List to String, there is the [`joinToString()`][ref-jointostring] function, which takes a separator (often the empty string) as argument. + +```kotlin +val kt = "kotlin".toList() // => [k, o, t, l, i, n] +kt.joinToString("") // => "kotlin" +kt.joinToString("_") // => "k_o_t_l_i_n" +``` + +Note that `joinToString()` operates on a List or Array. +To _cast_ a single `Char` to a 1-character string, use `toString()`. + +```kotlin +'a'.toString() // => "a" +``` + +To check if a character is present in a `String`, or a `Char` list or array, we have `in`, which maps to the [`contains()`][ref-contains] function. + +```kotlin +val clist = "kotlin".toList() // => [k, o, t, l, i, n] +'t' in clist // => true +'t' in "kotlin" // => true +``` + + +[ref-char]: https://kotlinlang.org/docs/characters.html +[ref-char-lib]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin/-char/ +[wiki-ascii]: https://en.wikipedia.org/wiki/ASCII +[web-unicode]: https://home.unicode.org/ +[wiki-utf-8]: https://en.wikipedia.org/wiki/UTF-8 +[ref-uppercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/uppercase.html +[ref-lowercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/lowercase.html +[ref-isuppercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-upper-case.html +[ref-islowercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-lower-case.html +[ref-isletter]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-letter.html +[ref-isdigit]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-digit.html +[ref-iswhitespace]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-whitespace.html +[ref-jointostring]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.collections/join-to-string.html +[ref-contains]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/contains.html +[ref-regex]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/-regex/ + From 1c13ef7aa49aafea8ae698b21202f23606808059 Mon Sep 17 00:00:00 2001 From: Colin Leach Date: Mon, 1 Sep 2025 20:11:11 -0700 Subject: [PATCH 3/8] Update concepts/chars/about.md Co-authored-by: Derk-Jan Karrenbeld --- concepts/chars/about.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/concepts/chars/about.md b/concepts/chars/about.md index ba7b5573..ff03a2aa 100644 --- a/concepts/chars/about.md +++ b/concepts/chars/about.md @@ -66,7 +66,7 @@ Char(97) // => 'a' The compiler allows _some_ forms of integer arithmetic on `Char`s: ```kotlin -'a' + 5 // => 'f' +'a' + 5 // => 'f' 'c' - 'a' // => 2 'c' + 'a' // => error! From 5c48dbb7f5ca2a3d33e3428138881c892e7180d9 Mon Sep 17 00:00:00 2001 From: Colin Leach Date: Mon, 1 Sep 2025 20:11:24 -0700 Subject: [PATCH 4/8] Update concepts/chars/about.md Co-authored-by: Derk-Jan Karrenbeld --- concepts/chars/about.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/concepts/chars/about.md b/concepts/chars/about.md index ff03a2aa..493c0f3b 100644 --- a/concepts/chars/about.md +++ b/concepts/chars/about.md @@ -59,7 +59,7 @@ Char.MAX_VALUE.code // => 65535 (64k, the largest code point allowed) Converting between `Char` and `Int` is simple: ```kotlin -a.code // => 97 (a.toInt() is deprecated) +a.code // => 97 Char(97) // => 'a' ``` From c2a71013be859a0167aa573e50b60030d6adb637 Mon Sep 17 00:00:00 2001 From: Colin Leach Date: Mon, 1 Sep 2025 20:11:31 -0700 Subject: [PATCH 5/8] Update concepts/chars/about.md Co-authored-by: Derk-Jan Karrenbeld --- concepts/chars/about.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/concepts/chars/about.md b/concepts/chars/about.md index 493c0f3b..a86e195c 100644 --- a/concepts/chars/about.md +++ b/concepts/chars/about.md @@ -56,7 +56,7 @@ heart.code // => 10084 Char.MAX_VALUE.code // => 65535 (64k, the largest code point allowed) ``` -Converting between `Char` and `Int` is simple: +Converting between `Char` and `Int` is straightforward: ```kotlin a.code // => 97 From 99ce70504310bc4cc9e3d8b140ba037c9a0035b5 Mon Sep 17 00:00:00 2001 From: Colin Leach Date: Mon, 1 Sep 2025 20:12:04 -0700 Subject: [PATCH 6/8] Update concepts/chars/about.md Co-authored-by: Derk-Jan Karrenbeld --- concepts/chars/about.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/concepts/chars/about.md b/concepts/chars/about.md index a86e195c..ac022a0a 100644 --- a/concepts/chars/about.md +++ b/concepts/chars/about.md @@ -53,7 +53,9 @@ jha.code // => 2333 val heart = '❤' // heart emoji heart.code // => 10084 -Char.MAX_VALUE.code // => 65535 (64k, the largest code point allowed) +Char.MAX_VALUE.code // => 65535 (64k, the largest code point allowed) + +val not_char = 'abc' // => Too many characters in a character literal. ``` Converting between `Char` and `Int` is straightforward: From 8d92ecbc509f72d0c044c4679b0bcf6c31252361 Mon Sep 17 00:00:00 2001 From: Colin Leach Date: Mon, 1 Sep 2025 20:12:41 -0700 Subject: [PATCH 7/8] Update concepts/chars/about.md Co-authored-by: Derk-Jan Karrenbeld --- concepts/chars/about.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/concepts/chars/about.md b/concepts/chars/about.md index ac022a0a..ed8ddc6d 100644 --- a/concepts/chars/about.md +++ b/concepts/chars/about.md @@ -101,7 +101,7 @@ Also, [regular expressions][ref-regex] (which will be the subject of a later Con ## Char List and String interconversions -For String to Char List, we can use `toList()`. +To convert from a `String` to a `List` of `Char`s, we can use `toList()`. For Char List to String, there is the [`joinToString()`][ref-jointostring] function, which takes a separator (often the empty string) as argument. From 423cf3364f6c9a9ecef635cc35bab892afca0dd9 Mon Sep 17 00:00:00 2001 From: Colin Leach Date: Mon, 1 Sep 2025 20:12:59 -0700 Subject: [PATCH 8/8] Update concepts/chars/about.md Co-authored-by: Derk-Jan Karrenbeld --- concepts/chars/about.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/concepts/chars/about.md b/concepts/chars/about.md index ed8ddc6d..ae6fdc78 100644 --- a/concepts/chars/about.md +++ b/concepts/chars/about.md @@ -103,7 +103,7 @@ Also, [regular expressions][ref-regex] (which will be the subject of a later Con To convert from a `String` to a `List` of `Char`s, we can use `toList()`. -For Char List to String, there is the [`joinToString()`][ref-jointostring] function, which takes a separator (often the empty string) as argument. +To convert a `List` of `Char`s to a `String`, there is the [`joinToString()`][ref-jointostring] function, which takes a separator (often the empty string) as argument. ```kotlin val kt = "kotlin".toList() // => [k, o, t, l, i, n]