-
-
Notifications
You must be signed in to change notification settings - Fork 202
Add Chars concept #715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Chars concept #715
Changes from 2 commits
b56b6fd
9e5545d
1c13ef7
5c48dbb
c2a7101
99ce705
8d92ecb
423cf33
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
{ | ||
"authors": [ | ||
"colinleach" | ||
], | ||
"contributors": [], | ||
"blurb": "Kotlin was designed from the beginning to use a large subset of Unicode characters, to represent a wide range of writing systems." | ||
} |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,141 @@ | ||||||||||||||||||||||
# About chars | ||||||||||||||||||||||
|
||||||||||||||||||||||
This is potentially a _big_ subject! | ||||||||||||||||||||||
It is possible to write a long book about it, and several people have done so (search Amazon for "unicode book" to see some examples). | ||||||||||||||||||||||
|
||||||||||||||||||||||
## A very brief history | ||||||||||||||||||||||
|
||||||||||||||||||||||
Handling characters in computers was much simpler in earlier decades, when programmers assumed that English was the only important language. | ||||||||||||||||||||||
So: 26 letters, upper and lower case, 10 digits, several punctuation marks, plus a code (0x07) to ring a bell, and it all fitted into 7 bits: the [ASCII][wiki-ascii] character set. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Naturally, people started asking what about à, ä and Ł, then other people started asking about ऄ, ஹ and ญ, and young people wanted emojis 😱. | ||||||||||||||||||||||
What to do? | ||||||||||||||||||||||
|
||||||||||||||||||||||
To cut a long story short, many smart and patient people had to serve on committees for years, working out the details of the [Unicode][web-unicode] character set, and of encodings such as [UTF-8][wiki-utf-8], and lots of software needed a _very_ complicated rewrite. | ||||||||||||||||||||||
Also, lots of new bugs were introduced. | ||||||||||||||||||||||
|
||||||||||||||||||||||
To prevent _everything_ breaking, the Unicode/UTF-8 design ensures that the first 127 codes are identical to ASCII _(even the bell)_. | ||||||||||||||||||||||
|
||||||||||||||||||||||
## Characters in Kotlin | ||||||||||||||||||||||
|
||||||||||||||||||||||
Languages designed after about 2005 have the huge advantage that a reasonably stable Unicode standard already existed. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Kotlin (first released in 2011) was able to assume that users would use a variety of (human) languages, and would need Unicode to express them. | ||||||||||||||||||||||
|
||||||||||||||||||||||
~~~~exercism/advanced | ||||||||||||||||||||||
[Characters][ref-char] in Kotlin are 16-bit (UTF-16) [`codepoints`][wiki-codepoint], the same as a JVM `char`. | ||||||||||||||||||||||
This is enough to express most written alphabets, but not the entire range of emojis. | ||||||||||||||||||||||
|
||||||||||||||||||||||
The full Unicode standard uses up to six bytes (48 bits) per character (called a [`grapheme`][wiki-grapheme]). | ||||||||||||||||||||||
|
||||||||||||||||||||||
Kotlin `Strings` support this full standard by using multiple codepoints per character, when necessary. | ||||||||||||||||||||||
For example, 😱 would be `\uD83D` and `\uDE31`. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Unfortunately, Java has no built-in grapheme support, and for compatibility neither does Kotlin. | ||||||||||||||||||||||
|
||||||||||||||||||||||
[wiki-codepoint]: https://en.wikipedia.org/wiki/Code_point | ||||||||||||||||||||||
[wiki-grapheme]: https://en.wikipedia.org/wiki/Grapheme | ||||||||||||||||||||||
[ref-char]: https://kotlinlang.org/docs/characters.html | ||||||||||||||||||||||
~~~~ | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
Character literals are written in single-quotes, and are distinct from strings written in double quotes. | ||||||||||||||||||||||
This is probably obvious to people from the C/C++ world, but potentially confusing to Python and JavaScript programmers. | ||||||||||||||||||||||
|
||||||||||||||||||||||
```kotlin | ||||||||||||||||||||||
val a = 'a' | ||||||||||||||||||||||
a::class.qualifiedName // => kotlin.Char | ||||||||||||||||||||||
a.code // => 97 | ||||||||||||||||||||||
|
||||||||||||||||||||||
val jha = 'झ' // Devanagari alphabet | ||||||||||||||||||||||
jha.code // => 2333 | ||||||||||||||||||||||
|
||||||||||||||||||||||
val heart = '❤' // heart emoji | ||||||||||||||||||||||
heart.code // => 10084 | ||||||||||||||||||||||
|
||||||||||||||||||||||
Char.MAX_VALUE.code // => 65535 (64k, the largest code point allowed) | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
Converting between `Char` and `Int` is simple: | ||||||||||||||||||||||
colinleach marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||
|
||||||||||||||||||||||
```kotlin | ||||||||||||||||||||||
a.code // => 97 (a.toInt() is deprecated) | ||||||||||||||||||||||
colinleach marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||
Char(97) // => 'a' | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
The compiler allows _some_ forms of integer arithmetic on `Char`s: | ||||||||||||||||||||||
|
||||||||||||||||||||||
```kotlin | ||||||||||||||||||||||
'a' + 5 // => 'f' | ||||||||||||||||||||||
'c' - 'a' // => 2 | ||||||||||||||||||||||
'c' + 'a' // => error! | ||||||||||||||||||||||
|
||||||||||||||||||||||
'f' + ('A' - 'a') // => 'F' (same as 'f'.uppercase() | ||||||||||||||||||||||
|
||||||||||||||||||||||
'f'.dec() // => 'e' (decrement) | ||||||||||||||||||||||
'f'.inc() // => 'g' (increment) | ||||||||||||||||||||||
colinleach marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
## Some functions for `Char` | ||||||||||||||||||||||
|
||||||||||||||||||||||
As always, there are far too [many functions][ref-char-lib] to discuss here, so this is just a selection. | ||||||||||||||||||||||
|
||||||||||||||||||||||
- For appropriate alphabets, change case with [`uppercase()`][ref-uppercase] and [`lowercase()`][ref-lowercase]. | ||||||||||||||||||||||
- Test case with [`isUpperCase()`][ref-isuppercase] and [`isLowerCase()`][ref-islowercase]. | ||||||||||||||||||||||
- Test character type with: | ||||||||||||||||||||||
- [`isLetter()`][ref-isletter], covers many alphabets | ||||||||||||||||||||||
- [`isDigit()`][ref-isdigit], in range 0..9 | ||||||||||||||||||||||
- `isLetterOrDigit()`, combines the previous two | ||||||||||||||||||||||
- [`isWhitespace()`][ref-iswhitespace], any whitespace character | ||||||||||||||||||||||
Comment on lines
+87
to
+91
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I think this is the correct markdown format. For about.md only, I think we can explain how it's implemented by specifying the unicode categories (that's how these functions work!) |
||||||||||||||||||||||
|
||||||||||||||||||||||
```kotlin | ||||||||||||||||||||||
'झ'.isLetter() // => true | ||||||||||||||||||||||
'A'.isLowerCase() // => false | ||||||||||||||||||||||
'4'.isDigit() // => true | ||||||||||||||||||||||
'\t'.isWhitespace() // => true (tab character) | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
Also, [regular expressions][ref-regex] (which will be the subject of a later Concept) allow powerful search and manipulation. | ||||||||||||||||||||||
|
||||||||||||||||||||||
## Char List and String interconversions | ||||||||||||||||||||||
|
||||||||||||||||||||||
For String to Char List, we can use `toList()`. | ||||||||||||||||||||||
colinleach marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||
|
||||||||||||||||||||||
For Char List to String, there is the [`joinToString()`][ref-jointostring] function, which takes a separator (often the empty string) as argument. | ||||||||||||||||||||||
colinleach marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||
|
||||||||||||||||||||||
```kotlin | ||||||||||||||||||||||
val kt = "kotlin".toList() // => [k, o, t, l, i, n] | ||||||||||||||||||||||
kt.joinToString("") // => "kotlin" | ||||||||||||||||||||||
kt.joinToString("_") // => "k_o_t_l_i_n" | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
Note that `joinToString()` operates on a List or Array. | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
What was the reason for this sentence? |
||||||||||||||||||||||
To _cast_ a single `Char` to a 1-character string, use `toString()`. | ||||||||||||||||||||||
|
||||||||||||||||||||||
```kotlin | ||||||||||||||||||||||
'a'.toString() // => "a" | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
To check if a character is present in a `String`, or a `Char` list or array, we have `in`, which maps to the [`contains()`][ref-contains] function. | ||||||||||||||||||||||
|
||||||||||||||||||||||
```kotlin | ||||||||||||||||||||||
val clist = "kotlin".toList() // => [k, o, t, l, i, n] | ||||||||||||||||||||||
't' in clist // => true | ||||||||||||||||||||||
't' in "kotlin" // => true | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
[ref-char]: https://kotlinlang.org/docs/characters.html | ||||||||||||||||||||||
[ref-char-lib]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin/-char/ | ||||||||||||||||||||||
[wiki-ascii]: https://en.wikipedia.org/wiki/ASCII | ||||||||||||||||||||||
[web-unicode]: https://home.unicode.org/ | ||||||||||||||||||||||
[wiki-utf-8]: https://en.wikipedia.org/wiki/UTF-8 | ||||||||||||||||||||||
[ref-uppercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/uppercase.html | ||||||||||||||||||||||
[ref-lowercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/lowercase.html | ||||||||||||||||||||||
[ref-isuppercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-upper-case.html | ||||||||||||||||||||||
[ref-islowercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-lower-case.html | ||||||||||||||||||||||
[ref-isletter]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-letter.html | ||||||||||||||||||||||
[ref-isdigit]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-digit.html | ||||||||||||||||||||||
[ref-iswhitespace]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-whitespace.html | ||||||||||||||||||||||
[ref-jointostring]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.collections/join-to-string.html | ||||||||||||||||||||||
[ref-contains]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/contains.html | ||||||||||||||||||||||
[ref-regex]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/-regex/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
# Introduction | ||
|
||
This is potentially a _big_ subject! | ||
It is possible to write a long book about it, and several people have done so (search Amazon for "unicode book" to see some examples). | ||
|
||
## A very brief history | ||
|
||
Handling characters in computers was much simpler in earlier decades, when programmers assumed that English was the only important language. | ||
So: 26 letters, upper and lower case, 10 digits, several punctuation marks, plus a code (0x07) to ring a bell, and it all fitted into 7 bits: the [ASCII][wiki-ascii] character set. | ||
|
||
Naturally, people started asking what about à, ä and Ł, then other people started asking about ऄ, ஹ and ญ, and young people wanted emojis 😱. | ||
What to do? | ||
|
||
To cut a long story short, many smart and patient people had to serve on committees for years, working out the details of the [Unicode][web-unicode] character set, and of encodings such as [UTF-8][wiki-utf-8], and lots of software needed a _very_ complicated rewrite. | ||
Also, lots of new bugs were introduced. | ||
|
||
To prevent _everything_ breaking, the Unicode/UTF-8 design ensures that the first 127 codes are identical to ASCII _(even the bell)_. | ||
|
||
## Characters in Kotlin | ||
|
||
Languages designed after about 2005 have the huge advantage that a reasonably stable Unicode standard already existed. | ||
|
||
Kotlin (first released in 2011) was able to assume that users would use a variety of (human) languages, and would need Unicode to express them. | ||
|
||
Character literals are written in single-quotes, and are distinct from strings written in double quotes. | ||
This is probably obvious to people from the C/C++ world, but potentially confusing to Python and JavaScript programmers. | ||
|
||
```kotlin | ||
val a = 'a' | ||
a::class.qualifiedName // => kotlin.Char | ||
a.code // => 97 | ||
|
||
val jha = 'झ' // Devanagari alphabet | ||
jha.code // => 2333 | ||
|
||
val heart = '❤' // heart emoji | ||
heart.code // => 10084 | ||
|
||
Char.MAX_VALUE.code // => 65535 (64k, the largest code point allowed) | ||
``` | ||
|
||
Converting between `Char` and `Int` is simple: | ||
|
||
```kotlin | ||
a.code // => 97 (a.toInt() is deprecated) | ||
Char(97) // => 'a' | ||
``` | ||
|
||
The compiler allows _some_ forms of integer arithmetic on `Char`s: | ||
|
||
```kotlin | ||
'a' + 5 // => 'f' | ||
'c' - 'a' // => 2 | ||
'c' + 'a' // => error! | ||
|
||
'f' + ('A' - 'a') // => 'F' (same as 'f'.uppercase() | ||
|
||
'f'.dec() // => 'e' (decrement) | ||
'f'.inc() // => 'g' (increment) | ||
``` | ||
|
||
## Some functions for `Char` | ||
|
||
As always, there are far too [many functions][ref-char-lib] to discuss here, so this is just a selection. | ||
|
||
- For appropriate alphabets, change case with [`uppercase()`][ref-uppercase] and [`lowercase()`][ref-lowercase]. | ||
- Test case with [`isUpperCase()`][ref-isuppercase] and [`isLowerCase()`][ref-islowercase]. | ||
- Test character type with: | ||
- [`isLetter()`][ref-isletter], covers many alphabets | ||
- [`isDigit()`][ref-isdigit], in range 0..9 | ||
- `isLetterOrDigit()`, combines the previous two | ||
- [`isWhitespace()`][ref-iswhitespace], any whitespace character | ||
|
||
```kotlin | ||
'झ'.isLetter() // => true | ||
'A'.isLowerCase() // => false | ||
'4'.isDigit() // => true | ||
'\t'.isWhitespace() // => true (tab character) | ||
``` | ||
|
||
## Char List and String interconversions | ||
|
||
For String to Char List, we can use `toList()`. | ||
|
||
For Char List to String, there is the [`joinToString()`][ref-jointostring] function, which takes a separator (often the empty string) as argument. | ||
|
||
```kotlin | ||
val kt = "kotlin".toList() // => [k, o, t, l, i, n] | ||
kt.joinToString("") // => "kotlin" | ||
kt.joinToString("_") // => "k_o_t_l_i_n" | ||
``` | ||
|
||
Note that `joinToString()` operates on a List or Array. | ||
To _cast_ a single `Char` to a 1-character string, use `toString()`. | ||
|
||
```kotlin | ||
'a'.toString() // => "a" | ||
``` | ||
|
||
To check if a character is present in a `String`, or a `Char` list or array, we have `in`, which maps to the [`contains()`][ref-contains] function. | ||
|
||
```kotlin | ||
val clist = "kotlin".toList() // => [k, o, t, l, i, n] | ||
't' in clist // => true | ||
't' in "kotlin" // => true | ||
``` | ||
|
||
|
||
[ref-char]: https://kotlinlang.org/docs/characters.html | ||
[ref-char-lib]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin/-char/ | ||
[wiki-ascii]: https://en.wikipedia.org/wiki/ASCII | ||
[web-unicode]: https://home.unicode.org/ | ||
[wiki-utf-8]: https://en.wikipedia.org/wiki/UTF-8 | ||
[ref-uppercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/uppercase.html | ||
[ref-lowercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/lowercase.html | ||
[ref-isuppercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-upper-case.html | ||
[ref-islowercase]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-lower-case.html | ||
[ref-isletter]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-letter.html | ||
[ref-isdigit]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-digit.html | ||
[ref-iswhitespace]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/is-whitespace.html | ||
[ref-jointostring]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.collections/join-to-string.html | ||
[ref-contains]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/contains.html | ||
[ref-regex]: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.text/-regex/ | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
[ | ||
{ | ||
"url": "https://kotlinlang.org/docs/characters.html", | ||
"description": "Kotlin introduction to characters." | ||
}, | ||
{ | ||
"url": "https://kotlinlang.org/api/core/kotlin-stdlib/kotlin/-char/", | ||
"description": "Char type and functions" | ||
} | ||
] |
Uh oh!
There was an error while loading. Please reload this page.