a blog title

sorted(), Lexicographic Order and Unicode Code Points

Python’s sorted() function sorts strings in lexicographic order. Sometimes, beginners to take this to mean a case-insensitive alphabetic order but this is a mistake! When using Python’s sorted(), you may have noticed that all of the capital letters A - Z come before lowercase letters a - z.

That is, if you create a list:

strings = ["Zap!", "aardvark", "Apples"]

And you feed strings into the sorted() function, you may be surprised at what you see:

print(sorted(strings))
# ['Apples', 'Zap!', 'aardvark']

“Zap!” comes before “aardvark”. Why? What is going on here?

Under the hood, the characters that compose strings are represented by integers (Unicode code point values. Recall that your computer understands numeric values; it has no intuitive understanding of alphabetic characters. Specifically, everything will map to binary 1s and 0s -- a transistor is either on or off.

Each character corresponds to an integer value and it is this value that the strings are being sorted on by default. You can use Python’s build-in ord() function to determine what the code point value is for a a Unicode character string.

ord('A'), ord('Z'), ord('a'), ord('z')
# (65, 90, 97, 122)

This clarifies the sorting above. If we look at Unicode code points, ‘A’ < ‘Z’ < ‘a’ < ‘z’! This is why we see the sorting we did above.

If you don’t want this lexicographic order and want to sort the data in a more classic alphabetical order for your use case, you’ll need to pass the appropriate function to the key parameter of sorted().

Commonly, people will use the string methods like .upper() and .lower() to transform the case of the strings before you sort. This can be totally adequate, especially if you’re confident your data only contains ASCII characters. I would recommend, however, to consider getting into the habit of using the more robust .casefold() method instead when you want to perform caseless comparisons (especially for equality comparisons and especially if you are working with strings that may contain characters that are not English language/standard ASCII character set.) You can find the casefold method’s documentation here.

Let’s just use an anonymous function as our argument to key since it’s a pretty simple transformation:

     strings = ["Zap!", "aardvark", "Apples"]
sorted(strings, key = lambda s: s.casefold())
#['aardvark', 'Apples', 'Zap!'] 

Ah, that may be moreso what you were expecting to see when you sorted your strings. Notice that our strings list has NOT been mutated. sorted() will return a totally new list that you can opt to store into a variable. If you want to mutate the list in-place, you can call the .sort() method: strings.sort(key = lambda s: s.casefold())

One other interesting point (at least, I thought it was interesting) was looking at how strings are sorted in case of a “tie.” Lets take a look at how the sort order changes if we add “Aardvark” with a capital A to our list in different locations relative to the lower case “aardvark”.

Placing "aardvark" before "Aardvark":

strings_lower_first = ["Zap!", "aardvark", "Apples", "Aardvark"]
sorted(strings_lower_first, key = lambda s: s.casefold())
# ['aardvark', 'Aardvark', 'Apples', 'Zap!']  

versus placing "Aardvark" before "aardvark":

strings_upper_first = ["Zap!", "Aardvark", "aardvark", "Apples"]
sorted(strings_upper_first, key = lambda s: s.casefold())
# ['Aardvark', 'aardvark', 'Apples', 'Zap!'] 

When the two strings are equal, Python will preserve the relative order of the elements present in our list. In the first example, lower-case aardvark came first because it appears first in the list of strings_lower_first and in the second example, Aardvark came first in our output of sorted() because it appears first in the list strings_upper_first.

This is because Python’s sorted() function is guaranteed to be "stable." From the docs: “A sort is stable if it guarantees not to change the relative order of elements that compare equal — this is helpful for sorting in multiple passes (for example, sort by department, then by salary grade).”

Python apparently uses an algorithm called Timsort for sorting. Maybe I'll write up my own post on Timsort once I learn more about it but I think I've done what I came to do for the scope of this humble post.

Cheers.

#python