Opening
What is “the same”?
This seemingly simple question is full of traps in computer systems. In April 2025, Linus Torvalds sent a scathing email to the Linux kernel mailing list regarding the case-folding feature of bcachefs1. His core argument can be summarized in one sentence: filenames should just be a string of bytes, and the filesystem should not attempt to “understand” them.
This is not merely a matter of technical preference. When we begin to assign semantics to filenames—such as letting the system understand that ‘A’ and ‘a’ are the same, or that ‘é’ and ‘e+´’ are the same—we open a Pandora’s box.
Let’s further understand the essence of this problem.
The Essence of Filenames: Identifiers or Names?
In a library, every book has two ways of being identified: the title and the call number. The title has semantics; it tells you what the book is about. The call number is opaque; it is simply a unique identifier used to locate the book on the shelf.
The designers of Unix chose the “call number” model.
In Unix design philosophy, a filename is an opaque sequence of bytes. The kernel’s responsibility is simple: map this byte sequence to an inode (index node). As stated in O’Reilly’s “Understanding the Linux Kernel”: “A Unix file is an information container structured as a sequence of bytes; the kernel does not interpret the contents of a file.”2 This “non-interpretation” philosophy applies equally to filenames.
What is the underlying reason for this design choice?
Simplicity brings predictability. When filenames are just byte sequences, the system’s behavior is completely deterministic: two filenames are identical if and only if their byte sequences are exactly the same. There is no ambiguity, no special cases, and no cultural dependency. The VFS (Virtual File System) layer resolves pathnames to inodes via the dentry cache, but this process is pure byte matching3.
But macOS chose a different path.
Equivalence Relations: When Sameness Becomes Complex
In mathematics, an equivalence relation is a relation that satisfies three properties: reflexivity (a ~ a), symmetry (if a ~ b, then b ~ a), and transitivity (if a ~ b and b ~ c, then a ~ c). When we say something is case-insensitive, we are actually defining an equivalence relation: A ~ a, B ~ b, and so on.
This seems simple, doesn’t it?
The problem is: who defines this equivalence relation? Different systems may have different definitions.
In English, the uppercase of ‘i’ is ‘I’; this seems obvious. But Turkish has four different ‘i’s:
| Form | Dotted | Dotless |
|---|---|---|
| Uppercase | İ (U+0130) | I (U+0049) |
| Lowercase | i (U+0069) | ı (U+0131) |
In Turkish, the uppercase of ‘i’ is ‘İ’ (uppercase I with a dot), and the lowercase of ‘I’ is ‘ı’ (lowercase i without a dot)4. This means that if a security check uses English rules to convert FILE to lowercase and gets file, while the filesystem uses Turkish rules and gets fıle, these two strings will not match—even though they “should” be the same filename.
WARNING
This is not a theoretical issue. Jeff Atwood documented a real-world case on the Coding Horror blog: when an application runs under a Turkish locale, converting the string INTEGER to lowercase results in ınteger instead of integer, causing the program logic to fail completely5.
This reveals a profound problem: case conversion is not a universal, culture-independent operation. When we implement case-insensitivity at the filesystem level, we must choose a specific set of rules—and that choice may be inconsistent with the rules used by user-space programs.
Unicode Normalization: A Deeper Rabbit Hole
If case-insensitivity is complex enough, Unicode normalization pushes the problem into another dimension.
Let’s start with a simple question: Is ‘é’ one character or two?
In Unicode, the answer is “both.” ‘é’ can be represented by a single code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE), or by two code points U+0065 (LATIN SMALL LETTER E) + U+0301 (COMBINING ACUTE ACCENT). These two representations are visually identical but completely different at the byte level:
Precomposed form (NFC): C3 A9 -> éDecomposed form (NFD): 65 CC 81 -> e + ́ → éThe Unicode standard defines four normalization forms: NFC, NFD, NFKC, and NFKD6. NFC tends to use precomposed characters, while NFD tends to decompose characters into base characters plus combining marks.
HFS+ chose NFD. According to Apple’s technical documentation, HFS+ uses a normalization form “very close to Unicode Normalization Form D”7. This means when you create a file named café, the system automatically decomposes é into the two code points e + ´.
This design decision has its logic: by forcing normalization, HFS+ ensures that the same “character” has only one representation, thereby preventing users from creating two files that “look the same but are actually different.”
But here is a critical issue: HFS+‘s normalization rules are based on Unicode 3.2, and these rules cannot be updated as the Unicode standard evolves because “such evolution would invalidate existing HFS+ volumes”8.
IMPORTANT
This is a normalization implementation stuck in 1998, yet it serves users in 2025. The Unicode standard has evolved from 3.2 to 17.0 (released September 9, 2025), adding tens of thousands of characters, but HFS+‘s normalization rules are forever frozen in the past.
In 2017, Apple introduced APFS to replace HFS+. APFS made a significant change: it no longer forces Unicode normalization but is instead “normalization-preserving but normalization-insensitive”9. This means APFS preserves the original byte sequence you input but still considers normalization equivalence when comparing filenames.
This change solved some problems but introduced new ones. When migrating from HFS+ to APFS, filenames that were originally normalized maintain their NFD form, while newly created files might use the NFC form. In certain edge cases, this can lead to “visually identical but actually different” filenames coexisting in the same directory.
The Essence of Security Vulnerabilities: Inconsistency in Equivalence Relations
Now we can understand the essence of these security vulnerabilities.
When a security checking program and the filesystem use different equivalence relations, a “gap” is created. An attacker can construct a filename that appears safe to the security checker but is equivalent to a dangerous filename in the eyes of the filesystem.
Torvalds accurately described this problem in his email:
Security issues like “user space checked that the filename didn’t match some security-sensitive pattern”. And then the shit-for-brains filesystem ends up matching that pattern anyway…
Let’s look at a concrete example.
In March 2021, the Git project disclosed a critical vulnerability, CVE-2021-2130010. This vulnerability specifically affected Windows and macOS users using case-insensitive filesystems.
The principle of the vulnerability involves Git’s lstat cache mechanism. When Git checks out files, it maintains a cache to reduce system calls. An attacker could construct a malicious repository containing two files: A and a. On a case-sensitive filesystem, these are two different files; but on a case-insensitive filesystem, they collide.
The key to the attack lies in the inconsistency between Git’s internal logic (based on case-sensitive assumptions) and the filesystem’s behavior (case-insensitive). By exploiting this inconsistency, an attacker could cause Git to execute arbitrary code during the checkout process.
Security issues brought by Unicode normalization are even more subtle. According to the Black Hat USA 2019 research paper “Host/Split: Exploitable Antipatterns in Unicode Normalization,” exploitable vulnerabilities arise when security decisions are made based on Unicode strings while subsequent processing uses a different normalization form11.
Consider a scenario: security software checks if a filename matches the sensitive path /etc/passwd.
An attacker creates a file with a name containing invisible characters or Unicode variants. The security software checks the string, finds it is not equal to /etc/passwd, and allows it.
However, when the filesystem processes it at the low level, it might normalize these Unicode variants into a form equivalent to /etc/passwd, thereby bypassing the security check.
CERT/CC documented a similar issue in VU#999008: compilers allow Unicode control characters and homoglyphs to appear in source code, which could be used to hide malicious code during code reviews12.
TOCTOU: Equivalence Issues in the Time Dimension
There is another class of even more subtle vulnerabilities: TOCTOU (Time-of-Check to Time-of-Use).
The essence of a TOCTOU vulnerability is that a time window exists between the check and the use, during which an attacker can change the system state, rendering the check result invalid13.
In the context of filesystems, this problem is closely related to the semantic interpretation of filenames. Let’s think about the process of file access:
- A program requests access to a file using a filename.
- The kernel resolves the filename to an inode.
- The kernel checks permissions.
- The kernel returns a file descriptor.
The problem is: between step 1 and step 2, the mapping from filename to inode might change. An attacker could, within this window, redirect the filename to another file.
NOTE
There is a key technical detail here: while the mapping from filename to inode is volatile, the mapping from inode to file descriptor is stable14. Once you obtain a file descriptor, it points directly to the inode and no longer depends on the filename. This is why CERT SEI recommends “opening critical files only once and then performing all required operations via the file descriptor rather than the filename.”
macOS’s case-insensitivity and Unicode normalization make TOCTOU issues more complex. When a security check uses one filename representation while the actual file operation uses another equivalent but different representation, the TOCTOU window expands.
The USENIX FAST’23 paper “Unsafe at Any Copy: Name Collisions from Mixing Case Sensitivity” systematically studied this issue15. The research found differences in case-folding rules and normalization techniques across different filesystems. For example, temp_200K (where K is the Kelvin Sign, U+212A) and temp_200k are considered the same on NTFS and APFS but different on ZFS.
This inconsistency is a breeding ground for security vulnerabilities.
Defense in Depth: When Filenames Are Untrusted
Facing the reality that filenames are untrusted, Apple chose a Defense in Depth strategy.
The core idea of this strategy is: since we cannot make filenames trustworthy, we should not rely on filenames to establish trust. Instead, we build independent security barriers at multiple levels, each using a different foundation of trust.
Let’s look at how Apple implements this strategy.
Merkle Trees and Signed System Volumes
macOS Big Sur (11.0) introduced the Signed System Volume (SSV) mechanism16. The core idea of SSV is to use cryptographic hashes to verify system integrity rather than relying on filenames.
The technical implementation of SSV is based on Merkle trees. A Merkle tree is an elegant data structure that allows us to verify the integrity of a dataset of any size using a fixed-size “root hash”17.
A Merkle tree works as follows:
- Divide the data into several blocks and calculate the hash value for each block (leaf nodes).
- Pair adjacent hash values and calculate their combined hash (internal nodes).
- Repeat step 2 until only one hash value remains (root node).
This structure has a key property: any modification to a data block will cause all hash values along the path from that leaf node to the root node to change. Therefore, as long as the root hash is trusted, we can verify the integrity of the entire dataset.
TIP
The verification efficiency of a Merkle tree is logarithmic.
To verify the integrity of a specific data block, one only needs to check the hash values along the path from that leaf node to the root node—for data blocks, this requires only hash calculations. This allows SSV to quickly verify system integrity at boot time without significantly increasing startup time.
In the SSV implementation, every file on the system volume has a SHA-256 hash value stored in the filesystem metadata. The hash value of the root node is called the “seal,” and it covers every single byte on the SSV.
This seal is verified by the bootloader every time the Mac starts. If verification fails, the boot process is aborted, and the user is prompted to reinstall the operating system18.
What does this mean? No matter what case confusion or Unicode variants an attacker uses—even if they obtain Root privileges—as long as they attempt to modify any content on the system volume, the hash values will not match, and the system will refuse to boot. The semantic interpretation of filenames becomes irrelevant here because the integrity of the entire volume is guaranteed by cryptographic hashes, not filenames.
Metadata Tagging and SIP
Before SSV, macOS already had SIP (System Integrity Protection), also known as “rootless”19. The core idea of SIP is to use metadata tags to protect critical files rather than relying on filenames.
SIP introduced a new restricted file flag. Files marked as restricted cannot be modified, even when running as the root user20.
The key is that SIP checks are based on the inode’s metadata tags, not the filename. When any process attempts to write to a protected directory, the kernel checks if that inode is marked as “SIP protected.” Even if an attacker tricks upper-level checks using Unicode variants or case confusion, when the request reaches the kernel, the kernel looks at the inode’s restricted flag and denies the operation.
This is an important design principle: shifting the foundation of trust from the filename to the inode’s metadata. Filenames can be obfuscated, but inode metadata is managed directly by the kernel and is unaffected by the semantic interpretation of filenames.
File Descriptors: Bypassing Filenames
In developer documentation, Apple promotes the use of NSURL (objects) and File Descriptors rather than using file path strings directly21.
There are profound security considerations behind this design choice.
Under the Sandbox mechanism, when a user authorizes an App to access a file, the system gives the App a Token instead of a path. The App can use this Token to request file access from the kernel, and the kernel locates the file via the inode.
This design avoids complex path resolution, case matching, and Unicode normalization issues. More importantly, it fundamentally eliminates the possibility of TOCTOU vulnerabilities—because the file descriptor points directly to the inode rather than referencing it indirectly through a filename.
TCC: Access Control Based on Process Identity
TCC (Transparency, Consent, and Control) is the framework macOS uses to manage application access to sensitive data22. The core of TCC is a SQLite database stored at /Library/Application Support/com.apple.TCC/TCC.db.
A key security feature of TCC is that it intercepts access based on process identity (rather than filenames). When an attacker attempts to read a user’s private directory, TCC checks the identity and permissions of the requesting process rather than simply checking the file path string.
The TCC database itself is protected by SIP and cannot be modified directly23. To interfere with these databases, an attacker must disable SIP or gain access to a trusted system process.
Reflections on Design Philosophy
Returning to Linus Torvalds’ criticism, we can see that this is not just a technical issue but a design philosophy issue.
Unix design philosophy emphasizes simplicity and orthogonality. Filenames are byte sequences; the kernel does not interpret their meaning. The advantage of this design is predictability and security—no hidden semantic conversions, no unexpected equivalence relations.
macOS chose a different path, attempting to provide a friendlier user experience. Case-insensitivity means users don’t have to worry about the difference between Document.txt and document.txt. Unicode normalization means users don’t have to understand the difference between NFD and NFC.
But this friendliness comes at a cost. When a filesystem begins to understand filenames, it assumes the responsibility of defining “sameness.” And the definition of sameness is complex, culture-dependent, and constantly evolving.
The deeper problem is that when we use different definitions of sameness at different levels of the system, security vulnerabilities arise. A security checking program might use one equivalence relation, the filesystem uses another, and attackers exploit this inconsistency to bypass security checks.
Apple’s Defense in Depth strategy is a pragmatic compromise. Since historical decisions cannot be changed (case-insensitivity is already the default behavior of macOS), independent security barriers are established at higher and lower levels:
- SSV protects system integrity via cryptographic hashes.
- SIP protects critical files via metadata tags.
- File Descriptors bypass filenames to use inodes directly.
- TCC protects user data via process identity verification.
The common characteristic of these mechanisms is that they do not trust filenames. They use cryptographic hashes, metadata tags, process identities, and inodes to establish trust, rather than relying on strings that can be obfuscated.
Conclusion
Torvalds’ criticism reminds us of a fundamental principle: security systems should not rely on complex semantic interpretations.
A simple model, even if imperfect, is often more reliable and predictable than a complex one. Unix’s “filenames are byte sequences” is exactly such a simple model: it gives up the ability to “understand” filenames but gains predictability and security in return.
The history of macOS demonstrates the cost of complexity. From HFS+‘s Unicode normalization to APFS’s normalization-insensitivity, from Git vulnerabilities to TOCTOU attacks, the semantic interpretation of filenames has been a breeding ground for security issues.
However, Apple’s response strategy also demonstrates engineering wisdom: when we cannot eliminate complexity, we can limit its impact through defense in depth. SSV, SIP, File Descriptors, TCC—none of these mechanisms rely on the semantic interpretation of filenames; they establish independent foundations of trust at lower levels.
For developers, the lessons from this story are clear:
- Never assume filenames are unique or immutable.
- Use file descriptors instead of path strings for file operations.
- Consider the impact of case and Unicode normalization when performing security checks.
- When developing cross-platform, test in both case-sensitive and case-insensitive environments.
As Torvalds said, a filename should just be a string of bytes. When we start giving them magical meanings, we open Pandora’s box.
References
Footnotes
-
Phoronix. “Linus Torvalds Expresses His Hatred For Case-Insensitive File-Systems.” 2025. https://www.phoronix.com/news/Linus-Torvalds-Anti-Case-Fold ↩
-
Bovet, D. P., & Cesati, M. “Understanding the Linux Kernel, Second Edition.” O’Reilly Media, 2002. ↩
-
Linux Kernel Documentation. “Overview of the Linux Virtual File System.” https://docs.kernel.org/filesystems/vfs.html ↩
-
I18n Guy. “Internationalization for Turkish: Dotted and Dotless Letter I.” http://www.i18nguy.com/unicode/turkish-i18n.html ↩
-
Atwood, J. “What’s Wrong With Turkey?” Coding Horror, 2008. https://blog.codinghorror.com/whats-wrong-with-turkey/ ↩
-
Unicode Consortium. “UAX #15: Unicode Normalization Forms.” https://unicode.org/reports/tr15/ ↩
-
Apple Developer. “Technical Q&A QA1235: Converting to Precomposed Unicode.” https://developer.apple.com/library/archive/qa/qa1235/_index.html ↩
-
Wikipedia. “HFS Plus.” https://en.wikipedia.org/wiki/HFS_Plus ↩
-
Eclectic Light. “Explainer: Unicode, normalization and APFS.” 2021. https://eclecticlight.co/2021/05/08/explainer-unicode-normalization-and-apfs/ ↩
-
InfoQ. “Analyzing Git Clone Vulnerability.” 2021. https://www.infoq.com/news/2021/03/git-clone-vulnerability/ ↩
-
Birch, J. “Host/Split: Exploitable Antipatterns in Unicode Normalization.” Black Hat USA 2019. https://i.blackhat.com/USA-19/Thursday/us-19-Birch-HostSplit-Exploitable-Antipatterns-In-Unicode-Normalization-wp.pdf ↩
-
CERT/CC. “VU#999008 - Compilers permit Unicode control and homoglyph characters.” 2021. https://www.kb.cert.org/vuls/id/999008 ↩
-
MITRE. “CWE-367: Time-of-check Time-of-use (TOCTOU) Race Condition.” https://cwe.mitre.org/data/definitions/367.html ↩
-
CERT SEI. “FIO45-C. Avoid TOCTOU race conditions while accessing files.” https://wiki.sei.cmu.edu/confluence/display/c/FIO45-C.+Avoid+TOCTOU+race+conditions+while+accessing+files ↩
-
Basu, A., et al. “Unsafe at Any Copy: Name Collisions from Mixing Case Sensitivity.” USENIX FAST’23. https://www.usenix.org/system/files/fast23-basu.pdf ↩
-
Apple Support. “Signed system volume security.” Apple Platform Security Guide. https://support.apple.com/guide/security/signed-system-volume-security-secd698747c9/web ↩
-
Wikipedia. “Merkle tree.” https://en.wikipedia.org/wiki/Merkle_tree ↩
-
Jamf Blog. “What’s New in macOS Big Sur Security.” 2020. https://www.jamf.com/blog/whats-new-in-macos-big-sur-security/ ↩
-
Wikipedia. “System Integrity Protection.” https://en.wikipedia.org/wiki/System_Integrity_Protection ↩
-
Apple Support. “System Integrity Protection.” Apple Platform Security Guide. https://support.apple.com/guide/security/system-integrity-protection-secb7ea06b49/web ↩
-
Apple Support. “Controlling app access to files in macOS.” Apple Platform Security Guide. https://support.apple.com/guide/security/controlling-app-access-to-files-secddd1d86a6/web ↩
-
Rainforest QA. “A deep dive into macOS TCC.db.” 2021. https://www.rainforestqa.com/blog/macos-tcc-db-deep-dive ↩
-
Huntress. “Full Transparency: Controlling Apple’s TCC.” 2024. https://www.huntress.com/blog/full-transparency-controlling-apples-tcc ↩