UTF-8 / tohutō capability audit

Purpose

This checklist is for publishers, distributors, and retailers across Australia and New Zealand to audit whether their systems can correctly handle UTF-8 text encoding - the standard that allows all diacritics, including macrons (tohutō), to be stored, transmitted, and searched without loss.

We’ve used tohutō here as this issue is so critical for Aotearoa, and Māori words are a visible example of where accuracy is essential. The same principles apply to all diacritics - whether it’s Māori, García, Brontë, or façade.

Why this matters

Tohutō (macrons) and other diacritics are part of the correct spelling of many names, places, and words. They change meaning, pronunciation, affect search results, and demonstrate respect for language and culture.
If they’re lost at any point in your supply chain, you risk:

Loss of trust from authors, readers, and communities
Incorrect author, title, or place names in your catalogue and marketing
Customers not finding the book they’re searching for
Search engines failing to match your pages with relevant queries, or ranking your website lower due to frequency of misspellings.

What is UTF-8?

UTF-8 is the most widely used text encoding standard on the internet. It can represent every character in every written language – including tohutō – in a consistent way.

Think of encoding as the “alphabet” a computer uses internally. The same letter in one encoding may be represented by different numbers in another. If two systems disagree on the encoding, or if one removes characters it can’t represent, your text changes.

Example:

Correct form: Māori
After corruption: M?ori or M�ori (system guessed wrong about the encoding)
After stripping: Maori (diacritic removed completely, often without warning)

Stripping vs corruption - two failure modes

Corruption – Characters replaced with placeholders, symbols, or question marks. Happens when:

Text is stored in UTF-8 but read as a different encoding (or vice versa)
The declared encoding doesn’t match the actual data

Stripping – Diacritics are silently removed, leaving only base letters. Happens when:

The system or script “normalises” text to plain ASCII (e.g. ā → a)
The database column can’t store extended characters
Export/import processes drop “unsupported” characters

Stripping can be the more common problem because it’s:

Sometimes deliberate as part of system configuration (someone believed removing diacritics improves compatibility)
More “silent” – it doesn’t produce obvious broken symbols, so it’s easy to miss
Sometimes masked by downstream manual corrections

Why stripping happens – common causes

1. Deliberate ASCII normalisation

Applied in code to “simplify” text for search or older systems.
If the simplified version replaces the original in storage, the correct form is lost forever.

2. Database limitations

Database or column uses ASCII or Latin1 encoding.
Unsupported characters are dropped or converted to the closest ASCII letter.
Can happen at any database layer: server default, database default, table default, or individual column definition.

3. Export/import defaults

Excel saving CSV in ANSI format by default (on Windows)
APIs sending responses in UTF-8 but the client interpreting them as ASCII (or vice versa)
Scripts writing files without specifying encoding, defaulting to system encoding (often ASCII on older systems)

4. Middleware / transformation tools

ETL processes, ONIX converters, or distributor feed scripts that “clean” or “sanitise” text before output
Often strips diacritics alongside other “non-standard” characters

5. Legacy code or libraries

Older programming libraries assume ASCII or a limited character set unless explicitly told to use UTF-8
Can silently strip characters if they can’t be represented

6. Manual intervention

Staff removing diacritics “for search” or “compatibility”
People unable to type diacritics on their keyboards, replacing them with plain letters
Downstream teams correcting for the web but not updating the source system - hiding the real

Step-by-step audit

1. Map your data journey

List every system that touches your metadata, such as:

Title management system
ONIX export/import tools
Distributor feeds and APIs
POS system
Website or e-commerce platform
Search index (internal and external)
Reporting/business intelligence tools
Middleware or data transformation scripts
Include any third-party suppliers, in-house tools, or manual processes. You may have multiple chains to map.

2. Check data entry practices

Do all staff know how to type tohutō on their devices?
Are diacritics entered consistently at the point of creation?
Is there any documented process to remove diacritics for certain outputs?
Are there steps in the chain where diacritics might be deliberately or automatically removed?
Are there manual “fixes” in downstream systems (e.g. web CMS) where diacritics are added that hide upstream problems?

3. Confirm with providers & check storage format

For each system, ask:

Full UTF-8 support - Can the system store, search, import, export, and display UTF-8 characters without alteration?
Current configuration - Is my specific setup using UTF-8 at every stage? (import parsers, API endpoints, database storage, exports, reports)
Database encoding - Is UTF-8 (or UTF8MB4 in MySQL/PostgreSQL) set at all levels: server, database, table, column?
Export settings - Do file exports default to UTF-8? (CSV, XML, ONIX, TXT, JSON)
Known gaps - Are there modules or legacy processes that still use older encodings?

4. Run an end-to-end encoding test

Create sample data with tohutō and other diacritics in titles, author names, and descriptions.
Enter into the first system in your chain.
Export from the last system in the chain.
Compare character-for-character:
- Stripping: Māori → Maori → check for ASCII-only steps, DB encoding, exports, or sanitisation scripts.
- Corruption: Māori → M?ori or � → check for mismatched encoding declarations, non-UTF-8 readers, or API response misinterpretation.
Work backwards until you find the first failure point.

5. File exports and Windows defaults

When exporting to CSV/XML/TXT, explicitly choose UTF-8 in the “Save As” or export settings.
Windows default setting - In Windows 10/11, you can enable Beta: Use Unicode UTF-8 for worldwide language support (Control Panel → Region → Administrative → Change system locale)
- ⚠ Test carefully - older programs may break if they assume the old encoding.

Outcomes

Downstream resilience

In theory: You can use any encoding, as long as you declare it correctly and the receiving system respects that declaration.
In practice: Many downstream systems don’t check and simply assume UTF-8.
Best practice: If everyone in the chain uses UTF-8 by default, you avoid:
- Assumption mismatches
- Incorrect interpretation of correctly declared data
- Subtle, hard-to-trace breakage

Search normalisation

Search normalisation means the search treats Māori and Maori as equivalent queries, while still storing and displaying the correct diacritic form.

In each system:

Search with diacritics and without.
Confirm results match in both cases.
Confirm stored/displayed records retain diacritics.

If diacritics have been removed in storage, normalisation isn’t the fix – it’s a symptom.

Quick diagnosis

You see “Maori” instead of “Māori” → stripping → look for ASCII-only steps, database limits, sanitisation scripts, or manual edits.
You see “M?ori” or “Ma�ri” → corruption → look for mismatched encodings, missing UTF-8 declarations, or default readers not set to UTF-8.
Final tip: UTF-8 is like a common language. For your message to arrive intact – from first keystroke to final customer search – everyone in the supply chain needs to speak it, store it, and pass it on without “simplifying” it.

Find best practice advice and learning resources for metadata.