Purpose
This checklist is for publishers, distributors, and retailers across Australia and New Zealand to audit whether their systems can correctly handle UTF-8 text encoding - the standard that allows all diacritics, including macrons (tohutō), to be stored, transmitted, and searched without loss.
We’ve used tohutō here as this issue is so critical for Aotearoa, and Māori words are a visible example of where accuracy is essential. The same principles apply to all diacritics - whether it’s Māori, García, Brontë, or façade.
Why this matters
Tohutō (macrons) and other diacritics are part of the correct spelling of many names, places, and words. They change meaning, pronunciation, affect search results, and demonstrate respect for language and culture.
If they’re lost at any point in your supply chain, you risk:
- Loss of trust from authors, readers, and communities
- Incorrect author, title, or place names in your catalogue and marketing
- Customers not finding the book they’re searching for
- Search engines failing to match your pages with relevant queries, or ranking your website lower due to frequency of misspellings.
What is UTF-8?
UTF-8 is the most widely used text encoding standard on the internet. It can represent every character in every written language – including tohutō – in a consistent way.
Think of encoding as the “alphabet” a computer uses internally. The same letter in one encoding may be represented by different numbers in another. If two systems disagree on the encoding, or if one removes characters it can’t represent, your text changes.
Example:
- Correct form: Māori
- After corruption: M?ori or M�ori (system guessed wrong about the encoding)
- After stripping: Maori (diacritic removed completely, often without warning)
Stripping vs corruption - two failure modes
Corruption – Characters replaced with placeholders, symbols, or question marks. Happens when:
- Text is stored in UTF-8 but read as a different encoding (or vice versa)
- The declared encoding doesn’t match the actual data
Stripping – Diacritics are silently removed, leaving only base letters. Happens when:
- The system or script “normalises” text to plain ASCII (e.g. ā → a)
- The database column can’t store extended characters
- Export/import processes drop “unsupported” characters
Stripping can be the more common problem because it’s:
- Sometimes deliberate as part of system configuration (someone believed removing diacritics improves compatibility)
- More “silent” – it doesn’t produce obvious broken symbols, so it’s easy to miss
- Sometimes masked by downstream manual corrections
Why stripping happens – common causes
1. Deliberate ASCII normalisation
- Applied in code to “simplify” text for search or older systems.
- If the simplified version replaces the original in storage, the correct form is lost forever.
2. Database limitations
- Database or column uses ASCII or Latin1 encoding.
- Unsupported characters are dropped or converted to the closest ASCII letter.
- Can happen at any database layer: server default, database default, table default, or individual column definition.
3. Export/import defaults
- Excel saving CSV in ANSI format by default (on Windows)
- APIs sending responses in UTF-8 but the client interpreting them as ASCII (or vice versa)
- Scripts writing files without specifying encoding, defaulting to system encoding (often ASCII on older systems)
4. Middleware / transformation tools
- ETL processes, ONIX converters, or distributor feed scripts that “clean” or “sanitise” text before output
- Often strips diacritics alongside other “non-standard” characters
5. Legacy code or libraries
- Older programming libraries assume ASCII or a limited character set unless explicitly told to use UTF-8
- Can silently strip characters if they can’t be represented
6. Manual intervention
- Staff removing diacritics “for search” or “compatibility”
- People unable to type diacritics on their keyboards, replacing them with plain letters
- Downstream teams correcting for the web but not updating the source system - hiding the real
Step-by-step audit
1. Map your data journey
List every system that touches your metadata, such as:
- Title management system
- ONIX export/import tools
- Distributor feeds and APIs
- POS system
- Website or e-commerce platform
- Search index (internal and external)
- Reporting/business intelligence tools
- Middleware or data transformation scripts
- Include any third-party suppliers, in-house tools, or manual processes. You may have multiple chains to map.
2. Check data entry practices
- Do all staff know how to type tohutō on their devices?
- Are diacritics entered consistently at the point of creation?
- Is there any documented process to remove diacritics for certain outputs?
- Are there steps in the chain where diacritics might be deliberately or automatically removed?
- Are there manual “fixes” in downstream systems (e.g. web CMS) where diacritics are added that hide upstream problems?
3. Confirm with providers & check storage format
For each system, ask:
- Full UTF-8 support - Can the system store, search, import, export, and display UTF-8 characters without alteration?
- Current configuration - Is my specific setup using UTF-8 at every stage? (import parsers, API endpoints, database storage, exports, reports)
- Database encoding - Is UTF-8 (or UTF8MB4 in MySQL/PostgreSQL) set at all levels: server, database, table, column?
- Export settings - Do file exports default to UTF-8? (CSV, XML, ONIX, TXT, JSON)
- Known gaps - Are there modules or legacy processes that still use older encodings?
4. Run an end-to-end encoding test
- Create sample data with tohutō and other diacritics in titles, author names, and descriptions.
- Enter into the first system in your chain.
- Export from the last system in the chain.
- Compare character-for-character:
- Stripping: Māori → Maori → check for ASCII-only steps, DB encoding, exports, or sanitisation scripts.
- Corruption: Māori → M?ori or � → check for mismatched encoding declarations, non-UTF-8 readers, or API response misinterpretation.
- Work backwards until you find the first failure point.
5. File exports and Windows defaults
- When exporting to CSV/XML/TXT, explicitly choose UTF-8 in the “Save As” or export settings.
- Windows default setting - In Windows 10/11, you can enable Beta: Use Unicode UTF-8 for worldwide language support (Control Panel → Region → Administrative → Change system locale)
- ⚠ Test carefully - older programs may break if they assume the old encoding.
Outcomes
Downstream resilience
- In theory: You can use any encoding, as long as you declare it correctly and the receiving system respects that declaration.
- In practice: Many downstream systems don’t check and simply assume UTF-8.
- Best practice: If everyone in the chain uses UTF-8 by default, you avoid:
- Assumption mismatches
- Incorrect interpretation of correctly declared data
- Subtle, hard-to-trace breakage
Search normalisation
Search normalisation means the search treats Māori and Maori as equivalent queries, while still storing and displaying the correct diacritic form.
In each system:
- Search with diacritics and without.
- Confirm results match in both cases.
- Confirm stored/displayed records retain diacritics.
If diacritics have been removed in storage, normalisation isn’t the fix – it’s a symptom.
Quick diagnosis
- You see “Maori” instead of “Māori” → stripping → look for ASCII-only steps, database limits, sanitisation scripts, or manual edits.
- You see “M?ori” or “Ma�ri” → corruption → look for mismatched encodings, missing UTF-8 declarations, or default readers not set to UTF-8.
- Final tip: UTF-8 is like a common language. For your message to arrive intact – from first keystroke to final customer search – everyone in the supply chain needs to speak it, store it, and pass it on without “simplifying” it.
Find best practice advice and learning resources for metadata.