<- Back to projects
Content design skill and quality evaluation system — Goodnotes
Live — released May 2026
Prompt engineering, reference architecture, and structured output design
Portfolio note: More details available to share in interviews
Content design doesn't scale 1:1. I built a Claude skill that embeds documented content decisions into a system any designer or PM can run — so the team gets grounded copy reviews without waiting for me. It connects to Figma via MCP, reads screens directly, and returns structured output against documented standards. This is the 0-to-1 initiative: nothing like it existed before I built it.
Content quality at a 25M+ MAU app means hundreds of strings across an AI product surface — error states, onboarding flows, empty states, AI-generated suggestions. Reviewing all of it manually wasn't scalable. Neither was ad-hoc A/B testing where experiment copy was written without a shared rubric.
What I built
A Claude skill with two modes, shipped to Goodnotes' internal AI skills marketplace
Evaluation mode
- Takes a string or set of strings as input
- Scores copy against a defined set of content principles (clarity, tone, length, consistency with terminology standards)
- Returns structured feedback with a score, specific failure flags, and suggested rewrites
- Used by PMs and designers to self-review, reducing back-and-forth
Generative mode for experiments
- Takes a brief (surface, context, user goal, constraints) as input
- Produces multiple A/B experiment copy variants with rationale
-
Output is formatted for direct use in experiment tooling
- Reduces time from brief to testable variants from days to minutes
Design decisions worth noting-
Modes are separate by design — mixing evaluation and generation in one prompt produces worse output in both directions
-
The evaluation rubric is itself a content design artifact: it had to be written precisely enough that the model applies it consistently, but not so rigidly that it penalises good judgment calls
-
Prompt architecture uses structured XML input/output to make results parseable and auditable
How I thought about it
Taxonomies for agentic ingestion
7 living markdown reference files — terminology, tone, patterns, decision log, IA principles, tropes, changelog — structured so the skill parses and cites them, not just so humans read them. Content engineered
Built the skill in Claude Code, designed the prompt architecture, structured the reference files for LLM retrieval, and iterated on the output format based on what the model was getting wrong. Deployed to Github. Measure AI output quality
Tracking output quality through a public log: helpfulness rating per review, issues flagged, iteration history. Currently: 1 of 1 reviews found output helpful. Skill issues flagged: 0.Went 0-to-1 in ambiguous space
No brief, no precedent. I scoped it, built it, stress-tested it, and shipped it — then documented what needed changing after real use.
What I learned building it
- Plan the system first, but be open to iterating based on project decisions. For example, the information density rule — max 1 thing per heading, max 2 per description — came from a live project.
-
By open to updating relational connections to other workflows. For example, when
- Replaced "always read everything" with a reference-mapping table — the skill loads only what's relevant. Reduced model hallucination and context bloat significantly. This is prompt engineering in practice.
-
The terminology file was generating false positives: B2C deprecation of "member" didn't apply to B2B workspace members. The bug was in the source doc's precision, not the model.
<- Back to projects
Terminology: a glossary structured for human and agentic use + evaluation loop — Goodnotes
Live — owned and maintained
The foundation for content governance, with an evaluation loop that measures and tracks consistency of language across live sources
Portfolio note: More details available to share in interviews
A terminology document that only humans can read and relies on manual updates is a bottleneck at scale. I built this to serve two audiences simultaneously: the team who writes product copy, and the AI skills that enforce it. Every entry has an internal/external mapping, an explicit deprecation flag, and usage notes specific enough for an LLM to act on without hallucinating context.
Glossary: What makes it infrastructure, not documentation
-
Used across various surfaces: cited by AI content design skill, referenced by humans, privdes a baseline check for a biweekly routine check against sources of truth, and the product feature naming brief pulls from it.
-
Internal engineering codes explicitly separated from external names: every entry has both, with a "never use in product copy" flag or a deprecation flag.
-
Context triggers, with deprecated-terms table with explicit replacements, not just a list of what to avoid.
How the quality evaluation loop works
Every two weeks, an AI-assisted audit cross-references the Terminology doc against org-wide sources of truth, surfaces inconsistencies by severity, and tracks which blockers are still unresolved. This provides a consistent view on blockers over time, rather than just a one-time snapshot.
<- Back to projectsNext: Agent behavior ->
Foundational AI agent exploration work — Klook
Foundational content design and UX research work. Sep 2024 - Jun 2025
Portfolio note: Working on conversational AI wasn’t just a one-off project, but iterations on content and conversational design over various features. I’ve documented some of that process here, with more to share in interviews.
Building the foundations
While working on Klook’s first B2C AI features, I conducted desktop research, user interviews and usability tests for each one, to sharpen and clarify these UX questions:
- What are users’ mental models at each step of a trip planning and booking process?
- How do users perceive AI at each step of the booking process for different products, e.g. reading hotel reviews vs. comparing tours?
- What tone and persona should Klook’s conversational AI adopt? How might it cohere with Klook’s existing brand voice?
- What user problems might actually be solved by AI?
Early explorations: AI reviews summary feature
As the lead UX writer on Klook’s first consumer AI feature, I conducted user interviews with an internal panel to find out how users perceived an AI reviews summary for hotels, and mental models surrounding AI more generally. This was intended to shape design direction and product decisions.
I designed the prototype with content, and had help from the designer to refine the interactions. Click into the first image to watch the prototype — simple, but effective for usability testing.
Insights: What I found was that while users find review summary for hotels extremely helpful, this needed to be tempered by users’ booking phase.
- In the first phase of travellers’ booking process, most are in what I called an “objective phase”. They will be focusing on the facts like location, beds, amenities, rooms etc, to narrow down hotels that fit their needs. In this phase, travellers were more likely to value the help that AI. could provide in helping them “sift through” and “narrow down” their choices.
- However, to come to a final decision, users go through a 2nd phase, which I called the “subjective phase”. At this time, users would be relying on reading “authentic” reviews from other travellers. This prioritization of authenticity led them to be skeptical of the completeness of AI review summaries.
Challenge: Making a case for product naming — pushing back against product managers and designers, who wanted to emphasize that the feature used “AI”, as they perceived that this would impress users.
Applying UX insights to content design
I found from research that when users are in the “subjective phase”, an AI summary becomes less helpful and in fact obstructive. I shared this with product and design, which made for a clearer content rationale than any of our gut feelings.
Based on the insights above, coupled with the fact that the reviews summary would be situated within the reviews section — squarely in the “subjective phase” of users’ decision-making, I went with a softer content approach. I focused content concepts and copy on reassurance and highlighting the benefit of the feature.
Two content design solutions I implemented
Before (version that was used for usability testing)After
Content contributions included
1. Decided on the name “Reviews summary” for the feature, and writing a subtitle that highlights the authenticity of “real reviews”
Rationale: During usability testing, users were a little surprised to see an “AI summary” module. They expressed a lack of trust and a feeling that the summary might only consist of positive reviews, which “based on real guest reviews” didn‘t help assuage.
Hypothesis: Users are likely to feel safer and more reassured by the module if it speaks to how they can benefit from it and their concerns about authenticity, rather than trying to flex Klook’s use of AI.
Pushback from designers and product managers: Believed that highlighting “AI” would impress users and make them more likely to use the product. While user testing shifted their mindsets, they still felt that it would be ideal to launch with a mention of “AI” somewhere. Solution:
Move the mention of “AI” into the subtitle, and leading instead with a product name that highlighted the benefits to users.
Similarly, in the subtitle, instead of making bold, unsubstantiated claims of “real guests reviews”, I used “real reviews” instead, to directly speak to the concerns about authenticity that users faced in the subject phase.
I supported this solution with competitive analysis from competitors, such as Trip.com, Expedia and Tripadvisor, which made it more compelling.
2. Simplifying the information on the screen Rationale: During usability testing, users were confused what the “1/5” and “10 reviews” in the “Overall impression” module referred to, which clouded their understanding
Hypothesis: There were too many numbers on the screen, each carrying a different meaning. There were the rating (e.g. 4.5/5), number of reviews (e.g. 274 reviews), “1/5” and “10 reviews”. The first two were fine, as they fit into users’ mental models of hotel reviews. The latter two caused confusion as users weren’t used to seeing it, and without significant affordance, it was hard to grasp. Moreover, this information is not significant enough to users’ decision-making to motivate them to dig into what those numbers mean. Solution:
Simplify the information on the screen, remove extraneous numbers to avoid obfuscating the key information on the page. I presented the findings and my hypothesis to product and design, as well as the content solution. They were convinced by the findings, and updated the design and product logic.
Before After
Content contributions included
1. Wrote a helpful and concise blurb to reassure users about how the reviews sumary comes about.
Rationale: During usability testing, users expressed a lack of trust and a feeling that the summary might only consist of positive reviews, which “based on real guest reviews” didn‘t help assuage.
Hypothesis: Users would be more reassured with some facts about how AI is used, and what goes into the reviews summary.
Solution:
Simplify the blurb, focus on delivering the key information that users need to know.
Don’t overly humanize this section (e.g. “hi, i am”) — save the conversational elements for a conversational interface. This blurb should focus on delivering facts, rather than getting friendly with the user. Product and design were willing to try this, and agreed that conversational elements be saved for a conversational interface (more to come in the next projects).
<- Back to projects
Shaping agent behavior and persona — Klook
Asking questions like: How an agent speaks, what do we think about anthromorphic responses, and what user problem does the agent solve? | Nov 2024 - Jun 2025
Iteration 1: Tours and activities comparison tool
Sep-Nov 2024
What is it? The MVP of this feature would appear on the search results page for day tours. Users would be able to select 2 tours to compare. This feature didn’t have a text field, and users would have to use the prompts provided.
Content design challenges: Explaining to users what this feature is and what it could do for them. This was difficult to fit into the entry point, which was as mall circle with space for maybe 2 letters. Both product and design were not willing to budge on the size of the entry point. While I tried to explain that the limited UI space would negatively impactu sers’ understanding of the entrypoint, product and design were insistent. My manager, a product director, was also insistent that this entrypoint would work just fine.
So, in the first iteration of the content, I did my best with the space available, and figured this was something I could prove with user testing later on. To come up with “VS”, I tested this content concept on non-product and localisation colleagues.
User testing in Australia: We conducted usability and feasibility tests, and found that while the feature was fairly easy to grasp and useful, it lacked a useful intoruction. Users, within the context of the test, already found it frustrating and took them several screens to grasp what the feature was, and what it could do for them.
In summary, we found that the lack of affordances and guidance to users made it difficult for them to grasp and use the feature effectively. I emphasized
Outcomes and content design takeaways:
- New features needed more introduction and guidance - a vague, oblique entrypoint with low affordances wasn’t suitable to introduce a new feature.
- Friendly, approchable persona was appropriate and users welcomed it. They liked the use of emojis and the friendly, conversational language, which helped reinforce that this was a feature meant to help them take the load off comparing tours and narrowing down their choices.
This feature was deprecated shortly after for further iterations, so I didn’t follow up on the feature after. Content design lessons were applied to Iteration 2: “Travel buddy”, which I worked on subsequently.
User flow for the MVP of the feature
Iteration 2: “Travel buddy”Apr-May 2025
What is it? In this iteration of the AI feautre, I worked with product and design on the entrypointrs to the MVP of an AI travel assistant feature, to provide recommendations based on users’ booked trips.
Applying content design lessons from earlier AI features:
1. Entrypoint needs clearer affordances, taking into consideration where users might encounter it. For instance, on a page showing users’ trips, I worked with product and design to add a pop-up to the entrypoint, explaining what the AI could do for users.
2. Keeping affordances high to guide users through their early interactions with the feature. Drawing on lessons from the AI comparison project, I focused on building a conversational, friendly and reassuring persona, e.g. in its introductory preamble and in its thinking process.
3. Designing an effective conversational counterpart: Apart from using “my” and “your” to lend a conversational tone, my content decisions were driven by building the “travel buddy” as a helpful counterpart that can carry a conversation. This was a key difference from the earlier AI comparison feature, which didn’t have a conversational element (users couldn’t type and participate in the conversation).
4. Pushing back against feature names that didn’t align with the style. Several members of senior management wanted to either call the feature a generic, lakclustre “AI assistant”, or something too edgy like “Travel homie”. I worked together with the product manager to lay out why “Travel buddy” would be the best choice, based on recognisability in our markets (”homie” leaned more American, for instance), tone and voice, style, and scalability. This avoided jury by seniority, and highlighted that content decisions were not made arbitrarily.
Outcomes and lessonsDesigning for AI is an iterative and ongoing process. Need for more affordances and guiding from content design, and actively doing content testing and usability testing to refine design (including content) decisions.
<- Back to projects
Multi-merchant information flows — Klook
Making train and bus policies make sense | Aug-Sep 2024
Role and contributions
Role: Led content information architecture and flows for Klook’s trains and buses, which are complex multi-merchant verticals.
Cross-functional collaborators: Product, Tech, Business planning teams
Contributions:- Creating scalable content structures for different train / bus products (e.g. Shinkansen, Eurostar, Taiwan High Speed Rail).
- Building and ensuring consistency in meaning across the booking experience.
- Documenting changes for future use. We continued to use the shared doc for cross-functional collaboration in future iterations.
- Fixing legacy issues and content debt.
Content design process
I worked closely with design, product and commercial stakeholders to improve the content structure so we would be able to use the same template across all products.
Instead of just rewriting / copyediting the titles, I took a big-picture view of screens, information flow, and the specific requirements of each product.
On a shared copy doc, I consolidated the commonalities and differences for each product, in order to determine how a common content structure or template could best support this.
I then implemented these commonalities in the content structure of the page
At the same time, through user research conducted in Australia, UK, EU and US markets, we found that what users pay attention to when deciding which ticket to buy stems less from the details in the ticket guide, but words that highlight the differences between different ticket types.
Scaling it to other pages
This was the first time that product, design and the business planning teams had worked with a content designer on information architecture and content design. After it worked well, we adopted the same approach for other screens, namely luggage and seat/cabin guides:
Outcomes
Since launch of the page structure in ~Sep 2024, more information and products have been added to the pages. The template has held and proven to be a useful, constant content structure for all mobility products.