I Built a Flashcard App That Verifies Its Own AI

8 June 2026

Because a wrong flashcard doesn't just fail. It teaches you the wrong thing, for weeks.

Here's the payoff up front. I built a language-flashcard app where every card is written by one AI and then checked by a second, independent AI whose only job is to catch the first one being wrong. It generates cards on demand for any language, level, and topic, runs real spaced repetition, and exports to Anki for your phone. It's free and open source.

The interesting part isn't the generating. AI has been able to write flashcards for years. The interesting part is the trust.

The problem nobody talks about

AI is great at generating language flashcards. Pick a language, a level, a topic, and it'll happily produce fifty sentences with translations and grammar notes in seconds.

AI is also confidently wrong. Especially on the fine stuff: gender agreement, verb forms, cases, the things that are easy to get subtly wrong and sound fine if you don't know better. And it gets worse the lower-resource the language is.

A wrong flashcard is the worst kind of bug. A normal bug fails loudly. A wrong flashcard succeeds at the wrong thing. It looks perfectly fine, you accept it, and then spaced repetition does exactly what it's designed to do: it drills that error into your long-term memory over weeks. You're not failing to learn. You're actively learning something false, on purpose, with a good algorithm.

So generation is the easy 20%. Trust is the hard 80%. The whole project is one idea: the gap between "generated" and "verified".

How a card actually gets made

Every card runs a pipeline. Six steps.

1. Generate. You pick the target language, your known language, a CEFR level from A1 to C2, one or more topics, and a count. The model writes each card: a natural sentence in the target language at that level, a translation, and a short grammar explanation. The translation and the explanation are in your KNOWN language, not the target. Because a grammar note you can't read yet is useless.

2. Dedup. Before anything costs money, each sentence gets normalized into a key. Lowercase, strip punctuation and whitespace. Keep diacritics, on purpose, so the German "schon" and "schön" stay two different words instead of collapsing into one. Then it dedupes against the whole existing deck and within the new batch. No paying to verify a card you already have.

3. The judge. This is the whole point of the app. A second, independent model reviews every card that survived dedup. It checks three things. Is the target sentence correct, natural, and right for the level. Is the translation accurate. Is the grammar explanation factually true AND does it actually describe the grammar in THIS sentence, not grammar in general. For a fill-in-the-blank card it also checks that the blanked word is the right thing to test.

The verdict comes back as strict JSON. And it fails closed. If a card can't be verified, it doesn't ship. Silence is a rejection, not a pass.

4. Auto-fix. Here's where it gets efficient. When the judge rejects a card, I don't throw it away. I ask the model to CORRECT it, then send the corrected card back to the judge. Kept only if it now passes. Cheaper than regenerating from scratch, and the yield is higher.

Two real examples from the build:

Spanish: "Yo tengo mucho hambre." Wrong. "Hambre" is feminine, so it's "mucha hambre." Caught, fixed, re-judged, saved.
A fill-in-the-blank card meant to test the preterite was testing the infinitive "comer" instead. Corrected to "comimos." Caught, fixed, re-verified, saved.

Both of those would have looked completely fine to a learner. Both would have taught the wrong thing. Neither shipped.

5. Audio. Every surviving sentence gets voiced with text-to-speech. The filename is a deterministic hash of the sentence itself, so the same text always reuses the same audio file.

6. Top-up. If dedup and rejects left you short of the count you asked for, it regenerates just the shortfall, dedup-aware, for a few rounds. You ask for 20, you get 20, not 14.

A security detail worth a minute

Generation can be triggered from the browser. That's convenient and it's also exactly the kind of thing that gets people pwned if you're not careful, because you've now got a model writing things in response to input.

So the headless model that writes the card text runs with a RESTRICTED toolset. No shell. No network. It can produce text and that's it.

Everything that spends money or touches the network (the judge, the TTS) runs server-side, in plain deterministic Python. Never by the model. So the one part that could in theory be steered by weird input can only ever write text, and that text gets independently verified anyway before it goes anywhere.

The term for this is capability reduction. The plain version: don't give the generative model the power to do damage. Give it the power to produce text, and then check the text.

The study engine

I didn't want a toy "flip the card" loop. I wanted real spaced repetition.

So I implemented FSRS from scratch. FSRS is the modern scheduler that Anki itself adopted. It models each card with two hidden values: stability (how long the memory lasts) and difficulty, plus a power forgetting curve. The next review interval is the time it takes for your predicted recall to drop back down to a target retention you set. Higher target means shorter intervals, more reviews, stronger recall. Your call.

Pure Python standard library. No dependencies. About 150 lines. I started on SM-2, the classic SuperMemo algorithm, then swapped to FSRS and migrated the cards I already had.

Four card types. Normal (see the sentence, recall the meaning). Cloze, the fill-in-the-blank one (the model wraps the tested word so the pipeline can blank it and the judge can confirm it's the right word to test). Reverse (see the meaning, produce the sentence). Audio-first (hear it, recall it).

Independent per-mode scheduling. This is the part I'm most happy with. Each mode is its own review stream with its OWN FSRS schedule. Recognizing a word when you READ it is a genuinely different skill from being able to SAY it, so forward and reverse get scheduled separately. They're not the same memory.

And a mode's schedule only gets created the first time you actually study it. So you OPT IN to reverse and audio by choosing them. They don't dump hundreds of new cards on you uninvited. That sounds small. It's the difference between a tool you keep using and one that buries you on day three.

You build your own mix. Checkboxes for which modes to study, an "all available" button, and a target-retention knob. Sessions interleave due items across the modes you picked, each one graded against its own schedule. Quit mid-session and resume where you left off, with anything newly due folded in.

The mobile answer: Anki export

The app is a desktop-local web app. But language learning happens on your phone, on the bus, in a queue, in the five minutes before a meeting.

I could have built and hosted a mobile app. Then I'd own the security and the ops and the bills for it. No thanks.

Instead I export to Anki. Anki is the best spaced-repetition app that exists, it's free, it's on every platform, and it already runs FSRS. Why would I rebuild that.

An .apkg file turns out to be just a zip of a SQLite database plus media files. Both sqlite3 and zipfile are in the Python standard library. So I build a real, valid Anki package by hand, no third-party library. People normally reach for genanki for this. I didn't need it.

Normal cards become a "Basic and reversed" note, so you get the forward and reverse cards. Fill-in-the-blank becomes a proper Cloze note with {{c1::answer}}. The audio rides along as [sound:...]. You can export one deck, a whole language-plus-level, or everything at once. Cards export as new so Anki schedules them fresh, and the note IDs are stable, so re-exporting UPDATES your cards instead of duplicating them.

Why it's free and open source, not a SaaS

Two reasons. One strategic, one personal.

Strategic: I'm not going to babysit a vibe-coded app that holds user data and could get someone pwned, for a bit of monthly revenue I'd have to earn back with a lot of marketing. Ship the code instead. Self-hosted, bring your own API key. No central database to breach. No auth to get wrong. No inference bill I eat. The risk surface basically disappears because there's no "I" hosting your data in the middle.

Personal: I can push this code to GitHub for free. Choosing to paywall it would be manufacturing scarcity that doesn't exist. Selling teaching, someone's actual hours and labor, is fine. Gatekeeping a tool that costs nothing to copy is a different thing, and I just don't want to. That's a choice, not a sermon.

So: free, open, self-hostable.

What it is NOT

Let me be straight about the limits, because overselling this would be against the whole point.

It's not an Anki killer. Anki is mature, it's mobile, and it has a giant shared-deck ecosystem I'm not competing with. The edge here is one specific thing: verified, on-demand generation. Cards for exactly the topic and level you want, that don't exist as a premade deck, checked by a second model before they reach you.

The judge is a quality GATE, not a guarantee. It raises the floor a lot. It will still miss subtle things, especially in lower-resource languages. I'd rather say that than pretend otherwise.

It's desktop-local today. Anki export is the mobile story for now.

And the clearest use case isn't a flashy launch. It's a teacher building custom, grammar-checked practice for a specific student or a specific lesson. A real job, done faster. That's the kind of small, concrete win I care about more than a big release.

The idea that actually matters

Forget flashcards for a second. Here's the pattern.

If you're going to let AI generate something people rely on, put a second AI, or a deterministic check, whose only job is to catch the first one being wrong. "Generated" and "verified" are two different words. Build the gap between them on purpose.

That's most of what good AI engineering is right now. Not a smarter prompt. A check that runs after the prompt.

The app is free and open source under the AGPL. Poke at it, fork it, adapt it for your own languages: it's on GitHub.

I build in public. Subscribe to the newsletter to follow what ships next.