The bread-and-butter LLM use case. A user types a question, your app sends it to a model with some context, the model streams back an answer. Wire it into a Discord bot, a Slackbot, a customer support widget, a help overlay inside your app, or a standalone chat product.
Recommended model
qai-flash for low-stakes chat; qai-pro for customer-facing.
Cost ballpark
~$0.20-$2 per 1,000 conversations on flash; ~$2-$10 on pro.
Architecture
User message → your app (build messages array with system prompt + history + new user msg) → Qai /v1/chat/completions with stream: true → pipe SSE chunks back to user → persist completed message
const stream = await qai.chat.completions.create({
model: 'qai-pro',
messages: [
{ role: 'system', content: 'You are a helpful support agent for Acme.' },
...conversationHistory,
{ role: 'user', content: userMessage },
],
stream: true,
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) {
res.write(delta); // stream to client
}
}
Gotchas
- Conversation history grows linearly - cap it at the last 10-20 turns or summarise older context to avoid runaway token bills.
- Streaming SSE through proxies / load balancers requires disabling buffering. Set
X-Accel-Buffering: no and similar headers.
- Cache the system prompt as a constant; do not regenerate it per request.
Blog posts, email replies, social captions, product descriptions, ad variants, internal docs. The "stare at the blank page" problem solved with a prompt and a model. Common pattern: user fills in a short brief, you generate 3 variants, they pick one and edit.
Recommended model
qai-pro as the default; qai-max when output goes to a customer unedited.
Cost ballpark
~$0.005 per generated blog paragraph on pro.
Architecture
User fills brief form → build prompt with brand voice + brief + format instruction → Qai /v1/chat/completions with n: 3 or three parallel calls → display variants → user picks & edits → save
Gotchas
- The model defaults to a "press release" voice - add explicit style instructions like "no exclamation marks, no buzzwords, no em dashes" to fight it.
- Run output through the Humanize Text utility to strip AI typography tells before showing it to users.
- Save drafts before edit - users often want the original back after they have over-edited.
Retrieval-augmented generation: take a knowledge base of documents, find the chunks most relevant to a user's question, stuff them into the prompt, ask the model to answer using only those chunks. The standard pattern for "AI that knows about my company / product / data."
Recommended model
qai-pro for most cases; qai-think if answers require multi-step inference across chunks.
Cost ballpark
~$0.01-$0.05 per question depending on chunk size and tier.
Architecture (interim, pending Qai embed model)
Ingest docs → chunk → embed via any provider → store in vector DB → user asks question → embed query → retrieve top-K chunks → Qai /v1/chat/completions with chunks as system context → cite the chunk in the answer
Gotchas
- Chunk size matters more than chunk count. Start with ~500-token chunks with ~100-token overlap.
- Always instruct the model "answer ONLY from the provided context" or you will get hallucinations leaking through.
- Cite chunks back to the user (filename + page) so they can verify. This single change buys you 80% of the trust gain RAG promises.
- Qai's own embed model is on the roadmap - you can switch over with a one-line change when it ships.
Avatar generators, product mockups, AI art tools, social-media-card factories, marketing visual pipelines. With Qai-hosted media URLs, your app does not even need its own storage bucket - you call the API and embed the returned URL.
Recommended model
qai-imagine-turbo for batches and previews; qai-imagine-quality for hero images.
Cost ballpark
$0.04 per turbo image; $0.08 per quality image.
Architecture
User prompt or template → Qai /v1/images/generations with hostMedia: true → receive permanent CDN URL → embed directly in your app / email / social post
curl -X POST https://llm.quickcasa.ai/v1/images/generations \
-H "Authorization: Bearer $QAI_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qai-imagine-turbo",
"prompt": "a friendly robot mascot holding a coffee mug, flat illustration style",
"hostMedia": true
}'
# Response includes: { "data": [{ "url": "https://llm.quickcasa.ai/media/{id}" }] }
# That URL is permanent. Embed it anywhere.
Gotchas
- Without
hostMedia: true, the returned URL is temporary and auto-expires. Always set it for production.
- Image models do not yet handle consistent character identity across multiple generations - if you need "same person across N images", that is on the roadmap.
- For batch generation, hit turbo with
n: 8 in one call instead of 8 separate calls.
Marketing clips, product demo b-roll, social hooks, intro animations, ad creative. Six seconds at a time, hosted on the Qai CDN, ready to embed. Stitch a few together for a longer piece.
Recommended model
qai-motion
Cost ballpark
~$1.08 per 6-second clip.
Architecture (async with polling)
POST /v1/videos/generations with prompt → get jobId back → poll /v1/videos/generations/{jobId} every ~10s → status flips to "completed" → download from result.data[0].url
Gotchas
- Video generation is async and slow (~90s end-to-end). Do not block a user-facing request on it; use a webhook-style pattern in your own app.
- Always set
hostMedia: true in production - same as images, the URL otherwise expires.
- Stitching clips: keep prompts visually consistent (style, palette, subject) or seams will be obvious.
- The model does not yet do reliable speech / voiceover. Generate the visuals only, layer audio separately.
Anywhere you need a model to think before it speaks. Scheduling assistants, financial decisioning, multi-constraint planners, code review bots, debate referees. qai-think spends extra compute working through a problem instead of guessing the first plausible answer.
Recommended model
qai-think
Cost ballpark
~$0.01-$0.05 per query, varies with reasoning depth.
Architecture
User problem statement → Qai /v1/chat/completions with model: "qai-think" → show "thinking..." UX (response can take seconds) → render the answer with reasoning trace
Gotchas
- Reasoning models are not great at small-talk - use qai-pro or qai-flash for casual chat and route only hard questions to qai-think.
- The reasoning trace is verbose. If your UI does not show it, instruct the model to "respond with only the final answer."
- Latency is variable. Set a UI affordance for the wait (animated thinking indicator) so users do not assume it broke.
n8n, Make, Zapier, custom cron jobs, GitHub Actions, internal data pipelines - anywhere a workflow needs "an LLM call" as one of its steps. Qai's OpenAI-compatible endpoint slots into every automation platform that already supports OpenAI.
Recommended model
Depends on the step. qai-flash for classify/tag, qai-pro for generate/transform.
Cost ballpark
~$1-$10 per 10,000 automation runs.
Architecture (n8n example)
Trigger (webhook / cron / change) → data prep nodes → OpenAI node configured with base URL https://llm.quickcasa.ai/v1 → downstream nodes (send to Slack, write to DB, etc.)
Gotchas
- Create a dedicated API key per workflow with its own daily budget cap. If one workflow goes haywire, only its key gets locked, not your whole account.
- Most automation platforms support custom base URLs but bury the setting - look for "OpenAI custom endpoint" or "compatible API" options.
- Log the Qai response in your workflow so you can debug bad runs later. Most platforms have a "store output" toggle.
Point Cursor, Continue, Aider, Cline, or any custom-base-URL coding tool at Qai and get a paid coding assistant for your team. Mix tiers across operations - cheap for autocomplete and quick refactors, qai-max for big architectural changes.
Recommended model
qai-pro for everyday; qai-max for refactors; qai-think for debugging tricky bugs.
Cost ballpark
~$5-$30 per developer per month at typical use.
Architecture (Cursor / Continue example)
IDE extension → user invokes a command (chat, edit, refactor) → extension sends OpenAI-format request to https://llm.quickcasa.ai/v1 → streamed response renders inline in editor
Gotchas
- Set a per-key daily budget. Coding assistants can burn tokens quickly during a "vibe coding" session.
- Most IDE extensions assume OpenAI model names by default - configure them to use qai-pro / qai-max etc. in extension settings.
- For autocomplete-style features, qai-flash is usually fast enough. Use qai-pro only for chat and code generation.
- Watch out for inline "agent" extensions that fire many calls per minute - they can outpace your daily budget faster than you expect.