AI-Generated UI Tests: What to Review Before You Merge Them

AI-generated UI tests are useful for speed, but speed is not the same thing as quality. A generated test can look polished, pass once locally, and still be a poor fit for your codebase. The real question is not whether an AI can produce a test, but whether a human can review it quickly enough to trust it in the main branch.

That is where many teams get tripped up. They treat generated test code like a finished artifact instead of a draft. The result is familiar to anyone who has maintained frontend automation for a while: brittle selectors, unclear assertions, weird waits, and tests that encode assumptions nobody noticed until CI turned red.

If you are using test automation to support release confidence, AI should reduce the time it takes to create useful coverage, not create a new class of maintenance debt. For teams working in modern frontend stacks, the right posture is simple: generate aggressively, review ruthlessly, and merge only after the test behaves like a human wrote it with the product and failure modes in mind.

What AI-generated UI tests are good at, and where they go wrong

AI-generated tests are strongest when the workflow is repetitive and the UI is fairly standard. Login, sign up, add to cart, submit a form, check a confirmation page, those are the kinds of flows where generated steps can save a lot of typing.

They tend to go wrong in the places that matter most for long-term maintainability:

they pick selectors that are technically valid but unstable,
they assert on text that is incidental instead of meaningful,
they wait for the wrong thing, or do not wait at all,
they assume a page state that is true only on a clean local environment,
they mix product behavior with implementation details.

That last point is important. A generated test can be functionally correct and still encode the wrong contract. For example, it may assert that a specific toast appears because that is what happened during generation, even though the real product requirement is simply that the user receives a success state and the data is saved.

A good test is not a transcript of the UI, it is a description of behavior that should continue to matter after the DOM changes.

The review checklist I would use before merging

If you are evaluating AI-generated UI tests, do not ask, “Does it run?” Ask, “Will it still be useful after the next frontend change?”

Here is the checklist I would use before merging generated test code into the main branch.

1. Are the selectors stable, or just convenient?

Selectors are usually the first place generated tests become fragile. An AI model may happily reach for nth-child, a long CSS path, or a text selector that happens to work on the current page state. That is not enough.

Prefer selectors with a clear relationship to user intent and product meaning:

data-testid or similar stable attributes when your team uses them consistently,
roles and accessible names when they are reliable,
semantic structure that reflects actual user interaction,
component-specific hooks that do not depend on visual layout.

Reject generated tests that depend on implementation noise, such as:

deeply nested CSS paths,
index-based selection,
dynamically generated class names,
text selectors that target copy known to change frequently.

A quick heuristic helps here: if the selector would become invalid after a harmless refactor, it is probably too brittle.

Example of a selector you should question:

typescript

await page.locator('div ప్రధాన > div:nth-child(3) > button').click()

Example of something much easier to defend:

typescript

await page.getByRole('button', { name: 'Save changes' }).click()

The second form is not perfect, but it tells you what the user is actually doing.

2. Do the assertions prove behavior, or just presence?

Generated tests often assert that something exists, because existence is easy to automate. But existence is a weak signal. A modal appearing, a toast rendering, or a title changing does not necessarily mean the workflow succeeded.

Good assertions should answer a product question:

Did the form submit successfully?
Did the user land on the correct destination?
Did the data persist?
Did the app show the right permission or validation state?
Did the action produce the intended side effect?

If you see assertions like “button is visible” or “text contains X” with no stronger purpose, ask whether they are actually protecting against regression. In many cases the answer is no.

A better pattern is to assert on outcome, not just appearance. For example, after checkout, the test should confirm order confirmation details, not only that a success message briefly appeared.

3. Is the waiting strategy intentional?

Waits are where flaky AI tests often hide. Generated code may use arbitrary sleep calls, overly broad waits for network idle, or a wait that matches the wrong condition entirely.

Be suspicious of:

waitForTimeout or fixed sleeps,
blind reliance on networkidle for apps with background requests,
waits that are unrelated to the next action,
repeated waits that exist because the test author was compensating for unstable selectors.

A maintainable test waits for the thing that proves the UI is ready for the next step. That might be:

a visible heading after navigation,
a button becoming enabled after validation,
a specific API-driven piece of content appearing,
a spinner disappearing when the state is ready.

In Playwright, that often looks like this:

typescript

await page.getByRole('button', { name: 'Submit' }).click()
await expect(page.getByText('Thanks, your form has been submitted')).toBeVisible()

Notice what is missing, no arbitrary sleep. The test waits for evidence of completion, not time passing.

4. Does the test tell a future human why it exists?

This is one of the easiest review filters to apply and one of the most frequently ignored.

If a generated test is just a chain of actions with no explanation of purpose, future maintainers will not know whether it covers a critical business rule or a random UI path. That matters when a test starts failing in CI and someone has to decide whether to fix the app, fix the test, or delete the test.

Good tests have a readable flow:

setup,
user action,
expected outcome,
important edge case or guardrail.

The code does not need to be verbose, but it should be obvious what risk the test covers. If the generated name is vague, rename it before merging. If the steps do not map to a user story, rewrite them.

5. Are there hidden assumptions about data, state, or environment?

AI-generated tests frequently assume too much.

They may assume:

a user already exists,
a feature flag is enabled,
the environment has seeded data,
the locale is English,
the viewport matches the one used during generation,
the account has permission to do the action,
the app is in a clean database state.

That is fine if those assumptions are made explicit and controlled. It is a problem if they are accidental.

A good review asks:

What data must exist before the test starts?
Is that data created inside the test, or injected by fixture?
What state must be reset after the test finishes?
Which environment variables affect the result?
Are locale and timezone relevant to the assertion?

The more assumptions a generated test makes, the more likely it is to fail in CI, on a staging replica, or when another team changes shared test data.

The difference between useful generated code and brittle generated code

A lot of the conversation around generated tests gets stuck on the code style, but style is secondary. What matters is whether the test is aligned to the product and easy to maintain.

Useful generated code usually has these traits:

stable selectors,
one clear user journey,
assertions that reflect business outcomes,
minimal duplication,
explicit setup and teardown,
clear naming,
no unnecessary framework tricks.

Brittle generated code usually has the opposite traits:

it is overfit to the current DOM,
it chains many actions without checkpoints,
it asserts every visible string,
it depends on incidental timing,
it copies whatever happened during generation,
it includes logic that no one wants to maintain.

One practical rule is to compare the test to a well-written manual test case. If the generated version is harder to explain than the manual flow, it probably needs editing.

A simple generated test review workflow

The safest way to use AI in test automation is to make it the first draft, not the final authority.

Here is a lightweight workflow that works well for QA leaders, frontend engineers, and SDETs:

Generate the test.
Run it once in the intended CI-like environment.
Inspect selectors, assertions, and waits.
Remove incidental steps.
Replace fragile checks with product-level checks.
Rename the test so its purpose is obvious.
Re-run it against a changed UI state if possible.
Merge only if the failure signal is meaningful.

That last step matters because a good test should fail for the right reasons. If the UI copy changes from “Save” to “Save changes” and the whole test breaks, that is probably too much coupling. If the same test fails because the payment flow no longer shows a confirmation screen, that is useful signal.

How to spot flaky AI tests before they spread

Flaky tests often look harmless at first. They pass on the author’s machine, maybe even in the first few CI runs, and then start producing noise.

When reviewing generated tests, watch for these warning signs:

time-based waits with no clear reason,
selectors that break on simple DOM reshuffles,
assertions that depend on animation timing,
tests that click too quickly through async transitions,
repeated retries that mask underlying instability,
tests that assume a perfect network or an empty cache.

If you already have a flaky suite, AI can make it worse by generating more of the same patterns faster. That is why the review process matters more than the generation process itself.

In Cypress, for example, a test that needs arbitrary sleeps is usually asking you to solve a real synchronization problem instead of hiding it:

javascript cy.contains(‘Submit’).click() cy.contains(‘Submission complete’).should(‘be.visible’)

The value is not the specific API, it is the discipline of tying the wait to the outcome.

Generated test code quality is not only about code style

When people say generated code quality, they often mean whether the code is formatted cleanly or uses idiomatic framework APIs. That matters, but only after the deeper questions are answered.

I would rank generated test code quality in this order:

correctness of the scenario,
stability of the selectors,
strength of the assertions,
appropriateness of the waits,
clarity of the structure,
consistency with the team’s testing conventions.

If the first four are weak, nice formatting will not save you.

This is especially true in frontend systems where the UI is moving fast. A polished but brittle test can be more dangerous than an ugly one, because it creates false confidence. Teams see a green checkmark and assume coverage exists where it does not.

Review questions worth asking in code review

If your team is merging generated tests through pull requests, these questions are worth putting in the review template:

What user behavior does this test protect?
Which selector would fail first if the UI changed, and is that acceptable?
Does the test assert outcome, or only that the DOM rendered?
Are all waits tied to readiness, not time?
What state does the test depend on?
Is this test independent from other tests?
Will a non-author understand the failure message?
If this test starts failing next month, will we know whether to fix the test or the app?

That last question is often the most useful one. Good automation reduces ambiguity. Bad automation just moves ambiguity into the CI pipeline.

When to reject an AI-generated test outright

Not every generated test deserves to be rescued.

I would reject a test if it:

is mostly composed of brittle selectors,
encodes too many incidental UI details,
lacks a meaningful assertion,
relies on arbitrary delays,
depends on opaque shared state,
duplicates an existing test without adding coverage,
would take more time to fix than to rewrite.

Rewriting from scratch is not failure. In mature test suites, it is often the cheaper option.

There is a temptation to accept generated output because it feels productive. But productivity is not just how quickly a test appears, it is how little future maintenance it creates.

Where AI can help, if you keep the human in the loop

The strongest use of AI in test automation is assistive, not autonomous. Let it help with first drafts, scenario expansion, boilerplate, and even cross-framework translation. But keep human review in charge of the parts that decide whether a test belongs in the suite.

In practice, that means AI can be useful for:

scaffolding a new test from a plain-English scenario,
suggesting coverage for adjacent edge cases,
converting repetitive flows into a base template,
surfacing locators or assertions you might have missed,
accelerating migration between frameworks.

It should not be the final authority on selectors, assertions, or business significance.

That is also why guided platforms tend to work better than fully hands-off generation. A tool like Endtest’s AI Test Creation Agent leans toward an editable, platform-native workflow, which makes review much easier than trying to salvage opaque generated output after the fact. If your team wants a more structured path, the dedicated AI test creation page is worth a look.

A practical bottom line

The best way to think about AI-generated UI tests is this: generation is cheap, trust is expensive.

Before you merge, review the test for stable selectors, meaningful assertions, intentional waits, explicit assumptions, and long-term maintainability. If it looks like a brittle script that happened to pass today, keep it out of the main branch. If it reads like a clear expression of user behavior and product intent, then it is probably ready.

That standard may feel stricter than what some AI workflows encourage, but it is the right standard for production test suites. You do not want more tests, you want better signals.

For teams that want more guidance around agentic AI test authoring and self-healing locators, Endtest is one possible alternative to the AI-only, write-it-and-pray approach. The important part, whichever tool you choose, is that a human still reviews the generated test code quality before it becomes part of your maintainable test automation strategy.