Zhipu Flagship GLM-5 Hands-on Test: Comparison with Opus 4.6 and GPT-5.3-Codex

1. Introduction

I just saw that Zhipu's new-generation flagship model GLM-5 has been officially released.

They are really pushing hard—they just had to launch it before the long holiday, and it hasn't even been two months since the previous version GLM-4.7 was released...

GLM-4.x is highly rated both domestically and internationally, and is widely recognized as a top-tier model in the programming field. People are naturally very curious about what improvements this new major version will bring.

To be honest, their team contacted me to participate in the internal beta last week, and I have been using this model for several days.

Coincidentally, also last week, two foreign flagship models released new versions at the same time: Anthropic released Claude Opus 4.6, and OpenAI released GPT-5.3-Codex.

All three new models are focused on programming, so I couldn't resist running a comparative test to see the differences between them, which I think many people are also interested in.

Below are the generation results of these three AI models on real-world programming tasks.

2. Introduction to GLM-5

The official release notes introduce GLM-5 as follows: As an open-source model, GLM-5 is fully positioned to match top-tier closed-source models, with special enhancements in two areas.

(1) Complex systems engineering

GLM-5 is not only good at generating front-end web pages, but also excels at handling back-end tasks, system refactoring, and in-depth debugging, abandoning the paradigm of "valuing front-end aesthetics over low-level logic".

It has an extremely strong self-reflection and error correction mechanism. When compilation fails or a runtime error occurs, it can independently analyze logs, locate the root cause, and iteratively repair the problem until the system runs successfully.

(2) Long-horizon Agent

It is capable of running long-horizon tasks, i.e., multi-stage, long-step complex tasks. It can independently split requirements, run continuously and automatically for up to several hours, while maintaining context coherence and goal consistency.

(3) Summary

The tasks GLM-5 can complete have already gone far beyond generating front-end UIs: it can generate large, complex system-level projects, such as operating system kernels, browser kernels, the V8 engine, and more.

Its slogan is: "In the era when large models have entered the age of Agents and large-scale tasks, GLM-5 is the open-source option you can use."

3. Testing Methodology

The test questions I chose are the same ones that Alejandro AO, developer advocate at HuggingFace, used to test Opus 4.6 and GPT 5.3.

He recorded a video showing the performance of the two models.

I used the same questions to test GLM-5, and compared the results against his.

There are four questions in total, covering both front-end and back-end tasks. I have put the original prompts and starter code into a repository on GitHub.

4. Web Design Test

The first test assesses web design and refactoring capabilities.

The original page is very rudimentary.

It simply categorizes information and stacks it together. We asked the AI to redesign this webpage to make it beautiful and user-friendly, with a mature, reliable professional feel.

As mentioned earlier, the prompts and original files are all on GitHub, so I won't repost them here. You can run the test yourself, or test other models with it.

Below are the results generated by GLM-5.

This result is genuinely beautiful and professional. All information is well organized, it has smooth animation effects, it works perfectly for mobile browsing (see below), and it can almost be deployed directly to production.

I have published this page, you can click here to view it.

Below are the generation results of Opus 4.6, captured as screenshots from the video.

Below are the generation results of GPT-5.3.

All three designs are usable, but GPT-5.3 has a flaw: it did not implement a sticky header, so the header disappears when you scroll down. It is also less visually appealing than the other two.

Therefore, in this test, GLM-5 and Opus 4.6 performed better. Which one is more excellent depends on the user's aesthetic preference. Personally, I prefer GLM-5's design style.

5. 3D Sandbox Test

The second test examines the 3D animation generation capability of AI models.

The requirement was to generate an educational 3D web sandbox that animates the orbital motion of celestial bodies in the solar system, allows adjusting parameters such as mass, position, and speed, and supports manually adding new celestial bodies.

Below are the results generated by GLM-5.

The right side of the page is the animation area, which by default shows three asteroids orbiting the central star. You can drag with your mouse to rotate 360 degrees, as well as zoom in and out.

The left side of the page is the control panel, which is very well designed.

The upper half adjusts animation and celestial parameters, while the lower half is used to add new celestial bodies or delete existing ones.

For comparison, here are the generation results of Opus 4.6.

Generation results of GPT-5.3.

All three generation results meet the requirements and can run smoothly. However, GLM-5's animation is missing gravitational grid lines, while GPT-5.3's grid lines are too cluttered, so Opus 4.6 has the better animation effect.

In terms of the control panel, both GLM-5 and Opus 4.6 are well designed, while GPT-5.3's is a bit too simplistic.

Overall, I think the best performer in this round is Opus 4.6, followed by GLM-5, and finally Codex 5.3.

6. Web Game

The third test is to generate the web game Angry Birds.

GLM-5's generation result is decent, it looks very similar to the original and is playable, but the gameplay is lacking, and the bounce effect is not very good.

Opus 4.6 has very high fidelity to the original, and the gameplay experience is also very close to the original.

GPT-5.3's generation result is underwhelming: the bird cannot be launched at all, so the game is unplayable.

This round is very clear: Opus 4.6 is the best, followed by GLM-5.

7. Porting Laravel to Next.js

The final test is to port a web application built with PHP's Laravel framework to the Next.js framework for JavaScript.

GLM-5 handled this almost without any issues, it quickly converted the PHP code to JavaScript, and provided the converted code structure.

After conversion, it also thoughtfully automatically installed dependent packages, completed the script build, and prompted the user: you only need to connect the external API, run npm run dev and it will work directly.

I followed its prompts, it ran smoothly with no errors, and I could access the application by opening localhost:3000.

This is a city weather lookup application. Since we did not require changing the styling, it looks exactly the same as the original PHP version.

You can search for cities in the input box in the top right corner.

Select the city you want from the search results.

Click through to the city's detail page, which has information including weather, sunrise and sunset times, air quality, and an interactive map.

Opus 4.6 and GPT-5.3 also generated the same correct result. Since the page and functionality are identical, I won't show screenshots here.

It is worth mentioning that both GLM-5 and GPT-5.3 completed the conversion in around 5 minutes. Opus 4.6 seemed to encounter some issues and took a full 20 minutes.

Looking only at the results this round, all three models performed well. But GLM-5 had a shorter generation time, encountered no errors at all, and had a good user experience throughout the process, so I would vote for it.

8. Summary

After these tests, GLM-5's programming performance is impressive and holds its own against the latest foreign flagship models. It even outperforms them in some areas, and even when it lags behind, it is usually only a matter of details, not a fundamental difference in quality.

I hear that it used China's domestic "10,000-GPU cluster" for both training and inference. It is easy to imagine that if it had access to more GPUs and more computing power, its performance would be even better, enough to compete head-to-head with the world's top-tier large model companies.

In addition, the two key areas it specially enhanced this time—"complex systems" and "long-horizon tasks"—bring tangible improvements.

The system logic and back-end code it generates have good reliability. There are not many errors during both generation and runtime. Any missing parts are usually just missing functions, which can be added by AI later, rather than architectural problems. Furthermore, I had a personal task that it ran for a full two hours, and it completed it successfully in the end without losing context or going off track.

I would like to end with a quote from the official release:

In 2026, coding large models are advancing from "able to write code" to "able to build systems". GLM-5 can be called the "system architect" model of the open source world. It has shifted its focus from "front-end aesthetics" to "Agentic depth / systems engineering capability", and is a domestic open-source alternative to Opus 4.6 and GPT-5.3.

(End)

    <div style="color:#556677;line-height:160%;padding:0.3em 0.5em;border:1px solid #d3d3d3;margin:1em;background-color:#AAD2F0;-moz-border-radius: 10px;-webkit-border-radius:10px;border-radius: 10px;"><h3>Document Information</h3>
  • Copyright Statement: Free reprint - Non-commercial - No derivatives - Attribution (Creative Commons 3.0 License)
  • Publication Date: February 12, 2026

This is a discussion topic separated from the original post at http://www.ruanyifeng.com/blog/2026/02/glm-5.html