Show HN: How I topped the HuggingFace open LLM leaderboard on two gaming GPUs

Name: Show HN: How I topped the HuggingFace open LLM leaderboard on two gaming GPUs
Rating: 4.5 (120 reviews)

topped the HuggingFace open LLM leaderboard on two gaming GPUs; improved performance across all Open LLM Leaderboard benchmarks and took #1.

458

Traction Score

120

Discussions

Mar 13, 2026

Launch Date

View Origin Link

Product Positioning & Context

AI Executive Synthesis

topped the HuggingFace open LLM leaderboard on two gaming GPUs; improved performance across all Open LLM Leaderboard benchmarks and took #1.

This submission presents a novel, empirical finding in LLM architecture optimization: duplicating specific 'circuit-sized blocks' of layers significantly enhances performance. The achievement of topping the HuggingFace leaderboard with this method, using consumer-grade GPUs, demonstrates a cost-effective path to competitive LLM performance. The implication of 'discrete functional circuits' suggests deeper insights into LLM internal mechanisms. Market implications: This research directly impacts the efficiency and accessibility of high-performance LLMs. For B2B SaaS providers building on or fine-tuning LLMs, this method offers a potential pathway to improved model efficacy without extensive retraining or prohibitive hardware investments. It signals a trend towards architectural hacks and empirical discoveries driving LLM advancements, rather than solely scaling model size. This could democratize access to top-tier LLM performance for smaller teams or those with limited compute resources.

I found that duplicating a specific block of 7 middle layers in Qwen2-72B, without modifying any weights, improved performance across all Open LLM Leaderboard benchmarks and took #1. As of 2026, the top 4 models on that leaderboard are still descendants.The weird finding: single-layer duplication does nothing. Too few layers, nothing. Too many, it gets worse. Only circuit-sized blocks of ~7 layers work. This suggests pretraining carves out discrete functional circuits in the layer stack that only work when preserved whole.The whole thing was developed on 2x RTX 4090s in my basement. I'm now running current models (GLM-4.7, Qwen3.5, MiniMax M2.5) on a dual GH200 rig (see my other post). Code and new models coming soon.Happy to answer questions.

Related Ecosystem & Alternatives

Discover adjacent products, open-source repositories, and developer tools sharing similar technical architecture.

Deep-Dive FAQs

What is How I topped the HuggingFace open LLM leaderboard on two gaming GPUs?

How I topped the HuggingFace open LLM leaderboard on two gaming GPUs is analyzed by our AI as: topped the HuggingFace open LLM leaderboard on two gaming GPUs; improved performance across all Open LLM Leaderboard benchmarks and took #1.. It focuses on This submission presents a novel, empirical finding in LLM architecture optimization: duplicating specific 'circuit-sized blocks' of layers signifi...

Where did How I topped the HuggingFace open LLM leaderboard on two gaming GPUs originate?

Data for How I topped the HuggingFace open LLM leaderboard on two gaming GPUs was aggregated directly from the Hacker News community ecosystem, representing raw developer and early-adopter sentiment.

When was How I topped the HuggingFace open LLM leaderboard on two gaming GPUs publicly launched?

The initial public indexing or launch date for How I topped the HuggingFace open LLM leaderboard on two gaming GPUs within our tracked developer communities was recorded on March 13, 2026.

How popular is How I topped the HuggingFace open LLM leaderboard on two gaming GPUs?

How I topped the HuggingFace open LLM leaderboard on two gaming GPUs has achieved measurable traction, logging over 458 traction score and facilitating 120 recorded discussions or engagements.

Which technical categories define How I topped the HuggingFace open LLM leaderboard on two gaming GPUs?

Based on metadata extraction, How I topped the HuggingFace open LLM leaderboard on two gaming GPUs is categorized under topics such as: HuggingFace open LLM leaderboard, gaming GPUs, Qwen2-72B, single-layer duplication.

What are some commercial alternatives to How I topped the HuggingFace open LLM leaderboard on two gaming GPUs?

Our semantic intelligence engine identifies potential commercial alternatives in the SaaS space, such as Investor Updates, which offers overlapping value propositions.

How does the creator describe How I topped the HuggingFace open LLM leaderboard on two gaming GPUs?

The original author or development team describes the product as follows: "I found that duplicating a specific block of 7 middle layers in Qwen2-72B, without modifying any weights, improved performance across all Open LLM Leaderboard benchmarks and took #1. As of 2026, th..."

Community Voice & Feedback

vjsrinivas • Mar 13, 2026

Great work and love the detailed breakdown. This is kind of tangential, but it reminded me of this work: https://arxiv.org/pdf/2310.12973 (Frozen Transformers in Language Models are Effective Visual Encoder Layers).The paper puts out an interesting hypothesis that these LLM-derived transformer layers have the ability to "refine" any set of learned tokens, even in different modalities. I wonder if what you're seeing here is related?

BrownSol • Mar 13, 2026

By far one of the most interesting blogs I’ve read in a long while. I’m curious if you could combine this with Karpathy’s auto research to find the best combination of layer duplication. The callout to model merging in 2024 was funny… around that time I became friendly with RomboDawg on HF who had the best merged coding models around and created a couple of Frankenstein models myself.I say this naively as I’m not that familiar with how transformers work under the hood, but I wonder if you could combine the two approaches in a coherent way. Frankenmerges were often down naively just smooshing things together, but knowing how the layers work under the hood I wonder if there’s a more intelligent way to combine merging and layer duplication to create even better performers.

momojo • Mar 10, 2026

I'm surprised the point/comment ratio is this skewed. There's so much meat in the post to chew on. I like your writing. This was one of those blogs where I can tell you spent a massive amount of time on the technical, but simplified it to layman's terms. I hope you keep putting out stuff :).I have a couple questions:1. I think this quote should be raising *many more* eyebrows.> The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.You put a cat's brain into a dog's head and its still breathing! It didn't flatline immediately! Is yesterday's news? This seems like the biggest take away. Why isn't every attempting LLM-surgery at this moment? Have you noticed any increasede discourse in this area?2. You mentioned you spent the beginning of your career looking at brains in biotech. How did you end up in a basement of GPU's, working not in biotech, but still kind of looking at brains?Again, great post!

imranq • Mar 10, 2026

Amazing write up and i wish more people showed the process for discovery which is often even more interesting than the result itselfStill the result is really interesting being able to stack abstract reasoning and get better performance and the heat maps to show the prob resultsThe academic literature seems to be catching up:- *[SOLAR / DUS (Kim et al., 2023)](https://arxiv.org/abs/2312.15166)* — duplicated transformer layers to build a 10.7B model that outperformed 30B parameter baselines.- *[The Curse of Depth (2025)](https://arxiv.org/abs/2502.05795)* — explains why this works: Pre-LN causes deep transformer layers to converge toward identity functions, meaning middle layers are where real computation happens, and duplicating them concentrates that capacity.- *[Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (Geiping et al., NeurIPS 2025)](https://arxiv.org/abs/2502.05171)* — takes the idea to its logical conclusion: a model trained with a single recurrent block repeated at inference time, scaling reasoning depth without adding parameters.

mysteria • Mar 10, 2026

The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.This wasn't something I really dug into in great detail but I remember my surprise back then at how all those merged models and those "expanded" models like Goliath still generated coherent output. IMO those were more community models made by small creators for entertainment rather than work, and only really of interest to the local LLM groups on Reddit, 4chan, and Discord. People might briefly discuss it on the board and say "that's cool" but papers aren't being written and it's less likely for academics or corpo researchers to notice it.That being said I wonder if it's possible to combine the layers of completely different models like say a Llama and a Qwen and still get it to work.Even with math probes, I hit unexpected problems. LLMs fail arithmetic in weird ways. They don’t get the answer wrong so much as get it almost right but forget to write the last digit, as if it got bored mid-number. Or they transpose two digits in the middle. Or they output the correct number with a trailing character that breaks the parser.Would using grammar parsing help here by forcing the LLM to only output the expected tokens (i.e. numbers)? Or maybe on the scoring side you could look at the actual probabilities per token to see how far the correct digit is.

iamjackg • Mar 10, 2026

I find the concept of LLM "brain surgery" fascinating, precisely because of how opaque the network is. One of the first things I did back when llama.cpp first got vision model support was hack the code to zero out (or otherwise modify) random numbers in the image embedding generated by the projector and then ask the LLM to describe the image. It was absolutely fascinating.It would go from a normal description of the item in the picture to suddenly seeing people clapping in the background that were not there, or making up some other stuff. I kinda stopped after a while, but I should pick that back up and do a more coherent experiment to see if I can find any correlation between vector dimensions and "meaning."

Havoc • Mar 10, 2026

Crazy writeup.Author is right about the base64 part. Does seem weird that it can decode and understand it at same time. And I guess what makes it weird that we just sorta accept that for say English and German this works ie normal use but when framed as base64 then it suddenly stops feeling intuitive

hmokiguess • Mar 10, 2026

I really enjoyed reading this. I feel like generalists intuitively experience this exact thing so much throughout their lives because they must have this neuroanatomy you describe. There’s a certain geometry to knowledge that makes possible for this orthogonal movement and it is really fascinating to me. Thank you for publishing this, you made my day!

Balinares • Mar 10, 2026

The idea that there may be a cognitive lingua franca hiding in the layers is fascinating and gives me hope for a neat idea: pluggable knowledge banks.MoE notwithstanding, a model trained on the whole Internet and a few hundred thousands stolen books carries way more knowledge than is actually needed for any given workflow. It would be great if we could ship slimmed down models into which we'd plug the knowledge banks useful for today's work, and only those.It would also mean that you could keep a model's knowledge fresh without retraining the whole of it.

rapatel0 • Mar 10, 2026

I think you may have cracked latent space reasoning. I've had a hunch that something like this would work, but couldn't figure out how the training would back propagate. But you've shown that you just need to duplicate existing layers.Have you tried a simple inline loop over the duplicated layers? Would be interesting to see performance. Also, would be interesting to compare with a MOE model. See if these layers are acting like different agreeing "experts" or if there is reasoning happening in the latent space.

Discovery Source

Hacker News

Aggregated via automated community intelligence tracking.

Tech Stack Dependencies

No direct open-source NPM package mentions detected in the product documentation.

Media Tractions & Mentions

No mainstream media stories specifically mentioning this product name have been intercepted yet.

Deep Research & Science

No direct peer-reviewed scientific literature matched with this product's architecture.