Research Presentation · 2026

Finding the Cheapest Path
to 95%

Can a hierarchical expert network beat a flat classifier on resource efficiency while keeping accuracy high enough to matter?

95.63% Compact HEN Accuracy
3.01M Parameters
0.217 GFLOPs
9.3× Smaller than Flat CNN
Thesis

The core question

The compact Joint HEN with MobileNetV3-Large reached the same accuracy band as the classic 3-level HEN, but with a radically smaller parameter budget. Flat ConvNeXt-Tiny remains the peak accuracy reference at 96.89%, but at nearly 10× the parameter count.

🏆

Compact Joint HEN

Shared MobileNetV3-Large backbone with hierarchical probability heads. Best engineering answer for tight budgets.

95.63% · 3.01M params
📐

Flat ConvNeXt-Tiny

Strong and simple single-backbone baseline. Best raw accuracy, but no built-in hierarchy or modular update path.

96.89% · 27.84M params
📦

Classic 3-Level HEN

Proved the hierarchy concept works—but duplicated full CNN backbones for every router and expert. Impractical storage cost.

95.63% · 145.31M params

Why 95%

Why the target is 95%, not perfection

Before comparing architecture efficiency, the evaluation needs a practical target. Some validation images mix multiple strong subjects: a bird with fruit, food beside a dominant object, or produce in a market scene. These can be visually reasonable for a human, yet still be counted wrong by the label protocol.

Some errors reflect task ambiguity as much as model weakness. Beyond 95%, the model is increasingly competing with label noise, mixed-object scenes, and annotation edge cases.

That is why this report treats 95% as the useful engineering line: high enough to prove the architecture works, but not so high that the comparison is dominated by unclear labels.

Loading failure examples The examples are embedded from the detailed report when the page is served from GitHub or a local server.

Hierarchy Map

The task is already a 3-9-27 tree

The HEN architecture mirrors the dataset structure: first decide the broad domain, then the mid-level family, then the exact leaf class. This is the reason a three-level model is meaningful rather than arbitrary.

3coarse groups
9mid-level groups
27leaf classes

Level 1 · Coarse

Animalliving subjects
Vehicletransport objects
Foodedible objects

Level 2 · Mid

Animal
bird canine feline
Vehicle
aircraft road vehicle watercraft
Food
fruit vegetable prepared food

Level 3 · Leaf

Animal leaves
goldfinch robin bald eagle golden retriever german shepherd doberman persian cat siamese cat egyptian cat
Vehicle leaves
airliner airship warplane ambulance fire engine sports car canoe container ship speedboat
Food leaves
lemon banana pomegranate broccoli cauliflower cucumber cheeseburger hotdog pizza

This tree is the core modularity target: adding a new leaf should ideally touch its local branch while leaving sibling branches intact.


Architecture

Architecture evolution

The most important shift was not adding more heads—it was learning where the model should share representation and where it should specialize.

🔲

Flat CNN

One full feature extractor feeds one 27-way classifier. Strong and simple, but no built-in hierarchy or modular update path.

96.89% peak
🌲

Classic 3-Level HEN

Multiple independent routers and leaf experts, each with its own full CNN backbone. Modular but storage-heavy.

Hierarchy ✓ · Economical ✗
🔍

Coarse-to-Fine HEN

Light router + stronger downstream experts. Top routing hit 99%+, but leaf layer accuracy hit a ceiling.

Intuitive design

Compact Joint HEN

One shared backbone, multiple hierarchical probability heads trained jointly. Hierarchy + efficiency in one shot.

95.63% · Best overall
🔀

Common-Delta Branches

Separated common and difference features per branch group. Improved hard clusters like aircraft and prepared food locally.

Targeted improvement
Flat CNN: one full backbone, one 27-way classifier.

*Estimated or backbone-only compute for exploratory branches; final verified comparisons are Flat ConvNeXt-Tiny, Classic 3-Level HEN, and Compact Joint HEN.


Journey

The path we took

The final result came from a sequence of very practical disappointments. Each one removed a tempting explanation and made the next design sharper.

1

Classic HEN worked, but was too heavy

The original 3-level HEN proved the hierarchy made sense. The cost was brutal: full backbones stored for every router and expert. Modular, but not economical.

95.63%
2

Shared-backbone modular HEN saved parameters

Sharing the feature extractor solved storage, but branch heads became too small relative to the backbone. Modularity preserved, but leaf experts weren't expert enough.

94.67%
3

Coarse-to-fine exposed the bottleneck

A light router plus stronger downstream experts matched natural intuition. Top routing reached 99%+, but the leaf layer could not recover enough accuracy.

92.37%
4

Food mid-level errors resisted obvious fixes

Higher resolution, wider mid heads, conservative crops, and attention crops all failed to move the food branch. The remaining errors appeared partly semantic and data-driven.

95.78% food mid
5

Compact Joint HEN crossed the line

A shared MobileNetV3-Large with hierarchical probability heads kept the hierarchy, removed most duplicate compute, and crossed the 95% target.

95.63% ✓

Difficulties & Fixes

What failed, and why that matters

The failure map explains why the final architecture is shaped the way it is.

⚠️

Hard routing was brittle

Argmax routing meant an early mistake was unrecoverable. The joint model softened this by supervising all levels from the same shared feature representation.

💾

Classic modularity was expensive

Independent routers and leaf experts duplicated whole CNN backbones. Compact HEN moved most capacity into one shared feature extractor.

🧠

Leaf experts needed real signal

Tiny branch heads saved parameters but underfit hard sibling classes. MobileNetV3-Large provided enough feature quality without returning to huge storage cost.

🍔

Food was not fixed by resolution

Food mid-level accuracy did not improve from 192px input, 512 hidden dims, or conservative crops. More pixels were not the missing ingredient.

🔲

Attention crop was too weak

Rule-based heatmap crops found salient regions, but not necessarily the labeled object. Ambiguity was often about which object should count.

✂️

Common-delta helped locally

Common vs. difference features improved some hard branches like aircraft and prepared food, but did not beat the simpler compact joint model overall.


Resources

Resource story

At the 95% threshold, compact HEN wins on verified resource efficiency. Flat ConvNeXt-Tiny still wins on peak accuracy.

Parameters (M)

Common-Delta HEN
1.74M
Compact Joint HEN
3.01M
Flat ConvNeXt-Tiny
27.84M
Coarse-to-Fine HEN
36.23M
Classic 3-Level HEN
145.31M

Compute (GFLOPs)

Common-Delta HEN
0.057*
Compact Joint HEN
0.217
Coarse-to-Fine HEN
~0.64*
Flat ConvNeXt-Tiny
4.456
Classic 3-Level HEN
5.442

*Common-Delta is backbone-dominated. Coarse-to-Fine compute is an active-path estimate from its 128px router plus one 128px ResNet18-style expert path; it did not reach the 95% target.


Conclusion

Final take

The hierarchy is worth keeping when it reduces resource cost without sacrificing the accuracy band that matters.

Peak Accuracy Reference

Flat CNN remains the strongest baseline for maximum accuracy at 96.89%—simple, powerful, and easy to train.

Engineering Answer ✓

Compact Joint HEN is the stronger choice under tight parameter and compute budgets, hitting 95.63% with 3.01M params and 0.217 GFLOPs.

1 / 9