HEN Presentation

Thesis

The core question

The compact Joint HEN with MobileNetV3-Large reached the same accuracy band as the classic 3-level HEN, but with a radically smaller parameter budget. Flat ConvNeXt-Tiny remains the peak accuracy reference at 96.89%, but at nearly 10× the parameter count.

🏆

Compact Joint HEN

Shared MobileNetV3-Large backbone with hierarchical probability heads. Best engineering answer for tight budgets.

95.63% · 3.01M params

📐

Flat ConvNeXt-Tiny

Strong and simple single-backbone baseline. Best raw accuracy, but no built-in hierarchy or modular update path.

96.89% · 27.84M params

📦

Classic 3-Level HEN

Proved the hierarchy concept works—but duplicated full CNN backbones for every router and expert. Impractical storage cost.

95.63% · 145.31M params

Why 95%

Why the target is 95%, not perfection

Before comparing architecture efficiency, the evaluation needs a practical target. Some validation images mix multiple strong subjects: a bird with fruit, food beside a dominant object, or produce in a market scene. These can be visually reasonable for a human, yet still be counted wrong by the label protocol.

Some errors reflect task ambiguity as much as model weakness. Beyond 95%, the model is increasingly competing with label noise, mixed-object scenes, and annotation edge cases.

That is why this report treats 95% as the useful engineering line: high enough to prove the architecture works, but not so high that the comparison is dominated by unclear labels.

Loading failure examples The examples are embedded from the detailed report when the page is served from GitHub or a local server.

Hierarchy Map

The task is already a 3-9-27 tree

The HEN architecture mirrors the dataset structure: first decide the broad domain, then the mid-level family, then the exact leaf class. This is the reason a three-level model is meaningful rather than arbitrary.

3coarse groups

9mid-level groups

27leaf classes

Level 1 · Coarse

Animalliving subjects

Vehicletransport objects

Foodedible objects

Level 2 · Mid

Animal

bird canine feline

Vehicle

aircraft road vehicle watercraft

Food

fruit vegetable prepared food

Level 3 · Leaf

Animal leaves

goldfinch robin bald eagle golden retriever german shepherd doberman persian cat siamese cat egyptian cat

Vehicle leaves

airliner airship warplane ambulance fire engine sports car canoe container ship speedboat

Food leaves

lemon banana pomegranate broccoli cauliflower cucumber cheeseburger hotdog pizza

This tree is the core modularity target: adding a new leaf should ideally touch its local branch while leaving sibling branches intact.

Architecture

Architecture evolution

The most important shift was not adding more heads—it was learning where the model should share representation and where it should specialize.

🔲

Flat CNN

One full feature extractor feeds one 27-way classifier. Strong and simple, but no built-in hierarchy or modular update path.

96.89% peak

🌲

Classic 3-Level HEN

Multiple independent routers and leaf experts, each with its own full CNN backbone. Modular but storage-heavy.

Hierarchy ✓ · Economical ✗

🔍

Coarse-to-Fine HEN

Light router + stronger downstream experts. Top routing hit 99%+, but leaf layer accuracy hit a ceiling.

Intuitive design

⚡

Compact Joint HEN

One shared backbone, multiple hierarchical probability heads trained jointly. Hierarchy + efficiency in one shot.

95.63% · Best overall

🔀

Common-Delta Branches

Separated common and difference features per branch group. Improved hard clusters like aircraft and prepared food locally.

Targeted improvement

Flat CNN: one full backbone, one 27-way classifier.

*Estimated or backbone-only compute for exploratory branches; final verified comparisons are Flat ConvNeXt-Tiny, Classic 3-Level HEN, and Compact Joint HEN.

Journey

The path we took

The final result came from a sequence of very practical disappointments. Each one removed a tempting explanation and made the next design sharper.

1

Classic HEN worked, but was too heavy

The original 3-level HEN proved the hierarchy made sense. The cost was brutal: full backbones stored for every router and expert. Modular, but not economical.

95.63%

2

Shared-backbone modular HEN saved parameters

Sharing the feature extractor solved storage, but branch heads became too small relative to the backbone. Modularity preserved, but leaf experts weren't expert enough.

94.67%

3

Coarse-to-fine exposed the bottleneck

A light router plus stronger downstream experts matched natural intuition. Top routing reached 99%+, but the leaf layer could not recover enough accuracy.

92.37%

4

Food mid-level errors resisted obvious fixes

Higher resolution, wider mid heads, conservative crops, and attention crops all failed to move the food branch. The remaining errors appeared partly semantic and data-driven.

95.78% food mid

5

Compact Joint HEN crossed the line

A shared MobileNetV3-Large with hierarchical probability heads kept the hierarchy, removed most duplicate compute, and crossed the 95% target.

95.63% ✓

Difficulties & Fixes

What failed, and why that matters

The failure map explains why the final architecture is shaped the way it is.

⚠️

Hard routing was brittle

Argmax routing meant an early mistake was unrecoverable. The joint model softened this by supervising all levels from the same shared feature representation.

💾

Classic modularity was expensive

Independent routers and leaf experts duplicated whole CNN backbones. Compact HEN moved most capacity into one shared feature extractor.

🧠

Leaf experts needed real signal

Tiny branch heads saved parameters but underfit hard sibling classes. MobileNetV3-Large provided enough feature quality without returning to huge storage cost.

🍔

Food was not fixed by resolution

Food mid-level accuracy did not improve from 192px input, 512 hidden dims, or conservative crops. More pixels were not the missing ingredient.

🔲

Attention crop was too weak

Rule-based heatmap crops found salient regions, but not necessarily the labeled object. Ambiguity was often about which object should count.

✂️

Common-delta helped locally

Common vs. difference features improved some hard branches like aircraft and prepared food, but did not beat the simpler compact joint model overall.

Resources

Resource story

At the 95% threshold, compact HEN wins on verified resource efficiency. Flat ConvNeXt-Tiny still wins on peak accuracy.

Parameters (M)

Common-Delta HEN

1.74M

Compact Joint HEN

3.01M

Flat ConvNeXt-Tiny

27.84M

Coarse-to-Fine HEN

36.23M

Classic 3-Level HEN

145.31M

Compute (GFLOPs)

Common-Delta HEN

0.057*

Compact Joint HEN

0.217

Coarse-to-Fine HEN

~0.64*

Flat ConvNeXt-Tiny

4.456

Classic 3-Level HEN

5.442

*Common-Delta is backbone-dominated. Coarse-to-Fine compute is an active-path estimate from its 128px router plus one 128px ResNet18-style expert path; it did not reach the 95% target.

Conclusion

Final take

The hierarchy is worth keeping when it reduces resource cost without sacrificing the accuracy band that matters.

Peak Accuracy Reference

Flat CNN remains the strongest baseline for maximum accuracy at 96.89%—simple, powerful, and easy to train.

Engineering Answer ✓

Compact Joint HEN is the stronger choice under tight parameter and compute budgets, hitting 95.63% with 3.01M params and 0.217 GFLOPs.

Finding the Cheapest Pathto 95%

The core question

Compact Joint HEN

Flat ConvNeXt-Tiny

Classic 3-Level HEN

Why the target is 95%, not perfection

The task is already a 3-9-27 tree

Level 1 · Coarse

Level 2 · Mid

Level 3 · Leaf

Architecture evolution

Flat CNN

Classic 3-Level HEN

Coarse-to-Fine HEN

Compact Joint HEN

Common-Delta Branches

The path we took

Classic HEN worked, but was too heavy

Shared-backbone modular HEN saved parameters

Coarse-to-fine exposed the bottleneck

Food mid-level errors resisted obvious fixes

Compact Joint HEN crossed the line

What failed, and why that matters

Hard routing was brittle

Classic modularity was expensive

Leaf experts needed real signal

Food was not fixed by resolution

Attention crop was too weak

Common-delta helped locally

Resource story

Parameters (M)

Compute (GFLOPs)

Final take

Peak Accuracy Reference

Engineering Answer ✓

Finding the Cheapest Path
to 95%