Can a hierarchical expert network beat a flat classifier on resource efficiency while keeping accuracy high enough to matter?
The compact Joint HEN with MobileNetV3-Large reached the same accuracy band as the classic 3-level HEN, but with a radically smaller parameter budget. Flat ConvNeXt-Tiny remains the peak accuracy reference at 96.89%, but at nearly 10× the parameter count.
Shared MobileNetV3-Large backbone with hierarchical probability heads. Best engineering answer for tight budgets.
95.63% · 3.01M paramsStrong and simple single-backbone baseline. Best raw accuracy, but no built-in hierarchy or modular update path.
96.89% · 27.84M paramsProved the hierarchy concept works—but duplicated full CNN backbones for every router and expert. Impractical storage cost.
95.63% · 145.31M paramsBefore comparing architecture efficiency, the evaluation needs a practical target. Some validation images mix multiple strong subjects: a bird with fruit, food beside a dominant object, or produce in a market scene. These can be visually reasonable for a human, yet still be counted wrong by the label protocol.
That is why this report treats 95% as the useful engineering line: high enough to prove the architecture works, but not so high that the comparison is dominated by unclear labels.
The HEN architecture mirrors the dataset structure: first decide the broad domain, then the mid-level family, then the exact leaf class. This is the reason a three-level model is meaningful rather than arbitrary.
This tree is the core modularity target: adding a new leaf should ideally touch its local branch while leaving sibling branches intact.
The most important shift was not adding more heads—it was learning where the model should share representation and where it should specialize.
One full feature extractor feeds one 27-way classifier. Strong and simple, but no built-in hierarchy or modular update path.
96.89% peakMultiple independent routers and leaf experts, each with its own full CNN backbone. Modular but storage-heavy.
Hierarchy ✓ · Economical ✗Light router + stronger downstream experts. Top routing hit 99%+, but leaf layer accuracy hit a ceiling.
Intuitive designOne shared backbone, multiple hierarchical probability heads trained jointly. Hierarchy + efficiency in one shot.
95.63% · Best overallSeparated common and difference features per branch group. Improved hard clusters like aircraft and prepared food locally.
Targeted improvement*Estimated or backbone-only compute for exploratory branches; final verified comparisons are Flat ConvNeXt-Tiny, Classic 3-Level HEN, and Compact Joint HEN.
The final result came from a sequence of very practical disappointments. Each one removed a tempting explanation and made the next design sharper.
The original 3-level HEN proved the hierarchy made sense. The cost was brutal: full backbones stored for every router and expert. Modular, but not economical.
95.63%Sharing the feature extractor solved storage, but branch heads became too small relative to the backbone. Modularity preserved, but leaf experts weren't expert enough.
94.67%A light router plus stronger downstream experts matched natural intuition. Top routing reached 99%+, but the leaf layer could not recover enough accuracy.
92.37%Higher resolution, wider mid heads, conservative crops, and attention crops all failed to move the food branch. The remaining errors appeared partly semantic and data-driven.
95.78% food midA shared MobileNetV3-Large with hierarchical probability heads kept the hierarchy, removed most duplicate compute, and crossed the 95% target.
95.63% ✓The failure map explains why the final architecture is shaped the way it is.
Argmax routing meant an early mistake was unrecoverable. The joint model softened this by supervising all levels from the same shared feature representation.
Independent routers and leaf experts duplicated whole CNN backbones. Compact HEN moved most capacity into one shared feature extractor.
Tiny branch heads saved parameters but underfit hard sibling classes. MobileNetV3-Large provided enough feature quality without returning to huge storage cost.
Food mid-level accuracy did not improve from 192px input, 512 hidden dims, or conservative crops. More pixels were not the missing ingredient.
Rule-based heatmap crops found salient regions, but not necessarily the labeled object. Ambiguity was often about which object should count.
Common vs. difference features improved some hard branches like aircraft and prepared food, but did not beat the simpler compact joint model overall.
At the 95% threshold, compact HEN wins on verified resource efficiency. Flat ConvNeXt-Tiny still wins on peak accuracy.
*Common-Delta is backbone-dominated. Coarse-to-Fine compute is an active-path estimate from its 128px router plus one 128px ResNet18-style expert path; it did not reach the 95% target.
The hierarchy is worth keeping when it reduces resource cost without sacrificing the accuracy band that matters.
Flat CNN remains the strongest baseline for maximum accuracy at 96.89%—simple, powerful, and easy to train.
Compact Joint HEN is the stronger choice under tight parameter and compute budgets, hitting 95.63% with 3.01M params and 0.217 GFLOPs.