Flat CNN vs HEN Comparative Report

Executive Summary

Two conclusions are now stable. First, flat CNN remains the best route for absolute peak accuracy: ConvNeXt-Tiny reaches 96.89% and is still the highest result in the project. Second, if the goal is not "maximum possible accuracy at any cost" but rather "cross 95% with the smallest practical resource budget," HEN now has the better answer: Joint HEN with a MobileNetV3-Large backbone reaches 95.63% with only 3.01M parameters.

If the priority is maximum accuracy

Flat

ConvNeXt-Tiny remains the strongest overall model at 96.89%.

If the priority is 95%+ at minimum cost

HEN

Joint HEN + MobileNetV3-Large reaches 95.63% at only 3.01M parameters.

If the priority is modularity and interpretability

HEN

Hierarchical routing and branch-level specialization remain HEN's natural strengths.

The project started with a very expensive but successful classic 3-level HEN. The important new result is that the same 95.63% accuracy band is now reachable with a compact shared-backbone HEN, which changes the practical comparison completely.

Why 95% Became the Practical Target

The 95% target is not arbitrary. It emerged because the validation set contains a visible amount of semantic ambiguity and context-heavy labeling. Some images are nominally labeled by the target object category, but the dominant visual subject is something else, or the scene mixes multiple strong concepts. In such a setting, chasing the last few percent can quickly become a fight against labeling semantics rather than a clean measurement of representation quality.

Why 95% Is Meaningful

It is high enough to prove that the hierarchy and leaf experts work on real images, not just on easy object crops.
It leaves room for unavoidable ambiguity: bird plus fruit, animal plus prepared food, produce displayed in a market scene, and similar mixed-subject images.
It is a better engineering target than "maximize at all costs" because it lets us compare parameter and compute efficiency directly.

What the Dataset Is Asking

Sometimes the label is object-centric, but the image is scene-centric.
Sometimes the image contains the correct class, but it is not the most visually dominant thing.
This means a model can be "reasonable" and still be counted wrong, which naturally compresses the realistic top end.

In short: 95% is the level at which performance is already strong enough that remaining errors are often about ambiguity, context, and task definition, not just weak feature extraction.

Resource Comparison at the 95% Threshold

The most important comparison is not "best flat versus best HEN at any cost," but rather "what is the smallest verified model on each side that reaches at least 95% accuracy?" That comparison is now quite favorable to HEN.

Family	Model	Accuracy	Parameters	Compute	Interpretation
Flat	ConvNeXt-Tiny	96.89%	27.84M	4.456 GFLOPs	Smallest verified flat model that clearly exceeds 95%.
HEN	Joint HEN + MobileNetV3-Large	95.63%	3.01M	0.217 GFLOPs	Smallest verified HEN model that exceeds 95%.
HEN classic	Classic 3-Level HEN	95.63%	145.31M total 33.53M active path	about 5.442 GFLOPs active	Proof that hierarchical experts work, but much too expensive as a final deployment answer.

At the 95% threshold, the current best compact HEN uses about 9.3x fewer parameters and about 20.5x less compute than the smallest verified flat 95%+ model, while landing in the same accuracy regime.

Representative Models Across the Search

This table shows the main reference points that shaped the current conclusion: the best flat models, the best classic HEN, the best compact HEN, and a representative selective hard-branch refinement experiment.

Family	Model	Architecture Idea	Accuracy	Parameters	Compute	Comment
Flat	ConvNeXt-Tiny	Single-path CNN, single classifier head	96.89%	27.84M	4.456 GFLOPs	Best overall accuracy.
Flat	ResNet18	Single-path CNN, strong baseline recipe	94.96%	11.19M	1.814 GFLOPs	Very close to 95%, but still below the line.
Flat	MobileNetV3-Small	Ultra-light flat classifier	94.52%	1.55M	0.057 GFLOPs	Shows that very small flat models struggle to cross 95%.
HEN classic	Classic 3-Level HEN	Top router + mid router + dedicated leaf experts	95.63%	145.31M total	about 5.442 GFLOPs active	Strong accuracy, very high storage cost.
HEN compact	Joint HEN + MobileNetV3-Large	Shared backbone + hierarchical probability heads	95.63%	3.01M	0.217 GFLOPs	Current best efficiency/accuracy tradeoff.
HEN research line	Selective Common-Delta HEN	Parent-specific refinement + hard-branch difference heads	93.78%	1.74M total 0.79M branch-trained	Backbone dominated	Helpful on some hard branches, not yet better overall.

Architecture Summary

The project moved from a straightforward flat classifier to an expensive modular hierarchy, and then to a compact shared-backbone HEN. The sketches below make the tradeoff visible: where the compute sits, how routing happens, and why the final compact HEN became the practical 95% solution.

Flat CNN

One dense path: the whole image goes through one backbone and one global classifier.

simple path highest peak accuracy

Simple, stable, and still the strongest choice when absolute accuracy is the main objective.
No native hierarchy: all 27 classes compete in the same representation and the same final head.

Classic 3-Level HEN

True hierarchical routing: many stored experts exist, but one active path is chosen at inference time.

many stored backbones one active path

Closest to the original HEN idea: explicit routing followed by branch-specific specialists at each lower level.
Reached 95.63%, but storage and maintenance cost exploded because routers and experts each owned a full backbone.

Compact Joint HEN

The winning practical design: one shared backbone, then lightweight hierarchical heads attached to the same feature tensor.

95.63% at 3.01M shared representation

This is the compact HEN that matched the best classic 3-level HEN accuracy band while collapsing the parameter budget.
The key idea is not hard modular routing with many full CNNs, but shared representation plus hierarchical supervision.

Selective Hard-Branch Refinement

An exploratory extension: keep the compact backbone, but add parent-specific refinement and common-delta logic only on the hardest branches.

targeted extra capacity only for hard branches

This line was useful diagnostically: it showed that the hardest siblings benefit from parent-aware refinement and residual difference features.
It improved some local branches, but it did not yet beat the simpler joint HEN as a full-system solution.

Why Some Remaining Errors Are Hard

The four examples below were exported from the project's own misclassification review folder and are embedded directly into this report. They show why accuracy does not simply scale forever with more training: some "errors" are actually boundary cases where multiple strong semantic interpretations are present in the same frame.

Fruit label, bird-dominant image

Ground truth is lemon (fruit), but the frame is visually dominated by a white bird perched on a tree. This is a good example of why coarse semantic ambiguity creates a practical accuracy ceiling.

Prepared food label, animal-dominant image

Ground truth is pizza, but a squirrel holding the slice dominates the scene. A classifier must decide whether the task is "what object is present" or "what is the main visual subject".

Prepared food vs vegetable boundary

The hotdog is surrounded by lettuce and garnish. This does not look like a clean studio object crop, so some mistakes reflect a reasonable boundary interpretation rather than a simple failure to recognize the class.

Vegetable in a market-display context

This image contains many cauliflower heads arranged in a produce-market context. The label is still valid, but the scene weakens the object-centric assumption that many classifiers implicitly rely on.

These examples explain why 95% is a sensible stopping point for the comparison. Beyond this level, a growing fraction of validation mistakes come from ambiguous scene composition and label semantics, not from obviously poor recognition behavior.

Current Recommendation

If the project needs a clean headline comparison, the recommendation is straightforward.

Best flat reference

ConvNeXt-Tiny

Use this to represent the flat family's peak accuracy ceiling.

Best HEN reference

Joint HEN + MobileNetV3-Large

Use this to represent the HEN family's best efficiency/accuracy tradeoff.

Research direction

Selective hard-branch refinement

Keep this as the next experimental branch, not as the current final answer.