Flat vs HEN

Flat CNN vs HEN Comparative Report

This report summarizes the current project pause point and compares the strongest flat CNN baselines against the strongest HEN variants on the 27-class hierarchical benchmark. It also explains why 95% accuracy became the practical target, how much compute and parameter budget each side consumes, and why some remaining failures are better interpreted as semantic ambiguity than as pure model weakness.

Date: 2026-05-01 Dataset: 27 balanced classes arranged as 3 x 3 x 3 Validation set: 1350 images Main question: what is the most resource-efficient way to reach 95%?

Best Flat Accuracy

96.89%

ConvNeXt-Tiny, 27.84M parameters, 4.456 GFLOPs

Best HEN Accuracy at Low Cost

95.63%

Joint HEN + MobileNetV3-Large, 3.01M parameters, about 0.217 GFLOPs

95% Threshold Winner

HEN

The cheapest verified 95%+ solution is now HEN, not flat CNN.

Resource Advantage

9.3x / 20.5x

Compared with the smallest verified flat 95%+ model, the best compact HEN uses fewer parameters and less compute by large margins.

Executive Summary

Two conclusions are now stable. First, flat CNN remains the best route for absolute peak accuracy: ConvNeXt-Tiny reaches 96.89% and is still the highest result in the project. Second, if the goal is not "maximum possible accuracy at any cost" but rather "cross 95% with the smallest practical resource budget," HEN now has the better answer: Joint HEN with a MobileNetV3-Large backbone reaches 95.63% with only 3.01M parameters.

If the priority is maximum accuracy
Flat

ConvNeXt-Tiny remains the strongest overall model at 96.89%.

If the priority is 95%+ at minimum cost
HEN

Joint HEN + MobileNetV3-Large reaches 95.63% at only 3.01M parameters.

If the priority is modularity and interpretability
HEN

Hierarchical routing and branch-level specialization remain HEN's natural strengths.

The project started with a very expensive but successful classic 3-level HEN. The important new result is that the same 95.63% accuracy band is now reachable with a compact shared-backbone HEN, which changes the practical comparison completely.

Why 95% Became the Practical Target

The 95% target is not arbitrary. It emerged because the validation set contains a visible amount of semantic ambiguity and context-heavy labeling. Some images are nominally labeled by the target object category, but the dominant visual subject is something else, or the scene mixes multiple strong concepts. In such a setting, chasing the last few percent can quickly become a fight against labeling semantics rather than a clean measurement of representation quality.

Why 95% Is Meaningful

  • It is high enough to prove that the hierarchy and leaf experts work on real images, not just on easy object crops.
  • It leaves room for unavoidable ambiguity: bird plus fruit, animal plus prepared food, produce displayed in a market scene, and similar mixed-subject images.
  • It is a better engineering target than "maximize at all costs" because it lets us compare parameter and compute efficiency directly.

What the Dataset Is Asking

  • Sometimes the label is object-centric, but the image is scene-centric.
  • Sometimes the image contains the correct class, but it is not the most visually dominant thing.
  • This means a model can be "reasonable" and still be counted wrong, which naturally compresses the realistic top end.
In short: 95% is the level at which performance is already strong enough that remaining errors are often about ambiguity, context, and task definition, not just weak feature extraction.

Resource Comparison at the 95% Threshold

The most important comparison is not "best flat versus best HEN at any cost," but rather "what is the smallest verified model on each side that reaches at least 95% accuracy?" That comparison is now quite favorable to HEN.

Family Model Accuracy Parameters Compute Interpretation
Flat ConvNeXt-Tiny 96.89% 27.84M 4.456 GFLOPs Smallest verified flat model that clearly exceeds 95%.
HEN Joint HEN + MobileNetV3-Large 95.63% 3.01M 0.217 GFLOPs Smallest verified HEN model that exceeds 95%.
HEN classic Classic 3-Level HEN 95.63% 145.31M total
33.53M active path
about 5.442 GFLOPs active Proof that hierarchical experts work, but much too expensive as a final deployment answer.
At the 95% threshold, the current best compact HEN uses about 9.3x fewer parameters and about 20.5x less compute than the smallest verified flat 95%+ model, while landing in the same accuracy regime.

Representative Models Across the Search

This table shows the main reference points that shaped the current conclusion: the best flat models, the best classic HEN, the best compact HEN, and a representative selective hard-branch refinement experiment.

Family Model Architecture Idea Accuracy Parameters Compute Comment
Flat ConvNeXt-Tiny Single-path CNN, single classifier head 96.89% 27.84M 4.456 GFLOPs Best overall accuracy.
Flat ResNet18 Single-path CNN, strong baseline recipe 94.96% 11.19M 1.814 GFLOPs Very close to 95%, but still below the line.
Flat MobileNetV3-Small Ultra-light flat classifier 94.52% 1.55M 0.057 GFLOPs Shows that very small flat models struggle to cross 95%.
HEN classic Classic 3-Level HEN Top router + mid router + dedicated leaf experts 95.63% 145.31M total about 5.442 GFLOPs active Strong accuracy, very high storage cost.
HEN compact Joint HEN + MobileNetV3-Large Shared backbone + hierarchical probability heads 95.63% 3.01M 0.217 GFLOPs Current best efficiency/accuracy tradeoff.
HEN research line Selective Common-Delta HEN Parent-specific refinement + hard-branch difference heads 93.78% 1.74M total
0.79M branch-trained
Backbone dominated Helpful on some hard branches, not yet better overall.

Architecture Summary

The project moved from a straightforward flat classifier to an expensive modular hierarchy, and then to a compact shared-backbone HEN. The sketches below make the tradeoff visible: where the compute sits, how routing happens, and why the final compact HEN became the practical 95% solution.

Flat CNN

One dense path: the whole image goes through one backbone and one global classifier.

Input 224 x 224 image Backbone ResNet / MobileNet / ConvNeXt all capacity sits in one stack Head 27-way logits
simple path highest peak accuracy
  • Simple, stable, and still the strongest choice when absolute accuracy is the main objective.
  • No native hierarchy: all 27 classes compete in the same representation and the same final head.

Classic 3-Level HEN

True hierarchical routing: many stored experts exist, but one active path is chosen at inference time.

Input image Top 3-way router animal mid food mid vehicle mid 3 leaf experts 3 leaf experts 3 leaf experts dark path = one active inference route, pale branches = stored experts
many stored backbones one active path
  • Closest to the original HEN idea: explicit routing followed by branch-specific specialists at each lower level.
  • Reached 95.63%, but storage and maintenance cost exploded because routers and experts each owned a full backbone.

Compact Joint HEN

The winning practical design: one shared backbone, then lightweight hierarchical heads attached to the same feature tensor.

Input image Shared Backbone single forward pass feature map reused by every level Level-1 head animal / vehicle / food Level-2 heads 3 parents -> 9 branches Leaf heads 9 groups -> 27 leaves
95.63% at 3.01M shared representation
  • This is the compact HEN that matched the best classic 3-level HEN accuracy band while collapsing the parameter budget.
  • The key idea is not hard modular routing with many full CNNs, but shared representation plus hierarchical supervision.

Selective Hard-Branch Refinement

An exploratory extension: keep the compact backbone, but add parent-specific refinement and common-delta logic only on the hardest branches.

Input Shared Backbone compact feature bank animal refiner vehicle refiner food refiner standard heads common + delta feline / aircraft common + delta prepared_food
targeted extra capacity only for hard branches
  • This line was useful diagnostically: it showed that the hardest siblings benefit from parent-aware refinement and residual difference features.
  • It improved some local branches, but it did not yet beat the simpler joint HEN as a full-system solution.

Why Some Remaining Errors Are Hard

The four examples below were exported from the project's own misclassification review folder and are embedded directly into this report. They show why accuracy does not simply scale forever with more training: some "errors" are actually boundary cases where multiple strong semantic interpretations are present in the same frame.

Fruit label, bird-dominant image

Fruit label, bird-dominant image

Ground truth is lemon (fruit), but the frame is visually dominated by a white bird perched on a tree. This is a good example of why coarse semantic ambiguity creates a practical accuracy ceiling.

Prepared food label, animal-dominant image

Prepared food label, animal-dominant image

Ground truth is pizza, but a squirrel holding the slice dominates the scene. A classifier must decide whether the task is "what object is present" or "what is the main visual subject".

Prepared food vs vegetable boundary

Prepared food vs vegetable boundary

The hotdog is surrounded by lettuce and garnish. This does not look like a clean studio object crop, so some mistakes reflect a reasonable boundary interpretation rather than a simple failure to recognize the class.

Vegetable in a market-display context

Vegetable in a market-display context

This image contains many cauliflower heads arranged in a produce-market context. The label is still valid, but the scene weakens the object-centric assumption that many classifiers implicitly rely on.

These examples explain why 95% is a sensible stopping point for the comparison. Beyond this level, a growing fraction of validation mistakes come from ambiguous scene composition and label semantics, not from obviously poor recognition behavior.

Current Recommendation

If the project needs a clean headline comparison, the recommendation is straightforward.

Best flat reference
ConvNeXt-Tiny

Use this to represent the flat family's peak accuracy ceiling.

Best HEN reference
Joint HEN + MobileNetV3-Large

Use this to represent the HEN family's best efficiency/accuracy tradeoff.

Research direction
Selective hard-branch refinement

Keep this as the next experimental branch, not as the current final answer.