Best Flat Accuracy
ConvNeXt-Tiny, 27.84M parameters, 4.456 GFLOPs
This report summarizes the current project pause point and compares the strongest flat CNN baselines against the strongest HEN variants on the 27-class hierarchical benchmark. It also explains why 95% accuracy became the practical target, how much compute and parameter budget each side consumes, and why some remaining failures are better interpreted as semantic ambiguity than as pure model weakness.
ConvNeXt-Tiny, 27.84M parameters, 4.456 GFLOPs
Joint HEN + MobileNetV3-Large, 3.01M parameters, about 0.217 GFLOPs
The cheapest verified 95%+ solution is now HEN, not flat CNN.
Compared with the smallest verified flat 95%+ model, the best compact HEN uses fewer parameters and less compute by large margins.
Two conclusions are now stable. First, flat CNN remains the best route for absolute peak accuracy: ConvNeXt-Tiny reaches 96.89% and is still the highest result in the project. Second, if the goal is not "maximum possible accuracy at any cost" but rather "cross 95% with the smallest practical resource budget," HEN now has the better answer: Joint HEN with a MobileNetV3-Large backbone reaches 95.63% with only 3.01M parameters.
ConvNeXt-Tiny remains the strongest overall model at 96.89%.
Joint HEN + MobileNetV3-Large reaches 95.63% at only 3.01M parameters.
Hierarchical routing and branch-level specialization remain HEN's natural strengths.
The 95% target is not arbitrary. It emerged because the validation set contains a visible amount of semantic ambiguity and context-heavy labeling. Some images are nominally labeled by the target object category, but the dominant visual subject is something else, or the scene mixes multiple strong concepts. In such a setting, chasing the last few percent can quickly become a fight against labeling semantics rather than a clean measurement of representation quality.
The most important comparison is not "best flat versus best HEN at any cost," but rather "what is the smallest verified model on each side that reaches at least 95% accuracy?" That comparison is now quite favorable to HEN.
| Family | Model | Accuracy | Parameters | Compute | Interpretation |
|---|---|---|---|---|---|
| Flat | ConvNeXt-Tiny | 96.89% | 27.84M | 4.456 GFLOPs | Smallest verified flat model that clearly exceeds 95%. |
| HEN | Joint HEN + MobileNetV3-Large | 95.63% | 3.01M | 0.217 GFLOPs | Smallest verified HEN model that exceeds 95%. |
| HEN classic | Classic 3-Level HEN | 95.63% | 145.31M total 33.53M active path |
about 5.442 GFLOPs active | Proof that hierarchical experts work, but much too expensive as a final deployment answer. |
This table shows the main reference points that shaped the current conclusion: the best flat models, the best classic HEN, the best compact HEN, and a representative selective hard-branch refinement experiment.
| Family | Model | Architecture Idea | Accuracy | Parameters | Compute | Comment |
|---|---|---|---|---|---|---|
| Flat | ConvNeXt-Tiny | Single-path CNN, single classifier head | 96.89% | 27.84M | 4.456 GFLOPs | Best overall accuracy. |
| Flat | ResNet18 | Single-path CNN, strong baseline recipe | 94.96% | 11.19M | 1.814 GFLOPs | Very close to 95%, but still below the line. |
| Flat | MobileNetV3-Small | Ultra-light flat classifier | 94.52% | 1.55M | 0.057 GFLOPs | Shows that very small flat models struggle to cross 95%. |
| HEN classic | Classic 3-Level HEN | Top router + mid router + dedicated leaf experts | 95.63% | 145.31M total | about 5.442 GFLOPs active | Strong accuracy, very high storage cost. |
| HEN compact | Joint HEN + MobileNetV3-Large | Shared backbone + hierarchical probability heads | 95.63% | 3.01M | 0.217 GFLOPs | Current best efficiency/accuracy tradeoff. |
| HEN research line | Selective Common-Delta HEN | Parent-specific refinement + hard-branch difference heads | 93.78% | 1.74M total 0.79M branch-trained |
Backbone dominated | Helpful on some hard branches, not yet better overall. |
The project moved from a straightforward flat classifier to an expensive modular hierarchy, and then to a compact shared-backbone HEN. The sketches below make the tradeoff visible: where the compute sits, how routing happens, and why the final compact HEN became the practical 95% solution.
One dense path: the whole image goes through one backbone and one global classifier.
True hierarchical routing: many stored experts exist, but one active path is chosen at inference time.
The winning practical design: one shared backbone, then lightweight hierarchical heads attached to the same feature tensor.
An exploratory extension: keep the compact backbone, but add parent-specific refinement and common-delta logic only on the hardest branches.
The four examples below were exported from the project's own misclassification review folder and are embedded directly into this report. They show why accuracy does not simply scale forever with more training: some "errors" are actually boundary cases where multiple strong semantic interpretations are present in the same frame.
Ground truth is lemon (fruit), but the frame is visually dominated by a white bird perched on a tree. This is a good example of why coarse semantic ambiguity creates a practical accuracy ceiling.
Ground truth is pizza, but a squirrel holding the slice dominates the scene. A classifier must decide whether the task is "what object is present" or "what is the main visual subject".
The hotdog is surrounded by lettuce and garnish. This does not look like a clean studio object crop, so some mistakes reflect a reasonable boundary interpretation rather than a simple failure to recognize the class.
This image contains many cauliflower heads arranged in a produce-market context. The label is still valid, but the scene weakens the object-centric assumption that many classifiers implicitly rely on.
If the project needs a clean headline comparison, the recommendation is straightforward.
Use this to represent the flat family's peak accuracy ceiling.
Use this to represent the HEN family's best efficiency/accuracy tradeoff.
Keep this as the next experimental branch, not as the current final answer.