Edge OCR Practice: Native Deployment of PP-OCRv5 on Android

Notes

This blog post:

Cover: Generated based on Google Nano Banana 2, no copyright reserved.
Project Source Code: Open-sourced on GitHub, please visit PPOCRv5-Android to access it.

Disclaimer:

The author (Fleey) is not a professional in the AI field; this project is driven purely by personal interest. Please forgive any omissions or errors in the text, and feel free to provide corrections!

Introduction

In 2024, Google rebranded TensorFlow Lite as LiteRT. This was not just a branding exercise but marked a paradigm shift in on-device AI from “mobile-first” to “edge-first” ¹. In this context, OCR (Optical Character Recognition), as one of the most practical on-device AI applications, is undergoing a silent revolution.

Baidu’s PaddleOCR team released PP-OCRv5 in 2025, a unified OCR model supporting multiple languages including Simplified Chinese, Traditional Chinese, English, and Japanese ². Its mobile version is only about 70MB, yet it can recognize 18,383 characters within a single model. Behind this number lies the collaborative work of two deep neural networks: detection and recognition.

But the problem is: PP-OCRv5 is trained on the PaddlePaddle framework, while the most mature inference engine on Android devices is LiteRT. How do we bridge this gap?

Let’s start with model conversion and gradually unveil the engineering behind on-device OCR.

1
flowchart TB
2
    subgraph E2E["End-to-End OCR Pipeline"]
3
        direction TB
4

5
        subgraph Input["Input"]
6
            IMG[Original Image<br/>Any Size]
7
        end
8

9
        subgraph Detection["Text Detection - DBNet"]
10
            DET_PRE[Preprocessing<br/>Resize 640x640<br/>ImageNet Normalize]
11
            DET_INF[DBNet Inference<br/>~45ms GPU]
12
            DET_POST[Post-processing<br/>Binarization - Contours - Rotated Rect]
13
        end
14

15
        subgraph Recognition["Text Recognition - SVTRv2"]
16
            REC_CROP[Perspective Transform Crop<br/>48xW Adaptive Width]
17
            REC_INF[SVTRv2 Inference<br/>~15ms/line GPU]
18
            REC_CTC[CTC Decoding<br/>Merge Duplicates + Remove Blanks]
19
        end
20

21
        subgraph Output["Output"]
22
            RES[OCR Results<br/>Text + Confidence + Position]
23
        end
24
    end
25

26
    IMG --> DET_PRE --> DET_INF --> DET_POST
27
    DET_POST -->|N Text Boxes| REC_CROP
28
    REC_CROP --> REC_INF --> REC_CTC --> RES

Model Conversion: The Long Journey from PaddlePaddle to TFLite

Fragmentation in deep learning frameworks is a major pain point in the industry. PyTorch, TensorFlow, PaddlePaddle, ONNX—each framework has its own model format and operator implementations. While ONNX (Open Neural Network Exchange) attempts to be a universal intermediate representation, reality is often harsher than the ideal.

The model conversion path for PP-OCRv5 is as follows:

1
flowchart LR
2
    subgraph PaddlePaddle["PaddlePaddle Framework"]
3
        PM[inference.json<br/>inference.pdiparams]
4
    end
5

6
    subgraph ONNX["ONNX Intermediate"]
7
        OM[model.onnx<br/>opset 14]
8
    end
9

10
    subgraph Optimization["Graph Optimization"]
11
        GS[onnx-graphsurgeon<br/>Operator Decomposition]
12
    end
13

14
    subgraph TFLite["LiteRT Format"]
15
        TM[model.tflite<br/>FP16 Quantized]
16
    end
17

18
    PM -->|paddle2onnx| OM
19
    OM -->|HardSigmoid Decomposition<br/>Resize Mode Modification| GS
20
    GS -->|onnx2tf| TM

This path seems simple but hides several nuances.

Pitfall 1: Operator Compatibility in paddle2onnx

paddle2onnx is the official model conversion tool provided by PaddlePaddle. Theoretically, it can convert PaddlePaddle models to ONNX format. However, PP-OCRv5 uses some special operators whose mappings in ONNX are not one-to-one.

paddle2onnx --model_dir PP-OCRv5_mobile_det \
  --model_filename inference.json \
  --params_filename inference.pdiparams \
  --save_file ocr_det_v5.onnx \
  --opset_version 14

A key detail here: the PP-OCRv5 model filename is inference.json rather than the traditional inference.pdmodel. This is a change in the model format of newer PaddlePaddle versions that many developers overlook ³.

Pitfall 2: HardSigmoid and GPU Compatibility

The converted ONNX model contains the HardSigmoid operator. Mathematically, this operator is defined as:

\text{HardSigmoid}(x) = \max(0, \min(1, \alpha x + \beta))

where $\alpha = 0.2$ and $\beta = 0.5$ .

The problem is that LiteRT’s GPU Delegate does not support HardSigmoid. When a model contains unsupported operators, the GPU Delegate falls back to the CPU for that entire subgraph, leading to significant performance loss.

The solution is to decompose HardSigmoid into basic operators. Using the onnx-graphsurgeon library, we can perform “surgery” at the computational graph level:

1
import onnx_graphsurgeon as gs
2
import numpy as np
3

4
def decompose_hardsigmoid(graph: gs.Graph) -> gs.Graph:
5
    """
6
    Decompose HardSigmoid into GPU-friendly basic operators
7
    HardSigmoid(x) = max(0, min(1, alpha*x + beta))
8
    Decomposed into: Mul -> Add -> Clip
9
    """
10
    for node in graph.nodes:
11
        if node.op == "HardSigmoid":
12
            # Get HardSigmoid parameters
13
            alpha = node.attrs.get("alpha", 0.2)
14
            beta = node.attrs.get("beta", 0.5)
15

16
            input_tensor = node.inputs[0]
17
            output_tensor = node.outputs[0]
18

19
            # Create constant tensors
20
            alpha_const = gs.Constant(
21
                name=f"{node.name}_alpha",
22
                values=np.array([alpha], dtype=np.float32)
23
            )
24
            beta_const = gs.Constant(
25
                name=f"{node.name}_beta",
26
                values=np.array([beta], dtype=np.float32)
27
            )
28

29
            # Create intermediate variables
30
            mul_out = gs.Variable(name=f"{node.name}_mul_out")
31
            add_out = gs.Variable(name=f"{node.name}_add_out")
32

33
            # Build decomposed subgraph: x -> Mul(alpha) -> Add(beta) -> Clip(0,1)
34
            mul_node = gs.Node(
35
                op="Mul",
36
                inputs=[input_tensor, alpha_const],
37
                outputs=[mul_out]
38
            )
39
            add_node = gs.Node(
40
                op="Add",
41
                inputs=[mul_out, beta_const],
42
                outputs=[add_out]
43
            )
44
            clip_node = gs.Node(
45
                op="Clip",
46
                inputs=[add_out],
47
                outputs=[output_tensor],
48
                attrs={"min": 0.0, "max": 1.0}
49
            )
50

51
            # Replace original node
52
            graph.nodes.remove(node)
53
            graph.nodes.extend([mul_node, add_node, clip_node])
54

55
    graph.cleanup().toposort()
56
    return graph

The key to this decomposition is that Mul, Add, and Clip are all operators fully supported by the LiteRT GPU Delegate. After decomposition, the entire subgraph can be executed continuously on the GPU, avoiding the overhead of CPU-GPU data transfers.

TIP

Why not modify the model training code directly? Because the gradient calculation for HardSigmoid during training differs from Clip. Decomposition should only occur during the inference stage to maintain numerical stability during training.

Pitfall 3: Coordinate Transformation Mode of the Resize Operator

The ONNX Resize operator has a coordinate_transformation_mode attribute, which determines how output coordinates are mapped to input coordinates. PP-OCRv5 uses the half_pixel mode, but LiteRT GPU Delegate has limited support for this mode.

Changing it to asymmetric mode provides better GPU compatibility:

1
for node in graph.nodes:
2
    if node.op == "Resize":
3
        node.attrs["coordinate_transformation_mode"] = "asymmetric"

WARNING

This modification may cause minor numerical differences. In practical testing, the impact of this difference on OCR accuracy is negligible, but it may require careful evaluation in other tasks.

Final Step: onnx2tf and FP16 Quantization

onnx2tf is a tool to convert ONNX models to TFLite format. FP16 (half-precision floating point) quantization is a common choice for mobile deployment. It halves the model size with acceptable accuracy loss and leverages the FP16 compute units of mobile GPUs.

onnx2tf -i ocr_det_v5_fixed.onnx -o converted_det \
  -b 1 -ois x:1,3,640,640 -n

The -ois parameter here specifies a static input shape. Static shapes are crucial for GPU acceleration; dynamic shapes would require recompiling the GPU program for every inference, severely impacting performance.

Text Detection: Differentiable Binarization in DBNet

The detection module of PP-OCRv5 is based on DBNet (Differentiable Binarization Network) ⁴. Traditional text detection methods use a fixed threshold for binarization, whereas DBNet’s innovation lies in letting the network learn the optimal threshold for each pixel.

1
flowchart TB
2
    subgraph DBNet["DBNet Architecture"]
3
        direction TB
4
        IMG[Input Image<br/>H x W x 3]
5
        BB[Backbone<br/>MobileNetV3]
6
        FPN[FPN Feature Pyramid<br/>Multi-scale Fusion]
7

8
        subgraph Heads["Dual Branch Output"]
9
            PH[Probability Map Branch<br/>P: H x W x 1]
10
            TH[Threshold Map Branch<br/>T: H x W x 1]
11
        end
12

13
        DB["Differentiable Binarization<br/>B = sigmoid k * P-T"]
14
    end
15

16
    IMG --> BB --> FPN
17
    FPN --> PH
18
    FPN --> TH
19
    PH --> DB
20
    TH --> DB

Standard Binarization vs. Differentiable Binarization

Standard binarization is a step function:

B_{i,j} = \begin{cases} 1 & \text{if } P_{i,j} \geq t \\ 0 & \text{otherwise} \end{cases}

This function is non-differentiable and cannot be trained end-to-end via backpropagation. DBNet proposes an approximate function:

\hat{B}_{i,j} = \frac{1}{1 + e^{-k(P_{i,j} - T_{i,j})}}

where $P$ is the probability map, $T$ is the threshold map (learned by the network), and $k$ is the amplification factor (set to 50 during training).

TIP

This formula is essentially a Sigmoid function, but with $P - T$ as the input. When $k$ is large enough, its behavior approaches a step function while remaining differentiable.

Engineering Implementation of the Post-processing Pipeline

In the PPOCRv5-Android project, the post-processing pipeline is implemented in postprocess.cpp. The core process includes:

1
flowchart LR
2
    subgraph Input["Model Output"]
3
        PM[Probability Map P<br/>640 x 640]
4
    end
5

6
    subgraph Binary["Binarization"]
7
        BT[Threshold Filtering<br/>threshold=0.1]
8
        BM[Binary Map<br/>640 x 640]
9
    end
10

11
    subgraph Contour["Contour Detection"]
12
        DS[4x Downsampling<br/>160 x 160]
13
        CC[Connected Component Analysis<br/>BFS Traversal]
14
        BD[Boundary Point Extraction]
15
    end
16

17
    subgraph Geometry["Geometric Calculation"]
18
        CH[Convex Hull Calculation<br/>Graham Scan]
19
        RR[Rotating Calipers<br/>Minimum Area Rectangle]
20
        UC[Unclip Expansion<br/>ratio=1.5]
21
    end
22

23
    subgraph Output["Output"]
24
        TB[RotatedRect<br/>center, size, angle]
25
    end
26

27
    PM --> BT --> BM
28
    BM --> DS --> CC --> BD
29
    BD --> CH --> RR --> UC --> TB

In the actual code, the TextDetector::Impl::Detect method demonstrates the complete detection process:

1
std::vector<RotatedRect> Detect(const uint8_t *image_data,
2
                                int width, int height, int stride,
3
                                float *detection_time_ms) {
4
    // 1. Calculate scale ratios
5
    scale_x_ = static_cast<float>(width) / kDetInputSize;
6
    scale_y_ = static_cast<float>(height) / kDetInputSize;
7

8
    // 2. Bilinear interpolation resize to 640x640
9
    image_utils::ResizeBilinear(image_data, width, height, stride,
10
                                resized_buffer_.data(), kDetInputSize, kDetInputSize);
11

12
    // 3. ImageNet Normalization
13
    PrepareFloatInput();
14

15
    // 4. Inference
16
    auto run_result = compiled_model_->Run(input_buffers_, output_buffers_);
17

18
    // 5. Binarization
19
    BinarizeOutput(prob_map, total_pixels);
20

21
    // 6. Contour Detection
22
    auto contours = postprocess::FindContours(binary_map_.data(),
23
                                              kDetInputSize, kDetInputSize);
24

25
    // 7. Minimum Area Rectangle + Unclip
26
    for (const auto &contour : contours) {
27
        RotatedRect rect = postprocess::MinAreaRect(contour);
28
        UnclipBox(rect, kUnclipRatio);
29
        // Map coordinates back to original image
30
        rect.center_x *= scale_x_;
31
        rect.center_y *= scale_y_;
32
        // ...
33
    }
34
}

The key to this process is the “Minimum Area Rotated Rectangle.” Unlike axis-aligned bounding boxes, rotated rectangles can tightly fit text at any angle, which is crucial for tilted text in natural scenes.

Unclip: The Text Box Expansion Algorithm

The text regions output by DBNet are usually slightly smaller than the actual text because the network learns the “core region” of the text. To obtain the complete text boundaries, an expansion (Unclip) operation must be performed on the detected polygons.

The mathematical principle of Unclip is based on the inverse operation of the Vatti polygon clipping algorithm. Given a polygon $P$ and an expansion distance $d$ , the expanded polygon $P'$ satisfies:

$d = \frac{A \times r}{L}$

where $A$ is the polygon area, $L$ is the perimeter, and $r$ is the expansion ratio (usually set to 1.5).

In postprocess.cpp, the UnclipBox function implements this logic:

1
void UnclipBox(RotatedRect &box, float unclip_ratio) {
2
    // Calculate expansion distance
3
    float area = box.width * box.height;
4
    float perimeter = 2.0f * (box.width + box.height);
5

6
    if (perimeter < 1e-6f) return;  // Prevent division by zero
7

8
    // d = A * r / L
9
    float distance = area * unclip_ratio / perimeter;
10

11
    // Expand outwards: increase width and height by 2d each
12
    box.width += 2.0f * distance;
13
    box.height += 2.0f * distance;
14
}

This simplified version assumes the text box is a rectangle. For more complex polygons, a full Clipper library implementation for polygon offsetting would be required:

1
// Full polygon Unclip (using Clipper library)
2
ClipperLib::Path polygon;
3
for (const auto& pt : contour) {
4
    polygon.push_back(ClipperLib::IntPoint(
5
        static_cast<int>(pt.x * 1000),  // Scale up to maintain precision
6
        static_cast<int>(pt.y * 1000)
7
    ));
8
}
9

10
ClipperLib::ClipperOffset offset;
11
offset.AddPath(polygon, ClipperLib::jtRound, ClipperLib::etClosedPolygon);
12

13
ClipperLib::Paths solution;
14
offset.Execute(solution, distance * 1000);  // Expand

NOTE

PPOCRv5-Android chooses simplified rectangular expansion over full polygon offsetting because:

Most text boxes are nearly rectangular.
The full Clipper library would significantly increase binary size.
The simplified version offers better performance.

Text Recognition: SVTRv2 and CTC Decoding

If detection is “finding where the text is,” then recognition is “reading what the text says.” The recognition module of PP-OCRv5 is based on SVTRv2 (Scene Text Recognition with Visual Transformer v2) ⁵.

Architectural Innovations in SVTRv2

SVTRv2 introduces three key improvements over its predecessor SVTR:

1
flowchart TB
2
    subgraph SVTRv2["SVTRv2 Architecture"]
3
        direction TB
4

5
        subgraph Encoder["Visual Encoder"]
6
            PE[Patch Embedding<br/>4x4 Conv]
7

8
            subgraph Mixing["Mixing Attention Block x12"]
9
                LA[Local Attention<br/>7x7 Window]
10
                GA[Global Attention<br/>Global Receptive Field]
11
                FFN[Feed Forward<br/>MLP]
12
            end
13
        end
14

15
        subgraph Decoder["CTC Decoder"]
16
            FC[Fully Connected Layer<br/>D -> 18384]
17
            SM[Softmax]
18
            CTC[CTC Decode]
19
        end
20
    end
21

22
    PE --> LA --> GA --> FFN
23
    FFN --> FC --> SM --> CTC

Mixing Attention Mechanism: Alternates between local attention (capturing stroke details) and global attention (understanding character structure). Local attention uses a 7x7 sliding window, reducing computational complexity from $O(n^2)$ to $O(n \times 49)$ .
Multi-scale Feature Fusion: Unlike the single resolution of ViT, SVTRv2 uses different feature map resolutions at different depths, similar to a CNN’s pyramid structure.
Semantic Guidance Module: A lightweight semantic branch is added at the end of the encoder to help the model understand semantic relationships between characters rather than just visual features.

These improvements allow SVTRv2 to achieve accuracy comparable to attention-based methods while maintaining the simplicity of CTC decoding ⁶.

Why CTC instead of Attention?

There are two mainstream paradigms for text recognition:

CTC (Connectionist Temporal Classification): Treats recognition as a sequence labeling problem where output is aligned with input.
Attention-based Decoder: Uses an attention mechanism to generate output character by character.

Attention methods usually offer higher accuracy, but CTC methods are simpler and faster. SVTRv2’s contribution is that by improving the visual encoder, it allows CTC methods to reach or even exceed the accuracy of attention methods ⁶.

The core of CTC decoding is “merging duplicates” and “removing blanks”:

1
flowchart LR
2
    subgraph Input["Model Output"]
3
        L["Logits<br/>[T, 18384]"]
4
    end
5

6
    subgraph Argmax["Argmax NEON"]
7
        A1["t=0: blank"]
8
        A2["t=1: H"]
9
        A3["t=2: H"]
10
        A4["t=3: blank"]
11
        A5["t=4: e"]
12
        A6["t=5: l"]
13
        A7["t=6: l"]
14
        A8["t=7: l"]
15
        A9["t=8: o"]
16
    end
17

18
    subgraph Merge["Merge Duplicates"]
19
        M["blank, H, blank, e, l, o"]
20
    end
21

22
    subgraph Remove["Remove Blanks"]
23
        R["H, e, l, o"]
24
    end
25

26
    subgraph Output["Output"]
27
        O["Helo - Error"]
28
    end
29

30
    L --> A1 & A2 & A3 & A4 & A5 & A6 & A7 & A8 & A9
31
    A1 & A2 & A3 & A4 & A5 & A6 & A7 & A8 & A9 --> Merge --> Remove --> Output

Wait, there’s a problem here. If the original text is “Hello,” the two ‘l’s are incorrectly merged. The CTC solution is to insert a blank token between repeated characters.

1
Correct Encoding: [blank, H, e, l, blank, l, o]
2
Decoding Result: "Hello"

NEON-Optimized CTC Decoding

CTC decoding in PPOCRv5-Android uses NEON-optimized Argmax. In text_recognizer.cpp:

1
inline void ArgmaxNeon8(const float *__restrict__ data, int size,
2
                        int &max_idx, float &max_val) {
3
    if (size < 16) {
4
        // Scalar fallback
5
        max_idx = 0;
6
        max_val = data[0];
7
        for (int i = 1; i < size; ++i) {
8
            if (data[i] > max_val) {
9
                max_val = data[i];
10
                max_idx = i;
11
            }
12
        }
13
        return;
14
    }
15

16
    // NEON vectorization: process 4 floats at a time
17
    float32x4_t v_max = vld1q_f32(data);
18
    int32x4_t v_idx = {0, 1, 2, 3};
19
    int32x4_t v_max_idx = v_idx;
20
    const int32x4_t v_four = vdupq_n_s32(4);
21

22
    int i = 4;
23
    for (; i + 4 <= size; i += 4) {
24
        float32x4_t v_curr = vld1q_f32(data + i);
25
        v_idx = vaddq_s32(v_idx, v_four);
26

27
        // Vectorized comparison and conditional selection
28
        uint32x4_t cmp = vcgtq_f32(v_curr, v_max);
29
        v_max = vbslq_f32(cmp, v_curr, v_max);        // Select larger value
30
        v_max_idx = vbslq_s32(cmp, v_idx, v_max_idx); // Select corresponding index
31
    }
32

33
    // Horizontal reduction: find the maximum among the 4 candidates
34
    float max_vals[4];
35
    int32_t max_idxs[4];
36
    vst1q_f32(max_vals, v_max);
37
    vst1q_s32(max_idxs, v_max_idx);
38
    // ... final comparison
39
}

For an Argmax with 18,384 categories, NEON optimization can provide approximately a 3x speedup.

Mathematical Principles of CTC Loss and Decoding

The core idea of CTC is that given an input sequence $X$ and all possible alignment paths $\pi$ , the probability of the target sequence $Y$ is calculated as:

$P(Y|X) = \sum_{\pi \in \mathcal{B}^{-1}(Y)} P(\pi|X)$

where $\mathcal{B}$ is a “many-to-one mapping function” that maps path $\pi$ to the output sequence $Y$ (by merging duplicates and removing blanks).

During inference, we use Greedy Decoding instead of full Beam Search:

1
std::string CTCGreedyDecode(const float* logits, int time_steps, int num_classes,
2
                            const std::vector<std::string>& dictionary) {
3
    std::string result;
4
    int prev_idx = -1;  // Used for merging duplicates
5

6
    for (int t = 0; t < time_steps; ++t) {
7
        // Find the category with the maximum probability for the current time step
8
        int max_idx = 0;
9
        float max_val = logits[t * num_classes];
10

11
        for (int c = 1; c < num_classes; ++c) {
12
            if (logits[t * num_classes + c] > max_val) {
13
                max_val = logits[t * num_classes + c];
14
                max_idx = c;
15
            }
16
        }
17

18
        // CTC decoding rules:
19
        // 1. Skip blank token (index 0)
20
        // 2. Merge consecutive duplicate characters
21
        if (max_idx != 0 && max_idx != prev_idx) {
22
            result += dictionary[max_idx - 1];  // -1 because blank occupies index 0
23
        }
24

25
        prev_idx = max_idx;
26
    }
27

28
    return result;
29
}

The time complexity of greedy decoding is $O(T \times C)$ , where $T$ is the number of time steps and $C$ is the number of categories. For PP-OCRv5, $T \approx 80$ and $C = 18384$ , requiring about 1.5 million comparisons per decoding. This is why NEON optimization is so important.

TIP

Beam Search can improve decoding accuracy, but its computational cost is $k$ times that of greedy decoding (where $k$ is the beam width). On mobile devices, greedy decoding is usually the better choice.

Character Dictionary: The Challenge of 18,383 Characters

PP-OCRv5 supports 18,383 characters, including:

Common Simplified Chinese characters
Common Traditional Chinese characters
English letters and numbers
Japanese Hiragana and Katakana
Common punctuation and special characters

This dictionary is stored in the keys_v5.txt file, one character per line. During CTC decoding, the model output logits have a shape of [1, T, 18384], where T is the number of time steps, and 18384 = 18383 characters + 1 blank token.

LiteRT C++ API: Modern Interfaces After the 2024 Refactor

PPOCRv5-Android uses the LiteRT C++ API refactored in 2024, which provides a more modern interface design. Compared to the traditional TFLite C API, the new API offers better type safety and resource management capabilities.

Comparison of Old and New APIs

The LiteRT 2024 refactor brought significant API changes:

Feature	Old API (TFLite)	New API (LiteRT)
Namespace	`tflite::`	`litert::`
Error Handling	Returns `TfLiteStatus` enum	Returns `Expected<T>` type
Memory Management	Manual management	RAII automatic management
Delegate Config	Scattered APIs	Unified `Options` class
Tensor Access	Pointers + manual casting	Type-safe `TensorBuffer`

The core advantage of the new API is type safety and automatic resource management. Taking error handling as an example:

1
// Old API: manual check required for every return value
2
TfLiteStatus status = TfLiteInterpreterAllocateTensors(interpreter);
3
if (status != kTfLiteOk) {
4
    // Error handling
5
}
6

7
// New API: uses Expected type, supports method chaining
8
auto model_result = litert::CompiledModel::Create(env, model_path, options);
9
if (!model_result) {
10
    LOGE(TAG, "Error: %s", model_result.Error().Message().c_str());
11
    return false;
12
}
13
auto model = std::move(*model_result);  // Automatic lifecycle management

Environment and Model Initialization

In text_detector.cpp, the initialization process is as follows:

1
bool Initialize(const std::string &model_path, AcceleratorType accelerator_type) {
2
    // 1. Create LiteRT environment
3
    auto env_result = litert::Environment::Create({});
4
    if (!env_result) {
5
        LOGE(TAG, "Failed to create LiteRT environment: %s",
6
             env_result.Error().Message().c_str());
7
        return false;
8
    }
9
    env_ = std::move(*env_result);
10

11
    // 2. Configure hardware accelerator
12
    auto options_result = litert::Options::Create();
13
    auto hw_accelerator = ToLiteRtAccelerator(accelerator_type);
14
    options.SetHardwareAccelerators(hw_accelerator);
15

16
    // 3. Compile model
17
    auto model_result = litert::CompiledModel::Create(*env_, model_path, options);
18
    if (!model_result) {
19
        LOGW(TAG, "Failed to create CompiledModel with accelerator %d: %s",
20
             static_cast<int>(accelerator_type),
21
             model_result.Error().Message().c_str());
22
        return false;
23
    }
24
    compiled_model_ = std::move(*model_result);
25

26
    // 4. Resize input tensor shape
27
    std::vector<int> input_dims = {1, kDetInputSize, kDetInputSize, 3};
28
    compiled_model_->ResizeInputTensor(0, absl::MakeConstSpan(input_dims));
29

30
    // 5. Create Managed Buffers
31
    CreateBuffersWithCApi();
32

33
    return true;
34
}

Managed Tensor Buffer: The Key to Zero-Copy Inference

LiteRT’s Managed Tensor Buffer is key to achieving high-performance inference. It allows the GPU Delegate to access the buffer directly, eliminating CPU-GPU data transfers:

1
bool CreateBuffersWithCApi() {
2
    LiteRtCompiledModel c_model = compiled_model_->Get();
3
    LiteRtEnvironment c_env = env_->Get();
4

5
    // Get input buffer requirements
6
    LiteRtTensorBufferRequirements input_requirements = nullptr;
7
    LiteRtGetCompiledModelInputBufferRequirements(
8
        c_model, /*signature_index=*/0, /*input_index=*/0,
9
        &input_requirements);
10

11
    // Get tensor type information
12
    auto input_type = compiled_model_->GetInputTensorType(0, 0);
13
    LiteRtRankedTensorType tensor_type =
14
        static_cast<LiteRtRankedTensorType>(*input_type);
15

16
    // Create managed buffer
17
    LiteRtTensorBuffer input_buffer = nullptr;
18
    LiteRtCreateManagedTensorBufferFromRequirements(
19
        c_env, &tensor_type, input_requirements, &input_buffer);
20

21
    // Wrap as C++ object for automatic lifecycle management
22
    input_buffers_.push_back(
23
        litert::TensorBuffer::WrapCObject(input_buffer,
24
                                          litert::OwnHandle::kYes));
25
    return true;
26
}

The advantages of this design are:

Zero-copy inference: The GPU Delegate can access the buffer directly without CPU-GPU data transfer.
Automatic memory management: OwnHandle::kYes ensures the buffer is automatically released when the C++ object is destroyed.
Type safety: Tensor type matching is checked at compile time.

GPU Acceleration: Choosing OpenCL and the Trade-offs

LiteRT provides several hardware acceleration options:

1
flowchart TB
2
    subgraph Delegates["LiteRT Delegate Ecosystem"]
3
        direction TB
4
        GPU_CL[GPU Delegate<br/>OpenCL Backend]
5
        GPU_GL[GPU Delegate<br/>OpenGL ES Backend]
6
        NNAPI[NNAPI Delegate<br/>Android HAL]
7
        XNN[XNNPACK Delegate<br/>CPU Optimized]
8
    end
9

10
    subgraph Hardware["Hardware Mapping"]
11
        direction TB
12
        ADRENO[Adreno GPU<br/>Qualcomm]
13
        MALI[Mali GPU<br/>ARM]
14
        NPU[NPU/DSP<br/>Vendor Specific]
15
        CPU[ARM CPU<br/>NEON]
16
    end
17

18
    GPU_CL --> ADRENO
19
    GPU_CL --> MALI
20
    GPU_GL --> ADRENO
21
    GPU_GL --> MALI
22
    NNAPI --> NPU
23
    XNN --> CPU

Accelerator	Backend	Pros	Cons
GPU	OpenCL	Wide support, good performance	Not a standard Android component
GPU	OpenGL ES	Standard Android component	Performance inferior to OpenCL
NPU	NNAPI	Highest performance	Poor device compatibility
CPU	XNNPACK	Widest compatibility	Lowest performance

PPOCRv5-Android chooses OpenCL as the primary acceleration backend. Google released the OpenCL backend for TFLite in 2020, which achieved about a 2x speedup on Adreno GPUs compared to the OpenGL ES backend ⁷.

The advantages of OpenCL come from several aspects:

Design intent: OpenCL was designed for general-purpose computing from the start, whereas OpenGL is a graphics rendering API that only later added support for compute shaders.
Constant memory: OpenCL’s constant memory is highly efficient for accessing neural network weights.
FP16 support: OpenCL natively supports half-precision floating point, whereas OpenGL support came later.

However, OpenCL has a fatal flaw: it is not a standard Android component. OpenCL implementations vary in quality across vendors, and some devices do not support it at all.

OpenCL vs. OpenGL ES: Deep Performance Comparison

To understand OpenCL’s advantage, we need to dive into GPU architecture. Taking the Qualcomm Adreno 640 as an example:

1
flowchart TB
2
    subgraph Adreno["Adreno 640 Architecture"]
3
        direction TB
4

5
        subgraph SP["Shader Processors x2"]
6
            ALU1[ALU Array<br/>256 FP32 / 512 FP16]
7
            ALU2[ALU Array<br/>256 FP32 / 512 FP16]
8
        end
9

10
        subgraph Memory["Memory Hierarchy"]
11
            L1[L1 Cache<br/>16KB per SP]
12
            L2[L2 Cache<br/>1MB Shared]
13
            GMEM[Global Memory<br/>LPDDR4X]
14
        end
15

16
        subgraph Special["Special Units"]
17
            TMU[Texture Unit<br/>Bilinear Interpolation]
18
            CONST[Constant Cache<br/>Weight Acceleration]
19
        end
20
    end
21

22
    ALU1 --> L1
23
    ALU2 --> L1
24
    L1 --> L2 --> GMEM
25
    TMU --> L1
26
    CONST --> ALU1 & ALU2

OpenCL’s performance advantage stems from:

Feature	OpenCL	OpenGL ES Compute
Constant Memory	Native support, hardware accelerated	Emulated via UBO
Workgroup Size	Flexibly configured	Limited by shader model
Memory Barriers	Fine-grained control	Coarse-grained
FP16 Compute	`cl_khr_fp16` extension	Requires `mediump` precision
Debugging Tools	Snapdragon Profiler	Limited support

In convolution operations, weights are typically constant. OpenCL can place weights in constant memory, benefiting from hardware-level broadcast optimizations. OpenGL ES, on the other hand, needs to pass weights as Uniform Buffer Objects (UBOs), increasing memory access overhead.

NOTE

Since Android 7.0, Google has restricted apps from directly loading OpenCL libraries. However, LiteRT’s GPU Delegate bypasses this restriction by dynamically loading the system’s OpenCL implementation via dlopen. This is why the GPU Delegate needs to detect OpenCL availability at runtime.

Graceful Fallback Strategy

PPOCRv5-Android implements a graceful fallback strategy:

1
constexpr AcceleratorType kFallbackChain[] = {
2
    AcceleratorType::kGpu,  // Preferred: GPU
3
    AcceleratorType::kCpu,  // Fallback: CPU
4
};
5

6
std::unique_ptr<OcrEngine> OcrEngine::Create(
7
        const std::string &det_model_path,
8
        const std::string &rec_model_path,
9
        const std::string &keys_path,
10
        AcceleratorType accelerator_type) {
11

12
    auto engine = std::unique_ptr<OcrEngine>(new OcrEngine());
13
    int start_index = GetFallbackStartIndex(accelerator_type);
14

15
    for (int i = start_index; i < kFallbackChainSize; ++i) {
16
        AcceleratorType current = kFallbackChain[i];
17

18
        auto detector = TextDetector::Create(det_model_path, current);
19
        if (!detector) continue;
20

21
        auto recognizer = TextRecognizer::Create(rec_model_path, keys_path, current);
22
        if (!recognizer) continue;
23

24
        engine->detector_ = std::move(detector);
25
        engine->recognizer_ = std::move(recognizer);
26
        engine->active_accelerator_ = current;
27

28
        engine->WarmUp();
29
        return engine;
30
    }
31
    return nullptr;
32
}

This strategy ensures the app can run on any device, albeit with varying performance.

Native Layer: C++ and NEON Optimization

Why use C++ instead of Kotlin?

The answer is simple: performance. Image preprocessing involves massive pixel-level operations, and the overhead of these operations on the JVM is unacceptable. More importantly, C++ can directly use ARM NEON SIMD instructions to achieve vectorized computation.

NEON: ARM’s SIMD Instruction Set

NEON is the SIMD (Single Instruction, Multiple Data) extension for ARM processors. It allows a single instruction to process multiple data elements simultaneously.

1
flowchart LR
2
    subgraph NEON["128-bit NEON Register"]
3
        direction TB
4
        F4["4x float32"]
5
        I8["8x int16"]
6
        B16["16x int8"]
7
    end
8

9
    subgraph Operations["Vectorized Operations"]
10
        direction TB
11
        LD["vld1q_f32<br/>Load 4 floats"]
12
        SUB["vsubq_f32<br/>4-way parallel subtraction"]
13
        MUL["vmulq_f32<br/>4-way parallel multiplication"]
14
        ST["vst1q_f32<br/>Store 4 floats"]
15
    end
16

17
    subgraph Speedup["Performance Boost"]
18
        S1["Scalar: 4 instructions"]
19
        S2["NEON: 1 instruction"]
20
        S3["Theoretical Speedup: 4x"]
21
    end
22

23
    F4 --> LD
24
    LD --> SUB --> MUL --> ST
25
    ST --> S3

PPOCRv5-Android uses NEON optimization in several critical paths. Taking binarization as an example (text_detector.cpp):

1
void BinarizeOutput(const float *prob_map, int total_pixels) {
2
#if defined(__ARM_NEON) || defined(__ARM_NEON__)
3
    const float32x4_t v_threshold = vdupq_n_f32(kBinaryThreshold);
4
    const uint8x16_t v_255 = vdupq_n_u8(255);
5
    const uint8x16_t v_0 = vdupq_n_u8(0);
6

7
    int i = 0;
8
    for (; i + 16 <= total_pixels; i += 16) {
9
        // Process 16 pixels at a time
10
        float32x4_t f0 = vld1q_f32(prob_map + i);
11
        float32x4_t f1 = vld1q_f32(prob_map + i + 4);
12
        float32x4_t f2 = vld1q_f32(prob_map + i + 8);
13
        float32x4_t f3 = vld1q_f32(prob_map + i + 12);
14

15
        // Vectorized comparison
16
        uint32x4_t cmp0 = vcgtq_f32(f0, v_threshold);
17
        uint32x4_t cmp1 = vcgtq_f32(f1, v_threshold);
18
        uint32x4_t cmp2 = vcgtq_f32(f2, v_threshold);
19
        uint32x4_t cmp3 = vcgtq_f32(f3, v_threshold);
20

21
        // Narrow down to uint8
22
        uint16x4_t n0 = vmovn_u32(cmp0);
23
        uint16x4_t n1 = vmovn_u32(cmp1);
24
        uint16x8_t n01 = vcombine_u16(n0, n1);
25
        // ... merge and store
26
    }
27
    // Scalar fallback for remaining pixels
28
    for (; i < total_pixels; ++i) {
29
        binary_map_[i] = (prob_map[i] > kBinaryThreshold) ? 255 : 0;
30
    }
31
#else
32
    // Pure scalar implementation
33
    for (int i = 0; i < total_pixels; ++i) {
34
        binary_map_[i] = (prob_map[i] > kBinaryThreshold) ? 255 : 0;
35
    }
36
#endif
37
}

Key optimization points in this code:

Batch loading: vld1q_f32 loads 4 floats at once, reducing memory access frequency.
Vectorized comparison: vcgtq_f32 compares 4 values simultaneously to generate a mask.
Type narrowing: vmovn_u32 compresses 32-bit results into 16-bit, and eventually to 8-bit.

Compared to a scalar implementation, NEON optimization can provide a 3-4x speedup ⁸.

NEON Implementation of ImageNet Normalization

Image normalization is a crucial step in preprocessing. ImageNet standardization uses the following formula:

$x_{normalized} = \frac{x - \mu}{\sigma}$

where $\mu = [0.485, 0.456, 0.406]$ and $\sigma = [0.229, 0.224, 0.225]$ (RGB channels).

In image_utils.cpp, the NEON-optimized normalization is implemented as follows:

1
void NormalizeImageNet(const uint8_t* src, int width, int height, int stride,
2
                       float* dst) {
3
    // ImageNet normalization parameters
4
    constexpr float kMeanR = 0.485f, kMeanG = 0.456f, kMeanB = 0.406f;
5
    constexpr float kStdR = 0.229f, kStdG = 0.224f, kStdB = 0.225f;
6
    constexpr float kInvStdR = 1.0f / kStdR;
7
    constexpr float kInvStdG = 1.0f / kStdG;
8
    constexpr float kInvStdB = 1.0f / kStdB;
9
    constexpr float kScale = 1.0f / 255.0f;
10

11
#if defined(__ARM_NEON) || defined(__ARM_NEON__)
12
    // Precompute: (1/255) / std = 1 / (255 * std)
13
    const float32x4_t v_scale_r = vdupq_n_f32(kScale * kInvStdR);
14
    const float32x4_t v_scale_g = vdupq_n_f32(kScale * kInvStdG);
15
    const float32x4_t v_scale_b = vdupq_n_f32(kScale * kInvStdB);
16

17
    // Precompute: -mean / std
18
    const float32x4_t v_bias_r = vdupq_n_f32(-kMeanR * kInvStdR);
19
    const float32x4_t v_bias_g = vdupq_n_f32(-kMeanG * kInvStdG);
20
    const float32x4_t v_bias_b = vdupq_n_f32(-kMeanB * kInvStdB);
21

22
    for (int y = 0; y < height; ++y) {
23
        const uint8_t* row = src + y * stride;
24
        float* dst_row = dst + y * width * 3;
25

26
        int x = 0;
27
        for (; x + 4 <= width; x += 4) {
28
            // Load 4 RGBA pixels (16 bytes)
29
            uint8x16_t rgba = vld1q_u8(row + x * 4);
30

31
            // De-interleave: RGBARGBARGBARGBA -> RRRR, GGGG, BBBB, AAAA
32
            uint8x16x4_t channels = vld4q_u8(row + x * 4);
33

34
            // uint8 -> uint16 -> uint32 -> float32
35
            uint16x8_t r16 = vmovl_u8(vget_low_u8(channels.val[0]));
36
            uint16x8_t g16 = vmovl_u8(vget_low_u8(channels.val[1]));
37
            uint16x8_t b16 = vmovl_u8(vget_low_u8(channels.val[2]));
38

39
            float32x4_t r_f = vcvtq_f32_u32(vmovl_u16(vget_low_u16(r16)));
40
            float32x4_t g_f = vcvtq_f32_u32(vmovl_u16(vget_low_u16(g16)));
41
            float32x4_t b_f = vcvtq_f32_u32(vmovl_u16(vget_low_u16(b16)));
42

43
            // Normalize: (x / 255 - mean) / std = x * (1/255/std) + (-mean/std)
44
            r_f = vmlaq_f32(v_bias_r, r_f, v_scale_r);  // fused multiply-add
45
            g_f = vmlaq_f32(v_bias_g, g_f, v_scale_g);
46
            b_f = vmlaq_f32(v_bias_b, b_f, v_scale_b);
47

48
            // Interleaved store: RRRR, GGGG, BBBB -> RGBRGBRGBRGB
49
            float32x4x3_t rgb = {r_f, g_f, b_f};
50
            vst3q_f32(dst_row + x * 3, rgb);
51
        }
52

53
        // Scalar processing for remaining pixels
54
        for (; x < width; ++x) {
55
            const uint8_t* px = row + x * 4;
56
            float* dst_px = dst_row + x * 3;
57
            dst_px[0] = (px[0] * kScale - kMeanR) * kInvStdR;
58
            dst_px[1] = (px[1] * kScale - kMeanG) * kInvStdG;
59
            dst_px[2] = (px[2] * kScale - kMeanB) * kInvStdB;
60
        }
61
    }
62
#else
63
    // Scalar implementation (omitted)
64
#endif
65
}

Key optimization techniques in this code:

Precomputing constants: Transforming (x - mean) / std into x * scale + bias to reduce runtime division.
Fused Multiply-Add: vmlaq_f32 performs multiplication and addition in a single instruction.
De-interleaved loading: vld4q_u8 automatically separates RGBA into four channels.
Interleaved storing: vst3q_f32 writes RGB channels back to memory in an interleaved manner.

Zero OpenCV Dependency

Many OCR projects rely on OpenCV for image preprocessing. While OpenCV is powerful, it brings a massive binary footprint; the OpenCV library on Android usually exceeds 10MB.

PPOCRv5-Android chooses a “Zero OpenCV Dependency” route. All image preprocessing operations are implemented in pure C++ in image_utils.cpp:

Bilinear interpolation resize: Hand-written implementation with NEON support.
Normalization: ImageNet standardization and recognition standardization.
Perspective Transform: Cropping text regions at any angle from the original image.

NEON Implementation of Bilinear Interpolation

Bilinear interpolation is the core algorithm for image scaling. Given source image coordinates $(x, y)$ , bilinear interpolation calculates the target pixel value:

$f(x, y) = (1-\alpha)(1-\beta)f_{00} + \alpha(1-\beta)f_{10} + (1-\alpha)\beta f_{01} + \alpha\beta f_{11}$

where $\alpha = x - \lfloor x \rfloor$ , $\beta = y - \lfloor y \rfloor$ , and $f_{ij}$ are the values of the four neighboring pixels.

1
void ResizeBilinear(const uint8_t* src, int src_w, int src_h, int src_stride,
2
                    uint8_t* dst, int dst_w, int dst_h) {
3
    const float scale_x = static_cast<float>(src_w) / dst_w;
4
    const float scale_y = static_cast<float>(src_h) / dst_h;
5

6
    for (int dy = 0; dy < dst_h; ++dy) {
7
        const float sy = (dy + 0.5f) * scale_y - 0.5f;
8
        const int y0 = std::max(0, static_cast<int>(std::floor(sy)));
9
        const int y1 = std::min(src_h - 1, y0 + 1);
10
        const float beta = sy - y0;
11
        const float inv_beta = 1.0f - beta;
12

13
        const uint8_t* row0 = src + y0 * src_stride;
14
        const uint8_t* row1 = src + y1 * src_stride;
15
        uint8_t* dst_row = dst + dy * dst_w * 4;
16

17
#if defined(__ARM_NEON) || defined(__ARM_NEON__)
18
        // NEON: Process 4 target pixels at a time
19
        const float32x4_t v_beta = vdupq_n_f32(beta);
20
        const float32x4_t v_inv_beta = vdupq_n_f32(inv_beta);
21

22
        int dx = 0;
23
        for (; dx + 4 <= dst_w; dx += 4) {
24
            // Calculate 4 source coordinates
25
            float sx[4];
26
            for (int i = 0; i < 4; ++i) {
27
                sx[i] = ((dx + i) + 0.5f) * scale_x - 0.5f;
28
            }
29

30
            // Load alpha weights
31
            float alpha[4], inv_alpha[4];
32
            int x0[4], x1[4];
33
            for (int i = 0; i < 4; ++i) {
34
                x0[i] = std::max(0, static_cast<int>(std::floor(sx[i])));
35
                x1[i] = std::min(src_w - 1, x0[i] + 1);
36
                alpha[i] = sx[i] - x0[i];
37
                inv_alpha[i] = 1.0f - alpha[i];
38
            }
39

40
            // Perform bilinear interpolation for each channel
41
            for (int c = 0; c < 4; ++c) {  // RGBA
42
                float32x4_t f00, f10, f01, f11;
43

44
                // Gather neighboring values for 4 pixels
45
                f00 = vsetq_lane_f32(row0[x0[0] * 4 + c], f00, 0);
46
                f00 = vsetq_lane_f32(row0[x0[1] * 4 + c], f00, 1);
47
                f00 = vsetq_lane_f32(row0[x0[2] * 4 + c], f00, 2);
48
                f00 = vsetq_lane_f32(row0[x0[3] * 4 + c], f00, 3);
49
                // ... f10, f01, f11 similar
50

51
                // Bilinear interpolation formula
52
                float32x4_t v_alpha = vld1q_f32(alpha);
53
                float32x4_t v_inv_alpha = vld1q_f32(inv_alpha);
54

55
                float32x4_t top = vmlaq_f32(
56
                    vmulq_f32(f00, v_inv_alpha),
57
                    f10, v_alpha
58
                );
59
                float32x4_t bottom = vmlaq_f32(
60
                    vmulq_f32(f01, v_inv_alpha),
61
                    f11, v_alpha
62
                );
63
                float32x4_t result = vmlaq_f32(
64
                    vmulq_f32(top, v_inv_beta),
65
                    bottom, v_beta
66
                );
67

68
                // Convert back to uint8 and store
69
                uint32x4_t result_u32 = vcvtq_u32_f32(result);
70
                // ... store
71
            }
72
        }
73
#endif
74
        // Scalar processing for remaining pixels (omitted)
75
    }
76
}

TIP

NEON optimization for bilinear interpolation is complex because the addresses of the four neighboring pixels are non-contiguous. A more efficient method is to use separable bilinear interpolation: interpolate horizontally first, then vertically. This better utilizes cache locality.

The cost of this choice is more development work, but the benefits are significant:

APK size reduced by about 10MB.
Full control over preprocessing logic, facilitating optimization.
Avoidance of OpenCV version compatibility issues.

Perspective Transform: From Rotated Rectangles to Standard Text Lines

Text recognition models expect horizontal text line images as input. However, detected text boxes can be rotated rectangles at any angle. Perspective transform is responsible for “straightening” these rotated rectangular regions.

In text_recognizer.cpp, the CropAndRotate method implements this functionality:

1
void CropAndRotate(const uint8_t *__restrict__ image_data,
2
                   int width, int height, int stride,
3
                   const RotatedRect &box, int &target_width) {
4
    // Calculate the four corner points of the rotated rectangle
5
    const float cos_angle = std::cos(box.angle * M_PI / 180.0f);
6
    const float sin_angle = std::sin(box.angle * M_PI / 180.0f);
7
    const float half_w = box.width / 2.0f;
8
    const float half_h = box.height / 2.0f;
9

10
    float corners[8];  // (x, y) coordinates for 4 corners
11
    corners[0] = box.center_x + (-half_w * cos_angle - (-half_h) * sin_angle);
12
    corners[1] = box.center_y + (-half_w * sin_angle + (-half_h) * cos_angle);
13
    // ... calculate other corners
14

15
    // Adaptive target width: maintain aspect ratio
16
    const float aspect_ratio = src_width / std::max(src_height, 1.0f);
17
    target_width = static_cast<int>(kRecInputHeight * aspect_ratio);
18
    target_width = std::clamp(target_width, 1, kRecInputWidth);  // 48x[1, 320]
19

20
    // Affine transform matrix
21
    const float a00 = (x1 - x0) * inv_dst_w;
22
    const float a01 = (x3 - x0) * inv_dst_h;
23
    const float a10 = (y1 - y0) * inv_dst_w;
24
    const float a11 = (y3 - y0) * inv_dst_h;
25

26
    // Bilinear sampling + normalization (NEON optimized)
27
    for (int dy = 0; dy < kRecInputHeight; ++dy) {
28
        for (int dx = 0; dx < target_width; ++dx) {
29
            float sx = base_sx + a00 * dx;
30
            float sy = base_sy + a10 * dx;
31
            BilinearSampleNeon(image_data, stride, sx, sy, dst_row + dx * 3);
32
        }
33
    }
34
}

Key optimizations in this implementation:

Adaptive width: Dynamically adjusts output width based on the text box aspect ratio, avoiding excessive stretching or compression.
Affine transform approximation: For text boxes that are approximately parallelograms, affine transform is used instead of perspective transform to reduce computation.
NEON Bilinear Sampling: Sampling and normalization are completed in a single pass, reducing memory access.

JNI: The Bridge Between Kotlin and C++

JNI (Java Native Interface) is the bridge for communication between Kotlin/Java and C++. However, JNI calls have overhead, and frequent cross-language calls can severely impact performance.

The design principle of PPOCRv5-Android is to minimize the number of JNI calls. The entire OCR process requires only one JNI call:

1
sequenceDiagram
2
    participant K as Kotlin Layer
3
    participant J as JNI Bridge
4
    participant N as Native Layer
5
    participant G as GPU
6

7
    K->>J: process(bitmap)
8
    J->>N: Pass RGBA pointer
9

10
    Note over N,G: Native layer completes all work
11

12
    N->>N: Image Preprocessing NEON
13
    N->>G: Text Detection Inference
14
    G-->>N: Probability Map
15
    N->>N: Post-processing Contour Detection
16

17
    loop Each Text Box
18
        N->>N: Perspective Transform Crop
19
        N->>G: Text Recognition Inference
20
        G-->>N: Logits
21
        N->>N: CTC Decoding
22
    end
23

24
    N-->>J: OCR Results
25
    J-->>K: List OcrResult

In ppocrv5_jni.cpp, the core nativeProcess function demonstrates this design:

1
JNIEXPORT jobjectArray JNICALL
2
Java_me_fleey_ppocrv5_ocr_OcrEngine_nativeProcess(
3
        JNIEnv *env, jobject thiz, jlong handle, jobject bitmap) {
4

5
    auto *engine = reinterpret_cast<ppocrv5::OcrEngine *>(handle);
6

7
    // Lock Bitmap pixels
8
    void *pixels = nullptr;
9
    AndroidBitmap_lockPixels(env, bitmap, &pixels);
10

11
    // Complete all OCR work in a single JNI call
12
    auto results = engine->Process(
13
            static_cast<const uint8_t *>(pixels),
14
            static_cast<int>(bitmap_info.width),
15
            static_cast<int>(bitmap_info.height),
16
            static_cast<int>(bitmap_info.stride));
17

18
    AndroidBitmap_unlockPixels(env, bitmap);
19

20
    // Construct Java object array to return
21
    // ...
22
}

This design avoids the overhead of passing data back and forth between detection and recognition.

Architecture Design: Modularity and Testability

The architecture of PPOCRv5-Android follows the “Separation of Concerns” principle:

1
flowchart TB
2
    subgraph UI["Jetpack Compose UI Layer"]
3
        direction LR
4
        CP[CameraPreview]
5
        GP[GalleryPicker]
6
        RO[ResultOverlay]
7
    end
8

9
    subgraph VM["ViewModel Layer"]
10
        OVM[OCRViewModel<br/>State Management]
11
    end
12

13
    subgraph Native["Native Layer - C++"]
14
        OE[OcrEngine<br/>Orchestration]
15

16
        subgraph Detection["Text Detection"]
17
            TD[TextDetector]
18
            DB[DBNet FP16]
19
        end
20

21
        subgraph Recognition["Text Recognition"]
22
            TR[TextRecognizer]
23
            SVTR[SVTRv2 + CTC]
24
        end
25

26
        subgraph Preprocessing["Image Processing"]
27
            IP[ImagePreprocessor<br/>NEON Optimized]
28
            PP[PostProcessor<br/>Contour Detection]
29
        end
30

31
        subgraph Runtime["LiteRT Runtime"]
32
            GPU[GPU Delegate<br/>OpenCL]
33
            CPU[CPU Fallback<br/>XNNPACK]
34
        end
35
    end
36

37
    CP --> OVM
38
    GP --> OVM
39
    OVM --> RO
40
    OVM <-->|JNI| OE
41
    OE --> TD
42
    OE --> TR
43
    TD --> DB
44
    TR --> SVTR
45
    TD --> IP
46
    TR --> IP
47
    DB --> PP
48
    DB --> GPU
49
    SVTR --> GPU
50
    GPU -.->|Fallback| CPU

The benefits of this layered architecture are:

UI Layer: Pure Kotlin/Compose, focusing on user interaction.
ViewModel Layer: Manages state and business logic.
Native Layer: High-performance computing, completely decoupled from the UI.

Each layer can be tested independently. The Native layer can be unit-tested with Google Test, and the ViewModel layer can be tested with JUnit + MockK.

Kotlin Layer Encapsulation

In OcrEngine.kt, the Kotlin layer provides a clean API:

1
class OcrEngine private constructor(
2
    private var nativeHandle: Long,
3
) : Closeable {
4

5
    companion object {
6
        init {
7
            System.loadLibrary("ppocrv5_jni")
8
        }
9

10
        fun create(
11
            context: Context,
12
            acceleratorType: AcceleratorType = AcceleratorType.GPU,
13
        ): Result<OcrEngine> = runCatching {
14
            initializeCache(context)
15

16
            val detModelPath = copyAssetToCache(context, "$MODELS_DIR/$DET_MODEL_FILE")
17
            val recModelPath = copyAssetToCache(context, "$MODELS_DIR/$REC_MODEL_FILE")
18
            val keysPath = copyAssetToCache(context, "$MODELS_DIR/$KEYS_FILE")
19

20
            val handle = OcrEngine(0).nativeCreate(
21
                detModelPath, recModelPath, keysPath,
22
                acceleratorType.value,
23
            )
24

25
            if (handle == 0L) {
26
                throw OcrException("Failed to create native OCR engine")
27
            }
28

29
            OcrEngine(handle)
30
        }
31
    }
32

33
    fun process(bitmap: Bitmap): List<OcrResult> {
34
        check(nativeHandle != 0L) { "OcrEngine has been closed" }
35
        return nativeProcess(nativeHandle, bitmap)?.toList() ?: emptyList()
36
    }
37

38
    override fun close() {
39
        if (nativeHandle != 0L) {
40
            nativeDestroy(nativeHandle)
41
            nativeHandle = 0
42
        }
43
    }
44
}

Advantages of this design:

Uses the Result type to handle initialization errors.
Implements the Closeable interface, supporting use blocks for automatic resource release.
Model files are automatically copied from assets to the cache directory.

Cold Start Optimization

The first inference (cold start) is usually much slower than subsequent inferences (warm start). This is because:

The GPU Delegate needs to compile OpenCL programs.
Model weights need to be transferred from CPU memory to GPU memory.
Various caches need to be warmed up.

PPOCRv5-Android mitigates cold start issues through a Warm-up mechanism:

1
void OcrEngine::WarmUp() {
2
    LOGD(TAG, "Starting warm-up (%d iterations)...", kWarmupIterations);
3

4
    // Create a small test image
5
    std::vector<uint8_t> dummy_image(kWarmupImageSize * kWarmupImageSize * 4, 128);
6
    for (int i = 0; i < kWarmupImageSize * kWarmupImageSize; ++i) {
7
        dummy_image[i * 4 + 0] = static_cast<uint8_t>((i * 7) % 256);
8
        dummy_image[i * 4 + 1] = static_cast<uint8_t>((i * 11) % 256);
9
        dummy_image[i * 4 + 2] = static_cast<uint8_t>((i * 13) % 256);
10
        dummy_image[i * 4 + 3] = 255;
11
    }
12

13
    // Perform a few inferences to warm up
14
    for (int iter = 0; iter < kWarmupIterations; ++iter) {
15
        float detection_time_ms = 0.0f;
16
        detector_->Detect(dummy_image.data(), kWarmupImageSize, kWarmupImageSize,
17
                          kWarmupImageSize * 4, &detection_time_ms);
18
    }
19

20
    LOGD(TAG, "Warm-up completed (accelerator: %s)", AcceleratorName(active_accelerator_));
21
}

Memory Alignment Optimization

In TextDetector::Impl, all pre-allocated buffers use 64-byte alignment:

1
// Pre-allocated buffers with cache-line alignment
2
alignas(64) std::vector<uint8_t> resized_buffer_;
3
alignas(64) std::vector<float> normalized_buffer_;
4
alignas(64) std::vector<uint8_t> binary_map_;
5
alignas(64) std::vector<float> prob_map_;

64-byte alignment corresponds to the cache line size of modern ARM processors. Aligned memory access avoids cache line splits and improves memory access efficiency.

Memory Pooling and Object Reuse

Frequent memory allocation and deallocation are performance killers. PPOCRv5-Android uses a pre-allocation strategy, allocating all required memory at once during initialization:

1
class TextDetector::Impl {
2
    // Pre-allocated buffers, lifecycle tied to Impl
3
    alignas(64) std::vector<uint8_t> resized_buffer_;      // 640 * 640 * 4 = 1.6MB
4
    alignas(64) std::vector<float> normalized_buffer_;     // 640 * 640 * 3 * 4 = 4.9MB
5
    alignas(64) std::vector<uint8_t> binary_map_;          // 640 * 640 = 0.4MB
6
    alignas(64) std::vector<float> prob_map_;              // 640 * 640 * 4 = 1.6MB
7

8
    bool Initialize(...) {
9
        // Allocate once to avoid runtime malloc
10
        resized_buffer_.resize(kDetInputSize * kDetInputSize * 4);
11
        normalized_buffer_.resize(kDetInputSize * kDetInputSize * 3);
12
        binary_map_.resize(kDetInputSize * kDetInputSize);
13
        prob_map_.resize(kDetInputSize * kDetInputSize);
14
        return true;
15
    }
16
};

Benefits of this design:

Avoids memory fragmentation: All large memory blocks are allocated at startup, preventing fragmentation during runtime.
Reduces system calls: malloc can trigger system calls; pre-allocation avoids this overhead.
Cache-friendly: Consecutively allocated memory is more likely to be physically contiguous, improving cache hit rates.

Branch Prediction Optimization

Modern CPUs use branch prediction to improve pipeline efficiency. Incorrect branch prediction leads to pipeline flushes, costing 10-20 clock cycles.

On hot paths, we use __builtin_expect to hint the compiler:

1
// Most pixels will not exceed the threshold
2
if (__builtin_expect(prob_map[i] > kBinaryThreshold, 0)) {
3
    binary_map_[i] = 255;
4
} else {
5
    binary_map_[i] = 0;
6
}

__builtin_expect(expr, val) tells the compiler that the value of expr is very likely to be val. The compiler adjusts the code layout accordingly, placing “unlikely” branches away from the main path.

Loop Unrolling and Software Pipelining

For compute-intensive loops, manual unrolling can reduce loop overhead and expose more instruction-level parallelism:

1
// Non-unrolled version
2
for (int i = 0; i < n; ++i) {
3
    dst[i] = src[i] * scale + bias;
4
}
5

6
// 4x unrolled version
7
int i = 0;
8
for (; i + 4 <= n; i += 4) {
9
    dst[i + 0] = src[i + 0] * scale + bias;
10
    dst[i + 1] = src[i + 1] * scale + bias;
11
    dst[i + 2] = src[i + 2] * scale + bias;
12
    dst[i + 3] = src[i + 3] * scale + bias;
13
}
14
for (; i < n; ++i) {
15
    dst[i] = src[i] * scale + bias;
16
}

After unrolling, the CPU can execute multiple independent multiply-add instructions simultaneously, fully utilizing the multiple execution units of superscalar architectures.

Prefetch Optimization

In the inner loop of the perspective transform, use __builtin_prefetch to load data for the next line in advance:

1
for (int dy = 0; dy < kRecInputHeight; ++dy) {
2
    // Prefetch next line data
3
    if (dy + 1 < kRecInputHeight) {
4
        const float next_sy = y0 + a11 * (dy + 1);
5
        const int next_y = static_cast<int>(next_sy);
6
        if (next_y >= 0 && next_y < height) {
7
            __builtin_prefetch(image_data + next_y * stride, 0, 1);
8
        }
9
    }
10
    // ... process current line
11
}

This optimization can hide memory latency; while processing the current line, the data for the next line is already in the L1 cache.

Engineering Details of Post-processing

Connected Component Analysis and Contour Detection

In postprocess.cpp, the FindContours function implements efficient connected component analysis:

1
std::vector<std::vector<Point>> FindContours(const uint8_t *binary_map,
2
                                             int width, int height) {
3
    // 1. 4x downsampling to reduce computation
4
    int ds_width = (width + kDownsampleFactor - 1) / kDownsampleFactor;
5
    int ds_height = (height + kDownsampleFactor - 1) / kDownsampleFactor;
6

7
    std::vector<uint8_t> ds_map(ds_width * ds_height);
8
    downsample_binary_map(binary_map, width, height,
9
                          ds_map.data(), ds_width, ds_height, kDownsampleFactor);
10

11
    // 2. BFS traversal of connected components
12
    std::vector<int> labels(ds_width * ds_height, 0);
13
    int current_label = 0;
14

15
    for (int y = 0; y < ds_height; ++y) {
16
        for (int x = 0; x < ds_width; ++x) {
17
            if (pixel_at(ds_map.data(), x, y, ds_width) > 0 &&
18
                labels[y * ds_width + x] == 0) {
19
                current_label++;
20
                std::vector<Point> boundary;
21
                std::queue<std::pair<int, int>> queue;
22
                queue.push({x, y});
23

24
                while (!queue.empty()) {
25
                    auto [cx, cy] = queue.front();
26
                    queue.pop();
27

28
                    // Detect boundary pixels
29
                    if (is_boundary_pixel(ds_map.data(), cx, cy, ds_width, ds_height)) {
30
                        boundary.push_back({
31
                            static_cast<float>(cx * kDownsampleFactor + kDownsampleFactor / 2),
32
                            static_cast<float>(cy * kDownsampleFactor + kDownsampleFactor / 2)
33
                        });
34
                    }
35

36
                    // 4-neighbor expansion
37
                    for (int d = 0; d < 4; ++d) {
38
                        int nx = cx + kNeighborDx4[d];
39
                        int ny = cy + kNeighborDy4[d];
40
                        // ...
41
                    }
42
                }
43

44
                if (boundary.size() >= 4) {
45
                    contours.push_back(std::move(boundary));
46
                }
47
            }
48
        }
49
    }
50
    return contours;
51
}

Key optimization points:

4x Downsampling: Downsampling the 640x640 binary map to 160x160 reduces computation by 16 times.
Boundary Detection: Only boundary pixels are kept, rather than the entire connected component.
Maximum Contour Limit: kMaxContours = 100 to prevent performance issues in extreme cases.

Convex Hull and Rotating Calipers Algorithms

Calculating the minimum area rotated rectangle involves two steps: first calculating the convex hull, then using the rotating calipers algorithm to find the minimum area bounding rectangle.

Graham Scan Convex Hull Algorithm

Graham Scan is a classic algorithm for calculating the convex hull with a time complexity of $O(n \log n)$ :

1
std::vector<Point> ConvexHull(std::vector<Point> points) {
2
    if (points.size() < 3) return points;
3

4
    // 1. Find the bottom-most point (min y, then min x)
5
    auto pivot = std::min_element(points.begin(), points.end(),
6
        [](const Point& a, const Point& b) {
7
            return a.y < b.y || (a.y == b.y && a.x < b.x);
8
        });
9
    std::swap(points[0], *pivot);
10
    Point p0 = points[0];
11

12
    // 2. Sort by polar angle
13
    std::sort(points.begin() + 1, points.end(),
14
        [&p0](const Point& a, const Point& b) {
15
            float cross = CrossProduct(p0, a, b);
16
            if (std::abs(cross) < 1e-6f) {
17
                // When collinear, the closer point comes first
18
                return DistanceSquared(p0, a) < DistanceSquared(p0, b);
19
            }
20
            return cross > 0;  // Counter-clockwise direction
21
        });
22

23
    // 3. Build the convex hull
24
    std::vector<Point> hull;
25
    for (const auto& p : points) {
26
        // Remove points that cause a clockwise turn
27
        while (hull.size() > 1 &&
28
               CrossProduct(hull[hull.size()-2], hull[hull.size()-1], p) <= 0) {
29
            hull.pop_back();
30
        }
31
        hull.push_back(p);
32
    }
33

34
    return hull;
35
}
36

37
// Cross product: determine turn direction
38
float CrossProduct(const Point& o, const Point& a, const Point& b) {
39
    return (a.x - o.x) * (b.y - o.y) - (a.y - o.y) * (b.x - o.x);
40
}

Rotating Calipers Algorithm

The Rotating Calipers algorithm iterates through each edge of the convex hull and calculates the area of the bounding rectangle based on that edge:

1
RotatedRect MinAreaRect(const std::vector<Point>& hull) {
2
    if (hull.size() < 3) return {};
3

4
    float min_area = std::numeric_limits<float>::max();
5
    RotatedRect best_rect;
6

7
    int n = hull.size();
8
    int right = 1, top = 1, left = 1;  // Three "caliper" positions
9

10
    for (int i = 0; i < n; ++i) {
11
        int j = (i + 1) % n;
12

13
        // Direction vector of the current edge
14
        float edge_x = hull[j].x - hull[i].x;
15
        float edge_y = hull[j].y - hull[i].y;
16
        float edge_len = std::sqrt(edge_x * edge_x + edge_y * edge_y);
17

18
        // Unit vector
19
        float ux = edge_x / edge_len;
20
        float uy = edge_y / edge_len;
21

22
        // Perpendicular direction
23
        float vx = -uy;
24
        float vy = ux;
25

26
        // Find the rightmost point (max projection along edge direction)
27
        while (Dot(hull[(right + 1) % n], ux, uy) > Dot(hull[right], ux, uy)) {
28
            right = (right + 1) % n;
29
        }
30

31
        // Find the topmost point (max projection along perpendicular direction)
32
        while (Dot(hull[(top + 1) % n], vx, vy) > Dot(hull[top], vx, vy)) {
33
            top = (top + 1) % n;
34
        }
35

36
        // Find the leftmost point
37
        while (Dot(hull[(left + 1) % n], ux, uy) < Dot(hull[left], ux, uy)) {
38
            left = (left + 1) % n;
39
        }
40

41
        // Calculate rectangle dimensions
42
        float width = Dot(hull[right], ux, uy) - Dot(hull[left], ux, uy);
43
        float height = Dot(hull[top], vx, vy) - Dot(hull[i], vx, vy);
44
        float area = width * height;
45

46
        if (area < min_area) {
47
            min_area = area;
48
            // Update optimal rectangle parameters
49
            best_rect.width = width;
50
            best_rect.height = height;
51
            best_rect.angle = std::atan2(uy, ux) * 180.0f / M_PI;
52
            // Calculate center point...
53
        }
54
    }
55

56
    return best_rect;
57
}

The key insight of rotating calipers is that as the base edge rotates, the three “calipers” (rightmost, topmost, leftmost points) only move monotonically forward. Thus, the total time complexity is $O(n)$ rather than $O(n^2)$ .

Minimum Area Rotated Rectangle

The MinAreaRect function uses the rotating calipers algorithm to calculate the minimum area rotated rectangle:

1
RotatedRect MinAreaRect(const std::vector<Point> &contour) {
2
    // 1. Subsampling to reduce point count
3
    std::vector<Point> points = subsample_points(contour, kMaxBoundaryPoints);
4

5
    // 2. Fast path: use AABB for text boxes with high aspect ratios
6
    float aspect = std::max(aabb_width, aabb_height) /
7
                   std::max(1.0f, std::min(aabb_width, aabb_height));
8
    if (aspect > 2.0f && points.size() > 50) {
9
        // Return axis-aligned bounding box directly
10
        RotatedRect rect;
11
        rect.center_x = (min_x + max_x) / 2.0f;
12
        rect.center_y = (min_y + max_y) / 2.0f;
13
        rect.width = aabb_width;
14
        rect.height = aabb_height;
15
        rect.angle = 0.0f;
16
        return rect;
17
    }
18

19
    // 3. Convex hull calculation
20
    std::vector<Point> hull = convex_hull(std::vector<Point>(points));
21

22
    // 4. Rotating calipers: iterate through each edge of the convex hull
23
    float min_area = std::numeric_limits<float>::max();
24
    RotatedRect best_rect;
25

26
    for (size_t i = 0; i < hull.size(); ++i) {
27
        // Calculate bounding rectangle based on the current edge
28
        float edge_x = hull[j].x - hull[i].x;
29
        float edge_y = hull[j].y - hull[i].y;
30

31
        // Project all points onto the edge direction and perpendicular direction
32
        project_points_onto_axis(hull, axis1_x, axis1_y, min1, max1);
33
        project_points_onto_axis(hull, axis2_x, axis2_y, min2, max2);
34

35
        float area = (max1 - min1) * (max2 - min2);
36
        if (area < min_area) {
37
            min_area = area;
38
            // Update optimal rectangle
39
        }
40
    }
41

42
    return best_rect;
43
}

The time complexity of this algorithm is $O(n \log n)$ (convex hull calculation) + $O(n)$ (rotating calipers), where $n$ is the number of boundary points. By subsampling to limit $n$ to within 200, real-time performance is ensured.

Real-time Camera OCR: CameraX and Frame Analysis

The challenge of real-time OCR is how to process each frame as quickly as possible while maintaining a smooth preview.

1
flowchart TB
2
    subgraph Camera["CameraX Pipeline"]
3
        direction TB
4
        CP[CameraProvider]
5
        PV[Preview UseCase<br/>30 FPS]
6
        IA[ImageAnalysis UseCase<br/>STRATEGY_KEEP_ONLY_LATEST]
7
    end
8

9
    subgraph Analysis["Frame Analysis Pipeline"]
10
        direction TB
11
        IP[ImageProxy<br/>YUV_420_888]
12
        BM[Bitmap Conversion<br/>RGBA_8888]
13
        JNI[JNI Call<br/>Single Cross-language]
14
    end
15

16
    subgraph Native["Native OCR"]
17
        direction TB
18
        DET[TextDetector<br/>~45ms GPU]
19
        REC[TextRecognizer<br/>~15ms/line]
20
        RES[OCR Results]
21
    end
22

23
    subgraph UI["UI Update"]
24
        direction TB
25
        VM[ViewModel<br/>StateFlow]
26
        OV[ResultOverlay<br/>Canvas Drawing]
27
    end
28

29
    CP --> PV
30
    CP --> IA
31
    IA --> IP --> BM --> JNI
32
    JNI --> DET --> REC --> RES
33
    RES --> VM --> OV

CameraX ImageAnalysis

CameraX is the Android Jetpack camera library, providing the ImageAnalysis use case, which allows us to perform real-time analysis on camera frames:

1
val imageAnalysis = ImageAnalysis.Builder()
2
    .setTargetResolution(Size(1280, 720))
3
    .setBackpressureStrategy(ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST)
4
    .build()
5

6
imageAnalysis.setAnalyzer(executor) { imageProxy ->
7
    val bitmap = imageProxy.toBitmap()
8
    val result = ocrEngine.process(bitmap)
9
    // Update UI
10
    imageProxy.close()
11
}

The key configuration is STRATEGY_KEEP_ONLY_LATEST: when the analyzer’s processing speed cannot keep up with the camera’s frame rate, old frames are discarded, keeping only the latest one. This ensures the timeliness of OCR results.

Trade-off Between Frame Rate and Latency

On GPU-accelerated devices (my current Snapdragon 870 seems to have issues, consistently failing to offload most computations to the GPU), PPOCRv5-Android can theoretically achieve high processing speeds. However, this doesn’t mean we should process every frame.

Consider a scenario where a user points the camera at a block of text; the text content won’t change in a short period. Performing full OCR on every frame would waste significant computational resources.

An optimization strategy is “change detection”: triggering OCR only when the scene changes significantly. This can be achieved by comparing histograms or feature points of consecutive frames.

Future Outlook: NPU and Quantization

The future of on-device AI lies in NPUs (Neural Processing Units). Compared to GPUs, NPUs are specifically designed for neural network inference and offer higher energy efficiency.

However, the challenge with NPUs is fragmentation. Each chip vendor has its own NPU architecture and SDK:

Qualcomm: Hexagon DSP + AI Engine
MediaTek: APU
Samsung: Exynos NPU
Google: Tensor TPU

Android’s NNAPI (Neural Networks API) attempts to provide a unified abstraction layer, but actual results vary. Many NPU features are not exposed through NNAPI, forcing developers to use vendor-specific SDKs.

INT8 Quantization: An Unfinished Battle

FP16 quantization is a conservative choice that loses almost no accuracy. But for extreme performance, INT8 quantization is the next step.

INT8 quantization compresses weights and activations from 32-bit floating point to 8-bit integers, which theoretically provides:

4x model compression.
2-4x inference speedup (depending on hardware).
Over 10x speedup on Qualcomm Hexagon DSPs.

This temptation was too great, so I began a long journey into INT8 quantization.

First Attempt: Synthetic Data Calibration

INT8 quantization requires a calibration dataset to determine quantization parameters (Scale and Zero Point). Initially, I took a shortcut and used randomly generated “text-like” images:

1
# Wrong approach: using random noise for calibration
2
img = np.ones((h, w, 3), dtype=np.float32) * 0.9
3
for _ in range(num_lines):
4
    gray_val = np.random.uniform(0.05, 0.3)
5
    img[y:y+line_h, x:x+line_w] = gray_val

The result was disastrous. The model output was all zeros:

Raw FLOAT32 output range: min=0.0000, max=0.0000
Prob map stats: min=0.0000, max=0.0000, mean=0.000000

The quantization tool calculated incorrect parameters based on random noise, causing real image activation values to be truncated.

Second Attempt: Real Image Calibration

I switched to real OCR dataset images: ICDAR2015, TextOCR, and PaddleOCR official samples. I also implemented Letterbox preprocessing to ensure the image distribution during calibration matched that during inference:

1
def letterbox_image(image, target_size):
2
    """Resize maintaining aspect ratio, pad remaining parts with gray"""
3
    ih, iw = image.shape[:2]
4
    h, w = target_size
5
    scale = min(w / iw, h / ih)
6
    # ... center paste

The model no longer output all zeros, but the recognition results were still gibberish.

Third Attempt: Fixing Type Handling on the C++ Side

I discovered that the C++ code had issues handling INT8 inputs. The INT8 model expects raw pixel values (0-255), but I was still performing ImageNet normalization (subtracting mean, dividing by variance).

1
if (input_is_int8_) {
2
    // INT8 model: input raw pixels directly, normalization fused into the first layer
3
    dst[i * 3 + 0] = static_cast<int8_t>(src[i * 4 + 0] ^ 0x80);
4
} else {
5
    // FP32 model: manual normalization required
6
    // (pixel - mean) / std
7
}

I also implemented logic to dynamically read quantization parameters instead of hardcoding them:

1
bool GetQuantizationParams(LiteRtTensor tensor, float* scale, int32_t* zero_point) {
2
    LiteRtQuantization quant;
3
    LiteRtGetTensorQuantization(tensor, &quant);
4
    // ...
5
}

Final Result: Compromise

After days of debugging, the INT8 model still failed to work correctly. The issues likely stemmed from:

onnx2tf’s quantization implementation: PP-OCRv5 uses some special operator combinations that onnx2tf might not have handled correctly during quantization.
DBNet’s output characteristics: DBNet outputs a probability map with values between 0 and 1; INT8 quantization is particularly sensitive to such small ranges.
Error accumulation in multi-stage models: Detection and recognition models are cascaded, so quantization errors accumulate and amplify.

Let’s analyze the second point further. DBNet’s output passes through a Sigmoid activation, compressing the range to [0, 1]. INT8 quantization uses the following formula:

$x_{quantized} = \text{round}\left(\frac{x_{float}}{scale}\right) + zero\_point$

For values in the [0, 1] range, if the scale is set incorrectly, the quantized values might only occupy a small fraction of the INT8 range [-128, 127], leading to severe precision loss.

1
# Assume scale = 0.00784 (1/127), zero_point = 0
2
# Input 0.5 -> round(0.5 / 0.00784) + 0 = 64
3
# Input 0.1 -> round(0.1 / 0.00784) + 0 = 13
4
# Input 0.01 -> round(0.01 / 0.00784) + 0 = 1
5
# Input 0.001 -> round(0.001 / 0.00784) + 0 = 0  # Precision lost!

The threshold for DBNet is usually set to 0.1-0.3, meaning a large number of meaningful probability values (0.1-0.3) can only be represented by 25 integers (13-38) after quantization, resulting in insufficient resolution.

WARNING

INT8 quantization for PP-OCRv5 is a known difficult problem. If you are attempting this, it’s recommended to first ensure the FP32 model works correctly before troubleshooting quantization issues. Alternatively, consider using the official Paddle Lite framework from PaddlePaddle, which has better support for PaddleOCR.

Quantization-Aware Training: The Correct Solution

If INT8 quantization is mandatory, the correct approach is Quantization-Aware Training (QAT) rather than Post-Training Quantization (PTQ).

QAT simulates quantization errors during the training process, allowing the model to learn to adapt to low-precision representations:

1
# PyTorch QAT Example
2
import torch.quantization as quant
3

4
model = DBNet()
5
model.qconfig = quant.get_default_qat_qconfig('fbgemm')
6
model_prepared = quant.prepare_qat(model)
7

8
# Normal training, but with fake quantization nodes inserted in forward passes
9
for epoch in range(num_epochs):
10
    for images, labels in dataloader:
11
        outputs = model_prepared(images)  # Includes quantization simulation
12
        loss = criterion(outputs, labels)
13
        loss.backward()
14
        optimizer.step()
15

16
# Convert to a real quantized model
17
model_quantized = quant.convert(model_prepared)

Unfortunately, the official PP-OCRv5 does not provide QAT-trained models. This means that to obtain a high-quality INT8 model, one would need to perform QAT training from scratch, which is beyond the scope of this project.

Ultimately, I chose to compromise: using FP16 quantization + GPU acceleration instead of INT8 + DSP.

The costs of this decision are:

Model size is twice that of INT8.
Cannot leverage the ultra-low power consumption of the Hexagon DSP.
Inference speed is 2-3x slower than the theoretical optimum.

But the benefits are:

Model accuracy is almost identical to FP32.
Development cycle is significantly shortened.
Code complexity is reduced.

The essence of engineering is trade-offs. Sometimes, “good enough” is more important than “theoretically optimal.”

Conclusion

From PaddlePaddle to TFLite, from DBNet to SVTRv2, and from OpenCL to NEON, the engineering practice of on-device OCR involves knowledge across multiple fields: deep learning, compilers, GPU programming, and mobile development.

The core lesson of this project is that on-device AI is not just about “putting a model on a phone.” It requires:

Deeply understanding the model architecture to convert it correctly.
Familiarity with hardware characteristics to fully utilize accelerators.
Mastery of system programming to implement high-performance native code.
Focus on user experience to find the balance between performance and power consumption.

PPOCRv5-Android is an open-source project that demonstrates how to deploy modern OCR models into actual mobile applications. I hope this article provides some reference for developers with similar needs.

As Google stated at the launch of LiteRT: “Maximum performance, simplified.” ⁹ The goal of on-device AI is not complexity, but making complexity simple.

Afterword

To be honest, I have been away from the Android field (both professionally and as a hobby) for at least two years. This is the first time I’ve publicly released a relatively mature library on my GitHub secondary account (I’ve handed over my primary account to colleagues to show my determination to move on).

Over the years, my work focus hasn’t actually been in the Android field. I can’t disclose the specifics, but I’ll have the chance to elaborate in the future. In short, it might be difficult for me to make further contributions to Android.

The release of this project was driven by my personal interest—I’m building an early-stage tool based on Android on-device capabilities, and OCR is just a small part of its underlying layer. The full source code will be opened soon (likely very soon), though I can’t reveal more for now.

Anyway, thank you for reading this far, and I look forward to you giving my repository a Star. Thank you!

References

Google AI Edge. “LiteRT: Maximum performance, simplified.” 2024. https://developers.googleblog.com/litert-maximum-performance-simplified/ ↩
PaddleOCR Team. “PaddleOCR 3.0 Technical Report.” arXiv:2507.05595, 2025. https://arxiv.org/abs/2507.05595 ↩
GitHub Discussion. “Problem while deploying the newest official PP-OCRv5.” PaddleOCR #16100, 2025. https://github.com/PaddlePaddle/PaddleOCR/discussions/16100 ↩
Liao, M., et al. “Real-time Scene Text Detection with Differentiable Binarization.” Proceedings of the AAAI Conference on Artificial Intelligence, 2020. https://arxiv.org/abs/1911.08947 ↩
Du, Y., et al. “SVTR: Scene Text Recognition with a Single Visual Model.” IJCAI, 2022. https://arxiv.org/abs/2205.00159 ↩
Du, Y., et al. “SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition.” ICCV, 2025. https://arxiv.org/abs/2411.15858 ↩ ↩²
TensorFlow Blog. “Even Faster Mobile GPU Inference with OpenCL.” 2020. https://blog.tensorflow.org/2020/08/faster-mobile-gpu-inference-with-opencl.html ↩
ARM Developer. “Neon Intrinsics on Android.” ARM Documentation, 2024. https://developer.arm.com/documentation/101964/latest/ ↩
Google AI Edge. “LiteRT Documentation.” 2024. https://ai.google.dev/edge/litert ↩

mobile/ppocrv5-android.md

# Edge OCR Practice: Native Deployment of PP-OCRv5 on Android

Notes

Introduction

Model Conversion: The Long Journey from PaddlePaddle to TFLite

Pitfall 1: Operator Compatibility in paddle2onnx

Pitfall 2: HardSigmoid and GPU Compatibility

Pitfall 3: Coordinate Transformation Mode of the Resize Operator

Final Step: onnx2tf and FP16 Quantization

Text Detection: Differentiable Binarization in DBNet

Standard Binarization vs. Differentiable Binarization

Engineering Implementation of the Post-processing Pipeline

Unclip: The Text Box Expansion Algorithm

Text Recognition: SVTRv2 and CTC Decoding

Architectural Innovations in SVTRv2

Why CTC instead of Attention?

NEON-Optimized CTC Decoding

Mathematical Principles of CTC Loss and Decoding

Character Dictionary: The Challenge of 18,383 Characters

LiteRT C++ API: Modern Interfaces After the 2024 Refactor

Comparison of Old and New APIs

Environment and Model Initialization

Managed Tensor Buffer: The Key to Zero-Copy Inference

GPU Acceleration: Choosing OpenCL and the Trade-offs

OpenCL vs. OpenGL ES: Deep Performance Comparison

Graceful Fallback Strategy

Native Layer: C++ and NEON Optimization

NEON: ARM’s SIMD Instruction Set

NEON Implementation of ImageNet Normalization

Zero OpenCV Dependency

NEON Implementation of Bilinear Interpolation

Perspective Transform: From Rotated Rectangles to Standard Text Lines

JNI: The Bridge Between Kotlin and C++

Architecture Design: Modularity and Testability

Kotlin Layer Encapsulation

Cold Start Optimization

Memory Alignment Optimization

Memory Pooling and Object Reuse

Branch Prediction Optimization

Loop Unrolling and Software Pipelining

Prefetch Optimization

Engineering Details of Post-processing

Connected Component Analysis and Contour Detection

Convex Hull and Rotating Calipers Algorithms

Graham Scan Convex Hull Algorithm

Rotating Calipers Algorithm

Minimum Area Rotated Rectangle

Real-time Camera OCR: CameraX and Frame Analysis

CameraX ImageAnalysis

Trade-off Between Frame Rate and Latency

Future Outlook: NPU and Quantization

INT8 Quantization: An Unfinished Battle

First Attempt: Synthetic Data Calibration

Second Attempt: Real Image Calibration

Third Attempt: Fixing Type Handling on the C++ Side

Final Result: Compromise

Quantization-Aware Training: The Correct Solution

Conclusion

Afterword

References

Footnotes