Unlimited OCR Works

by Baidu Inc.

Audio version created with Paper2Audio.

Listen on Paper2Audio

Unlimited Ocr Works

Welcome the Era of One-shot Long-horizon Parsing

Baidu Inc.

Abstract

Recently, end-to-end O.C.R models, exemplified by DeepSeek O.C.R, have once again thrust O.C.R into the spotlight. A widely held view is that employing a large language model (L.L.M) as the decoder allows the model to leverage the prior distribution of language, leading to improved O.C.R performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated K.V cache drives up memory consumption and progressively slows down generation. This stands in stark contrast to humans, who exhibit no such decline in efficiency during long-horizon copying tasks. In this technical report, we propose Unlimited O.C.R, a model designed to emulate human parsing working memory. Taking DeepSeek O.C.R as the baseline, we replace all attention layers in the decoder with our proposed Reference Sliding Window Attention (R-S.W.A), which reduces attention computation costs while maintaining a constant K.V cache throughout the entire decoding process. By combining the high compression rate of DeepSeek O.C.R's encoder with our constant K.V cache design, Unlimited O.C.R can transcribe dozens of pages of documents in a single forward pass under a standard maximum length of 32 K. More importantly, R-S.W.A is a general-purpose parsing attention mechanism — beyond O.C.R, it is equally applicable to tasks such as A.S.R, translation, etcetera Codes and model weights are publicly available at github dot com U.R.L U.R.L

1. Introduction

Humans are remarkably adept at seemingly straightforward long-horizon tasks: transcribing hundreds of book pages, translating hours-long audio recordings, and the like. Yet these are precisely the tasks where current models fall short. Take O.C.R as an example—no existing model can even parse ten of pages in a single forward pass.

Instead, they resort to page-by-page processing in a for-loop fashion, resetting memory at every step. This divergence is far from superficial, and it cannot be reduced to a mere lack of sufficient context. When humans perform such tasks, they maintain a continuous cognitive state in which distant outputs fade softly from memory, while nearby context is used to track progress.

The for-loop paradigm, by contrast, erases memory entirely at each page, fragmenting a coherent long-horizon process into isolated short tasks managed by an external scheduler. It works to some extent, but it remains an engineering workaround, not a step toward A.G.I-purpose intelligence.

Consider the act of transcribing a document. As we copy each character, we do not scan the entire text already written; we simply glance at the immediately surrounding context to stay oriented. This everyday behavior points to an attention pattern fundamentally different from those in current models. It is not standard full attention—the full history is never fully consulted.

Nor does it resemble linear attention, since visual/reference tokens undergo no recurrent state updates; such updates would progressively blur the visual features and degrade recognition accuracy. To align more closely with this natural attention flow, and to explore how multimodal large language models (M.L.L.M's) can handle simple long-horizon parsing tasks, we propose Unlimited O.C.R. Our main contributions are as follows:

- We introduce Reference Sliding Window Attention (R-S.W.A), illustrated in Figure 1. For each token, R-S.W.A attends to all reference tokens—visual tokens and the prompt—while limiting output attention to the preceding n tokens (n defaults to 128). In this way, each token perceives the full image and autonomously tracks O.C.R progress through state transitions within the causal sliding window. This design keeps the K.V cache constant during inference, alleviating memory pressure and reducing the computational cost.

Figure 1 summary: This figure is a schematic diagram comparing two attention mechanisms. The illustration contrasts Vanilla Attention with Reference Sliding Window Attention, showing how tokens interact with reference tokens and working memory during the generation process. In Vanilla Attention, the attention scope expands as more tokens are generated, leading to an increasing cache size. In contrast, Reference Sliding Window Attention maintains a fixed-size window for working memory while ensuring that every generated token consistently attends to all reference tokens. This design allows the system to maintain a constant cache size throughout the decoding process and prevents the loss of detail in reference tokens, which avoids the blurring effect seen in standard sliding window approaches.

- Building on R-S.W.A, we propose Unlimited O.C.R. Using DeepSeek O.C.R as our baseline, we retain its DeepEncoder with high image compression rate, modifying all the decoder L.L.M's attention mechanism to R-S.W.A. This enables Unlimited O.C.R to parse dozens of paper pages in a single forward pass. R-S.W.A also yields a modest improvement in general O.C.R accuracy. Specifically, Unlimited O.C.R achieves 93% on the OmniDocBench v1.5 benchmark. Outperforming the DeenSeek O.C.R baseline by 6%.

- We conduct a preliminary validation of M.L.L.M architectures with linear-complexity attention on O.C.R tasks, particularly in long-horizon scenarios. Rather than brute-force scaling up the training context, we identify an elegant approach that achieves long-horizon O.C.R. Looking ahead, we see promise in extending R-S.W.A to A.S.R, translation, and other reference-based tasks that demand long-horizon dependency modeling.

In summary, we present R-S.W.A, which substantially reduces both the computational cost of attention and the memory footprint in the long-horizon inference. Building on R-S.W.A, Unlimited O.C.R not only enables one-shot parsing of an entire book, but also surpasses the DeepSeek O.C.R baseline by a large margin on popular document parsing benchmarks. Furthermore, we believe R-S.W.A holds promise well beyond O.C.R.

2. Related Works

2.1. Pipeline-based Framework

Traditional O.C.R models, particularly those designed for document parsing, typically adopt a pipeline architecture: a detection model first identifies different types of document elements, followed by multiple recognition operators that further parse the content within those blocks. These components are often bridged by a variety of heuristic strategies, such as cropping, rectification, and so on. In recent years, with the powerful decoder capabilities of large language models (L.L.M's), the pipeline-based O.C.R paradigm has continued to evolve. The most straightforward adaptation retains the detection model while consolidating the multiple recognition models into a single unified model—a pragmatic hybrid that combines mature traditional detection algorithms with the advanced decoder of an L.L.M. Beyond this, there is another pipeline variant that invokes the L.L.M twice, replacing even the detection model with the same L.L.M, so that the entire O.C.R workflow becomes: L.L.M detection-cropping strategy-L.L.M recognition. Thanks to the inherent flexibility in how O.C.R tasks can be decomposed, pipeline architectures still remain widely adopted to this day.

2.2. End-to-end Model

With the advancement of vision-language models (V.L.M's), end-to-end O.C.R, especially dense O.C.R models are on the rise. This approach fully leverages the powerful decoder capabilities of L.L.M's by merging text detection and recognition into a single unified function, allowing the entire content of a page to be parsed in a single forward pass. Compared with the pipeline approach, the end-to-end algorithm places higher demands on model capacity and poses greater training challenges. This, in turn, makes research on end-to-end O.C.R models all the more compelling: innovations in architectural design and iterative improvements in training methodologies can more directly inspire, or even advance, the development of general-purpose V.L.M's.

2.2.1. High-compression Encoder

In end-to-end models, the encoder is an indispensable module that extracts and compresses image information. To a certain extent, the encoder determines the upper bound of the model: taking generation efficiency as an example, if the input vision tokens are too long—meaning the encoder's token compression ratio is insufficient—the model's decoding efficiency will be hindered by excessively long prefix tokens, thereby affecting decoding speed. The same holds true for effective decoding length. DeepEncoder achieves a 16times token compression rate under low activation values by cascading window attention ViT and global attention one, making it an ideal choice for multi-page long-horizon O.C.R.

2.2.2. High-efficiency Decoder

What most directly affects inference cost is the decoder, including the activation value of the L.L.M and the K.V cache size. Regarding the former, current end-to-end O.C.R models are typically under 3 B parameters. In a related vein, DeepSeek O.C.R uses an MoE architecture, keeping its activation at only 500 M during inference. As for the K.V cache, current models all see it grow continuously with decoding contexts, which limits both generation speed and length. This is exactly the key issue that our Unlimited O.C.R aims to address.

3. Methodology

3.1. Long-horizon Parsing

Our humans excel at long-horizon parsing tasks—continuously transcribing an entire book, translating even hundreds of pages in one sitting, or transcribing hours of audio without interruption. This continuous parsing capability appears closely linked to the working memory. As illustrated in Figure 2, when a person copies a book by hand, their attention typically centers on three points: the original source book, a small portion of what has just been written (usually only a few characters), and the next character about to be written. Rather than retaining a complete memory of everything already transcribed, they engage in a form of soft forgetting.

Figure 2 summary: This figure is a conceptual diagram illustrating a proposed model architecture. The left side uses an analogy of a person copying from a book to represent the input and output process, highlighting cognitive functions like working memory and selective focus. The right side details the technical framework, showing a pipeline where an input is processed by a DeepEncoder, passed through a series of layers, and then handled by a Mixture-of-Experts Large Language Model decoder. Below the framework, the KV cache is depicted as a queue containing visual and prompt tokens alongside generated tokens. The diagram demonstrates a sliding window mechanism where old tokens are evicted as new ones are added. This architecture ensures that memory usage and computational requirements remain constant regardless of the generation length, enabling the model to handle unlimited OCR tasks efficiently.

This may be the key to sustaining long-horizon parsing under low cognitive load. Inspired by this observation, we present Unlimited O.C.R.

3.2. Architecture

As shown in Figure 2, Unlimited O.C.R adopts DeepSeek O.C.R as its baseline. Specifically, it comprises the DeepEncoder paired with a Mixture-of-Experts (MoE) architecture that enjoys 3 B total and 500 M activated parameters. The DeepEncoder stands out for its exceptional visual token compression capability, which can dramatically reduce the K.V cache footprint during the prefill stage while preserving robust optical text feature extraction. Departing from the original DeepSeek O.C.R, we replace the vanilla Multi-Head Attention (M.H.A) with our proposed R-S.W.A. With the new proposed attention, long-horizon parsing can be achieved by augmenting the original reference K.V cache m with a fixed-capacity output K.V buffer of width n. We will delve into the technical details in the following sections.

3.3. DeepEncoder

DeepEncoder is originally introduced in DeepSeek O.C.R. It cascades SAM-Vee.I.T with CLIP-Vee.I.T and applies 16× token compression at the bridge, so that the first half relies entirely on window attention to process the original image tokens, while global attention is reserved exclusively for the compressed tokens. This design keeps the activation values low when encoding high-resolution images, thereby conserving GPU memory. DeepEncoder natively supports five resolution modes; we retain two of them: the "Base" model ( 10241024 ) for multi-page), and the "Gundam" mode (dynamic resolution for single-page). Specifically, DeepEncoder can compress a 10241024 P.D.F-image to just 256 tokens. This high compression ratio is critically important for unlimited O.C.R works, because visual tokens do not undergo state transitions alongside the output - they are encoded once and remain static throughout the entire long-horizon parsing process.

3.4. Reference Sliding Window Attention

Despite the satisfactory compression of visual tokens that DeepEncoder achieves on the input side, the real bottleneck for one-shot parsing of an entire book lies in the decoding stage. Assume a compression ratio of 1:10 between visual and text tokens — that is, one visual token can decode around ten text tokens. In that case, 10 K visual tokens (equivalent to roughly 20 – 30 pages at 1024 times 1024 resolution) demand an output length of 100k+ tokens for full decoding.

This has long been a formidable challenge for vanilla L.L.M-driven O.C.R models, due to the massive K.V cache storage and attention computation that sequences beyond 128k tokens entail. To address this, we propose Reference Sliding Window Attention (R-S.W.A).

3.4.1. Attention computation

In essence, R-S.W.A constrains attention within a two-segment window of size m + n , as illustrated in Figure 2. Here, m denotes the window for prefix tokens, which includes both visual tokens and the prompt. During a single inference pass, m remains fixed; it depends only on the number of book pages or the resolution size of the document being decoded, and does not vary with decoding length. The window n for the decode region is also fixed in size and slides in a causal manner. Specifically, the formulation is as follows:

Math summary: This computation defines the set of visible tokens by combining a fixed prefix segment with a causal sliding window. The process takes the current token position as input and outputs a union of all prefix indices and a range of recent indices determined by the window size.

where P denotes the prefix segment of length L m, which is globally visible to all subsequent tokens, and D n of t denotes the causal sliding window of width n over the decode region. The attention weight from token t to position j in N of t is then computed as

Math summary: This computation calculates a normalized attention weight using a softmax function. It divides the exponentiated scaled dot product of a query and a specific key by the sum of exponentiated scaled dot products for all keys in the accessible set.

where q t, k j, and v j are the query, key, and value vectors, respectively, and d k is the dimension of the key-vector. The output representation is obtained by aggregating values over the same accessible set:

Math summary: This computation calculates a weighted sum to produce an output representation. It multiplies each value vector by a corresponding scaling factor and sums these results across a specific set of neighboring inputs.

This formulation makes explicit that each decoding token can attend to all prefix tokens as persistent global context, while only attending locally within a bounded causal window over previously generated tokens. As a result, the model preserves access to the full prefix information while reducing the attention cost over the growing decode sequence.

3.4.2. K.V Cache Management

For DeepSeek O.C.R baseline, it employs standard Multi-Head Attention (M.H.A)—the most classical form of attention, which offers strong expressiveness but imposes enormous K.V cache pressure, the K.V cache size is calculated as follows:

Math summary: This expression calculates the total size of the key value cache. It sums the fixed size of the prefix cache with the number of generated tokens to determine the final output.

In contrast, under R-S.W.A, the model always retains the full prefix cache of size L m, but for the generated continuation it only needs to keep the most recent n tokens. Therefore, after generating a total of T tokens, the required K.V cache size is

Math summary: This expression calculates the total memory required for the key value cache. It adds a fixed prefix cache size to the smaller of either the total tokens generated or a maximum window size.

This shows that, unlike standard M.H.A whose cache size increases unboundedly with T, the decode-side cache of R-S.W.A is upper-bounded by a constant window size. To quantify the reduction, we define the cache ratio

Math summary: This expression calculates the cache ratio by dividing the memory requirements of the sliding window attention mechanism by those of the multi head attention mechanism. The process takes the sum of a constant base length and the smaller of either the window size or the current time step, then divides that result by the sum of the base length and the current time step.

If the generated length is sufficiently long such that T is much greater than n, then

Math summary: This formula calculates a ratio to determine a specific density value. It divides the sum of the memory limit and the prefix length by the sum of the memory limit and the total decode length.

which decreases as T grows. In particular, when the decode length dominates both the prefix length and the window size, we have

Math summary: This computation calculates the ratio of the combined prefix length and window size to the total decode length. As the decode length increases, the resulting output value approaches zero.

Therefore, for long-sequence decoding, R-S.W.A reduces the K.V cache requirement from linear growth in T to a bounded quantity L m + n , yielding a substantial memory saving compared with standard M.H.A. Accordingly, R-S.W.A serves as the cornerstone to enabling near-unlimited parsing works under limited resources.

3.4.3. Kernel study

As shown in Figure 3, we plot the per-call duration of the Flash Attention v3 kernel for both the DeepSeek O.C.R baseline and Unlimited O.C.R Works (denoted as U.O.W in the figure). The figure clearly shows that the standard M.H.A kernel in DeepSeek O.C.R incurs growing latency with each successive decoding step, whereas in Unlimited O.C.R the duration remains constant—a direct benefit of adopting R-S.W.A across all layers of the L.L.M decoder. The spike in the DeepSeek O.C.R occurs when the K.V cache length crosses a certain alignment boundary, causing an abrupt drop in data transfer efficiency; this issue also does not arise with R-S.W.A. Besides, the same pattern will hold for GPU memory usage during inference: in the original DeepSeek O.C.R it scales linearly, while in Unlimited O.C.R it stays fixed. This joint stability in both computational cost and memory footprint is precisely what makes long-horizon parsing possible.

Figure 3 summary: This figure is a line chart. It illustrates the per-call duration of two different attention kernels, Ds-Attn and UoW-Attn, across a series of decode steps. The chart displays both raw and smoothed data for each kernel to show the trend of latency as the decoding length increases. The data indicates that while the UoW-Attn kernel maintains a consistent and low latency regardless of the decode step, the Ds-Attn kernel experiences a steady increase in latency. Furthermore, the Ds-Attn kernel shows periodic sharp jumps in duration, suggesting a non-linear growth in latency as the sequence length expands. Consequently, UoW-Attn is significantly more efficient and scalable than Ds-Attn in terms of per-call latency during the decoding process.

4. Experimental Settings

4.1. Data Engine

We construct approximately 2 million document O.C.R data samples to train Unlimited O.C.R, with a 9:1 ratio of single-page to multi-page data. For the single-page P.D.F data, we use Paddle O.C.R for annotation, concatenating the coordinates and content of each block to construct end-to-end detection and parsing ground truth. The coordinates of each element are normalized to the range of 0 to 1000. All multi-page data are synthesized by concatenating single-page data.

We randomly generate around 200k samples, each consisting of 2 to 50 pages, with <page> used as a separator between pages. All data are packed into a sequence length of 32 K tokens.

4.2. Implementation Details

Starting from the DeepSeek O.C.R checkpoint, we continue training Unlimited O.C.R for 4,000 steps with a global batch size of 256 and a maximum sequence length of 32 K on 8 times 16 A.800 GPUs, using random packing for all data. During training, we freeze the DeepEncoder and only train the L.L.M parameters, as the DeepEncoder is already sufficiently optimized in DeepSeek O.C.R. We use the AdamW optimizer and a cosine annealing scheduler with an initial learning rate of 1e-4 . To support 32 K training, we adopt DeepEP, with expert parallelism (E.P) set to 4. The entire training pipeline is built on the Megatron-L.M framework. For inference, we implement K.V cache management for R-S.W.A in the Transformers library, along with corresponding support and optimizations in the SGLang inference engine. Both inference frameworks can operate Unlimited O.C.R under constant T.P.S (tokens/S) and GPU memory.

5. Evaluation

5.1. Benchmark and Metrics

We select OmniDocBench as the main benchmark for evaluating foundational document O.C.R capabilities, and test the Unlimited O.C.R on both v1.5 and v1.6 versions. OmniDocBench v1.6 includes 296 more test images than v1.5 and represents the latest benchmark, while v1.5 provides official metrics from more classic models—including our baseline DeepSeek O.C.R—which facilitates performance comparisons. For long-horizon O.C.R evaluation, an in-house test set is constructed, where we select a number of novels, documents, and papers and divide them by page count to assess the multi-page performance of Unlimited O.C.R. Specifically, we select books of 2, 5, 10, 20, and 40+ pages for testing, with no fewer than ten books for each category.

OmniDocBench is designed to evaluate document parsing capabilities across multiple dimensions, including text recognition, formula recognition, table structure extraction, and reading order prediction. It adopts task-specific metrics for a well-rounded evaluation: (1) Text Edit Distance (Edit ), which measures character-level accuracy for text recognition; (2) Formula C.D.M (C.D.M ), which evaluates the quality of mathematical formula recognition; (3) Table teds (teds ) and Table teds-S (teds-S ), which assess table structure extraction accuracy with and without content recognition; and (4) Reading Order Edit Distance (Edit ), which quantifies the correctness of predicted reading sequences. The overall score is then computed as a weighted average across text, formula, and table recognition tasks. For the in-house benchmark, we report both the Distinct-n and the Edit Distance. Distinct-n is the ratio of the number of unique n-grams to the total number of n-grams in the generated text.

5.2. Main Results

As shown in Table 1, by continue-training on merely 2 M P.D.F-document-specific data based on DeepSeek O.C.R, Unlimited O.C.R achieves end-to-end sota performance. This demonstrates the effectiveness of R-S.W.A on parsing tasks. First, compared with the standard attention in DeepSeek O.C.R, R-S.W.A may allow the model to focus more on dense O.C.R tasks, whereas full attention could lead to divergence as the output length increases. On the other hand, the state transition across intra-page content under R-S.W.A is both workable and solid.

Table 1 summary: The table compares the performance of various end-to-end VLM-based architectures on the OmniDocBench v1.5 and v1.6 benchmarks across multiple metrics, including overall accuracy, text edit distance, formula recognition, table structure, and read-order. In both benchmark versions, Unlimited-OCR consistently outperforms other models, achieving the highest overall scores and the lowest error rates for text and read-order. Specifically, in v1.5, Unlimited-OCR shows substantial improvements over the baseline DeepSeek-OCR and other competitive models across all categories. In v1.6, it maintains a leading position against current state-of-the-art models, demonstrating superior capabilities in formula and table recognition.

Specifically, on OmniDocBench v1.5, compared with DeepSeek O.C.R, the text edit distance drops by 0.035, and the table teds improves by 5.96%, indicating that historical information is causally and continuously fed into the sliding window, enabling the model to clearly locate its O.C.R progress even though it sees only a few tokens. On the OmniDocBench v1.6 benchmark, Unlimited O.C.R again achieves end-to-end sota (93.92% on overall metric), further proving that for single-page P.D.F-level document O.C.R tasks, replacing all standard attention entirely with R-S.W.A of width 128 is both effective and lossless.

Moreover, Unlimited O.C.R gains all the benefits of DeepSeek O.C.R, such as the MoE architecture with only 0.5 B activated parameters, resulting in very high inference efficiency. In the OmniDocBench, Unlimited O.C.R achieves 5580 T.P.S (tokens/s/512 concurrency) compared to DeepSeek O.C.R's 4951 T.P.S under "Base" DeepEncoder mode, representing a 12.7% speed increase. Of course, the average document length in OmniDocBench is relatively short—the longer the output length, the more pronounced the advantage of Unlimited O.C.R becomes.

5.3. Subcategory Study

OmniDocBench (v1.5) provides 9 types of documents, and conducting a subcategory comparison is crucial for a more systematic and comprehensive analysis of R-S.W.A. As shown in Table 2, compared to DeepSeek O.C.R, Unlimited O.C.R shows clear and consistent gains across every metric, demonstrating that our decoder-side optimization, that is, R-S.W.A, delivers a genuine "free lunch"—improvements without compromises. Compared to DeepSeek O.C.R 2, Unlimited O.C.R also holds a clear advantage, with seven-ninths of both the text edit distance and reading order scores surpassing those of DeepSeek O.C.R 2. For documents with complex layouts such as P.P.T, newspapers, magazines, and note, Unlimited O.C.R shows no disadvantage either, further demonstrating that replacing all standard attention with R-S.W.A for L.L.M-decoder is complete and sound for parsing tasks.

Table 2 summary: The table compares the edit distances of Unlimited OCR and the DeepSeek-OCR series across various document types, where lower values indicate superior performance. Overall, Unlimited OCR generally achieves lower edit distances than the DeepSeek-OCR models across most categories for both text and reading order, demonstrating better accuracy and structural recognition. While DeepSeek-OCR shows competitive results in a few specific instances, Unlimited OCR maintains a consistent advantage across the majority of the tested document formats.

5.4. Long-horizon Parsing

Long-horizon parsing is one of the novel capabilities of Unlimited O.C.R. Two main obstacles have hindered previous models from achieving this: first, excessively long output sequences can easily exceed the maximum token limit; second, output latency grows with sequence length, causing the O.C.R of documents spanning dozens of pages to become progressively slower. Unlimited O.C.R, equipped with R-S.W.A, can prefill tens to hundreds of document pages in a single pass and parse continuously from the first page to the last. Throughout this process, the K.V cache remains fixed, so output latency stays constant—making long-horizon parsing feasible.

As shown in Table 3, our model delivers satisfactory performance in multi-page one-shot O.C.R scenarios, maintaining strong results even with 20 pages input simultaneously. At 40+ pages, the edit distance remains below 0.11 along with 97% Distinct-35. We examine the cases with repeated errors and find that most occur where small text in the P.D.F is difficult to discern, primarily due to the use of DeepEncoder's "Base" mode (1024×1024 resolution) under multi-page conditions, rather than R-S.W.A losing direction in long-horizon parsing process.

Table 3 summary: The results for long-horizon OCR indicate that as the number of pages increases, the distinct-n metrics generally remain high, though they show a slight decline for the longest documents. Simultaneously, the edit distance tends to increase with the page count, suggesting a gradual decrease in accuracy as the document length grows.

6. Efficiency Analysis

As presented in Table 4, we compare the output tokens per second (T.P.S) of Unlimited O.C.R and DeepSeek O.C.R under ideal concurrency conditions. The prefill length is fixed at 10, with all other settings held identical. The results show that at 256 tokens, the inference speeds of the two models are virtually the same. As the output length grows, however, the T.P.S of DeepSeek O.C.R steadily declines, and at 6,000 tokens, it lags behind Unlimited O.C.R—which incorporates R-S.W.A—by 35%. These findings further validate the effectiveness of R-S.W.A and underscore that consistent generation speed is a critical requirement for long-horizon O.C.R tasks.

Table 4 summary: A comparison of theoretical inference performance ceilings shows that Unlimited OCR maintains a stable and higher throughput across various output lengths, whereas DeepSeek OCR experiences a steady decline in performance as the output length increases.

7. Limitation and Future Work

Our model cannot achieve truly unlimited parsing under a finite context length (e.g., 32 K), as it is also constrained by the prefill length. Although DeepEncoder already achieves a high compression rate for image tokens, the prefill still becomes very long as the number of pages accumulates. In the short term, we will train models with longer context lengths, such as 128 K, to support the prefill of more pages. In the long term, we plan to build a prefill pool and enable the model to learn to automatically fetch prefill K.V chunks, thereby simulating the effect of a human flipping through pages, so as to achieve truly unlimited O.C.R works. In addition, we will also transfer R-S.W.A to reference-based tasks such as A.S.R and translation.

8. Conclusion

In this technical report, we propose the Unlimited O.C.R model and present the R-S.W.A algorithm to support its capability for long-horizon parsing. We verify that when all standard attention in the decoder of an end-to-end model is replaced with causal reference-based S.W.A, the model's performance on parsing tasks remains lossless. This indicates that the model learns to continuously pass useful information from historical outputs into the window, and this soft form of forgetting is consistent with how we humans behave when transcribing a book. We believe that R-S.W.A will be applied to more tasks in the future, making attention computation and memory footprint no longer the bottleneck for long-horizon parsing field.

9. Author List

Baidu Inc.

Audio by Paper2Audio.

You have reached the end of the document.