Research Narrative

This section is organized as reading logs rather than as project cards. The emphasis is on why a reading path started, what changed in my understanding, and what open questions remained after each cluster.

DAS Research Log: From Data Availability Sampling to Coded Distributed Arrays

This part is a summary I put together from older notes and materials I used while writing the paper.

I wrote it for two reasons:

to leave behind a reference that others can use when they want a clearer overview of DAS;
and to preserve part of the path that eventually led me to the CDA paper.

This research log moves through four steps:

Proto-Danksharding: Explain why Ethereum moved from calldata to blob transactions for temporary rollup data and why Data Availability is needed.
Data Availability Sampling: Discuss about what is Data Availability Sampling (DAS) and how to encode data.
Network for DAS: Explore all current proposed network layer solutions for DAS.
Coded Distributed Array: Explore how idea of CDA is shaped - Draft Version

Proto-Danksharding: From CALLDATA to Blob Transaction

Before `Proto-Danksharding`

Before Proto-Danksharding, rollups such as Optimism, Arbitrum, zkSync, and other L2s were already using Ethereum as their settlement layer. They executed transactions offchain, but they still needed to publish enough data back to Ethereum so that the L2 could be checked from outside.

For optimistic rollups, this published data is what makes fraud proofs possible. For zk rollups, this published data is what lets outsiders reconstruct the underlying state transition data instead of trusting the operator blindly.

More generally, the reason rollups publish data to Ethereum is simple: without that data, nobody outside the operator can independently reconstruct the state transition and check whether the L2 is behaving correctly.

At that time, the practical answer was simple: this data was posted through calldata.

So the pattern looked like this:

the L2 executes transactions offchain;
the operator,called batch poster, periodically sends batch data and commitments to Ethereum;
that L1 data is what later lets outsiders challenge an invalid optimistic transition or verify a validity-based claim.

What is `CALLDATA` ?

If you have not seen CALLDATA before, the easiest way to think about it is:

it is the input bytes attached to a smart-contract call,
in Solidity, it is where function arguments live when an external call comes in.

For example:

// SPDX-License-Identifier: MIT
pragma solidity ^0.8.24;

contract Inbox {
    event DataPosted(bytes data);

    function postBatch(bytes calldata batchData) external {
        emit DataPosted(batchData);
    }
}

Here batchData argument is in CALLDATA. At the EVM level, the transaction is calling postBatch(...) and those bytes are part of the call input.

The problem with `CALLDATA`

The problem is that CALLDATA is a bad place for this kind of data. When rollup data is posted through CALLDATA:

it is attached to normal transactions,
it goes through Ethereum's ordinary block pipeline,
and it stays in Ethereum history permanently.

That is a bad fit for rollup batch data. This data is mainly there so that others can verify the L2 state transition, generate a fraud proof, or check the validity of a proof. It does not need to be treated like permanent application state on Ethereum itself.

So the mismatch was not only about permanence. It was also about scale.

Rollup batch data carried through CALLDATA has to compete with ordinary execution inside the same block gas budget. That means if Ethereum wants to carry more rollup data through calldata, it does not get a separate data lane for that growth. It is effectively pushing more pressure into the normal block path.

So the problem with CALLDATA was really two problems at once:

the data was being stored on the wrong path for too long;
and the only obvious way to scale that path further was to make the ordinary block carry more and more data.

That is the bottleneck Proto-Danksharding was trying to break.

`Proto-Danksharding` and blob transactions

Proto-Danksharding (EIP-4844) is Ethereum's first real answer to that mismatch.

After EIP-4844, rollups get a new data lane:

instead of publishing their batch data through permanent calldata,
they can publish it through blobs.

The important difference is that blobs are temporary. They are kept for a bounded window, not forever. In EIP-4844, the minimum retention window for blob sidecars is 4096 epochs, which is roughly 18 days.

After Proto-Danksharding, Ethereum no longer treats rollup data and ordinary execution data as the same thing.

Before going into that a bit more, it helps to clarify two terms.

In Ethereum, the beacon block is the block produced on the consensus layer. Inside it, there is an execution payload, which is the block passed to the execution layer to process transactions and transform state. This is block that we see on Etherscan :D, which is the part that contains normal transactions, so this is where CALLDATA belongs.

Blobs are different. They are stored outside this beacon block as temporary data. Inside the beacon block, there is a field called blob_kzg_commitments, which is used to check that a given blob really belongs to that block.

Proto-Danksharding does solve the first bottleneck:

rollup data no longer has to live forever on the ordinary execution path.

But the design is still not scalable enough. In practice, blobs give Ethereum a target of about 0.375 MB of blob data per block and a hard limit of about 0.75 MB.

That already improves on the old CALLDATA path, but it still does not solve the main scaling problem. Every consensus node is still expected to download all blob data, so this is still a bottleneck.

But Ethereum wants much more blob capacity than this. What happens if the roadmap moves toward something like 128 MB of blob data?

So the next question becomes:

how can Ethereum keep the same availability guarantee without forcing every node to download everything?

In the next section, we will look at that problem directly.

Data Availability Sampling

Before talking about specific protocols, it helps to start from the most naive idea.

Suppose we simply split a block into many pieces and distribute those pieces across the network. At first glance, that sounds enough: each node only stores a small part, so the storage burden is reduced. alt text

However, The problem appears immediately once even a single part goes missing. It is easy to imagine a small cluster of Byzantine nodes deliberately refusing to store or serve one part of the block. At this case, we can't reconstruct back to our original block from parts. So plain splitting is not enough. alt text

We need a way to distribute the data such that even if some pieces are missing, the original block can still be recovered.

Erasure coding

This is why DAS designs start with erasure coding.

At a high level, erasure coding takes:

k original parts of data,
and encodes them into n coded parts,

in such a way that any k out of those n coded parts are enough to reconstruct the original data.

I don't talk in detail about how it is implemented. But you can imagine, the way it do is:

Creating a function from n points, with k/n points only we can find the old functions.
From that we can find the origin k points by index.

That is the key improvement over simple splitting. The network no longer depends on every specific part staying online.
By improving the redundancy of each coded parts, it only needs enough coded parts to remain available.

This changes the data-availability question completely. We are no longer asking:

is every original part still there?

We are asking:

is there still a large enough set of coded parts to recover the block?

That is the door that opens the way to sampling.

alt text

1D sampling

Once erasure coding is in place, the next idea becomes possible: a validator no longer needs to download the full encoded block just to test availability.

Instead, it can sample a small subset of coded data and use that to gain confidence that the whole block is still recoverable.

The first concrete form of this idea is 1D sampling.

In the 1D setting, the data is erasure-coded in only one direction, usually pictured as a horizontal extension of the original block into a longer coded strip.

Suppose the original data is encoded from k parts into 2k coded parts. Since the block can still be reconstructed from any k available coded parts, an adversary must withhold more than half of the coded data to make the block unrecoverable.

alt text

If a validator makes n independent random sampling requests by columns, then the probability that all of them still miss the unavailable region is at most:

(1/2)^n

So the probability of detecting data withholding is at least:

1 - (1/2)^n

For example, with n = 20, the probability of detection is already:

1 - (1/2)^20 ≈ 0.999999

This is already much better than full download:

a node does not need the whole block,
yet it can still test whether enough coded data is available.

But as an engineering design, this is still only a temporary solution.

Why 1D is not the end

The first limitation of 1D coding is its recovery structure.

With 1D erasure coding, recovery is still global rather than local. Even if only a few cells go missing, the system does not recover those cells locally. Instead, it has to fall back to reconstructing the whole encoded block.

That is expensive, both in bandwidth and in computation. So 1D sampling is a useful first step for DAS, but it is not yet a good long-term structure for distributed recovery.

2D sampling: from columns to cells

The main reason to move from 1D to 2D is recovery. With 1D erasure coding, reconstruction is global: if some data is missing, recovery tends to require reconstructing the whole encoded block. With 2D erasure coding, recovery becomes much more local and structured.

Instead of encoding the data in only one direction, we encode it in two:

first across rows,
then across columns.

So the encoded block becomes a matrix rather than a single long strip.

This changes the recovery structure in an important way.

A missing cell can now be recovered from:

recover through the row,
or recover through the column.

So if one recovery direction becomes weak, the other may still be enough. That is the real advantage of 2D coding: it gives the system two repair paths instead of one.

So compared with 1D sampling, 2D sampling gives a much more local recovery structure. The system no longer has to treat every small loss as a whole-block reconstruction problem.

However, 2D coding does not make the whole problem easy. It mainly fixes the recovery structure. The next bottleneck moves to the network layer.

Once sampling happens at the level of individual cells, the system now needs to answer harder questions:

which nodes store which coded data?
how does a node find the peer that holds the cell it wants to sample?
how do we make sure every cell is still recoverable in adversarial settings?

This is where the design becomes much harder. A validator that wants to sample one cell must somehow find and contact the right peer among a large set of nodes. And if the publisher withholds data, agreement now depends on a stronger condition: each missing cell must still be recoverable through at least one honest direction, either from its row or from its column.

This is the point where protocol design turns into network design. The harder question now is how to store, route, sample, and reconstruct those coded cells across a real adversarial network.

Ethereum's concrete proposals for that next step, such as PeerDAS, belong more naturally to the network-layer story.

Network Layer for DAS

In the previous section, we went through the main DAS solutions. But both 1D and 2D sampling only answered one part of the problem:

how the data should be encoded.

It did not yet answer the next question:

once the data has been encoded, how should it actually be stored, disseminated, and retrieved across the network?

That is the purpose of this section. Here I want to focus on the network-layer directions that try to answer that question.

PeerDAS

In the current Fulu version, 1D PeerDAS is already the direction being applied to Ethereum.

Its basic structure is column-based data assignment:

the encoded data is grouped into columns,
and each node is assigned to store some of those columns.

The exact assignment rules are more detailed than I want to go into here, but they can be checked directly in the consensus-specs.

Sampling is then done through DHT retrieval. A node that wants to sample does not download the whole block. Instead, it tries to find the peer responsible for the column it wants and retrieves that sampled data from the network.

This is already a meaningful step away from full download, but the bottleneck now moves to the network layer.

DHT retrieval is multi-hop,
GossipSub dissemination is also multi-hop.

That is where the weakness appears. DHT works well under more benign assumptions, but once Byzantine nodes are present it becomes much more vulnerable. An adversary can try to attack the hash-space neighborhood itself, for example through eclipse-style behavior, and make the needed sampling data hard to retrieve.

At that point, the problem is no longer just "some sample is missing". The network may fail to retrieve the sampled piece at all, which means the availability check itself becomes unstable. In practice, that means no new block can be safely accepted.

For more background on DHT and GossipSub in Ethereum's consensus network, I already wrote about that in an older blog post:

Ethereum consensus layer: from peer discovery with discv5 to message propagation with GossipSub

SubnetDAS

Besides 1D PeerDAS, there have also been discussions around SubnetDAS on Ethereum Research.

The idea there is still in the 1D-sampling family, but the sampling path is moved more directly onto GossipSub itself through subnets, instead of treating retrieval as a separate DHT-heavy problem.

However, the main issue with SubnetDAS is not sampling itself, but reconstruction. Its convergence property depends on many subnets remaining reliable under adversarial conditions, and a disruption in even one critical subnet may be enough to stall full reconstruction.

FullDAS

Another direction is FullDAS, which moves into the 2D sampling setting. Here the sampling unit is no longer a full column, but an individual cell inside a two-dimensional coded matrix.

Compared with 1D PeerDAS, this gives a much cleaner recovery structure. But it also makes the network-layer problem harder, because now the system must support sampling and retrieval at finer granularity.

This line of work is also still far from the final answer for Ethereum's long-term roadmap. For example, work such as FullDAS is still framed around settings like 32 MB, while the longer-term target people talk about is closer to 128 MB blob data per block.

So the current picture is still unfinished. There are already proposals to optimize sampling by improving parts of the DHT path or by moving more logic into GossipSub, but these designs are still not fully settled for Ethereum's current implementation path.

PANDAS

PANDAS is another direction for making DAS practical within Ethereum's consensus time bounds. Before going to a bit detail, block time in Ethereum is 12s, with 4s for block probagation:

The main idea is to push the first dissemination step onto the block builder. Instead of relying mainly on multi-hop propagation from the start, the builder directly seeds the coded data to the nodes that are supposed to custody it. After that, nodes do two things in parallel:

Consolidation: fetch the missing cells they are supposed to store;
Sampling: fetch the random cells they want to sample for DAS.

So the flow is roughly:

the builder constructs the block and the encoded blob data;
the builder directly sends seed cells to responsible nodes;
nodes consolidate the rest of their assigned data from peers;
nodes also perform DAS sampling in parallel;
if this finishes within roughly 4s, validators can safely attest to the block.

alt text

This is the main appeal of PANDAS. It is designed around Ethereum's actual production constraint: DAS is not useful if dissemination and sampling cannot finish within the attestation deadline.

Its main advantage is therefore practical performance. The paper evaluates PANDAS specifically against the 4s window even with 128MB block and shows that direct exchanges scale much better than relying only on GossipSub or DHT-style multi-hop retrieval.

But the tradeoff is also clear. PANDAS pushes a lot of responsibility onto the builder:

the builder becomes the first major distributor of blob data;
the builder needs a timely and broad view of the nodes it should send data to;
and the builder must have enough network capacity to seed that data quickly.

So while PANDAS improves performance, it also makes the system more builder-centric and centralized. In practice, the builder is required as a SUPERnode: it needs to know who should receive what, and it needs enough bandwidth and coordination to deliver those packets in time. Imagining if block builder can not make it in time, then no blocks can be created at all.

Robust Distributed Arrays

RDA is a cleaner and more formal network-layer direction for DAS. Compared with earlier proposals, the idea is simpler, and the security analysis is much tighter.

The basic idea is to organize the network itself as a matrix, usually called a node matrix, with dimensions k1 × k2. Each node is randomly assigned to one row and one column of this matrix.

In the paper, there are two types of nodes:

Validator nodes
Bootstrap nodes

Bootstrap nodes mainly help with network membership, such as join and leave events. They participate in all rows, so they effectively know the whole network. Regular validator nodes are lighter: each one only needs to know the nodes in its own row and its own column.

The storage rule is also simple. Nodes in the same node-matrix column store the same portion of the data. So if the block is divided into 512 columns, and the node matrix has k2 columns, then each node is responsible for roughly 512 / k2 data columns.

This gives RDA a very clean retrieval path. If a node wants to sample some data, it only needs to determine the destination column for that data and query nodes in that column. So retrieval is essentially one-hop, which is much faster and much less fragile than DHT-style multi-hop routing.

This is the main appeal of RDA. The sampling path is simple, the retrieval path is simple, and the security model is stated much more explicitly.

Another important point is the trust assumption. RDA does not rely on an honest majority. Instead, its robustness depends on having a sufficiently large absolute number of honest nodes online for sufficiently long periods of time.

The paper also gives a concrete example of this tradeoff.

It shows that with 5000 honest network participants, where each node stores only 1% of the data and is connected to 10% of the other peers, the system can still provably ensure that 90% of the data remain available at all points in time.

The cost is replication. To get this robustness and one-hop retrieval, RDA duplicates data much more aggressively. All nodes in the same column store the same assigned data, so storage and communication overhead are significantly higher.

So compared with PeerDAS-style designs, RDA spends more on replication in exchange for a simpler retrieval path and a much stronger formal guarantee.

For visualization, you can check this slide:

Coded Distributed Array

This section is only a draft of the idea. I only want to explain how the direction of CDA started to form.

The starting point for me was that RDA is already very clean from the security side, but it still pays a very large data-duplication cost.

If the network has 5000 nodes and the node matrix has 100 columns, then the data-duplication factor is already:

5000 / 100 = 50

That is a lot of replication.

At the same time, the paper on DHT limitations in DAS suggests something else that is important:

DHT is already quite good for fast sampling,
but it is much worse for data dissemination if we try to use it as the main broadcast substrate.

That led me to a different direction. Instead of trying to keep RDA exactly as it is, I started thinking about whether it would be better to accept a slightly longer sampling path, as long as we could reduce the duplication inside each column and improve propagation time overall.

The intuition was roughly this:

if sampling does not need one specific piece, but can tolerate reconstruction from random coded pieces, then the system becomes less sensitive to Byzantine behavior on exact lookup paths;
and if column dissemination is the expensive part, then maybe network coding can reduce the amount of duplication needed there.

That is what led us to think about combining RLNC with the RDA direction.

The hope is not to preserve every property of RDA exactly. The hope is to find a better tradeoff:

allow a few more hops during sampling,
reduce the duplication cost during propagation,
and still keep the total sampling plus propagation time below the roughly 4s budget inside block propagation.

That is the point where the idea of Coded Distributed Arrays started to take shape for me.

For more detail, the slides are still the best reference for now.