Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions docs/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,18 @@ implementations including filesystem and S3.
**Contiguous Data Store** - Storage backend for complete transaction data.
Manages both data files and verification metadata.

<a id="content-digest"></a> **Content Digest (ar-io-digest)** - The SHA-256
hash of a piece of contiguous data, base64url-encoded. It is emitted on data
responses as the `X-AR-IO-Digest` header and is the key under which the
[contiguous data store](#contiguous-data-store) addresses bytes on disk
(`data/<h0:2>/<h2:4>/<hash>`). Because the same value identifies content
across the cache, the index, and the response header, it doubles as a stable
content address. The `GET /ar-io/digest/{digest}` endpoint serves bytes
directly by this value; such responses are inherently self-verifying (the
bytes provably hash to the requested digest) and immutable, but local-cache
only — there is no on-demand fetch by content hash, since Arweave addresses
data by [item ID](#item-id), not by content hash.

Comment on lines +226 to +237

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix broken link fragment.

The anchor reference at line 229 uses #contiguous-data-store, but the actual anchor at line 99 is <a id="contiguous-data">. Update the link to #contiguous-data to match the existing anchor.

🔗 Proposed fix
 <a id="content-digest"></a> **Content Digest (ar-io-digest)** - The SHA-256
 hash of a piece of contiguous data, base64url-encoded. It is emitted on data
 responses as the `X-AR-IO-Digest` header and is the key under which the
-[contiguous data store](`#contiguous-data-store`) addresses bytes on disk
+[contiguous data store](`#contiguous-data`) addresses bytes on disk
 (`data/<h0:2>/<h2:4>/<hash>`). Because the same value identifies content
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<a id="content-digest"></a> **Content Digest (ar-io-digest)** - The SHA-256
hash of a piece of contiguous data, base64url-encoded. It is emitted on data
responses as the `X-AR-IO-Digest` header and is the key under which the
[contiguous data store](#contiguous-data-store) addresses bytes on disk
(`data/<h0:2>/<h2:4>/<hash>`). Because the same value identifies content
across the cache, the index, and the response header, it doubles as a stable
content address. The `GET /ar-io/digest/{digest}` endpoint serves bytes
directly by this value; such responses are inherently self-verifying (the
bytes provably hash to the requested digest) and immutable, but local-cache
only — there is no on-demand fetch by content hash, since Arweave addresses
data by [item ID](#item-id), not by content hash.
<a id="content-digest"></a> **Content Digest (ar-io-digest)** - The SHA-256
hash of a piece of contiguous data, base64url-encoded. It is emitted on data
responses as the `X-AR-IO-Digest` header and is the key under which the
[contiguous data store](`#contiguous-data`) addresses bytes on disk
(`data/<h0:2>/<h2:4>/<hash>`). Because the same value identifies content
across the cache, the index, and the response header, it doubles as a stable
content address. The `GET /ar-io/digest/{digest}` endpoint serves bytes
directly by this value; such responses are inherently self-verifying (the
bytes provably hash to the requested digest) and immutable, but local-cache
only — there is no on-demand fetch by content hash, since Arweave addresses
data by [item ID](`#item-id`), not by content hash.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 229-229: Link fragments should be valid

(MD051, link-fragments)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/glossary.md` around lines 226 - 237, Update the broken anchor in the
"Content Digest (ar-io-digest)" entry: change the link target from
`#contiguous-data-store` to the existing anchor `#contiguous-data` so the
reference to the contiguous data section resolves correctly (look for the
"Content Digest (ar-io-digest)" paragraph and its `[contiguous data
store](`#contiguous-data-store`)` link).

## Data Verification

**Data Verification** - The process of cryptographically verifying data
Expand Down
109 changes: 109 additions & 0 deletions docs/openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1745,6 +1745,115 @@ paths:
'416':
description: Range not satisfiable for HEAD request

'/ar-io/digest/{digest}':
get:
tags: [Data]
summary: Get data by its content digest (SHA-256)
description: |
Retrieve contiguous data addressed by its SHA-256 content digest — the
same base64url value emitted in the `X-AR-IO-Digest` response header on
every data response (and used as the gateway's on-disk cache key).

This is a content-addressed endpoint: the bytes returned provably hash
to the requested digest, so the response is always self-verifying
(`X-AR-IO-Verified: true`) and immutable (`Cache-Control: …, immutable`).

Local-cache only: the gateway can only serve a digest it has already
materialized (via prior retrieval or bundle unbundling). There is no
on-demand fetch by content hash, because Arweave and peers address data
by transaction/data-item id, not by content hash — so a digest the node
has never stored returns 404.

For header parity with `/raw/{txId}`, a representative id that resolves
to this digest is used to populate the full id-scoped header set
(`X-AR-IO-Data-Id`, tags, owner, signature, root offsets), which are
then covered by the HTTPSIG signature when signing is enabled.
parameters:
- name: digest
in: path
required: true
schema:
$ref: '#/components/schemas/Base64Url43'
description: base64url-encoded SHA-256 content digest (X-AR-IO-Digest)
- name: Range
in: header
required: false
schema:
$ref: '#/components/schemas/ByteRange'
description: Byte range(s) to retrieve
responses:
'200':
description: |
Successful response. Emits the same header set as `/raw/{txId}`
(see that endpoint for the full list), with `X-AR-IO-Verified`
always `true` and an immutable `Cache-Control`.
headers:
Content-Type:
schema:
type: string
example: application/octet-stream
Content-Length:
schema:
type: string
example: 1024
Cache-Control:
schema:
type: string
example: public, max-age=2592000, immutable
X-AR-IO-Digest:
schema:
type: string
example: '4ROTs2lTPAKbr8Y41WrjHu-2q-7S-m-yTuO7fAUzZI4'
ETag:
schema:
type: string
example: '4ROTs2lTPAKbr8Y41WrjHu-2q-7S-m-yTuO7fAUzZI4'
Content-Digest:
schema:
type: string
description: RFC 9530 compliant digest header with SHA-256
example: 'sha-256=:4ROTs2lTPAKbr8Y41WrjHu+2q+7S+m+yTuO7fAUzZI4=:'
X-AR-IO-Verified:
schema:
$ref: '#/components/schemas/VerificationStatus'
example: true
X-AR-IO-Data-Id:
schema:
type: string
description: A representative id that resolves to this digest
'206':
description: Partial content for range requests
'400':
description: Malformed digest (not a canonical 43-char base64url SHA-256)
'404':
description: No content for this digest in the local cache
'416':
description: Range not satisfiable
'451':
description: Content blocked by this node's content policy
head:
tags: [Data]
summary: Get headers for data by its content digest
description: |
Existence check / header retrieval for content addressed by its SHA-256
digest. Returns the same headers as the GET response with no body.
parameters:
- name: digest
in: path
required: true
schema:
$ref: '#/components/schemas/Base64Url43'
description: base64url-encoded SHA-256 content digest (X-AR-IO-Digest)
responses:
'200':
description: Successful response (headers only)
'400':
description: Malformed digest
'404':
description: No content for this digest in the local cache
'451':
description: Content blocked by this node's content policy

# Network and Node Status
'/info':
get:
Expand Down
5 changes: 5 additions & 0 deletions src/constants.ts
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,11 @@ export const verificationPriorities = {
export const DATA_PATH_REGEX =
/^\/?([a-zA-Z0-9-_]{43})\/?$|^\/?([a-zA-Z0-9-_]{43})\/(.*)$/i;
export const RAW_DATA_PATH_REGEX = /^\/raw\/([a-zA-Z0-9-_]{43})\/?$/i;
// Content-addressed data: base64url SHA-256 digest (43 chars), the value
// emitted as X-AR-IO-Digest. Distinct prefix from /raw/:txid because a
// digest is indistinguishable from a 43-char txid by shape alone.
export const DIGEST_DATA_PATH_REGEX =
/^\/ar-io\/digest\/([a-zA-Z0-9-_]{43})\/?$/i;
export const FARCASTER_FRAME_DATA_PATH_REGEX =
/^\/local\/farcaster\/frame\/([a-zA-Z0-9-_]{43})\/?$/i;

Expand Down
62 changes: 62 additions & 0 deletions src/data/read-through-data-cache.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,22 @@ describe('ReadThroughDataCache', function () {
return undefined;
},

getDataAttributesByHash: async (hash: string) => {
if (hash === 'knownHash') {
return {
hash: 'knownHash',
size: 100,
contentType: 'text/plain',
id: 'knownId',
};
}
// Indexed in contiguous_data but the blob is missing from the store.
if (hash === 'indexedButNoBlob') {
return { hash: 'indexedButNoBlob', size: 50 };
}
return undefined;
},

// eslint-disable-next-line no-empty-pattern
saveDataContentAttributes: async ({}: {
id: string;
Expand Down Expand Up @@ -169,6 +185,52 @@ describe('ReadThroughDataCache', function () {
mock.restoreAll();
});

describe('getDataByHash', () => {
it('streams indexed content addressed by hash, marked self-verifying', async () => {
const result = await readThroughDataCache.getDataByHash('knownHash');

assert.equal(result.hash, 'knownHash');
assert.equal(result.size, 100);
assert.equal(result.totalSize, 100);
assert.equal(result.sourceContentType, 'text/plain');
// Content-addressed reads are self-verifying and always local-cache.
assert.equal(result.verified, true);
assert.equal(result.trusted, true);
assert.equal(result.cached, true);
// The single internal lookup also surfaces the representative id.
assert.equal(result.representativeId, 'knownId');

const chunks: Buffer[] = [];
for await (const chunk of result.stream) {
chunks.push(Buffer.from(chunk));
}
assert.equal(Buffer.concat(chunks).toString(), 'simulated data');
});

it('honors a byte region', async () => {
const result = await readThroughDataCache.getDataByHash('knownHash', {
offset: 0,
size: 4,
});
assert.equal(result.size, 4);
assert.equal(result.totalSize, 100);
});
Comment on lines +210 to +217

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert that the region is passed to the store call.

This test currently validates return metadata only; it can still pass if region is accidentally dropped before dataStore.get().

🧪 Suggested assertion hardening
   it('honors a byte region', async () => {
+    let receivedRegion: { offset: number; size: number } | undefined;
+    mock.method(
+      mockContiguousDataStore,
+      'get',
+      async (hash: string, region?: { offset: number; size: number }) => {
+        if (hash === 'knownHash') {
+          receivedRegion = region;
+          const stream = new Readable();
+          stream.push('simulated data');
+          stream.push(null);
+          return stream;
+        }
+        return undefined;
+      },
+    );
+
     const result = await readThroughDataCache.getDataByHash('knownHash', {
       offset: 0,
       size: 4,
     });
     assert.equal(result.size, 4);
     assert.equal(result.totalSize, 100);
+    assert.deepEqual(receivedRegion, { offset: 0, size: 4 });
   });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
it('honors a byte region', async () => {
const result = await readThroughDataCache.getDataByHash('knownHash', {
offset: 0,
size: 4,
});
assert.equal(result.size, 4);
assert.equal(result.totalSize, 100);
});
it('honors a byte region', async () => {
let receivedRegion: { offset: number; size: number } | undefined;
mock.method(
mockContiguousDataStore,
'get',
async (hash: string, region?: { offset: number; size: number }) => {
if (hash === 'knownHash') {
receivedRegion = region;
const stream = new Readable();
stream.push('simulated data');
stream.push(null);
return stream;
}
return undefined;
},
);
const result = await readThroughDataCache.getDataByHash('knownHash', {
offset: 0,
size: 4,
});
assert.equal(result.size, 4);
assert.equal(result.totalSize, 100);
assert.deepEqual(receivedRegion, { offset: 0, size: 4 });
});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/data/read-through-data-cache.test.ts` around lines 208 - 215, The test
for readThroughDataCache.getDataByHash currently only asserts returned metadata;
update the test to also assert that the underlying dataStore.get was called with
the expected region object (offset and size) so the region is not dropped.
Locate the call to readThroughDataCache.getDataByHash in the test and add an
assertion against the mocked/spied dataStore.get (or equivalent spy used in this
spec) verifying its arguments include a region matching { offset: 0, size: 4 }
(or that the second/appropriate parameter contains that region), ensuring the
store call receives the region.


it('rejects when the hash is not indexed', async () => {
await assert.rejects(
readThroughDataCache.getDataByHash('unknownHash'),
/No content indexed/,
);
});

it('rejects when indexed but the blob is missing from the store', async () => {
await assert.rejects(
readThroughDataCache.getDataByHash('indexedButNoBlob'),
/No cached data/,
);
});
});

describe('getCachedData', () => {
it('should return data from cache when available', async () => {
let calledWithArgument: string;
Expand Down
59 changes: 59 additions & 0 deletions src/data/read-through-data-cache.ts
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import * as metrics from '../metrics.js';
import { KvJsonStore } from '../store/kv-attributes-store.js';
import { startChildSpan } from '../tracing.js';
import {
ByHashData,
ContiguousData,
ContiguousDataAttributesStore,
ContiguousDataIndex,
Expand Down Expand Up @@ -435,6 +436,64 @@ export class ReadThroughDataCache implements ContiguousDataSource {
return undefined;
}

/**
* Serve contiguous data addressed directly by its content hash (the
* value emitted as X-AR-IO-Digest and used as the on-disk cache key).
*
* Unlike {@link getData}, there is no id, no manifest/ArNS resolution, and
* no upstream fall-through: Arweave and peers address by transaction id,
* not by content hash, so a hash we have never materialized cannot be
* fetched on demand. The endpoint therefore serves only content already
* present in the local content store. Because the store is keyed by the
* SHA-256 of the bytes, a successful read is self-verifying — the bytes
* provably hash to the requested digest — so the result is reported as
* verified, trusted, and cached.
*
* @throws if no content is indexed for the hash, or the indexed blob is
* missing from the store (evicted/pruned between index and read).
*/
async getDataByHash(
hash: string,
region?: {
offset: number;
size: number;
},
): Promise<ByHashData> {
const attributes =
await this.contiguousDataIndex.getDataAttributesByHash(hash);
if (attributes === undefined) {
throw new Error(`No content indexed for hash: ${hash}`);
}

const cacheStream = await this.dataStore.get(hash, region);
if (cacheStream === undefined) {
throw new Error(`No cached data found for hash: ${hash}`);
}

const requestType = region !== undefined ? 'range' : 'full';
metrics.getDataStreamSuccessesTotal.inc({
class: this.constructor.name,
source: 'cache',
request_type: requestType,
});

const totalSize = attributes.size;
return {
hash,
stream: cacheStream,
size: region?.size ?? totalSize,
totalSize,
sourceContentType: attributes.contentType,
// Content-addressed: the bytes provably hash to the requested digest.
verified: true,
trusted: true,
cached: true,
// A representative id resolving to this hash (the same single lookup
// above), so callers need not re-query to emit id-scoped headers.
representativeId: attributes.id,
};
Comment on lines +473 to +494

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Defer success metrics until the stream actually completes.

getDataByHash() increments getDataStreamSuccessesTotal before bytes are consumed. If the stream errors later, this path will still report success.

📈 Suggested fix
   const requestType = region !== undefined ? 'range' : 'full';
-  metrics.getDataStreamSuccessesTotal.inc({
-    class: this.constructor.name,
-    source: 'cache',
-    request_type: requestType,
-  });
+  cacheStream.once('error', () => {
+    metrics.getDataStreamErrorsTotal.inc({
+      class: this.constructor.name,
+      source: 'cache',
+      request_type: requestType,
+    });
+  });
+  cacheStream.once('end', () => {
+    metrics.getDataStreamSuccessesTotal.inc({
+      class: this.constructor.name,
+      source: 'cache',
+      request_type: requestType,
+    });
+  });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const requestType = region !== undefined ? 'range' : 'full';
metrics.getDataStreamSuccessesTotal.inc({
class: this.constructor.name,
source: 'cache',
request_type: requestType,
});
const totalSize = attributes.size;
return {
hash,
stream: cacheStream,
size: region?.size ?? totalSize,
totalSize,
sourceContentType: attributes.contentType,
// Content-addressed: the bytes provably hash to the requested digest.
verified: true,
trusted: true,
cached: true,
};
const requestType = region !== undefined ? 'range' : 'full';
cacheStream.once('error', () => {
metrics.getDataStreamErrorsTotal.inc({
class: this.constructor.name,
source: 'cache',
request_type: requestType,
});
});
cacheStream.once('end', () => {
metrics.getDataStreamSuccessesTotal.inc({
class: this.constructor.name,
source: 'cache',
request_type: requestType,
});
});
const totalSize = attributes.size;
return {
hash,
stream: cacheStream,
size: region?.size ?? totalSize,
totalSize,
sourceContentType: attributes.contentType,
// Content-addressed: the bytes provably hash to the requested digest.
verified: true,
trusted: true,
cached: true,
};
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/data/read-through-data-cache.ts` around lines 472 - 490, The success
metric is being incremented too early; remove the immediate call to
metrics.getDataStreamSuccessesTotal.inc(...) and instead attach a one-time
listener to cacheStream (use cacheStream.once('end', ...)) to call
metrics.getDataStreamSuccessesTotal.inc with the same labels (class:
this.constructor.name, source: 'cache', request_type: requestType) when the
stream actually finishes; also attach a one-time 'error' listener to cacheStream
to avoid incrementing on failure (or increment a failure metric if available).
Ensure you handle the case the stream has already ended (check
cacheStream.readableEnded or equivalent) and call the metric immediately in that
case. Use the existing symbols cacheStream, requestType,
metrics.getDataStreamSuccessesTotal, attributes, region, and hash to locate and
implement the change.

}

async getData({
id,
requestAttributes,
Expand Down
24 changes: 24 additions & 0 deletions src/database/sql/data/content-attributes.sql
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,30 @@ FROM (
)
LIMIT 1

-- selectDataAttributesByHash
-- Reverse lookup: resolve content metadata directly from the content hash
-- (the value emitted as X-AR-IO-Digest and used as the on-disk cache key).
-- contiguous_data is keyed by hash (primary-key point lookup); the LEFT JOIN
-- additionally surfaces one representative id that resolves to this hash
-- (via the contiguous_data_hash index) so the content-addressed endpoint can
-- emit id-scoped response headers. Many ids may share a hash (byte-identical
-- content under different signed envelopes); the ORDER BY makes the choice
-- deterministic and prefers the strongest provenance — a verified id over a
-- trusted one over an arbitrary one — so the representative is stable across
-- requests rather than dependent on index iteration order.
SELECT
cd.hash,
cd.data_size,
cd.original_source_content_type,
cdi.id AS id
FROM contiguous_data cd
LEFT JOIN contiguous_data_ids cdi ON cdi.contiguous_data_hash = cd.hash
WHERE cd.hash = :hash
ORDER BY cdi.verified DESC NULLS LAST,
cdi.trusted DESC NULLS LAST,
cdi.id ASC
LIMIT 1

-- selectDataParent
SELECT
cdip.parent_id,
Expand Down
Loading
Loading