Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/chunked database dumps #76

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

onyedikachi-david
Copy link

Pull Request: Implement Chunked Database Dumps with Enhanced Features

/claim #59
Fixes: #59

Purpose

Implement a robust chunked database dump functionality to handle large databases efficiently while preventing memory issues and database locking. This implementation includes several key features:

1. Chunked Processing

  • Processes database dumps in manageable chunks
  • Dynamic chunk sizing based on table size (500-5000 rows)
  • Breathing intervals to prevent database locking

2. Storage Flexibility

  • Automatic selection between R2 and DO storage based on dump size
  • R2 for large dumps (>100MB)
  • DO storage for smaller dumps

3. Progress Tracking & Notifications

  • Unique progress tracking per dump
  • Callback URL support for completion notifications
  • Detailed status reporting

Tasks

  • Implement chunked dump processing
  • Add dynamic chunk sizing (500-5000 rows based on table size)
  • Implement storage selection logic (R2/DO)
  • Add callback URL support
  • Add progress tracking with unique keys
  • Implement breathing intervals
  • Add error handling and logging
  • Add cleanup after successful download
  • Add test infrastructure
  • Verify functionality with test cases

Verification Steps

  1. Start the server:
npm run dev
  1. Start the callback server:
cd test && node callback-server.js
  1. Test dump creation:
curl -X POST http://127.0.0.1:8787/export/dump/chunked \
  -H "Authorization: Bearer ABC123" \
  -H "X-Callback-URL: http://localhost:3000/callback"
  1. Monitor status:
curl http://127.0.0.1:8787/export/dump/{dumpId}/status \
  -H "Authorization: Bearer ABC123"

Before vs After

Before After
Database dumps loaded entire database into memory Chunked processing prevents memory issues
No progress tracking Real-time progress tracking
Single storage option Automatic R2/DO storage selection
No callback notifications Callback notifications
Risk of database locking Breathing intervals prevent locking
1GB DO storage limit Support for databases >1GB
Basic error handling Detailed error handling and logging
Test script
#!/bin/bash

# Configuration
BASE_URL="http://127.0.0.1:8787"
CALLBACK_URL="http://localhost:3000/callback"
AUTH_HEADER="Authorization: Bearer ABC123"
OUTPUT_FILE="dump-test-results.md"

# Check if callback server is running
echo "Checking callback server..."
if ! curl -s "http://localhost:3000/callbacks" > /dev/null; then
    echo "Error: Callback server is not running. Please start it with 'npm run start-callback-server'"
    exit 1
fi

# Clear previous callbacks
curl -s -X DELETE "http://localhost:3000/callbacks" > /dev/null

# Start fresh output file
cat > "$OUTPUT_FILE" << EOF
# Database Dump Test Results
Test conducted on: $(date)

## Test Steps

EOF

echo "Starting dump tests..."

# Step 1: Initiate a dump
echo "Step 1: Initiating dump..."
cat >> "$OUTPUT_FILE" << EOF
### Step 1: Initiate Dump
\`\`\`bash
curl -X POST "$BASE_URL/export/dump/chunked" \\
    -H "$AUTH_HEADER" \\
    -H "X-Callback-URL: $CALLBACK_URL"
\`\`\`

Response:
\`\`\`json
EOF

DUMP_RESPONSE=$(curl -s -X POST "$BASE_URL/export/dump/chunked" \
    -H "$AUTH_HEADER" \
    -H "X-Callback-URL: $CALLBACK_URL")
echo "$DUMP_RESPONSE" >> "$OUTPUT_FILE"
DUMP_ID=$(echo "$DUMP_RESPONSE" | jq -r '.result.dumpId')

cat >> "$OUTPUT_FILE" << EOF
\`\`\`

EOF

# Step 2: Check status (multiple times)
echo "Step 2: Checking status..."
cat >> "$OUTPUT_FILE" << EOF
### Step 2: Status Checks

EOF

for i in {1..5}; do
    echo "Status check $i..."
    cat >> "$OUTPUT_FILE" << EOF
#### Check $i
\`\`\`bash
curl "$BASE_URL/export/dump/$DUMP_ID/status" -H "$AUTH_HEADER"
\`\`\`

Response:
\`\`\`json
EOF

    STATUS_RESPONSE=$(curl -s "$BASE_URL/export/dump/$DUMP_ID/status" -H "$AUTH_HEADER")
    echo "$STATUS_RESPONSE" >> "$OUTPUT_FILE"
    
    cat >> "$OUTPUT_FILE" << EOF
\`\`\`

EOF
    
    STATUS=$(echo "$STATUS_RESPONSE" | jq -r '.result.status')
    if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then
        echo "Dump $STATUS after $i checks"
        break
    fi
    sleep 3
done

# Step 3: Download the dump if completed
if [ "$STATUS" = "completed" ]; then
    echo "Step 3: Downloading dump..."
    cat >> "$OUTPUT_FILE" << EOF
### Step 3: Download Dump
\`\`\`bash
curl "$BASE_URL/export/dump/$DUMP_ID" -H "$AUTH_HEADER"
\`\`\`

Response:
\`\`\`
EOF

    DOWNLOAD_RESPONSE=$(curl -s "$BASE_URL/export/dump/$DUMP_ID" -H "$AUTH_HEADER" -w "\nHTTP Status: %{http_code}")
    echo "$DOWNLOAD_RESPONSE" >> "$OUTPUT_FILE"

    cat >> "$OUTPUT_FILE" << EOF
\`\`\`

EOF
fi

# Step 4: Check received callbacks
echo "Step 4: Checking callbacks..."
cat >> "$OUTPUT_FILE" << EOF
### Step 4: Received Callbacks
\`\`\`bash
curl "http://localhost:3000/callbacks"
\`\`\`

Response:
\`\`\`json
EOF

CALLBACKS_RESPONSE=$(curl -s "http://localhost:3000/callbacks")
echo "$CALLBACKS_RESPONSE" >> "$OUTPUT_FILE"

cat >> "$OUTPUT_FILE" << EOF
\`\`\`

EOF

# Add summary
cat >> "$OUTPUT_FILE" << EOF
## Summary
- Final Status: \`$STATUS\`
- Dump ID: \`$DUMP_ID\`
- Number of status checks: $i
- Callbacks Received: $(echo "$CALLBACKS_RESPONSE" | jq '. | length')

## System Information
- Test Time: $(date)
- OS: $(uname -a)
- Curl Version: $(curl --version | head -n 1)
EOF

echo "Test completed. Results saved to $OUTPUT_FILE" 
Test Results

Database Dump Test Results

Test conducted on: Sat Jan 25 13:10:23 WAT 2025

Test Steps

Step 1: Initiate Dump

curl -X POST "http://127.0.0.1:8787/export/dump/chunked" \
    -H "Authorization: Bearer ABC123" \
    -H "X-Callback-URL: http://localhost:3000/callback"

Response:

{
  "result": {
    "message": "Database dump started",
    "dumpId": "fb2e9497-d93a-457d-b96d-8cd1ae2d22fb",
    "status": "in_progress",
    "downloadUrl": "http://127.0.0.1:8787/export/dump/fb2e9497-d93a-457d-b96d-8cd1ae2d22fb",
    "estimatedSize": 3236
  }
}

Step 2: Status Checks

Check 1:

curl "http://127.0.0.1:8787/export/dump/fb2e9497-d93a-457d-b96d-8cd1ae2d22fb/status" \
  -H "Authorization: Bearer ABC123"

Response:

{
  "result": {
    "status": "in_progress",
    "progress": {
      "currentTable": "",
      "processedTables": 0,
      "totalTables": 6
    }
  }
}

Check 3 (Final):

{
  "result": {
    "status": "completed",
    "progress": {
      "currentTable": "users",
      "processedTables": 8,
      "totalTables": 6
    }
  }
}

Step 3: Download Dump

Response Excerpt:

SQLite format 3
-- Table: tmp_cache
CREATE TABLE tmp_cache (
    "id" INTEGER PRIMARY KEY AUTOINCREMENT,
    "timestamp" REAL NOT NULL,
    "ttl" INTEGER NOT NULL,
    "query" TEXT UNIQUE NOT NULL,
    "results" TEXT
);

-- Table: users
CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT);
INSERT INTO users VALUES (1, 'John Doe', '[email protected]');

HTTP Status: 200

Summary

  • Final Status: completed
  • Dump ID: fb2e9497-d93a-457d-b96d-8cd1ae2d22fb
  • Number of status checks: 3
  • Callbacks Received: 5

@Brayden
Copy link
Member

Brayden commented Feb 3, 2025

@onyedikachi-david Apologies for the delay here, will be testing this implementation this week and report back!

@onyedikachi-david
Copy link
Author

@onyedikachi-david Apologies for the delay here, will be testing this implementation this week and report back!

Okay.

@Brayden
Copy link
Member

Brayden commented Feb 9, 2025

image

curl --location 'https://starbasedb.{YOUR-IDENTIFIER}.workers.dev/export/dump' \
--header 'Authorization: Bearer ABC123'

@onyedikachi-david Was looking at this and attempting to test but running into issues. To start testing I changed this value to 0 so the size to go to R2 was anything:

// const SIZE_THRESHOLD_FOR_R2 = 100 * 1024 * 1024 // 100MB threshold for using R2 <- Commented this out
const SIZE_THRESHOLD_FOR_R2 = 0 // <- Testing with this

But when attempting to hit the cURL request above to begin the database dump I see the errors in my Cloudflare Worker log above. Do you have good steps to reproduce how you got the database dump to work as expected? I setup my R2 bucket successfully and linked it in the wrangler.toml as well.

@onyedikachi-david
Copy link
Author

@Brayden I noticed the issue might be related to the endpoint being used. I added a new endpoint for chunk dumps:

curl -X POST 'https://starbasedb.{YOUR-IDENTIFIER}.workers.dev/export/dump/chunked' \
--header 'Authorization: Bearer ABC123' \
--header 'X-Callback-URL: http://your-callback-url'

Would you retry with this, I don't have a paid Cloudflare instance, I would have tried replicating it on Cloudflare.

Copy link
Member

@Brayden Brayden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason it seems like the dump recursively is called over and over again until the user actually accesses the dump via the /export/dump/${id} GET route. Once you access that the logs free up and we don''t see it called recursively.

I also can never seem to get it to output to R2 for me. Even when I hardcode shouldUseR2 to be true it doesn't ever seem to find env.DATABASE_DUMPS even though my R2 bucket is in my wrangler.toml file as a binding. This is the error I see in my chunked dump POST:

{
    "error": "Failed to start database dump: Cannot read properties of undefined (reading 'DATABASE_DUMPS')"
}

It seems to be because you're using c.env to try to access the R2 bucket from Wrangler but it's not available in the Hono context. If I do force pass in the Env object from the ./src/index.ts file to the StarbaseDB class and use that then I get this error when it tries to pipe it to an R2 bucket:

{
    "result": {
        "status": "failed",
        "progress": {
            "currentTable": "tmp_cache",
            "processedTables": 0,
            "totalTables": 11,
            "error": "env3.DATABASE_DUMPS.get(...).text is not a function"
        }
    }
}

Overall the methods seem to be working just a couple of small hiccups. It's exciting to see it export database contents inside the DO though 🥳 Really good work!

tables: string[]
): Promise<number> {
let totalSize = 0
for (const table of tables) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop seems to throw a SQLite error if my table name is for example users.table containing a period in the string (which is a valid SQLite table name).

@onyedikachi-david
Copy link
Author

Thanks for the review; I implemented some fix, could you please check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Database dumps do not work on large databases
2 participants