Uploading large files to a system like S3 requires techniques to overcome limitations like HTTP POST’s size restrictions and to ensure reliability and efficiency.

Multipart upload is the standard method to handle large files efficiently in S3-like systems. Here’s how it works:

  1. Initiation:
    • The client sends a request to initiate a multipart upload.
    • S3 generates a unique upload ID to track the parts of this specific upload.
  2. Chunking:
    • The file is split into smaller chunks (e.g., 5 MB minimum for S3, up to 5 GB per chunk).
    • Each chunk is uploaded independently using the upload ID and a part number (an integer indicating its sequence).
  3. Upload Parts:
    • Each chunk is uploaded as a separate request using PUT.
    • Example request:
      PUT /mybucket/myobject?partNumber=1&uploadId=xyz HTTP/1.1
      
    • Chunks can be uploaded in parallel for improved speed.
  4. Completion:
    • Once all parts are uploaded, the client sends a complete request.
    • S3 assembles the chunks into a single object.
  5. Failure Recovery:
    • If a single chunk fails, only that chunk is retried, not the entire file.

2. Pre-Signed URLs for Chunked Uploads

For security and efficiency:

  • Pre-signed URLs allow clients to upload each chunk directly to S3 without additional authentication steps.
  • Each chunk’s pre-signed URL is generated by the backend during the initiation phase.

Other Considerations for Large File Uploads

  1. Parallelism:
    • Multipart uploads leverage parallelism to speed up transfers, especially on high-bandwidth networks.
  2. Integrity Checks:
    • Each chunk is validated with a checksum (e.g., MD5).
    • S3 validates the entire object’s checksum during assembly to ensure correctness.
  3. Resume Support:
    • Multipart upload state can be resumed using the unique upload ID. This is particularly useful for interrupted uploads.
  4. Maximum File Size:
    • S3 supports objects up to 5 TB with multipart upload.