Skip to content

Conversation

Ben-El
Copy link
Contributor

@Ben-El Ben-El commented Sep 4, 2025

Closes #9387

This PR strengthens the deployment workflow to prevent overwriting released Spark client jars
on the treeverse-clients-us-east bucket.

Changes

  • Added a safe publish step (s3PutIfAbsent) that uses If-None-Match: *, ensuring S3 rejects uploads if the object already exists.
  • Updated publish-scala to run s3PutIfAbsent.
  • Removed the non-performant recursive aws s3 cp --acl public-read ... step.
  • Updated readme (clients/spark/README.md).

Public readability right now is applied by adding "--acl","public-read" flag addition to the s3PutIfAbsent command, but in the near future it will be enforced at the bucket level, not via ACLs.

Testing

As elaborated in the issue discussion.

Verified:

  • New jars upload successfully.
  • Re-publishing an existing version failed as expected.
  • Jars remain publicly accessible (because of the "--acl","public-read" flag addition in the s3PutIfAbsent command).

Follow ups:

@Ben-El Ben-El marked this pull request as draft September 4, 2025 19:28
@Ben-El Ben-El added exclude-changelog PR description should not be included in next release changelog minor-change Used for PRs that don't require issue attached labels Sep 4, 2025
@Ben-El Ben-El temporarily deployed to Treeverse signing September 7, 2025 11:44 — with GitHub Actions Inactive
@Ben-El Ben-El changed the title Update AWS S3 configurations and add debug step in publish workflow Add S3 guard to prevent publishing existing JARs Sep 7, 2025
@Ben-El Ben-El temporarily deployed to Treeverse signing September 7, 2025 13:06 — with GitHub Actions Inactive
@Ben-El Ben-El temporarily deployed to Treeverse signing September 7, 2025 14:28 — with GitHub Actions Inactive
@Ben-El Ben-El temporarily deployed to Treeverse signing September 7, 2025 14:33 — with GitHub Actions Inactive
@Ben-El Ben-El temporarily deployed to Treeverse signing September 7, 2025 14:43 — with GitHub Actions Inactive
…oved accuracy and error handling in Makefile
…ivation logic, fallback handling, and additional debug logs
…d key derivation processes and improving error handling.
… debug logs and simplifying host, bucket, and key derivation processes.
Copy link
Contributor

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we run aws s3 cli to perform the upload, we need to update the requirements from the env / readme.
also, sbt as a build tool should be able to run command, we can use it - without adding code to compile in order to run a command.

@Ben-El Ben-El requested a review from nopcoder September 15, 2025 17:36
val region = "us-east-1"

val cmd = Seq(
"aws","s3api","put-object",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search for non 's3api' command - aws s3 is preferred.

Copy link
Contributor Author

@Ben-El Ben-El Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nopcoder

Using aws s3 cp instead of s3api put-object is possible with some "overhead tweak", but there are trade-offs:

aws s3 cp can perform multipart uploads, however, it does not support the If-None-Match: "*" condition. That means you cannot guarantee an atomic “create-if-absent” operation with aws s3 cp.

If we switch to aws s3 cp, the only workaround would be something like -

  • checking first: aws s3api head-object --bucket "$bucket" --key "$key" ....
  • And then uploading: aws s3 cp "$jar" "s3://$bucket/$key"....

But this is not atomic.

For our use case, since the assembly jar is always < 5GB, I think s3api put-object would fit nicely (it supports single-object uploads up to 5GB).

If one day we publish artifacts larger than 5GB, we would need multipart upload (and lose the atomic “no-overwrite” guarantee).

@Ben-El Ben-El requested a review from nopcoder September 16, 2025 10:08
@Ben-El Ben-El requested a review from nopcoder September 16, 2025 11:43
…ove `publishBucket` handling in Spark client build
@Ben-El Ben-El requested a review from nopcoder September 16, 2025 11:59
Copy link
Contributor

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@Isan-Rivkin Isan-Rivkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please release experimental version from branch after merge for final sanity check?
Verify its accesible from S3 with no auth or aws cli, just curl and that's the jar works!

@Ben-El Ben-El merged commit 7e32167 into master Sep 16, 2025
40 checks passed
@Ben-El Ben-El deleted the test-public-bucket branch September 16, 2025 13:54
@Ben-El
Copy link
Contributor Author

Ben-El commented Sep 16, 2025

Please release experimental version from branch after merge for final sanity check? Verify its accesible from S3 with no auth or aws cli, just curl and that's the jar works!

Works fine !

nopcoder pushed a commit that referenced this pull request Sep 18, 2025
* Update AWS S3 configurations and add debug step in publish workflow

* Revert project version to 0.16.0-demo-9 in Spark client build configuration

* Add S3 overwrite guard to Scala publish process in Makefile

* Remove AWS debug step from publish workflow

* Update S3 bucket name in Makefile

* Refine S3 overwrite guard logic and update handling of bucket and key mapping in Makefile

* Simplify S3 overwrite guard logic in Makefile and streamline error handling

* Refactor S3 overwrite guard logic in Makefile to improve readability and robustness

* Enhance S3 overwrite guard to use HTTP status check via curl for improved accuracy and error handling in Makefile

* Improve S3 overwrite guard in Makefile with enhanced bucket extraction, additional debug logs, and stricter error handling

* Remove redundant comment from Makefile's S3 bucket extraction logic

* Enhance S3 overwrite guard in Makefile with improved host and key derivation logic, fallback handling, and additional debug logs

* Simplify S3 overwrite guard logic in Makefile by streamlining host and key derivation processes and improving error handling.

* Streamline S3 overwrite guard logic in Makefile by removing redundant debug logs and simplifying host, bucket, and key derivation processes.

* Improve S3 overwrite guard in Makefile by refining HTTP status checks and enhancing error messaging

* Bump Spark client version to 0.16.0-demo-10 in build.sbt

* Simplify S3 overwrite guard logic in Makefile by refining host and bucket derivation steps.

* Enhance S3 overwrite guard in Makefile by refining key derivation logic with name, version, and jar details

* Refine S3 overwrite guard in Makefile by enhancing bucket and key extraction logic with stricter validations and improved error handling

* Refine S3 overwrite guard in Makefile by simplifying bucket and key derivation logic using mappings and removing redundant validations.

* Refine S3 overwrite guard in Makefile by adding sbt flags for no colors and disabling supershell.

* Simplify S3 overwrite guard in Makefile by removing redundant sbt flags for no colors and supershell.

* Simplify S3 overwrite guard in Makefile by refining host, bucket, and key derivation logic and consolidating HTTP status checks.

* Simplify S3 overwrite guard in Makefile by inlining bucket derivation into URL creation.

* Bump Spark client version to 0.16.0-demo-11 in build.sbt

* Revert Spark client version to 0.16.0-demo-10 in build.sbt and simplify S3 URL derivation in Makefile by removing redundant bucket resolution.

* Bump Spark client version to 0.16.0-demo-12 in build.sbt

* Update S3 bucket and host details; rename Spark client and bump version to 0.16.0-demo-1.

* Update S3 bucket and host references to use benel-public-test.

* Replace S3 upload logic with atomic safe upload using `If-None-Match`, bump Spark client version to 0.16.0-demo-14, and add build-time dependencies for AWS SDK.

* Add placeholder for safe S3 upload logic using `If-None-Match`.

* Remove commented placeholder for safe S3 upload logic in build.sbt.

* Refactor safe S3 upload logic to use `RequestOverrideConfiguration` for `If-None-Match`.

* Remove redundant `contentType` specification from S3 upload request in build.sbt.

* Simplify `safeS3Upload` task definition and fix incorrect import in build.sbt.

* Move AWS SDK dependencies to plugins.sbt for build-time use and add `aws-core` dependency.

* Replace `safeS3Upload` with `s3PutIfAbsent` using AWS CLI for atomic S3 uploads.

* No changes made; reformat file to ensure proper newline at EOF.

* Bump Spark client version to 0.16.0-demo-15.

* Refactor `s3PutIfAbsent` to improve error handling and include region specification in S3 upload logic.

* Remove redundant assembly trigger from `s3PutIfAbsent` definition in build.sbt.

* Refactor `s3PutIfAbsent` to simplify variable naming and improve error handling.

* Bump Spark client version to 0.16.0-demo-16.

* Enhance `s3PutIfAbsent` with dynamic S3 host, improved URL handling, and success logging.

* Refactor `s3PutIfAbsent` to extract region as a variable for better maintainability.

* Add `sys.process` import to `s3PutIfAbsent` task in build.sbt

* Simplify `s3PutIfAbsent` by removing unused `s3Upload` settings and refactoring bucket URL handling.

* Refactor `s3PutIfAbsent` error handling by removing redundant success logging.

* Update `s3PutIfAbsent` URL to use S3-standard format for bucket URLs.

* Bump Spark client version to 0.16.0-demo-17.

* Add success logging for `s3PutIfAbsent` uploads in build.sbt.

* Bump Spark client version to 0.16.0-demo-19.

* Bump Spark client version to 0.16.0-demo-20 and enable public-read ACL for S3 uploads.

* commit

* commit

* Improve `s3PutIfAbsent` error handling: surface AWS CLI errors, refine artifact existence checks, and update bucket target.

* Refine `s3PutIfAbsent`: remove unused variable, adjust comments, and standardize S3 URL formatting in logs.

* Update Spark client README: add publishing requirements and flow details for maintainers

* Refactor Spark client S3 upload: replace `s3PutIfAbsent` with `s3Upload`, update related Makefile and build logic, and remove outdated README publishing details.

* Update `s3Upload` task description for clarity in Spark client build logic

* Make `publishBucket` configurable in Spark client build and improve `s3Upload` error handling.

* Update `publish-scala` description in Makefile and refine S3 upload logic in Spark client build

* Remove unused `sbts3` plugin from Spark client build configuration.

* Make `S3_PUBLISH_BUCKET` configurable in workflow and pass it to `publish-scala` task via Makefile.

* Improve error handling in `s3Upload`: enhance S3 failure message with bucket and key details.

* Make `PUBLISH_BUCKET_FLAG` optional in Makefile and improve S3 upload logging in Spark client

* Refine `s3Upload` in Spark client build: simplify logging and remove redundant error handling.

* Remove conditional `PUBLISH_BUCKET_FLAG` logic from Makefile and improve `publishBucket` handling in Spark client build
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exclude-changelog PR description should not be included in next release changelog minor-change Used for PRs that don't require issue attached
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Publish Spark metadata client: might fail and cause client to be unaccessible
3 participants