Skip to content

Conversation

mrnicegyu11
Copy link
Member

What do these changes do?

  • Higher CPU Limit for director-v0
  • Higher CPU Limit and unconstrained wrt node-labels for registry

Related issue/s

Related PR/s

Checklist

  • I tested and it works

mrnicegyu11 and others added 30 commits October 15, 2024 16:18
Merge remote-tracking branch 'upstream/main'
…oundation#979)

* Introduce longhorn chart

* Further longhorn configuration

* Longhorn: further settings configuration

* Fix longhorn configuration bugs

Extra: introduce longhorn pv vales for portainer

* Add comment for deletion longhorn

* Further longhorn configuration

* Add README.md for Longhorn wit FAQ

* Update Longhorn readme

* Update readme

* Futher LH configuration

* Update LH's Readme

* Update Longhorn Readme

* Improve LH's Readme

* LH: Reduce reserved default disk space to 5%

Since we use a dedicated disk for LH, we can go ahead with 5%

* Use values to set Longhorn storage class

* Update LH's Readme

* LH Readme: add requirements reference

* PR Review: bring back portainer s3 pv

* LH: decrease portinaer volume size
Merge remote-tracking branch 'upstream/main'
@mrnicegyu11 mrnicegyu11 marked this pull request as ready for review August 19, 2025 09:53
resources:
limits:
memory: 1G
cpus: '2'
cpus: '6'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this number? What is the reasoning behind?
Do we take into account available number of CPU on machines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of available CPUs are not taken into account, and afaik this is correct, they do not matter (as much). [talk to me for clarification ;)]

The number is a guess based on prometheus observation on osparc-master (e.g. PromQL sum(clamp_max(rate(container_cpu_cfs_throttled_seconds_total{image=~"(registry:.*)|(traefik:.*)|(.*itisfoundation.*)|(.*director:.*)"}[2m]) > 0.2,3)) by (container_label_com_docker_swarm_service_name))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevertheless @mrnicegyu11 if you set a limit above what is available in reality, the container never starts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK then I vote for removing the limit alltogether. I can see that it was throttled >7 (additional) CPUs at a time on master, so 8 CPUs would be the right number. But this is too high to put into the section if what @sanderegg says is correct. @YuryHrytsuk @sanderegg

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Test CPU Limit
Optional: Benchmark with and without registry CPU limit

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.
for the registry that will not go on worker nodes where we run services in prod right?

@YuryHrytsuk
Copy link
Collaborator

YuryHrytsuk commented Aug 19, 2025

Is there an issue where we track all these efforts so we can have a summary?

@mrnicegyu11
Copy link
Member Author

thanks. for the registry that will not go on worker nodes where we run services in prod right?

@sanderegg good point, they would right now. Can you provide me with a label that autoscaled machines have (by which they can be identified?) so I can exclude at least those? thx

@sanderegg
Copy link
Member

sanderegg commented Aug 19, 2025

thanks. for the registry that will not go on worker nodes where we run services in prod right?

@sanderegg good point, they would right now. Can you provide me with a label that autoscaled machines have (by which they can be identified?) so I can exclude at least those? thx

@mrnicegyu11 These labels are defined in osparc-config. so I guess at least it should not go anywhere where there are dynamic sidecars at least.

@mrnicegyu11
Copy link
Member Author

mrnicegyu11 commented Aug 19, 2025

Please note that this PR is currently blocking the master CD (CD is disabled until this PR is merged)
unblocked

@mrnicegyu11
Copy link
Member Author

thanks. for the registry that will not go on worker nodes where we run services in prod right?

@sanderegg good point, they would right now. Can you provide me with a label that autoscaled machines have (by which they can be identified?) so I can exclude at least those? thx

@mrnicegyu11 These labels are defined in osparc-config. so I guess at least it should not go anywhere where there are dynamic sidecars at least.

I have added constraints along those lines

Comment on lines 118 to 119
- node.labels.gpu!=true
- node.labels.dynamicsidecar!=true
Copy link
Collaborator

@YuryHrytsuk YuryHrytsuk Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing OPS constraint node.labels.ops==true and introducing negative constraints? Is there anything special about this service so it cannot follow general conventions (ops label in this case)?

@@ -115,11 +115,12 @@ services:
parallelism: 1
placement:
constraints:
- node.labels.ops==true
Copy link
Member Author

@mrnicegyu11 mrnicegyu11 Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chore:

  • use .master. docker compose file or j2 instead of negative labels
  • spread-constraint
  • or: consider manager machine

on all deployment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: test / precommit (minor)

mrnicegyu11 and others added 9 commits August 20, 2025 09:24
…on#1182)

* wip

* Add csi-s3 and have portainer use it

* Change request @Hrytsuk 1GB max portainer volume size

* Arch Linux Certificates Customization

* Fix pgsql exporter failure

* [Kubernetes] Introduce on-prem persistent Storage (Longhorn) 🎉  (ITISFoundation#979)

* Introduce longhorn chart

* Further longhorn configuration

* Longhorn: further settings configuration

* Fix longhorn configuration bugs

Extra: introduce longhorn pv vales for portainer

* Add comment for deletion longhorn

* Further longhorn configuration

* Add README.md for Longhorn wit FAQ

* Update Longhorn readme

* Update readme

* Futher LH configuration

* Update LH's Readme

* Update Longhorn Readme

* Improve LH's Readme

* LH: Reduce reserved default disk space to 5%

Since we use a dedicated disk for LH, we can go ahead with 5%

* Use values to set Longhorn storage class

* Update LH's Readme

* LH Readme: add requirements reference

* PR Review: bring back portainer s3 pv

* LH: decrease portinaer volume size

* Experimental: Try to add tracing to simcore-traefik on master

* Fixes ITISFoundation/osparc-simcore#7363

* Arch Linux Certificates Customization - 2

* Upgrade registry, add tracing

* revert accidental commit

---------

Co-authored-by: Dustin Kaiser <[email protected]>
Co-authored-by: YH <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants