Skip to content

Conversation

EricHallahan
Copy link
Contributor

The requirements files specified in ./requirements have historically been strict as to prevent CI Docker images changing without our prior knowledge. However, this places a burden on users who would like to run GPT-NeoX on a host without containerization, and many non-critical packages are unnecessarily strictly specified as well. This motivates the following changes in this PR:

  • Loosen the dependency version requirements for non-critical packages
  • Remove the unnecessary einops and mpi4py packages
  • Make the installation of wandb optional, making installation accessible through a new requirements file ./requirements/requirements-wandb.txt
  • Fix an oversight that made the Megatron GPT-2 tokenizer inaccessible
  • Clean up the deepy.py DeepSpeed launcher script

StellaAthena
StellaAthena previously approved these changes Nov 7, 2021
@StellaAthena
Copy link
Member

This looks good to me. @EricHallahan’s explanation of how he decided which imports go inside main() seem like a reasonable rule of thumb, and I like the changes to the requirements system in general.

I haven’t run this myself, as I’m under the impression that you’ve been testing it extensively.

@sdtblck
Copy link
Contributor

sdtblck commented Nov 7, 2021

Aside from the above ^ lgtm 🚀

StellaAthena
StellaAthena previously approved these changes Nov 7, 2021
@StellaAthena
Copy link
Member

@EricHallahan is there something you’re waiting on to merge this?

@EricHallahan
Copy link
Contributor Author

EricHallahan commented Nov 8, 2021

I am working on verifying that the behavior with regards to missing Weights & Biases dependencies is what I intended/makes sense, and I also need to add corresponding documentation. I expect to have this ready to merge sometime early tomorrow.

Eliminate the usage of `shortuuid`
- Fix CITATION.cff
- Update README.md to reflect changes to wandb installation
- Remove `shortuuid` from requirements-wandb.txt
Enable wandb by default for the EleutherAI cluster.
@EricHallahan EricHallahan removed the request for review from joshlk November 9, 2021 17:28
@@ -22,14 +22,9 @@ def _get_cuda_bare_metal_version(cuda_dir):
srcpath = Path(__file__).parent.absolute()
cc_flag = []
_, bare_metal_major, _ = _get_cuda_bare_metal_version(cpp_extension.CUDA_HOME)
if int(bare_metal_major) >= 11:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain what this change is doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code forces building sm_70 (Volta) and sm_80 (Ampere) and no other architectures. This means the built extensions will fail to execute on older architectures like Kepler, Maxwell, or Pascal. By default CUDAExtension will build for the local hardware and so therefore it is safe to remove this code. Compiling for non-local hardware is as easy as setting TORCH_CUDA_ARCH_LIST in the environment prior to running the script should it be needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds good. Did you test the kernels built with this change

  1. still work the same on the A100s, and
  2. work on older architectures?

Re: the TORCH_CUDA_ARCH_LIST thing, should we add a comment saying as such?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested it many times and have not seen any issues, and yes, we should absolutely add documentation about what to do if you need to force the arch.

try:
import wandb
except ModuleNotFoundError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will wandb not being present not cause some errors later in training?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If wandb is not installed it sets use_wandb to False. All subsequent Weights & Biases code relies on use_wandb being True, and therefore it never executes anything imported from wandb.

Copy link
Contributor

@sdtblck sdtblck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure of the motivation for the changed to fused_kernels.py
Also, I really don't like the requirements being so granular like this. I don't see the need for separate requirements files for wandb / tensorboard

@StellaAthena
Copy link
Member

I'm not sure of the motivation for the changed to fused_kernels.py Also, I really don't like the requirements being so granular like this. I don't see the need for separate requirements files for wandb / tensorboard

If someone doesn't have WandB available and doesn't wish to use it, how would you prefer they proceed?

@sdtblck
Copy link
Contributor

sdtblck commented Nov 9, 2021

I'm not sure of the motivation for the changed to fused_kernels.py Also, I really don't like the requirements being so granular like this. I don't see the need for separate requirements files for wandb / tensorboard

If someone doesn't have WandB available and doesn't wish to use it, how would you prefer they proceed?

The reason sparse attention and onebitadam were separated out is because they're optional dependencies which are also a bit of a pain to install (cupy-cuda requires you specify the cuda version, and triton used to break often), so removing them from requirements.txt reduced complexity for most users.

As far as I'm aware - there are no such problems with the installation of wandb. You can just pip install it. Including it in requirements.txt does nothing more than take up a few kb more space on the user's device.

Including it in a separate file may mean someone has to take a few minutes to figure out why their logging isn't working, and go back and realise that it's actually not in requirements.txt and you need to install it separately. I can't see a counter scenario where it would actually save time / decrease complexity.

I don't know why / how requirements/requirements-tensorboard.txt ever became a separete file. @sweinbach any ideas?

@EricHallahan
Copy link
Contributor Author

The only reason why I separated it out is because it mirrored how TensorBoard was handled. If we don't think that makes sense I'm happy to change it.

A counterargument to instructing users to "just install wandb/tensorboard" is versioning, which is the problem that requirements files are designed to solve. If you tell the user to install the requirements file, at least you can narrow down the environment to what is in the file rather than having to take a guess at what version that the package manager chose.

@sweinbach
Copy link
Contributor

Maybe too much for this PR but related. Any reason not to use pip-compile to create the requirements file and actually fix dependencies?

@EricHallahan
Copy link
Contributor Author

Maybe too much for this PR but related. Any reason not to use pip-compile to create the requirements file and actually fix dependencies?

I had originally planned this PR with a larger scope which included more granular dependency management (such as only installing the dependencies for evaluation if the user specifies they are interested in evaluation) and a setuptools script to manage that system (which would have also registered deepy.py as a console script for convenience). It however became a point of contention whether this was worth the potential confusion such a system could create, not to mention the increased complexity. I ultimately ended up stashing that line of work so that we could integrate the important changes, but if a more advanced dependency management workflow is desired it would not be hard to continue that line of work.

@StellaAthena
Copy link
Member

Closing as it’s unfixably far behind and better done from scratch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants