07 Feb 22:06

bc56630

Lightning v2.2

Lightning AI is excited to announce the release of Lightning 2.2 ⚡

Did you know? The Lightning philosophy extends beyond a boilerplate-free deep learning framework: We've been hard at work bringing you Lightning Studio. Code together, prototype, train, deploy, host AI web apps. All from your browser, with zero setup.

While our previous release was packed with many big new features, this time around we're rolling out mainly improvements based on feedback from the community. And of course, as the name implies, this release fully supports the latest PyTorch 2.2 🎉

Highlights

Monitoring Throughput

Lightning now has built-in utilities to measure throughput metrics such as batches/sec, samples/sec and Model FLOP Utilization (MFU) (#18848).

Trainer:

For the Trainer, this comes in form of a ThroughputMonitor callback. In order to track samples/sec, you need to provide a function to tell the monitor how to extract the batch dimension from your input. Furthermore, if you want to track MFU, you can provide a sample forward pass and the ThroughputMonitor will automatically estimate the utilization based on the hardware you are running on:

import lightning as L
from lightning.pytorch.callbacks import ThroughputMonitor
from lightning.fabric.utilities.throughput import measure_flops


class MyModel(LightningModule):
    def setup(self, stage):
        with torch.device("meta"):
            model = MyModel()

        def sample_forward():
            batch = torch.randn(..., device="meta")
            return model(batch)

        self.flops_per_batch = measure_flops(model, sample_forward, loss_fn=torch.Tensor.sum)


throughput = ThroughputMonitor(
    batch_size_fn=lambda batch: batch.size(0),
    # optional, if your samples have a length (like number of tokens)
    sample_fn=lambda batch: batch.size(1)
)
trainer = L.Trainer(log_every_n_steps=10, callbacks=throughput, logger=...)
model = MyModel()
trainer.fit(model)

The results get automatically sent to the logger if one is configured on the Trainer.

Fabric:

For Fabric, the ThroughputMonitor is a simple utility object on which you call .update() and compute_and_log() during the training loop:

import lightning as L
from lightning.fabric.utilities import ThroughputMonitor


fabric = L.Fabric(logger=...)
throughput = ThroughputMonitor(fabric)

t0 = time()
for batch_idx, batch in enumerate(train_dataloader):
    do_work()
    torch.cuda.synchronize()  # required or else time() won't be correct
    throughput.update(
        time=(time() - t0), 
        batches=batch_idx, 
        samples=(batch_idx * batch_size)
    )
    if batch_idx % 10 == 0:
        throughput.compute_and_log(step=batch_idx)

Check out our TinyLlama LLM pretraining script for a full example using Fabric's ThroughputMonitor.

The troughput utilities can report:

batches per second (per process and across process)
samples per second (per process and across process)
items per second (e.g. tokens) (per process and across process)
flops per second (per process and across process)
model flops utilization (MFU) (per process)
total time, total samples, total batches, and total items (per process)

Improved Handling of Evaluation Mode

When you train a model and have validation enabled, the Trainer automatically calls .eval() when transitioning to the validation loop, and .train() when validation ends. Until now, this had the unfortunate side effect that any submodules in your LightningModule that were in evaluation mode get reset to train mode. In Lightning 2.2, the Trainer now captures the mode of every submodule before switching to validation, and restores the mode the modules were in when validation ends (#18951, #18951, #18951). This improvement will help users avoid silent correctness bugs and removes boilerplate code for managing frozen layers.

import lightning as L


class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.trainable_module = ...
        
        # This will now stay in eval mode
        self.frozen_module = ...
        self.frozen_module.eval()
        
    def training_step(self, batch):
        # Previously, modules were all in train mode
        # Now: Modules are in mode they were set up with
        assert self.trainable_module.training
        assert not self.frozen_module.training
        ...
        
    def validation_step(self, batch):
        # All modules are in eval mode
        ...
    
    
model = LitModel()
trainer = L.Trainer()
trainer.fit(model)

If you have overridden any of the LightningModule.on_{validation,test,predict}_model_{eval,train} hooks, they will still get called and execute your custom logic, but they are no longer required if you added them to preserve the eval mode of frozen modules.

Important

In some libraries, for example HuggingFace, models are created in evaluation mode by default (e.g. HFModel.from_pretrained(...)). Starting from 2.2, you will have to set .train() on these models if you intend to train them.

Converting FSDP Checkpoints

In the previous release, we introduced distributed checkpointing with FSDP to speed up saving and loading checkpoints for big models. These checkpoints are in a special format saved in a folder with shards from each GPU in a separate file. While these checkpoints can be loaded back with Lightning Trainer or Fabric very easily, they aren't easy to load or process externally. In Lightning 2.2, we introduced a CLI utility that lets you consolidate the checkpoint folder to a single file that can be loaded in raw PyTorch with torch.load() for example (#19213).

Given you saved a distributed checkpoint, you can then convert it like so:

# For Trainer checkpoints:
python -m lightning.pytorch.utilities.consolidate_checkpoint path/to/my/checkpoint


# For Fabric checkpoints:
python -m lightning.fabric.utilities.consolidate_checkpoint path/to/my/checkpoint

Read more about distributed checkpointing in our documentation: Trainer, Fabric.

Improvements to Compiling DDP/FSDP in Fabric

PyTorch 2.0+ introduced torch.compile, a powerful tool to speed up your models without changing the code.
We now added a comprehensive guide how to use torch.compile correctly with tips and tricks to help you troubleshoot common issues. On top of that, Fabric.setup() will now reapply torch.compile on top of DDP/FSDP if you are enabling these strategies (#19280).

import lightning as L

# Select a distributed strategy (DDP, FSDP, ...)
fabric = L.Fabric(strategy="ddp", devices=8)

# Compile your model before `.setup()`
model = torch.compile(model)

# Now automatically handles compiling also over DDP/FSDP
model = fabric.setup(model)

# You can opt-out if it is causing trouble
model = fabric.setup(model, _reapply_compile=False)

You might see fewer graph breaks, but there won't be any significant speed-ups with this. We introduced this mainly to make Fabric ready for future improvements from PyTorch to optimizing distributed operations.

Saving and Loading DataLoader State

If you use a dataloader/iterable that implements the .state_dict() and .load_state_dict() interface, the Trainer will now automatically save and load their state in the checkpoint (#19361).

import lightning as L


class MyDataLoa...

Contributors

mjbommar, andyland, and 34 other contributors

Assets 10

01 Feb 15:41

awaelchli

2.2.0.rc0

6296a4f

Lightning 2.2 Release Candidate Pre-release

Pre-release

This is a preview release for Lightning 2.2.0.

Assets 10

01 Feb 15:10

awaelchli

2.1.4

8623143

Minor patch release v2.1.4

Fabric

Fixed

Fixed an issue preventing Fabric to run on CPU when the system's CUDA driver is outdated or broken (#19234)
Fixed typo in kwarg in SpikeDetection (#19282)

PyTorch

Fixed

Fixed Trainer not expanding the default_root_dir if it has the ~ (home) prefix (#19179)
Fixed warning for Dataloader if num_workers=1 and CPU count is 1 (#19224)
Fixed WandbLogger.watch() method annotation to accept None for the log parameter (#19237)
Fixed an issue preventing the Trainer to run on CPU when the system's CUDA driver is outdated or broken (#19234)
Fixed an issue with the ModelCheckpoint callback not saving relative symlinks with ModelCheckpoint(save_last="link") (#19303)
Fixed issue where the _restricted_classmethod_impl would incorrectly raise a TypeError on inspection rather than on call (#19332)
Fixed exporting __version__ in __init__ (#19221)

Full Changelog: 2.1.3...2.1.4

Contributors

@andyland @asingh9530 @awaelchli @Borda @daturkel @dipta007 @lauritsf @mjbommar @shenmishajing @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

mjbommar, andyland, and 8 other contributors

Assets 10

21 Dec 15:14

Borda

2.1.3

02020b5

Minor patch release v2.1.3

App

Changed

Lightning App: Use the batch get endpoint (#19180)
Drop starsessions from App's requirements (#18470)
Optimize loading time for chunks to be there (#19109)

Data

Added

Add fault tolerance StreamingDataset (#19052)
Add numpy support for the StreamingDataset (#19050)
Add fault tolerance for the StreamingDataset (#19049)
Add direct s3 support to the StreamingDataset (#19044)
Add disk usage check before downloading files (#19041)

Changed

Cleanup chunks right away if the dataset doesn't fit within the cache in StreamingDataset (#19168)
StreamingDataset improve deletion strategy (#19118)
Improve StreamingDataset Speed (#19114)
Remove time in the Data Processor progress bar (#19108)
Optimize loading time for chunks to be there (#19109)
Resolve path for StreamingDataset (#19094)
Make input dir in DataProcessor required (#18910)
Remove the LightningDataset relying on un-maintained torchdata (#19019)

Fixed

Resolve checkpointing for the Streaming Dataset (#19123)
Resolve Item Loader bugs (#19017)

Fabric

Fixed

Avoid moving the model to device if move_to_device=False is passed (#19152)
Fixed broadcast at initialization in MPIEnvironment (#19074)

PyTorch

Changed

LightningCLI no longer allows setting a normal class instance as default. A lazy_instance can be used instead (#18822)

Fixed

Fixed checks for local file protocol due to fsspec changes in 2023.10.0 (#19023)
Fixed automatic detection of 'last.ckpt' files to respect the extension when filtering (#17072)
Fixed an issue where setting CHECKPOINT_JOIN_CHAR or CHECKPOINT_EQUALS_CHAR would only work on the ModelCheckpoint class but not on an instance (#19054)
Fixed ModelCheckpoint not expanding the dirpath if it has the ~ (home) prefix (#19058)
Fixed handling checkpoint dirpath suffix in NeptuneLogger (#18863)
Fixed an edge case where ModelCheckpoint would alternate between versioned and unversioned filename (#19064)
Fixed broadcast at initialization in MPIEnvironment (#19074)
Fixed the tensor conversion in self.log to respect the default dtype (#19046)

Full Changelog: 2.1.2...2.1.3

Contributors

@AleksanderWWW, @awaelchli, @Borda, @carmocca, @dependabot[bot], @mauvilsa, @MF-FOOM, @tchaton, @yassersouri

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

yassersouri, awaelchli, and 7 other contributors

Assets 10

15 Nov 20:56

Borda

2.1.2

3368d16

Minor patch release v2.1.2

App

Changed

Forced plugin server to use localhost (#18976)
Enabled bundling additional files into app source (#18980)
Limited rate of requests to http queue (#18981)

Fabric

Fixed

Fixed precision default from environment (#18928)

PyTorch

Fixed

Fixed an issue causing permission errors on Windows when attempting to create a symlink for the "last" checkpoint (#18942)
Fixed an issue where Metric instances from torchmetrics wouldn't get moved to the device when using FSDP (#18954)
Fixed an issue preventing the user to Trainer.save_checkpoint() an FSDP model when Trainer.test/validate/predict() ran after Trainer.fit() (#18992)

Contributors

@awaelchli, @carmocca, @ethanwharris, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Full Changelog: 2.1.1...2.1.2

Contributors

awaelchli, ethanwharris, and 2 other contributors

Assets 10

06 Nov 18:19

Borda

2.1.1

2f1756f

Minor patch release v2.1.1

App

Added

add flow fail() (#18883)

Fixed

Fix failing lightning cli entry point (#18821)

Fabric

Changed

Calling a method other than forward that invokes submodules is now an error when the model is wrapped (e.g., with DDP) (#18819)

Fixed

Fixed false-positive warnings about method calls on the Fabric-wrapped module (#18819)
Refined the FSDP saving logic and error messaging when the path exists (#18884)
Fixed layer conversion under Fabric.init_module() context manager when using the BitsandbytesPrecision plugin (#18914)

PyTorch

Fixed

Fixed an issue when replacing an existing last.ckpt file with a symlink (#18793)
Fixed an issue when BatchSizeFinder steps_per_trial parameter ends up defining how many validation batches to run during the entire training (#18394)
Fixed an issue saving the last.ckpt file when using ModelCheckpoint on a remote filesystem, and no logger is used (#18867)
Refined the FSDP saving logic and error messaging when the path exists (#18884)
Fixed an issue parsing the version from folders that don't include a version number in TensorBoardLogger and CSVLogger (#18897)

Contributors

@awaelchli, @Borda, @BoringDonut, @carmocca, @hiaoxui, @ioangatop, @nohalon, @rasbt, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Full Changelog: 2.1.0...2.1.1

Contributors

awaelchli, rasbt, and 7 other contributors

Assets 10

12 Oct 13:10

awaelchli

2.1.0

6f6c07d

Lightning 2.1: Train Bigger, Better, Faster

Lightning AI is excited to announce the release of Lightning 2.1 ⚡ It's the culmination of work from 79 contributors who have worked on features, bug-fixes, and documentation for a total of over 750+ commits since v2.0.

The theme of 2.1 is "bigger, better, faster": Bigger because training large multi-billion parameter models has gotten even more efficient thanks to FSDP, efficient initialization and sharded checkpointing improvements, better because it's easier than ever to scale models without making substantial code changes or installing third-party packages and faster because it leverages the latest hardware features to speed up training in low-bit precision thanks to new precision plugins like bitsandbytes and transformer engine.
And of course, as the name implies, this release fully leverages the latest features in PyTorch 2.1 🎉

Highlights

Improvements To Large-Scale Training With FSDP

The FSDP strategy for training large billion-parameter models gets substantial improvements and new features in Lightning 2.1, both in Trainer and Fabric (in case you didn't know, Fabric is the latest addition to the Lightning family of tools to scale models without the boilerplate code).
FSDP is now more user-friendly to configure, has memory management and speed improvements, and we have a brand new end-to-end user guide with best practices (Trainer, Fabric).

Efficient Saving and Loading of Large Checkpoints

When training large billion-parameter models with FSDP, saving and resuming training, or even just loading model parameters for finetuning can be challenging, as users are are often plagued by out-of-memory errors and speed bottlenecks.

In 2.1, we made several improvements. Starting with saving checkpoints, we added support for distributed/sharded checkpoints, enabled through the setting state_dict_type in the strategy (#18364, #18358):

Trainer:

import lightning as L
from lightning.pytorch.strategies import FSDPStrategy

# Default used by the strategy
strategy = FSDPStrategy(state_dict_type="full")

# Enable saving distributed checkpoints
strategy = FSDPStrategy(state_dict_type="sharded")

trainer = L.Trainer(strategy=strategy, ...)

Fabric:

import lightning as L
from lightning.fabric.strategies import FSDPStrategy

# Saving distributed checkpoints is the default
strategy = FSDPStrategy(state_dict_type="sharded")

# Save consolidated (single file) checkpoints
strategy = FSDPStrategy(state_dict_type="full")

fabric = L.Fabric(strategy=strategy, ...)

Distributed checkpoints are the fastest and most memory efficient way to save the state of very large models.
The distributed checkpoint format also makes it efficient to load these checkpoints back for resuming training in parallel, and it reduces the impact on CPU memory usage significantly. Furthermore, we've also introduced lazy-loading for non-distributed checkpoints (#18150, #18379), which greatly reduces the impact on CPU memory usage when loading a consolidated (single-file) checkpoint (e.g. for finetuning). Learn more about these features in our FSDP guides (Trainer, Fabric).

Fast and Memory-Optimized Initialization

A major challenge that users face when working with large models such as LLMs is dealing with the extreme memory requirements. Even something as simple as instantiating a model becomes non-trivial if the model is so large it won't fit in a single GPU or even a single machine. In Lightning 2.1, we are introducing empty-weights initialization through the Fabric.init_module() (#17462, #17627) and Trainer.init_module()/LightningModule.configure_model() (#18004, #18004, #18385) methods:

Trainer:

import lightning as L

class MyModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        # Delay initialization of model to `configure_model()`

    def configure_model(self):
        # Model initialized in correct precision and weights on meta-device
        self.model = ...

    ...

trainer = L.Trainer(strategy="fsdp", ...)
trainer.fit(model)

Fabric:

import lightning as L

fabric = L.Fabric(strategy="fsdp", ...)

# Model initialized in correct precision and weights on meta-device
with fabric.init_module(empty_init=True):
    model = ...
    

# You can also initialize buffers and tensors directly on device and dtype
with fabric.init_tensor():
    model.mask.create()
    model.kv_cache.create()
    x = torch.randn(4, 128)

# Materialization and sharding of model happens inside here
model = fabric.setup(model)

Read more about this new feature and its other benefits in our docs (Trainer, Fabric).

User-Friendly Configuration

We made it super easy to configure the sharding- and activation-checkpointing policy when you want to auto-wrap particular layers of your model for advanced control (#18045, #18084).

  import lightning as L
  from lightning.pytorch.strategies import FSDPStrategy
- from torch.distributed.fsdp.wrap import ModuleWrapPolicy

- strategy = FSDPStrategy(auto_wrap_policy=ModuleWrapPolicy({MyTransformerBlock}))
+ strategy = FSDPStrategy(auto_wrap_policy={MyTransformerBlock})
  trainer = L.Trainer(strategy=strategy, ...)

Furthermore, the sharding strategy can now be conveniently set with a string value (#18087):

  import lightning as L
  from lightning.pytorch.strategies import FSDPStrategy
- from torch.distributed.fsdp.fully_sharded_data_parallel import ShardingStrategy

- strategy = FSDPStrategy(sharding_strategy=ShardingStrategy.SHARD_GRAD_OP)
+ strategy = FSDPStrategy(sharding_strategy="SHARD_GRAD_OP")
  trainer = L.Trainer(strategy=strategy, ...)

You no longer need to remember the long PyTorch imports! Fabric also supports all these improvements shown above.

True Half-Precision

Lightning now supports true half-precision for training and inference with all built-in strategies (#18193, #18217, #18213, #18219). With this setting, the memory required to store the model weights is only half of what is normally needed when running with float32. In addition, you get the same speed benefits as mixed precision training (precision="16-mixed") has:

import lightning as L

# default
trainer = L.Trainer(precision="32-true")

# train with model weights in `torch.float16`
trainer = L.Trainer(precision="16-true")

# train with model weights in `torch.bfloat16`
# (if hardware supports it)
trainer = L.Trainer(precision="bf16-true")

The same settings are also available in Fabric! We recommend to try bfloat16 training (precision="bf16-true") as it is often more numerically stable than regular 16-bit precision (`precisi...

Contributors

nicolai86, lantiga, and 88 other contributors

Assets 10

10 Oct 08:15

Borda

2.1.0.rc1

4ad8bdb

Feature teaser Pre-release

Pre-release

🐰

Assets 10

28 Sep 18:47

Borda

2.0.9.post0

528aaa2

Hotfix for Conda package

2.0.9.post0

releasing 2.0.9.post0

Assets 10

14 Sep 19:22

Borda

2.0.9

2a0af04

Weekly patch release

App

Fixed

Replace LightningClient with import from lightning_cloud (#18544)

Fabric

Fixed

Fixed an issue causing the _FabricOptimizer.state to remain outdated after loading with load_state_dict (#18488)

PyTorch

Fixed

Fixed an issue that wouldn't prevent the user to set the log_model parameter in WandbLogger via the LightningCLI (#18458)
Fixed the display of v_num in the progress bar when running with Trainer(fast_dev_run=True) (#18491)
Fixed UnboundLocalError when running with python -O (#18496)
Fixed visual glitch with the TQDM progress bar leaving the validation bar incomplete before switching back to the training display (#18503)
Fixed false positive warning about logging interval when running with Trainer(fast_dev_run=True) (#18550)

Contributors

@awaelchli, @Borda, @justusschock, @SebastianGer

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

awaelchli, Borda, and 2 other contributors

Assets 10

Releases: Lightning-AI/pytorch-lightning

Lightning v2.2

Highlights

Monitoring Throughput

Improved Handling of Evaluation Mode

Converting FSDP Checkpoints

Improvements to Compiling DDP/FSDP in Fabric

Saving and Loading DataLoader State

Contributors

Uh oh!

Lightning 2.2 Release Candidate

Uh oh!

Minor patch release v2.1.4

Fabric

Fixed

PyTorch

Fixed

Contributors

Contributors

Uh oh!

Minor patch release v2.1.3

App

Changed

Data

Added

Changed

Fixed

Fabric

Fixed

PyTorch

Changed

Fixed

Contributors

Contributors

Uh oh!

Minor patch release v2.1.2

App

Changed

Fabric

Fixed

PyTorch

Fixed

Contributors

Contributors

Uh oh!

Minor patch release v2.1.1

App

Added

Fixed

Fabric

Changed

Fixed

PyTorch

Fixed

Contributors

Contributors

Uh oh!

Lightning 2.1: Train Bigger, Better, Faster

Highlights

Improvements To Large-Scale Training With FSDP

Efficient Saving and Loading of Large Checkpoints

Fast and Memory-Optimized Initialization

User-Friendly Configuration

True Half-Precision

Contributors

Uh oh!

Feature teaser

Uh oh!

Hotfix for Conda package

Uh oh!

Weekly patch release

App

Fixed

Fabric

Fixed

PyTorch

Fixed

Contributors

Contributors

Uh oh!