PyTorch 是一个 Torch7 团队开源的 Python 优先的深度学习框架
PyTorch 是一个 Torch7 团队开源的 Python 优先的深度学习框架，提供两个高级功能： 强大的 GPU 加速 Tensor 计算（类似 numpy）Here comes the next major release of PyTorch, just in time for ICML. Install it today from our website http://pytorch.org
Package documentation for this release is available at http://pytorch.org/docs/0.2.0/
We're introducing long-awaited features such as Broadcasting, Advanced Indexing, Higher-order gradients and finally: Distributed PyTorch.
Due to introducing Broadcasting, the code behavior for certain broadcastable situations is different from behavior in 0.1.12. This might lead to silent bugs in your existing code. We've provided easy ways of identifying this ambiguous code in the Important Breakages and Workarounds section.
Table of contents:
- Tensor Broadcasting (numpy-style)
- Advanced Indexing for Tensors and Variables
- Higher-order gradients
- Distributed PyTorch (multi-node training, etc.)
- Neural Network layers and features: SpatialTransformers, WeightNorm, EmbeddingBag, etc.
- New in torch and autograd: matmul, inverse, etc.
- Easier debugging, better error messages
- Bug Fixes
- Important Breakages and Workarounds
Tensor Broadcasting (numpy-style)
In short, if a PyTorch operation supports broadcasting, then its Tensor arguments can be automatically expanded to be of equal sizes (without making copies of the data).
PyTorch Broadcasting semantics closely follow numpy-style broadcasting; if you are familiar with numpy broadcasting, things should just work as expected.
General Semantics
Two tensors are “broadcastable” if the following rules hold:
- Each tensor has at least one dimension.
- When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.
For Example:
>>> x=torch.FloatTensor(5,7,3)
>>> y=torch.FloatTensor(5,7,3)
# same shapes are always broadcastable (i.e. the above rules always hold)
# can line up trailing dimensions
>>> x=torch.FloatTensor(5,3,4,1)
>>> y=torch.FloatTensor( 3,1,1)
# x and y are broadcastable.
# 1st trailing dimension: both have size 1
# 2nd trailing dimension: y has size 1
# 3rd trailing dimension: x size == y size
# 4th trailing dimension: y dimension doesn't exist
# but:
>>> x=torch.FloatTensor(5,2,4,1)
>>> y=torch.FloatTensor( 3,1,1)
# x and y are not broadcastable, because in the 3rd trailing dimension 2 != 3
If two tensors x, y are "broadcastable", the resulting tensor size is calculated as follows:
- If the number of dimensions of x and y are not equal, prepend 1 to the dimensions of the tensor with fewer dimensions to make them equal length.
- Then, for each dimension size, the resulting dimension size is the max of the sizes of x and y along that dimension.
For Example:
# can line up trailing dimensions to make reading easier
>>> x=torch.FloatTensor(5,1,4,1)
>>> y=torch.FloatTensor( 3,1,1)
>>> (x+y).size()
torch.Size([5, 3, 4, 1])
# error case
>>> x=torch.FloatTensor(5,2,4,1)
>>> y=torch.FloatTensor( 3,1,1)
>>> (x+y).size()
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1
More details can be found on the PyTorch documentation site. Also, each torch function lists its broadcasting semantics in the documentation.
Advanced Indexing for Tensors and Variables
PyTorch now supports a subset of NumPy style advanced indexing. This allows users to select arbitrary indices at each dimension of the Tensor, including non-adjacent indices and duplicate indices, using the same []
-style operation. This allows for a more flexible indexing strategy without needing calls to PyTorch's Index[Select, Add, ...]
functions.
Let's look at some examples:
x = torch.Tensor(5, 5, 5)
Pure Integer Array Indexing - specify arbitrary indices at each dimension
x[[1, 2], [3, 2], [1, 0]]
--> yields a 2-element Tensor (x[1][3][1], x[2][2][0])
also supports broadcasting, duplicates
x[[2, 3, 2], [0], [1]]
--> yields a 3-element Tensor (x[2][0][1], x[3][0][1], x[2][0][1])
arbitrary indexer shapes allowed
x[[[1, 0], [0, 1]], [0], [1]].shape
--> yields a 2x2 Tensor [[x[1][0][1], x[0][0][1]],
[x[0][0][1], x[1][0][1]]]
can use colon, ellipse
x[[0, 3], :, :]
x[[0, 3], ...]
--> both yield a 2x5x5 Tensor [x[0], x[3]]
also use Tensors to index!
y = torch.LongTensor([0, 2, 4])
x[y, :, :]
--> yields a 3x5x5 Tensor [x[0], x[2], x[4]]
selection with less than ndim, note the use of comma
x[[1, 3], ]
--> yields a 2x5x5 Tensor [x[1], x[3]]
Higher order gradients
Now you can evaluate higher order differentials in PyTorch. For example, you can compute Hessian-Vector products, penalize the norm of the gradients of your model, implement Unrolled GANs and Improved WGANs, etc.
In the 0.2
release, we've enabled the ability to compute higher order gradients for all of torch.XXX
functions and the most popular nn
layers. The rest will be covered in the next release.
Here's a short example that penalizes the norm of the weight gradients of a Resnet-18 model, so that the volume of weights is slow-changing.
import torch
from torchvision.models import resnet18
from torch.autograd import Variable
model = resnet18().cuda()
# dummy inputs for the example
input = Variable(torch.randn(2,3,224,224).cuda(), requires_grad=True)
target = Variable(torch.zeros(2).long().cuda())
# as usual
output = model(input)
loss = torch.nn.functional.nll_loss(output, target)
grad_params = torch.autograd.grad(loss, model.parameters(), create_graph=True)
# torch.autograd.grad does not accumuate the gradients into the .grad attributes
# It instead returns the gradients as Variable tuples.
# now compute the 2-norm of the grad_params
grad_norm = 0
for grad in grad_params:
grad_norm += grad.pow(2).sum()
grad_norm = grad_norm.sqrt()
# take the gradients wrt grad_norm. backward() will accumulate
# the gradients into the .grad attributes
grad_norm.backward()
# do an optimization step
optimizer.step()
We see two new concepts here:
- torch.autograd.grad is a function that takes in [outputs, list of inputs (for which you want gradients)], and returns the gradients wrt. these inputs as a tuple, rather than accumulating the gradients into the
.grad
attributes. This is useful if you want to further operate on the gradients. - You can operate on the gradients, and call
backward()
on them.
The list of nn
layers that support higher order gradients are:
AvgPool*d
,BatchNorm*d
,Conv*d
,MaxPool1d,2d
,Linear
,Bilinear
pad
,ConstantPad2d
,ZeroPad2d
,LPPool2d
,PixelShuffle
ReLU6
,LeakyReLU
,PReLU
,Tanh
,Tanhshrink
,Threshold
,Sigmoid
,HardTanh
,ELU
,Softsign
,SeLU
L1Loss
,NLLLoss
,PoissonNLLLoss
,LogSoftmax
,Softmax2d
The rest will be enabled in the next release.
To enable higher order gradients, we've introduced a new style of writing autograd.Function
(the current/old style of writing functions is fully backward compatible). You can read more about the new style of functions here.
Most of you dont write your own autograd.Function
s, they are low-level primitives that introduce
new operations to the autograd engine, where you specify the forward and backward calls.
Distributed PyTorch
We introduce the torch.distributed package that allows you to exchange Tensors among multiple machines. Using this package, you can scale your network training over multiple machines and larger mini-batches. For example, you are given the primitives to implement Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.
The distributed
package follows an MPI-style programming model. This means that there are functions provided to you such as send
, recv
, all_reduce
that will exchange Tensors among nodes (machines).
For each of the machines to first identify each other and assign unique numbers to each other (ranks), we provide simple initialization methods:
- shared file system (requires that all processes can access a single file system)
- IP multicast (requires that all processes are in the same network)
- environment variable (requires you to manually assign ranks and know an address of a node reachable from all processes)
Our package documentation contains more details on initialization and available backends, but here's an example of initializing using a multicast address:
import torch.distributed as dist
dist.init_process_group(backend='tcp',
init_method='tcp://[ff15:1e18:5d4c:4cf0:d02d:b659:53ba:b0a7]:23456',
world_size=4)
print('Hello from process {} (out of {})!'.format(
dist.get_rank(), dist.get_world_size()))
This would print Hello from process 2 (out of 4)
on the 3rd machine.
World size is the number of processes that will participate in the job. Each will be assigned a rank, which is a number between 0 and world_size - 1, unique within this job. It will serve as a process identifier and will be used instead of an address to, for example, specify to which process should a tensor be sent.
Here's a snippet that shows how simple point-to-point communication can be performed:
# All processes (receiving ones too!) need to have tensors of appropriate
# size preallocated.
x = torch.Tensor(10)
if dist.get_rank() == 0:
x.normal_()
# Send x to process with rank 1
dist.send(x, dst=1)
else: # rank == 1
# Receive data from process with rank 0 and save result in x
dist.recv(x, src=0)
Asynchronous p2p functions (isend
, irecv
) are available too.
However, some communication patterns appear so often that more efficient collective calls have been developed. They typically engage the whole process group and are much faster than naive algorithms using send
/recv
. One example is all_reduce
:
x = torch.Tensor([dist.get_rank()])
# Add tensors from all processes such that they all receive the result.
# x is an input and output to this operation.
dist.all_reduce(x)
The distributed package is fairly low-level, so that it allows to implement more advanced algorithms and tailor the code to very specific purposes, but data-parallel training is such a common one that we have created high-level helpers for it.
Hence, we've introduced DistributedDataParallel
, which is meant to be a nearly drop-in replacement for nn.DataParallel.
Here's a code snippet demonstrating changes necessary to add it to existing training code:
# Wrap model in DistributedDataParallel (CUDA only for the moment)
model = torch.nn.parallel.DistributedDataParallel(model.cuda())
# Use a DistributedSampler to restrict each process to a distinct subset
# of the dataset.
train_dataset = ...
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size, num_workers=args.workers,
pin_memory=True, sampler=train_sampler)
for epoch in range(args.num_epochs):
# Use .set_epoch() method to reshuffle the dataset partition at every iteration
train_sampler.set_epoch(epoch)
# training loop
...
You can see a fuller Imagenet training example here
New nn layers: SpatialTransformers, WeightNorm, EmbeddingBag, etc.
New features
- forward_pre_hook is introduced to execute user-specified closures right before a forward function is called.
- Convenient access to non-leaf gradients:
Currently, to access and inspect gradients of intermediate values, we have to usehooks
. This is not convenient for doing simple inspections. Hence, we introduceretain_grad
. It is best explained via an example:
input = Variable(torch.rand(1, 3), requires_grad=True)
h1 = input * 3
out = (h1 * h1).sum()
h1.retain_grad()
out.backward()
print(h1.grad)
# without calling retain_grad(), h1.grad is None
- DataParallel now supports dicts as inputs
New Layers
- Spatial Transformer Networks via
F.grid_sample
andF.affine_grid
nn.SeLU
andnn.AlphaDropout
are introduced, from the paper: Self-Normalizing Neural Networksnn.GLU
(Gated Linear Unit) is introduced from the paper Convolutional Sequence to Sequence Learning- Weight Normalization is now implemented via torch.utils.weight_norm.
- You can now ignore specific target indices while computing
cross_entropy_loss
andnll_loss
using theignore_index
argument. This is a cheap and useful way of implementing masking, where you can have amask
index that is ignored in computing the loss. F.normalize
implements dimension-wise renormalizationF.upsample
andnn.Upsample
consolidate multiple Upsampling layers into one function. It implements 2d and 3d bilinear/trilinear/nearest upsampling.nn.EmbeddingBag
: When build bag-of-words models, doing anEmbedding
followed bySum
orMean
is common. For variable length sequences, computing bags of embeddings involves masking. We provide a singenn.EmbeddingBag
which is much more efficent and faster to compute bags of embeddings, especially for variable length sequences.- Numerically stable Binary Cross-Entropy loss via
bce_with_logits
- A negative log-likelihood loss with Poisson distribution of the target via
PoissonNLLLoss
cosine_similarity
: Returns cosine similarity between x1 and x2, computed along dim.
training utilities
Learning Rate Schedulers: torch.optim.lr_scheduler provides several dumb and smart methods to adjust the current learning rate. They are quite convenient while experimenting, giving a proxy for what you as the user would likely want to do.
There are various strategies provided, which can be used depending on the appropriate situation, more can be read in the package docs:
- ReduceLROnPlateau, LambdaLR, StepLR, MultiStepLR, ExponentialLR
ConcatDataset
that is a convenient dataset meta-class that can merge and concatenate two individual datasets.
New in torch and autograd
- All reduce functions such as
sum
andmean
now default to squeezing the reduced dimension. For exampletorch.sum(torch.randn(10, 20))
returns a 1D Tensor. x.shape
, similar to numpy. A convenienceproperty
that is equivalent tox.size()
torch.matmul
, similar to np.matmul- bitwise and, or, xor, lshift, rshift
- autograd support for
inverse
,gesv
,cumprod
,atan2
- unbiased
var
andstd
now available via keyword argument option torch.scatter_add
- torch.scatter, except when duplicate indices are encountered, the values are summed.- torch.median behaves similar to torch.sum when no arguments are given, i.e. it reduces all the dimensions and returns a single median value of the flattened Tensor.
- masked_copy_ has been renamed to masked_scatter_ (with deprecation on masked_copy_)
- torch.manual_seed now seeds all CUDA devices as well
- You can now specify the random number generator object via keyword arguments
torch.rand(1000, generator=gen)
Bug-fixes and small improvements
- Now we emit an error when a Variable is converted to a bool. For example:
b = Variable(torch.zeros(1))
if b[0]: # errors now
- Fix correctness bugs in qr decomposition on CUDA.
- Support for IBM PowerPC64 platform
- Check that the CuDNN version at compile-time is the same version at run-time.
- Improve error message in CUDA forked subprocess
- Faster transposed-copy on CPU
- Improve error messages in InstanceNorm
- Add more argument checking for various routines, especially BatchNorm and Convolution routines.
- Better error messages around shape reporting across the CPU backend.
- Support more than 8 GPUs per machine (work-around a CUDA p2p restriction)
- Improve error message when accessing attributes that don't exist
- t() of Variable consistent with Tensor
- prevent divide-by-zero when dropout p=1
- fix sharing of CUDA tensors on non-current devices
- when BN epsilon < allowed CuDNN value, fallback to THNN
- Fix thread-trashing when using different number of threads for MKL and OMP
- improve memory usage when using CuDNN RNN
- Fix ZeroPad2d backwards with negative padding
- add dummy tensor.data property, to provide interpretable error message to users
- Fix in-place division for Python3
- Raise error when call from_numpy on 0-dim array
- Empty Tensors dont error out when shared across multiprocessing
- fix baddbmm for expanded tensors
- Let parallel_apply accept arbitrary inputs
- keyword arguments in Tensor and Variable are now consistent
- fix torch.inverse when Magma is not available
- Add logical not operator for ByteTensor
- add device asserts in scatter/gather kernels
Important Breakages and Workarounds
As you've read, we've introduced two important changes that are not
backward compatible:
- Numpy-style Broadcasting
- Reduction functions such as
sum(1)
now default tokeepdim=False
We provide different levels of Python warnings that you can enable to alert you if you are using deprecated behavior or if the behavior of your code has changed.
tl;dr
Here is a code snippet that you can add to the top of your scripts.
Adding this code will generate warnings highlighting incompatible code.
Fix your code to no longer generate warnings.
# insert this to the top of your scripts (usually main.py)
import sys, warnings, traceback, torch
def warn_with_traceback(message, category, filename, lineno, file=None, line=None):
sys.stderr.write(warnings.formatwarning(message, category, filename, lineno, line))
traceback.print_stack(sys._getframe(2))
warnings.showwarning = warn_with_traceback; warnings.simplefilter('always', UserWarning);
torch.utils.backcompat.broadcast_warning.enabled = True
torch.utils.backcompat.keepdim_warning.enabled = True
Once all warnings disappear, you can remove the code snippet.
More elaborately
Now, let us see the three incompatible changes with examples.
Using the (now deprecated) 1-dimensional view pointwise function
Prior versions of PyTorch allowed certain pointwise functions to execute on tensors with different shapes, as long as the number of elements in each tensor was equal. The pointwise operation would then be carried out by viewing each tensor as 1-dimensional. PyTorch now supports broadcasting. The “1-dimensional” pointwise behavior is considered deprecated and will generate a Python warning in cases where tensors are not broadcastable, but have the same number of elements.
For example:
>>> torch.add(torch.ones(4), torch.ones(2,2))
__main__:1: UserWarning: self and other not broadcastable, but have the same
number of elements. Falling back to deprecated pointwise behavior.
2
2
2
2
[torch.FloatTensor of size 4]
Broadcasting in code where it didn't happen before
The introduction of broadcasting can cause backwards incompatible changes in the case where two tensors do not have the same shape,
but are broadcastable and have the same number of elements.
For example:
>>> torch.add(torch.ones(4,1), torch.randn(4))
would previously produce a Tensor with size: torch.Size([4,1])
,
but now produces a Tensor with size: torch.Size([4,4])
.
In order to help identify cases in your code where backwards incompatibilities introduced by broadcasting may exist, you may set torch.utils.backcompat.broadcast_warning.enabled
to True
, which will generate a python warning in such cases.
For Example:
>>> torch.utils.backcompat.broadcast_warning.enabled=True
>>> torch.add(torch.ones(4,1), torch.ones(4))
__main__:1: UserWarning: self and other do not have the same shape, but are broadcastable, and have the same number of elements.
Note that this setting can trigger warnings for valid uses of broadcasting (including in library code), so you probably want to turn this warning off after migrating your code.
KeepDim=False for Reduction Functions
To get a warning when using a dimensional reduction function with the default keepdim argument, set torch.utils.backcompat.keepdim_warning.enabled
to True
. For example:
>>> torch.sum(torch.ones(2,3), 1)
__main__:1: UserWarning: backwards compatibility: call to "sum" uses default value for keepdim which has changed default to False. Consider passing as kwarg.
3
3
[torch.FloatTensor of size 2]
As with torch.utils.backcompat.broadcast_warning.enabled
, this warning can trigger from valid code, so you most likely want to disable this warning after migrating your code.
Note also that using keepdim=False
can cause your existing code to "just work" with broadcasting. For example:
# behavior with (old) keepdim=True, causes accidental broadcast
>>> torch.add(torch.ones(4), torch.ones(4,4).sum(dim=1, keepdim=True))
5 5 5 5
5 5 5 5
5 5 5 5
5 5 5 5
[torch.FloatTensor of size 4x4]
# new behavior with keepdim=False is equivalent to non-broadcasted result
>>> torch.add(torch.ones(4), torch.ones(4,4).sum(dim=1, keepdim=False))
5
5
5
5
[torch.FloatTensor of size 4]
Downloads
API Changes
torch.range
is deprecated in favor oftorch.arange
which is consistent with numpy and python range.- On sparse Tensors,
contiguous
is renamed tocoalesce
andcoalesce
is now made out-of-place.
(a reminder that Sparse API is still experimental and evolving, so we dont provide backward-compability).
New Features
New layers and functions
torch.topk
is now supported for all CUDA types, not justtorch.cuda.FloatTensor
.- Added a three-way ranking loss: nn.TripletMarginLoss
- Added per-instance normalization layers: nn.InstanceNorm1d, nn.InstanceNorm2d, nn.InstanceNorm3d
Each channel is treated as an instance to normalize, and mean-subtraction and std-division is done. This is useful when dealing with larger images and smaller mini-batches where BatchNorm like effects are desired. nn.ZeroPad2d
andnn.ConstantPad2d
are added.nn.Bilinear
is added, which computesY = X1 * W * X2 + b
Negative dimension support for all functions
Every single function that took a dimension argument will also allow taking negative dimensions.
A negative dimension will index the tensor from the last dimension.
For example:
x = torch.randn(10, 20, 30)
y = torch.mean(x, dim = -1)
Here, since x
has 3 dimensions, and dim = -1
, the last dimension, i.e. dim=3
is picked for taking a mean.
The functions with dimension arguments are:
narrow, transpose, size, cat, chunk, gather, index_select, split, squeeze,
stack, unbind, unsqueeze, cumprod, cumsum, mean, median, mode, norm, prod, std,
sum, var, kthvalue, max, min, sort, topk, renorm,
index_add, index_copy, index_fill, scatter, select, unfold
CUDA support for Sparse Tensors, faster CPU sparse
Now a part of the torch.sparse
API is also supported for torch.cuda.sparse.*Tensor
.
Functions that are supported on CUDA are:
sparse_mask, to_dense, coalesce, transpose, spaddmm
spcadd, mul, div, cadd, csub, cmul
nn.Embedding
now supports sparse even on CUDA (with the sparse=True
flag) leveraging these sparse functions.
A new hybrid matrix-multiply hspmm
operation that multiplies a sparse matrix with a dense matrix and returns a matrix in the form of a hybrid tensor (i.e. 1 sparse dimension, 1 dense dimension).
Several of the CPU sparse functions have more efficient implementations.
In a quickly hacked up Embedding classifier training script by @martinraison we see CUDA sparse performing as well as CUDA dense:
https://gist.github.com/martinraison/1e7c18c6f6eda87f1cb4995b0e6a22a5
Table times of seconds / batch
_ | CPU | CUDA |
---|---|---|
Dense | 10 | 0.86 |
Sparse | 0.15 | 0.13 |
named_parameters to filter out specific parameter types
Let's say that you want to add weight decay to all parameters of your model except for the biases. How do you get only the biases of your model?
We introduce nn.Module.named_parameters for this.
It joins named_children
and named_modules
in helping you filter specific attributes of models.
Example of filtering out biases of a model and give them weight_decay of 0:
import torch
import torch.nn as nn
import torch.optim as optim
m = nn.Sequential(
nn.Linear(10, 20),
nn.ReLU(),
nn.Linear(20, 20),
nn.ReLU(),
)
weights, biases = [], []
for name, p in m.named_parameters():
if 'bias' in name:
biases += [p]
else:
weights += [p]
optim.SGD([
{'params': weights},
{'params': biases, weight_decay=0}
], lr=1e-2, momentum=0.9, weight_decay=1e-5)
Performance Improvements
cumsum
andcumprod
have been significantly made faster on the GPU via using some thrust primitives where appropriate.LSTMCell
andGRUCell
are now significantly faster on the GPU via a fused kernel- The default Algorithm for CuDNN has been changed to
PRECOMP_GEMM
which is a
much faster algorithm that takes a tiny bit of workspace. Previously, it used to
beIMPLICIT_GEMM
which took zero workspace, but was significantly slower. - 5% to 10% improvement in data loader by collating batches directly into shared memory.
- SVD is now computed on the GPU via divide-and-conquer (sgesdd) which gives a 2x to 5x speedup.
- The commonly used function
expand
has been moved to C, to have better performance in smaller models.
Bug Fixes
- Added contiguous checks on weight and bias for a large range of THNN functions
- make the range of
random_
correct when both lower and upper bound are specified parallel_apply
now can take arguments that are unhashable- Reshape
grad
correctly in the Dot function (inputs don't have to be 1D vectors...) - Added
Variable.type_as
- Unify argument names of
norm
andrenorm
to havep=norm_type, dim=dim
btrisolve
works on CPU doubles- ipython autocomplete for torch.nn.Module fixed via implementing
__dir__
device_ids
can now beNone
again inF.data_parallel
and will use all available GPUs- workaround cudnn bugs in BatchNorm (<5.1.10) and Dilation (6.0.20)
- Padding bugfix in Conv1d CPU
remainder
andcremainder
are fixed for integer types- fix memory leak in
btrisolve
andgetri
- If nn.Module's source cant be retrieved because of any exception,
handle serialization to be non-fatal collate_fn
now retains the type of the numpy arrayis_tensor
andis_storage
are now fixed for old-style Python classestorch.cat
now supports keyword arguments- CUDA collectives supported coalescing, but the inputs were all assumed
to be of the same Tensor type. This is fixed. - Fix a deadlock bug in autograd because of an underlying glibc bug in specific
linux distros (ArchLinux in particular) abs
is now fixed forchar
andshort
cuda types- fix
torch.diag
autograd when giving a dimension argument - fix grouped convolution on CPU when
bias=False
- expose
dilated
convolutions forConvTranspose*d
- Fix a bug in
HingeEmbeddingLoss
wheremargin
can now be specified via kwargs
Improved error messages
- Fix errors and messages when no CUDA devices are available.
Downloads
Watchers：574 |
Star：10178 |
Fork：2140 |
创建时间： 2016-08-13 13:26:41 |
最后Commits： 前天 |
Performance improvements, new layers, ship models to other frameworks (via ONNX), CUDA9, CuDNNv7, lots of bug fixes
soumith released this
Dec 5, 2017
Table of contents
reinforce()
Breaking changes
Stochastic functions, i.e.
Variable.reinforce()
were removed because of their limited functionality and broad performance implications. The motivation for stochastic functions was to avoid book-keeping of sampled values. In practice, users were still book-keeping in their code for various reasons. We constructed an alternative, equally effective API, but did not have a reasonable deprecation path to the new API. Hence this removal is a breaking change.We introduce the torch.distributions package to replace Stochastic functions.
Your previous code typically looked like this:
This is the new equivalent code:
New features
Unreduced losses
Now, Some loss functions can compute per-sample losses in a mini-batch
reduce=False
to return individual losses for each sample in the mini-batchloss = nn.CrossEntropyLoss(..., reduce=False)
MSELoss
,NLLLoss
,NLLLoss2d
,KLDivLoss
,CrossEntropyLoss
,SmoothL1Loss
,L1Loss
An in-built Profiler in the autograd engine
We built a low-level profiler to help you identify bottlenecks in your models
Let us start with an example:
The profiler works for both CPU and CUDA models.
For CUDA models, you have to run your python program with a special
nvprof
prefix. For example:Then, you can load
trace_name.prof
in PyTorch and print a summary profile report.Read additional documentation here
Higher order gradients
Added higher-order gradients support for the following layers
Optimizers
New layers and nn functionality
nearest
andlinear
modes.padding_mode="border"
.grid_sample
expects a grid in the range of[-1, 1]
, and if the values are out of these bounds, padding with the value0.0
is applied by default. However, in a lot of cases, using the border value (i.e. the nearest valid value) helps improve accuracy of the overall model.nn.utils.parameters_to_vector
andnn.utils.vector_to_parameters
parameters_to_vector
takesnet.parameters()
and return a 1D vector that contains all the parametersvector_to_parameters
takes a vector of flattened parameters and copies the values over to a network's parametersAdaptivePool*d
and infer them at runtime.New Tensor functions and features
torch.erf
andtorch.erfinv
that compute the error function and the inverse error function of each element in the Tensor.Tensor.put_
andtorch.take
similar tonumpy.take
andnumpy.put
.first. The output has the same shape as the indices.
numpy
equivalents:numpy.take
has an optional axis argument, which behaves likeindex_select
. Thisaxis
argument is not yet present.numpy.put
repeats the values if necessary to make them as long as indices. This behavior is not yet replicated.zeros
andzeros_like
for sparse Tensors.int(torch.Tensor([5]))
works now.Other additions
torch.cuda.get_device_name
andtorch.cuda.get_device_capability
that do what the names say. Example:torch.backends.cudnn.deterministic = True
, then the CuDNN convolutions use deterministic algorithmstorch.cuda_get_rng_state_all
andtorch.cuda_set_rng_state_all
are introduced to let you save / load the state of the random number generator over all GPUs at oncetorch.cuda.emptyCache()
frees the cached memory blocks in PyTorch's caching allocator. This is useful when having long-running ipython notebooks while sharing the GPU with other processes.API changes
softmax
andlog_softmax
now take adim
argument that specifies the dimension in which slices are taken for the softmax operation.dim
allows negative dimensions as well (dim = -1
will be the last dimension)torch.potrf
(Cholesky decomposition) is now differentiable and defined onVariable
device_id
and replace it withdevice
, to make things consistenttorch.autograd.grad
now allows you to specify inputs that are unused in the autograd graph if you useallow_unused=True
This gets useful when using
torch.autograd.grad
in large graphs with lists of inputs / outputsFor example:
pad_packed_sequence
now allows apadding_value
argument that can be used instead of zero-paddingDataset
now has a+
operator (which usesConcatDataset
). You can do something likeMNIST(...) + FashionMNIST(...)
for example, and you will get a concatenated dataset containing samples from both.torch.distributed.recv
allows Tensors to be received from any sender (hence,src
is optional).recv
returns the rank of the sender.zero_()
toVariable
Variable.shape
returns the size of the Tensor (now made consistent with Tensor)torch.version.cuda
specifies the CUDA version that PyTorch was compiled withrandom_
for CUDA.pathlib.Path
object, which is a standard Python3 typed filepath objectstate_dict
into another model (for example to fine-tune a pre-trained network),load_state_dict
was strict on matching the key names of the parameters. Now we provide astrict=False
option toload_state_dict
where it only loads in parameters where the keys match, and ignores the other parameter keys.nn.functional.embedding_bag
that is equivalent tonn.EmbeddingBag
Performance Improvements
torch
functions on Variables was around 10 microseconds. This has been brought down to ~1.5 microseconds by moving most of the core autograd formulas into C++ using our ATen library. This speeds-up models that are very small, such as small LSTMs and other common models seen in NLP.100k x 128
and a batch size of 1024, it is 33x faster.groups == nInputPlane
(depthwise convolution). Speedups range from 5x to 1000x for tested layer sizes. See the benchmark table for more details as well as this table.optim.SGD
's memory usage for sparse gradients (for ex.nn.Embedding(..., sparse=True)
), reducing the usage on a user-provided test script by 10x.torch.nn.utils.weight_norm
over the right-most dimensions is fastertorch.norm
is sped up by ~1.5xpack_padded_sequence
torch.arange
. For exampletorch.arange(10)
Framework Interoperability
DLPack Interoperability
DLPack Tensors are cross-framework Tensor formats. We now have
torch.utils.to_dlpack(x)
andtorch.utils.from_dlpack(x)
to convert between DLPack and torch Tensor formats. The conversion has zero memory copy and hence is very efficient.Model exporter to ONNX
ONNX is a common model interchange format that can be executed in Caffe2, CoreML, CNTK, MXNet, Tensorflow at the moment. PyTorch models that are ConvNet-like and RNN-like (static graphs) can now be shipped to the ONNX format.
There is a new module torch.onnx (http://pytorch.org/docs/0.3.0/onnx.html) which provides the API for exporting ONNX models.
The operations supported in this release are:
Usability Improvements
Breaking changes
warning
is printed to the user.load_state_dict
Bug fixes
torch
torch.manual_seed
(instead, the calls are queued and run when CUDA is initialized)Tensor
x
is 2D,x[[0, 3],]
was needed to trigger advanced indexing. The trailing comma is no longer needed, and you can dox[[0, 3]]
x.sort(descending=True)
used to incorrectly fail for Tensors. Fixed a bug in the argument checking logic to allow this.torch.DoubleTensor(np.array([0,1,2], dtype=np.float32))
torch.cuda.FloatTensor(np.random.rand(10,2).astype(np.float32))
will now work by making a copy.ones_like
andzeros_like
now create Tensors on the same device as the original Tensortorch.multinomial
on the CPU would reshape the inputprob_dist
in-place. Fixed this to make sure theprob_dist
input's shape is unchanged after the call tomultinomial
expand
andexpand_as
allow expanding an empty Tensor to another empty Tensor[..., None, ...]
was given (i.e. newaxis placement in indexing was specified), PyTorch had different behavior from NumPy. This is made consistent with NumPy in all cases.numpy()
andtorch.from_numpy
torch.scatter
torch.tril
andtorch.triu
on the GPU for storage-offset Tensors (would return incorrect result).torch.topk
random_
on CPU (which previously had a max value of 2^32) for DoubleTensor and LongTensorZeroDivisionError: float division by zero
when printing certain Tensorstorch.gels
whenm > n
had a truncation bug on the CPU and returned incorrect results. Fixed.contiguous
any
andall
work on empty Tensors on the cpu (previously errored out)symeig
on CUDA for large matrices. The bug is that not enough space was being allocated for the workspace, causing some undefined behavior.torch.var
andtorch.std
by using Welford's algorithmuniform
samples with inconsistent bounds (inconsistency in cpu implementation and running into a cublas bug).uniform
sampled numbers will return within the bounds[0, 1)
, across all types and devicestorch.svd
to not segfault on large CUDA Tensors (fixed an overflow error in the magma bindings)index_select
(instead of erroring out)eigenvector=False
,symeig
returns some unknown value for the eigenvectors. Now we zero them out.sparse
.type()
not converting indices tensor.autograd
type()
around non-default GPU input.torch.norm
returned0.0
, the gradient wasNaN
. We now use the subgradient at0.0
, so the gradient is0.0
.torch.prod
's backward was failing on the GPU due to a type error, fixed.optim
torch.optim.lr_scheduler
is now imported by default.nn
register_buffer("foo", ...)
is called, and self.foo already exists, then instead of silently failing, now raises aKeyError
_data_ptrs
attributes.nn.Embedding
had a hard error when using themax_norm
option. This is fixed now.max_norm
option, the passed-in indices are written upon (by the underlying implementation). To fix this, pass a clone of the indices to the renorm kernel.F.affine_grid
now can take non-contiguous inputs1
value per channel in total, raise an error in training mode.-inf
was returned. Now this correctly returns0.0
poisson_nll_loss
whenlog_input=False
by adding a small epsilondistributed and multi-gpu
n = nn.DataParallel(Net()); out = n(input=i)
requires_grad=False
in DistributedDataParallelbuffers
(previously raised incoherent error)__get_state__
to be functional inDistributedDataParallel
(was returning nothing)Others
model.zoo.load_url
now first attempts to use therequests
library if available, and then falls back tourllib
numpy.str_
Downloads