-
-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Blockwise
Op
#757
Add Blockwise
Op
#757
Conversation
d359ff8
to
7f2d982
Compare
82bd1ef
to
ae84fbe
Compare
ae84fbe
to
1e6012d
Compare
1e6012d
to
ad487f0
Compare
def transform(var: "TensorVariable", client_node: Optional[Apply]) -> Variable: | ||
"""Walk a graph and expand single gradient \"block\"s into their block-wise equivalents.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @brandonwillard
Can you explain what transform function is and how it is used in computing L_Op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just like its Elemwise
counterpart, transform
is supposed to use a "template" gradient graph for each input to construct broadcasted gradient graphs in which all the relevant Op
s are Elemwise
/Blockwise
Op
s applied to the original inputs.
Let's take a look at what's happening in Blockwise.L_op
in first test_Blockwise_grad
test.
First, the graph for which we want the L-op/gradient:
aesara.dprint(outputs)
# Blockwise{op=<tests.tensor.test_blockwise.DotBW object at 0x7f5e8236fd90>, signature=((('m', 'n'), ('n', 'p')), (('m', 'p'),))} [id A] <TensorType(float64, (None, None, None))>
# |input 0 [id B] <TensorType(float64, (None, None, None))>
# |input 1 [id C] <TensorType(float64, (None, None, None))>
It's a Blockwise
dot product node with two 3D inputs named input 0
and input 1
.
A "template" graph of the gradient is produced for each input and stored in core_inp_grads
. Each element of core_inp_grads
corresponds to the generic form of a single-block's gradient wrt. each input.
aesara.dprint(core_inp_grads, print_type=True)
# dot [id A]
# |<TensorType(float64, (None, None))> [id B]
# |InplaceDimShuffle{1,0} [id C]
# |<TensorType(float64, (None, None))> [id D]
# dot [id E]
# |InplaceDimShuffle{1,0} [id F]
# | |<TensorType(float64, (None, None))> [id G]
# |<TensorType(float64, (None, None))> [id B]
We can see that the gradient of a dot
in a single block is just another dot
, and that the original inputs aren't present; instead some stand-in variables are used and they're 2D (i.e. TensorType
s with (None, None)
static shapes).
In other words, we've used the core dimensions specified by the Blockwise
and its Op
to remove the broadcasted dimensions (i.e. that determine each block) and produce the generic form of a single "block"'s L-op from an existing Op.[grad|L_op]
implementation.
Now, we can't simply replace those stand-in inputs with input 0
and/or input 1
, because the dot
s in the gradient graphs don't work block-wise and, as a result, cannot take the original inputs as inputs. Also, the InplaceDimShuffle
applied to one of the inputs in each graph wouldn't work with an input containing an extra third dimension.
The idea is that we need to convert the templates' dot
s into Blockwise(dot)
s and do something about the InplaceDimShuffle
s. My guess is that the first input's gradient graph would end up looking like the following after applying transform
:
# Blockwise{op=<tests.tensor.test_blockwise.DotBW object at 0x7f5e8236fd90>, signature=((('m', 'n'), ('n', 'p')), (('m', 'p'),))} [id A]
# |input 0 [id B]
# |InplaceDimShuffle{1,0,2} [id C]
# |input 1 [id D]
The DimShuffle
ed dimensions will probably require a little bit of calculation involving Blockwise.signature
(i.e. to transpose the correct, core dimensions), but most other Op
s should be Blockwise
amenable—at least after we formalize and attach the relevant signature information to our Op
s. DimShuffle
is perhaps a special case in which we don't want to create a Blockwise
Op
, mostly because there's no point in literally applying a DimShuffle
block-wise when a new, equivalent DimShuffle
can be produced that accomplishes the same thing, but more succinctly.
Any Op
s that can't be converted to a Blockwise
form (e.g. because they don't provide signature information in some way or another) should result in a no-gradient error.
ad487f0
to
037f90f
Compare
Closing in favor of #1215. |
This PR implements #695.
It's currently just an outline.