-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confusion about computation in paper 'Simple Recurrent Units for Highly Parallelizable Recurrence' ? #157
Comments
hi @liziru , thanks for posting the great questions. (1) Difference of the two versions. In our first arxiv version, we didn't include the element-wise hidden-to-hidden connection ( (2) Parallelization. There are two different parallelization here. First, the matrix multiplications in (1)-(4) can be grouped into a single one and therefore be parallelized across both time step and hidden dimension. Second, you are correct that point-wise computation cannot be parallelized across time step. But they can be parallelized across hidden dimension since each dimension is independent. In other words, to compute c[t][i] you need to first compute c[t-1][i]; but you can compute c[*][i] and c[*][j] in parallel for i != j. Note the second parallelization is not available in LSTM / GRU since in these RNNs the hidden-to-hidden connections are fully connected. Hope this helps. |
Thanks for your reply. @taolei87 |
@liziru It is version 2 by default. But you can test v1 by passing |
Great! Thank you very much. |
Thanks for your job. @taoleicn @taolei87
In the paper 'Simple Recurrent Units for Highly Parallelizable Recurrence', I found the following computation:
![image](https://user-images.githubusercontent.com/34911790/105475334-382c8600-5cda-11eb-9c4b-e7e0f499a197.png)
I have no idea how to become parallelizable when using a point-wise multiplication dimension of the state vectors instead.
Because of still using ct-1, I think this ‘sru’ cannot be parallelizable.
Besides, I also found 'Training RNNs as Fast as CNNs’ computation that I think it can be parallelizable.
![image](https://user-images.githubusercontent.com/34911790/105475991-0962df80-5cdb-11eb-991e-91d5a9cc282a.png)
And, what are the differences between these two papers? I also mentioned this question in Zhihu.
Looking forward to your reply.
The text was updated successfully, but these errors were encountered: