Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: TFRT Kernel Fallback #266

Merged
merged 30 commits into from
Nov 13, 2020

Conversation

annarev
Copy link
Contributor

@annarev annarev commented Jul 13, 2020

Feedback open until 7/28

Status Proposed
RFC # 266
Author(s) Anna Revinskaya (annarev@google.com), Jeremy Lau (lauj@google.com)
Sponsor Jeremy Lau (lauj@google.com)
Updated 2020-09-09

Objective

This proposal focuses on getting a majority of "well-behaved" TensorFlow ops running efficiently on
mobile devices by removing the need to execute them via the TensorFlow eager runtime, instead
calling kernels directly from the new TFRT TensorFlow runtime.

Note that there is an effort to call existing kernels by delegating toTensorFlow eager runtime instead. This approach is called Runtime Fallback and corresponding RFC will be published soon. The goals of the two fallback mechanisms are as follows:

  • Runtime Fallback aims to reuse all current TensorFlow kernels in TFRT.
  • Kernel Fallback (focus of this document) aims to get a large number of existing kernels working in TFRT while reducing binary size to support mobile devices.

Formatting fixes

Formatting fixes and removed option 2 for C API

Fixed top table

Adjusted some of the wording

Changed kernel fallback RFC name, updated selective registration section

Fix links to cs.opensource.google
@annarev annarev force-pushed the tfrt_kernel_fallback_rfc branch from bfd54a3 to f1ce513 Compare July 13, 2020 22:59
Copy link

@fdxmw fdxmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, thanks for putting this RFC together!

rfcs/20200712-tfrt-kernel-fallback.md Outdated Show resolved Hide resolved
rfcs/20200712-tfrt-kernel-fallback.md Outdated Show resolved Hide resolved
rfcs/20200712-tfrt-kernel-fallback.md Outdated Show resolved Hide resolved
rfcs/20200712-tfrt-kernel-fallback.md Outdated Show resolved Hide resolved
rfcs/20200712-tfrt-kernel-fallback.md Show resolved Hide resolved
rfcs/20200712-tfrt-kernel-fallback.md Outdated Show resolved Hide resolved
rfcs/20200712-tfrt-kernel-fallback.md Outdated Show resolved Hide resolved
rfcs/20200712-tfrt-kernel-fallback.md Outdated Show resolved Hide resolved
rfcs/20200712-tfrt-kernel-fallback.md Outdated Show resolved Hide resolved
@bhack
Copy link
Contributor

bhack commented Jul 15, 2020

Can you introduce why we are talking about fallback? The no fallback ones are codegen? What Is the native kernel concept in TFRT?
Is there an impact on custom c++ ops in SIGs repositories? There will be a way to run that on mobile?

@annarev
Copy link
Contributor Author

annarev commented Jul 15, 2020

@bhack, thank you for the comments! Note that I am still keeping it as draft to gather comments from coworkers/coauthors. I will hopefully announce it for review very soon.

Can you introduce why we are talking about fallback? The no fallback are codegen? What Is the native kernel concept in TFRT?

"Native kernels" are kernels implemented in the new runtime (as opposed to existing TensorFlow runtime). See: https://github.com/tensorflow/runtime/blob/62695daf59a51774709fac081165f36611dfe67c/documents/tfrt_host_runtime_design.md#low-level-abstractions-asyncvalue-kernel-threadpool-allocator . @fdxmw let me know if there is anything I am missing.

Since TFRT is a new runtime, it only supports a small subset of kernels. At the same time, existing TensorFlow runtime supports many kernels (probably around 1000). We want to define a way for TFRT to call these existing TensorFlow kernels and we call it "fallback".

Is there an impact on custom c++ ops in SIGs repositories?

Impact is optional. We are thinking of having two types of "fallback": Runtime fallback and Kernel fallback. Runtime fallback should support everything out of the box, including custom C++ ops. Kernel fallback aims to support a subset of kernels with a subset of functionality. Custom ops can choose to support kernel fallback by extending OpKernelBase instead of OpKernel and taking OpKernelConstructionInterface/OpKernelContextInterface instead of OpKernelConstruction/OpKernelContext as inputs. Depending on the kernel, this might or might not be feasible.

There will be a way to run that on mobile?

Main concern is binary size. It might be that Runtime fallback can run on mobile as well - which would automatically support all kernels. However, Runtime Fallback would have larger binary size demands, which might or might not fit mobile constraints. At the same time, the focus of this doc - Kernel fallback aims to support a subset kernels but minimize binary size. So it would be a better option for models that just use this subset.

@annarev
Copy link
Contributor Author

annarev commented Jul 17, 2020

@bhack thank you for the questions, see replies below:

@annarev So is this just a more or less long temporary solution waiting for native kernel population in TFRT?

Correct.

Will these fallback kernels be maintained in the Tensorflow repo?

Yes, we just update existing kernel code instead of adding new ones. We will add separate registration macros, but these will also be in TensorFlow repo.

@annarev
Copy link
Contributor Author

annarev commented Aug 1, 2020

Design review notes:

  • tfrt_fallback as preferred dialect prefix
  • String interning less important now, will consider later
  • Why called fallback? Because we fallback from TFRT
  • Gradients op comment is not super relevant, can be removed
    • Can kernel fallback be used for training?
      • Kernels are available
  • Size measurement
    • Clarified: 200k per kernel per dtype
    • Should compare TFLite select (Flex) delegate with Kernel Fallback
      • Is the main saving from the framework or kernel side?
    • Ideal size number: 200k framework, 20k for kernels
  • Kernel fallback
    • Short term usable for TFLite select and also TFMRT, to get 400 of kernels while we develop proper native kernels
  • What's required per kernel?
    • Add base classes, change TF kernels to take interfaces, add registrations
    • Unexpected kernel change work, because some kernel use special base class
  • TFRT attribute names are sorted alphabetically
    • Long standing issue, there are some solution, but not committed to any solution yet
  • Kernel registration not related to native kernel
  • Subset of kernels: how many?
    • Ideally ~400 based on estimate
    • Will get a better idea of whether we can convert all 400 kernels
  • How to fallback to full TF for kernels that use C API?
    • Depends on which platform you are using, select at compile time
  • Do we expect external users to use Kernel Fallback to fallback to their custom kernels?
    • We release two binaries, one with full TF, and the other with kernel fallback
    • One kernel implementation can work with both
  • Want to use TFRT kernel fallback for both TFLite Select and also TFMRT
    • TFMRT won't use the same API as TFLite
  • Should be able to package both runtime, kernel fallback
    • If we do that, we won't get binary size benefits
    • In the future, may consider kernel fallback to be used together with runtime fallback for non-mobile
  • Correct, this is a temporary solution until we have native kernels
    • Yes, will maintain them in TF repo
  • ENG cost (not scalability) for the kernel fallback:
    • Anna is responsible for adding those new kernels
    • Will provided an ENG estimate in Q4
  • Whether we want to use template or inheritance?
    • We need to convert a lot of kernels? How to convert kernels?
    • On latency, inheritance is slower, due to vtable lookup for small kernels
    • basic_ops_benchmark call methods that are used by all TF kernels
      • Regression meant lots of more GCU used in TF
    • Will cause regression on existing TF => this inheritance approach doesn't seem feasible
      • Could #ifdef stuff we don't want to use; we already tried, but really hard to make it work
      • Have a simple base class with the common methods? But might as well use template
    • Only new TF kernels that want conversion require templates
    • If we go with templates, then would it be easy to just add native kernels?
      • You will have to add a few lines of code to every kernel
      • But likely can use ClangMR to update kernels in batch
      • If it's just a script that can update all kernels, then it seems fine
      • If it requires a lot more manual work, then maybe we need to prioritize on the set of kernels
    • What's the current scope of kernels?
      • There are two for now
      • Will estimate the ENG cost in Q3. Will implement in Q4.
    • Should come up with a list
    • TF has a lot of kernels. Need to depend on Runtime Fallback and/or Kernel Fallback
      • Kernels outside TF repo do not use special TensorFlow headers but more likely use C++ code that we don't handle
  • What's the next steps?
    • We are moving forward with kernel fallback, there are some details that need to be figured out
    • Anna will add more numbers, change inheritance to template, in a week
    • Need to know how many kernels can be done?
      • 20 not so much
      • 400 is great!
      • What's the code size overhead? estimated at ~3% based on one kernel

@ematejska
Copy link
Contributor

Will mark as accepted once the next steps are completed by Anna above.

@annarev
Copy link
Contributor Author

annarev commented Aug 4, 2020

Just a heads up that I am still working on running more benchmarks to make final template/inheritance decision and update the doc.
I will add a comment here once this doc is ready to merge.

@bhack
Copy link
Contributor

bhack commented Aug 4, 2020

@annarev How the kernels are sampled for this template/inheritance benchmarks? By "popularity"? (I suppose that you have a small set)

@annarev
Copy link
Contributor Author

annarev commented Aug 4, 2020

@annarev How the kernels are sampled for this template/inheritance benchmarks? By "popularity"? (I suppose that you have a small set)

We try to look at kernels that are as small as possible (for e.g. scalar) so that framework overhead is emphasized.

@bhack
Copy link
Contributor

bhack commented Aug 4, 2020

Are you also analyzing the compile time? Do we need to introduce precompiled headers if we go on with inheritance? Probably you don't care too much about compile time but currently it is already quite heavy to compile Tensorflow sources for contributors without accessing to the "internal" cache (tensorflow/build#5).

@annarev
Copy link
Contributor Author

annarev commented Aug 4, 2020

Are you also analyzing the compile time? Do we need to introduce precompiled headers if we go on with inheritance? Probably you don't care too much about compile time but currently it is already quite heavy to compile Tensorflow sources for contributors without accessing to the "internal" cache (tensorflow/build#5).

Definitely we don't want to increase compile time significantly given that it already takes a long time. I would actually expect a larger compile time increase due to templates since all templates would need to be expanded. Do you expect inheritance to have a larger build time penalty?
I tried measuring build time, but it is not easy. Basically, building a single op is dominated by all building we already have to do in TensorFlow that I couldn't measure overhead of this specific change.

@bhack
Copy link
Contributor

bhack commented Aug 4, 2020

Yes I don't know all the internals but I suppose that It Is a little bit hard to isolate the compile time of a good sample of kernels. In general of course also ISOCPP itself claims an increased compile time of templates but we still need to figure out in our empirical case just to not be aware of this too late (when we have accumulated a significante amount of registered kernels).

@annarev
Copy link
Contributor Author

annarev commented Sep 11, 2020

Yes I don't know all the internals but I suppose that It Is a little bit hard to isolate the compile time of a good sample of kernels. In general of course also ISOCPP itself claims an increased compile time of templates but we still need to figure out in our empirical case just to not be aware of this too late (when we have accumulated a significante amount of registered kernels).

Quick update. I ran build 3 times at head and 3 times with OpKernelContext and OpKernelConstruction updated to inherit from interfaces (our preferred approach right now is inheritance, so I didn't test templates).
Time measurements at head (seconds):
base = np.array([6776.767, 6595.033, 6926.461])
Time measurements when using inheritance (seconds):
inheritance = np.array([7109.257, 6722.884, 6660.029])
p-value (calculated with stats.ttest_ind(inheritance, base)) is quite high at 0.72. Therefore, most likely there is no build time regression.

@annarev
Copy link
Contributor Author

annarev commented Sep 14, 2020

@ematejska This RFC is now ready to accept and merge.

@ematejska ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Nov 13, 2020
@ematejska ematejska merged commit 9c4102d into tensorflow:master Nov 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants