Given that you can restrict $f$ and $g$ to any form (convex, monotonic, etc.) what can be said about $\log(f(g(x)))$ (if anything)?
For context:
I am looking to consider replacing weight updates in neural network backpropagation with $\log$ weight updates as a way to deal with vanishing gradients in long chains of partial derivatives. The form for a neural network looks like:
$g(W_2g(W_1x)) = \hat{y}$
With $g$ and $f$ as any arbitrary non-linear functions. During backpropagation you compute $\Delta W_i = \frac{\partial L}{\partial W_i}$ which ends up looking like a large chain of partial derivatives $\Delta W_i = \frac{\partial L}{\partial h}\frac{\partial h}{\partial a}\frac{\partial a}{\partial W_i}$. Taking $\log{\Delta W_i}$ allows you to add those partial derivatives together instead of multiplying, but you are left with $\log{\Delta W_i}$ instead of $\Delta W_i$.
I think the question ultimate is about if it is possible to constrain the forward model in such a way (perhaps limiting it's expressiveness) that we might use $\log{\Delta W_i}$ to update weights without needing to take $e^{\log{\Delta W_i}}$. One of my first thoughts was to take $\log{\hat{y}}$ and sort of see what happens, but I realized I didn't know much about what I might be able to do with $\log(f(g(x))$.
I'm thinking there might be concepts like Jensen's Inequality but for composite functions and then we seek to minimize our loss function $L$ as a upper or lower bound.
No comments:
Post a Comment