This provides a decent speed up in pretty much everything that touches pair loadstores because in most cases they are just regular non-quantizing
float loadstores that happen.
Instead of having an "INS" instruction after every single instruction to duplicate the bottom 64bits in to the top 64bits of the register,
create a new FPR register cache type to track when a register's lower 64bits is supposed to be duplicated in to the high 64bits.
Not necessarily actually having the lower bits duplicated in the host side register. This removes inefficient INS instructions from sequential single
float instructions.
In particular a very heavy single heavy block in Animal Crossing went from 712 instructions down to 520 instructions(~37% less instructions!)
If we are going to be using lfd, then chances are it is going to be used in double heavy areas of code.
If we only need to load the lower register, then we should also not worry about having to insert in to the low 64bits of the guest register.
So add a new flag to the backpatching to handle lfd to directly to the destination register.
This gives ~3% performance improvement to Povray.