Although information theory is omnipresent in statistical inference/learning, most introductions concentrate on the coding or statistical physics interpretation of information. The goal of these notes is to demonstrate the power of information theory in machine learning and statistics. Hopefully, this will unite machine learning and statistics.
These notes are highly experimental. We ask the reader to forgive the cold, terse tone of this exposition. Any comments on confusions or mistakes are greatly appreciated.