John Aslanides's weblogSelf-indulgent rambling on topics in science, mathematics, music, architecture, sport, etc.
http://aslanides.github.io/
Wed, 28 Jun 2017 11:32:23 +0000Wed, 28 Jun 2017 11:32:23 +0000Jekyll v3.4.4Bay Area II: CFAR Workshop<p><em>This is the second of two posts relating my Bay area visit of August-September 2016. You can read the first part <a href="http://aslanides.io/travel/2016/12/28/bay-area/">here</a>.</em></p>
<p>This is a <em>very</em> (6+ months late) belated write-up about my experience at the Summer 2016 Workshop for AI Researchers run by the <a href="https://rationality.org">Center for Applied Rationality (CFAR)</a>. To add insult to injury, I’m not even going to talk about the rationality techniques we studied! As a short summary though, the workshop content, the participants, and above all the CFAR people were all awesome, and I got a huge amount out of attending. The atmosphere throughout was electric and immensely fun. I’ll be writing more about quantified self & rationality practice over the coming months, as I get this blog up to speed.</p>
<figure class="image">
<a href="/assets/bay/ellen_kenna_house.jpg" target="_blank"><img class="photo" src="/assets/bay/ellen_kenna_house.jpg" alt="Ellen Kenna House" /></a>
<figcaption>Ellen Kenna House</figcaption>
</figure>
<p>The workshop was held from August 30 - September 5 at <a href="https://localwiki.org/oakland/Ellen_Kenna_House">Ellen Kenna House</a>, a beautiful Victorian mansion that sits on the top of a hill in Oakland, California. There were about ~30 participants and ~10 CFAR staff. The participants were AI researchers hailing from a variety of places: <a href="http://bair.berkeley.edu/">BAIR</a>, <a href="http://ai.stanford.edu/">SAIL</a>, <a href="http://intelligence.org">MIRI</a>, <a href="https://www.fhi.ox.ac.uk/">FHI</a>, <a href="http://idsia.ch/">IDSIA</a>, <a href="https://research.google.com/teams/brain/">Google Brain</a>, <a href="https://mila.umontreal.ca/">MILA</a>, were some of the labs/institutes represented.</p>
<center> <iframe width="560" height="315" src="https://www.youtube.com/embed/NPQCra8FEew" frameborder="0" allowfullscreen=""></iframe>
<br /><i>I had 'Bino's album </i>Because the Internet<i> on loop the whole week. Apt or corny?</i>
</center>
<p>In this post I’m going to formulate two fun prediction/betting games that I learnt about at the workshop: Calibration Market and Bid/At.</p>
<h2 id="calibration-market">Calibration Market</h2>
<p>This is related to the <a href="http://acritch.com/credence-game/">game of the same name</a> by <a href="http://acritch.com">Andrew Critch</a>. I’ll formulate it somewhat formally here.</p>
<p>A <em>calibration market</em> \(M\) is a tuple \(\left(P,\tau,B,\right)\), where \(P\) is a proposition that is evaluated at some time \(\tau\). \(B\) is a finite sequence of <em>bets</em> \(b_0b_1b_2\dots\), where each bet \(b_i\in(0,1)\) is interpreted as the subjective credence of a player that \(P\) evaluates to \(1\) at time \(\tau\). For <a href="https://en.wikipedia.org/wiki/Cromwell's_rule">obvious</a> <a href="http://i0.kym-cdn.com/photos/images/facebook/000/008/729/Division_of_Zero_by_Sephro_Hoyland.jpg">reasons</a>, values of 0 and 1 are not permitted. The market is initialised with a prior \(b_0\), usually bet by the <em>house</em> or <em>market maker</em>, who may in turn participate in the market subsequently. Once \(P\) is evaluated, the market closes. Each bet (except the first) is scored by the log ratio with the previous bet</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
S[i] &= 100\log_2\left(\frac{b_i^P(1-b_i)^{1-P}}{b_{i-1}^P(1-b_{i-1})^{1-P}}\right) \\
&= 100\left[P\log_2\left(\frac{b_i}{b_{i-1}}\right) + (1-P)\log_2\left(\frac{1-b_i}{1-b_{i-1}}\right)\right]
\end{align} %]]></script>
<p>Informally, you gain points if you move the market towards the truth, and lose points if you move it away from the truth. For each player \(p_j\), you add up the total score for the market.</p>
<p>Some other rules:</p>
<ul>
<li>You may not bet after yourself, and</li>
<li>You may not alternate bets with someone else.*</li>
</ul>
<p>* This second rule is deliberately left flexible and informal: how it is interpreted, and how strictly it is applied is down to taste.</p>
<p>The equation above is an example of a <a href="https://en.wikipedia.org/wiki/Scoring_rule">proper scoring rule</a>: to maximize your expected score one should always bet your true beliefs – this was pointed out to me by <a href="https://intelligence.org/team/">Matthew Graves</a>. Let your true belief be \(a\), and your bet be \(b\). Let the previous bet be \(c\), though we will see that it turns out the value of \(c\) doesn’t matter. Consider your \(a\)-expected score as a function of \(b\):</p>
<script type="math/tex; mode=display">\mathbb{E}[S] = a\log_2\left(\frac{b}{c}\right) + (1-a)\log_2\left(\frac{1-b}{1-c}\right).</script>
<p>We can maximize the expected value simply by setting the derivative to zero and solving for \(b\), since it is trivial to show that \(\mathbb{E}[S]\) is concave. The derivative is given by</p>
<script type="math/tex; mode=display">\partial_b\mathbb{E}[S] = \frac{a-b}{b(1-b)},</script>
<p>which naturally implies that one’s expected score is maximized if one bets one’s true beliefs:</p>
<script type="math/tex; mode=display">\arg\max_{b}\mathbb{E}[S] = a.</script>
<p>The animation below illustrates this, showing how the optimal bet changes as we sweep our belief over the interval \([0,1]\):</p>
<center><img src="/assets/animation.gif" /></center>
<p>Of course, one of the central notions at CFAR is recursive: becoming more rational requires metacognition, i.e. reasoning about reasoning. For this reason, when we played the game, many of the propositions were self-referential in nature. This makes for much more interesting markets, and goes to the whole meta theme. Propositions were typically of the form of “a randomly selected participant will willingly do/say X when prompted”, or “the outcome of this market will be X”, rather than propositions of the form “it will/won’t rain tomorrow”. There were numerous calibration markets run throughout the week, and they added to the generally stimulating and fun environment.</p>
<figure class="image">
<a href="/assets/bay/markets.jpg" target="_blank"><img class="photo" src="/assets/bay/markets.jpg" alt="A couple of markets." /></a>
<figcaption>A couple of markets.</figcaption>
</figure>
<h2 id="bidat">Bid/At</h2>
<p>I’m told that this game is played a lot at <a href="http://janestreet.com">Jane Street</a>. The game is simple: You make a market out of some future/unknown outcome \(Q\) that is quantifiable. For simplicity, let \(Q\) be a positive-valued random variable. The market consists of two or more players buying and selling contracts relating to the outcome \(Q\):</p>
<p>A <strong>contract</strong> between two players \(p_1\) and \(p_2\) is executed as follows: \(p_1\) pays \(p_2\) \(x>0\) credits, and receives in exchange a promise stipulating that after the value of \(Q\) is known, \(p_2\) will pay P1 \(Q\) credits. Information flows in the market by players sending price signals to each other: a player may <strong>bid</strong> a value \(x\) to indicate they are willing to buy a contract for \(x\) credits, or conversely declare they are <strong>at</strong> a value \(y\), which indicates they are willing to sell a contract for \(y\) credits.</p>
<p>In general one would bid \(x\) if one believes that \(\mathbb{E}[Q] > x\), and accept bids at \(x\) if one believes that \(\mathbb{E}[Q] < x\). Like the calibration game above, players can make inferences about the beliefs of other players (based on bid/at prices) and update their own beliefs; presumably the market converges to a fixed price under certain conditions, though I haven’t thought about it enough to write down what those conditions must be.</p>
<p>The main interesting game-theoretic difference between Bid/At and Calibration is that in Bid/At, you are incentivized to hide your subjective belief \(\mathbb{E}[Q]\), so as to potentially reap the biggest margins when buying/selling contracts. In any case, it’s a game that rewards forming accurate predictive models about the world, with an added element of risk management.</p>
<p>If played in earnest, the players would keep records of all contracts, and when \(Q\) is realized, the contracts are paid out at some (pre-determined) credits-to-dollars conversion rate.</p>
<h2 id="bonus-group-decision-algorithm">Bonus: Group Decision Algorithm</h2>
<p><a href="http://thirdfoundation.github.io">Duncan</a> told us about this algorithm in the context of coming to a decision about which restaurant to go to as a group of people. The algorithm is simple:</p>
<ul>
<li>Anyone can suggest one or more new options for consideration at any time.</li>
<li>If only one option remains under consideration without veto, and no more options are suggested, then this option is chosen and that’s the final decision.</li>
<li>If more than one option is under consideration sans veto, the final decision goes to a vote, using the usual procedure for redistributing preferences in the case of ties, breaking deadlocked ties at random.</li>
<li>During the candidate selection process, anyone can veto an option at any time; this option is then immediately and permanently removed from consideration. <strong>The person that made the veto must then generate three new suggestions</strong>.</li>
</ul>
<p>The algorithm is designed to completely obviate the situation in which someone suggests some option which gets immediately shot down/vetoed without any promising alternatives presented in its place. Once a first option is suggested, the algorithm is guaranteed to produce a decision that satisfies people’s preferences (assuming that people actually veto options they aren’t happy with), provided the set of feasible options isn’t exhausted; in the case of restaurants this certainly won’t happen, since everyone’s gotta eat eventually :). The multiplicative factor of three seems well-tuned to yield good results and fast convergence: it’s just enough to disincentivize flippant vetos, but small enough to not make vetoing too onerous (e.g. most people can easily suggest three new restaurants), so that everyone has a good chance of having their preferences satisfied.</p>
<hr />
<p>There’s more interesting analysis to be done here (if it hasn’t been done already). Once I have the time to learn more about Bayesian markets, I’m sure there’ll be some interesting microeconomic/game-theoretic analysis of Bid/At. I’d also quite like to implement both Bid/At and Calibration as fun small-scale asynchronous multiplayer web app-based games. That would be an intereesting engineering project, and would be a pretty fun tool for the rationality community to practice on, to supplement things like <a href="https://predictionbook.com/">PredictionBook</a>. To be continued!</p>
Sun, 30 Apr 2017 10:00:00 +0000
http://aslanides.github.io/travel/2017/04/30/cfar-games/
http://aslanides.github.io/travel/2017/04/30/cfar-games/travelSimplicity is Complicated; Contraints bring Freedom<p>I’ve recently been getting back into programming with <a href="https://golang.org/">Go</a>, which I haven’t used since early 2016, when I was helping out on the backend at <a href="https://karma.wiki">Karma Wiki</a>. Over the past couple of days, I’ve been <a href="https://github.com/aslanides/aixigo">porting</a> parts of <a href="https://github.com/aslanides/aixijs">AIXIjs</a> into Go, with a view to making it a highly performant and scalable reference implementation by drawing on <a href="http://divan.github.io/posts/go_concurrency_visualize/">Go’s excellent concurrency</a> features. In a future post, I’ll explain what I’ve learned about performance tuning and parallelism in Go.</p>
<p>Needless to say, I’m very impressed with the design of the language and the toolset surrounding it, and intend to build more projects with it. As part of my Go revival, I’ve watched a few talks by <a href="https://en.wikipedia.org/wiki/Rob_Pike">Rob Pike</a>, one of the co-creators of the langauge. These of course include the modern classic <a href="https://www.youtube.com/watch?v=cN_DpYBzKso">“Concurrency is not Parallelism”</a>. However, one talk in particular resonated with me: the presentation at dotGo 2015, titled <a href="https://www.youtube.com/watch?v=rFejpH_tAHM">“Simplicity is Complicated”</a>.</p>
<p>In “Simplicitly is Complicated”, among other things, Rob talks at length about the decisions that went into designing Go’s notoriously <a href="https://golang.org/doc/faq">small feature set</a>. He points out that in many other languages (e.g. C++, Java, etc.), there are many ways to do things. In Go, this is largely not the case: the language – from code formatting to its idiosynchratic approach to OO to dependency management – is very <em>constrained</em>. There is usually only one way to do something, and it is usually constrained to be quite simple, robust and performant. Between the compiler’s static analysis and the toolchain (<code class="highlighter-rouge">go vet</code>, <code class="highlighter-rouge">go lint</code> and <code class="highlighter-rouge">go fmt</code>), you don’t get much choice in the matter. This is a good thing.</p>
<p>For some programmers, of course, these constraints make using the language a pain in the arse [citation needed]. Rob’s point is that in other, heavily feature-laden languages, programmer <em>productivity</em> is significantly impacted by wasting time thinking about how to express an idea. “Should I use feature X or feature Y? What are the trade-offs?”, and so on. In Go, by introducing constraints on how things are expressed, Pike argues that the programmer is freed to spend their thinking about what matters: the design and architecture of the software they are writing.</p>
<p>This is of course not a new notion in programming, and is not even a new notion in languages in general. Computer languages are tools for expressing certain kinds of ideas. <em>Human</em> languages are themselves tools for expressing ideas. Of course, these ideas are somewhat less formal than those typically expressed in computer languages, and their purpose and usage is quite different, but the language analogy (naturally) holds. Arguably, <strong>English is the C++ of human languages</strong>: it is hugely flexible and expressive, comprises an enormous vocabulary, is composed out of numerous other languages, is widely (ab)used, has many traps and gotchas, and is generally quite difficult to master.</p>
<p>Most <em>native</em> English speakers struggle to wield the language well. It is very common for students (even at the undergraduate or – occasionally – graduate level) to have considerable difficulty expressing their ideas concisely and coherently. The language offers so much flexibility that it’s common to see people agonize over the best words to use, rather than spending their thinking about how to distill their thoughts and ideas into language.</p>
<p>Here I hope that I have made the analogy with programming clear; these are both <em>creative</em> endeavours in which we use the tools (i.e. languages) at hand to realize our ideas. From my personal experience, I have struggled with this issue in both domains. Historically, I have had enormous difficulty putting my thoughts down on paper in such a way that I was pleased with the results. Until around 2012, essay and report writing largely represented exercises in frustrated word-shuffling and hair-pulling to me. Even in my recent <a href="https://github.com/aslanides/aixijs">AIXIjs</a> project, I estimate that probably about 30% of my <a href="https://github.com/aslanides/aixijs/commits/master">commits</a> were spent either in changing superficial features of the program, or in vacillating between various (clumsy) design choices.</p>
<p>I say <em>until 2012</em>, because that was the year in which I did <a href="https://physics.anu.edu.au/education/honours/honours_structure.php">Honours</a> in physics, working under <a href="http://people.physics.anu.edu.au/~cms130/">C. M. Savage</a>; the Honours program was immensely challenging and stimulating, and for me, it was a year of considerable intellectual maturation. As the thesis deadline approached, I grew apprehensive of the task of writing a <a href="http://aslanides.io/docs/honours_thesis.pdf">100+ page document</a>. The inimitable <a href="https://researchers.anu.edu.au/researchers/close-jd">J. D. Close</a> recommended a booklet (almost a <em>pamphlet</em>, really) to me: <em><a href="https://en.wikipedia.org/wiki/The_Elements_of_Style">The Elements of Style</a></em>, by Strunk and White [<a href="http://www.jlakes.org/ch/web/The-elements-of-style.pdf">pdf</a>].</p>
<p><em>The Elements of Style</em> (TEOS), first published in 1935, is one of my favourite non-fiction books of all time. It has improved my writing (such as it is!) immensely. I can’t recommend it highly enough. At this point it should come as no surprise to the reader that TEOS is highly opinionated, prescriptive, and <em>constraining</em>. It begins by authoritatively enumerating several <em>hard</em> grammatical and syntactic rules that one should expect to be common knowledge for all English speakers. One such basic rule that is <em>frequently</em> abused by native English speakers is</p>
<blockquote>
<p><strong>#5 - Do not join independent clauses with a comma.</strong></p>
<p>If two or more clauses grammatically complete and not joined by a conjunction are to form a single compound sentence, the proper mark of punctuation is a semicolon.</p>
</blockquote>
<p>(Many people often write sentences of the form <em>“It is nearly half past five, we cannot reach town before dark”</em>; this is incorrect, and one should instead write <em>“It is nearly half past five; we cannot reach town before dark”</em>. If one wishes to use a comma, then a conjunction is necessary: <em>“It is nearly half past five, <strong>and/so</strong> we cannot reach town before dark.”</em>)</p>
<p>The book then follows with a concise and comprehensive treatise on the <em>design and structure</em> of pieces of English writing (the analogy continues!). I have found that adhering to their constraints has freed up my writing – instead of agonizing over form, style, and structure, I am (generally – although not always) able to abstract most of these considerations away and just concentrate on what matters: expressing my thoughts and ideas. Just like many C++ programmers employ style guides to constrain themselves to using certain idioms or subsets of the language, TEOS guides us in how to write concise and readable English.</p>
<p>In drawing this connection between writing and programming, I wanted to explore this general and powerful concept: that constraining the output space in certain ways can greatly enhance one’s creativity, productivity, and expressiveness. I believe that this concept is well understood (and frequently used to great advantage) in certain disciplines; visual art and music are obvious examples. There’s more to think about here, and I’m sure that others have explored these ideas before. Perhaps I’ll follow up with another blog post on this topic once I’ve thought/read about it a bit more.</p>
Sun, 30 Apr 2017 02:37:00 +0000
http://aslanides.github.io/general/2017/04/30/simplicity-constraints/
http://aslanides.github.io/general/2017/04/30/simplicity-constraints/generalA response to 'The AI Cargo Cult'<p>Kevin Kelly’s April 26 Backchannel piece <a href="https://backchannel.com/the-myth-of-a-superhuman-ai-59282b686c62">The AI Cargo Cult: The Myth of Superhuman AI</a> presents a long and, I believe, somewhat confused argument against the possibility of machine superintelligence. Most of his points are not new, and are already well refuted/anticipated by (for example) Bostrom’s <a href="https://en.wikipedia.org/wiki/Superintelligence:_Paths,_Dangers,_Strategies"><em>Superintelligence</em></a>. Numerous commenters on <a href="https://news.ycombinator.com/item?id=14205042">Hacker News</a> have already pulled apart some of his arguments. In fact, as I write this, the article has just fallen off the first page of HN, so it seems that perhaps not too many people are taking it seriously. I’m sure many in the AI safety community would dismiss Kelly as not worth arguing with, given how uninformed he seems to be on the topic. His essay certainly presents an easy target for rebuttal, and since I’m still trying to breathe some life into this blog, I’m going to go ahead and (briefly) do that.</p>
<hr />
<p>Kelly claims that there are five wrong assumptions made by those who believe that machine superintelligence is possible. His claim is wrong about at least four of them: these assumptions need not hold for superhuman intelligence to exist, and almost no one in the AI community is making these assumptions – they are largely <a href="https://en.wikipedia.org/wiki/Straw_man">strawmen</a> of his (and others’) construction. He lists these putative assumptions at the beginning of the article, before putting forward his own counterarguments.</p>
<p>Below, I take issue both with his characterisation of the assumptions made by AI researchers, and with his counterarguments. Of course, a lot has already been written on these topics, and I am not an expert. This post is mainly just an exercise for me in quickly writing up my thoughts in response to something (I only read his essay at 8 am this morning), and subsequently putting it online, instead of sitting on a half-finished draft (as is my wont). I have a lot more to write about these topics, and that’s for another blog post; this one will be kept short. Here it goes:</p>
<h2 id="1--2">1. & 2.</h2>
<p><em>Kelly’s imagined AI researcher</em>: <strong>Artificial intelligence is already getting smarter than us, at an exponential rate.</strong></p>
<p><em>Kelly’s counterargument</em>: <strong>Intelligence is not a single dimension, so “smarter than humans” is a meaningless concept.</strong></p>
<p>No one in the field is claiming that this is true, at least not in the sense that Kelly means. Certainly, Moore’s law continues to run exponentially for the time being, and the rate at which we are producing data appears to be increasing exponentially. The rate of progress in applying AI to narrow problem domains (for example vision, translation, and playing simple games) is arguably exponential, certainly in recent years. But almost everyone in the AI community acknowledges that there are fundamental difficult scientific questions that remain to be answered before we can achieve superintelligent machines, or AGI.</p>
<p>As a counterargument, Kelly raises the (reasonable, relative to some of his other claims) point that representing intelligence as a single-dimensional property is overly simplistic, and argues that intelligence is a rich, multidimensional object instead:</p>
<blockquote>
<p>Instead of a single decibel line, a more accurate model for intelligence is to chart its possibility space. […] Intelligence is a combinatorial continuum. Multiple nodes, each node a continuum, create complexes of high diversity in high dimensions. Some intelligences may be very complex, with many sub-nodes of thinking. Others may be simpler but more extreme, off in a corner of the space.</p>
</blockquote>
<p>Most AI researchers would not disagree with this, at least insofar as it applies to the ‘folk’ definition of intelligence. Kelly goes on to use this as the basis for the claim that there is no ordering on intelligence, and that therefore at best, with AI, we can create an intelligence <em>different</em> to our own, but not <em>better</em>, since this comparison is meaningless:</p>
<p><em>Kelly’s imagined AI researcher</em>: <strong>We’ll make AIs into a general purpose intelligence, like our own.</strong></p>
<p><em>Kelly’s counterargument</em>: <strong>Humans do not have general purpose minds, and neither will AIs.</strong></p>
<p>Even if we accept the (reasonable) multidimensional claim, this following claim is clearly false. If we indulge Kelly’s formulation by representing intelligence as vectors lying in the positive orthant of <script type="math/tex">\mathbb{R}^N</script> for some (presumably large and unknown) <script type="math/tex">N</script> – that is, the set <script type="math/tex">\{x\ :\ x\in\mathbb{R}^N\ \lvert\ x_i \geq 0 \forall\ i\ \in\{1,\dots,N\}\}</script>. Here I am clearly assuming that <script type="math/tex">N</script> is finite, though I expect that one can easily extend these arguments to infinite-dimensional spaces.</p>
<p>Now, let a ‘typical’ human intelligence be a vector <script type="math/tex">v_{\text{Human}} \in V</script>. Kelly’s thesis is that human intelligence is not ‘general-purpose’, as we clearly have low or zero intelligence along certain dimensions, and that therefore comparisons are impossible in this space. This is nonsense. There are any number of ways I can compare two vectors. Even if I want to be completely agnostic as to the relative value of different dimensions of intelligence (which Kelly seems to be), I can still say unambiguously that some AI is <em>strictly</em> more intelligent than humans if the elements of <script type="math/tex">v_{\text{AI}}</script> are no smaller than the corresponding elements of <script type="math/tex">v_{\text{Human}}</script>, and strictly greater in at least one dimension.</p>
<p>I anticipate Kelly would argue that this is missing his point, and that my vector space analogy is overly simplistic, and ignores the argument that intelligences are much more complex objects, as he seems to argue. It is difficult to understand his position here, though, as he seems somewhat confused about whether to use the vector analogy or the fractal/snowflake one. His vocabulary, and the images that he uses, seem to suggest both:</p>
<center><img src="/assets/kelly/complex.png" width="50%" /><img src="/assets/kelly/vector.jpeg" width="50%" /></center>
<p>In any case, this argument is largely about definitions of intelligence. Kelly chooses to define intelligence in a way that makes every animal unique and incomparable, and so tries to render the question of ‘superhuman’ intelligence moot. But most AI researchers would take the pragmatic view, that intelligence only matters insofar as what it enables you to <em>do</em>: can you build rockets to Mars? Can you invent new math? Can you conquer the planet/galaxy/observable universe? This issue relates to point 5, below.</p>
<h3 id="3">3.</h3>
<p><em>Kelly’s imagined AI researcher</em>: <strong>We can make human intelligence in silicon.</strong></p>
<p><em>Kelly’s counterargument</em>: <strong>Emulation of human thinking in other media will be constrained by cost.</strong></p>
<p>Kelly claims that the only way to faithfully simulate human cognition <em>in real time</em> is to do it using human tissue: essentially with brains:</p>
<blockquote>
<p>… [T]he only way to get a very human-like thought process is to run the computation on very human-like wet tissue. That also means that very big, complex artificial intelligences run on dry silicon will produce big, complex, unhuman-like minds. If it would be possible to build artificial wet brains using human-like grown neurons, my prediction is that their thought will be more similar to ours. The benefits of such a wet brain are proportional to how similar we make the substrate. The costs of creating wetware is huge and the closer that tissue is to human brain tissue, the more cost-efficient it is to just make a human. After all, making a human is something we can do in nine months.</p>
</blockquote>
<p>This seems not particularly well-argued or relevant. Certainly, if we emulate brains in silicon, they won’t behave <em>exactly like human brains</em>. Surely that’s the whole point. As far as objections to AGI go, I don’t see how this is part of a strong case. He clearly concedes that brains are essentially wet computers, so this should be the end of the matter. The issue of fidelity is not very interesting to most AI researchers; insert bird/plane analogy here.</p>
<h3 id="4">4.</h3>
<p><em>Kelly’s imagined idiotic AI researcher</em>: <strong>Intelligence can be expanded without limit.</strong></p>
<p><em>Kelly’s counterargument</em>: <strong>Derp.</strong></p>
<p>Show me a <strong>single</strong> AI researcher that actually believes this. This is the laziest strawman I’ve ever seen. It is particularly grating to see him bring up limits in physics, as though not a single AI researcher has studied physics before. Here’s Kelly:</p>
<blockquote>
<p>It stands to reason that reason itself is finite, and not infinite. So the question is, where is the limit of intelligence? We tend to believe that the limit is way beyond us, way “above” us, as we are “above” an ant. Setting aside the recurring problem of a single dimension, what evidence do we have that the limit is not us? Why can’t we be at the maximum? Or maybe the limits are only a short distance away from us? Why do we believe that intelligence is something that can continue to expand forever?</p>
</blockquote>
<p>Hey dude, the middle ages called, and they want their <a href="https://en.wikipedia.org/wiki/Anthropocentrism">anthropocentrism</a> back. What are the odds that human intelligence is <em>the most intelligent it is possible to be, at all, ever</em>? This is thoroughly debunked in so many places (hint: you should actually read Bostrom’s book before you shit on it).</p>
<p>In all seriousness: throughout the article it is pretty obvious that Kelly is not well-informed on the topic of AI, but here he exposes a profound ignorance of history. Arguably one of the central narratives of science has been the ejection of humans from the center of the universe.</p>
<p>To be clear: no, the limits of intelligence are clearly not infinite. Yes, they are almost certainly <em>significantly</em> higher than human-level. I refer the reader to the excellent <a href="https://arxiv.org/abs/1703.10987">On the Impossibility of Supersized Machines</a>.</p>
<h3 id="5">5.</h3>
<p><em>Kelly’s imagined AI researcher</em>: <strong>Once we have exploding superintelligence it can solve most of our problems.</strong></p>
<p><em>Kelly’s counterargument</em>: <strong>Intelligences are only one factor in progress.</strong></p>
<p>Kelly takes issue with the claim that having more intelligence makes you more able to solve problems. He brands the claim ‘thinkism’:</p>
<blockquote>
<p>Many proponents of an explosion of intelligence expect it will produce an explosion of progress. I call this mythical belief “thinkism.” It’s the fallacy that future levels of progress are only hindered by a lack of thinking power, or intelligence. (I might also note that the belief that thinking is the magic super ingredient to a cure-all is held by a lot of guys who like to think.)</p>
</blockquote>
<p>Kelly goes on:</p>
<blockquote>
<p>Let’s take curing cancer or prolonging longevity. These are problems that thinking alone cannot solve. No amount of thinkism will discover how the cell ages, or how telomeres fall off. No intelligence, no matter how super duper, can figure out how the human body works simply by reading all the known scientific literature in the world today and then contemplating it. No super AI can simply think about all the current and past nuclear fission experiments and then come up with working nuclear fusion in a day. A lot more than just thinking is needed to move between not knowing how things work and knowing how they work. There are tons of experiments in the real world, each of which yields tons and tons of contradictory data, requiring further experiments that will be required to form the correct working hypothesis. Thinking about the potential data will not yield the correct data.</p>
</blockquote>
<p>Here, I think, is where we get to the core of the matter. Kelly’s notion of intelligence is not <em>operational</em>: Kelly neglects the fact that superintelligences will have <em>agency</em>. Here’s an idea: a superintelligence can run experiments of its own, or at the very least suggest experiments for humans to run. Being smarter lets you <em>do more stuff</em>. Chimpanzees could not have designed the Large Hadron Collider. It is precisely the superhuman intelligence that we hope to create that <em>will</em> solve our problems. In the Legg/Hutter <a href="http://www.vetta.org/documents/Machine_Super_Intelligence.pdf">definition of intelligence</a>, intelligence essentially boils down to the <em>capacity to solve problems</em>. If we use this definition, then <em>by construction</em>, a machine superintelligence will be able to solve problems that humans cannot.</p>
<p>That’s the whole point of building AGI in the first place.</p>
Wed, 26 Apr 2017 23:50:00 +0000
http://aslanides.github.io/artificial_intelligence/2017/04/27/cargo-cult-rebuttal/
http://aslanides.github.io/artificial_intelligence/2017/04/27/cargo-cult-rebuttal/artificial_intelligenceBay Area I: San Francisco, Berkeley, & Silicon Valley<p>This is the first of two blog posts documenting my visit to the San Francisco Bay area from August 22 to September 5, 2016. I was lucky enough to be invited by the <a href="http://rationality.org" target="_blank">Center for Applied Rationality</a> (CFAR) for an expenses-paid workshop for AI researchers, which ran from August 30 to September 5. I arrived a week early to do some sight seeing around the Bay area before the workshop. This post documents some of what I did, and shows some of the better photos that I took during my first week in CA. The second post will discuss what I learned at the CFAR workshop, during the second week.</p>
<p><em>Disclaimer: I put this post together in December 2016, based on the notes and photos that I took in August. All of the photos were taken on my <a href="https://en.wikipedia.org/wiki/Nexus_5" target="_blank">Nexus 5</a>; many of them came out quite underexposed and so I’ve done my best to fix them up with some post-processing in Google Photos. I apologise in advance for the filter abuse and for blatantly tryharding at photography with a shitty smartphone camera and zero skill</em> ¯\_(ツ)_/¯. For those interested, the complete album can be found <a href="https://goo.gl/photos/XUiZ4Gr7M7BQTyUG6">here</a>.</p>
<figure class="image">
<a href="/assets/bay/trip.png" target="_blank"><img class="photo" src="/assets/bay/trip.png" alt="My Google Maps Timeline. The red dots roughly correspond to locations I visited." /></a>
<figcaption>My Google Maps Timeline. The red dots roughly correspond to locations I visited.</figcaption>
</figure>
<h1 id="aug-23-san-francisco">Aug 23: San Francisco</h1>
<p>For the duration of my visit, Bay Area Rapid Transit (<a href="https://youtu.be/_hqxEOSjqRM?t=27s" target="_blank">BART</a>) was my ride of choice around the Bay. Like much of the public infrastructure, it is pretty ancient, and dates back to the 80s or earlier. Unlike Australian trains (I think; after all, I’m just a simple Canberra boy), push bikes are allowed on board, with their own designated areas. Dogs are also a relatively common sight.</p>
<figure class="image">
<a href="/assets/bay/bart.jpg" target="_blank"><img class="photo" src="/assets/bay/bart.jpg" alt="BART trains look like they were inspired by the Apollo program from the 1960s." /></a>
<figcaption>BART trains look like they were inspired by the Apollo program from the 1960s.</figcaption>
</figure>
<p>Here it comes: the obligatory photo of the Golden Gate Bridge. I really underestimated the scale of the place; it took like three hours to walk from downtown SF to the bridge. From this lookout, I managed to catch a small container ship passing into the Bay, which you can just make out in the right of the shot.</p>
<figure class="image">
<a href="/assets/bay/goldengate.jpg" target="_blank"><img class="photo" src="/assets/bay/goldengate.jpg" alt="Golden Gate is best Gate." /></a>
<figcaption>Golden Gate is best Gate.</figcaption>
</figure>
<p>The <a href="https://en.wikipedia.org/wiki/Palace_of_Fine_Arts">Palace of Fine Arts</a> is a restored monument constructed for the 1915 Panama-Pacific Exhibition. It housed the art exhibits, and since then it has been maintained and rebuilt in stone as a monument in its own right. I thought this was an magnificent statement of the American spirit: build a large monument, not for the church, or for a king, but simply because they can.</p>
<figure class="image">
<a href="/assets/bay/palace.jpg" target="_blank"><img class="photo" src="/assets/bay/palace.jpg" alt="The rotunda, viewed from between columns in the surrounding colonnade." /></a>
<figcaption>The rotunda, viewed from between columns in the surrounding colonnade.</figcaption>
</figure>
<p>I took this next photo as a piece of social commentary. Here we are in the U.N. Plaza, part of the <a href="https://en.wikipedia.org/wiki/Civic_Center,_San_Francisco">Civic Center</a>. The Plaza commemorates the signing of the <a href="https://en.wikipedia.org/wiki/United_Nations_Charter">United Nations Charter</a> in 1945. At the end of the plaza is the city hall. Framing the photo are the U.N. and U.S. flags. This is surely an impressive landmark; it is of historical significance to American and world history. This is juxtaposed with the line of homeless people queueing for food stamps, in the left of the shot. Behind me (out of shot) are dozens more homeless men and women, distributed through the square and around the entrance to the BART station. The smell of weed is not far off.</p>
<figure class="image">
<a href="/assets/bay/unplaza.jpg" target="_blank"><img class="photo" src="/assets/bay/unplaza.jpg" alt="Extreme poverty meets pomp and circumstance: U.N. Plaza edition." /></a>
<figcaption>Extreme poverty meets pomp and circumstance: U.N. Plaza edition.</figcaption>
</figure>
<p>Walking around downtown SF was a pretty eye-opening experience. At the time, it frequently made me uncomfortable. The number of poor, homeless, and desperate people in the streets was really confronting. During an afternoon strolling around, I probably saw every different make and model of expensive sports car, no doubt driven by one of Silicon Valley’s <a href="https://www.jointventure.org/images/stories/pdf/index2016.pdf" target="_blank">76,000 (m|b)illionaires</a>. At any given time there would be at least a dozen homeless men and women within sight. Around the financial district, and near the offices of the likes of Uber and Twitter, it’s trivially easy to spot who belongs to which tribe: there’s the healthy 20 or 30-something walking with purpose, sipping a coffee, wearing a suit or a tech company T-shirt; the rest is largely unhealthy, looking over 50 regardless of their age, poor, mentally ill, and disenfranchised. At many points, I was left with the impression of visiting a paradoxically industrious and progressive <a href="http://usuncut.com/class-war/america-third-world-country/" target="_blank">third world country</a>.</p>
<p>For most of the week, I stayed in the Green Tortoise, a groovy youth hostel on Broadway, in the middle of the San Francisco’s red light district. Many of the clubs around here were open and bouncing 24 hours a day. I met people from various nationalities during my stay; Australians were definitely over-represented. Price was roughly $70 per night for bunk-bed, dorm-style rooms. Pretty decent accommodation; would recommend.</p>
<figure class="image">
<a href="/assets/bay/tortoise.jpg" target="_blank"><img class="photo" src="/assets/bay/tortoise.jpg" alt="The Green Tortoise." /></a>
<figcaption>The Green Tortoise.</figcaption>
</figure>
<p>This next photo was taken near the hostel, looking down Kearny street towards downtown. For me, it captures several distinctive features of San Francisco:</p>
<ul>
<li>the low cloud, which is ubiquitous even in late summer,</li>
<li>the Victorian-era buildings, with two excellent specimens visible at the intersection with Columbus Ave,</li>
<li>the rampant capitalism, represented by the <a href="https://en.wikipedia.org/wiki/Transamerica_Pyramid" target="_blank">Transamerica Pyramid</a>, in the background to the left,</li>
<li>the let-it-all-hang-out hedonism of the strip clubs and adult video store,</li>
<li>the iconic steep sloping streets,</li>
<li>and the juxtaposition of private wealth with crumbling public infrastructure: the 30 year old bus next to late-model cars (including a Pruis and SmartCar – Californians are quite an environmentally conscious lot).</li>
</ul>
<figure class="image">
<a href="/assets/bay/pyramid.jpg" target="_blank"><img class="photo" src="/assets/bay/pyramid.jpg" alt="" /></a>
<figcaption></figcaption>
</figure>
<h1 id="aug-24-uc-berkeley--miri">Aug 24: UC Berkeley & MIRI</h1>
<p>The <a href="https://en.wikipedia.org/wiki/University_of_California,_Berkeley">University of California (UC), Berkeley</a> is probably the most prestigious public university in the world, and is well renowned for its <a href="https://eecs.berkeley.edu/">computer science department</a>, among others. It turned out by chance that the day of my visit was the first day of class, so the campus was vibrant and thrumming with thousands of undergrads. The whole place had a great atmosphere: the scale, the beautiful buildings, the laid-back-but-high-achieving feel of the place, with all this academic history and prestige against the backdrop of sunny California. The campus makes ANU’s look <em>very</em> second-rate in comparison.</p>
<figure class="image">
<a href="/assets/bay/berkeley.jpg" target="_blank"><img class="photo" src="/assets/bay/berkeley.jpg" alt="Doe Library, UC Berkeley." /></a>
<figcaption>Doe Library, UC Berkeley.</figcaption>
</figure>
<p>My sarcastic Australian sensibilities suggested initially that the sign in the following photo was a deliberate gag, but I’m convinced that it’s not. This is evidence of the earnest and unapologetic prestige and ambition of UC Berkeley. I’m sure it never even occurred to the people that commissioned the sign that this could be interpreted as a joke! Americans don’t suffer from tall poppy syndrome as we do ;).</p>
<figure class="image">
<a href="/assets/bay/nobel.jpg" target="_blank"><img class="photo" src="/assets/bay/nobel.jpg" alt="Four Nobel Laureate-only spots in a row!" /></a>
<figcaption>Four Nobel Laureate-only spots in a row!</figcaption>
</figure>
<p>With my colleague <a href="http://www.tomeveritt.se/">Tom Everitt</a>’s introduction, I had booked a short afternoon meeting with <a href="http://jessic.at/">Jessica Taylor</a> and <a href="https://www.linkedin.com/in/patricklavictoire">Patrick LaVictoire</a> at the <a href="http://intelligence.org">Machine Intelligence Research Institute</a> in Downtown Berkeley. <a href="https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/">AI Safety</a> is an area that is currently blowing up as a sub-field of artificial intelligence research. My <a href="/about">group</a> has produced a number of safety researchers; both Tom and <a href="https://jan.leike.name">Jan</a> (my MSc thesis supervisor) are heavily involved in safety research, as is <a href="http://danielfilan.com/">Daniel</a>, who graduated in 2015.</p>
<p>I ended up hanging out in the MIRI/CFAR shared working space after the last day of the workshop, on the 5th of September. On the right are a pair of bowls on the table containing complimentary MealSquares (as advertized on <a href="http://slatestarcodex.com/">Slate Star Codex</a>!).</p>
<figure class="image">
<a href="/assets/bay/miri.jpg" target="_blank"><img class="photo" src="/assets/bay/miri.jpg" alt="MIRI/CFAR shared working space." /></a>
<figcaption>MIRI/CFAR shared working space.</figcaption>
</figure>
<h1 id="aug-25-26-stanford--palo-alto">Aug 25-26: Stanford & Palo Alto</h1>
<p>Stanford University was founded in 1891 by Leland Stanford, a tycoon and industrialist. For decades, Stanford has been integral to the entrepreneurial technology culture in Silicon Valley. Oh, and it’s the most beautiful university campus I’ve ever seen. If Berkeley made ANU’s campus look second-rate, Stanford makes it look like a third world primary school. That tycoon money!</p>
<figure class="image">
<a href="/assets/bay/stanford.jpg" target="_blank"><img class="photo" src="/assets/bay/stanford.jpg" alt="That's one hell of a front lawn." /></a>
<figcaption>That's one hell of a front lawn.</figcaption>
</figure>
<figure class="image">
<a href="/assets/bay/hoover.jpg" target="_blank"><img class="photo" src="/assets/bay/hoover.jpg" alt="Foreground: Cecil H. Green Library. Background: Hoover Tower." /></a>
<figcaption>Foreground: Cecil H. Green Library. Background: Hoover Tower.</figcaption>
</figure>
<figure class="image">
<a href="/assets/bay/track.jpg" target="_blank"><img class="photo" src="/assets/bay/track.jpg" alt="Angell track. The Stanford Stadium lights can be seen in the background." /></a>
<figcaption>Angell track. The Stanford Stadium lights can be seen in the background.</figcaption>
</figure>
<p>Silicon Valley is enormous, suburban, and wealthy. It essentially extends all the way from San Mateo in the north-west to San Jose, roughly 40 kilometres to the south-east. I spent a few hours exploring the suburban streets on my way downtown; California really turned on the good weather.</p>
<figure class="image">
<a href="/assets/bay/palo_alto.jpg" target="_blank"><img class="photo" src="/assets/bay/palo_alto.jpg" alt="Palo Alto sprawls like this for miles." /></a>
<figcaption>Palo Alto sprawls like this for miles.</figcaption>
</figure>
<p>Fans of Lin-Manuel Miranda’s masterpiece <a href="https://en.wikipedia.org/wiki/Hamilton_(musical)"><em>Hamilton</em></a> will appreciate this one:</p>
<figure class="image">
<a href="/assets/bay/hamilton.jpg" target="_blank"><img class="photo" src="/assets/bay/hamilton.jpg" alt="I'm not sure who checks for parking permits between 2 and 5 A.M., though." /></a>
<figcaption>I'm not sure who checks for parking permits between 2 and 5 A.M., though.</figcaption>
</figure>
<p>The <a href="https://www.google.com.au/maps/place/Stanford+Shopping+Center/@37.4531555,-122.1960036,13z/data=!4m5!3m4!1s0x808fbb3471827639:0x75895b0f0e878d4!8m2!3d37.443126!4d-122.171574" target="_blank">Stanford Shopping Center</a> is located on the corner of Stanford’s (enormous) campus, between the hospital and Downtown Palo Alto. There are also two fully decked-out Apple stores within a stone’s throw of each other: one here, and one in Downtown. In this shot, I caught the afternoon sun peeking over the top of the Tesla showroom.</p>
<figure class="image">
<a href="/assets/bay/tesla.jpg" target="_blank"><img class="photo" src="/assets/bay/tesla.jpg" alt="The staff are super chill about randos getting in the display models and faffing around." /></a>
<figcaption>The staff are super chill about randos getting in the display models and faffing around.</figcaption>
</figure>
<p>I stayed a few nights at a hacker house on N. California Ave. It was a simple three bedroom place that had been set up to sleep 12-14, and decked out with startup paraphernalia, complete with motivational posters and biographies of all the cult heroes. The nights I was there, there were a bunch of people staying there on a semi-permanent basis, ranging in age from 18-50, but most in the range 18-25.</p>
<p>These people were insanely ambitious: there was a 21 year old college drop-out starting a luxury smartwatch company. An 18 year old <em>first day</em> college drop-out, trying to start an <em>autonomous vehicle</em> related startup. Stuff hardly any Australian adults would dream of doing, let alone teenagers.</p>
<figure class="image">
<a href="/assets/bay/elon.jpg" target="_blank"><img class="photo" src="/assets/bay/elon.jpg" alt="All hail the king." /></a>
<figcaption>All hail the king.</figcaption>
</figure>
<p>As I mentioned earlier, Californians are quite environmentally conscious – more so, I think, than the average Australian. In leafy Silicon Valley, the Tesla Model S is the ubiquitous badge of the elite. I probably saw over a hundred in the space of a couple of days; the novelty wore off quickly. A family owning both the Model S and the Model X was sufficiently rare that it was worth a photo:</p>
<figure class="image">
<a href="/assets/bay/family.jpg" target="_blank"><img class="photo" src="/assets/bay/family.jpg" alt="The Palo Alto double special." /></a>
<figcaption>The Palo Alto double special.</figcaption>
</figure>
<p>I also had my first taste of <a href="http://www.sushirrito.com/food">Sushirritto</a> in DT Palo Alto. These guys really nailed the Mexican-Japanese fusion. This shit is going to be huge in Australia in like 3 years. I’d take good odds.</p>
<figure class="image">
<a href="/assets/bay/sushirritto.jpg" target="_blank"><img class="photo" src="/assets/bay/sushirritto.jpg" alt="Sushirritto!" /></a>
<figcaption>Sushirritto!</figcaption>
</figure>
<h1 id="aug-27-san-francisco">Aug 27: San Francisco</h1>
<p>On the Saturday, I returned to SF and spent the day hanging out with two old friends from <a href="http://cgs.act.edu.au">CGS</a>: Richard Dear, who does data science at <a href="https://medium.com/airbnb-engineering">AirBnB</a>, and Ross Dyson, who is now (December 2016) a frontend engineering at <a href="https://cloudflare.com/">CloudFlare</a>.</p>
<figure class="image">
<a href="/assets/bay/sf_night.jpg" target="_blank"><img class="photo" src="/assets/bay/sf_night.jpg" alt="Looking towards the Bay Bridge on the way back to the hostel." /></a>
<figcaption>Looking towards the Bay Bridge on the way back to the hostel.</figcaption>
</figure>
<h1 id="aug-28-muir-woods--sausalito">Aug 28: Muir Woods & Sausalito</h1>
<p>On the Sunday I went on a bus tour to <a href="https://en.wikipedia.org/wiki/Muir_Woods_National_Monument">Muir Woods National Monument</a>, just a few kilometres north of the Golden Gate Bridge. It is a beautiful old growth redwood forest. Despite the crowds, I found it a very reflective and peaceful experience. The natural scale and beauty of the place was really awe-inspiring.</p>
<figure class="image">
<a href="/assets/bay/muirwoods1.jpg" target="_blank"><img class="photo" src="/assets/bay/muirwoods1.jpg" alt="Dense redwood and Douglas fir forest like this stretches for miles through a gorgeous valley." /></a>
<figcaption>Dense redwood and Douglas fir forest like this stretches for miles through a gorgeous valley.</figcaption>
</figure>
<figure class="image">
<a href="/assets/bay/muirwoods2.jpg" target="_blank"><img class="photo" src="/assets/bay/muirwoods2.jpg" alt="I couldn't find the Ewok village, though." /></a>
<figcaption>I couldn't find the Ewok village, though.</figcaption>
</figure>
<p>I caught the ferry back across the Bay, which gave me a great view of Alcatraz Island, along with the former Federal Penitentiary, which I didn’t have time to check out. In the evening, I met up with <a href="http://jaan.io/">Jaan Altosaar</a> for dinner (after being generously introduced by <a href="https://twitter.com/rafaelcosman">Rafael Cosman</a>, who was in turn introduced to me by <a href="http://www.tomeveritt.se/">Tom</a>. )</p>
<figure class="image">
<a href="/assets/bay/alcatraz.jpg" target="_blank"><img class="photo" src="/assets/bay/alcatraz.jpg" alt="" /></a>
<figcaption></figcaption>
</figure>
<h1 id="aug-29--san-francisco-musem-of-modern-art">Aug 29: <img src="/assets/bay/sfmoma.png" width="80" /> (San Francisco Musem of Modern Art)</h1>
<figure class="image">
<a href="/assets/bay/moma.jpg" target="_blank"><img class="photo" src="/assets/bay/moma.jpg" alt="San Francisco Museum of Modern Art (SFMOMA)." /></a>
<figcaption>San Francisco Museum of Modern Art (SFMOMA).</figcaption>
</figure>
<p>MoMA has seven floors, each divided into numerous sections that treat the different schools of 20th and 21st century art. I reckon it would take at least 6 hours to do it all justice. I tried my best to visit every room, but fatigue set in after around 3-3.5 hours. I’m really glad I went, and highly recommend it. For a visual art philistine like me, this was both an educational and invigorating experience.</p>
<p>In most Australian museums and galleries, photography is explicitly prohibited; here, however, people were blazing away with smartphones and SLRs without inhibition. Unfortunately, I only ended up taking a handful of photos, mostly because my smartphone camera performs quite poorly in low light conditions.</p>
<p>This work – <a href="https://www.sfmoma.org/artwork/73.38"><em>The Verónica</em>, by Jay Defeo</a> – made a big impression on me, and I stared at it for quite a few minutes. The photo doesn’t really capture its scale; it towers over you like a monolith, at 3.3 metres in height. The paint is so thickly layered it almost resembles a sculpture – the shapes seem to almost erupt out of the canvas. The gradations of color, and the suggestion of movement also reminded me somewhat of a segment from <a href="https://www.youtube.com/watch?v=z4MQ7GzE6HY">Disney’s Fantasia (1940)</a>, which is a film I adored in my childhood.</p>
<figure class="image">
<a href="/assets/bay/veronica.jpg" target="_blank"><img class="photo" src="/assets/bay/veronica.jpg" alt="" /></a>
<figcaption></figcaption>
</figure>
<p>Readers that are familiar with Cormen et al.’s excellent <a href="https://mitpress.mit.edu/books/introduction-algorithms">Introduction to Algorithms</a> textbook will recognize one of Alexander Calder’s <em>Mobiles</em> below. It’s a great choice for the cover of a book about algorithms and data structures; an iconic and beautiful work of art in its own right, and also an instance (in the abstract) of a <a href="https://en.wikipedia.org/wiki/Tree_(data_structure)">tree</a>, one of the most fundamental and important data structures in computer science.</p>
<figure class="image">
<a href="/assets/bay/mobiles.jpg" target="_blank"><img class="photo" src="/assets/bay/mobiles.jpg" alt="There was another, larger mobile hanging in the lobby." /></a>
<figcaption>There was another, larger mobile hanging in the lobby.</figcaption>
</figure>
<p>In the evening, I had dinner with <a href="http://danielfilan.com/">Daniel Filan</a> in Berkeley, before getting ready for the CFAR workshop. And that’s the highlights of my first week in the Bay! In the following post, I’ll document what I learnt from the workshop itself. Stay tuned.</p>
<h1 id="meta">Meta</h1>
<p>In the interests of becoming a better blogger, I’ll enumerate a few observations and things to work on, that occurred to me while writing up this post:</p>
<ol>
<li>This post took <em>way</em> too long to make. In my defence, I was rushing to get my MSc thesis finished from September-November 2016. Nevertheless, I need to be pushing out blog posts much more frequently if I want to ever join the ranks of the <a href="http://slatestarcodex.com">prolific rationalist blogosphere</a>. The key, I think, is to lower my inhibitions about writing stuff online; this will come with practice, I expect.</li>
<li>In retrospect, it occurs to me that I should be less coy about taking selfies with people: I actually met a bunch of awesome people, but I don’t have photos with which to remember those meetings, which is a bit of a shame. I’ll endeavour to be less shy about taking photos of/with people in future.</li>
<li>This post turned out a bit boring for my taste – i.e., a bit light on the ground-shattering insights and a bit heavy on ‘Today I did X, Y, and Z. Cue photos’. Of course, this relates to point #1: I’m probably holding my writing to unreasonable standards here. And, naturally, writing up so long after the event will give it a bit of an oblique perspective, and it will naturally lack some of the emotional charge and excitement that it might have had, had I written it up immediately.</li>
</ol>
Wed, 28 Dec 2016 10:37:00 +0000
http://aslanides.github.io/travel/2016/12/28/bay-area/
http://aslanides.github.io/travel/2016/12/28/bay-area/travelMarginalization with Einstein<p>I discovered this useful trick while I was recently working on an assignment question for Christfried Webers’ <a href="https://sml.forge.nicta.com.au/isml15/">excellent Introduction to Statistical Machine Learning course</a>. The idea is to simplify implementations of the <a href="https://en.wikipedia.org/wiki/Belief_propagation">belief propagation</a> algorithm on acyclic factor graphs, using NumPy’s Einstein summation API.</p>
<p>Changelog:</p>
<ul>
<li>2016-11-28: <em>Add references.</em></li>
</ul>
<h1 id="einstein-notation">Einstein notation</h1>
<p>The <a href="https://en.wikipedia.org/wiki/Einstein_notation">Einstein summation notation</a> is a really convenient way to represent operations on multidimensional matrices. It’s primarily used when manipulating tensors in differential geometry and in relativistic field theories in physics, but we’ll be using it to do operations on discrete joint distributions, which are basically big normalized matrices. The distinction between a tensor and a matrix is that the <a href="https://en.wikipedia.org/wiki/Tensor">tensor</a> has to behave in a certain way under coordinate transformations; for our purposes, this (physically motivated) constraint is lifted.</p>
<ul>
<li>
<p>All expressions are in terms of the elements of multidimensional objects. If an object is <script type="math/tex">K</script>-dimensional, it must have <script type="math/tex">K</script> distinct indices. For example, if <script type="math/tex">\boldsymbol{A}</script> is an <script type="math/tex">N \times N</script> matrix of real numbers, it is a <script type="math/tex">2</script>-dimensional object (which we will call rank-2 from now on), and we write it down as <script type="math/tex">A^{ij}</script>.</p>
</li>
<li>
<p>If an index occurs both as a subscript and as a superscript in an expression, then it is summed over.</p>
</li>
</ul>
<p>Following these rules, we can look at some simple examples. The inner product of two vectors <script type="math/tex">\boldsymbol{u},\boldsymbol{v}\in\mathbb{R}^N</script> is given by the shorthand</p>
<script type="math/tex; mode=display">u^iv_i=\sum_{i}^{N}u_iv_i.</script>
<p>This kind of operation is called a <em>contraction</em>, since the result has lower rank than its inputs (in the case of the inner product, these are 0 and 1, respectively). Notice that we can construct objects of higher rank out of lower rank ones quite easily. The outer product</p>
<script type="math/tex; mode=display">A^{i}_{j}=u^iv_j</script>
<p>is an <script type="math/tex">N\times N</script> matrix whose <script type="math/tex">(i,j)^{\text{th}}</script> element is given by the product <script type="math/tex">u_iv_j</script>. Note that the dual vectors <script type="math/tex">v^{i}</script> and <script type="math/tex">v_{i}</script> are in general related by the <a href="https://en.wikipedia.org/wiki/Metric_tensor">metric</a> tensor: <script type="math/tex">v^{i} = g^{ij}v_j</script>. In our case, the metric is the identity, so they are element-wise equal, and the only distinction is that one is a column vector and the other is a row vector. Hence, in this context, the matrices <script type="math/tex">A_{ij}</script> and <script type="math/tex">A^{i}_{j}</script> are equal, since to get from one to the other we left-multiply by the identity.</p>
<p>Matrix-vector multiplication looks like</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\left[Ax\right]_{jk} &= \sum_{i}A_{ijk} x_i\\
&= A_{ijk} x^i.
\end{aligned} %]]></script>
<p>Here’s where the power of the notation starts to become apparent: note that in the above example, <script type="math/tex">\boldsymbol{A}\in\mathbb{R}^{L,M,N}</script> is a rank-3 matrix. By summing over the <script type="math/tex">i</script> index, we are rotating <script type="math/tex">\boldsymbol{A}</script> in such a way that it is being multiplied with <script type="math/tex">x</script> in the first dimension. To illustrate this further, consider the multiplication of two matrices <script type="math/tex">A\in\mathbb{R}^{M\times N}</script> and <script type="math/tex">B \in\mathbb{R}^{P\times N}</script>; clearly the product <script type="math/tex">AB^T</script> is well-defined but <script type="math/tex">AB</script> is not. With the Einstein notation, the indices tell us explicitly which dimensions to sum over. Hence</p>
<script type="math/tex; mode=display">A_{ji}B^{i}_{k}</script>
<p>is well defined, since it sums over the <script type="math/tex">\mathbb{R}^N</script> ‘slot’, whereas</p>
<script type="math/tex; mode=display">A_{ij}B^{i}_{k}.</script>
<p>is not well-defined. Note that flipping the order of two indices amounts to taking the transpose. As we will see, this feature is really helpful when it comes to marginalizing a multidimensional distribution.</p>
<p>Some final examples: we can compute the trace of a matrix by contracting with the Kronecker delta:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{tr}\{A\} &=\delta^{i}_{j} A_{i}^{j}\\
&=A_{i}^{i}.
\end{aligned} %]]></script>
<p>Given a vector <script type="math/tex">\boldsymbol{f}</script> whose entries are functions on some basis set <script type="math/tex">\mathbb{x}</script>, we can write down the Jacobian simply as</p>
<script type="math/tex; mode=display">J^i_j = \partial_j\ f^i,</script>
<p>where we identify <script type="math/tex">\partial_j \equiv \frac{\partial}{\partial_{x_j}}</script>. Bonus: if you’re a statistician or a computer scientist, you now have all the tools you need to parse <a href="https://en.wikipedia.org/wiki/Standard_Model">quantum field-theoretic Lagrangians</a>:</p>
<script type="math/tex; mode=display">\mathcal{L}_{\text{EW}} =
\sum_\psi\bar\psi\gamma^\mu
\left(\imath\partial_\mu-g^\prime{1\over2}Y_\mathrm{W}B_\mu-g{1\over2}\vec\tau_\mathrm{L}\vec W_\mu\right)\psi.</script>
<p>Ok, back to the task at hand. I hope I’ve convinced you that the set of permissible operations with with these rules (more formally known as the <a href="https://en.wikipedia.org/wiki/Ricci_calculus">Ricci calculus</a>) generalize matrix algebra. Let’s see how we can use these to make marginalization cleaner when doing belief propagation.</p>
<h1 id="discrete-distributions-and-marginalization-in-numpy">Discrete distributions and marginalization in numpy</h1>
<p>Before we get to belief propagation, let’s talk about the standard set-up: we have a discrete multi-dimensional probability mass function <script type="math/tex">p</script> over a bunch of random variables <script type="math/tex">X_1,\dots,X_K</script>, where each of the <script type="math/tex">X_k</script> has its own finite sample space <script type="math/tex">\Omega_k</script>, and in general <script type="math/tex">\Omega_i \neq \Omega_j</script> if <script type="math/tex">i\neq j</script>. For example, we could have <script type="math/tex">\Omega_1=\{0,1,2,3\},\ \Omega_2=\{0,\dots,255\},\ \Omega_3=\{"blue","green","red"\}</script>, etc.</p>
<p>The two most basic operations on <script type="math/tex">p</script> are conditioning and marginalization. To marginalize, we wish to compute, for example,</p>
<script type="math/tex; mode=display">p(x_1)=\sum_{x_2}\cdots\sum_{x_K}p(x_1,\dots,x_K)</script>
<p>In <a href="http://www.numpy.org">NumPy</a>, we represent the probability mass function <script type="math/tex">p</script> as a <script type="math/tex">\Omega_1 \times \Omega_2 \times \dots \times \Omega_K</script> matrix <script type="math/tex">P</script> of real numbers in the interval <script type="math/tex">[0,1]</script>. To compute the sums above ‘by brute force’, we would sum over all the dimensions except the first. Similar to Matlab, this operation is most efficient in NumPy if it is vectorized. The best way to accomplish a simple sum like this is with <code class="highlighter-rouge">numpy.sum(...)</code>, specifying the dimensions to sum over. We can also do this with <code class="highlighter-rouge">numpy.einsum(...)</code>:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">9</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span> <span class="c"># make a random 7-dimensional array</span>
<span class="n">A</span> <span class="o">/=</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="c"># normalize</span>
<span class="n">np</span><span class="o">.</span><span class="n">einsum</span><span class="p">(</span><span class="s">'abcdefg->d'</span><span class="p">,</span><span class="n">A</span><span class="p">)</span> <span class="c"># marginalize :)</span>
</code></pre>
</div>
<p>Note that repetition of indices does not necessarily imply a summation if they’re both super/subscripts. This lets us easily define element-wise products. For example:</p>
<script type="math/tex; mode=display">A_{ijk} = B_{ij} C_{ik}.</script>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">B</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
<span class="n">C</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">5</span><span class="p">)</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">einsum</span><span class="p">(</span><span class="s">'ij,ik->ijk'</span><span class="p">,</span><span class="n">B</span><span class="p">,</span><span class="n">C</span><span class="p">)</span>
<span class="c"># A has shape (3,2,5)</span>
</code></pre>
</div>
<p>As long as the shapes match up, then we have combined two rank-2 tensors to make a rank 3 tensor without contracting, i.e. without summation. This is key when we want to compute the product of messages in belief propagation. More tips and tricks can be found at the <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html">einsum documentation page</a>.</p>
<h1 id="belief-propagation">Belief propagation</h1>
<p>Following Bishop [1], We have a graphical model of discrete variables <script type="math/tex">X=\left(x_1,\dots,x_N\right)</script> that induces some factorisation of the joint distribution <script type="math/tex">p(X)=\prod_{s}f_s\left(X_s\right),</script> where the factors <script type="math/tex">f_s</script> are functions of the variable subsets <script type="math/tex">X_s \subset X</script>. When it comes to marginalisation:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned} p(x) &= \sum_{X\backslash x} p(X)\\ &= \sum_{X\backslash x} \prod_{s\in\text{ne}(x)} F_s\left(x,X_s\right)\\ &= \prod_{s\in\text{ne}(x)}\sum_{X_s}F_s\left(x,X_s\right)\\ &:= \prod_{s\in\text{ne}(x)} \mu_{f_s \to x}(x), \end{aligned} %]]></script>
<p>where <script type="math/tex">F_s(x,X_s)</script> is the product of factors in the subtree annexed by <script type="math/tex">x</script>, and we interpret the subtree marginals <script type="math/tex">\mu_{f_s\to x}(x)</script> as “messages”, which satisfy the mutually recursive relations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned} \mu_{f_s\to x}(x) &= \sum_{x_1}\cdots\sum_{x_M}f_s\left(x,x_1,\dots,x_M\right)\prod_{m\in\text{ne}\left(f_s\right)\backslash x}\mu_{x_m \to f_s}\left(x_m\right)&&(1)\\ \mu_{x_m\to f_s}\left(x_m\right) &= \prod_{l\in\text{ne}\left(x_m\right)\backslash\left(f_s\right)}\mu_{f_l\to x_m}\left(x_m\right),&&(2) \end{aligned} %]]></script>
<p>where for all leaf factors and nodes we set <script type="math/tex">\mu_{f_l \to x} = f_l(x)</script> and <script type="math/tex">\mu_{x_l \to f} = 1</script>, respectively. Given a joint distribution <script type="math/tex">p</script> corresponding to a graphical model, we can efficiently marginalise by evaluating the messages in equations (1) and (2). The subject of this post is the details of the implementation of these messages and their evaluation. Since we are dealing with discrete variables each factor <script type="math/tex">f_s\left(x_1,\dots,x_M\right)</script> is represented by a <script type="math/tex">K_1 \times \dots K_M</script> array, where <script type="math/tex">K_m</script> is the size of the domain of <script type="math/tex">x_m</script>.</p>
<h1 id="putting-it-together">Putting it together</h1>
<p>Note that each of the messages <script type="math/tex">\mu(x_m)</script> is a marginal distribution with respect to <script type="math/tex">x_m</script>, and so is a vector of size <script type="math/tex">K_m</script>. The multiplication in equations <script type="math/tex">(1)</script> and <script type="math/tex">(2)</script> then corresponds to a bunch of elementwise matrix-vector products. This makes them amenable to the Einstein treatment. Rewriting equation <script type="math/tex">(1)</script> with the Einstein notation, we have</p>
<script type="math/tex; mode=display">\left[\mu_{f \to x}\right]_{j} = f_{j,i_1,\dots,i_M}\left[\mu_{x_1 \to f}\right]^{i_1}\dots\left[\mu_{x_M \to f}\right]^{i_M}.</script>
<p>Similarly for equation <script type="math/tex">(2)</script>:</p>
<script type="math/tex; mode=display">\left[\mu_{x \to f}\right]_{i} = \left[\mu_{f_1 \to x}\right]_{i} \dots \left[\mu_{f_L \to x}\right]_{i}.</script>
<p>A short demo on a simple Bayesian network can be found <a href="https://github.com/aslanides/dag-inference">here</a>.</p>
<hr />
<h1 id="references">References</h1>
<p>[1] C. M. Bishop. <em>Pattern Recognition and Machine Learning</em>. Springer, 2006</p>
Sun, 28 Feb 2016 08:58:00 +0000
http://aslanides.github.io/machine_learning/2016/02/28/marginalization-einstein/
http://aslanides.github.io/machine_learning/2016/02/28/marginalization-einstein/machine_learningLinear regression & Hello World!<p>Hello, world! My starting this blog has been long overdue. I thought I’d ease myself into it with something easy and uncontroversial, so here’s my take on linear regression. We start with the standard (frequentist) maximum likelihood approach, and show that Gaussian noise induces the familiar least-squares result, then show the correspondence between isotropic Gaussian priors and <script type="math/tex">l_2</script> regularizers, before playing with some Bayesian updating using <em>Mathematica</em>. Note that these results are all well-known and discussed in detail in Bishop [1]; my motivation for reproducing them here is that I hadn’t encountered a concise and complete online write-up that I found satisfying.</p>
<p>Changelog:</p>
<ul>
<li>2016-03-30: <em>Edits for readability, and added a section on Bayesian regression.</em></li>
<li>2016-11-28: <em>Some more minor edits, and moved the code to its own <a href="http://github.com/aslanides/bayes-regression">GitHub repo</a>.</em></li>
</ul>
<h1 id="setup">Setup</h1>
<p>In the standard regression task we are given labelled data</p>
<script type="math/tex; mode=display">\mathcal{D} \stackrel{\cdot}{=} \left\{\left(x_{i},y_{i}\right)\right\}_{i=1}^{N},</script>
<p>with each of the <script type="math/tex">y_i\in\mathbb{R}</script>, and <script type="math/tex">x_{i}\in\mathcal{X}</script> drawn i.i.d. from some unknown stochastic process <script type="math/tex">\mu(x,y)</script>. Our objective is to learn a linear model <script type="math/tex">f:\mathcal{X}\to\mathbb{R}</script> of the form</p>
<script type="math/tex; mode=display">f(x;w,\phi)=w^T\phi(x),</script>
<p>where <script type="math/tex">w\in\mathbb{R}^D</script> are our model weights (parameters), and the basis functions
<script type="math/tex">\phi : \mathcal{X} \to \mathbb{R}^D</script>
are known and fixed. Now, let’s assume the data contains additive zero-mean Gaussian noise, so that</p>
<script type="math/tex; mode=display">y(x)=f(x)+\epsilon \qquad \text{with }\epsilon\sim\mathcal{N}(0,\beta^{-1}),</script>
<p>and so the probability density of the targets conditioned on the inputs is</p>
<script type="math/tex; mode=display">p(y|x,w) = \mathcal{N}\left(w^T\phi(x),\beta^{-1}\right).</script>
<p>The Gaussian noise assumption is motivated by the central limit theorem, and it helps in the Bayesian setting, since the Gaussian is its own conjugate prior. Note here that we make the conditional dependence on <script type="math/tex">x</script> and <script type="math/tex">w</script> explicit, and that we assume that the precision <script type="math/tex">\beta</script> is known, and that the basis functions <script type="math/tex">\phi</script> are well-chosen in some sense.</p>
<h1 id="maximum-likelihood-solution">Maximum Likelihood Solution</h1>
<p>Given that our dataset is drawn i.i.d., the likelihood is given by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\mathcal{D}|w) &= \prod_{i=1}^{N}p(y_i|x_i,w)p(x_i)\\
&= \prod_{i=1}^{N}p(y_i|x_i)\prod_{j=1}^{N}p(x_j)\\
& \propto \prod_{i=1}^{N}p(y_i|x_i)\\
&= \prod_{i=1}^{N}\mathcal{N}\left(w^T\phi(x_i),\beta^{-1}\right).\\
\end{aligned} %]]></script>
<p>Our objective now is to learn the weights <script type="math/tex">w</script> that maximise this likelihood, taking the variance of the noise <script type="math/tex">\beta^{-1}</script> to be fixed. Notice that since the <script type="math/tex">p(x_{j})</script> don’t depend on the parameters \(w\), we can neglect them in our objective function. The maximum likelihood (ML) solution for <script type="math/tex">w</script> is</p>
<script type="math/tex; mode=display">\hat{w}_{\text{ML}} = \arg\max_{w}p(\mathcal{D}|w).</script>
<p>Now, <script type="math/tex">\log(\cdot)</script> is a monotonically increasing concave function, so</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\hat{w}_{\text{ML}}&=\arg\max_{w}\log p(\mathcal{D}|w)\\
&=\arg\max_{w}\log\prod_{i=1}^{N}\mathcal{N}\left(w^T\phi(x_i),\beta^{-1}\right)\\
&=\arg\max_{w}\sum_{i=1}^{N}\log\mathcal{N}\left(w^T\phi(x_i),\beta^{-1}\right)\\
&=\arg\max_{w}\sum_{i=1}^{N}\log\left(\frac{1}{\sqrt{2\pi\beta^{-1}}}\exp\left\{-\frac{\left(y_i-w^T\phi(x_i)\right)^2}{2\beta^{-1}}\right\}\right)\\
&=\arg\min_{w}\sum_{i=1}^{N}\left(y_i-w^T\phi(x_i)\right)^2.\qquad (1)\\
\end{aligned} %]]></script>
<p>This shows the correspondence between maximizing the likelihood under Gaussian noise with minimizing the square error. Minimizing the sum of square error has a nice geometric interpration, since it corresponds to finding the point in the subspace spanned by the basis <script type="math/tex">\phi(x_i)</script> which minimizes the Euclidean distance to <script type="math/tex">y_i</script>.</p>
<p>This is the standard frequentist result. The objective is a quadratic program, is convex, and has the well-known closed-form solution</p>
<script type="math/tex; mode=display">\hat{w}_{\text{ML}}=\left(\Phi^T\Phi\right)^{-1}\Phi^Ty,</script>
<p>where <script type="math/tex">y</script> is the vector of targets, and we have defined the design matrix <script type="math/tex">\Phi\in\mathbb{R}^{N\times D}</script> so that its <script type="math/tex">i^{\text{th}}</script> row is given by <script type="math/tex">\phi(x_i)</script>.</p>
<p>Maximum likelihood is notorious for overfitting, since by definition it finds the model for which our particular dataset is most probable; without tweaks, such a model will not generalize well to unseen data. The tweak that is typically introduced in the frequentist framework is an <script type="math/tex">l_2</script> regularizer on the weights; this will tend to bias us towards simpler models with small coefficients. Note that there are numerous other ways in which we can penalize complex models; for example, we can use the MDL principle, or try to make <script type="math/tex">w</script> sparse by using an <script type="math/tex">l_1</script> regularizer instead.</p>
<h1 id="map-inference">MAP Inference</h1>
<p>Interestingly, a certain class of prior over <script type="math/tex">w</script> induces an <script type="math/tex">l_2</script> regularizer when we calculate the posterior. This gives a stronger intuition for introducing this term into the objective function. Consider the isotropic Gaussian prior with hyperparameter <script type="math/tex">\alpha>0</script></p>
<script type="math/tex; mode=display">p(w|\alpha) = \mathcal{N}\left(0,\alpha^{-1}I_{N}\right),</script>
<p>where <script type="math/tex">I_N</script> is the <script type="math/tex">N\times N</script> identity matrix. Because a Gaussian is its own conjugate prior, we know that the posterior is a Gaussian. We can now easily compute the weights that maximize this posterior, conditioned on seeing some data <script type="math/tex">\mathcal{D}</script>. This is the maximum <em>a posteriori</em> (MAP) point estimate of <script type="math/tex">w</script>. Using the prior above, we can easily see that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\hat{w}_{\text{MAP}}&=\arg\max_{w}p(w|\mathcal{D})\\
&=\arg\max_{w}\frac{p(\mathcal{D}|w)p(w)}{p(\mathcal{D})}\\
&=\arg\max_{w}p(\mathcal{D}|w)p(w)\\
&=\arg\max_{w}\log\left(p(\mathcal{D}|w)p(w)\right)\\
&=\arg\max_{w}\left(\log p(\mathcal{D}|w) + \log p(w)\right)\\
&=\arg\min_{w}\left(\sum_{i=1}^{N}\left(y_i-w^T\phi(x_i)|\right)^2+\lambda\|w\|^2\right),\\
\end{aligned} %]]></script>
<p>where the constant <script type="math/tex">\lambda(\alpha,\beta)</script> is dependent only on the hyperparameters <script type="math/tex">\alpha</script> and <script type="math/tex">\beta</script>. The second term in our objective is now clearly the <script type="math/tex">l_2</script> regularizer we’ve been looking for. This objective is also convex, and has a simple closed-form solution analogous to the simple ML case.</p>
<h1 id="full-bayes">Full Bayes</h1>
<p>With MAP inference we start with a relatively ignorant prior over our model parameters, learn from a big batch of data, and end up with a point estimate. This clearly is not useful if we want to learn online with a sequential learning scheme, in which we update our belief <script type="math/tex">p(w)</script> with each new datum we receive. For each new data point <script type="math/tex">d=(x,y)\in\mathcal{D}</script> that we receive, we compute the posterior from Bayes rule, using as a prior the current state of our belief:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(w|d)&=\frac{p(d|w)p(w)}{p(d)}\\
&=\frac{p(d|w)p(w)}{\int_{\mathcal{W}}p(d|w)p(w)\text{d}w},
\end{aligned} %]]></script>
<p>where in our case <script type="math/tex">\mathcal{W}=\mathbb{R}^D</script>. Using Gaussian priors and likelihoods, we ensure that our posterior is also Gaussian, and so we can write down the posterior directly:</p>
<script type="math/tex; mode=display">p(w|\mathcal{D})=\mathcal{N}(w|m_N,S_N)</script>
<p>where, as it turns out (Equations (3.53) and (3.54) in Bishop),</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
m_N &= \beta S_N \Phi^T y\\
S_N^{-1} &= \alpha I + \beta \Phi^T\Phi.\\
\end{aligned} %]]></script>
<p>We now have the ingredients we need to implement a Bayesian updating scheme. The <em>Mathematica</em> notebook can be found <a href="http://github.com/aslanides/bayes-regression">here</a>.</p>
<table>
<tr>
<td>
<img src="/figures/plt1.png" width="300" />
</td>
<td>
<img src="/figures/plt2.png" width="300" />
</td>
<td>
$$\ \dots\ $$
</td>
<td>
<img src="/figures/plt20.png" width="300" />
</td>
</tr>
<tr>
<td>
<center>(a)</center>
</td>
<td>
<center>(b)</center>
</td>
<td>
</td>
<td>
<center>(c)</center>
</td>
</tr>
</table>
<p>Figure: Contour plots of the distribution <script type="math/tex">p(w)</script>. (a) Isotropic prior. (b) Posterior after updating on one data point. (c) Posterior after updating on 20 data points. The white <script type="math/tex">X</script> represents the ground truth.</p>
<h1 id="predictive-distribution">Predictive distribution</h1>
<p>We can use the predictive distribution, representing our uncertainty in the value of <script type="math/tex">y</script>, given some <script type="math/tex">x</script> and a bunch of experience <script type="math/tex">\mathcal{D}</script>. Note that we marginalize out the parameter <script type="math/tex">w</script>, using <script type="math/tex">p(w)</script>.</p>
<script type="math/tex; mode=display">p(y|x,\mathcal{D}) = \int_\mathcal{W}\text{d}wp(y|w,x,\mathcal{D})p(w)</script>
<p>Of course, the central issue with the Bayesian scheme (neglecting the computational/analytic difficulties arising from maintaining and updating a large distribution) is choosing the prior in a sensible and principled way. Note that the prior we used in the polynomial fitting example above was essentially chosen for convenience; though it is simple and thus reasonable according to Ockham’s principle, it is still chosen arbitrarily. Enter Solomonoff’s universal prior, which I’ll discuss more in a later post. <script type="math/tex">\square</script></p>
<hr />
<h1 id="references">References</h1>
<p>[1] C. M. Bishop. <em>Pattern Recognition and Machine Learning</em>. Springer, 2006</p>
Wed, 24 Feb 2016 02:37:00 +0000
http://aslanides.github.io/machine_learning/2016/02/24/linear-regression/
http://aslanides.github.io/machine_learning/2016/02/24/linear-regression/machine_learning