by Y Combinator7/21/2017
Doug Eck is a research scientist at Google and he’s working on Magenta, a project making music and art through machine learning.
If you want to learn more you can check out Magenta.Tensorflow.org
Craig Cannon [00:00] Hey, this is Craig Cannon, and you’re listening to Y Combinator’s podcast. Today’s episode is with Doug Eck, Doug’s a research scientist at Google, and he’s working on Magenta, which is a project making music and art through machine learning. Their goal is to basically create open-source tools and models that help creative people be even more creative. So if you want to learn more about Magenta, or get started using it, you can check out magenta.tensorflow.org. Alright, here we go. I wanted to start with the quote that you ended your IO talk with because I feel like that might be helpful for some folks. It’s a Brian Eno quote and I will have the slightly longer version if that’s okay.
Doug Eck [00:34] Yeah. Good, yeah.
Craig Cannon [00:36] So yeah, it goes like this. “Whatever you now find weird, ugly, uncomfortable, and nasty about a new medium will surely become its signature. CD distortion, the jitteriness of digital video, the crap sound of 8-bit, all of these will be cherished and emulated as soon as they can be avoided. It’s the sound of failure. So much modern art is the sound of things going out of control. Out of a medium, pushing to its limits and breaking apart.” That’s how you ended your IO talk.
Doug Eck [01:02] Correct.
Craig Cannon [01:03] What it kind of opened up for me was when you’re thinking about creating Magenta and all the projects therein as new mediums, how are you thinking about what’s going to be broken and what’s going to be created?
Doug Eck [01:19] The reason that I put that quote there I think is to be honest with the division between engineering, and research, and artistry, and to not think that what I’m doing is being a machine learning artist. But we’re trying to build interesting ways to make new kinds of art. I think it occurred to me, I read that quote, and I thought that’s it. No matter how hard Eastman or whomever invented the film camera. I’m sorry if that’s the wrong person. They clearly weren’t thinking of breakage, or they were trying to avoid certain kinds of breakage. I mean, you know guitar amplifiers aren’t supposed to distort. I thought well, what if we do that with machine learning? The first thing that you’re going to do if someone comes to you and says, “Here’s this really smart model that you can make art with,” what are you going to do? You’re going to try to show the world that it’s a stupid model, right? But maybe it’s smart enough that it’s kind of hard to make it stupid, so you get to have a lot of fun making it stupid.
Craig Cannon [02:17] I was playing with Quick, Draw! this morning with my girlfriend and what she was trying to do was make the most accurate picture that the computer wouldn’t recognize. Like, immediately out of the gate. She works in art and like yeah, doesn’t want to believe.
Doug Eck [02:30] That’s right. That’s a good intuition.
Craig Cannon [02:34] Yeah. Maybe the best way to start is then talk about what are you working on right now? What are you guys making?
Doug Eck [02:41] Right now we’re working on, god let me think. That’s a good question. We have this project called NSynth, which is trying to get deep learning models to generate new sounds. We’re working on a number of ways to make that better. I think one way to think about it is we have this latent space. To make that a little bit less a buzz word, we have a kind of compressed space, a space that doesn’t have the ability to memorize the original audio, but it’s set up in such a way that we can try to regenerate some of that audio. In regenerating it, we don’t get back exactly what we started with. But hopefully we get something close. That space is set up so that we can move around in that space, and come into new points in that space, and actually listen to what’s there. Right now it’s quite slow to listen so to speak. We’re not able to do things in real time. We also would love to be at kind of a meta level, building models that can generate those embeddings, having trained on other data, so that you’re able to move around in that space in different ways. So we’re moving, we’re continuing to work with Sound Generation for music. We also are spending quite a bit of time on rethinking the music sequence generation work that we’re doing. We put out some models that were, by any reasonable account, primitive. I mean kind of very simple recurrent neural networks that generate midi from midi and that maybe use a tension that maybe have a little bit smarter ways to sample as when doing inference, when generating. Now we’re actually taking seriously, wait a minute, what if we really look at large data sets of performed music? What if we actually start to care about expressive timing and dynamics, cared deeply about polyphony, and really care about like not putting out kind of what you would consider a simple reference model, but actually what we think is super good? I think those are the things we’re focusing on. I think we’re trying to actually make things, really pull up quality, and make things that are better, more usable for people.
Craig Cannon [04:46] With all that supervised learning, are you going to create a web app that people will evaluate how good the music is? Because I heard a couple interviews with you before where that was the issue, right? Like, how do you know what’s good?
Doug Eck [04:58] Yeah. I’m pausing because that’s the big question I think in my mind is how do we evaluate these models? At least for Magenta, I haven’t felt like the quality of what we’ve been generating has been good enough to bother so to speak. You find it, you cherry pick, you find some good things. You’re like, “Okay, this model trains, and it’s interesting,” and now we kind of understand that the API, the input output, of what we’re trying to do. I would love, yeah, I don’t know how to solve this. Conceptually what we do, here’s what we do, right? We build a mobile app and we make it go viral. That’s what we do, right? Then once it’s viral, we just keep feeding all of this great art and music in. I used to do music recommendation. We just build a collaborative filter, which is a kind of way to recommend items to people based upon what they like. We’d start giving people what they like, and we pay attention to what they like, and we make the models better. So all we need to do is make that app go viral.
Craig Cannon [06:01] One simple thing.
Doug Eck [06:02] In fact, maybe someone in the Y Combinator world can helps us do that, right?
Craig Cannon [06:07] Yeah, it’s like sliding between cow and trombone, what’s the best sound?
Doug Eck [06:10] Exactly, right. Maybe that particular web app is not the right answer. No, I mean I’m saying that as a joke, but I think, look at it this way. If we can find a way, or the community in general can find a way for machine generated media to be sort of out there for a large group of interested users to play with, I think we can learn from that signal and I think we can learn to improve. If we do, we’ll make quite a nice contribution to machine learning. We will learn to prove, based upon human feedback, to generate something of interest. So that’s a great goal. Today in this room, I wish I could tell you we had like a secret plan. Like, “Oh, he’s figured it out. The app’s going to launch tomorrow.” It’s really hard work.
Craig Cannon [06:54] Bleep, cut.
Doug Eck [06:56] Yeah, yeah exactly. Sorry.
Craig Cannon [06:59] Interesting, okay because I was wondering what kind of data you were getting back from artists? Do people just use all of your projects, all of the repos, to create things of their own interest, or are they pushing back valuable data to you?
Doug Eck [07:11] We’re getting some valuable data back and I think what we’re getting back, some of the signals that we’re getting back are giving us such an obvious direction for improvement. Like, why would I want to run a Python command to generate 1,000 midi files? That’s not what we do. You get that kind of feedback and you’re like okay, we wanted this command line version because we need it to be able to test some things. But if musicians are really going to use the music part of what we’re doing, we have to provide them with more fluid and more useful tools. There I think we’re still sitting with so many obvious hard problems to solve like integration with something like Ableton or really solid real time IO and things like that that we know what to work on. But I think we’ll get to the point pretty quickly where we’ll have something that solves the obvious problems, plugs in reasonably well to your workflow, and you can start to generate some things, and you can play with sound. Then we need to be much more careful about the questions we ask and how good we are at listening to how people use what we’re doing.
Craig Cannon [08:12] What are artists using it for at this point?
Doug Eck [08:15] Right now, most of what we’ve done so far had to do with music. If we look for a second away from music and look at Sketch-RNN, which is a model that learned to draw, we’ve actually seen quite a bit of, so first at a higher level, Sketch-RNN is a recurrent neural network trained on sketches to make sketches. The sketches came from a game that Google released called Quick, Draw! where people had 20 seconds to draw something to try to win at Pictionary with a computer, a classifier counterpart. We trained a model that can generate new cats, or dogs, or whatever. There’s some really cool classes in there. A cruise ship, that’s nice.
Craig Cannon [08:58] The one that always threw me was camouflage. It calls out camouflage all the time. I’m never, yeah.
Doug Eck [09:03] As if by definition, you can’t draw it, right?
Craig Cannon [09:05] Yeah. Nothing, pause for 20 seconds.
Doug Eck [09:09] Yeah, I actually won a Pictionary round with the word white and I just pointed at the paper. I’m like, “No way,” and she said, “White.” I’m like, “Yes.” You got to be kidding. Anyway, just kind of like a corollary to camouflage. We’ve seen artists start to sample from the model. We’ve seen artists using the model as a distance measure to look for weird examples because the model has an idea of what’s probable in the space. We’ve also seen artists just playing around with the raw data, and so there’s been a nice explosion there. I’m not expecting that artists really do a huge amount with this Quick, Draw! data because as cool as it is, these things were drawn in 20 seconds, right? There’s kind of a limit to how much we can do with them. On the music side, we’ve had a number of people playing with NSynth with just like dumps of samples from NSynth, so basically like a rudimentary synthesizer. There I’ve been surprised at the kind of… I would expect that if you’re really good at this, so like you’re Aphex Twin, or how about this? You want to be Aphex Twin, right? That you look at this and go, “Yeah, whatever. There are 50 other tools that I have that I can use.” But those are the people that we’ve found have been the most interested. Because I think we are generating some sounds that are new. So first, you can test. Someone pointed out on Hacker News you can take a few oscillators, and a noise generator, and make something new. But I think these are new in a way when you start sampling the space between a trombone and a flute or something like that. But these are new in a way that capture some very nice harmonic properties, capture some of the essence of some of the Brian Eno quote, are kind of broken, and glitchy, and edgy in a way. But that glitchiness is not the same as you would get from like digital clipping. The glitchiness sounds really harmonic. For example, Jesse on our team, Jesse Engel, he built some Ableton plugin where you’re listening to these notes, but you’re able to erase the beginnings of the notes. You erase the onsets, which is usually where most of the information is. Most of the information in a piano note is kind of that first percussive onset. But it’s the onsets that the model is doing such a great job of reproducing because it gradually moves away from, in times the temporal embedding and the noise kind of adds up as we move through the embedding in time. So it’s the tails of these notes that start to get ringy, and like they’ll detune, and you’ll hear these rushes of noise come in, or there’ll be this little weird… at the end. We’ve found that musicians who’ve actually played with sound a lot find these particular sounds immensely fascinating. I think they’re the kinds of sounds that sound interesting in a way that’s hard to describe unless you’ve played with them. I think they’re interesting because the model has been forced to capture some of the important sources of variance in real music audio. Even when it fails to reproduce all of them when it fills in with confusion so to speak, even that confusion is somehow driven by musical sound. Which you see by the corollary if you look at something like Deep Dream and you see what models are doing when they’re showing you what they’ve learned. It may not be what you expect from the world, but there’s something interesting about them, right? Anyway, that’s a long answer. But the short version of the answer is we’ve found that working with very talented musicians has been really fruitful. Our challenge is now to be good enough at what we do, and make it easy enough, and make it clean enough that even someone who’s not an Aphex Twin, and I’m not saying we worked with Aphex Twin. We didn’t work with Aphex Twin. But like that kind of artist.
Craig Cannon [12:53] Someone kind of
Doug Eck [12:54] Yeah, that we can also be saying, “Hey, this is really genuinely musically engaging for a much much larger audience.”
Craig Cannon [13:01] That’s surprising. So it’s not necessarily generating melodies for people so much as it is generating interesting sounds? That’s what’s brought them in?
Doug Eck [13:09] That’s what’s brought them in. Though the parallel has existed for the sequence generation stuff. What I noticed, even with A.I. Duet, which is this web based, like it’s a simple RNN. It’s like I can lay claim it’s technology that was published in 2002. It’s really a very simple…
Craig Cannon [13:27] It’s really fun though.
Doug Eck [13:28] Really simple. This model, if your viewers haven’t seen it, you play a little melody and then the model thinks for a minute. The AI genius, which is an LSTM network, comes back and plays something back to you, right? If you play Fur Elise, you know? Right? And you wait, you’re expecting maybe that it’ll continue the tune. It’s not going to, right? It’s going to go right? So this idea of expecting the model to carry these long arcs of melody along is not really understanding the model. What we saw was, especially jazz musicians, but musicians who listen, the game they play is to follow the model. I would see guys, or people, women too, to sit down and go like ♫ Dum dum dum dum and just wait, and it’s almost like pinging the model with an impulse response. Like, “What’s this thing going to do?” Then instead of trying to drive it, it comes back and goes ♫ Dum dum dum dum Right?
Craig Cannon [14:25] Yeah.
Doug Eck [14:26] Then the musician says, “Oh, I see. Let’s go up to the fifth.” Then you get this really, it’s almost like follow the leader, but you’re following the model. Then it’s super fun. It’s basically a challenge for the musician to try to understand how to play along with something that’s so primitive. But if you don’t have the musical, so basically it’s the musician bringing all the skill to the table, right? Even with the primitive sequence generation stuff, it’s still been interesting to see that it’s the musicians with a lot of musical talent and particularly the ability to improvise and listen that have managed to actually get what I would consider interesting results out of that.
Craig Cannon [15:01] Yeah, so it’s become more of like a call and response game than a tool?
Doug Eck [15:06] Yeah, I think so. That’s partially because the model’s pretty primitive. I think that if we can get the data pipelines in order so that we know what we’re training on and we can actually do some more modern sequence learning, having like generative adversarial feedback and things like that, we can do much better. Even we have some stuff that we haven’t released yet that I think is better. But yeah, as we make it better, it’ll be more of a, “This model’s going to give me some more ideas from what I’ve done.” Right now it’s more of a, “This mode’s kind of weird but I’m kind of try to understand what it’s doing.” Both are fun modes by the way. They’re both cool modes, right?
Craig Cannon [15:46] Yeah, I mean I haven’t tried, like I’m definitely not a pianist. I mean I’ve played guitar before. I tried to get a song going, but I had trouble with it.
Doug Eck [15:55] We’re sorry.
Craig Cannon [15:56] It’s okay. I think it’s mostly my fault to be honest. I love the YouTube video.
Doug Eck [16:01] Blame the user, right?
Craig Cannon [16:01] Yeah, yeah. The video where that guy played a song with it. That was amazing.
Doug Eck [16:07] Yeah, that was cool.
Craig Cannon [16:08] It was very cool. Have you seen a lot of that stuff as well?
Doug Eck [16:10] Yeah, we’ve seen. We saw like well, we haven’t pushed the sequence generation stuff much because we really wanted to focus on tamper. But when we have released things and tried to show people where they are, yeah we’ve gotten. If you look on them, there’s a Magenta mailing list that’s just like it’s linked, g.co/magenta, and if you look around, there’s a discussion list. Which is as flaming and spammy as some discussion lists, but a little bit less so. It’s pretty, you know. Every couple weeks, someone will put up some stuff they composed with Magenta and usually they’re more effective if they’ve layered their own stuff on top of it or they’ve taken time offline rather than in performance to generate. But some stuff’s actually quite good. It’s fun. It’s a start.
Craig Cannon [16:52] Yeah. I think it’s great. You compared it to the work you did in 2002. Where has LSTM gone since then? You talk about like you ended up doing this project. I saw in your talk that because you kind of like failed at it a while ago.
Doug Eck [17:08] Failure is good. Yeah, so there was a point in time, I was at a lab called IDSIA, the Dalle Molle Institute for Artificial Intelligence, and I was working for JÃ¼rgen Schmidhuber, who’s one of the coauthors. He was the advisor to Sepp Hochreiter who did LSTM. There was a point in time where there were three of us in a room in Manno, Switzerland, which is a suburb of Lugano, Switzerland, who are the only people in the world using LSTM. It was myself, Felix Gers, and Alex Graves. Among the three of us, by far Alex Graves has done the most with LSTM, so he continued after he finished his PhD. He continued doggedly to try to understand how recurrent neural networks worked, how to train them, and how to make them useful for sequence learning I think more than anybody else in the world including Sepp, the person who created LSTM. Alex just stuck with it and finally started to get winds in speech and language. I, more or less, put down LSTM as I started working with audio stuff and other more like cognitively different music stuff at University of Montreal. But it worked finally, right? You know there’s like this thing in music, a 20 year overnight success, right?
Craig Cannon [18:23] Yeah.
Doug Eck [18:24] This worked because he stuck with it. Now of course it’s become like the touchstone for recurrent models in time series analysis. Some version of it forms the core of what we’re doing with translation. These models have changed, right? They’ve evolved over time. But basically, recurrent neural networks as a family of models is around because of that effort of like, it’s interesting, right? There really were three of us.
Craig Cannon [18:51] That’s so wild.
Doug Eck [18:51] Felix went on with his life and I went on with my life. Alex stuck with it, this really one person caring for it. But you may get letters from people saying, “Hey wait, you forgot about me. You forgot about me.” This is a little bit reductionist. Obviously there were more, but it felt that way at the time, right?
Craig Cannon [19:05] Right. What was the breakthrough then that like got people interested?
Doug Eck [19:12] I think it was the same breakthrough that got people interested in deep neural networks and convolutional neural networks. It’s that these models don’t work that well with small training sets and small models, and then…
Craig Cannon [19:27] So you like mentioned that with like the stuff
Doug Eck [19:28] They’re data absorptive, meaning that they can absorb lots of data if they have it. Neural networks as a class are really good with high dimensional data. So as machines got faster, and memory got bigger, they started to work. We were working with really small machines, and working with LSTM networks that maybe had like 50 to 100 hidden units, and then a couple of gates to control them, and trying things that had to do with the dynamics of how these things can count, and how they can follow time series. You try to scale that to speech or you try to scale that to, you know, speech recognition was one of the first things. This stuff’s really hard to do. So I think a lot of this is just due to having faster machines and more memory. It’s kind of weird, right?
Craig Cannon [20:11] It’s surprising that that would be it.
Doug Eck [20:13] Yeah, I think it surprises everybody a little bit. Now the running joke, like having coffee here at Brain is sort of like what other technology from the ’80s should we rescue?
Craig Cannon [20:22] Waves
Doug Eck [20:22] Exactly right.
Craig Cannon [20:25] AI’s back.
Doug Eck [20:22] Exactly right.
Craig Cannon [20:27] How far have you pushed LSTM? Obviously there’s some amount of text generation that people are trying out. Have you let it create an entire song?
Doug Eck [20:38] No we haven’t because we haven’t got the conditional part of it right yet. I think LSTM in its most vanilla form, I think everybody’s pretty convinced that it’s not going to handle really long time scale hierarchical patterning. I’d love it if someone comes along and says, “No, you don’t need anything but vanilla LSTM to do this.” But I think what makes music interesting over even after five seconds or 10 seconds is this idea that you’re getting repetition, you’re getting harmonic shifts like chord changes. There’s a they’re there, right? One way to talk about that they’re there is that you have some lower level pattern generation going on, but there’s some conditioning happening. Oh, now continue generating, but the conditions shifted. We just shifted chords for example. So I think if we start talking about conditional models, if we talk about models that are explicitly hierarchical. If we talk about models that we can sample from in different ways, we can start to get somewhere. But I think only a recurrent neural network is… It would be reductionist to say that it’s the whole answer. It’s in fact true, it’s not the whole answer.
Craig Cannon [21:48] I was thinking about how you were, was it the TensorFlow or the IO talk where you were talking about Bach?
Doug Eck [21:54] Oh, probably we did stuff that was like, “More Bach than Bach.” Yeah, we nailed it.
Craig Cannon [21:59] Yeah, that’s it, like you start making things that are more palatable as like, “I’ll make the best Picasso painting for you,” but it’s not necessarily a Picasso painting because it’s not necessarily saying anything.
Doug Eck [22:12] Precisely. I think by analogy, so first in case it’s not clear. I don’t believe that we made something that was better than Bach. But when we put these tunes out for untrained listeners to listen to, they sometimes voted them as sounding more Bach-y. Imagine what these models are learning, right? They’re learning the principal axes of variance. They’re learning what’s most important. They have to because they have a limited memory. They’re compressed. If you sample from Sketch-RNN with very low temperature, meaning without a lot of noise in the system, you actually get what like if you want to squint your eyes and break philosophy is like the Platonic cat. You get the cat that looks more like a cat anyone would draw, sort of the average cat. I think that’s what we’re getting from these time series models as well. They’re giving you something that’s more a caricature than a sample.
Craig Cannon [23:09] Then in the creation of art, what are you predicting is going to happen as Magenta progresses?
Doug Eck [23:18] Can I make predictions that are on the timeframe of like 28 to 40 years?
Craig Cannon [23:21] Yeah, sure. Why not?
Doug Eck [23:22] When no one will ever test.
Craig Cannon [23:23] In 1,000 years.
Doug Eck [23:24] In 1,000 years, Magenta is going to be the only, no. Joking aside, I do believe that the machine learning and AI will continue, like will become part of the toolkit for communicating and for expression, including art. I think that in the same way, I think that it’s healthy for me to admit that those of us who are doing this engineering won’t, almost by definition, know where it’s going to go. We can’t and we shouldn’t know where it’s going to go. I think our job is to build AI smart tools. At the same time, I want to point out some people find that answer boring, like it’s hedging. But I do think there are directions. I can imagine a direction that we could go on that’d be really cool. For example, thinking of literature, right? I think plot is really interesting in stories and that you can imagine that we have a particular way as humans, like the kind of cognitive constraints that we have of limitations in how we would draw plots out as an author. You’re not going to do it one pass, left to right, like in a recurrent neural network. It’s going to be like sketching out the plot and do we kill this character off? But I can imagine that generative models might be able to generate plots that are really really difficult for people to generate, but still make sense to us as readers, right?
Craig Cannon [24:53] Oh man. Okay, yeah.
Doug Eck [24:54] Think of it if you flip it around, like I think jokes are hard because it’s really hard to generate the surprising turns. You go in one direction and you land over here, but it still makes sense. I can imagine that the right kind of language model might be able to generate jokes that are super super funny to us and that actually might have a flavor to them of being like, “Yeah, I know. This joke must’ve been machine generated because it fits in so many different ways,” right?
Craig Cannon [25:23] Yeah.
Doug Eck [25:23] Right? It fits in so many different ways. In math, like in high dimensional space, but it’s super funny to us. I don’t know how to do that. But I can totally imagine that we would be in a world where we get that.
Craig Cannon [25:34] I thought about it in the complete opposite way, but that makes sense. I was thinking about it, training it to create pulp fiction. That would be so simple in my mind. Just create these airport novels. It can just bang out the plots.
Doug Eck [25:46] I mean that’s probably where we’ll start. I would love it if we could write, so everybody understands that’s listening or watching. We can’t generate a coherent paragraph. I don’t mean we, Magenta, I mean we, humanity.
Craig Cannon [26:01] He’s like, “I can’t write at all.”
Doug Eck [26:04] Yeah, it’s really hard. It all hits its structure at some level, like nested structure whether it’s music, or I think there’s like art, plays with geometry, or color, or something else. It’s meaning. It’s nested structure somewhere.
Craig Cannon [26:20] Has the art world or I guess any kind of artist, any kind of creator, have people pushed back in the way that they’re scared? I imagine when photography came out, everyone was pushing back saying, “This might end painting,” because it’s about photography captures the essence. But then it ended up changing because people realized that painting wasn’t just about capturing something, capturing an exact moment.
Doug Eck [26:44] Certainly the generative art world, and we’ve seen lots of that. Another researcher in London, someone posted on his Facebook something like, or he posted to us a tweet that was like, “What you’re doing is bad for humanity.” Like, really? He’s making new folk songs. He’s generating folk songs with an LSTM, he’s Bob Sturm. It’s probably not bad for humanity. So yeah, of course. But what I love about that is it’s okay if a bunch of people don’t like it. In fact, if it’s interesting, what art does everybody like?
Craig Cannon [27:17] Zero.
Doug Eck [27:18] Right or it’s really boring, right?
Craig Cannon [27:20] Right.
Doug Eck [27:22] You have this idea that if you want to really engage with people, you’re probably going to find an audience. That audience is going to be some slice. Frankly, it’s probably going to be some slice coming up from the next generation of people that have experienced technology, that are taking some things for granted that are still novel to someone like myself, right? But it’s okay if a bunch of people don’t like it.
Craig Cannon [27:47] Yeah, well when we were talking before, I was surprised that you hadn’t gotten more pushback. It seems to be like most people in our world are just like, “All right, kay. Whatever.” It’s like, “Do your thing.” It’s opening up new territory rather than it it is challenging.
Doug Eck [28:00] I think that I’ve gotten pushback in terms of questions. I think we have, and I think this is a community in Google, and outside of Google, and outside of Magenta, I think people are really clear that what’s interesting about a project like this is that it be a tool, not a replacement. I think if we presented this as, “Push this button and you’ll get your finished music,” it would be a very different story. But that’s boring.
Craig Cannon [28:29] It’s funny you mentioned Hacker News because I was talking with one of the moderators.
Doug Eck [28:35] We love you, Hacker News. Be nice to us.
Craig Cannon [28:37] No, they’re great. Yeah, no it’s just impersonal. It’s so easy to critique people. But I was talking with Scott, one of the moderators, and he was wondering if you guys were concerned with the actual cathartic feeling of creating music, or if that’s just something you don’t even consider right now?
Doug Eck [28:55] I mean as people, yeah of course.
Craig Cannon [28:56] You have to.
Doug Eck [28:57] Yeah and I think there’s a couple of levels there. I think you lose that if what you’re just doing is pushing a button. I think this is everywhere. The drum machine is such a great thing to fall back on. It is just not fun to just push the button and make the drum machine do its canned patterns. I think that was the goal. The reading that I’ve done is like this’ll make it really easy, right? But what makes the drum machine interesting is people working with it, writing their own loops or their own patterns, changing it, working against it, working with it. I think this project loses its interest if we don’t have people getting that cathartic release, which believe me, I understand what you mean. That’s thing one. The other thing I would mention is if there’s anything that we’re not getting that I wish we were getting more of is people coding creatively. We talk about creative coding in this handwavy sense. But I would love to have the right kind of mix of models in Magenta and in open source linking to other projects that you as a coder could come in and actually say, “I’m going to code for the evening and add some things. I’m going to maybe retrain, maybe I’m going to hack data, and I’m going to get the effect that I want,” and that part of what you’re doing is being an artist by coding. I think we haven’t hid that yet in Magenta. I’d love to get feedback from whomever, like in terms of ways to get there. The point is, there’s a certain catharsis for those of us that train the model. You get the model to train, and it worked.
Craig Cannon [30:29] Just psyched.
Doug Eck [30:31] It’s funny, you’ll be bored if you just push the button, but it feels good for me to push that button ’cause I’m the one that made that button work. So there’s, that right?
Craig Cannon [30:38] Yeah.
Doug Eck [30:38] That’s a creative act in it’s own right.
Craig Cannon [30:40] Have people been creatively breaking the code? Like, “Oh, it would be funny if it did this or interesting if it did that.”
Doug Eck [30:48] A few. Though I think are code is so easy, like most open source projects need to be rewritten a couple times. I think we’ve gone through, we’re on our second rewrite, is that if the code is brittle enough that it’s easy to break uncreatively, then it’s hard to also break it creatively. Listen, I’m being pretty critical. I’m really proud of the quality of the Magenta open source effort. I actually think we have well tested, well thought out code. I think it’s just a really hard problem to do coding for art and music, and that if you get it wrong a little bit, it’s just wrong enough that you have to fix it. So we still have a lot of work to do.
Craig Cannon [31:25] Then where does that creative coder world go? I’ve seen a lot of people that are concerned with even just preserving, I think Rhizome is doing a preserving digital art project. What direction do you think that’s going to go in?
Doug Eck [31:39] Presumably a number of cool directions in parallel. The one that interests me personally the most is reinforcement learning and this idea that models trained… So there’s a long story or a short story. Which one do you want?
Craig Cannon [31:54] Long.
Doug Eck [31:54] Sure, yeah okay. We know..
Craig Cannon [31:55] Well, how long?
Doug Eck [31:57] It’s not that bad. Generative models 101, you start generating from a model trained just to be able to regenerate the data it’s trained on. You tend to get output that’s blurry, right? Or is just kind of wandery. That’s because all the model learns to do is sit somewhere on the big, imagine the distribution as a mountain range and it just sits on the high mountaintop.
Craig Cannon [32:20] Kind of plays it safe.
Doug Eck [32:21] Yep, it kind of plays it safe. All t-shirts are gray if you’re colorizing ’cause that’s safe, you’re not going to punished. One revolution that came along thanks to Ian Goodfellow is this idea of a generative adversarial network. It’s a different cost for the model to minimize where the model is actually trying to create counterfeits and it’s forced to not just play it safe, right? I don’t know how, if this is too technical.
Craig Cannon [32:47] It’s very interesting to me. Yeah, this was part of the talk, right? Where you cut out the square in the painting?
Doug Eck [32:51] Exactly, yeah.
Craig Cannon [32:51] Yeah, I saw that part.
Doug Eck [32:53] Another way to do this is to use reinforcement learning. It’s slower to train because all you have is a single number, scale, or reward instead of this whole gradient flowing than GANs. But it also is more flexible. Okay, so my story here is that GANs are a part of a larger family of models that are some level, critical. Everybody needs a critic and they’re pushing back. They’re pushing you off of your, pushing you out of your safe spot, whatever that safe spot is, and that’s helping you be able to do a better job of generating. We have a particular idea that you can use reinforcement learning to provide a reward for following a certain set of rules or a certain set of heuristics. This is normally like, if you mention rules at a machine learning dinner party, everybody looks at you funny, right?
Craig Cannon [33:37] Like you’re stepping backward, right?
Doug Eck [33:38] Yeah, you’re not supposed to use rules. Come on, we don’t use rules. But instead of building the rules into the model, like the AI is not rules. The machine learning is not rules. It’s that the rules are out there in the world and you get rewarded for following them. We had, I thought, some very nice generated samples of music that were pretty boring with the LSTM network. But then the LSTM network trained additionally using a kind of reinforcement learning called deep Q-learning to follow some of these rules, the generation got way different and way better. It specifically got catchier. What were the rules? The rules were like rules of composition for counterpoint from the 1800s. They were super simple. Now, we don’t care about those rules. But there’s a really nice creative coding aspect, which is, think of it this way. I have a ton of data. I have a model that’s trained. I have a generative model, whatever it may be. It may be one trained to draw. It may be one trained for music. That model is kind of tried to disentangle all the sources of variants that are sitting in this data and so it’s smartly generating, it can generate new things. But now think as long as I can write a bit of code that takes a sample from the model and evaluates it, providing scale or reward, anything I stuff in that evaluator then I can get the generator to try to do a better job of generating stuff that makes that evaluator happy. It doesn’t have to be 18th century rules of counterpoint, right?
Craig Cannon [35:07] No, yeah.
Doug Eck [35:09] You could imagine taking something like Sketch-RNN and adding a reinforcement learning model that says, “I really hate straight lines.” Suddenly, the model’s going to try to learn to draw cats, but without straight lines. The data’s telling it to draw cats. Sometimes the cats have triangular ears with straight lines. But the model’s going to get rewarded for trying to draw those cats that it can without drawing straight lines. Straight lines was just one constraint that I picked off the top of my head. It has to be a constraint that you can measure in the output of the model. But musically speaking, if I could come up with an evaluator that described what I meant in my mind by shimmery, really fast changing small local changes, I should be able to get a kind of music that sounds shimmery by adding that reward to an existing model. Furthermore, the model still retains the nice realness that it gets from being trained on data. I’m not trying to come up with a rule to generate shimmery. I’m trying to come up with a rule that rewards a model for generating something that’s shimmery.
Craig Cannon [36:11] Something shimmery, whatever that is.
Doug Eck [36:12] Yeah, it’s very different, right?
Craig Cannon [36:13] Yeah.
Doug Eck [36:14] I think that’s one really interesting direction to go in, is like opening up the ability, if you can generate scale or reward, and drop it in this box over here, and we’ll take a model that’s already trained on data and we’ll tilt it to do what you want it to do.
Craig Cannon [36:24] That underlies a fear that people have, right? Which is what happens when you can create the best pop song and what do people do? Do you have thoughts on A, is that possible, and B, what would the world look like if that world comes to be?
Doug Eck [36:42] I had an algorithm for doing this, for the best pop song for me, which is when we used to sell used CDs, it was usually like a two to one. So every time if you have 1,000 CDs, you trade them in, and you have 500 that you like better. Then you just keep going. You finally get that one.
Craig Cannon [36:58] Until that best one.
Doug Eck [36:59] Yeah, exactly. Hill climbing in that space. Yeah, I think that… I’m not sure. A part of me wants to say people love the rawness and the variety of things that aren’t predictable pop. But let’s face it, people love pop music, of even like there’s a kind of pop music that you’ll catch on the radio sometimes that isn’t like, most of your listeners are probably in the same camp, or viewers. There’s pop that we love, like I love the poppiest of Frank Ocean’s music. I could listen to it forever. But then there’s just like the gutter of pop and so maybe we’ll…
Craig Cannon [37:41] You can’t even distinguish who the artist is, but they play at the big festivals I guess.
Doug Eck [37:45] I guess that unasks the perfect pop. I mean pop is such a broad thing. But yeah, I think I can imagine that with machine learning and AI at the table, we will… Here’s another way to look at it. Some things that used to but hard will be easy and so we’ll offload all of that. If people are happy just listening to the stuff that’s now easy, then yeah, it’s a problem solved and well be able to generate lots of it. But then what people tend to do is go look for something else hard. It’s like the drum machine argument. You solved the metronomic beat problem. Then what you actually find is that artists who are really good at this, they play off of it and they’re allowed, like when they sing, to do many more rhythmical things than they could do before because now they have this scaffolding they didn’t have to work with before.
Craig Cannon [38:33] They just constantly break it, right?
Doug Eck [38:34] Yeah.
Craig Cannon [38:34] As soon as you distorted the electric guitar.
Doug Eck [38:36] But I hope that’s an honest answer to your question. I mean your question was a different flavor. It’s like, “Hey, are we really moving towards a world where we’re going to generate the perfect pop song?” Yeah, I don’t know. I don’t think so.
Craig Cannon [38:48] I don’t feel like that’s going to happen. But maybe it happens so quickly and then as soon as we realize like, “Okay, this is how we’re going to break it. This is how we’re going to retrain ourselves.” It can learn so fast, that it’s like, “Okay, now I can do that too.”
Doug Eck [39:02] Yeah, that’s nice.
Craig Cannon [39:03] Then what I was wondering is there like, in the next handful of years, is there a holy grail that you’re working toward for Magenta? Like, “Okay, now we’ve hit it. This is the benchmark that we’re going for.”
Doug Eck [39:22] There are a couple of things I’d love to do. I think creating long form pieces, whether they’re music or art, I think is something we want to do. This hints at this idea of not just having these things that make sense at 20 seconds of music time, but actually say something more. That direction is really interesting because I think that not only, so let’s face it. That would be at least more interesting if you pushed the button and listened to it. But also, this leads to tools where composers can offload very interesting things. Some people, I’m one of these people. I’m really obsessed with expressive timing. I’m really obsessed with musical texture.
Craig Cannon [40:04] Okay, I don’t know what that is.
Doug Eck [40:06] Oh no, I just mean let’s say you’re playing piano…
Craig Cannon [40:09] Oh, I saw that in the Grey art space, or the Grace talk. You were contrasting the piano played by a computer.
Doug Eck [40:14] Hey, yeah. You did your homework. Yeah, exactly.
Craig Cannon [40:16] I just watched a bunch of YouTube videos.
Doug Eck [40:19] If you listen to someone play waltz, it’ll have a little lilt to it. Or like some of my favorite musicians, like Thelonious Monk if you’re familiar. If you’re not familiar with Thelonious Monk, homework, go listen to him. He played piano with a very very specific style that almost sounded lurching sometimes. He really cared about time in a way. If the way that you’re thinking about music and composition is really really caring about local stuff, it’d be very very interesting if you had a model that would handle for you some of the decisions that you would make for longer time scale things, like when do chord changes happen, right? Usually it’s the other way around. You have these AI machine learning models can handle local texture, but you have to decide that. Yeah, my point is if we get to models that can handle longer structure and nested structure, we’ll have a lot more ways in which we can decide what we want to work on versus what we have the machine help us with, right?
Craig Cannon [41:12] Right. Has it affected your creative work or do you still do creative composition?
Doug Eck [41:19] Yeah. I’m working here at Google. This is like a coalmine of work to do this project, Magenta like every day. No, joking aside.
Craig Cannon [41:29] Yeah, as we’re here.
Doug Eck [41:29] Yeah, plus two kids. Plus two kids.
Craig Cannon [41:31] That’s true.
Doug Eck [41:33] Basically I’ve been using music as more of a catharsis relaxation thing. I don’t feel like personally I’ve done anything recently that I would consider creative at a level that I want to share with someone else. It’s been more like jamming with friends or just throwaway compositions, jamming like, “Here’s 10 chords that sound good. Let’s jam over it for the evening,” and then don’t even remember it the next day. And really trying hard to understand this creative coding thing, like that’s the most I’ve worked on. A lot of it’s just like I’ll start and then I’ll get distracted. But yeah, that’s the level of my creative output I’m afraid.
Craig Cannon [42:04] Well, the creative coding thing, it’s seemingly, I don’t know. So many people are looking for it in every venue and it’s so difficult to find people. They’re like one-offs now.
Doug Eck [42:13] Yeah, I think that’s right. It’s so hard to have the right, I think maybe we need the GarageBand of this. We need to have something that’s so well put together that it makes it easy for a whole generation of people to jump in and try this even if they haven’t had four or five years of Python experience or something like that.
Craig Cannon [42:31] I didn’t know if that’s what you were alluding to when you were saying that command-line, obviously not the way to do it where it dumps midi files. But now it’s an API, right?
Doug Eck [42:41] Yeah.
Craig Cannon [42:41] What is the next step that’s very obvious?
Doug Eck [42:45] Try to make it more usable and more expressive. Expressivity’s hard in an API, right? It’s like so hard to get it right and I think it’s almost always multiple passes. We’ve got I think the API, the core API that allows us to move music around in real time in midi and actually have a meaningful conversation between an AI model and multiple musicians is there. There’s just a bunch more thinking that needs to happen to get it right. Cool, so if someone wants to become a creative coder, or wants to learn more about you guys, what would you advise them to check out? I would say the call to action for us is to visit our site. The shortest URL is g.co/Magenta. It’s also magenta.tensorflow.org.
Craig Cannon [43:27] We can link it all up.
Doug Eck [43:28] G.co and have a look at we have some open issues. We have a bunch of code that you can install on your computer, and hope that you can make work, and maybe you will be able to. We want feedback. We have a pretty active and we certainly follow our discussion list closely, our game for philosophical discussions, and our game for technical discussions. Beyond that, we’re just keeping rolling. We’re just going to try to keep doing research and keep trying to build this community.
Craig Cannon [43:58] Okay, great. Thanks, man.
Doug Eck [43:58] Sure. No, it was fun.
Craig Cannon [44:00] Alright, and thanks for listening. So if you want to get started using Magenta, you can check out magenta.tensorflow.org. And if you want to watch the video, which we filmed in one of Google’s very swanky libraries, you can check out blog.ycombinator.com. Okay, see ya next time.
Y Combinator created a new model for funding early stage startups. Twice a year we invest a small amount of money ($150k) in a large number of startups (recently 200). The startups move to Silicon