Knowledge Institute Podcasts
The Global Startup Ecosystem: Automating Understanding with John Bohannon
John Bohannon, Director of Science at Primer AI, discusses his time as the “Indiana Jones of Journalism” and explains how he helps build machines to read and write – automating intelligent understanding of large datasets.
Hosted by Jeff Kavanaugh, VP and Head of the Infosys Knowledge Institute.
“I brought my skills to bear from science journalism. I put myself in the customer's shoes, and I had a wonderful team to work with to try and create new algorithms to process text in order to help people find the information they need.”
“There is this expression that software is eating the world. Well, machine learning is eating software.”
- John Bohannon
- In 2017 natural language processing tools just were not there yet. Everything was built from scratch, and mainly rule-based heuristics. Engineers had to be the machine and figure out the step-by-step process for an algorithm to make sense of text. There was no fancy machine learning. The machine looked for patterns which the engineer figured out for the machine.
- Now, across the data science space, people are switching out all of the old heuristic-based natural language processing approaches. They are replacing them with machine learning. There is this expression that software is eating the world. Well, machine learning is eating software.
- Primer developed a tool called saliency which gives people a peek into machine learning black box. The machine, with saliency, takes a huge document and focuses in on the small information that was most useful in its classification effort. This is especially helpful for engineers to conduct root-cause analysis of errors the model makes, which they remedy.
- If you are solving a problem at the global scale, it will have large volume and diversity of information. The context in which these problems exist is less predictable and defined. That presents challenges that you do not worry much about when you have small-scale problems. You cannot define the range of inputs the algorithm will see or the range of contexts.
- Knowledge bases should be a self-updating. They should be a system that listens to the world, across all information streams that matter. It should do the passive job of keeping track of everything a person learns and cares about in their world. To this day, people still have the tedious job of entering data into a system, bit by bit, and tidying it. It just soaks up the life force of people who should be free to be synthetic and creative.
Jeff introduces John
How did you go from getting a Ph.D. in molecular biology from Oxford to becoming an investigative data journalist embedded with NATO in Afghanistan?
What do you do at Primer?
What's your thought process behind focusing on machine learning, natural language processing, et cetera?
John compares machine model learning to the black box concept.
What's the difference about solving problems at [a global] scale, versus something that's a little smaller, almost at a toy level?
What are some of the other kinds of challenges that you're looking to solve at Primer that you see now and you see around the corner?
Looking at the corporate world today, are there any specific challenges that you think should be solved, and maybe that's around the corner, beyond the tools that you have today, that maybe businesses in general should be solving, or you're excited about helping them solve, maybe in the next year or two?
Jeff Kavanaugh: Welcome to a special edition to the Knowledge Institute Podcast, where we discuss the global startup ecosystem with experts, deconstruct main ideas, share their insights. I'm Jeff Kavanaugh, Head of the Infosys Knowledge Institute. Today we're here with John Bohannon, Director of Science at Primer. John, thanks so much for joining us.
John Bohannon: Oh, thanks for having me.
Jeff Kavanaugh: John, let's start off talking about your history for a minute. How did you go from getting a Ph.D. in molecular biology from Oxford to becoming an investigative data journalist embedded with NATO in Afghanistan?
John Bohannon: I did a Ph.D. because I thought it would be fun. I didn't have a really specific plan. I really enjoyed it, but then I wasn't good enough with my hands. I think with molecular biology, if you're going to work in the lab, you actually have to be quite dexterous. And the thing that drove me crazy was, I would squirt a droplet of liquid just into the wrong tube or wrong amount, and you would lose, sometimes, months of work. If you're not good enough with your hands, it can drive you crazy.
John Bohannon: I heard that Science, the journal, had possibilities for interns of some kind, and so I sent an e-mail or maybe even a paper letter to the editor of Science. I later found out it circulated as a joke at Science. Science is one of the most prestigious journals in the world. You know, Science and Nature.
John Bohannon: And some time later, I got an e-mail from the editor of the news department at Science, and he said, "Well, you have a very interesting background." I told him I was finishing a molecular biology Ph.D., and I hadn't written much except for plays. I was very active in the theater scene in England. And he said, "Well, that's interesting. Maybe you should do a news internship."
John Bohannon: And so then he passed me off to Rich Stone, who had just started the Cambridge UK office of Science, and he said, "Well, have you ever written any journalism?" "No." He said, "Okay, well, here's a scientific paper. Write about it." I was like, "Okay. Can you give me some examples of what that looks like?" And he did, and I just sort of copied the style, and he said, "All right, kid, good enough."
John Bohannon: And the next thing I knew, I was on the border of North Korea, trying to track down a guy from Germany who had set up a factory to convert dead human bodies into artwork. You may have heard of this big show that's traveling around called Body Worlds. So my mission was to try and find out where he was getting his bodies, and most importantly, was he sourcing them from Chinese state prisons and mental institutions? The answer was yes, and so I kind of broke that story, and there was just no looking back. I loved it. The idea that you could use your scientific know-how to just go off into the world and investigate things in the real world and then write stories about it, I just loved it. So I did that for more than a decade.
Jeff Kavanaugh: “The Indiana Jones of Journalism.” All right. Primer, as I understand it, is an AI company. What do you do at Primer?
John Bohannon: Over the past, let's say, seven years, I've switched entirely to data science. Even when I was a journalist, I was just entirely doing coding and data science. So when I joined Primer, the first thing I wanted to focus on was how to make sense of scientific papers from the point of view of a machine. So in this case, people interested in science are the customer. It could be actual scientists, or they could be policy makers, or they could be someone who had a drug company who's trying to get their head around the science of something to build a product.
John Bohannon: You have this fundamental problem of too much information. And so I just brought my skills to bear from science journalism. I put myself in the customer's shoes, and I had a wonderful team to work with to try and create new algorithms to process that text in order to help you find stuff.
John Bohannon: Here's a concrete example. One of the first things that I coded myself was a jargon translator. So the way it works is, you feed in text, like a scientific paper, and what it does is, it goes through and it tries to find technical terminology that is being abbreviated. So ML is a great example of that. AI is an example of that. AI, if you keep on using that term, you're going to lose some people, because they might not know that it stands for artificial intelligence. Infosys, I'm not even sure what that stands for. Someone knows. Information systems, maybe?
John Bohannon: So that's jargon. Jargon is special technical language that people are using in text. And so the first thing I did was to make something that finds all that stuff, figures out the abbreviated and the long form, and just automatically generates a glossary for you of the jargon. It's kind of a dumb pet trick with data science. Not hard, but, man, does it make a difference, because you can now go into a whole new field, something you know nothing about, and when you hit a term that just is totally mystifying, you've got this automatic glossary that will tell you what it stands for, at least.
Jeff Kavanaugh: And what's your thought process behind focusing on machine learning, natural language processing, et cetera?
John Bohannon: So when I started at Primer circa 2017, NLP, which is natural language processing, using computers to make sense of text, it was really hard, and I mean really hard. The tools just were not there yet. You had to build everything from scratch, and it was a lot of rule-based heuristics. You'd have to be the machine yourself and figure out the exact step-by-step process that some algorithm could do to make sense of that text. So in the case of that algorithm that did the automatic jargon glossary generation, it's nothing but rules. There's no fancy machine learning in there. It's going and looking for patterns, and I as the engineer had to figure out what those patterns are, one by one, find all those corner cases. It's a lot of work.
John Bohannon: Starting in 2018, just a year and a half later from when I started, suddenly NLP practitioners had the whole situation change when a new tool got created called language models. Suddenly, and I really mean overnight, stuff started working. You could basically solve problems that you would have had to spend weeks and weeks making some really bespoke, little handcrafted solution for. Now you could just throw data at the problem, and nine times out of 10, it would just work.
John Bohannon: And I tell you, NLP just got fun. My job just turned from a good mixture of fun and hard work to just fun, just playground fun. That's where machine learning really shines. And we're in the middle of a real revolution here. Now, if I were to solve that jargon problem again with these new tools, I would take a totally different approach. I would just go and have some human experts label data for us, just basically capture what they know, which is essentially capturing what they want, and then I would teach a machine, using a machine learning model, to just do that task. And the more data you give it, the better and better it gets.
John Bohannon: There's lots of caveats there. It doesn't always work as well as you hope. Sometimes it's not the right tool for the job. But more and more here at Primer and, I think, across the industry, we're ripping out all of those old heuristic-based approaches, those hard-coded, bespoke little NLP solutions, ripping them out by the roots, and replacing them with machine learning. There's this expression that software is eating the world. Well, machine learning is eating software.
Jeff Kavanaugh: There's a concept about black box/white box that, for people outside of the testing world or the coding world, if you can't see inside the box, you are completely relying on trust for anybody to use it. Because I used to be in the supply chain world and a lot of those optimization tools, late 90s, early 2000s, were solving great problems, integer programming and closed loop optimization and all that where materials and orders and capacity and sequencing were finally put together. No cloud at that point, all on premise, but nobody would use them in their first wave.
Jeff Kavanaugh: And there were a couple of companies, like i2 Technologies, as you know... If we do a very good way of showing what's going on and making it clear and giving a little peek inside the box, then the senior executive will say, "Okay, I trust it enough." And then they just took off. So it wasn't the person that had the best algorithm. It was the one that generated the most trust about a pretty mediocre algorithm at the time that just took off.
John Bohannon: So here's a direct parallel of that in my world. You're right that a deep learning, machine learning model is a black box, ultimately. You just can't hope to know why are the neurons hooked up this way. We're never going to know. That's just not how you can understand them. But what you can do is, you can shine some light into that black box. And one of the tools we've made recently to do that is something called saliency. And so our black boxes, they take input as text. So you feed in pages and pages of text, and it's going to do something for you. For example, it might classify them.
John Bohannon: Let's say you have a whole bunch of documents flowing in. I'll make this up. These are complaints from customers. Some system, you've got complaints from customers coming in, and you need to triage. You need to put these guys in the right buckets so that they can be dealt with appropriately. Well, you can put a classifier right there at that gate. The documents are coming in, and its job is to say which bucket this belongs in. So that's classification. If that's all you got, that's a classic black box, so you just don't know why I put that doc into that bucket. There's no way to know.
John Bohannon: And so we've made this thing that you can use to have the model explain itself. So a doc comes in. It says, "This belongs in Bucket A." You can say, "What was it about this doc that made you decide it belonged in Bucket A?" And what it does, and it literally looks like this, it's as if it took a highlighter pen, went to that doc, and said, "These words, and in fact, this sentence is the most important bit for me making that decision."
John Bohannon: It just works. It's kind of amazing, essentially when you get human experts to look at what the model highlighted. They, sure enough, will tell you, "Yeah, that's what I would have highlighted." It doesn't always work that way, but when it does, that really helps. It's showing you what was most salient when it read the doc.
John Bohannon: What you've done is, you've essentially taken a whole big doc, which takes a long time for a human to read, and then you've focused in on a small part of the doc which was most useful for its classification effort. And so you've essentially summarized the decision-making process.
John Bohannon: By the way, it's useful not only for building trust with a customer who wants to use this thing, but it's just a great reliability tool for the engineers who build them. To give you an example, if I train a model and I use saliency, I look at some decisions the model made that I disagree with, and I'm like, "Why is it making this error?" Well, now I can say, "Show me what it is in this doc that you're paying attention to."
John Bohannon: And often, if it's a really dumb mistake, it's because, "Oh, I am picking out these words, and these words just look a lot alike." And often, they'll be ambiguous words, or maybe it'll be, "Oh, you know what I'm realizing? I've skewed the training data. I've basically taught the model to cheat." Whenever it sees this word, it just assumes, "Oh, that belongs in that bucket." And it's because my data isn't balanced, it's not diverse enough. And so these little tools are actually very useful for the engineer as well as the customer. What goes around, comes around.
Jeff Kavanaugh: So we've talked about bias inside and ways to look at it and to minimize it. Natural language processing is difficult. You said it's gotten better, at least with the horsepower behind it. You sound like you operate at a global scale. National security, global corporation. What's the difference about solving problems at that scale, versus something that's a little smaller, almost at a toy level?
John Bohannon: If we're talking about a global scale, and that means not just volume, not just, "Oh, there's more of it," but it's more diverse. The context in which the thing is going to be used is less predictable, less well defined. The diversity of, in our case, text, you should assume it's going to be very high. You just don't know. That presents some challenges that you don't have to worry about as much when you have a small scale problem where you really can't define all of the range of inputs it will see and the range of contexts and use cases.
John Bohannon: Here's a very tangible, practical example of that. We have a model called named entity recognition. So it does the job of finding all the people, places, organizations, and other named entity things in a piece of text. So you could feed in a contract or a news article or a bunch of e-mails, and it's just going to go through and find for you all of the named entities in that text. And there's a ton of downstream useful work you can do with that. You've got to create that structure first, though. Make a big lookup table of all the people, places, things.
John Bohannon: If you have a truly global customer who's going to be building solutions on top of that model, it's a real problem if that model was only trained on Western names. If it largely only ever saw, during training time, Western people, Western locations, Western organizations, you can bet it is going to perform more poorly when that model and the systems built on it are deployed off in totally different contexts that involve foreign names, non-Western locations, and so forth.
John Bohannon: And so what we've done at Primer is, we've tested how well our named-entity recognition model performs when you throw it into a foreign land. And the way we did that was, we made a huge data set of non-Western and Western names, first and last, of people, and we played a substitution game. So we had all this gold-label data where we know what the true answers are, and we basically substituted the names of people in our first experiment with one of about a dozen other languages. So we had a big grab bag of Finnish names, we had a big grab bag of Korean names, and so forth.
John Bohannon: And it's a statistical test of whether, when you replace the names systematically and randomly with other cultures, does the system perform better or worse in a way that can be explained just by the origin language? And sure enough, some models, as soon as you swap those names out with anything but other English names, the performance just starts to nosedive. It can't recognize the name, or it misclassifies it: "Oh, that's not a person, that's an organization," or whatever. Then the next step, of course, is to mitigate. So now, what you can do, of course, is, you can take training data and play the swap-a-roo game. So you just make sure that the model really does get exposed to truly diverse names.
John Bohannon: It doesn't stop there. You also need to increase the diversity of the text that it was trained on in the first place. You need to go off, and you need to find foreign newspapers and unusual formatted data you did, boost that diversity so that you're not making something that can only run on rails. You really need something that can off-road. But that's what it looks like day to day. When you're dealing with a truly global set of customers, you have to double down on reliability.
Jeff Kavanaugh: What are some of the other kinds of challenges that you're looking to solve at Primer that you see now and you see around the corner?
John Bohannon: I'd say the thing that's just totally preoccupying me right now is, we've launched this exciting, new, kind of scary, hard-to-imagine product called Automate. The basic idea behind it is, we've been building all these machine learning models behind the scenes to power the products that we sell to these big organizations. And a light bulb went off in our head: "Hey, we could also just sell those models and make it possible for people to fine tune them for their own problems." It's really incredibly challenging, because we have to reverse engineer ourselves as data scientists so that some customer who's not a data scientist at all can just walk right in and get to work and actually solve a problem, end to end train a machine learning model, and deploy it to solve some problem that they have.
Jeff Kavanaugh: Looking at the corporate world today, are there any specific challenges that you think should be solved, and maybe that's around the corner, beyond the tools that you have today, that maybe businesses in general should be solving, or you're excited about helping them solve, maybe in the next year or two?
John Bohannon: An example of something that I would love to see us solve in the next couple of years is just keeping track of everything you already know in a knowledge base so that you don't have to basically keep updating this knowledge base. It should be a self-updating knowledge base. So you have some system that's listening to you, listening to your world, whatever those information streams are that are passing through, and it just does this passive, boring job of keeping track of everything you're learning about the things you care about in your world. So we call that final system a knowledge base or a knowledge graph. It's really just a big database that's keeping, in a nice, structured way, everything you know about your world that you care about.
John Bohannon: And to this day, people have to do this terrible task that is often called WikiGnoming. It's that incredibly boring, tedious thing of going and entering data into a system, bit by bit, correcting it, tidying it, feeding it. It's like having a pet, and it just soaks up the life force of all these wonderful people who should be freed from that task so they can be synthetic and creative. That's what humans are good at.
John Bohannon: And we started to build such a system. We call it Quicksilver. We built the first version more than two years ago, and you can share with readers a nice link to a Wired article that was written about it. For our first prototype application, we wanted to make, literally, a self-writing Wikipedia that would go off and find all of the women of science who were missing from Wikipedia, who had done work just as important and notable as the men of science who already have Wikipedia pages, and it would just write a draft of that person's bio and put it in a queue of work for human volunteers. And so that's what we built, and that's a sign of what I hope that we're going to build for the rest of the world in the coming years.
John Bohannon: So whether you're in the government or a big company, you have some world you care about, whatever it is, and you have all of this information about the entities in that world, whatever that is. And you have to keep it somewhere, and you have to somehow get the world's information into your world so that you can access it and manipulate it. You can be alerted when something crucial is changing. That's the name of the game: self-updating, self-writing knowledge bases. So that's what I want to build.
Jeff Kavanaugh: That sounds great. And everyone, you'll be able to find details on our show notes and transcripts at infosys.com/IKI. Thank you so much for your time and a very interesting discussion, and maybe we'll have you back again sometime.
John Bohannon: Oh, it'd be a pleasure.
Jeff Kavanaugh: Everyone, you've been listening to the Knowledge Institute, where we talk with experts on business trends, deconstruct main ideas, and share their insights. Thanks to our producer, Catherine Burdette, Christine Calhoun, and the entire Knowledge Institute team. Until next time, keep learning, keep sharing.
About John Bohannon
Director of Science
John leads applied research at Primer. He worked for a decade as an award-winning investigative data journalist based in Europe and embedded with NATO forces in Afghanistan. He has a PhD in molecular biology from Oxford University.
- “Announcing Primer Automate: A No-Code Platform to Build and Train Your Own NLP Models” – Primer – April 6. 2021
- “Low Code, No Code: Transforming Digital Platform Development” – Infosys Knowledge Institute – November, 2020
- “A New State of the Art for Named Entity Recognition” Primer – November 8, 2019
- “Machine-Generated Knowledge Bases” Primer – August 3, 2018
- “Using Artificial Intelligence to Fix Wikipedia's Gender Problem” WIRED – August 3, 2018
- Primer AI website
Mentioned in the podcast