AI, NLP, & Data Science. What Do We Use, and Why Should You Care?

February 08, 2022

10 minute read

Data science is a playing field growing larger and larger by the day and becoming ever more populated with new players. We feel that it is essential to practice what we preach and present some transparency around how our platform works—and how, from our humble beginnings, we built our company upon a foundation of data science and data-driven innovation.

However, we needed to bypass the marketing department and go to a source explaining how our platform functions and how it was developed. So I sat down with our CTO, Christian Lawaetz, to learn more about how our Valuer Platform uses artificial intelligence, natural language processing, and data science.

So Christian, let’s start by helping to define what characterizes artificial intelligence?

Christian: "Artificial Intelligence represents a broad area of digital brains (computers). It has many definitions and applications. While we don’t have computers with a consciousness walking among us, we are already exposed to AI-driven products and solutions. Any system that is “complicated” enough to perform intelligent actions that mimic human behavior is by most considered AI, including everything from self-driving cars to online advertising.

However, when we drill down into any AI in production (being used) around us, we find that it is built from a lot of more minor “moving” pieces. It is more than simple input and output mechanisms or information processing models at its foundation. An AI, to simplify it, is built up of a lot of algorithms and other methods of taking in data, doing something to it, and outputting something else.

This is the beauty but also sometimes the pitfall of AI. It is nothing else than a complicated system build-up of simple pieces, making it tangible but not magic. Like every other information processing system is bound by reality and is only as intelligent as the data you feed it."

How do we apply them to our platform?

Christian: "At Valuer, we use AI as an essential piece of our system/infrastructure to objectively process large amounts of data to identify relevant results for the question being asked. These questions can be anything from “which sector does this company belong in,” “what is the most similar company to this one,” and “are any companies similarly solving this problem.”

The beauty of the application of AI in our system is that we can automatically set up a flow of data that is processed according to some kind of instructions. It’s the fundamental value that many other applications of AI get and is a more efficient way of processing data. There are a couple of main AI applications on our platform. We use Neural Networks for predicting missing data points. We use Clustering to make sense of large amounts of data. Finally, we use NLP (natural language processing) to provide answers based on text."

How does this translate into customer value?

Christian: "Our goal of applying AI in our platform is to simplify accessing a large amount of data and make it more convenient to interact with. We have all experienced the frustration of getting a dump of data in an excel document, or even worse, as a PDF or text document that is impossible to navigate. Imagine that frustration, but with a factor of 1000+, that’s how it would feel to access our database “naked” without any search processor interface. Of course, a lot of these frustrations could be mitigated by simpler search algorithms or lookup systems, but we believe (and have concluded) that there are more convenient ways of working with data for the user.

You don’t need all the data in one big messy file. You want to get a couple of the “best” results based on your question, and it would be great if you don’t have to know the answer to the question you are trying to solve to find it. All of this is what our application of AI helps to mitigate/remove. In essence, our Neural networks help with identifying and “fixing” missing data, our clustering makes sense of all the data, and the NLP allows you to ask an open question and get a helpful question without knowing what you are trying to find."

What is Natural Language Processing, why is it so important to our platform, and how does it enhance the customer experience?

Christian: "NLP is the branch of AI that works to process natural language (how we humans communicate). A widespread interaction with NLP is when you google for answers, in which you don’t have to think about keywords and data structures, you just throw a question to the algorithm, and it deals with it for you.

NLP is clever because it has been trained on a considerably large amount of text data generated by humans. I don’t think we should get into the details of this, but it can be considered such that the AI has seen so many structures of sentences using similar words as your question that it can “understand” and link your question to a helpful answer.

The NLP cuts out the middleman and searches for you by finding structures and keywords in your natural sentence and goes to look those answers up. NLP can be used for many other things, but searching is the primary use in the Valuer platform. We allow users to access our startup database directly from a text field to write any question they want, which the NLP AI then uses to find the most relevant results.

An important note here, which sometimes causes frustration/confusion, is that you will never get a perfect result, and you should also brace yourself to get less valuable results. You won’t get a “binary“ response since the AI is not using a checklist to lookup exact matches based on fixed keywords.

“So, can you describe what you are interested in today_”

Instead, you will get a gradient response based on how similar the results are to your question. This is shown in the Valuer platform as our “match score,” which depicts how close the startup description is to your original question. This is where I personally (and many of our users who get familiar with our search flow) find the benefits or enhancement of the customer experience.

Suppose you utilize the process correctly, and yes, I am aware that it is not always going to be easy. You’ll get previously unknown and hopefully valuable results that you would otherwise not have access to if you used a traditional search process. In other words, you can request an open question that provides you with the most relevant answers, even though they are unnecessary for the same industry or sector you initially thought to look in."

What is the difference between unsupervised and supervised learning?

Christian: "In AI, we generally refer to two significant ways of creating a system—supervised and unsupervised learning. There are also combinations, such as reinforcement, self-learning, and deep learning, but they are “just” combinations/more complicated versions. All AI is trained by exposing it to data sets (training data), which is a large part of what makes the systems complex. The training allows the system to be created based on extensive data pools, gradually tune/adjust how the system processes new data (non-training data), and provides results.

Supervised learning– is the most “familiar” to us humans. We teach each other new things, which is also an efficient way of improving an “intelligent” system such as an AI. The learning process is done by showing examples of results and how to find them, basically giving examples of correct answers for the task.

This is an easy way of training a system, but it requires other intelligences to process and mark example/training data already, which can be time-consuming. For example, this could be to have an AI perform classifications, such as, tell what is in a picture that you show it.

To train the AI, you need to provide it with pictures that have already been processed by intelligence (humans) and classified correctly so that the AI can learn how to identify what's in the picture accurately (side note, in many image recognition algorithms, we also apply association, an unsupervised method to avoid having to “tell” the AI what to look form, which just underlines that this stuff is not black and white).

After sufficient data, you can provide it with unknown data and hopefully see it correctly identify what is in pictures that no one has seen or processed before (this is part of the value we discussed earlier, being able to process anonymous data).

Unsupervised learning– is a little more “away” from our human perspective. It works more like asking the AI to find systems in unstructured data and have it conclude and group data into structures that enable the system to do the same for new data it’s exposed to. Two significant areas here are clustering and association, which are two practically different ways of applying unsupervised learning, but both work with data that does not have a classification/label already applied to it.

Unsupervised learning is having the AI sort data into buckets so that similar groups are placed together, which is a way to avoid the human element of labeling data for the AI to work from. Unsupervised learning can also help in identifying patterns that we humans did not see or expect in the data since it is “objectively” processed and divided based on all the parameters and patterns that the system identifies.

This is a good process to avoid having to classify the training data. First, you just share the data with the AI, and it learns from the patterns that it identifies itself. An example of unsupervised learning could be a demographic analysis, where the system processes a large pool of data to identify groups and segmentation relevant for marketing or sales applications."

Do we use both? And what is the customer benefit?

Christian: "The short answer is yes. The longer answer is that it depends on how you ask the question. We have multiple systems built of pre-trained AI from vast pools of data, all of which have been done in a big and complicated process that included supervised and unsupervised learning (and everything in between).

However, we also apply AI to process labeled and unlabeled data, requiring supervised and unsupervised learning. The customer benefit is that we have a toolbox of AI’s that can process any kind of data relevant to the startups you want to find, which makes the interaction with our database a lot more efficient and more manageable. Supervised helps in cases where we can label the data subjects, and unsupervised is excellent when we want to find systems in unstructured or prepare them for searching."

What is a black box AI?

Christian: "A Black Box system is simply something where you feed it input data, and it provides you with a result/output, But you have no way of knowing/understanding how it came up with that result. This is also sometimes referred to as an oracle, which can answer your questions but won’t tell you how the answer was generated.

Black Boxes are ubiquitous in AI since their complexity makes it very difficult to reverse engineer and answer. It is even more difficult for humans to understand the process that leads to that specific answer. Some AI models are notoriously difficult to figure out how they work (so they are Black Boxes), such as a Neural network that acts similarly as neurons in the human brain. Other AI’s are easy to build. Examples include those built on regression, which is more natural for humans to understand and much easier to reverse engineer."

Why are/can they be a bad thing?

Christian: "A black Box algorithm is not necessarily a bad thing. It’s often a by-product of the complex system, which also benefits when it needs to factor in considerable data and variables. A neural network behind the ability for a self-driving car to process its environment and act based on what is happening is impossible for us to understand (we can’t explain why the car made what decision just based on particular input).

However, it still does its job perfectly, and we are able to conclude that based on the tests we carry out. However, in other use cases, such as determining a credit score, we as humans want and need to know the reasoning behind why an AI decided on a specific answer.

Here we can’t use Black Box AIs because they won’t live up to our requirements of understanding. Luckily, we can use many different AIs, some of which can solve problems, such as credit scoring, while having transparency of why a particular outcome was picked (or which can be adjusted to provide at least simplified insights into the process)."

Do we use them?

Christian: "Yes, we use Black Box AIs in some situations. However, we are very aware of the implications and limitations of any type of AI we consider implementing in our systems. In the case of our process to identify and map missing data, we use a Neural Network, which is a black box system. Still, we determined that since this is done as preprocessing before any customer gets to interact with the data, it does not matter how it found an “answer,” only that it is correct.

On the other side of the spectrum, when it comes to our users’ search process and feedback mechanism, we do our best to avoid Black Box systems since this would impact the user experience. The way we structure our database and carry out the search allows us to provide a “distance” to each result, which we convert into a match percentage (our match score), which is one of the ways we try to un black box the interaction with the AIs on the Valuer platform."

Can you share anything about Valuers algorithms and what/why we use them?

Christian: "Neural networks– There are always a lot of unstructured and missing data that gets picked up through our data collection and processes (data mining etc.). Here we feed the data through several Neural Networks to predict the missing values, such as country and sector.

This enables us to pick up data that is not necessarily complete or has all the values included in our database and still use it in our systems for users to access. It also acts as an automatic check/insurance that we have structured data for the later processes so that we don’t feed “bad” data into the latest applications and AI and get bad results.

Clustering– We also use a couple of clustering algorithms, which are unsupervised learning algorithms that allow us to work with data that is not necessarily labeled in the way we want or which are way too complicated to “only” assign a single label. Our clustering includes t-SNE and KNN and allows us to process our database of companies and position every entry based on many different criteria, such as the very complex data that free form text is.

This clustering prepares our datasets for the search algorithms that our users use to access the data. Yet another piece in the puzzle of reducing the complexity of accessing relevant results while skipping over everything that is not practical or relevant.

NLP– Finally, the NLP we use is a version of BERT, built by Google and trained on 3.3 Billion words through reinforcement learning that we have adapted and adjusted to our application of searching for startups. Still, in principle, it is an AI that has been trained on a lot of text data to receive previously unknown input/data (free form text from our users) and understand what the user is looking for.

The search query (question) is processed and projected into the clustered startup database and is treated as an ideal startup. This allows us to automatically find the closest (and most similar) startups and send them back as results. We see a considerable part of the value being generated since finding results that are “only” similar and not necessarily the same is where you find unknown solutions to open questions."

Wrap-up

In essence, the data science sandbox is a complex ecosystem that can be difficult to fully understand even for experts in the field. With this interview, we hope that it can lend a helping hand in understanding AI and NLP as well as shed some light on what we are doing and how we apply data science in our platform.

We’ve built our company on this foundation of AI, letting the data drive us forward. Our platform makes this process easier. Through utilizing AI to cut out the busy work, you effectively get reliable results faster, making any team increasingly efficient. And in our case, this is what we want to convey, that the way we apply data science and AI is versatile and can be an extremely useful tool for any business or investor that wants to let data fuel their decisions.