Making most out of the world of big data

Stephen Brobst, chief technology officer of Teradata

Stephen Brobst, Teradata’s chief technology officer, speaks to TRENDS about the intrinsic world of data warehousing and why it is essential to prevent data lakes from becoming data swamps.

Data lakes are an emerging concept in the GCC market. But sometimes, it takes a long time to develop a data lake and there are concerns that, by the time one is ready, a lot of information becomes outdated. What is your take on this issue and how are you tackling this problem at Teradata?

For me, that is more of a methodology problem. Ideally, it should not take a long time to set up a data lake. While using the Teradata Velocity for deployment, it does not take a long time. We can set up data lakes very quickly and we have already done that in the past.

However, there is a Gartner paper that says that, through 2018, 90 percent of all data lake deployments will be useless. So, you are spending a lot of money and, still, you will not get anything useful. The major reason behind this is the broken measure of success of the data lake.

If you have a bad measure of success, you will obviously get bad results. Also, most people measure the success of a data lake by how big it is – it is a good measure of cost, but not a very good measure of value.

How do bad measures of success impact businesses in the long run? What could be the solution to this?

If we have got a wrong measure of success, it will lead to dysfunctional behavior and it could be a disaster for the overall business ecosystem and lead to some really bad outcomes. And this points to the second problem: a lack of governance. In fact, I don’t like this term data lake: a lake is a random act of nature, but a data lake should be manmade and always consumable.

We want to constructively use the data; we don’t want to just put it there in the lake. At the same time, we should be cautious as data lakes very quickly become data swamps. And methodology and governance are the two areas where most data lakes fail. Therefore, they demand keen attention.

We are living in the era of Fourth Industrial Revolution, where technology has disrupted most industries. Amidst this prevailing disruption, why do you think is it important for you to talk about artificial intelligence (AI)?

Let’s take the example of China. There are a smaller number of companies in China that have been quite successful and aggressive in deploying advanced learning technology. But, at the same time, there is a lot more unused potential. This is not for the ones who have already done it, but for the ones who have not yet done it yet. So the need of the hour is to identify this unutilized potential and devise suitable mechanisms and processes to make most out of it and ensure long-term success.

In China, there are big players, such as Alibaba, who are very advanced in this area, but our interest is in the traditional industries, such as financial services and telecommunications, where we see huge opportunities. So, what we are trying to do is to encourage more traditional enterprises to not be so slow and to adopt new technologies.

How are you looking to encourage traditional industries such as banks and telecom? Are they responsive enough for a new change?

I think there is a lot for them to explore and exploit in terms of their capabilities: providing better value to their customers, optimizing recommendations, detecting frauds, enhancing networks and more. And if the right idea is pitched before them, they are certainly ready to go for it.

Like in China, there are some companies that are very advanced, but it is a very small slice of the market. There is a discrepancy between the advanced and the not advanced, and it is much bigger than the difference in the US market, where traditional sectors such as banks and telecom are already using these technologies.

The scope of AI is massive. Sometimes, it’s really difficult for businesses to choose the right options that it offers. Currently, which segments of AI are you exploring to eventually use them in your products?

The Teradata analytic platform delivers a lot of advanced capabilities that can fall under the domain of AI. It is a very broad topic and we are focused specifically in the area of machine learning and deep learning.

We don’t do robotics, we don’t do knowledge management. So there are a whole bunch of things in AI that we don’t do, but machine learning and deep learning are what we deliver. For this, we have an intelligent software, which we refer to as Asterdata.

The Teradata analytic platform also uses open-source technology, so we can combine commercial technology redeveloped with open-source technology. And then we provide a data learning platform, because deep learning means nothing without data. Our core competence is in the area of managing very large quantities of data while ensuring very high performance and cutting-edge scalability.

Teradata is actively pursuing the application of deep learning; it is one of your fortes. If we talk globally, what challenges do you face in the area of deep learning?

The main limitation of deep learning is that it’s a black box. For example, while taking certain decisions, if you just give a blank statement without explaining anything, things will not work. So, a better explanation of any model or process is very important and, for machine learning, which is linear mathematics, it is a little easy to do, but for deep learning, which is nonlinear mathematics, it is very hard to explain to an average human.

So this is an active area of research with many questions that have not been solved yet. However, a lot of research is going on in Chinese and American universities to find solutions to this, but it is still in the very early stages.

Can you elaborate more on areas where applying deep learning applications is feasible? Where are they not adaptable in some cases?

Understanding deep learning is very important to reap rich dividends in the future. For example, if you are using the deep learning technology for, let us say, fraud detection, then as long as you can prove that your fraud detection is better, it is fine. But if you are doing something like credit risk scoring, at least in the US, it is highly regulated and it’s not very easy. So if you give one person a mortgage but not the other person, that [the reason behind it] is what the regulator wants to know.

This is because, 50 years ago in the US, this was a big problem, as there was a lot of discrimination between white and black and male and female. As a society, we find that unacceptable now, but that was happening. So we started regulation and now your credit risk scoring is audited by a third party. It made things transparent and nobody will be able to say that this model score is derived from prejudicial data, because now, you cannot collect data based on the black or white or male or female. But if you take a postal code and 90 percent of people who live in that part are African Americans, you are again doing the same thing. So, auditors look for such biases but, with deep learning, it is like a black box; it is hard for them to look into.

So things that are regulated cannot actually use deep learning. However, we need to solve this problem in the coming months. For machine learning, it is fine, but for deep learning, it is not.

You need enormous data to ensure sustainable results in businesses. What do you think of the opportunities in India, which is undoubtedly generating huge amounts of data, thanks to its large population, and is one of the fastest-growing economies?

In a [mostly] cash economy, where there is no digitization of transactions, we don’t add value. We always need data to add value, so the demonetization process that India started last year in November is very good for our business, because it means all of the data will now be recorded somehow and, as soon as you record it digitally, we can analyze the data and help people make better decisions. Now, the level of detail at which data gets recorded is of utmost importance.

Demonetization is actually the first step in the right direction and now we have to go big with digitalization, but I really want the detail below the financial transactions. There is another level of it that has to happen and, once we start getting that detail, then life will be easier. Actually, India is in the beginning of this revolution. But a digital divide in India is actually very big and demonetization is going to bridge it, because the number of unbanked people in India is huge. So, demonetization is going to bring those people into the mainstream and give them access to loan services and other things that are going to improve their quality of life.

So yes India, is a good market for us, but I don’t care much about sales. What I care more about is whether we [Teradata] are improving the quality of life or not. This is true for both India and China.