Artificial Intelligence (AI) In The Data Asset 1.0 Era

Image: iStock/Just Super

Image: iStock/Just Super

AI takes off again in the data asset 1.0 era

During the past about 10 years, marked as data asset 1.0 era [what is data asset 1.0 era? (to be posted soon)], artificial intelligence (AI) has fundamentally impacted many industries, from as important as how business decisions are made, to creating countless of new revenue sources. However, AI is not new. First mentioned in 1956, AI has gone through several ups-and-downs [1]. Needless to say, a major driver for the successful commercialization of AI this time is big data. Never-existing amounts of data have been collected from our daily lives, with and without our awareness. During this period, data, especially personal data, is treated as companies ‘most valuable’ assets. Personal data is any information that relates to an identified or identifiable living individual [2], such as contact information, passport photo, and shopping history. The more personal data is collected, the better AI “understands” people and their need, and the more revenue is generated. As a result, companies had been fearlessly and eagerly acquiring as much personal data as possible during data asset 1.0 era, fueling AI to take off again.

 
 

 
 

High-quantity but low-quality personal data

The quantity of available personal data for building AI models is large during this past several years, however, the quality is relatively low. The low quality is a result from the approaches through which data are collected, as well as the structure of the collected data.

The first data collection approach is from companies’ own business channels, such as amazon.com for Amazon and Pixel phone for Google. However, very few companies have enough data collection channels to understand their customers well in order to provide accurate personalized products/services at the right time. Limited number of channels usually means limited information, or more accurately, high noise-to-signal ratio. For example, by just knowing a person’s income and gender, no model is able to tell whether he likes Sushi. In fact, collecting personal data from new channels (to expand the feature space) is a continuous and strategic effort of most AI companies, leading to the second approach; buying personal data from third parties.

In the data asset 1.0 era, buying/selling personal data among companies had been a common business practice[3]. However, personal data acquired through this approach usually goes through desensitization process, which removes key identifiable information such as name and contact information from the original data. The quality of personal data is reduced consequently.

In addition to limited channels and reduced information due to desensitization, significant amount of data collected are unstructured, which cannot be used directly. Since the quantity of unstructured data is huge, for majority of companies, to simply understand what these unstructured data are is almost impossible, not to mention extracting values from these data.

 
 

 
 

Large model, high data security risk, low interpretability

With data of high volume but low quality, the development of AI follows the general path of increasing model size during this past about 10 years[4]. Among all, most breakthroughs are made in the deep neuron networks (DNNs) model family, which are large by design and have achieved (better than) human-level performance in many hard tasks such as facial recognition [5] and very recently question answering [6]. For example, a popular model in computer vision, ResNet-152, has more than 60 million parameters! The major factor for the success of DNNs is that, unlike any other machine learning models, their performance does not rely on feature engineering by AI scientists. DNNs ‘learn’ the most suitable features from data themselves if enough data are provided for training, making DNNs very handy for complicated problems whose features are infeasible to construct properly by human.

However, the quantity of data needed for training is enormous, leading to two major problems. 1. Security hole in data infrastructure. To collect, manipulate and compute the ever-growing data, the capacity of the underlying data infrastructure needs to expand accordingly. The upgrade of existed data infrastructure, or integration with new infrastructures, might create security holes, leading to serious data security risks. Besides, as most training of models is centralized, requiring aggregation of personal data from different channels, the impact of any security problems is amplified. 2. Low interpretability. Large models are usually criticized for its low interpretability, as each decision is made based on intractable amount of computations. However, there are cases when interpretability becomes very important [7], especially in cases where the answer does not solve the whole problem. For example, if a model predicts a customer will churn. Explanation to why this would happen is more important than just the prediction itself.

 
 

 
 

What will happen to AI next?

In the data asset 1.0 era, AI has made great progress thanks to the never-existing amount of data and computing power. However, a major change is happening to the entire data industry due to the release of a series of data privacy regulation laws globally. What will happen to AI in the following 10 years? Please check: The future of Artificial Intelligence (AI) in the data asset 2.0 era (to be posted soon).

Hongyuan Yuan

Head of AI @ Helios Data Inc.

2018-10

 

1. History of artificial intelligence. Available from: https://en.wikipedia.org/wiki/History_of_artificial_intelligence.

2. What is personal data? ; Available from: https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en.

3. How companies collect, combine, analyze, use and trade personal data on billions. Available from: https://dataethics.eu/en/companies-collect-combine-analyze-use-trade-personal-data-billions/.

4. CNN Architectures-LeNet, AlexNet, VGG, GoogLeNet and ResNet. Available from: https://medium.com/@RaghavPrabhu/cnn-architectures-lenet-alexnet-vgg-googlenet-and-resnet-7c81c017b848.

5. Why Facebook is beating the FBI at facial recognition. Available from: https://www.theverge.com/2014/7/7/5878069/why-facebook-is-beating-the-fbi-at-facial-recognition.

6. Google open-sources BERT, a state-of-the-art pretraining technique for natural language processing. Available from: https://venturebeat.com/2018/11/02/google-open-sources-bert-a-state-of-the-art-training-technique-for-natural-language-processing/.

7. Interpreting machine learning models. Available from: https://towardsdatascience.com/interpretability-in-machine-learning-70c30694a05f.

Previous
Previous

Don’t Fall for “Personal Data Value vs. Data Protection”:  You Can Have Both!