Artificial intelligence (AI), which includes machine learning, deep learning, reinforcement and analytics algorithms, is a powerful tool, but it can’t just be installed and expected to immediately start delivering value. The algorithms must be trained on curated data sets before they can be applied to actual production data to deliver the value that businesses seek.
Examples of AI training requirements include labeling photos to enable AI systems to identify people or objects, adding context to text passages for text recognition, and mapping variations in voice recordings for speech recognition systems. While the tools and platforms are evolving and making such training more accessible, data is the other critical side of the equation.
Training algorithms involves creating a model that employs labeled or contextual examples of available or historical data from which machines can learn. Although humans learn through intuition and experience, machines can learn only through data patterns. This training process needs to occur both when an algorithm is first tested and deployed, as well as throughout its lifetime. Training is delivered not only from incoming data, but also through humans specifying the meaning behind labels or categories that go into the model.
“Training an algorithm, put simply, is the process of taking data collected and using it to generate an estimate or expected result,” said Jennifer Shin, founder of 8 Path Solutions and data science lecturer at New York University’s Stern School of Business. “The training set can be thought of as the information we have available at the time we make a decision. For instance, let’s say you are interested in buying a new pair of shoes and you have decided that your selection will be based on the last three purchases. In this case, your past purchases would be your training set.”
The accuracy of algorithms depends on whether or not training sets are complete, Shin cautions. “In reality, there is no way to know for sure whether your data set has all the factors that influenced your past purchases, such as price and maintenance,” she said. “And if these variables are not included in your training set, this will result in a less accurate algorithm.”
Here are some guidelines to establish a robust algorithm training process:
1. The more good data, the better
Not only must data sets be as comprehensive as possible, but the data also needs to be clean and useful. “Models built on a few thousand rows are generally not robust enough to be successful for large-scale business practices,” wrote Robert Munro and Qazaleh Mirsharif in “The Essential Guide to Training Data.” Tellingly, Sam Ransbotham and MIT-BCG researchers found that one of the most common misconceptions about AI is that “sophisticated AI algorithms alone can provide valuable business solutions without sufficient data.”
Historical data is essential for making more accurate predictions. AI cannot predict a trend or a failure if there is no previous data with previous instances of the information that is being sought. “No amount of algorithmic sophistication will overcome a lack of data,” they stated. Data quality is another challenge, they note: “Sophisticated algorithms can sometimes overcome limited data if its quality is high, but bad data is simply paralyzing.”
2. Create a test set
“Overfitting” a training set is another challenge to delivering accurate results. “Overfitting happens when an algorithm is created to fit a particular training set so well that it fails to be generalizable,” Shin said. “In other words, the algorithm produces great results for the training set, but terrible results when new data is used in place of the training set.”
The best way to avoid overfitting is to create a test set to help validate the algorithm and check for issues such as overfitting, Shin advised. “A test set is simply a part of the training set that is set aside and not used to train the algorithm. Once the algorithm is complete, it is run using the test set in place of the training set and the results are evaluated to check for potential issues, such as overfitting.” This process can be automated, Shin added.
Maintaining accuracy in the results of AI algorithms is an ongoing effort that requires validation of training sets and some level of human intervention.
3. Determine how much training data is needed
The appropriate amount of data required depends on the types of applications being supported. Eighty to 90 percent accuracy from a sentiment analysis algorithm that provides insights on social media content, for example, is likely to be acceptable, Munro and Mirsharif illustrated. A cancer detection model or a self-driving car algorithm is an entirely different story, however.
“A car that’s 85 or 90% safe is actually remarkably unsafe and should never see the road. A cancer detection model that could miss important indicators is literally a matter of life or death,” they wrote. “More complicated use cases generally require more data than less complex ones... The amount of training data you’ll need is contingent on the complexity of your ontology and how necessary high levels of accuracy are.”
4. Address data access issues early
To broaden training horizons, larger data sets outside the original data set may be accessed in a process called transfer learning, Munro and Mirsharif stated: “It can be a great way to create smart models when your training data set is a bit smaller than you’d like.” However, actual data ownership is often a question mark for many enterprises, which leads to unwelcome surprises in attempting to incorporate larger training data sets.
“Companies sometimes erroneously believe that they already have access to the data they need to exploit AI,” the MIT-BCG authors observed. “Data ownership is a vexing problem for managers across all industries. Some data is proprietary, and the organizations that own it may have little incentive to make it available to others. Other data is fragmented across data sources, requiring consolidation and agreements with multiple other organizations in order to get more complete information for training AI systems. In other cases, ownership of important data may be uncertain or contested. Getting business value from AI may be theoretically possible but pragmatically difficult.”
5. Keep humans in the loop
Active learning practices—in which humans provide additional judgments—will help achieve more accuracy from algorithm training. “When your model is a bit overconfident about a certain class, using human judgments to correct it can be a big help,” Munro and Mirsharif stated, adding that “human-in-the-loop machine learning should never mean ‘just label some training data.’ You can also test and tune your algorithms with human judgments.” For example, humans may intervene to correct machine errors in recognizing images.
Maintaining accuracy in the results of AI algorithms is an ongoing effort that requires validation of training sets and some level of human intervention. Change in today’s economy occurs swiftly and constantly—both within the walls of organizations and without. As organizations increasingly rely on artificial intelligence and machine learning for insights and actions from data, it’s important that these systems are as up to date and accurate as possible.
CREDITS: Akrain/iStock
About the Author
You may know us for our processors. But we do so much more. Intel invents at the boundaries of technology to make amazing experiences possible for business and society, and for every person on Earth.
Harnessing the capability of the cloud, the ubiquity of the Internet of Things, the latest advances in memory and programmable solutions, and the promise of always-on 5G connectivity and artificial intelligence, Intel is disrupting industries and solving global challenges. Leading on policy, diversity, inclusion, education and sustainability, we create value for our stockholders, customers, and society.
No comments:
Post a Comment