Key concepts of AIML 1 - Feature and Target Variable

Abstract: This post aims to explain the very fundamental base of Machine Learning algorithm. It is aimed for professionals and business users to appreciate the reasoning behind Machine Learning - what it is and why it is supposed to work on old data and even produce fresh new data using Gen AI.

The picture above tells the entire story. The Left hand side is the target variable aka y and right hand side are the feature variables x1x2 etc. It is like one variable is a outcome of another set of variable. We are used to it in basic mathematics. Like the area of a rectangle is a product of width and height. Likewise many other more complicated equations that are applied in our lifestyle. An equation like Fourier transform is embedded in our life through mobile communication. 

We can draw a graph of this equation when it is dependent on 1/2 or max 3 dimensions or features. Some try to represent 4th or higher dimensions using color, pattern etc. But for all practical purposes, an equation can be visualized when it is dependent on maximum 3 dimensions or features.

An equation can tell the story of cause and effect. If x is the cause, y is the effect. When x becomes something, y becomes the corresponding value. From there, we identify pattern. The pattern thus found may have a popular name like straight line, circle, parabola or many such curves. 

But it is not necessary to be bound by cause every time. When we observe a pattern, the variables y and x in this case may not be bound by only cause-effect relationship. Sometimes it may be other way also. But when we think about pre ML age, we don't explore them too much because we can't both intellectually and computationally too.

If we observe, there are multiple symptoms or side effects of an event. But we may never logically deduce a cause and effect within them. Think of an extreme example - Astrology! 

It is argued that when a planet is in some position of a person's chart, something good or bad is about to happen. Now, is it a cause and effect relationship? Astrologers say "I don't know but that is what usually observed from analysis of past data".       

Ok. Let us a take a more serious example from classic ML education material. We have a dataset of one or more restaurants where tip is mentioned with many other data. It is in a excel file with columns like:


Students are assigned task of finding correlation of tip with other columns. Here, tip is the target variable and other columns are features or dimensions. Do you can ever imagine a cause-effect relationship among any of these features?

Welcome to machine learning. In AIML space, we don't care about cause and effect. Target variable itself may be the cause or effect of one or more feature variables. All that matters is - is there a pattern that emerges from data analysis? 
 


So here, y means tip, x1 is order_id, x2 is day upto x7 which is total_bill. This means if it is possible to get a pattern or equation, it will be of the order of 7. Can we visualize it? Forget it. Ok, but then can we at least know how the equation will look like? It is possible but why? Why do you want to take that pain? The machine knows it. It will use it when required. That's the essence of machine learning. 

Take an use case. Get the data related to it. Don't bother about whether they are related or not. Feed them all together in the machine and apply AIML techniques on them. You will get an answer of a business use case.

Of course it is never so easy that can be concluded like a youtube short video! That's where there are professionals like us. When someone from us, the community of AIML professionals is involved, he takes the pain. Not all the features may be required. Some of these features may even create unnecessary noise around the data. Side effect can be counted as independent event influencing the target - multiple stuffs like these. The technical discipline is called Feature Engineering and there are multiple other measures like clas balancing, PCA etc which are required. Also, a very important aspect - ethics. Statistics empowers us to identify the data who can be trusted. But it is upto the architect or solution designer to apply domain knowledge, ethics and statistical learning to apply them. 

So, there will always be good AI and bad AI. No one can know surely what is the best model for a complex business solution. There are ways to choose the best fit for a problem with conscious trade off between False positive/Negative vs Failed to predict cases. Those will be business decision or derived from the business use case owner's vision.

That's it. Overall, in this article, i tried to keep it short and simple. Providing a logical snapshot of the behind the scene Machine Learning and AI for enabling business users to get better context and possibly confidence to experiment and drive AI innovation in their respective businesses.      


     

Comments

Popular posts from this blog

AIML - A Career or Confusion?

Offering - Private NLP development