Table of Contents
Mean, Mode, Variance and Standard Deviation
In math notation when we see in a formula, it refers to Population and when we see it means Sample.
Moda
It is the value that is most frequent in a series. There is an implementation inside python implementation…
Note: is the formula for unbiased sample variance, since we are dividing by
Note: Finding the reintroduces bias.
import mathvalores = [1, 2, 3, 8, 4, 9.8, 6.5, 4, 3, 8, 5, 9, 3.3, 0, 4, 7, 9]valores_clean_float = [float(x) for x in valores]media = sum(valores_clean_float) / len(valores_clean_float)moda = max(valores_clean_float, key=valores_clean_float.count)valores_clean_float_sorted = sorted(valores_clean_float)list_size = len(valores_clean_float_sorted)if list_size % 2 == 0: mediana = (valores_clean_float_sorted[int((list_size / 2) - 1)] + valores_clean_float_sorted[(int(list_size / 2))]) / 2else: mediana = valores_clean_float_sorted[math.floor(list_size / 2)]squared_distance_from_mean = [round((x-media)**2, 2) for x in valores_clean_float_sorted]variance = sum(squared_distance_from_mean)/list_sizestandard_deviation = math.sqrt(variance)print('Media: ', media, '- Moda: ', moda, '- Mediana: ', mediana, '- Variância: ', variance, '- Standard Deviation: ', standard_deviation)
Simple Linear Regression
or
is the value being predicted
ou is called intercept, are coefficients obtained through this equation
ou is called slope, are coefficients obtained through where is the current value of , is mean of the values, is the current , mean of values.
A shortcut formula is where is the correlation of and (aka Pearson’s correlation coefficient), which is a measure of how relates two variables are in the range of -1 to 1.
x = [1,2,4,3,5]y = [1,3,3,2,5]m = 0 b = 0grau_aprendizado = 0.01for i in range(4): for i in range(len(x)): previsao = m * float(i) + b # y = mx + b erro = previsao - float(y[i]) # erro = p ( i ) - y ( i ) m = m - grau_aprendizado * erro * float(x[i]) b = b - grau_aprendizado * erro * 1.0 print "m {} b {}".format(m, b)
Logistic Regression
is the predicted output
is the bias or intercept
is the coefficient for the single value
Each column in your input data has an associated B coefficient (a constant real value) that must be learned from your training data. The actual representation of the model that you would store in memory or in a file are the coefficients in the equation (the beta value or B’s).
Linear Discriminant Analysis – LDA
Steps
4. Making predictions – Just plug the values found above into the representation model
for X = 4.667797637 and Y = 0
for X = 4.667797637 and Y = 1
We can see that the discriminant value for Y = 0 (12.3293558) is larger than the discriminate value for Y = 1 (-130.3349038), therefore the model predicts Y = 0. Which we know is correct in the dataset.
CART – Classification And Regression Trees
Steps
Sample Dataset
X1 X2 Y2.771244718 1.784783929 01.728571309 1.169761413 03.678319846 2.81281357 03.961043357 2.61995032 02.999208922 2.209014212 07.497545867 3.162953546 19.00220326 3.339047188 17.444542326 0.476683375 110.12493903 3.234550982 16.642287351 3.319983761 1
1. Find the best Split Point Candidate for a feature by iterating through the dataset
IF X1 < 2.7712 THEN LEFT
IF X1 >= 2.7712 THEN RIGHT
X1 Y Group2.771244718 0 RIGHT1.728571309 0 LEFT3.678319846 0 RIGHT3.961043357 0 RIGHT2.999208922 0 RIGHT7.497545867 1 RIGHT9.00220326 1 RIGHT7.444542326 1 RIGHT10.12493903 1 RIGHT6.642287351 1 RIGHT
1.2 Calculate the proportions for each side related to each class
LEFT
1.3 Calculate the Gini for this candidate
1.4 Continue iterating over the dataset until you find the lowest Gini. In this case the lowest Gini index is the X = 6.6422
IF X1 < 6.6422 THEN LEFT
IF X1 >= 6.6422 THEN RIGHT
X1 Y Group2.771244718 0 LEFT1.728571309 0 LEFT3.678319846 0 LEFT3.961043357 0 LEFT2.999208922 0 LEFT7.497545867 1 RIGHT9.00220326 1 RIGHT7.444542326 1 RIGHT10.12493903 1 RIGHT6.642287351 1 RIGHT
LEFT
This is a split that results in a pure Gini index, because the classes are perfectly separated. The LEFT child node will classify instances as class 0 and the RIGHT as class 1.
What do you think?