信息增益-决策树

表8.1给出的是带有标记类的元组的训练集D;

类标号属性 buys_computer有两个不同值:{yes, no}

设 类 C1 → \rightarrow → yes,C2 → \rightarrow → no;

已知:C1包含9个元组,C2包含5个元组;

从属性age计算期望信息需求:

age:{‘youth’, ‘middle_aged’, ‘senior’}

youthmiddle_agedsenior
yes243
no302

对于 y o u t h youth youth 来说:

D j D_{j} Dj​ = 2 + 3 = 5 2+3=5 2+3=5

I n f o Info Info( D j D_{j} Dj​) = - 2 / 5 2/5 2/5 * log ⁡ 2 \log_{2} log2​( 2 / 5 2/5 2/5) - 3 / 5 3/5 3/5 * log ⁡ 2 \log_{2} log2​( 3 / 5 3/5 3/5)

即,

即,按年龄划分的信息增益:

G a i n ( a g e ) = I n f o ( D ) Gain(age) = Info(D) Gain(age)=Info(D) - I n f o a g e ( D ) Info_{age}(D) Infoage​(D) = 0.94-0.694 = 0.246

同理,

G a i n ( i n c o m e ) = 0.029 Gain(income) = 0.029 Gain(income)=0.029, G a i n ( s t u d e n t ) = 0.151 Gain(student) = 0.151 Gain(student)=0.151,

G a i n ( c r e d i t Gain(credit Gain(credit_ r a t i n g ) = 0.048 rating) = 0.048 rating)=0.048

由于age在属性中具有最高的信息增益,所以它被选作分裂属性;

由于age → \rightarrow → middle_aged 元组属于相同的类,所以在该分支的端点创建一个树叶,并用 yes 标记;

最终决策树如下:

参考:

  1. https://blog.csdn.net/Time_Memory_cici/article/details/132915003
  2. https://blog.csdn.net/m0_50989510/article/details/122395804
  3. https://blog.csdn.net/weixin_44606139/article/details/127049701