**Building Decision Trees in Python**

Pages: 1, 2, 3, **4**, 5

### Implementing the ID3 Heuristic

The ID3 heuristic uses the concept of entropy to formulate the gain in information received by choosing a particular attribute to be the next node in the decision tree. Here's the entropy function:

```
def entropy(data, target_attr):
"""
Calculates the entropy of the given data set for the target attribute.
"""
val_freq = {}
data_entropy = 0.0
# Calculate the frequency of each of the values in the target attr
for record in data:
if (val_freq.has_key(record[target_attr])):
val_freq[record[target_attr]] += 1.0
else:
val_freq[record[target_attr]] = 1.0
# Calculate the entropy of the data for the target attribute
for freq in val_freq.values():
data_entropy += (-freq/len(data)) * math.log(freq/len(data), 2)
return data_entropy
```

Just like the `create_decision_tree`

function, the first thing the `entropy`

function does is create the variables it uses throughout the algorithm. The first is a dictionary object called `val_freq`

to hold all the values found in the data set passed into this function and the frequency at which each value appears in the data set. The other variable is `data_entropy`

, which holds the ongoing calculation of the data's entropy value.

The next section of code adds each of the values in the data set to the `val_freq`

dictionary and calculates the corresponding frequency for each value. It does so by looping through each of the records in the data set and checking the `val_freq`

dictionary object to see if the current value already resides within it. If it does, it increments the frequency for the current value, otherwise, it adds the current value to the dictionary object and initializes its frequency to 1. The final portion of the code is responsible for actually calculating the entropy measurement (using the equation in Figure 1) with the frequencies stored in the `val_freq`

dictionary object.

That was easy, wasn't it? That's only the first half of the ID3 heuristic. Now that you know how to calculate the amount of disorder in a set of data, you need to take those calculations and use them to find the amount of information gain you will get by using an attribute in the decision tree. The information gain function is very similar to the entropy function. Here's the code that calculates this measurement:

```
def gain(data, attr, target_attr):
"""
Calculates the information gain (reduction in entropy) that would
result by splitting the data on the chosen attribute (attr).
"""
val_freq = {}
subset_entropy = 0.0
# Calculate the frequency of each of the values in the target attribute
for record in data:
if (val_freq.has_key(record[attr])):
val_freq[record[attr]] += 1.0
else:
val_freq[record[attr]] = 1.0
# Calculate the sum of the entropy for each subset of records weighted
# by their probability of occuring in the training set.
for val in val_freq.keys():
val_prob = val_freq[val] / sum(val_freq.values())
data_subset = [record for record in data if record[attr] == val]
subset_entropy += val_prob * entropy(data_subset, target_attr)
# Subtract the entropy of the chosen attribute from the entropy of the
# whole data set with respect to the target attribute (and return it)
return (entropy(data, target_attr) - subset_entropy)
```

Once again, the code starts by calculating the frequency of each of the values in the data set. Following this, it calculates the entropy for the data set with the new division of data derived by using the chosen attribute, `attr`

, to classify the records in the data set. Subtracting that from the original entropy for the current subset of data set finds the gain in information (or, reduction in disorder, if you prefer to think those terms) that you get by choosing that attribute as the next node in the decision tree.

That is essentially all there is to it. You still need some code that cycles through each attribute and calculates its information gain measure and chooses the best one, but that part of the code should be somewhat obvious; it's just a matter of repeatedly calling the gain function on each attribute and keeping track of the attribute with the best score. That said, I leave it as a challenge for you to look over the rest of the helper functions in the accompanying source code and figure out each one.