决策树算法的编程实现｜递归的用法

决策树是一种常用的机器学习算法，广泛应用于分类和回归任务中。递归在决策树的实现中起着至关重要的作用，通过递归调用，可以有效地构建和使用决策树。本文将详细介绍决策树实现算法中的递归，重点讲解基线条件与递归调用的实现。

递归

递归的含义

递归是指函数在其定义中调用自身。递归函数通常包含两个部分：基线条件和递归调用。基线条件用于终止递归，防止无限递归；递归调用则处理更小的子问题，逐步逼近基线条件。

基线条件的重要性

基线条件在递归算法中起着至关重要的作用：

防止无限递归：基线条件确保递归在某个点上停止，从而防止无限递归。
提供直接解：基线条件通常对应于问题的最小子问题或最简单的情况，这些情况可以直接得到解。
在决策树的构建和预测过程中，基线条件确保递归在适当的条件下停止，从而构建出完整的决策树并进行正确的预测。

递归与迭代的区别与联系

迭代和递归是两种常见的编程技术，用于重复执行一段代码。虽然它们在功能上可以实现相同的效果，但在实现方式和应用场景上有一些显著的区别。以下是迭代和递归的主要区别：

迭代：通过循环结构（for 或 while 循环）重复执行代码，通常占用较少的内存，适用于循环次数已知的情况。
递归：通过函数调用自身重复执行代码，适用于可以分解为相同子问题的情况，但可能占用较多的内存。

示例代码

迭代示例：计算阶乘

int factorial_iterative(int n) {
    int result = 1;
    for (int i = 1; i <= n; ++i) {
        result *= i;
    }
    return result;
}

递归示例：计算阶乘

int factorial_recursive(int n) {
    if (n == 0) {
        return 1;
    }
    return n * factorial_recursive(n - 1);
}

决策树

决策树的基本概念

决策树是一种树状结构，其中每个节点表示一个特征，每个分支表示特征的一个可能取值，每个叶节点表示一个类别或回归值。决策树的构建过程包括选择最佳特征进行分裂，并递归地构建子树；预测过程则通过递归遍历树结构，根据特征值进行分类或回归。

决策树的类型

Hunt 算法于 20 世纪 60 年代提出，是决策树构建的一种基本方法。原始的 Hunt 算法没有指定具体的选择标准，这为后续算法的改进留下了空间，例如：

ID3：该算法由Ross Quinlan于1986 年提出，全称为”迭代二叉树 3 代” (“Iterative Dichotomiser 3”)。该算法利用信息熵与信息增益作为评估候选拆分的指标。
C4.5：该算法是 ID3 的后期扩展，同样由 Quinlan 提出。它使用信息增益或增益率来评估决策树中的切分点。
CART：全称是”分类和回归”，由 Leo Breiman提出。该算法通常利用”基尼不纯度”来确定拆分点。基尼不纯度衡量随机选择的属性被错误分类的频率。使用该评估方法时，基尼不纯度越小越理想。

决策树构建中的递归实现

步骤

构建决策树的基本步骤包括：

检查基线条件：如果所有样本属于同一类别，或者没有可用特征，或者样本集为空，则创建叶节点并返回。
选择最佳分裂特征和阈值。
根据选择的特征和阈值将样本分成左右两个子集。
递归地构建左子树和右子树。

代码

std::unique_ptr<TreeNode> buildDecisionTree(std::vector<Sample> samples, std::vector<bool> usedFeatures) {
    auto node = std::make_unique<TreeNode>();

    // 基线条件：所有样本属于同一类别
    if (std::all_of(samples.begin(), samples.end(), [&](const Sample& s) { return s.label == samples[0].label; })) {
        node->isLeaf = true;
        node->label = samples[0].label;
        return node;
    }

    // 基线条件：没有可用特征或样本为空
    if (std::all_of(usedFeatures.begin(), usedFeatures.end(), [](bool b) { return b; }) || samples.empty()) {
        node->isLeaf = true;
        std::map<std::string, int> labelCount;
        for (const auto& sample : samples) {
            labelCount[sample.label]++;
        }
        node->label = std::max_element(labelCount.begin(), labelCount.end(),
            [](const std::pair<std::string, int>& a, const std::pair<std::string, int>& b) {
                return a.second < b.second;
            })->first;
        return node;
    }

    // 选择最佳分裂特征和阈值
    int bestFeature;
    double bestThreshold;
    std::tie(bestFeature, bestThreshold) = selectBestFeatureAndThreshold(samples, usedFeatures);
    node->featureIndex = bestFeature;
    node->threshold = bestThreshold;
    node->isLeaf = false;

    // 分裂样本
    std::vector<Sample> leftSamples, rightSamples;
    for (const auto& sample : samples) {
        if (sample.features[bestFeature] <= bestThreshold) {
            leftSamples.push_back(sample);
        } else {
            rightSamples.push_back(sample);
        }
    }

    // 递归构建子树
    usedFeatures[bestFeature] = true;
    node->left = buildDecisionTree(leftSamples, usedFeatures);
    node->right = buildDecisionTree(rightSamples, usedFeatures);

    return node;
}

决策树预测中的递归实现

步骤

预测过程的基本步骤包括：

检查基线条件：如果当前节点是叶节点，则返回该节点的标签。
递归调用：根据样本的特征值选择左子树或右子树进行递归调用。

代码

std::string predict(const TreeNode* node, const Sample& sample) const {
    if (node->isLeaf) {
        return node->label;
    }

    if (sample.features[node->featureIndex] <= node->threshold) {
        return predict(node->left.get(), sample);
    } else {
        return predict(node->right.get(), sample);
    }
}

完整代码示例

下面的代码是C++实现的C4.5 决策树算法，可以处理多分类问题。代码中使用了信息增益比来选择最佳分裂特征，并通过递归的方法构建决策树。此代码较为简单，存在一些局限性，包括：缺乏剪枝功能、仅处理连续特征、没有处理缺失值、没有考虑特征的重要性、没有实现模型的持久化等。

#include <memory>
#include <vector>
#include <map>
#include <string>
#include <algorithm>
#include <stdexcept>
#include <iostream>

struct Sample {
    std::vector<double> features; // 假设特征是数值型的
    std::string label;
};

class C45DecisionTree {
public:
    struct TreeNode {
        bool isLeaf;
        std::string label;
        int featureIndex;
        double threshold;
        std::unique_ptr<TreeNode> left;
        std::unique_ptr<TreeNode> right;
    };

    C45DecisionTree() : root(nullptr) {}

    void train(const std::vector<Sample>& samples) {
        if (samples.empty()) {
            throw std::runtime_error("训练样本为空");
        }
        std::vector<bool> usedFeatures(samples[0].features.size(), false);
        root = buildDecisionTree(samples, usedFeatures);
    }

    std::string predict(const Sample& sample) const {
        if (!root) {
            throw std::runtime_error("决策树尚未训练");
        }
        return predict(root.get(), sample);
    }

private:
    std::unique_ptr<TreeNode> root;

    std::unique_ptr<TreeNode> buildDecisionTree(std::vector<Sample> samples, std::vector<bool> usedFeatures) {
        auto node = std::make_unique<TreeNode>();

        // 如果所有样本属于同一类别,创建叶节点
        if (std::all_of(samples.begin(), samples.end(), [&](const Sample& s) { return s.label == samples[0].label; })) {
            node->isLeaf = true;
            node->label = samples[0].label;
            return node;
        }

        // 如果没有可用特征或样本为空,创建叶节点并返回多数类
        if (std::all_of(usedFeatures.begin(), usedFeatures.end(), [](bool b) { return b; }) || samples.empty()) {
            node->isLeaf = true;
            std::map<std::string, int> labelCount;
            for (const auto& sample : samples) {
                labelCount[sample.label]++;
            }
            node->label = std::max_element(labelCount.begin(), labelCount.end(),
                [](const std::pair<std::string, int>& a, const std::pair<std::string, int>& b) {
                    return a.second < b.second;
                })->first;
            return node;
        }

        // 选择最佳分裂特征和阈值
        int bestFeature;
        double bestThreshold;
        std::tie(bestFeature, bestThreshold) = selectBestFeatureAndThreshold(samples, usedFeatures);
        node->featureIndex = bestFeature;
        node->threshold = bestThreshold;
        node->isLeaf = false;

        // 根据选择的特征和阈值分裂样本
        std::vector<Sample> leftSamples, rightSamples;
        for (const auto& sample : samples) {
            if (sample.features[bestFeature] <= bestThreshold) {
                leftSamples.push_back(sample);
            } else {
                rightSamples.push_back(sample);
            }
        }

        // 递归构建子树
        usedFeatures[bestFeature] = true;
        node->left = buildDecisionTree(leftSamples, usedFeatures);
        node->right = buildDecisionTree(rightSamples, usedFeatures);

        return node;
    }

    std::pair<int, double> selectBestFeatureAndThreshold(const std::vector<Sample>& samples, const std::vector<bool>& usedFeatures) const {
        int bestFeature = -1;
        double bestThreshold = 0.0;
        double bestGainRatio = -1;

        for (int i = 0; i < samples[0].features.size(); i++) {
            if (!usedFeatures[i]) {
                std::vector<double> thresholds;
                for (const auto& sample : samples) {
                    thresholds.push_back(sample.features[i]);
                }
                std::sort(thresholds.begin(), thresholds.end());
                for (int j = 1; j < thresholds.size(); j++) {
                    double threshold = (thresholds[j - 1] + thresholds[j]) / 2;
                    double gainRatio = calculateGainRatio(samples, i, threshold);
                    if (gainRatio > bestGainRatio) {
                        bestGainRatio = gainRatio;
                        bestFeature = i;
                        bestThreshold = threshold;
                    }
                }
            }
        }

        return {bestFeature, bestThreshold};
    }

    double calculateGainRatio(const std::vector<Sample>& samples, int featureIndex, double threshold) const {
        double entropyBefore = calculateEntropy(samples);
        std::vector<Sample> leftSamples, rightSamples;
        for (const auto& sample : samples) {
            if (sample.features[featureIndex] <= threshold) {
                leftSamples.push_back(sample);
            } else {
                rightSamples.push_back(sample);
            }
        }

        double entropyAfter = (leftSamples.size() * calculateEntropy(leftSamples) + rightSamples.size() * calculateEntropy(rightSamples)) / samples.size();
        double informationGain = entropyBefore - entropyAfter;

        double splitInfo = 0.0;
        double pLeft = static_cast<double>(leftSamples.size()) / samples.size();
        double pRight = static_cast<double>(rightSamples.size()) / samples.size();
        if (pLeft > 0) splitInfo -= pLeft * std::log2(pLeft);
        if (pRight > 0) splitInfo -= pRight * std::log2(pRight);

        return (splitInfo != 0) ? informationGain / splitInfo : 0;
    }

    double calculateEntropy(const std::vector<Sample>& samples) const {
        std::map<std::string, int> labelCount;
        for (const auto& sample : samples) {
            labelCount[sample.label]++;
        }

        double entropy = 0.0;
        int totalSamples = samples.size();
        for (const auto& pair : labelCount) {
            double probability = static_cast<double>(pair.second) / totalSamples;
            entropy -= probability * std::log2(probability);
        }
        return entropy;
    }

    std::string predict(const TreeNode* node, const Sample& sample) const {
        if (node->isLeaf) {
            return node->label;
        }

        if (sample.features[node->featureIndex] <= node->threshold) {
            return predict(node->left.get(), sample);
        } else {
            return predict(node->right.get(), sample);
        }
    }
};

浏览量: 718

水波粼粼

决策树算法的编程实现｜递归的用法

递归

递归的含义

基线条件的重要性

递归与迭代的区别与联系

决策树

决策树的基本概念

决策树的类型

决策树构建中的递归实现

步骤

代码

决策树预测中的递归实现

步骤

代码

完整代码示例

关注公众号

递归

递归的含义

基线条件的重要性

递归与迭代的区别与联系

决策树

决策树的基本概念

决策树的类型

决策树构建中的递归实现

步骤

代码

决策树预测中的递归实现

步骤

代码

完整代码示例

相关文章

决策树如何有效处理分类变量 | 理论、实践与建议

scikit-learn的CART树——寻找分裂点的两种模式｜使用技巧

scikit-learn中的树算法 不能直接使用分类变量

关注公众号

scikit-learn中的树算法不能直接使用分类变量