在处理大量数据时,通常会使用批处理方法来训练机器学习模型。然而,当数据量较小时,可以将所有数据加载到内存中,并在每次迭代中传递给模型进行训练。本文将实现这种训练方法。
将重用之前的实现,因此起点可以是之前的源代码。对于数据加载,需要定义一个新的方法。鸢尾花数据存储在文本格式中,如下所示:
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa(1 0 0)
7.0,3.2,4.7,1.4,versicolor(0 1 0)
7.6,3.0,6.6,2.1,virginica(0 0 1)
...
输出列按照之前看到的1-N编码规则进行编码。该方法将从文件中读取所有数据,解析数据,并创建两个浮点数组:
float[] feature, 和 float[] label
如所见,这两个数组都是一维的,这意味着所有数据都将插入到一维数组中,因为CNTK需要这样做。由于数据在一维数组中,还应该提供数据的维度,以便CNTK可以解析每个特征的值。以下代码展示了如何将鸢尾花数据加载到两个返回为元组的一维数组中。
C#
static
(float[], float[]) loadIrisDataset(
string filePath,
int featureDim,
int numClasses)
{
var rows = File.ReadAllLines(filePath);
var features = new List();
var label = new List();
for (int i = 1; i < rows.Length; i++)
{
var row = rows[i].Split(',');
var input = new float[featureDim];
for (int j = 0; j < featureDim; j++)
{
input[j] = float.Parse(row[j], CultureInfo.InvariantCulture);
}
var output = new float[numClasses];
for (int k = 0; k < numClasses; k++)
{
int oIndex = featureDim + k;
output[k] = float.Parse(row[oIndex], CultureInfo.InvariantCulture);
}
features.AddRange(input);
label.AddRange(output);
}
return (features.ToArray(), label.ToArray());
}
一旦数据被加载,应该只需要很少的代码更改就可以实现批处理,而不是使用minibatchSource。在开始时,提供几个变量来定义NN模型结构。然后调用loadIrisDataset,并定义xValues和yValues,使用它们来创建特征和标签输入变量。然后创建一个字典,将特征和标签与数据值连接起来,稍后将这些值传递给训练器。
接下来的代码与之前的版本相同,用于创建NN模型、损失和评估函数,以及学习率。然后创建一个循环,进行800次迭代。一旦迭代达到最大值,程序输出模型属性并终止。
上述内容在以下代码中实现:
C#
public static void TrainIriswithBatch(DeviceDescriptor device)
{
// data file path
var iris_data_file = "Data/iris_with_hot_vector.csv";
// Network definition
int inputDim = 4;
int numOutputClasses = 3;
int numHiddenLayers = 1;
int hidenLayerDim = 6;
int sampleSize = 130;
// load data in to memory
var dataSet = loadIrisDataset(iris_data_file, inputDim, numOutputClasses);
// build a NN model
// define input and output variable
var xValues = Value.CreateBatch(new NDShape(1, inputDim), dataSet.Item1, device);
var yValues = Value.CreateBatch(new NDShape(1, numOutputClasses), dataSet.Item2, device);
// build a NN model
// define input and output variable and connecting to the stream configuration
var feature = Variable.InputVariable(new NDShape(1, inputDim), DataType.Float);
var label = Variable.InputVariable(new NDShape(1, numOutputClasses), DataType.Float);
// Combine variables and data in to Dictionary for the training
var dic = new Dictionary();
dic.Add(feature, xValues);
dic.Add(label, yValues);
// Build simple Feed Forward Neural Network model
var ffnn_model = CreateMLPClassifier(device, numOutputClasses, hidenLayerDim, feature, classifierName);
var ffnn_model = createFFNN(feature, numHiddenLayers, hidenLayerDim, numOutputClasses, Activation.Tanh, "IrisNNModel", device);
// Loss and error functions definition
var trainingLoss = CNTKLib.CrossEntropyWithSoftmax(new Variable(ffnn_model), label, "lossFunction");
var classError = CNTKLib.ClassificationError(new Variable(ffnn_model), label, "classificationError");
// set learning rate for the network
var learningRatePerSample = new TrainingParameterScheduleDouble(0.001125, 1);
// define learners for the NN model
var ll = Learner.SGDLearner(ffnn_model.Parameters(), learningRatePerSample);
// define trainer based on ffnn_model, loss and error functions , and SGD learner
var trainer = Trainer.CreateTrainer(ffnn_model, trainingLoss, classError, new Learner[] { ll });
// Preparation for the iterative learning process
// used 800 epochs/iterations. Batch size will be the same as sample size since the data set is small
int epochs = 800;
int i = 0;
while (epochs > -1)
{
trainer.TrainMinibatch(dic, device);
// print progress
printTrainingProgress(trainer, i++, 50);
// epochs--;
}
// Summary of training
double acc = Math.Round((1.0 - trainer.PreviousMinibatchEvaluationAverage()) * 100, 2);
Console.WriteLine("------TRAINING SUMMARY--------");
Console.WriteLine("The model trained with the accuracy {acc}%");
}