CV领域中,在完成数据准备工作和设计定义好模型之后,我们就可以去迭代训练模型了,通过设置调节不同的超参数(这需要理论知识和丰富的经验)来使得损失(loss)和准确率(accuracy)这两个常用的指标达到最优。一般在训练完成之后,都需要通过损失曲线图和准确率曲线图来衡量整个训练过程。
在训练模型之前,我们需要将数据划分为训练集和验证集,在训练集上训练模型,在验证集上评估模型。最后一旦找到了模型的最佳参数,就在测试集上最后测试一次,并将得到的测试结果储存为CSV文件,提交到Kaggle平台上,看分数如何,以便进行后期的改正。
数据集的划分有三种常用的方法:
- 简单的留出验证;
- K折交叉验证;
- 带有打乱数据的重复K折验证;
知道了训练模型的一些方法和注意事项之后,我们就要开始编写TensorFlow程序,以实现迭代训练模型,并将最终的模型保存下来。这里需要先学习TensorFlow模型持久化(即如何保存和恢复模型)。
TensorFlow模型持久化
主要介绍如何编写TensorFlow程序来持久化一个训练好的模型,并从持久化的模型文件中还原被保存的模型。TensorFlow提供一个tf.train.Saver类用于保存和还原一个神经网络模型。
保存模型
以下程序是保存模型的示例:
import tensorflow as tf # 模型保存地址 model_path = 'C:/Users/Administrator/logs/model.ckpt' # 声明两个变量并计算他们的和 v1 = tf.Variable(tf.constant(1.0,shape=[1]),name="v1") v2 = tf.Variable(tf.constant(3.0,shape=[1]),name="v2") result = v1 + v2 # 声明tf.train.Saver类用于保存模型 saver = tf.train.Saver() with tf.Session() as sess: # 初始化所有变量 sess.run(tf.global_variables_initializer()) # 将模型保存到指定文件中 saver.save(sess,model_path)
输出结果如下:
可以看到在模型保存地址中出现了4个文件,这是因为TensorFlow会将计算图的结构和参数取值分开保存。
- model.ckpt.meta 保存了计算图的结构
- model.ckpt.data-00000-of-00001 保存了计算图上的每个变量的取值
- checkpoint 保存了目录下的所有的模型文件列表,方便还原模型时直接调用
- model.ckpt.index 暂时用不到
加载模型
加载模型有两种常见方法:
- 在加载模型的程序中定义TensorFlow计算图上的所有运算;
- 不重复定义计算图上运算,直接加载已经持久化的图。
第一种方法示例代码如下:
import tensorflow as tf # 模型保存地址 model_path = 'C:/Users/Administrator/logs/model.ckpt' # 使用和保存模型代码中一样的方式来声明变量和定义计算图结构 v1 = tf.Variable(tf.constant(1.0,shape=[1]),name="v1") v2 = tf.Variable(tf.constant(3.0,shape=[1]),name="v2") result = v1 + v2 saver = tf.train.Saver() with tf.Session() as sess: # 加载已经保存的模型,并通过已经保存的模型中变量的值来计算加法 saver.restore(sess,'C:/Users/Administrator/logs/model.ckpt') print(sess.run(result))
第二种方法示例代码如下:
import tensorflow as tf # 模型保存地址 model_path = 'C:/Users/Administrator/logs/model.ckpt' saver = tf.train.import_meta_graph('C:/Users/Administrator/logs/model.ckpt.meta') with tf.Session() as sess: saver.restore(sess,model_path) print(sess.run(tf.get_default_graph().get_tensor_by_name("add:0")))
两种方法输出结果一样,如下图所示:
INFO:tensorflow:Restoring parameters from C:/Users/Administrator/logs/model.ckpt [ 4.]
迭代训练模型实现
程序代码如下:
import tensorflow as tf import matplotlib.pyplot as plt import numpy as np import os import time # 导入模型定义文件和数据准备文件 import model import input_data # ---------------------------配置神经网络超参数------------------------------------------- N_CLASSES = 2 # 输出类别数 IMG_W = 227 # 图像宽度 IMG_H = 227 # 图像高度 IMG_C = 3 # 图像通道 BATCH_SIZE = 10 # 训练集批次大小 MAX_STEP = 20000 # 最大迭代步数 CAPACITY = 2000 # 用于定义的范围 LEARNING_RETE = 0.0001 # 定义学习率 # 本地电脑训练集对应路径地址,和模型及日志文件保存地址 train_dir = "F:/Software/Python_Project/Classification-cat-dog/train/" logs_train_dir = "F:/Software/Python_Project/Classification-cat-dog/logs/" # 云服务器训练集对应路径地址,和模型及日志文件保存地址 # train_dir = '/data/Dogs-Cats-Redux-Kernels-Edition/train/' # logs_train_dir = '/data/Dogs-Cats-Redux-Kernels-Edition/logs/ # ---------------------------定义模型训练函数------------------------------------------- def run_training(): # 获取训练集文件名和对应标签列表 file_list, label_list = input_data.get_files(train_dir) # 生成一个batch的图像数据和标签 train_batch, train_label_batch = input_data.get_batch(file_list, label_list, IMG_W, IMG_H, BATCH_SIZE, CAPACITY) regularizer = tf.contrib.layers.l2_regularizer(0.0001) # 获取训练batch数据网络输出结果 train_logits = model.inference(train_batch, True,BATCH_SIZE,regularizer, N_CLASSES) train_loss = model.losses(train_logits, train_label_batch) # 计算训练batch的损失 train_op = model.trainning(train_loss, LEARNING_RETE) # 利用损失和学习率更新网络权重W参数 train_acc = model.evaluation(train_logits, train_label_batch) # 计算准确率 # 定义输入输出placeholder,用于得到传递进来的真实样本,标签没有进行one-hot编码 x_train = tf.placeholder(tf.float32,shape=[BATCH_SIZE,IMG_W,IMG_H,IMG_C],name='x_') y_train_ = tf.placeholder(tf.int32,shape=[BATCH_SIZE,],name='y_') # # 获取训练batch的模型输出结果,logits是一个batch_size*2的二维数组 # logits = model.inference(x,True,BATCH_SIZE,regularizer,N_CLASSES) # # (小处理)将logits乘以1赋值给logits_eval,定义name,方便在后续调用模型时通过tensor名字调用输出tensor # b = tf.constant(value=1,dtype=tf.float32) # logits_eval = tf.multiply(logits,b,name='logits_eval') # # 计算交叉熵作为刻画预测值和真实值之间差距的损失函数 # cross_entroy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y_) # # 计算在当前batch中所有样例的交叉熵平均值 # loss = tf.reduce_mean(cross_entroy,name='loss') # # 使用tf.train.AdamOptimizer优化算法来优化损失函数 # train_op = tf.train.AdamOptimizer(learning_rate=0.001).minimize(loss) # # 计算模型在一个batch数据上的正确率 # correct_prediction = tf.equal(tf.cast(tf.argmax(logits,1),tf.int32), y_) # acc = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) # print(loss.shape,acc.shape) with tf.Session() as sess: tra_loss = [] # 定义loss列表 # 初始化TensorFlow持久化类 saver = tf.train.Saver() sess.run(tf.global_variables_initializer()) # 初始化所有变量 coord = tf.train.Coordinator() # 设置多线程协调器 threads = tf.train.start_queue_runners(sess=sess,coord=coord) # 开始队列运行器(Queue Runner) summary_op = tf.summary.merge_all() # 汇总操作 # 把训练的汇总写入logs_train_dir train_writer = tf.summary.FileWriter(logs_train_dir,sess.graph) try: # 开始运行 for step in np.arange(MAX_STEP): if coord.should_stop(): break image_batch,label_batch = sess.run([train_batch, train_label_batch]) # 计算损失和准确率 _,loss,acc = sess.run([train_op,train_loss,train_acc],feed_dict={x_train:image_batch,y_train_:label_batch}) # print(type(train_loss),type(train_acc),train_acc.shape) tra_loss.append(loss) # train_loss_val = np.sum(train_loss) # train_acc_val = np.sum(train_acc) if step % 100 == 0: # 当step值为10的倍数时,打印损失和准确率 print('Step %d,train loss = %.2f,train accuracy = %.2f%%' % (step,train_loss,train_acc*100.0)) # print("Step %d, loss: %.2f, ac : %.2f" % (step,loss,acc)) summary_str = sess.run(summary_op) train_writer.add_summary(summary_str,step) if step % 2000 == 0 or (step+1) == MAX_STEP: # 保存模型 checkpoint_path = os.path.join(logs_train_dir,'model.ckpt') saver.save(sess,checkpoint_path,global_step=step) print('Model saved') print('Traning finished!') except tf.errors.OutOfRangeError: # 异常处理 print('Done training -- epoch limit reached.') finally: # 停止所有线程 coord.request_stop() coord.join(threads) # 绘制损失函数趋势曲线图 plt.plot(loss) plt.xlabel('Iter') plt.ylabel('loss') plt.title('lr=%f,ti=%d,bs=%d' % (LEARNING_RETE,MAX_STEP,BATCH_SIZE)) plt.tight_layout() plt.savefig('cat_and_dog_alexnet.jpg',dpi=200) #-------------------------------程序从这里开始运行--------------------------------------- if __name__ == "__main__": run_training()
输出结果
受笔记本性能限制,我写这篇博客的时候,模型还没有训练完成,我这里只截取了部分结果,最终的输出和loss及accuracy曲线图分析,明天补上。
There are 12500 cats
There are 12500 dogs
Step 0, train loss = 113810.02, train accuracy = 50%
Step 100, train loss = 20647.10, train accuracy = 40%
Step 200, train loss = 16054.08, train accuracy = 50%
Step 300, train loss = 7717.75, train accuracy = 50%
Step 400, train loss = 5881.07, train accuracy = 50%
Step 500, train loss = 2879.47, train accuracy = 70%
Step 600, train loss = 338.30, train accuracy = 70%
Step 700, train loss = 1178.86, train accuracy = 50%
Step 800, train loss = 287.65, train accuracy = 50%
Step 900, train loss = 245.80, train accuracy = 50%
Step 1000, train loss = 20.37, train accuracy = 50%
Step 1100, train loss = 49.53, train accuracy = 60%
Step 1200, train loss = 11.61, train accuracy = 60%
Step 1300, train loss = 1.78, train accuracy = 70%
Step 1400, train loss = 10.86, train accuracy = 30%
Step 1500, train loss = 2.33, train accuracy = 30%
Step 1600, train loss = 26.34, train accuracy = 40%
Step 1700, train loss = 43.71, train accuracy = 50%
Step 1800, train loss = 14.57, train accuracy = 60%
Step 1900, train loss = 23.90, train accuracy = 30%
Step 2000, train loss = 1.50, train accuracy = 50%
Step 2100, train loss = 3.84, train accuracy = 50%
Step 2200, train loss = 1.06, train accuracy = 60%
Step 2300, train loss = 1.90, train accuracy = 50%
Step 2400, train loss = 8.90, train accuracy = 50%
Step 2500, train loss = 4.88, train accuracy = 40%
Step 2600, train loss = 1.83, train accuracy = 70%
Step 2700, train loss = 3.73, train accuracy = 40%
Step 2800, train loss = 40.79, train accuracy = 40%
Step 2900, train loss = 57.23, train accuracy = 40%
Step 3000, train loss = 1.04, train accuracy = 80%
Step 3100, train loss = 1.16, train accuracy = 50%
Step 3200, train loss = 2.04, train accuracy = 50%
Step 3300, train loss = 49.13, train accuracy = 50%
Step 3400, train loss = 1.67, train accuracy = 70%
Step 3500, train loss = 2.48, train accuracy = 40%
Step 3600, train loss = 2.01, train accuracy = 50%
Step 3700, train loss = 2.04, train accuracy = 60%
Step 3800, train loss = 0.62, train accuracy = 60%
Step 3900, train loss = 3.46, train accuracy = 40%
Step 4000, train loss = 1.15, train accuracy = 50%
Step 4100, train loss = 2.64, train accuracy = 40%
Step 4200, train loss = 1.08, train accuracy = 60%
Step 4300, train loss = 5.22, train accuracy = 60%
Step 4400, train loss = 7.35, train accuracy = 50%
Step 4500, train loss = 0.60, train accuracy = 90%
Step 4600, train loss = 1.60, train accuracy = 80%
Step 4700, train loss = 1.02, train accuracy = 50%
Step 4800, train loss = 1.46, train accuracy = 60%
Step 4900, train loss = 1.33, train accuracy = 40%
使用输入文件队列的注意事项
关于训练数据输入神经网络的方法,我之前有用过直接使用numpy打乱及划分batch,然后通过占位符placeholder输入给神经网络,也使用过TensorFlow输入文件队列(tf.train.shuffle_batch)的方法输入Tensor数据给神经网络,两个方法都行得通。
但是,我这两天发现TensorFlow有个巨坑的地方,就是你利用文件队列的方式去进行输入数据处理,你必须将tf.train.batch方法输出的张量数据直接输入到神经网络中,不能通过占位符的方式,否则就会报如下错误:
TypeError,must be real number,not Tensor
也有可能报如下错误:
InvalidArgumentError: You must feed a value for placeholder tensor ‘x_’ with dtype float and shape [10,227,227,3]
[[Node: x_ = Placeholder[dtype=DT_FLOAT, shape=[10,227,227,3], _device=”/job:localhost/replica:0/task:0/device:GPU:0″]()]]
至于原因,我也不知道为什么,还没有去细细深究,但这是我踩了两天的坑才发现的,以前也没人提过这个问题!我上面说的可能还不是很清楚,直接看代码(只截取了关键部分)吧:
正确代码:
# 获取训练集文件名和对应标签列表 file_list, label_list = input_data.get_files(train_dir) # 生成一个batch的图像数据和标签 train_batch, train_label_batch = input_data.get_batch(file_list, label_list, IMG_W, IMG_H, BATCH_SIZE, CAPACITY) regularizer = tf.contrib.layers.l2_regularizer(0.0001) # 获取训练batch数据网络输出结果 train_logits = model.inference(train_batch, True,BATCH_SIZE,regularizer, N_CLASSES) train_loss = model.losses(train_logits, train_label_batch) # 计算训练batch的损失 train_op = model.trainning(train_loss, LEARNING_RETE) # 利用损失和学习率更新网络权重W参数 train_acc = model.evaluation(train_logits, train_label_batch) # 计算准确率 # 定义输入输出placeholder,用于得到传递进来的真实样本,标签没有进行one-hot编码 x_train = tf.placeholder(tf.float32,shape=[BATCH_SIZE,IMG_W,IMG_H,IMG_C],name='x_') y_train_ = tf.placeholder(tf.int32,shape=[BATCH_SIZE,],name='y_')
错误代码:
# 获取训练集文件名和对应标签列表 file_list, label_list = input_data.get_files(train_dir) # 生成一个batch的图像数据和标签 train_batch, train_label_batch = input_data.get_batch(file_list, label_list, IMG_W, IMG_H, BATCH_SIZE, CAPACITY) # 定义输入输出placeholder,用于得到传递进来的真实样本,标签没有进行one-hot编码 x_train = tf.placeholder(tf.float32,shape=[BATCH_SIZE,IMG_W,IMG_H,IMG_C],name='x_') y_train_ = tf.placeholder(tf.int32,shape=[BATCH_SIZE,],name='y_') regularizer = tf.contrib.layers.l2_regularizer(0.0001) # 获取训练batch数据网络输出结果 train_logits = model.inference(x_train, True,BATCH_SIZE,regularizer, N_CLASSES) train_loss = model.losses(train_logits, y_train_) # 计算训练batch的损失 train_op = model.trainning(train_loss, LEARNING_RETE)# 利用损失和学习率更新网络权重W参数 train_acc = model.evaluation(train_logits, y_train_) # 计算准确率
最后,虽然我总结出了一点经验和教训,但是我希望有人来去深究里面的原因和机制,这样其实能更深入的了解和掌握TensorFlow。
参考资料
- 《TensorFlow实战谷歌深度学习框架第二版》
- 《深度学习卷积神经网络从入门到精通》
- 《TensorFlow深度学习应用实践》
发表评论