DQN的多种改进(1)

1.N-step DQN

N-step DQN的核心是将bellman方程展开,即 Q(st,at)=rt+γrt+1+γ2maxaQ(st+2,a)Q(s_t,a_t) = r_t + \gamma r_{t+1} + \gamma^2 max_{a'}Q(s_{t+2},a')
显然,这个式子可以进一步的拓展。 但要注意的是,这里假设了ata_t是趋近于最优动作,因此才能省略max

书中提到,该方法的优点在于可以加速Q网络的收敛。 原因在于, 由于一开始的随机数据,使得真正准确的Q值其实只存在于最后一个状态。 因为只有最后一个状态的Q值等于reward是准确的,其余的都掺杂有不准确的target_Q网络的预测值。 而准确的Q值会在第一次迭代后影响到倒数第二层, 继而在下一次迭代后影响到倒数第三层。。 而如果使用N-step DQN, 可以使得准确的Q值在第一次迭代时就影响到倒数前N层,因此起到了加速收敛的作用。但是N值不能取得太大,因为每一步的a并不是最优动作,N值太大时会使得Q的计算严重出错,因为省略了max。且由于DQN off-policy的性质,a的值很可能来源于old policy, 从而影响性能。

2.Double DQN (DDQN)

一句话概括DDQN的改变就是下面这个式子:

Q(st,at)=rt+γmaxatarget_Q(s,argmaxaQ(s,a))Q(s_t,a_t) = r_t + \gamma max_{a'}target\_Q(s', argmax_{a'}Q(s',a'))

比较一下DQN的式子:
Q(st,at)=rt+γmaxaAtarget_Q(s,a)Q(s_t,a_t) = r_t + \gamma max_{a'\in A}target\_Q(s', a')
可以发现,区别在于 DDQN通过Q网络来选取a’, 而不是完全使用target_Q。 结果显示,这样可以防止Q网络的对value的过度预测,加快收敛。

3. Noise Network

之前的 ϵ\epsilon-greedy 的探索方式其实并不好,而一种改进的策略就是在Network中加入noise。(可以认为该noise强迫网络进行探索,而由于noise也会加入后向传播的优化,所以也会逐渐收敛)。

  1. 第一种方式: 对所有全连接层的每个权重,都加上一个高斯分布的噪声项进行干扰
  2. 限制高斯变量只在一个有限的随机矩阵中取。

结论: 大大加速了收敛。

发布了47 篇原创文章 · 获赞 116 · 访问量 73万+
展开阅读全文

在网上找到一个DQN的神经网络代码。可以运行,但是没有读取模型的部分

09-22

代码可以运行,但是没用读取模型的代码,我在网上找了一段时间,还是没有找到教程。自己写的读写代码不能正常工作 这是原代码 ``` import pygame import random from pygame.locals import * import numpy as np from collections import deque import tensorflow as tf import cv2 BLACK = (0 ,0 ,0 ) WHITE = (255,255,255) SCREEN_SIZE = [320,400] BAR_SIZE = [50, 5] BALL_SIZE = [15, 15] # 神经网络的输出 MOVE_STAY = [1, 0, 0,0] MOVE_LEFT = [0, 1, 0,0] MOVE_RIGHT = [0, 0, 1,0] MOVE_RIGHT1=[0,0,0,1] class Game(object): def __init__(self): pygame.init() self.clock = pygame.time.Clock() self.screen = pygame.display.set_mode(SCREEN_SIZE) pygame.display.set_caption('Simple Game') self.ball_pos_x = SCREEN_SIZE[0]//2 - BALL_SIZE[0]/2 self.ball_pos_y = SCREEN_SIZE[1]//2 - BALL_SIZE[1]/2 self.ball_dir_x = -1 # -1 = left 1 = right self.ball_dir_y = -1 # -1 = up 1 = down self.ball_pos = pygame.Rect(self.ball_pos_x, self.ball_pos_y, BALL_SIZE[0], BALL_SIZE[1]) self.bar_pos_x = SCREEN_SIZE[0]//2-BAR_SIZE[0]//2 self.bar_pos = pygame.Rect(self.bar_pos_x, SCREEN_SIZE[1]-BAR_SIZE[1], BAR_SIZE[0], BAR_SIZE[1]) # action是MOVE_STAY、MOVE_LEFT、MOVE_RIGHT # ai控制棒子左右移动;返回游戏界面像素数和对应的奖励。(像素->奖励->强化棒子往奖励高的方向移动) def step(self, action): if action == MOVE_LEFT: self.bar_pos_x = self.bar_pos_x - 2 elif action == MOVE_RIGHT: self.bar_pos_x = self.bar_pos_x + 2 elif action == MOVE_RIGHT1: self.bar_pos_x = self.bar_pos_x + 1 else: pass if self.bar_pos_x < 0: self.bar_pos_x = 0 if self.bar_pos_x > SCREEN_SIZE[0] - BAR_SIZE[0]: self.bar_pos_x = SCREEN_SIZE[0] - BAR_SIZE[0] self.screen.fill(BLACK) self.bar_pos.left = self.bar_pos_x pygame.draw.rect(self.screen, WHITE, self.bar_pos) self.ball_pos.left += self.ball_dir_x * 2 self.ball_pos.bottom += self.ball_dir_y * 3 pygame.draw.rect(self.screen, WHITE, self.ball_pos) if self.ball_pos.top <= 0 or self.ball_pos.bottom >= (SCREEN_SIZE[1] - BAR_SIZE[1]+1): self.ball_dir_y = self.ball_dir_y * -1 if self.ball_pos.left <= 0 or self.ball_pos.right >= (SCREEN_SIZE[0]): self.ball_dir_x = self.ball_dir_x * -1 reward = 0 if self.bar_pos.top <= self.ball_pos.bottom and (self.bar_pos.left < self.ball_pos.right and self.bar_pos.right > self.ball_pos.left): reward = 1 # 击中奖励 elif self.bar_pos.top <= self.ball_pos.bottom and (self.bar_pos.left > self.ball_pos.right or self.bar_pos.right < self.ball_pos.left): reward = -1 # 没击中惩罚 # 获得游戏界面像素 screen_image = pygame.surfarray.array3d(pygame.display.get_surface()) #np.save(r'C:\Users\Administrator\Desktop\game\model\112454.npy',screen_image) pygame.display.update() # 返回游戏界面像素和对应的奖励 return reward, screen_image # learning_rate LEARNING_RATE = 0.99 # 更新梯度 INITIAL_EPSILON = 1.0 FINAL_EPSILON = 0.05 # 测试观测次数 EXPLORE = 500000 OBSERVE = 50000 # 存储过往经验大小 REPLAY_MEMORY = 500000 BATCH = 100 output = 4 # 输出层神经元数。代表3种操作-MOVE_STAY:[1, 0, 0] MOVE_LEFT:[0, 1, 0] MOVE_RIGHT:[0, 0, 1] input_image = tf.placeholder("float", [None, 80, 100, 4]) # 游戏像素 action = tf.placeholder("float", [None, output]) # 操作 # 定义CNN-卷积神经网络 参考:http://blog.topspeedsnail.com/archives/10451 def convolutional_neural_network(input_image): weights = {'w_conv1':tf.Variable(tf.zeros([8, 8, 4, 32])), 'w_conv2':tf.Variable(tf.zeros([4, 4, 32, 64])), 'w_conv3':tf.Variable(tf.zeros([3, 3, 64, 64])), 'w_fc4':tf.Variable(tf.zeros([3456, 784])), 'w_out':tf.Variable(tf.zeros([784, output]))} biases = {'b_conv1':tf.Variable(tf.zeros([32])), 'b_conv2':tf.Variable(tf.zeros([64])), 'b_conv3':tf.Variable(tf.zeros([64])), 'b_fc4':tf.Variable(tf.zeros([784])), 'b_out':tf.Variable(tf.zeros([output]))} conv1 = tf.nn.relu(tf.nn.conv2d(input_image, weights['w_conv1'], strides = [1, 4, 4, 1], padding = "VALID") + biases['b_conv1']) conv2 = tf.nn.relu(tf.nn.conv2d(conv1, weights['w_conv2'], strides = [1, 2, 2, 1], padding = "VALID") + biases['b_conv2']) conv3 = tf.nn.relu(tf.nn.conv2d(conv2, weights['w_conv3'], strides = [1, 1, 1, 1], padding = "VALID") + biases['b_conv3']) conv3_flat = tf.reshape(conv3, [-1, 3456]) fc4 = tf.nn.relu(tf.matmul(conv3_flat, weights['w_fc4']) + biases['b_fc4']) output_layer = tf.matmul(fc4, weights['w_out']) + biases['b_out'] return output_layer # 深度强化学习入门: https://www.nervanasys.com/demystifying-deep-reinforcement-learning/ # 训练神经网络 def train_neural_network(input_image): predict_action = convolutional_neural_network(input_image) argmax = tf.placeholder("float", [None, output]) gt = tf.placeholder("float", [None]) action = tf.reduce_sum(tf.multiply(predict_action, argmax), reduction_indices = 1) cost = tf.reduce_mean(tf.square(action - gt)) optimizer = tf.train.AdamOptimizer(1e-6).minimize(cost) game = Game() D = deque() _, image = game.step(MOVE_STAY) # 转换为灰度值 image = cv2.cvtColor(cv2.resize(image, (100, 80)), cv2.COLOR_BGR2GRAY) # 转换为二值 ret, image = cv2.threshold(image, 1, 255, cv2.THRESH_BINARY) input_image_data = np.stack((image, image, image, image), axis = 2) with tf.Session() as sess: sess.run(tf.initialize_all_variables()) saver = tf.train.Saver() n = 0 epsilon = INITIAL_EPSILON while True: action_t = predict_action.eval(feed_dict = {input_image : [input_image_data]})[0] argmax_t = np.zeros([output], dtype=np.int) if(random.random() <= INITIAL_EPSILON): maxIndex = random.randrange(output) else: maxIndex = np.argmax(action_t) argmax_t[maxIndex] = 1 if epsilon > FINAL_EPSILON: epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE for event in pygame.event.get(): #macOS需要事件循环,否则白屏 if event.type == QUIT: pygame.quit() sys.exit() reward, image = game.step(list(argmax_t)) image = cv2.cvtColor(cv2.resize(image, (100, 80)), cv2.COLOR_BGR2GRAY) ret, image = cv2.threshold(image, 1, 255, cv2.THRESH_BINARY) image = np.reshape(image, (80, 100, 1)) input_image_data1 = np.append(image, input_image_data[:, :, 0:3], axis = 2) D.append((input_image_data, argmax_t, reward, input_image_data1)) if len(D) > REPLAY_MEMORY: D.popleft() if n > OBSERVE: minibatch = random.sample(D, BATCH) input_image_data_batch = [d[0] for d in minibatch] argmax_batch = [d[1] for d in minibatch] reward_batch = [d[2] for d in minibatch] input_image_data1_batch = [d[3] for d in minibatch] gt_batch = [] out_batch = predict_action.eval(feed_dict = {input_image : input_image_data1_batch}) for i in range(0, len(minibatch)): gt_batch.append(reward_batch[i] + LEARNING_RATE * np.max(out_batch[i])) optimizer.run(feed_dict = {gt : gt_batch, argmax : argmax_batch, input_image : input_image_data_batch}) input_image_data = input_image_data1 n = n+1 if n % 100 == 0: saver.save(sess, 'D:/lolAI/model/game', global_step = n) # 保存模型 print(n, "epsilon:", epsilon, " " ,"action:", maxIndex, " " ,"reward:", reward) train_neural_network(input_image) ``` 这是我根据教程写的读取模型并且运行的代码 ``` import tensorflow as tf tf.reset_default_graph() with tf.Session() as sess: new_saver = tf.train.import_meta_graph('D:/lolAI/model/game-400.meta') new_saver.restore(sess, tf.train.latest_checkpoint('D:/lolAI/model')) print(sess.run(tf.initialize_all_variables())) ``` 代码我还没有看的很明白,希望大佬给点意见 问答

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 大白 设计师: CSDN官方博客

分享到微信朋友圈

×

扫一扫,手机浏览