tensorflow 中dataset.padded_batch函数的个人理解过程

Octavia ·

更新时间:2024-09-20

· 775 次阅读

今天继续啃Tensorflow实战Google深度学习框架这本书，在250P的Seq2Seq模型代码理解时候有点困难，其中padded_batch(batch_size,padded_shapes)这个函数为最，本次仅为记录刨根问底的过程，也是整理一下类似函数的理解过程。

1直接查看英文解释，并且配合W3school的中文解释，锻炼英文阅读理解能力，尤其是专业的英文单词。

直接在pycharm上查看代码自带的英文注释


"""Combines consecutive elements of this dataset into padded batches.
Like `Dataset.dense_to_sparse_batch()`, this method combines
multiple consecutive elements of this dataset, which might have
different shapes, into a single element. The tensors in the
resulting element have an additional outer dimension, and are
padded to the respective shape in `padded_shapes`.
Args:
  batch_size: A `tf.int64` scalar `tf.Tensor`, representing the number of
    consecutive elements of this dataset to combine in a single batch.
  padded_shapes: A nested structure of `tf.TensorShape` or
    `tf.int64` vector tensor-like objects representing the shape
    to which the respective component of each input element should
    be padded prior to batching. Any unknown dimensions
    (e.g. `tf.Dimension(None)` in a `tf.TensorShape` or `-1` in a
    tensor-like object) will be padded to the maximum size of that
    dimension in each batch.
  padding_values: (Optional.) A nested structure of scalar-shaped
    `tf.Tensor`, representing the padding values to use for the
    respective components.  Defaults are `0` for numeric types and
    the empty string for string types.
Returns:
  A `Dataset`.
"""
结合W3school的中文解释，https://www.w3cschool.cn/tensorflow_python/tensorflow_python-pqdr2cqn.html
将此数据集的连续元素合并为填充的批处理.
像 Dataset.dense_to_sparse_batch() 一样, 此方法将此数据集的多个连续元素 (可能具有不同的形状) 合并到单个元素中.结果元素中的张量有一个额外的外部维度, 并填充到 padded_shapes 中的相应形状.
ARGS：
batch_size：一个 tf.int64 标量 tf.Tensor,表示此数据集的连续元素在单个批处理中合并的数量.
	padded_shapes：tf.TensorShape 的嵌套结构或 tf. int64 向量张量样对象,表示每个输入元素的各自组件在批处理之前应填充的形状.任何未知的维度 (例如 tf.Dimension(None) 在一个 TensorShape 或-1 在一个类似张量的对象中) 将被填充到每个批次中该维度的最大维度.
padding_values：(可选)一个标量形状的嵌套结构 tf.Tensor,表示要用于各个组件的填充值.对于数字类型和字符串类型的空字符串,默认值为 0.
返回：
一个数据集
具体应用实例，我参考了这位博主的博文https://blog.csdn.net/z2539329562/article/details/89791783，经过删减并添加了自己的注释。
第一个实例：


import tensorflow as tf
import numpy as np
tf.reset_default_graph()
x = [[1, 0, 0],
     [2, 3, 0],
     [4, 5, 6],
     [7, 8, 0],
     [9, 0, 0],
     [0, 1, 0]]
# tf.TensorShape([])     表示长度为单个数字
# tf.TensorShape([None]) 表示长度未知的向量
padded_shapes = (
    tf.TensorShape([None])
)
dataset = tf.data.Dataset.from_tensor_slices(x)
iterator_before = dataset.make_one_shot_iterator()
dataset_padded = dataset.padded_batch(2, padded_shapes=padded_shapes)
#dataset_padded = dataset.padded_batch(2, padded_shapes=[None]) 也是可以的，原因看下面注释1
iterator_later = dataset_padded.make_one_shot_iterator()#iterator_later是经过padded_batch处理的数据集迭代器
sess = tf.Session()
try:
    while True:
        print(sess.run(iterator_before.get_next()))
except tf.errors.OutOfRangeError:
    print("end")
try:
    while True:
        print(sess.run(iterator_later.get_next()))
except tf.errors.OutOfRangeError:
    print("end")


注释1：参考的网址：http://www.tensorfly.cn/tfdoc/resources/dims_types.html

运行结果
[1 0 0]

[2 3 0]

[4 5 6]

[7 8 0]

[9 0 0]

[0 1 0]

end

[[1 0 0]

 [2 3 0]]

[[4 5 6]

 [7 8 0]]

[[9 0 0]

 [0 1 0]]

end
数据集dataset 保存着原来二维的数组X，dataset 中的每一个元素是一个1*3的数组，也就是X的每一个行，iterator_before 只是顺序输出dataset的每一个元素。而dataset_padded中的三个元素分别是
[[1 0 0]

 [2 3 0]]
[[4 5 6]

 [7 8 0]]
[[9 0 0]

 [0 1 0]]
可见padded_batch(2, padded_shapes=padded_shapes)将原来dataset的元素[1 0 0] 和元素[2 3 0]   从新组合成为了一个新的元素
[[1 0 0]

 [2 3 0]]    因为batch_size是2，所以是使用原来dataset的两个元素组合成一个新的元素（一个列表）
那padded_shapes什么意思那？别急，我们看下个例子。
第二个实例：

import tensorflow as tf
import numpy as np
# tf.TensorShape([])     表示长度为单个数字
# tf.TensorShape([None]) 表示长度未知的向量
padded_shapes = (
    tf.TensorShape([None]),
    #tf.TensorShape([])
)
dataset = tf.data.Dataset.range(10)#生成一个数据集里面的元素分别是 0 1 2 3 4 5 6 7 8  9
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
iterator = dataset.make_one_shot_iterator()
sess = tf.Session()
try:
    i = 0
    while True:
        print("elem",i,":", sess.run(iterator.get_next()))
        i+=1
except tf.errors.OutOfRangeError:
    print("end")

输出是：
elem 0 : []

elem 1 : [1]

elem 2 : [2 2]

elem 3 : [3 3 3]

elem 4 : [4 4 4 4]

elem 5 : [5 5 5 5 5]

elem 6 : [6 6 6 6 6 6]

elem 7 : [7 7 7 7 7 7 7]

elem 8 : [8 8 8 8 8 8 8 8]

elem 9 : [9 9 9 9 9 9 9 9 9]

end
{
这里也写一下注释：为什么会生成这样的dataset，猜也能猜的到是这一句代码：
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
这行代码的作用，就是对dataset的每一个元素，比如0或者1或者2，进行一个运算，这个运算就是tf.fill([tf.cast(x, tf.int32)], x)，而tf.fill的作用，我们直接看pycharm 源代码的英文注释就能看懂，

r"""Creates a tensor filled with a scalar value.
This operation creates a tensor of shape `dims` and fills it with `value`.
For example:
```
# Output tensor has shape [2, 3].
fill([2, 3], 9) ==> [[9, 9, 9]
                     [9, 9, 9]]
```
也就是0 这个元素替换为了fill([0], 0)   刚好就是第一个输出[]  。1这个元素替换为了fill([1], 1) ，也就是第二个输出[1]
}
好了，回归正题，在第二个实例中，这里我们就得到了一个dataset里面的元素是看成长度一样的list，正好我们用这个数据集试试填充，之前关于padded_bach函数中第二个参数padded_shapes参数的说明 “任何未知的维度 (例如 tf.Dimension(None) 在一个 TensorShape 或-1 在一个类似张量的对象中) 将被填充到每个批次中该维度的最大维度.”

import tensorflow as tf
import numpy as np
# tf.TensorShape([])     表示长度为单个数字
# tf.TensorShape([None]) 表示长度未知的向量
padded_shapes = (
    tf.TensorShape([None])
    #tf.TensorShape([])
)
dataset =tf.data.Dataset.range(10)
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
dataset = dataset.padded_batch(2, padded_shapes=padded_shapes)# 多了这一行代码
iterator = dataset.make_one_shot_iterator()
sess = tf.Session()
try:
    i=1
    while True:
        print("elem",i,":", sess.run(iterator.get_next()))
        i+=1;
except tf.errors.OutOfRangeError:
    print("end")
输出如下：
elem 1 : [[0]

              [1]]

elem 2 : [[2 2 0]

              [3 3 3]]

elem 3 : [[4 4 4 4 0]

              [5 5 5 5 5]]

elem 4 : [[6 6 6 6 6 6 0]

              [7 7 7 7 7 7 7]]

elem 5 : [[8 8 8 8 8 8 8 8 0]

              [9 9 9 9 9 9 9 9 9]]

end
可以看到，原来的两个元素 [] [1] 合并成一个元素了，并且[]这个元素填充了一个0，以确保和[1]这个元素一样长。
下面再看一个例子，如何使用

padded_shapes = (
    tf.TensorShape([None]),#表示长读未知的向量
    tf.TensorShape([])#表示为单个数字
)
第三个实例：

import tensorflow as tf
import numpy as np
# tf.TensorShape([])     表示长度为单个数字
# tf.TensorShape([None]) 表示长度未知的向量
padded_shapes = (
    tf.TensorShape([None]),
    tf.TensorShape([])
)
dataset =tf.data.Dataset.range(10)
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
dataset = dataset.map(lambda x: (x, tf.reduce_sum(x)))
iterator = dataset.make_one_shot_iterator()
sess = tf.Session()
try:
    i=1
    while True:
        print("elem",i,":", sess.run(iterator.get_next()))
        i+=1;
except tf.errors.OutOfRangeError:
    print("end")
输出为：
elem 1 : (array([], dtype=int64), 0)

elem 2 : (array([1], dtype=int64), 1)

elem 3 : (array([2, 2], dtype=int64), 4)

elem 4 : (array([3, 3, 3], dtype=int64), 9)

elem 5 : (array([4, 4, 4, 4], dtype=int64), 16)

elem 6 : (array([5, 5, 5, 5, 5], dtype=int64), 25)

elem 7 : (array([6, 6, 6, 6, 6, 6], dtype=int64), 36)

elem 8 : (array([7, 7, 7, 7, 7, 7, 7], dtype=int64), 49)

elem 9 : (array([8, 8, 8, 8, 8, 8, 8, 8], dtype=int64), 64)

elem 10 : (array([9, 9, 9, 9, 9, 9, 9, 9, 9], dtype=int64), 81)

end
当代码添加  加粗的哪一行代码后

dataset =tf.data.Dataset.range(10)
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
dataset = dataset.map(lambda x: (x, tf.reduce_sum(x)))
dataset = dataset.padded_batch(2, padded_shapes=padded_shapes)#我就是加粗的那一行 
iterator = dataset.make_one_shot_iterator()
输出
elem 1 : (array([[0],

       [1]], dtype=int64), array([0, 1], dtype=int64))

elem 2 : (array([[2, 2, 0],

       [3, 3, 3]], dtype=int64), array([4, 9], dtype=int64))

elem 3 : (array([[4, 4, 4, 4, 0],

       [5, 5, 5, 5, 5]], dtype=int64), array([16, 25], dtype=int64))

elem 4 : (array([[6, 6, 6, 6, 6, 6, 0],

       [7, 7, 7, 7, 7, 7, 7]], dtype=int64), array([36, 49], dtype=int64))

elem 5 : (array([[8, 8, 8, 8, 8, 8, 8, 8, 0],

       [9, 9, 9, 9, 9, 9, 9, 9, 9]], dtype=int64), array([64, 81], dtype=int64))

end
dataset的每一个元素是 类似包含两个元素的元组  (array([[0],[1]], dtype=int64), array([0, 1], dtype=int64))，array([[0],[1]]是两个元素[] [1]经过填充的再合并成一个array数组，而 array([0, 1]并不需要填充，两个元素 0 1直接合并成一个array数组。
大概总结就是这样，应该还是有不少的纰漏，毕竟才疏学浅，基础也不太好，有些东西不知道怎么形容才好。。。。


作者：qq_32691667
                    
 
                

                            Dataset
                            batch
                            tensorflow


           
    
    

            
                
                    
                
            
            
                
    
        
            需要 登录 后方可回复, 如果你还没有账号请 注册新账号
        
    
                
            
                
                    
                        相关文章

    
        
    
    
        
            探索PowerShell(一) 初识 PowerShell
        
        
            Maleah
            2021-05-23
        
    
    
        828
    


    
        
            CSS使用盒模型实例讲解
        
        
            Bunny
            2020-08-29
        
    
    
        825
    


    
        
            python实现Linux异步epoll代码
        
        
            Bella
            2020-07-24
        
    
    
        638
    


    
        
            TensorFlow如何指定GPU训练模型
        
        
            Maleah
            2022-11-09
        
    
    
        872
    


    
        
            TensorFlow中关于tf.app.flags命令行参数解析模块
        
        
            Fern
            2022-11-09
        
    
    
        1394
    


    
        
            Tensorflow高性能数据优化增强工具Pipeline使用详解
        
        
            Bliss
            2022-11-09
        
    
    
        36
    


    
        
    
    
        
            利用OpenCV+Tensorflow实现的手势识别
        
        
            Dulcea
            2022-11-12
        
    
    
        987
    


    
        
            Tensorflow2.1MNIST图像分类实现思路分析
        
        
            Malina
            2022-11-20
        
    
    
        1598
    


    
        
            Tensorflow2.1实现文本中情感分类实现解析
        
        
            Rose
            2022-11-20
        
    
    
        1182
    


    
        
    
    
        
            Tensorflow2.1完成对MPG回归预测详解
        
        
            Querida
            2022-11-20
        
    
    
        696
    


    
        
    
    
        
            Pytorch如何加载自己的数据集(使用DataLoader读取Dataset)
        
        
            Fawn
            2022-12-15
        
    
    
        205
    


    
        
            Tensorflow2.4从头训练Word Embedding实现文本分类
        
        
            Serafina
            2023-01-06
        
    
    
        1212
    


    
        
            Tensorflow2.4搭建单层和多层Bi-LSTM模型
        
        
            Kathy
            2023-01-06
        
    
    
        65
    


    
        
    
    
        
            深度学习Tensorflow 2.4 完成迁移学习和模型微调
        
        
            Tani
            2023-01-06
        
    
    
        1085
    


    
        
            Python基于TensorFlow接口实现深度学习神经网络回归
        
        
            Lark
            2023-02-18
        
    
    
        364
    


    
        
            tensorflow1.x和tensorflow2.x中的tensor转换为字符串的实现
        
        
            Olivia
            2023-02-25
        
    
    
        1309
    


    
        
            Echarts通过dataset数据集实现创建单轴散点图
        
        
            Phylicia
            2023-02-26
        
    
    
        205
    


    
        
            tensorflow基于Anaconda环境搭建的方法步骤
        
        
            Oria
            2023-02-28
        
    
    
        278
    


    
        
    
    
        
            Anaconda中安装Tensorflow的过程
        
        
            Psyche
            2023-03-31
        
    
    
        506
    


    
        
            使用Python、TensorFlow和Keras来进行垃圾分类的操作方法
        
        
            Laila
            2023-05-12
        
    
    
        349


        
    
        
            我要提问
        
    
    
        
        
    
        致谢
        
            帮助他人，成就自己。
            人生最大成功就是伸出热情而温暖的双手，尽自己所能去帮助身边的每一个人，只要无私的奉献，就会收获到美好的生活。
            1024问感谢每一位朋友的帮助和支持。
            软件开发网提供编程的基础软件技术培训教程,软件开发编程实例讲解Go,Node,HTML,CSS,Javascript,Python,Java,Ruby,C,PHP,MySQL等软件开发编程语言以及数据开发的基础知识，也提供大量的软件开发在线实例、从入门到精通就在1024问。
        
    
    
        
            
    育儿网
    微养生
    全球行
    美食街
    育儿
    菜谱大全
    海南旅游
    女性
    养狗百科
    星座