在书籍《Operation and Modeling of the MOS Transistor 4th.pdf》中修正页码和增加书签

Tip

一本好书：《Operation and Modeling of the MOS Transistor》第四版

（多年之后）终于又有心情捡起我的技术老本行😄，趁着上海疫情被封在家里又翻起器件的基础物理知识，看paper时无意中从参考列表里看到了上面这本书，马上去找，国内好像没有实体书，找来找去终于在欧洲一个网站上找到了第四版的扫描版，下载下来看了一下书本身非常好，不过pdf的页码是顺序排下来的，和原本书页的显示并不一样，也没有书签目录，在电子设备看起来着实不方面，想着怎么弄一下。

工作环境

环境：WSL+Ubuntu20.04（还是习惯在Linux下操作）。
软件：Python3+PyPDF4；从pdf提取图像的pdfimges；做OCR的tesseract。

基本PyPDF4操作

打开原始pdf文件并将页面附加到pdf输出中

1
2
3
4
5
6
7
8
9


from PyPDF4 import pdf as PDF

fin = "Operation-and-Modeling-of-the-MOS-Transistor-4th.pdf"
fon = "Operation-and-Modeling-of-the-MOS-Transistor-4th-BM.pdf"

pdfin = PDF.PdfFileReader(fin)

pdfou = PDF.PdfFileWriter()
pdfou.appendPagesFromReader(pdfin)

排布页码

下面这段代码主体来自于网上，主要是关于pdf格式本身我是知之甚少， PyPDF4文档中也无更多的解释，能从代码中主要学习到的有如下几点：

要添加进PDF中的是各种object，如NameObject、NumberObject、ArrayObject，不能是直接的字符串或数字；
PyPDF4中并没有提供非常详尽或者明确的方法来操作PDF，有时候必须直接访存内部变量，如_root_object；
如果不熟悉PDF结构，可能压根不知道该用什么来达到目的，由于不用科学上网，发现使用必应的搜索效果比百度好不少。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


def pdf_pagelabels_roman():
  number_type = PDF.DictionaryObject()
  number_type.update({PDF.NameObject("/S"):PDF.NameObject("/r")})
  return number_type

def pdf_pagelabels_decimal():
  number_type = PDF.DictionaryObject()
  number_type.update({PDF.NameObject("/S"):PDF.NameObject("/D")})
  return number_type

def pdf_pagelabels_decimal_with_offset(offset):
  number_type = pdf_pagelabels_decimal()
  number_type.update({PDF.NameObject("/St"):PDF.NumberObject(offset)})
  return number_type

nums_array = PDF.ArrayObject()
# Each entry consists of an index followed by a page label...
nums_array.append(PDF.NumberObject(0))  # Page 0:
nums_array.append(pdf_pagelabels_roman()) # Roman numerals

# Each entry consists of an index followed by a page label...
nums_array.append(PDF.NumberObject(24)) # Page 24 - 
nums_array.append(pdf_pagelabels_decimal_with_offset(1)) # Decimal numbers, with Offset

page_numbers = PDF.DictionaryObject()
page_numbers.update({PDF.NameObject("/Nums"):nums_array})

page_labels = PDF.DictionaryObject()
page_labels.update({PDF.NameObject("/PageLabels"): page_numbers})

root_obj = pdfou._root_object
root_obj.update(page_labels)

增加书签

本想使用PyPDF4的页面图像提取功能提取书中Content页面，由于它不支持扫描软件常使用的JBIG2格式解码，使用时会报错，在网上查找方案发现可以使用pdfimages这个工具，在Ubuntu下它是poppler-utils的一部分。注意这里使用-png定义输出图像格式为png，缺省情况下输出图像格式和在PDF中存储的图像格式相同。
1 2 3

sudo apt install poppler-utils pdfimges -f first_page_num -l last_page_num -png PDF_File IMAGE_DIR/
Python中也有一些OCR库，不过经过简单的搜索我选择使用tesseract这个现在由Google维护工具，它也可以直接在Ubuntu中安装 sudo apt install tesseract-ocr tesseract-ocr-all，其中 tesseract-ocr-all是各种语言的训练数据，也可以只安装指定语言的训练数据。使用这个软件逐个处理导出的图像，并将OCR结果存入一个文本文件以备后用。
1 2 3

tesseract --psm4 pic00_file - | tee -a out.txt ...... tesseract --psm4 picxx_file - | tee -a out.txt

利用文本编辑器对OCR输出的文件手工编辑，修正一些错误并建立如下结构以便python遍历整个目录并输出到PDF文件。修正和编辑工作颇考验文本编辑器使用水平，我作为Vimer也是颇花了一些时间的。第2行和第9行的数字是当前这组页码的偏移量，最内一层的元组的最后一个数字加上这个偏移量得到真正pdf从零开始的实际页码数。这个最内层元组的头一个元素表明这个书签的父节点，第2个元素是这个书签的自身命名。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


pdf_toc = (
  ( -1, (
    ("0", "-1", "About the Author", 6),
    ("0", "-2", "Contents", 9),
    ("0", "-3", "PREFACE", 17),
    )
  ),

  ( 23, (
  
    ("0", "1",   "CHAPTER 1 Semiconductors, Junctions, and MOSFET Overview", 1),
    ("1", "1.1", "1.1 Introduction", 1),
    ......
    ("0", "IND", "Index", 713),
   )
  )
)

下面代码遍历上面的元组并利用 addBookmark 将书签逐个加入pdf中。注意如果父节点字符串为"0"，则是父节点为None（即是顶层节点）。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


tocs = {}
for offset, marks in pdf_toc:
  for (toc_p, toc_s, title, num) in marks:
    if toc_p == "0" :
      bm = pdfou.addBookmark(title, num+offset, bold=True)
      tocs[toc_s] = bm
    else:
      bf = True if "." not in toc_s else False
      if toc_p in tocs:
        bm = pdfou.addBookmark(title, num+offset, parent=tocs[toc_p], bold=bf)
        tocs[toc_s] = bm

增加元信息

和上面排布页码遇到的问题相同，由于不知道哪些数据是构建这些信息的，只能通过网上搜索才能发现。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


import datetime
now_time = datetime.datetime.now()
time_str = datetime.datetime.strftime(now_time,'%F %A %X')

infoDict = pdfou._info.getObject()
infoDict.update({
  PDF.NameObject('/Title'): PDF.createStringObject(u'Operation and Modeling of the MOS Transistor 4th'),
  PDF.NameObject('/Author'): PDF.createStringObject(u'Yannis Tsividis@Columbia University, Colin McAndrew@Freescale Semiconductor'),
  PDF.NameObject('/Creator'): PDF.createStringObject(u'A Script Based on PyPDF4'),
  PDF.NameObject('/Note'): PDF.createStringObject(u'By EIJ on '+time_str),
})
##  PDF.NameObject('/Subject'): PDF.createStringObject(u'subject'),

最终结果链接

调整过的pdf文件可以从百度云盘下载： Operation and Modeling of the MOS Transistor 4th 提取码：ndp5

文章目录