paddleOCR--提取图片或PDF表格内容

## 项目介绍
[paddleOCR](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.6)

## 安装
环境是python 3.10.11
```
pip install paddleocr
pip install paddlepaddle
```
安装后，使用的时候，需要将paddleocr下的ppstructure目录拷贝到当前目录。如下：

![](/media/202308/2023-08-04_235132&45fxU7OmiEcy61rNHngG.png)

## 实战
for 版面分析+表格识别
```
import os
import cv2
from paddleocr import PPStructure,draw_structure_result,save_structure_res

table_engine = PPStructure(show_log=True) # 将会下载模型
# 如需要下载图片分类模型，需要paddleclas安装包，然后使用
# table_engine = PPStructure(show_log=True, image_orientation=True)

save_folder = './output'
img_path = '11.png'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0])
# 到这一步就已经完成了

for line in result:
    line.pop('img')
    print(line)

from PIL import Image

font_path = 'doc/fonts/simfang.ttf' # PaddleOCR下提供字体包
image = Image.open(img_path).convert('RGB')
im_show = draw_structure_result(image, result,font_path=font_path)
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')
```

看到上面save_structure_res函数执行完，就已经完成了提取，具体如下：
> 原图

![](/media/202308/2023-08-04_235328&SR07ICHMvnTUWFwXAtYx.png)

> 识别结果

![](/media/202308/2023-08-04_235359&F49ldaviMs0UDo3Rp5jZ.png)

![](/media/202308/2023-08-04_235445&YFd1cyeOW8vm9DXAa63i.png)

![](/media/202308/2023-08-04_235503&wCjm7aeWu9LQk2XNyK06.png)