I have two PPTs (File1.pptx and File2.pptx) in which I have the below 2 lines

XX NOV 2021, Time: xx:xx – xx:xx hrs (90mins)
FY21/22 / FY22/23

I wish to replace like below

a) NOV 2021 as NOV 2022.

b) FY21/22 / FY22/23 as FY21/22 or FY22/23.

But the problem is my replacement works in File1.pptx but it doesn't work in File2.pptx.

When I printed the run text, I was able to see that they are represented differently in two slides.

def replace_text(replacements:dict,shapes:list):
    for shape in shapes:
        for match, replacement in replacements.items():
            if shape.has_text_frame:
                if (shape.text.find(match)) != -1:
                    text_frame = shape.text_frame
                    for paragraph in text_frame.paragraphs:
                        for run in paragraph.runs:
                            cur_text = run.text
                            print(cur_text)
                            print("---")
                            new_text = cur_text.replace(str(match), str(replacement))
                            run.text = new_text

In File1.pptx, the cur_text looks like below (for 1st keyword). So, my replace works (as it contains the keyword that I am looking for)

enter image description here

But in File2.pptx, the cur_text looks like below (for 1st keyword). So, replace doesn't work (because the cur_text doesn't match with my search term)

enter image description here

The same issue happens for 2nd keyword as well which is FY21/22 / FY22/23.

The problem is the split keyword could be in previous or next run from current run (with no pattern). So, we should be able to compare a search term with previous run term (along with current term as well). Then a match can be found (like Nov 2021) and be replaced.

This issue happens for only 10% of the search terms (and not for all of my search terms) but scary to live with this issue because if the % increases, we may have to do a lot of manual work. How do we avoid this and code correctly?

How do we get/extract/find/identify the word that we are looking for across multiple runs (when they are indeed present) like CTRL+F and replace it with desired keyword?

Any help please?

推荐答案

As one can find in python-pptx's documentation at https://python-pptx.readthedocs.io/en/latest/api/text.html

  1. a text frame is made up of paragraphs and
  2. a paragraph is made up of runs and specifies a font configuration that is used as the default for it's runs.
  3. runs specify part of the paragraph's text with a certain font configuration - possibly different from the default font configuration in the paragraph

All three have a field called text:

  1. The text frame's text contains all the text from all it's paragraphs concatenated together with the appropriate line-feeds in between the paragraphs.
  2. The paragraphs's text contains all the texts from all of it's runs concatenated to a long string with a vertical tab character (\v) put wherever there was a so-called soft-break in any of the run's text (a soft break is like a line-feed but without terminating the paragraph).
  3. The run's text contains text that is to be rendered with a certain font configuration (font family, font size, italic/bold/underlined, color etc. pp). It is the lowest level of the font configuration for any text.

Now if you specify a line of text in a text-frame in a PowerPoint presentation, this text-frame will very likely only have one paragraph and that paragraph will have just one run.

Let's say that line says: Hi there! How are you? What is your name? and is all normal (neither italic nor bold) and in size 10.

Now if you go ahead in PowerPoint and make the questions How are you? What is your name? stand out by making them italic, you will end up with 2 runs in our paragraph:

  1. Hello there! with the default font configuration from the paragraph
  2. How are you? What is you name? with the font configuration specifying the additional italic attribute.

Now imagine, we want the How are you? stand out even more by making it bold and italic. We end up with 3 runs:

  1. Hello there! with the default font configuration from the paragraph.
  2. How are you? with the font configuration specifying the BOLD and ITALIC attribute
  3. What is your name? with the font configuration specifying the ITALIC attribute.

One step further, making the are in How are you? bigger. We get 5 runs:

  1. Hello there! with the default font configuration from the paragraph.
  2. How with the font configuration specifying the BOLD and ITALIC attribute
  3. are with the font configuration specifying the BOLD and ITALIC attribute and font size 16
  4. you? with the font configuration specifying the BOLD and ITALIC attribute
  5. What is your name? with the font configuration specifying the ITALIC attribute.

So if you try to replace the How are you? with I'm fine! with the code from your question, you won't succeed, because the text How are you? is actually distributed across 3 runs.

You can go one level higher and look at the paragraph's text, that still says Hello there! How are you? What is your name? since it is the concatenation of all its run's texts.

But if you go ahead and do the replacement of the paragraph's text, it will erase all runs and create one new run with the text Hello there! I'm fine! What is your name? all the while deleting all the formatting that we put on the What is your name?.

Therefore, changing text in a paragraph without affecting formatting of the other text in the paragraph is pretty involved. And even if the text you are looking for has all the same formatting, that is no guarantee for it to be within one run. Because if you - in our example above - make the are smaller again, the 5 runs will very likely remain, the runs 2 to 4 just having the same font configuration now.

Here is the code to produce a test presentation with a text box containing the exact paragraph runs as given in my example above:

from pptx import Presentation
from pptx.chart.data import CategoryChartData
from pptx.enum.chart import XL_CHART_TYPE,XL_LABEL_POSITION
from pptx.util import Inches, Pt
from pptx.dml.color import RGBColor
from pptx.enum.dml import MSO_THEME_COLOR

# create presentation with 1 slide ------
prs = Presentation()
slide = prs.slides.add_slide(prs.slide_layouts[5])
textbox_shape = slide.shapes.add_textbox(Pt(200),Pt(200),Pt(30),Pt(240))
text_frame = textbox_shape.text_frame
p = text_frame.paragraphs[0]
font = p.font
font.name = 'Arial'
font.size = Pt(10)
font.bold = False
font.italic = False
font.color.rgb = RGBColor(0,0,0)

run = p.add_run()
run.text = 'Hello there! '

run = p.add_run()
run.text = 'How '
font = run.font
font.italic = True
font.bold = True

run = p.add_run()
run.text = 'are'
font = run.font
font.italic = True
font.bold = True
font.size = Pt(16)

run = p.add_run()
run.text = ' you?'
font = run.font
font.italic = True
font.bold = True

run = p.add_run()
run.text = ' What is your name?'
run.font.italic = True

prs.save('text-01.pptx')

And this is what it looks like, if you open it in PowerPoint:

The created presentation slide

Now if you use this code on it:

from pptx import Presentation
from pptx.chart.data import CategoryChartData
from pptx.shapes.graphfrm import GraphicFrame
from pptx.enum.chart import XL_CHART_TYPE
from pptx.util import Inches

def replace_text(replacements, shapes):
    for shape in shapes:
        if shape.has_text_frame:
            text_frame = shape.text_frame
            for (match, replacement) in replacements.items():
                if text_frame.text.find(match)>=0:
                    for paragraph in text_frame.paragraphs:
                        pos = paragraph.text.find(match)
                        while pos>=0:
                            replace_runs_text(paragraph.runs, pos, len(match), replacement)
                            pos = paragraph.text.find(match)

def replace_runs_text(runs, pos, match_len, replacement):
    cnt = len(runs)
    i = 0
    while i<cnt:
        olen = len(runs[i].text)
        if pos<olen:
            # we found the run, where the match starts!
            to_replace = replacement
            repl_len = len(to_replace)

            while i<cnt:
                run = runs[i]
                otext = run.text
                olen = len(otext)
                if pos+match_len < olen:
                    # our match ends before the end of the text of this run therefore
                    # we put the rest of our replacement string here and we are done!
                    run.text = otext[0:pos]+to_replace+otext[pos+match_len:]
                    return
                if pos+match_len == olen:
                    # our match ends together with the text of this run therefore
                    # we put the rest of our replacement string here and we are done!
                    run.text = otext[0:pos]+to_replace
                    return
                # we still haven't found all of our original match string
                # so we process what we have here and go on to the next run
                part_match_len = olen-pos
                ntext = otext[0:pos]
                if repl_len <= part_match_len:
                    # we now found at least as many characters for our match string
                    # as we have replacement characters for it. Thus we use up the
                    # the rest of our replacement string here and will replace the
                    # remainder of the match with an empty string (which happens
                    # to happen in this exact same spot for the next run ;-))
                    ntext += to_replace
                    repl_len = 0
                    to_replace = ''
                else:
                    # we have got some more match characters but still more
                    # replacement characters than match characters found 
                    ntext += to_replace[0:part_match_len]
                    to_replace = to_replace[part_match_len:]
                    repl_len -= part_match_len
                run.text = ntext            # save the new text to the run
                match_len -= part_match_len # this is what is left to match
                pos = 0                     # in the next run, we start at pos 0 with our match
                i += 1                      # and off to the next run
        else:
            pos -= olen # the relative position of our match in the next run's text
            i += 1      # and off to the next run
            

# create presentation with 1 slide ------
prs = Presentation('text-01.pptx')

# what is to be replaced
replacements = { 'How are you?': "I'm fine!" }

# loop through all slides and replace text in all their shapes
for slide in prs.slides:
    replace_text(replacements, slide.shapes)

# save changed presentation
prs.save('text-02.pptx')

the resulting presentation looks like this:

The changed presentation

As you can see, it mapped the replacement string exactly onto the existing font-configurations, thus if your match and it's replacement have the same length, the replacement string will retain the exact format of the match.

But - as an important side-note - if the text-frame has auto-size or fit-frame switched on, even all that work won't save you from screwing up the formatting, if the text after the replacement needs more or less space!

Python相关问答推荐

Python:如何在函数中键入提示 tf.keras 对象?

Windows 上的 Python:有时在调用 win32gui.SetForegroundWindow() 时“访问被拒绝”

R Reticulate - 以编程方式将定义的变量从 Python 环境移动到 R

为什么我在这里得到零?分配数组元素

如何根据用户的最后一个输入键更改角色的图像?

在转换为 numpy 数组/Pandas 数据帧之前有效地过滤字节流

为什么我的 winfo_screenwidth() 和 winfo_screenwidth() 得到一些奇怪的数字?特金特

无法从 sourcePylancereportMissingModuleSource 解析导入“urllib3”

如何在 Trip Advisor 中使用 Python Selenium 进行分页以提取 comments

加权平均日期时间,关闭但仅适用于某些月份

如何用 2 * 3 块占用 10 x 10 块,不让其他人推另一个 2 * 3

numpy 计算自定义矩阵乘法

Python:如何从一组具有权重限制的数据中获取所有可能的 bin 组合?

根据您在函数中提供的扩展名和文件夹获取文件数

从多索引数据集中删除偶数级别

如何在展平 MultiIndex 列并将样式应用于 DataFrame 后删除索引列?

来自 Python 目录的最新文件

使用 Python 类型提示指示来自副作用的变量值

从 Pandas DataFrame 获取latex 字符串时如何删除索引列

Python:创建值与列名匹配的数据框