使用.NET框架搜索和替换Word文档文本

在.NET应用程序中搜索和替换Word文档的文本是一个常见的任务。本文将介绍几种可以实现这一功能的方法，并展示如何仅使用.NET框架（不使用任何第三方代码）来搜索和替换Word文档的文本。要理解实现细节，需要具备基本的WordprocessingML知识。

细节

如果可以选择使用Word自动化（这需要安装MS Word），那么可以使用Word互操作提供的API来实现查找和替换功能，如这里所示。另一种方法是读取整个DOCX文件的主部分（document.xml）作为字符串，并在它上面执行查找和替换，如这里所示。这种简单的方法可能足够了，但当搜索的文本不是单个XML元素的值时，就会出现问题，例如，考虑以下DOCX文件：

文档的主部分看起来可能如下所示：







Hello





World

另一种情况是：




Hello

World

因此，在Word文档中要搜索的文本可能跨越多个元素，需要在搜索时考虑这一点。

实现

将打开Word文档，并用FlatDocument对象呈现它。这个对象将读取文档部分（如正文、页眉、页脚、注释等）并将它们存储为XDocument对象的集合。FlatDocument对象还将创建一组FlatTextRange对象，这些对象代表文档文本内容的可搜索部分（一个FlatTextRange可以代表一个段落、一个超链接等）。每个FlatTextRange将包含有索引的文本内容的FlatText对象（FlatText.StartIndex和FlatText.EndIndex代表FlatText的文本位置在FlatTextRange的文本内）。

打开Word文档：


    public sealed class FlatDocument : IDisposable
    {
        public FlatDocument(string path) : this(File.Open(path, FileMode.Open, FileAccess.ReadWrite)) { }
        public FlatDocument(Stream stream)
        {
            this.documents = XDocumentCollection.Open(stream);
            this.ranges = new List();
            this.CreateFlatTextRanges();
        }
        // ...
    }

遍历支持的文档部分（正文、页眉、页脚、注释、尾注和脚注，它们被加载为XDocument对象）的Run元素，并创建FlatTextRange和FlatText实例：


    public sealed class FlatDocument : IDisposable
    {
        private void CreateFlatTextRanges()
        {
            foreach (XDocument document in this.documents)
            {
                FlatTextRange currentRange = null;
                foreach (XElement run in document.Descendants(FlatConstants.RunElementName))
                {
                    if (!run.HasElements) continue;

                    FlatText flatText = FlattenRunElement(run);
                    if (flatText == null) continue;

                    // 如果当前Run不属于同一个父级（如段落、超链接等），则创建一个新的FlatTextRange，否则使用当前的一个。
                    if (currentRange == null || currentRange.Parent != run.Parent)
                        currentRange = this.CreateFlatTextRange(run.Parent);
                    currentRange.AddFlatText(flatText);
                }
            }
        }
        // ...
    }

压平Run元素，将单个Run元素分割成多个连续的Run元素，每个元素有一个单一的内容子元素（可选地是第一个RunProperties子元素）。从压平的Run元素创建一个FlatText对象：


    public sealed class FlatDocument : IDisposable
    {
        private static FlatText FlattenRunElement(XElement run)
        {
            XElement[] childs = run.Elements().ToArray();
            XElement runProperties = childs[0].Name == FlatConstants.RunPropertiesElementName ? childs[0] : null;
            int childCount = childs.Length;
            int flatChildCount = 1 + (runProperties != null ? 1 : 0);

            // 将当前Run分解成多个Run元素，每个元素有一个子元素，
            // 或者如果它有RunProperties元素作为第一个子元素，则有两个子元素。
            while (childCount > flatChildCount)
            {
                XElement child = childs[childCount - 1];
                run.AddAfterSelf(new XElement(FlatConstants.RunElementName, runProperties != null ? new XElement(runProperties) : null, new XElement(child)));
                child.Remove();
                --childCount;
            }

            XElement remainingChild = childs[childCount - 1];
            return remainingChild.Name == FlatConstants.TextElementName ? new FlatText(remainingChild) : null;
        }
        // ...
    }

执行FlatTextRange实例上的查找和替换：


    public sealed class FlatDocument : IDisposable
    {
        public void FindAndReplace(string find, string replace)
        {
            this.FindAndReplace(find, replace, StringComparison.CurrentCulture);
        }
        public void FindAndReplace(string find, string replace, StringComparison comparisonType)
        {
            this.ranges.ForEach(range => range.FindAndReplace(find, replace, comparisonType));
        }
        // ...
    }

最后，FlatDocument.Dispose将保存XDocument部分并关闭Word文档。

用法

以下示例代码展示了如何使用FlatDocument：


    class Program
    {
        static void Main(string[] args)
        {
            // 打开Word文件。
            using (var flatDocument = new FlatDocument("Sample.docx"))
            {
                // 搜索并替换文档的文本内容。
                flatDocument.FindAndReplace("Hello Word", "New Value 1");
                flatDocument.FindAndReplace("Foo Bar", "New Value 2");
                // ...
                // 保存Word文件。
            }
        }
    }

上述算法的一个替代方法是将单个Run元素分割成多个连续的Run元素，每个元素有一个子元素（与上述相同），但在这种情况下，一个单一的子元素将只包含一个单一的字符：




H


e


l


l


o

然后将迭代这些元素，寻找匹配字符的序列。可以在以下文章中找到这种方法的详细信息和实现：

在Open XML WordprocessingML文档中搜索和替换文本

实际上，这种方法在Open XML PowerTools（TextReplacer类）中使用。但是，这两种算法的问题是它们不适用于跨越多个段落的内容。在这种情况下，需要将整个Word文档的内容压平，以成功搜索所需的文本。

GemBox.Document是一个用于处理Word文件的.NET组件，它通过ContentRange类提供了一个文档的内容模型层次结构，可以作为平面内容访问。有了它，能够搜索跨越多个段落的内容。有关详细信息，请参见以下文章：

使用C#或VB.NET在Word中查找和替换

通过这种方法，实际上可以找到任何任意内容并用任何期望的内容替换它（包括表格、图片、段落、HTML格式化文本、RTF格式化文本等）。

目前，替换文本将使用找到的文本开头的相同格式。然而，可以考虑提供一个FindAndReplace重载方法，该方法接受所需的格式（例如：FlatDocument.FindAndReplace(string find, string replace, TextFormat format)）。当提供格式时，需要根据它创建一个新的RunProperties元素。

目前，搜索和替换文本中的任何特殊字符（如制表符、换行符、不间断连字符等）都没有被考虑。对于这一点，FlatText应该意识到FlatText.textElement可以是的不同元素类型（如、
、等），并根据它返回适当的FlatText.Text值。

跨平台应用开发指南

本文介绍了如何使用Multi-OS Engine将现有的Android应用移植到iOS平台，包括项目创建、UI设计、事件处理和业务逻辑实现等步骤。

私有数组元素测试的复杂性

本文讨论了在C#中测试私有数组元素时遇到的复杂性，以及如何通过扩展方法和反射技术来解决这些问题。

使用.NET框架搜索和替换Word文档文本

细节

实现

用法

跨平台应用开发指南

私有数组元素测试的复杂性

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485

使用.NET框架搜索和替换Word文档文本

细节

实现

用法

跨平台应用开发指南

私有数组元素测试的复杂性

沪ICP备2024098111号-1

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢 联系电话：17898875485

上海秋旦网络科技中心：上海市奉贤区金大公路8218号1幢联系电话：17898875485