首页 文章

是否存在Boyer-Moore字符串搜索和快速搜索和替换功能以及Delphi 2010 String(UnicodeString)的快速字符串计数?

提问于
浏览
18

我需要三个快速大字符串函数:快速搜索,快速搜索和替换,以及字符串中子字符串的快速计数 .

我已经在C和Python中遇到了Boyer-Moore字符串搜索,但是用于实现快速搜索和替换的唯一Delphi Boyer-Moore算法是由Peter Morris(前身为DroopyEyes软件)及其网站的FastStrings的一部分 . 和电子邮件不再有效 .

我已经移植了FastStrings forward以便在Delphi 2009/2010中为AnsiStrings工作,其中一个字节等于一个AnsiChar,但是它们也可以在Delphi 2010中使用String(UnicodeString),这看起来并不重要 .

使用这个Boyer-Moore算法,应该可以轻松地进行不区分大小写的搜索,以及不区分大小写的搜索和替换,没有任何临时字符串(使用StrUpper等),并且不调用比Boyer更慢的Pos()需要在重复搜索同一文本时进行摩尔搜索 .

(编辑:我有一个部分解决方案,作为这个问题的答案,它几乎100%完成,它甚至有一个快速的字符串替换功能 . 我相信它必须有bug,特别认为,因为它假装是Unicode由于未实现的Unicode承诺,必须存在故障 . )

(编辑2:有趣和意外的结果;堆栈上的unicode代码点表的大堆栈大小 - 下面的代码中的SkipTable严重阻碍了你在unicode字符串boyer中可以做的双赢优化的数量 - 摩尔字符串搜索 . 感谢Florent Ouchet指出我应该立即注意到的内容 . )

2 回答

  • 1

    这个答案现在已经完成并且适用于区分大小写的模式,但不适用于不区分大小写的模式,并且可能还有其他错误,因为它没有经过单元测试,并且可能会进一步优化,例如我重复了本地函数__SameChar而不是使用本来更快的比较函数回调,实际上,允许用户传递所有这些的比较函数对于想要提供一些额外逻辑的Unicode用户来说非常好(对于某些语言,等效的Unicode字形集合) .

    基于Dorin Dominica的代码,我构建了以下内容 .

    { _FindStringBoyer:
      Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM.
      Credited to Dorin Duminica.
    }
    function _FindStringBoyer(const sString, sPattern: string;
      const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer;
    
        function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
        begin
          if bCaseSensitive then
            Result := (sString[StringIndex] = sPattern[PatternIndex])
          else
            Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0);
        end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    
    var
      SkipTable: array [Char] of Integer;
      LengthPattern: Integer;
      LengthString: Integer;
      Index: Integer;
      kIndex: Integer;
      LastMarker: Integer;
      Large: Integer;
      chPattern: Char;
    begin
      if fromPos < 1 then
        raise Exception.CreateFmt('Invalid search start position: %d.', [fromPos]);
      LengthPattern := Length(sPattern);
      LengthString := Length(sString);
      for chPattern := Low(Char) to High(Char) do
        SkipTable[chPattern] := LengthPattern;
      for Index := 1 to LengthPattern -1 do
        SkipTable[sPattern[Index]] := LengthPattern - Index;
      Large := LengthPattern + LengthString + 1;
      LastMarker := SkipTable[sPattern[LengthPattern]];
      SkipTable[sPattern[LengthPattern]] := Large;
      Index := fromPos + LengthPattern -1;
      Result := 0;
      while Index <= LengthString do begin
        repeat
          Index := Index + SkipTable[sString[Index]];
        until Index > LengthString;
        if Index <= Large then
          Break
        else
          Index := Index - Large;
        kIndex := 1;
        while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
          Inc(kIndex);
        if kIndex = LengthPattern then begin
          // Found, return.
          Result := Index - kIndex + 1;
          Index := Index + LengthPattern;
          exit;
        end else begin
          if __SameChar(Index, LengthPattern) then
            Index := Index + LastMarker
          else
            Index := Index + SkipTable[sString[Index]];
        end; // if kIndex = LengthPattern then begin
      end; // while Index <= LengthString do begin
    end;
    
    { Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of
      a substring inside the main string, at a much faster rate than we
      could have done otherwise.  Another thing that would be great is
      to have a function that returns an array of find-locations,
      which would be way faster to do than repeatedly calling Pos.
    }
    function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer;
    var
      foundPos:Integer;
      fromPos:Integer;
      Limit:Integer;
      guard:Integer;
      SkipTable: array [Char] of Integer;
      LengthPattern: Integer;
      LengthString: Integer;
      Index: Integer;
      kIndex: Integer;
      LastMarker: Integer;
      Large: Integer;
      chPattern: Char;
        function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
        begin
          if CaseSensitive then
            Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
          else
            Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
        end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    
    begin
      result := 0;
      foundPos := 1;
      fromPos := 1;
      Limit := Length(aSourceString);
      guard := Length(aFindString);
      Index := 0;
      LengthPattern := Length(aFindString);
      LengthString := Length(aSourceString);
      for chPattern := Low(Char) to High(Char) do
        SkipTable[chPattern] := LengthPattern;
      for Index := 1 to LengthPattern -1 do
        SkipTable[aFindString[Index]] := LengthPattern - Index;
      Large := LengthPattern + LengthString + 1;
      LastMarker := SkipTable[aFindString[LengthPattern]];
      SkipTable[aFindString[LengthPattern]] := Large;
      while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin
    
        Index := fromPos + LengthPattern -1;
        if Index>Limit then
            break;
        kIndex := 0;
        while Index <= LengthString do begin
          repeat
            Index := Index + SkipTable[aSourceString[Index]];
          until Index > LengthString;
          if Index <= Large then
            Break
          else
            Index := Index - Large;
          kIndex := 1;
          while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
            Inc(kIndex);
          if kIndex = LengthPattern then begin
            // Found, return.
            //Result := Index - kIndex + 1;
            Index := Index + LengthPattern;
            fromPos := Index;
            Inc(Result);
            break;
          end else begin
            if __SameChar(Index, LengthPattern) then
              Index := Index + LastMarker
            else
              Index := Index + SkipTable[aSourceString[Index]];
          end; // if kIndex = LengthPattern then begin
        end; // while Index <= LengthString do begin
    
      end;
    end;
    

    这真是一个很好的算法,因为:

    • 以这种方式计算字符串Y中子串X的实例的速度更快,非常好 .

    • 仅仅替换Pos(),_FindStringBoyer()比FastCode项目人员为Delphi贡献的纯asm版本更快,目前用于Pos,如果你需要不区分大小写,你可以想象当我们没有那么大的时候,性能提升了 . 但是,高效算法仍然是美的东西 . )

    好吧,我用Boyer-Moore风格写了一个String Replace:

    function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String;
    var
      errors:Integer;
      fromPos:Integer;
      Limit:Integer;
      guard:Integer;
      SkipTable: array [Char] of Integer;
      LengthPattern: Integer;
      LengthString: Integer;
      Index: Integer;
      kIndex: Integer;
      LastMarker: Integer;
      Large: Integer;
      chPattern: Char;
      CaseSensitive:Boolean;
      foundAt:Integer;
      lastFoundAt:Integer;
      copyStartsAt:Integer;
      copyLen:Integer;
        function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
        begin
          if CaseSensitive then
            Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
          else
            Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
        end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    
    begin
      result := '';
      lastFoundAt := 0;
      fromPos := 1;
      errors := 0;
      CaseSensitive := rfIgnoreCase in Flags;
      Limit := Length(aSourceString);
      guard := Length(aFindString);
      Index := 0;
      LengthPattern := Length(aFindString);
      LengthString := Length(aSourceString);
      for chPattern := Low(Char) to High(Char) do
        SkipTable[chPattern] := LengthPattern;
      for Index := 1 to LengthPattern -1 do
        SkipTable[aFindString[Index]] := LengthPattern - Index;
      Large := LengthPattern + LengthString + 1;
      LastMarker := SkipTable[aFindString[LengthPattern]];
      SkipTable[aFindString[LengthPattern]] := Large;
      while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin
    
        Index := fromPos + LengthPattern -1;
        if Index>Limit then
            break;
        kIndex := 0;
        foundAt := 0;
        while Index <= LengthString do begin
          repeat
            Index := Index + SkipTable[aSourceString[Index]];
          until Index > LengthString;
          if Index <= Large then
            Break
          else
            Index := Index - Large;
          kIndex := 1;
          while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
            Inc(kIndex);
          if kIndex = LengthPattern then begin
    
    
            foundAt := Index - kIndex + 1;
            Index := Index + LengthPattern;
            //fromPos := Index;
            fromPos := (foundAt+LengthPattern);
            if lastFoundAt=0 then begin
                    copyStartsAt := 1;
                    copyLen := foundAt-copyStartsAt;
            end else begin
                    copyStartsAt := lastFoundAt+LengthPattern;
                    copyLen := foundAt-copyStartsAt;
            end;
    
            if (copyLen<=0)or(copyStartsAt<=0) then begin
                    Inc(errors);
            end;
    
            Result := Result + Copy(aSourceString, copyStartsAt, copyLen ) + aReplaceString;
            lastFoundAt := foundAt;
            if not (rfReplaceAll in Flags) then
                     fromPos := 0; // break out of outer while loop too!
            break;
          end else begin
            if __SameChar(Index, LengthPattern) then
              Index := Index + LastMarker
            else
              Index := Index + SkipTable[aSourceString[Index]];
          end; // if kIndex = LengthPattern then begin
        end; // while Index <= LengthString do begin
      end;
      if (lastFoundAt=0) then
      begin
         // nothing was found, just return whole original string
          Result := aSourceString;
      end
      else
      if (lastFoundAt+LengthPattern < Limit) then begin
         // the part that didn't require any replacing, because nothing more was found,
         // or rfReplaceAll flag was not specified, is copied at the
         // end as the final step.
        copyStartsAt := lastFoundAt+LengthPattern;
        copyLen := Limit; { this number can be larger than needed to be, and it is harmless }
        Result := Result + Copy(aSourceString, copyStartsAt, copyLen );
      end;
    
    end;
    

    好的,问题:这个堆栈的足迹:

    var
      skiptable : array [Char] of Integer;  // 65536*4 bytes stack usage on Unicode delphi
    

    再见CPU地狱,你好堆栈地狱 . 如果我使用动态数组,那么我必须在运行时调整它的大小 . 所以这个东西基本上很快,因为你的计算机上的虚拟内存系统不会在堆栈上以256K闪烁,但这并不总是最佳的代码片段 . 尽管如此,我的电脑并没有像这样大堆栈的东西眨眼 . 它不会成为Delphi标准库的默认值,也不会在未来赢得任何快速代码挑战 . 我认为重复搜索是上述代码应该写成类的情况,并且skiptable应该是该类中的数据字段 . 然后你可以构建一次boyer-moore表,并且随着时间的推移,如果字符串是不变的,则重复使用该对象进行快速查找 .

  • 11

    由于我只是在寻找相同的内容:Jedi JCL在jclUnicode.pas中使用Boyer-Moore获得了一个知识点识别搜索引擎 . 我不知道它有多好或有多快 .

相关问题