C#中利用Markup Service实现HTML解析为DOM Tree

一个轻量级Parsing 实现。这个代码不会从网上下载任何资料,也不会执行任何脚本,纯属Parsing。
Parsing是通过MSHTML的Markup Service实现的。要正确使用这个代码,需要添加MSHTML引用。
由于.net中没有定义IPersistStreamInt接口,就必须自己实现,接口定义:
| 以下内容为程序代码:

[ComVisible(true), ComImport(), Guid("7FD52380-4E07-101B-AE2D-08002B2EC713 " ) , InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)]
public interface IPersistStreamInit
{
void GetClassID([In, Out] ref Guid pClassID);
[return: MarshalAs(UnmanagedType.I4)] [PreserveSig]
int IsDirty();
void Load([In, MarshalAs(UnmanagedType.Interface)] UCOMIStream pstm);
void Save([In, MarshalAs(UnmanagedType.Interface)] UCOMIStream pstm,
[In, MarshalAs(UnmanagedType.I4)] int fClearDirty);
void GetSizeMax([Out, MarshalAs(UnmanagedType.LPArray)] long pcbSize);
void InitNew();
}


| 以下内容为程序代码:

unsafe IHTMLDocument2 Parse(string s)
{
IHTMLDocument2 pDocument=new HTMLDocumentClass();
if(pDocument!=null)
{
IPersistStreamInit pPersist=pDocument as IPersistStreamInit ;
pPersist.InitNew();
pPersist=null;
IMarkupServices ms=pDocument as IMarkupServices ;
if(ms!=null)
{
IMarkupContainer pMC=null;
IMarkupPointer pStart,pEnd;
ms.CreateMarkupPointer(out pStart);
ms.CreateMarkupPointer(out pEnd);
StringBuilder sb=new StringBuilder(s);
IntPtr pSource=Marshal.StringToHGlobalUni(s);
ms.ParseString(ref (ushort)pSource.ToPointer(),0,out pMC,pStart,pEnd);
if(pMC!=null)
{
Marshal.Release(pSource);
return pMC as IHTMLDocument2;
}
Marshal.Release(pSource);
}
}
return null;
}


写代码的时候出了一点问题,IMarkupService::ParseString第一个参数是ref ushort,显然要传入HTML代码,这个ushort必须是第一个WideChar了,所以这里通过使用不安全代码来绕过编译器警告。

Published At
Categories with Web编程
Tagged with
comments powered by Disqus