输入一个地址,就可以把那个网页中的链接提取出来,下面这段代码可以轻松实现,主要的是用到了正则表达式。查看例子
http://search.csdn.net/Expert/topic/2131/2131209.xml?temp=.4868585
GetUrl.aspx代码如下:
1@ Page Language="vb" CodeBehind="GetUrl.aspx.vb" AutoEventWireup="false" Inherits="aspxWeb.GetUrl"
1<html>
2<head>
3<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
4</head>
5<body>
6<form id="Form1" method="post" runat="server">
7<p>
8<asp:label id="Label1" runat="server"></asp:label>
9<asp:textbox id="urlTextBox" runat="server" width="336px">
10http://lucky_elove.www1.dotnetplayground.com/
11</asp:textbox>
12<asp:button id="scrapeButton" onclick="scrapeButton_Click" runat="server"></asp:button>
13</p>
14<hr size="1" width="100%"/>
15<p>
16<asp:label id="TipResult" runat="server"></asp:label>
17<asp:textbox height="400" id="resultLabel" runat="server" textmode="MultiLine" width="100%"></asp:textbox>
18</p>
19</form>
20</body>
21</html>
后代码GetUrl.aspx.vb如下:
Imports System.IO
Imports System.Net
Imports System.Text
Imports System.Text.RegularExpressions
Imports System
Public Class GetUrl
Inherits System.Web.UI.Page
Protected WithEvents Label1 As System.Web.UI.WebControls.Label
Protected WithEvents urlTextBox As System.Web.UI.WebControls.TextBox
Protected WithEvents scrapeButton As System.Web.UI.WebControls.Button
Protected WithEvents TipResult As System.Web.UI.WebControls.Label
Protected WithEvents resultLabel As System.Web.UI.WebControls.TextBox
#Region " Web 窗体设计器生成的代码 "
'该调用是 Web 窗体设计器所必需的。
1<system.diagnostics.debuggerstepthrough()> Private Sub InitializeComponent()
2
3End Sub
4
5Private Sub Page_Init(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Init
6'CODEGEN: 此方法调用是 Web 窗体设计器所必需的
7'不要使用代码编辑器修改它。
8InitializeComponent()
9End Sub
10
11#End Region
12
13Private Sub Page_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
14'在此处放置初始化页的用户代码
15Label1.Text = "请输入一个URL地址:"
16scrapeButton.Text = "分离Href链接"
17End Sub
18Private report As New StringBuilder()
19Private webPage As String
20Private countOfMatches As Int32
21
22Public Sub scrapeButton_Click(ByVal sender As System.Object, ByVal e As System.EventArgs)
23webPage = GrabUrl()
24Dim myDelegate As New MatchEvaluator(AddressOf MatchHandler)
25
26Dim linksExpression As New Regex( _
27"\<a.+?href=['""](?!http\:\ )(?!mailto\:)(?="" \="">foundAnchor>[^'"">]+?)[^>]*?\>", _
28RegexOptions.Multiline Or RegexOptions.IgnoreCase Or RegexOptions.IgnorePatternWhitespace)
29
30Dim newWebPage As String = linksExpression.Replace(webPage, myDelegate)
31
32TipResult.Text = "<h2>从 " & urlTextBox.Text & "分离出的Href链接</h2>" & _
33"<b>找到并整理" & countOfMatches.ToString() & " 个链接</b><br/><br/>" & _
34report.ToString().Replace(Environment.NewLine, "<br/>")
35TipResult.Text &= "<h2>整理过的页面</h2><script>window.document.title='抓取网页中的链接'</script>"
36resultLabel.Text = newWebPage
37End Sub
38
39Public Function MatchHandler(ByVal m As Match) As String
40Dim link As String = m.Groups("foundAnchor").Value
41Dim rToL As New Regex("^", RegexOptions.Multiline Or RegexOptions.RightToLeft)
42Dim col, row As Int32
43Dim lineBegin As Int32 = rToL.Match(webPage, m.Index).Index
44
45row = rToL.Matches(webPage, m.Index).Count
46col = m.Index - lineBegin
47
48report.AppendFormat( _
49"Link <b>{0}</b>, fixed at row: {1}, col: {2}{3}", _
50Server.HtmlEncode(m.Groups(0).Value), _
51row, _
52col, _
53Environment.NewLine _
54)
55Dim newLink As String
56If link.StartsWith("/") Then
57newLink = link.Substring(1)
58Else
59newLink = link
60End If
61
62countOfMatches += 1
63Return m.Groups(0).Value.Replace(link, newLink)
64End Function
65
66Private Function GrabUrl() As String
67Dim wc As New WebClient()
68Dim s As Stream = wc.OpenRead(urlTextBox.Text)
69Dim sr As StreamReader = New StreamReader(s, System.Text.Encoding.Default)
70GrabUrl = sr.ReadToEnd
71s.Close()
72wc.Dispose()
73End Function
74
75End Class
76
77
78
79
80
81这段正则表达式了用来验证Email的:^[_.0-9a-z-]+@([0-9a-z][0-9a-z-]+.)+[a-z]{2,3}$
82在这段正则表达式中,“+”表示前面的字符串连续出现一个或多个;“^”表示下一个字符串必须出现在开头,“$”表示前一个字符串必须出现在结尾;“.”也就是“.”,这里“”是转义符;“{2,3}”表示前面的字符串可以连续出现2-3次。“()”表示包含的内容必须同时出现在目标对象中。“[_.0-9a-z-]”表示包含在“_”、“.”、“-”、从a到z范围内的字母、从0到9范围内的数字中的任意字符;
83这样一来,这个正则表达式可以这样翻译:
84“下面的字符必须在开头(^)”、“该字符必须包含在“_”、“.”、“-”、从a到z范围内的字母、从0到9范围内的数字中([_.0-9a-z-])”、“前面这个字符至少出现一次(+)”、@、“该字符串由一个包含在从a到z范围内的一个字母、从0到9范围内的数字中的字符开头,后面跟至少一个包含在“-”、从a到z范围内任何一个字母、从0到9范围内任何一个数字中的字符,最后以.结束(([0-9a-z][0-9a-z-]+.))”、“前面这个字符至少出现一次(+)”、“从a到z范围内的字母出现2-3次,并以它结束([a-z]{2,3}$)”
85
86表示匹配但不获取,如果不用,会多获取几个匹配,占用资源。
87?<1>引用名称,即表示该获取可以用$1引用,
88如果想更好的使用正则,可以看以下连接,希望对你有用
89http://expert.csdn.net/Expert/TopicView1.asp?id=1410423
90
91单看一眼,和href\s*=\s*("[^"]"|\S+)区别
92只是这个如果有引号会同时匹配到,同时也有$1
93上面只是获取""内的内容,复杂了点但有实用
94
95http://search.csdn.net/Expert/topic/1450/1450366.xml?temp=.8075525
96
97
98
99
100
101
102
103http://search.csdn.net/Expert/topic/1895/1895427.xml?temp=.2321894
104例如:
105当前网页为: http://www.cccc.com/aa/bb/cc/dd/ee.htm ,网页为有如下代码:
106
107(1)href="../../../df/gov.htm"
108需要转为:href=" http://www.cccc.com/aa/df/gov.htm "
109
110(2)href="../../special_index.htm"
111需要转为:href=" http://www.cccc.com/aa/bb/spcial_index.htm "
112
113(3)href="/index.htm" class='white'>频道主页
114需要转为:href=" http://www.cccc.com/index.htm " class='white'>频道主页
115
116(4)<img src="/myexe/wind_chromeless_2.1.js"/>
117需要转为:<img src="http://www.cccc.com/myexe/wind_chromeless_2.1.js"/>
118
119(5)background-image: url(images/index/sameA.jpg)[css代码中]
120需要转为:background-image: url( http://www.cccc.com/aa/bb/cc/dd/images/index/sameA.jpg )
121
122(6)url=" mailto:[email protected] "又不能转换。
123
124现在比较头痛的问题是:网页并不都是很规范的。有的属性值用了引号(或单引号或双引号),还有的没有用引号,更有甚者,单双引号嵌套。但又必须保证不乱,否则就会乱掉。
125
126
127
128
1291.首先确定出 http://www.cccc.com/aa/bb/cc/dd/ee.htm 的根 http://www.cccc.com
130
1312.确定出 http://www.cccc.com/aa/bb/cc/dd/ee.htm 得目录 http://www.cccc.com/aa/bb/cc/dd
132
1332.在所有像 href="/、src="/ 这样的前面加入 http://www.cccc.com
134
1353.在所有 href="???.htm、src="???.jpg 前面加入 http://www.cccc.com/aa/bb/cc/dd/
136
1374.像 href="../../../df/gov.htm" 计算出有多少个 ../ ,有3个,表示从 http://www.cccc.com/aa/bb/cc/dd 后退3层,先将 http://www.cccc.com/aa/bb/cc/dd 反向排列,找到第3个 / 的位置,从这个位置开始提取字符,将提取的字符再次反向,得到 http://www.cccc.com/aa/ ,提取 href="../../../df/gov.htm" 中的第3个 / 后面所有字符和前面得到的 http://www.cccc.com/aa/ 组合。
138
1395.像 ../ 和 ../../ 用 4 的办法同样计算。
140
1416.像 url(???/??.jpg) 只要在 url( 后面插入 http://www.cccc.com/aa/bb/cc/dd 就行了。
142
143用正则表达式
144string pattern = @"(href\s*=\s*)|(src\s*=\s*)[""'](?<url>[^""']+)[""']";
145Regex r = new Regex(pattern, RegexOptions.Compile | RegexOptions.IgnoreCase);
146for(Match m = r.Match(YourHtmlPageString); m.Sucess; m = m.NextMatch())
147{
148string url = m.Result("${url}");
149// 处理该URL
150}
151
152
153
154
155
156已经解决。
@ Page Language="VB" debug="true"
@ Import Namespace="System.Net"
@ Import Namespace="System.IO"
1<script language="VB" runat="server">
2Sub Page_load(sender as Object,E as EventArgs)
3If IsPostBack=False Then
4dim strUrl as string
5strUrl=Request.QueryString("Url")
6if strUrl="" then
7strUrl=trim(Request.Params("Url"))
8end if
9strUrl=strUrl.TrimEnd("/")
10' response.write(strUrl & "<br>")
11if strUrl<>Nothing And strUrl.StartsWith("http://") then
12Dim wc As New System.Net.WebClient()
13Dim html As String = Encoding.default.GetString(wc.DownloadData(strUrl))
14' Response.Write(html)
15Dim strRegEx as String
16
17strRegEx="\b(href|src|url|background)=((""|')?\s*([^\>\s]*?)\2?(\s)|([^>]*?>))"
18html=RegExLinks(strRegEx,html,strUrl)
19
20' strRegEx="\b(href|src|background)=(""|')?\s*([^\>\s]*?)\2?(\s)"
21' html=RegExLinks(strRegEx,html,strUrl)
22' strRegEx="\b(href|src|background)\s*=\s*(""|')?\s*([^>\s]*?)\2?\/?>"
23' html=RegExLinks(strRegEx,html,strUrl)
24Response.write(html)
25
26end if
27End If
28End Sub
29
30Function RegExLinks(ByVal strRegEx as string,ByVal html as string,ByVal strUrl as string)
31dim arrLink() as String
32dim firstquot,lastquot as string
33dim strOldFullLink,strOldLink,strNewFullLink,strNewLink as String
34dim strLink as String
35dim strSpace as String
36dim objRegEx as RegEx
37Dim objMatch as Match
38Dim objMatchCollection as MatchCollection
39objRegEx=New RegEx(strRegEx,RegexOptions.IgnoreCase or RegexOptions.Multiline)
40objMatchCollection=objRegEx.Matches(html)
41For Each objMatch in objMatchCollection
42strLink=objMatch.value
43Erase arrLink
44arrLink=strLink.split("=")
45'如果链接中有http://www.domain.com/news.asp?date=200306&keyword=news&page=2等类似情况时,Ubound>=2,此时后面无空格,否则错误
46if UBound(arrLink)<2 then
47strSpace=" "
48else
49strSpace=""
50end if
51if arrLink(1).StartsWith("""") then
52strOldFullLink=arrLink(1)
53if arrLink(1).LastIndexOf("""")>1 then
54if arrLink(1).EndsWith(">") then
55arrLink(1)=arrLink(1).TrimEnd(">")
56lastquot=""">"
57else
58lastquot=""""
59end if
60end if
61strOldLink=arrLink(1).replace("""","")
62firstquot=""""
63strNewLink=DoLinks(strUrl,strOldLink)
64strNewFullLink=firstquot & trim(strNewLink) & trim(lastquot)
65' response.write("替换前:双引号" & strOldFullLink & "<br>")
66' response.write("替换后:双引号<font color='red'>" & strNewFullLink & "</font><br>")
67elseif arrLink(1).StartsWith("'") then
68strOldFullLink=arrLink(1)
69if arrLink(1).LastIndexOf("'")>1 then
70if arrLink(1).EndsWith(">") then
71arrLink(1)=arrLink(1).TrimEnd(">")
72lastquot="'>"
73else
74lastquot="'"
75end if
76end if
77strOldLink=arrLink(1).replace("'","")
78firstquot="'"
79strNewLink=DoLinks(strUrl,strOldLink)
80strNewFullLink=firstquot & trim(strNewLink) & trim(lastquot)
81' response.write("替换前:单" & strOldFullLink & "<br>")
82' response.write("替换后:单<font color='red'>" & strNewFullLink & "</font><br>")
83else
84strOldFullLink=arrLink(0) & "=" & arrLink(1)
85' strOldFullLink=arrLink(1)
86strOldLink=arrLink(1)
87strNewLink=DoLinks(strUrl,strOldLink)
88strNewFullLink=arrLink(0) & "=" & trim(strNewLink)
89' strNewFullLink=trim(strNewLink)
90' response.write("前:无" & strOldFullLink & "<br>")
91' response.write("后:无<font color='red'>" & strNewFullLink & "</font><br>")
92end if
93html=html.Replace(strOldFullLink,trim(strNewFullLink) & strSpace)
94
95firstquot=nothing
96lastquot=nothing
97strOldFullLink=nothing
98strNewFullLink=nothing
99Next
100RegExLinks=html
101End Function
102
103Function DoLinks(byVal strUrl as string,byVal strTempLink as string)
104dim objRegExSite as RegEx
105objRegExSite=New RegEx("http://[^/]+",RegexOptions.IgnoreCase)
106dim strSite as string
107strSite=trim(objRegExSite.Match(strUrl).value.ToString)
108dim strLinkF as String
109dim strUrlF as String
110strUrlF=strUrl.Replace(strSite,"")
111dim arrDir() as String
112dim iDirLen as integer
113if strUrlF.indexOf("/")>=0 then
114arrDir=strUrlF.split("/")
115iDirLen=arrDir.length
116strUrlF=strUrlF.Replace(arrDir(iDirLen-1),"")
117end if
118
119dim k,j as Integer
120dim objMatchColF as MatchCollection
121dim objRegExF as RegEx
122if strTempLink.ToLower.StartsWith("javascript:") or strTempLink.ToLower.StartsWith("mailto:") or strTempLink.ToLower.StartsWith("#") or _
123strTempLink.ToLower.StartsWith("http://") or strTempLink.ToLower.StartsWith("www.") then
124strLinkF=strTempLink
125elseif strTempLink.StartsWith("../") then
126objRegExF=New RegEx("\\.\\.\/")
127objMatchColF=objRegExF.Matches(strTempLink)
128j=objMatchColF.Count
129'当下载网页链接的"../"个数+1大于该网页链接层数时,说明网页本身有误,则指向最底层链接。
130if isArray(arrDir) then
131if Ubound(arrDir)<j+1 then
132j=Ubound(arrDir)-1
133end if
134for k=j-1 to 0 step -1
135strUrlF=trim(strUrlF.Remove(strUrlF.LastIndexOf(arrDir(iDirLen-2-k)),len(arrDir(iDirLen-2-k))+1))
136next
137end if
138dim strEnd as String
139strEnd=trim(strTempLink.Replace("../",""))
140strLinkF=strSite.subString(0,len(strSite)) & strUrlF & strEnd
141elseif strTempLink.StartsWith("./") then
142' http://www.southcn.com/news/china
143' ./todaycn/200306260529.htm
144strLinkF=strUrl & strTempLink.Replace("./","/")
145elseif strTempLink.StartsWith("/") then
146strLinkF=strSite & strTempLink
147else
148if strUrlF="" then
149strUrlF="/"
150end if
151strLinkF=strSite & strUrlF & strTempLink
152end if
153DoLinks=strLinkF
154End Function
155
156</script>
157<html>
158<body>
159</body>
160
161
162
163
164
165
166go to
167http://www.regexlib.com/search.aspx
168
169enter "links" in the keyword textbox and click on Search button
170
171or try
172
173using System.Text.RegularExpressions;
174
175string str = "............";
176
177Regex re = new Regex(@"<a[^>]+href=\s*(?:'(?<href>[^']+)'|""(?<href>[^""]+)""|(?<href>[^>\s]+))\s*[^>]*>", RegexOptions.IgnoreCase | RegexOptions.Singleline);
178
179MatchCollection mc = re.Matches(str);
180Console.WriteLine(mc.Count);
181foreach (Match m in mc)
182Console.WriteLine(m.Groups["href"].Value);
183}</href></href></href></a[^></html></url></a.+?href=['""](?!http\:\></system.diagnostics.debuggerstepthrough()>