My web scrapper with asynchronous web request and visual proxy availability detection.
In this page I describe a small part of one of the my big desktop application. At common this application is designed for web scrapping, but it has a interesting small part what be described below.
Most site has request limitation for IP-address and forbid web scraping from one IP-address. Therefore a web scrapper use various method to change IP-address. One of the good method is manually changing IP-adress when your address is blocked. There are many various ways to solve this problem, to change your IP-address. One of the solution is use AdvOR (first screen below), second good choice is use VPN (second screen below).
But my application using another way. There are many sites with proxy list (for example https://free-proxy-list.net/) and this fragment of my program is visually checker haw to site working through one or another proxy in proxy list.
Proxy address and port is set to this place in my application, after tat need to press right button "check proxy".
You may select proxy to checking from old existing list of application or adding it an any time.
After that separate windows with browser is pops. Maybe proxy not working in your environment, in this case you may press "delete proxy from list".
In another case proxy working well and you may see add to you current list.
My visual checker (actually web-browser) has some another opportunity, like to add new site to site list for check proxy availability.
After you refresh actually proxy list, you can set in application count of proxy to webscraping. There is a common template of small addons in my web-scrapper, this is visual checker big existing external proxy list in my current environment.
So, below I describe some code fragment of this addond. Three last controls in first form of this application (included in Toolstrip4) named ToolStripLabel6, ProxyIP, CheckProxyButton.
Program code to processing events of this three buttons you may see in screen below.
At common this is very big my application, and line from 369 to line 396 is executing class ProxyChecker with parameters as IP:PORT of proxy and refresh combobox list (if proxy adding to table ProxyTabs.
...
372: Dim ProxyChecker1 As New ProxyChecker("Http://" & OnlyIpPort & "/")
373: AddHandler ProxyChecker1.RefreshProxyList, AddressOf ProxyIP_Refresh
374: ProxyChecker1.Go()
...
391: ProxyIP.Items.Add(One.URL.ToLower.Replace("http://", "").Replace("/", "") & " (" & One.CrDate.ToString("dd.MM.yyyy HH:mm:ss", System.Globalization.CultureInfo.InvariantCulture) & ")")
...
Code above is only environment for executing class ProxyChecker. Code of this class you may see below.
This is most important class of this fragment on my application, because it contains all handlers to process event of popup form VisualSiteCheckerForm, what contains web-browser.
1: Public Class ProxyChecker
2: Inherits Wcf_Client
3:
4: Public Event RefreshProxyList()
5:
6: Public Property Checked As Boolean
7: Public Property Full_ProxyURL As String
8: Public Property ResponseEncode As Wcf_Client.PostRequestEncode
9: Public Property VisualSiteCheckerForm As Global.Freelancer.VisualSiteChecker1
10:
11: Public Sub New(ByVal IpAddr As String)
12: Full_ProxyURL = IpAddr
13: ResponseEncode = Wcf_Client.PostRequestEncode.UTF8
14: End Sub
15:
16: Public Sub Go()
17: VisualSiteCheckerForm = New Global.Freelancer.VisualSiteChecker1
18: VisualSiteCheckerForm.IsPageCorrectCallBack = AddressOf IsPageOK
19: AddHandler VisualSiteCheckerForm.GetHTML, AddressOf ReadHTMLSync
20: AddHandler VisualSiteCheckerForm.GetHTMLAsync, AddressOf ReadHTMLASync
21: VisualSiteCheckerForm.DeleteProxy = AddressOf DelProxy
22: VisualSiteCheckerForm.Title = Full_ProxyURL.ToLower.Replace("http://", "").Replace("/", "")
23: VisualSiteCheckerForm.Show()
24: End Sub
25:
26: Public Sub DelProxy()
27: Dim db1 = New ParserDBDataContext
28: Dim CurProxy = (From X In db1.ProxyTabs Select X Where X.URL = Full_ProxyURL).ToList
29: If CurProxy.Count > 0 Then
30: db1.ProxyTabs.DeleteOnSubmit(CurProxy(0))
31: db1.SubmitChanges()
32: RaiseEvent RefreshProxyList()
33: End If
34: End Sub
35:
36: Public Sub IsPageOK(ByVal yesno As Boolean)
37: Checked = yesno
38: If yesno Then
39: Dim db1 As New ParserDBDataContext
40:
41: db1.ProxyTabs.InsertOrUpdateTable(Function(e) e.URL = Full_ProxyURL,
42: New ProxyTab With {.CrDate = Now, .URL = Full_ProxyURL},
43: Sub(e) e.CrDate = Now)
44: LoadForm.ProxyIP_Refresh()
45: End If
46: End Sub
47:
48: Public Sub ReadHTMLSync(ByVal URL As String, ByRef HTML As String)
49: HTML = GetRequestStrAsync(URL, ResponseEncode, Full_ProxyURL)
50: End Sub
51:
52:
53: Private WithEvents backgroundWorker1 As System.ComponentModel.BackgroundWorker
54:
55: 'старт в основном потоке
56: Public Sub ReadHTMLASync(ByVal URL As String)
57: backgroundWorker1 = New System.ComponentModel.BackgroundWorker
58: backgroundWorker1.RunWorkerAsync(URL)
59: End Sub
60:
61: 'вот єто в другом потоке
62: Private Sub BackgroundWorker1_DoWork(ByVal sender As System.Object, ByVal e As System.ComponentModel.DoWorkEventArgs) Handles backgroundWorker1.DoWork
63: HTML1 = GetRequestStrAsync(e.Argument, ResponseEncode, Full_ProxyURL)
64: End Sub
65:
66: Dim HTML1 As String
67:
68: 'финиш опять в основном потоке
69: Private Sub BackgroundWorker1_RunWorkerCompleted(ByVal sender As System.Object, ByVal e As System.ComponentModel.RunWorkerCompletedEventArgs) Handles backgroundWorker1.RunWorkerCompleted
70: VisualSiteCheckerForm.ShowAsyncHtmlResult.Invoke(HTML1)
71: End Sub
72:
73: End Class
74:
But to understand how class ProxyChecker working need firstly see to to form VisualSiteCheckerForm with web-browser. This form contains some controls, names of this control you may understand as learning of screen below.
Code of form VisualSiteCheckerForm you may see below, it use to show html TheArtOfDev.HtmlRenderer (https://github.com/ArthurHub/HTML-Renderer), support table TestURLs to store URL to check proxy.
1: Public Class VisualSiteChecker1
2: Public Event GetHTML(ByVal URL As String, ByRef HTML As String)
3: Public Event GetHTMLAsync(ByVal URL As String)
4:
5: Delegate Function GetRequestStrDelegate(ByVal RequestEncoding As Wcf_Client.PostRequestEncode, ByVal URL As String, ByVal ResponseEncoding As Wcf_Client.PostRequestEncode, ByVal Full_ProxyURL As String) As String
6: Delegate Sub IsCorrect(ByVal yes_no As Boolean)
7: Delegate Sub DelProxy()
8: Delegate Sub ShowHtmlResult(ByVal Html As String)
9:
10: Public Property Title As String
11: Public Property HTML As String = ""
12: Public Property IsPageCorrectCallBack As IsCorrect
13: Public Property DeleteProxy As DelProxy
14: Public Property ShowAsyncHtmlResult As ShowHtmlResult
15: Property HtmlPanel As TheArtOfDev.HtmlRenderer.WinForms.HtmlPanel
16: Dim db1 As ParserDBDataContext
17:
18: Private Sub VisualSiteChecker_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load
19: Me.Text &= " " & Title
20: HtmlPanel = New TheArtOfDev.HtmlRenderer.WinForms.HtmlPanel
21: HtmlPanel.Dock = DockStyle.Fill
22: ToolStripContainer1.ContentPanel.Controls.Add(HtmlPanel)
23: ShowAsyncHtmlResult = AddressOf ShowHtmlHandler
24: NavigateURL_refresh()
25: End Sub
26:
27: Private Sub GoButton_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles GoButton.Click
28: If NavigateURL.Text <> "" Then
29: HtmlPanel.Refresh()
30: HtmlPanel.Text = ""
31: Try
32: RaiseEvent GetHTML(NavigateURL.Text, HTML)
33: LenHtml.Text = Len(HTML).ToString & " chars"
34: HtmlPanel.Text = HTML
35: Catch ex As Exception
36: HtmlPanel.Text = ex.Message
37: IsPageCorrectCallBack.Invoke(False)
38: End Try
39: End If
40: End Sub
41:
42: Private Sub GoAsyncButton_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles GoAsyncButton.Click
43: If NavigateURL.Text <> "" Then
44: HtmlPanel.Refresh()
45: HtmlPanel.Text = ""
46: Try
47: RaiseEvent GetHTMLAsync(NavigateURL.Text)
48: Catch ex As Exception
49: HtmlPanel.Text = ex.Message
50: IsPageCorrectCallBack.Invoke(False)
51: End Try
52: End If
53: End Sub
54:
55: Sub ShowHtmlHandler(ByVal Html As String)
56: HtmlPanel.Text = Html
57: HtmlPanel.Refresh()
58: End Sub
59:
60:
61: Private Sub OkButton_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles OkButton.Click
62: IsPageCorrectCallBack.Invoke(True)
63: Me.Close()
64: End Sub
65:
66: Private Sub DelButton_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles DelButton.Click
67: DeleteProxy.Invoke()
68: Me.Close()
69: End Sub
70:
71: Sub NavigateURL_refresh()
72: NavigateURL.Items.Clear()
73: db1.GetContext(True)
74: Dim X = (From Z In db1.TestURLs Select Z Order By Z.i).ToList
75: For Each One As TestURL In X
76: NavigateURL.Items.Add(One.URL)
77: Next
78: End Sub
79:
80:
81: Private Sub DeleteURL_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles DeleteURL.Click
82: Dim X = (From Z In db1.TestURLs Select Z Where Z.URL = NavigateURL.Text).ToList
83: If X.Count > 0 Then
84: db1.TestURLs.DeleteOnSubmit(X(0))
85: db1.SubmitChanges()
86: End If
87: NavigateURL_refresh()
88: End Sub
89:
90: Private Sub AddUrl_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles AddUrl.Click
91: db1.TestURLs.InsertOnSubmit(New TestURL With {.URL = NavigateURL.Text})
92: db1.SubmitChanges()
93: NavigateURL_refresh()
94: End Sub
95: End Class
Also this code invoke four external delegates and two events. But sync event GetHTML devoid of sense because all application (including first execution form) is freezing and blocking for long time and all application waiting while proxy server is answering. Actually it's only a test method. Really workable in this application only GetHTMLAsync event. Below you may see in diagram list of external connections for form VisualSiteCheckerForm - four delegates and two events (plus one input field - form title).
So, now we come back to my code above, to class ProxyChecker and understanding that ProxyChecker contains only handler that process event from form VisualSiteCheckerForm. But this is not a simple directly connection! All code of class ProxyChecker working in the same thread as form VisualSiteCheckerForm, except line 62-63, that executing in another thread.
To understand more in my application we need understand more from my class ProxyChecker. It inherits from my big old class WCF_CLIENT - клиент Web-сервиса, written by me before 2010 year. I don't publish now all of this code, we see only main important fragments of class Wcf_Client, related with this application.
So, this is code of this class, related to this application. For my experience this is most difficult asynchronous code as possible, but it working!
1428:
1429: Public allDone As Threading.ManualResetEvent
1430: Dim BUFFER_SIZE As Integer = 1000000
1431:
1432: Public Overridable Function GetRequestStrAsync(ByVal URL As String, Optional ByVal ResponseEncoding As PostRequestEncode = PostRequestEncode.ASCII, Optional ByVal Full_ProxyURL As String = "") As String
1433:
1434: allDone = New Threading.ManualResetEvent(False)
1435:
1436: '========== System.NotSupportedException The URI prefix is not recognized.
1437: Dim Request As Net.HttpWebRequest = Net.HttpWebRequest.Create(URL)
1438: Request.UserAgent = UserAgent
1439: Request.Method = "GET"
1440: If Full_ProxyURL <> "" Then
1441: Dim MyProxy As New Net.WebProxy
1442: MyProxy.Address = New Uri(Full_ProxyURL)
1443: Request.Proxy = MyProxy
1444: End If
1445: Dim RS As RequestState = New RequestState(BUFFER_SIZE, ResponseEncoding)
1446: ' Put the request into the state so it can be passed around.
1447: RS.Request = Request
1448:
1449: 'Issue the async request.
1450: Dim r As IAsyncResult = CType(Request.BeginGetResponse(
1451: New AsyncCallback(AddressOf RespCallback), RS), IAsyncResult)
1452:
1453: ' Wait until the ManualResetEvent is set so that the application
1454: ' does not exit until after the callback is called.
1455: allDone.WaitOne()
1456:
1457: Return RS.ErrorMessage & RS.StringBuilder.ToString
1458: End Function
1459:
1460: Sub RespCallback(ByVal ar As IAsyncResult)
1461: ' Get the RequestState object from the async result
1462: Dim rs As RequestState = CType(ar.AsyncState, RequestState)
1463: Try
1464: ' Get the HttpWebRequest from RequestState.
1465: Dim req As Net.HttpWebRequest = rs.Request
1466:
1467: ' Call EndGetResponse, which returns the HttpWebResponse object
1468: ' that came from the request issued above.
1469: Dim resp As Net.HttpWebResponse = CType(req.EndGetResponse(ar), Net.HttpWebResponse)
1470:
1471: ' Start reading data from the respons stream.
1472: '============= The remote server returned an error: (407) Proxy Authentication Required. ==========
1473: Dim ResponseStream As IO.Stream = resp.GetResponseStream()
1474:
1475: ' Store the reponse stream in RequestState to read
1476: ' the stream asynchronously.
1477: rs.ResponseStream = ResponseStream
1478:
1479: ' Pass rs.BufferRead to BeginRead. Read data into rs.BufferRead.
1480: Dim iarRead As IAsyncResult =
1481: ResponseStream.BeginRead(rs.BufferRead, 0, BUFFER_SIZE,
1482: New AsyncCallback(AddressOf ReadCallBack), rs)
1483: Catch ex As Exception
1484: rs.ErrorMessage = ex.Message
1485: allDone.Set()
1486: End Try
1487:
1488: End Sub
1489:
1490: Sub ReadCallBack(ByVal asyncResult As IAsyncResult)
1491: ' Get the RequestState object from the AsyncResult.
1492: Dim rs As RequestState = CType(asyncResult.AsyncState, RequestState)
1493:
1494: ' Retrieve the ResponseStream that was set in RespCallback.
1495: Dim responseStream As IO.Stream = rs.ResponseStream
1496:
1497: ' Read rs.BufferRead to verify that it contains data.
1498: Dim read As Integer
1499: Try
1500: read = responseStream.EndRead(asyncResult)
1501: Catch ex As Exception
1502: Return
1503: End Try
1504: '
1505: If read > 0 Then
1506: ' Prepare a Char array buffer for converting to Unicode.
1507: Dim charBuffer(rs.BufferRead.Count) As Char
1508:
1509: ' Convert byte stream to Char array and then String.
1510: ' len contains the number of characters converted to Unicode.
1511: Dim len As Integer = _
1512: rs.StreamDecode.GetChars(rs.BufferRead, 0, read, charBuffer, 0)
1513: Dim str As String = New String(charBuffer, 0, len)
1514:
1515: ' Append the recently read data to the RequestData stringbuilder
1516: ' object contained in RequestState.
1517: rs.StringBuilder.Append(str)
1518:
1519: ' Continue reading data until responseStream.EndRead
1520: ' returns –1.
1521: Dim ar As IAsyncResult = _
1522: responseStream.BeginRead(rs.BufferRead, 0, BUFFER_SIZE, _
1523: New AsyncCallback(AddressOf ReadCallBack), rs)
1524: Else
1525:
1526: ' Close down the response stream.
1527: responseStream.Close()
1528:
1529: ' Set the ManualResetEvent so the main thread can exit.
1530: allDone.Set()
1531: End If
1532:
1533: Return
1534: End Sub
1535:
And I show last small fragment of this application, ProxyReader (that also inherits from class Wcf_Client. If you remember good proxy IP:PORT is collection in table ProxyTabs. This small class (in reality is big, it contains authentication and more another behavior), support ProxyTabs and provides each next request through another good proxy.
1: Public Class ProxyReader
2: Inherits Wcf_Client
3:
4: Property db1 As ParserDBDataContext
5: Property LastProxyCount As Integer
6: Property CurrentIndex As Integer
7: Property LastProxy As System.Collections.Generic.List(Of ProxyTab)
8: Property ReadingErrorCount As Integer
9:
10: Public Sub New(ByVal _LastProxyCount As Integer)
11: LastProxyCount = _LastProxyCount
12: db1 = New ParserDBDataContext
13: LastProxy = (From X In db1.ProxyTabs Select X Order By X.CrDate Descending Take _LastProxyCount).ToList
14: End Sub
15:
...
165: Function GetRequestStrThruProxy(ByVal URL As String, Optional ByVal ResponseEncoding As PostRequestEncode = PostRequestEncode.UTF8) As String
166: Dim HTML As String
167: Try
168: StartRead:
169: HTML = GetRequestStrAsync(URL, ResponseEncoding, LastProxy(CurrentIndex).URL)
170: Return HTML
171: Catch ex As Exception
172: ReadingErrorCount += 1
173: If ReadingErrorCount < LastProxy.Count Then
174: GetNextProxy()
175: GoTo StartRead
176: Else
177: Return Nothing
178: End If
179: End Try
180:
181: End Function
182:
183: Sub GetNextProxy()
184: If CurrentIndex < LastProxy.Count Then
185: CurrentIndex += 1
186: Else
187: CurrentIndex = 0
188: End If
189: End Sub
190:
191: End Class
That's it! You see in this page fragment of source code of my real application!
<SITEMAP> <MVC> <ASP> <NET> <DATA> <KIOSK> <FLEX> <SQL> <NOTES> <LINUX> <MONO> <FREEWARE> <DOCS> <ENG> <CHAT ME> <ABOUT ME> < THANKS ME> |