Multithreading Parsers with Parallel, CsQuery, Newtonsoft.Json, OfficeOpenXml and IAsyncResult/AsyncCallback.
- 1. TSL 1.2
- 2. More then two endpoint
- 3. Background thread and sockets read thread
- 4. EPPlus extension function, relative column address and check if columns is exist.
- 5. Timer
- 6. Resizable progressbar.
- 7. Backgroundworker.
- 8. Multithreading by Parallel.
- 9. Control of Parallel count threading
- 10. Shared ThreadSafe Value.
- 11. Full code of started multithreading in DoWork.
- 12. Update WinForms from another thread.
- 13. Parse Json by Newtonsoft.Json package.
- 14. Send Cookie to request.
- 15. Parse HTML by CsQuery with jQuery Selector.
- 16. Parse data in JavaScript by Regular expression or string operations.
- 17. GetRequestStrAsync - require HTML by AsyncCallback.
- 18. Use Profiler to check code performance.
I write parsers many years, first parser I have described in my site is dated by 2009 WebDownloader_UltraLite - ваш личный поисковик по рунету с особыми возможностями поиска.. Time on time I described various part of parsers, for example component to use proxy servers - My web scrapper with asynchronous web request and visual proxy availability detection. In my parsers I use various HTML parse engine, for example Parse HTML by HtmlAgilityPack (Xpath selector) and CsQuery (jQuery selector) or various multithreading engine, for example - Building TreeView by Reactive Extensions NET (Recursive observe directory, Iterator function with Yield, Windows native thread).
In this page I will make a full description of my regular downloader and parser.
1. TSL 1.2
Windows 10 by default has not opportunity to strong SSL-connect, it require special parameters in app.config.
1: <?xml version="1.0" encoding="utf-8" ?>
2: <configuration>
3: <runtime>
4: <AppContextSwitchOverrides value="Switch.System.Net.DontEnableSchUseStrongCrypto=false"/>
...
9: </configuration>
2. More then two endpoint
p>Windows 10 by default has not opportunity more than two output sockets, before you start program "Application.Run(StartForm_Instance)" you need to set "Net.ServicePointManager.DefaultConnectionLimit = 1000". This is string number 36 in code below.3. Background thread and sockets read thread.
Usually threading of parsers look as that - windows form thread, background thread and socket read thread.
If exception is appear in not windows forms thread, program finish silently. Therefore each multithreading program (or program with backgroundworker) need a starter like this. I describe this future many times in my site Trap unhandled exception in windows application.
1: Imports CefSharp
2: Imports CefSharp.WinForms
3: Imports System
4: Imports System.IO
5: Imports System.Runtime.InteropServices
6: Imports System.Windows.Forms
7:
8: Public Module Starter
9:
10: <DllImport("user32.dll")>
11: Private Function ShowWindow(ByVal hWnd As IntPtr, ByVal nCmdShow As Integer) As Boolean
12: End Function
13:
14: <DllImport("Kernel32")>
15: Private Function GetConsoleWindow() As IntPtr
16: End Function
17:
18: Const SW_HIDE As Integer = 0
19: Const SW_SHOW As Integer = 5
20:
21: Public StartForm_Instance As StartForm
22:
23:
24: <STAThread>
25: Public Sub Main()
26:
27:
28: Dim hwnd As IntPtr
29: hwnd = GetConsoleWindow()
30: ShowWindow(hwnd, SW_HIDE)
31:
32: AddHandler Application.ThreadException, New Threading.ThreadExceptionEventHandler(AddressOf Form1_UIThreadException)
33: Application.SetUnhandledExceptionMode(UnhandledExceptionMode.CatchException)
34: AddHandler AppDomain.CurrentDomain.UnhandledException, New UnhandledExceptionEventHandler(AddressOf CurrentDomain_UnhandledException)
35:
36: System.Net.ServicePointManager.DefaultConnectionLimit = 1000
37:
38: StartForm_Instance = New StartForm
39:
40: Application.Run(StartForm_Instance)
41: End Sub
42:
43: Private Sub Form1_UIThreadException(ByVal sender As Object, ByVal t As Threading.ThreadExceptionEventArgs)
44: Dim result As DialogResult = DialogResult.Cancel
45:
46: Try
47: result = ShowThreadExceptionDialog("Fatal Error", t.Exception)
48: If result = DialogResult.Abort Then Application.[Exit]()
49: Catch
50: Try
51: MessageBox.Show("Fatal Windows Forms Error", "UIThreadException", MessageBoxButtons.AbortRetryIgnore, MessageBoxIcon.[Stop])
52: Finally
53: Application.[Exit]()
54: End Try
55: End Try
56: End Sub
57:
58: Private Sub CurrentDomain_UnhandledException(ByVal sender As Object, ByVal e As UnhandledExceptionEventArgs)
59: Dim errorMsg As String = "Fatal Non-UI Error:" & vbLf & vbLf
60: Dim ex As Exception = CType(e.ExceptionObject, Exception)
61: Try
62: Dim myLog As EventLog = New EventLog()
63: myLog.Source = "ThreadException"
64: myLog.WriteEntry(errorMsg & ex.Message & vbLf & vbLf & "Stack Trace:" & vbLf + ex.StackTrace)
65: Catch exc As Exception
66: Try
67: MessageBox.Show("Could not write the error to the event log. Reason: " & exc.Message, "UnhandledException", MessageBoxButtons.OK, MessageBoxIcon.[Stop])
68: Finally
69: Application.[Exit]()
70: End Try
71: End Try
72: End Sub
73:
74: Public Function ShowThreadExceptionDialog(ByVal title As String, ByVal e As Exception) As DialogResult
75: Dim errorMsg As String = "An application error occurred. Please contact the adminstrator " & "with the following information:" & vbLf & vbLf
76: errorMsg = errorMsg & e.Message & vbLf & vbLf & "Stack Trace:" & vbLf + e.StackTrace
77: Return MessageBox.Show(errorMsg, title, MessageBoxButtons.AbortRetryIgnore, MessageBoxIcon.[Stop])
78: End Function
79:
80: End Module
4. EPPlus extension function, relative column address and check if columns is exist.
Take a look to parser code details. Often it start with job, listed in Excel file. There is a standard part of parsers (more detail about EPPlus please see in page Processing Office Open XML (xlsx) files by EPPlus
1: Imports System.IO
2: Imports OfficeOpenXml
3: Imports Newtonsoft.Json
4: Imports Newtonsoft.Json.Linq
5: Imports CsQuery
6: Imports System.ComponentModel
7: Imports System.Timers
8:
9: Public Class StartForm
10:
11:
12: Private Sub StartForm_Load(sender As Object, e As EventArgs) Handles Me.Load
13: Text &= " (" & My.Application.Info.Version.ToString & ")"
14: End Sub
15:
16: Dim ExcelPackage As ExcelPackage
17: Dim FI As FileInfo
18: Dim WorkBook As ExcelWorkbook
19: Dim WorkSheet As ExcelWorksheet
20: Dim ColCount As Integer
21: Dim RowCount As Integer
22: Dim CurRow As Integer
23:
24: Public ColumnS(21) As Integer 'real column number
25:
26: Public Enum ColumnType
27: ITEM = 1
28: STORE = 2
29: Model = 3
30: StoreName = 4
31: URL = 5
32: PRODUCT_TITLE = 6
33: RETAIL = 7
34: INVENTORY_AT_STORE = 8
35: TS_MODEL = 9
36: TS_URL = 10
37: TS_STORENAME = 11
38: TS_PRODUCT_TITLE = 12
39: TS_RETAIL = 14
40: TS_INVENTORY = 13
41: TS_LEADTIME = 15
42: TS_URL1 = 16
43: TS_Json1 = 17
44: TS_URL2 = 18
45: End Enum
46:
47:
48: Private Sub Button1_Click(sender As Object, e As EventArgs) Handles ButtonSelect.Click
49: Dim X As New OpenFileDialog
50: X.RestoreDirectory = True
51: X.Title = "Browse input Excel File"
52: X.Filter = "All files (*.*)|*.*|(*.xls) files |*.xls|(*.xlsx) files |*.xlsx|(*.xlsm) files |*.xlsm"
53: X.FilterIndex = 1
54: X.CheckFileExists = True
55: X.CheckPathExists = True
56: X.Multiselect = False
57: If X.ShowDialog = DialogResult.OK Then
58: ExcelLocation.Text = X.FileName
59: FI = New FileInfo(ExcelLocation.Text)
60: ExcelPackage = New ExcelPackage(FI)
61: WorkBook = ExcelPackage.Workbook
62: WorkSheet = WorkBook.Worksheets.First()
63: ColCount = WorkSheet.Dimension.End.Column
64: RowCount = WorkSheet.Dimension.End.Row
65: If CheckIsColumnExist() Then
66: ExcelRowCount.Text = "All columns exists," & RowCount & " rows found"
67: StartButton.Enabled = True
68: ThreadNumericUpDown.Enabled = True
69: StartDelayNumericUpDown.Enabled = True
70: RepeateDelayNumericUpDown.Enabled = True
71: Label1_.Enabled = True
72: Label2_.Enabled = True
73: Label3_.Enabled = True
74: Label4_.Enabled = True
75: Label3.Enabled = True
76: End If
77: ToolStripProgressBar1.Maximum = (RowCount + 1) * 3
78: ToolStripProgressBar1.Style = ProgressBarStyle.Blocks
79: End If
80: End Sub
....
285: Private Sub SaveButton_Click(sender As Object, e As EventArgs) Handles SaveButton.Click
286: ExcelPackage.SaveAs(FI)
287: End Sub
288:
....
352: Function CheckIsColumnExist() As Boolean
353: For Each One As ColumnType In System.Enum.GetValues(GetType(ColumnType))
354: Dim ColNumber As Integer = WorkSheet.GetColumnByName(ColumnName(CInt(One) - 1), AddressOf BadColumnStructure)
355: If ColNumber = -1 Then
356: Return False
357: Else
358: ColumnS(CInt(One)) = ColNumber
359: End If
360: Next
361: Return True
362: End Function
363:
364: Sub BadColumnStructure()
365: Dim Str1 As New Text.StringBuilder
366: ColumnName.ToArray.ForEach(Of String)(ColumnName.ToArray, Sub(Z) Str1.AppendLine(Z))
367: MsgBox("Right column structure is: " & vbCrLf & Str1.ToString)
368: End Sub
369:
....
391: Dim ColumnName() As String = {"ITEM #",
392: "STORE #",
393: "Model #",
394: "Store name",
395: "URL",
396: "PRODUCT TITLE",
397: "RETAIL",
398: "INVENTORY AT STORE",
399: "TS MODEL",
400: "TS URL",
401: "TS STORENAME",
402: "TS PRODUCT TITLE",
403: "TS RETAIL",
404: "TS INVENTORY",
405: "TS LEADTIME",
406: "TS_URL1",
407: "TS_Json1",
408: "TS_URL2"}
409:
410: End Class
1: Imports System.Runtime.CompilerServices
2: Imports OfficeOpenXml
3:
4: Module EpPlusExtensionMethods
5: <Extension()>
6: Function GetColumnByName(ByVal ws As ExcelWorksheet, ByVal columnName As String, ErrMessAction As Action) As Integer
7: If ws Is Nothing Then Throw New ArgumentNullException(NameOf(ws))
8: Try
9: Return ws.Cells("1:1").First(Function(c) c.Value.ToString().ToLower.Trim = columnName.ToLower.Trim).Start.Column
10: Catch ex As Exception
11: ErrMessAction.Invoke
12: Return -1
13: End Try
14:
15: End Function
16: End Module
5. Timer
Next component, usually present in parser is timer. Win forms timer usually not working, therefore I use system.timer.
313: Public Sub UpdateTimer(Value As Integer)
314: Me.InvokeOnUiThreadIfRequired(Sub() TimeLabel.Text += Timer1.Interval / 1000)
315: Me.InvokeOnUiThreadIfRequired(Sub() TimeLabel.Refresh())
316: End Sub
...
339: Function GetDuration(StartTime As Long) As String
340: Return String.Format("{0:N2} seconds", New TimeSpan(Now.Ticks - StartTime).TotalSeconds)
341: End Function
...
374: Dim WithEvents Timer1 As Timers.Timer
375: Sub StartTimer()
376: Timer1 = New Timer(1000)
377: Timer1.Enabled = True
378: Timer1.Start()
379: End Sub
380:
381: Sub StopTimer()
382: Timer1.Enabled = False
383: Timer1.Stop()
384: End Sub
385:
386: Private Sub Timer1_Elapsed(sender As Object, e As ElapsedEventArgs) Handles Timer1.Elapsed
387: UpdateTimer(Timer1.Interval)
388: End Sub
389:
390:
6. Resizable progressbar.
Each parser usually have ProgressBar.
77: ToolStripProgressBar1.Maximum = (RowCount + 1) * 3
78: ToolStripProgressBar1.Style = ProgressBarStyle.Blocks
...
298: Public Sub UpdateProgress()
299: Me.InvokeOnUiThreadIfRequired(Sub()
300: If ToolStripProgressBar1.Maximum > ToolStripProgressBar1.Value Then
301: ToolStripProgressBar1.Value += 1
302: Else
303: ToolStripProgressBar1.Maximum += 1
304: End If
305: End Sub)
306: Me.InvokeOnUiThreadIfRequired(Sub() StatusStrip1.Refresh())
307: End Sub
308: Public Sub IncreaseProgress()
309: Me.InvokeOnUiThreadIfRequired(Sub() ToolStripProgressBar1.Maximum += 1)
310: Me.InvokeOnUiThreadIfRequired(Sub() StatusStrip1.Refresh())
311: End Sub
...
370: Private Sub StatusStrip1_Resize(sender As Object, e As EventArgs) Handles StatusStrip1.Resize
371: ToolStripProgressBar1.Size = New Size(StatusStrip1.Width - 20, 16)
372: End Sub
...
7. Backgroundworker.
Usually any continuous operation in Windows desktop program doing in background worker.
85: Dim WithEvents BGW1 As BackgroundWorker
86: Dim BGW1_Prm As New Object
87:
88: Private Sub StartButton_Click(sender As Object, e As EventArgs) Handles StartButton.Click
89: ButtonSelect.Enabled = False
90: StartButton.Enabled = False
91: SaveButton.Enabled = True
92: Label2.Enabled = True
93: BGW1 = New BackgroundWorker
94: BGW1_Prm = New With {.StartDelay = StartDelayNumericUpDown.Value, .Thread = ThreadNumericUpDown.Value, .RepeateDelay = RepeateDelayNumericUpDown.Value}
95: BGW1.RunWorkerAsync(BGW1_Prm)
96: End Sub
....
100: Private Sub BGW1_DoWork(sender As Object, e As DoWorkEventArgs) Handles BGW1.DoWork
....
105: .... call AddressOf ProcessOneRow)
106: End Sub
....
289: Private Sub BGW1_RunWorkerCompleted(sender As Object, e As RunWorkerCompletedEventArgs) Handles BGW1.RunWorkerCompleted
290: ExcelPackage.SaveAs(FI)
291: StopTimer()
292: AutoClosingMessageBox.Show("Done.")
293: ButtonSelect.Enabled = True
294: StartButton.Enabled = True
295: ToolStripProgressBar1.Value = 0
296: End Sub
8. Multithreading by Parallel.
105: Dim W = Parallel.For(2, RowCount + 1, MaxThread, AddressOf ProcessOneRow)
In the screen below you can see process of thread degradation. I start from 50 thread, during processing request/response to network in the middle of execution program half of process was terminated, and before the finish still alive just only one thread.
And this is memory increase process and mymory utilization process by Global Catalog.
9. Control of Parallel count threading.
100: Private Sub BGW1_DoWork(sender As Object, e As DoWorkEventArgs) Handles BGW1.DoWork
101: Dim MaxThread As New ParallelOptions()
102: MaxThread.MaxDegreeOfParallelism = e.Argument.Thread
....
105: Dim W = Parallel.For(2, RowCount + 1, MaxThread, AddressOf ProcessOneRow)
111: End Sub
112:
....
115: Sub ProcessOneRow(CurRow As Integer, ThreadState As ParallelLoopState)
....
10. Shared ThreadSafe Value.
1: Public Class SharedThreadSafeValue(Of T)
2: Private _Value As T
3: Private readerWriterLock As Threading.ReaderWriterLockSlim = New Threading.ReaderWriterLockSlim()
4:
5: Public Sub New(ByVal Val As T)
6: _Value = Val
7: End Sub
8:
9: Public Property Value As T
10: Get
11: readerWriterLock.EnterReadLock()
12:
13: Try
14: Return _Value
15: Finally
16: readerWriterLock.ExitReadLock()
17: End Try
18: End Get
19: Set(ByVal Val As T)
20: readerWriterLock.EnterWriteLock()
21:
22: Try
23:
24: If Not _Value.Equals(Val) Then
25: _Value = Val
26: End If
27:
28: Finally
29: readerWriterLock.ExitWriteLock()
30: End Try
31: End Set
32: End Property
33: End Class
11. Full code of started multithreading in DoWork.
98: Dim StartDelay As SharedThreadSafeValue(Of Integer)
99: Dim RepeateDelay As SharedThreadSafeValue(Of Integer)
100: Private Sub BGW1_DoWork(sender As Object, e As DoWorkEventArgs) Handles BGW1.DoWork
101: Dim MaxThread As New ParallelOptions()
102: MaxThread.MaxDegreeOfParallelism = e.Argument.Thread
103: StartDelay = New SharedThreadSafeValue(Of Integer)(e.Argument.StartDelay)
104: RepeateDelay = New SharedThreadSafeValue(Of Integer)(e.Argument.RepeateDelay)
105: Dim W = Parallel.For(2, RowCount + 1, MaxThread, AddressOf ProcessOneRow)
106: End Sub
107:
12. Update WinForms from another thread.
In this case I use patten described in page Оновлення StatusLabel з потоку BackGroundWorker - приклад застосування Action, Delegate, Invoke, AddressOf, Extension, Expression.. Alternative solution I describe in page Mutlithreading InfoMessagBox by ConcurrentQueue, Interlocked and Timer.
1: Imports System.Runtime.CompilerServices
2:
3: Module StatusLabelExtension
4: <Extension()>
5: Sub InvokeOnUiThreadIfRequired(ByVal control As Control, ByVal action As Action)
6: If control.InvokeRequired Then
7: control.BeginInvoke(action)
8: Else
9: action.Invoke()
10: End If
11: End Sub
12: End Module
314: Public Sub UpdateTimer(Value As Integer)
315: Me.InvokeOnUiThreadIfRequired(Sub() TimeLabel.Text += Timer1.Interval / 1000)
316: Me.InvokeOnUiThreadIfRequired(Sub() TimeLabel.Refresh())
317: End Sub
318:
319: Public Sub UpdateLabel(RowNumber As String)
320: Me.InvokeOnUiThreadIfRequired(Sub() RowCountLabel.Text = RowNumber)
321: Me.InvokeOnUiThreadIfRequired(Sub() RowCountLabel.Refresh())
322: End Sub
323:
324: Public Sub InfoMessage1(Msg As String)
325: Me.InvokeOnUiThreadIfRequired(Sub() lErr1.Text = Msg)
326: Me.InvokeOnUiThreadIfRequired(Sub() lErr1.Refresh())
327: End Sub
328:
329: Public Sub InfoMessage2(Msg As String)
330: Me.InvokeOnUiThreadIfRequired(Sub() lErr2.Text = Msg)
331: Me.InvokeOnUiThreadIfRequired(Sub() lErr2.Refresh())
332: End Sub
333:
334: Public Sub InfoMessage3(Msg As String)
335: Me.InvokeOnUiThreadIfRequired(Sub() lErr3.Text = Msg)
336: Me.InvokeOnUiThreadIfRequired(Sub() lErr3.Refresh())
337: End Sub
13. Parse Json by Newtonsoft.Json package.
In my site a have a couple of description aboun JSON : Parse Yotube response by Newtonsoft.Json and How to parse JSON by Newtonsoft.Json (on example of FireFox Tab Session Manager and decrypted JwtSecurityToken). This is actual JSON in this program.
1: {
2: "products": [
3: {
4: "desc": "10 In. All Surface Wash Brush",
5: "url": "/pd/CRAFTSMAN-10-In-All-Surface-Wash-Brush/1000622473",
6: "brand": "CRAFTSMAN",
7: "itemNumber": "805817",
8: "vendorNumber": "302",
9: "modelNumber": "CMXMLBA7310A",
10: "imgName": "049206143470",
11: "productId": "1000622473",
12: "rating": 4.7797,
13: "reviewCount": 59,
14: "imgUrl": "http://images.lowes.com/product/converted/049206/049206143470sm.jpg",
15: "name": "CRAFTSMAN 10 In. All Surface Wash Brush"
16: },
17: {
18: "desc": "Ez Siding and House Wash Pressure Washer",
19: "url": "/pd/Mold-Armor-Ez-Siding-and-House-Wash-Pressure-Washer/1000731526",
20: "brand": "Mold Armor",
21: "itemNumber": "1150577",
22: "vendorNumber": "78",
23: "modelNumber": "FG581",
24: "imgName": "075919005811",
25: "productId": "1000731526",
26: "rating": 4.5455,
27: "reviewCount": 99,
28: "imgUrl": "http://images.lowes.com/product/converted/075919/075919005811sm.jpg",
29: "name": "Mold Armor Ez Siding and House Wash Pressure Washer"
30: },
31: {
32: "desc": "Deck Wash Kit",
33: "url": "/pd/Arnold-Deck-Wash-Kit/50119657",
34: "brand": "Arnold",
35: "itemNumber": "543172",
36: "vendorNumber": "442",
37: "modelNumber": "490-900-M061",
38: "imgName": "037049954339",
39: "productId": "50119657",
40: "rating": 4.4792,
41: "reviewCount": 48,
42: "imgUrl": "http://images.lowes.com/product/converted/037049/037049954339sm.jpg",
43: "name": "Arnold Deck Wash Kit"
44: }
45: ]
46: }
This is begining of this parser, some first strings of code.
And this is fragment of real finally code with using JSON parse - look at string 92,93,97.
1: Imports Newtonsoft.Json
2: Imports Newtonsoft.Json.Linq
3: Imports CsQuery
4: Imports System.Text.RegularExpressions
5:
6: Partial Public Class StartForm
7:
8: Public Shared Function BindIPEndPoint1(ByVal servicePoint As Net.ServicePoint, ByVal remoteEndPoint As Net.IPEndPoint, ByVal retryCount As Integer) As Net.IPEndPoint
9: If retryCount = 0 Then
10: ServerIPAddr = New SharedThreadSafeValue(Of String)(remoteEndPoint.Address.ToString)
11: StartForm_Instance.ShowServerIP(remoteEndPoint.Address.ToString)
12: End If
13: Return remoteEndPoint
14: End Function
15:
16: Dim StartDelay As SharedThreadSafeValue(Of Integer)
17: Dim RepeateDelay As SharedThreadSafeValue(Of Integer)
18: Dim ResponseTimeout As SharedThreadSafeValue(Of Integer)
19: Dim ResultCount As SharedThreadSafeValue(Of Integer)
20: Shared ServerIPAddr As SharedThreadSafeValue(Of String)
21:
22: Sub ProcessOneRow(CurRow As Integer, ThreadState As ParallelLoopState)
23: Try
24: Dim RepeateRequest1 As Integer = 0, RepeateRequest2 As Integer = 0, RepeateRequest3 As Integer = 0
25:
26: If String.IsNullOrEmpty(WorkSheet.Cells(CurRow, ColumnS(ColumnType.ITEM)).Value) Or String.IsNullOrEmpty(WorkSheet.Cells(CurRow, ColumnS(ColumnType.STORE)).Value) Then
27: Exit Sub
28: End If
29:
30: ShowAliveThread(Process.GetCurrentProcess().Threads.Count)
31: InfoMessage("Start thread=" & System.Threading.Thread.CurrentThread.ManagedThreadId & ", row=" & CurRow)
32: Debug.Print("Start thread=" & System.Threading.Thread.CurrentThread.ManagedThreadId & ", row=" & CurRow)
33: Threading.Thread.Sleep(1000 * StartDelay.Value)
34: Dim Loader As New Wcf_Client()
35:
36: Dim RowStartTime As Long = Now.Ticks
37: WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_MODEL)).Value = Now()
38: WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_PRODUCT_TITLE)).Value = ""
39: WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_RETAIL)).Value = ""
40: WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_INVENTORY)).Value = ""
41: WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_STORENAME)).Value = ""
42: WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_URL)).Value = ""
43: WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_LEADTIME)).Value = "Working"
44:
45:
46: Request1:
47: Dim URL1 As String = "https://www.lowes.com/LowesSearchServices/resources/typeahead/v1_1?searchTerm=" &
48: WorkSheet.Cells(CurRow, ColumnS(ColumnType.ITEM)).Value &
49: "&maxArticles=20&maxBrands=0&maxDepts=0&maxProducts=20&store=" &
50: WorkSheet.Cells(CurRow, ColumnS(ColumnType.STORE)).Value &
51: "&isRemoteLocation=0"
52: WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_URL1)).Value = URL1
53:
54: Dim Cookie1 As Specialized.NameValueCollection
55:
56: Dim Response1 As ResponseResult = Loader.GetRequestStrAsync(URL1,, CurRow, RepeateRequest1, ResponseTimeout.Value)
57: Dim JsonHtml As String = Response1.HTML
58:
59: Dim mySP As Net.ServicePoint = Net.ServicePointManager.FindServicePoint(New Uri(URL1))
60: mySP.BindIPEndPointDelegate = New Net.BindIPEndPoint(AddressOf BindIPEndPoint1)
61:
62: WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_Json1)).Value = JsonHtml
63:
64: If JsonHtml = "Unable to connect to the remote server" Or JsonHtml = "The remote server returned an error: (500) Internal Server Error." Or JsonHtml = "" Or JsonHtml = "{}" Then
65:
66: Dim ResponseState1 As RequestState = Response1.AsyncResult.AsyncState
67: Dim ResponseHeaders1 As Specialized.NameValueCollection = ResponseState1.ResponseHeader()
68: Dim RespSetCookie1 As String = ResponseHeaders1("Set-Cookie")
69: If Not String.IsNullOrEmpty(RespSetCookie1) Then
70: Cookie1 = New Specialized.NameValueCollection
71: Cookie1.Add("Cookie", RespSetCookie1)
72: End If
73: Dim ErrMsg1 As String = "", StatusCode1 As String = ""
74: If Not String.IsNullOrEmpty(ResponseState1.ErrorMessage) Then ErrMsg1 = ResponseState1.ErrorMessage
75: If Not String.IsNullOrEmpty(ResponseState1.ResponseStatusCode) Then StatusCode1 = "HttpStatus=" & ResponseState1.ResponseStatusCode
76:
77: WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_LEADTIME)).Value = "Unable to connect to the remote server (1), " & "Thread=" & Threading.Thread.CurrentThread.ManagedThreadId & ", Row=" & CurRow & " " & StatusCode1 & " " & ErrMsg1
78: InfoMessage("Unable to connect to the remote server (1), Thread=" & Threading.Thread.CurrentThread.ManagedThreadId & ", Row=" & CurRow & ", Restarted (" & RepeateRequest1 & "). " & " " & StatusCode1 & " " & ErrMsg1)
79: Debug.Print("Unable to connect to the remote server (1), Thread=" & Threading.Thread.CurrentThread.ManagedThreadId & ", Row=" & CurRow & ", Restarted (" & RepeateRequest1 & "). " & " " & StatusCode1 & " " & ErrMsg1)
80: 'IncreaseProgress()
81: Loader = Nothing
82: Threading.Thread.Sleep(1000 * RepeateDelay.Value)
83: Loader = New Wcf_Client()
84: RepeateRequest1 += 1
85: If RepeateRequest1 < 10 Then GoTo Request1 Else Exit Sub
86: End If
87:
88: WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_URL)).Value = GetDuration(RowStartTime)
89: UpdateProgress()
90:
91: Dim Json As JObject = JObject.Parse(JsonHtml)
92: For Each One In Json("products")
93: If One("itemNumber").Value(Of String).Trim = WorkSheet.Cells(CurRow, ColumnS(ColumnType.ITEM)).Value.ToString.Trim Then
94: Request2:
95: Dim URL2 As String = "https://www.lowes.com" & One("url").Value(Of String)
96:
97: WorkSheet.Cells(CurRow, ColumnS(ColumnType.Model)).Value = One("modelNumber").Value(Of String)
98: WorkSheet.Cells(CurRow, ColumnS(ColumnType.URL)).Value = URL2
99:
100: Dim Cookie2 As String = ""
101:
102: Dim Headers As New Specialized.NameValueCollection
103: Headers.Add("Cookie", "sn=" & WorkSheet.Cells(CurRow, ColumnS(ColumnType.STORE)).Value & Cookie2)
104: Dim Response2 As ResponseResult = Loader.GetRequestStrAsync(URL2, Headers, CurRow, RepeateRequest2, ResponseTimeout.Value)
105: Dim HTML As String = Response2.HTML
106:
107: If HTML = "Unable to connect to the remote server" Or HTML = "The remote server returned an error: (500) Internal Server Error." Or HTML = "" Then
108:
109: Dim ResponseState2 As RequestState = Response2.AsyncResult.AsyncState
110: Dim ResponseHeaders2 As Specialized.NameValueCollection = ResponseState2.ResponseHeader()
111: Dim RespSetCookie2 As String = ResponseHeaders2("Set-Cookie")
112: If Not String.IsNullOrEmpty(RespSetCookie2) Then
113: Cookie2 = ";" & RespSetCookie2
114: End If
115: Dim ErrMsg2 As String = "", StatusCode2 As String = ""
116: If Not String.IsNullOrEmpty(ResponseState2.ErrorMessage) Then ErrMsg2 = ResponseState2.ErrorMessage
117: If Not String.IsNullOrEmpty(ResponseState2.ResponseStatusCode) Then StatusCode2 = "HttpStatus=" & ResponseState2.ResponseStatusCode
118:
119: WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_LEADTIME)).Value = "Unable to connect to the remote server (2), " & "Thread=" & Threading.Thread.CurrentThread.ManagedThreadId & ", Row=" & CurRow & " " & StatusCode2 & " " & ErrMsg2
120: InfoMessage("Unable to connect to the remote server (2), Thread=" & Threading.Thread.CurrentThread.ManagedThreadId & ", Row=" & CurRow & ", Restarted (" & RepeateRequest2 & "). " & " " & StatusCode2 & " " & ErrMsg2)
121: Debug.Print("Unable to connect to the remote server (2), Thread=" & Threading.Thread.CurrentThread.ManagedThreadId & ", Row=" & CurRow & ", Restarted (" & RepeateRequest2 & "). " & " " & StatusCode2 & " " & ErrMsg2)
122: 'IncreaseProgress()
123: Loader = Nothing
124: Threading.Thread.Sleep(1000 * RepeateDelay.Value)
125: Loader = New Wcf_Client()
126: RepeateRequest2 += 1
127: If RepeateRequest2 < 10 Then GoTo Request2 Else Exit Sub
128: End If
129:
130: WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_STORENAME)).Value = GetDuration(RowStartTime)
131: UpdateProgress()
132:
133: Dim HtmlDom As CQ = HTML
134:
135: Dim EleItem = HtmlDom("input[name='productId']")
136: If EleItem.Count > 0 Then
137: WorkSheet.Cells(CurRow, ColumnS(ColumnType.PRODUCT_TITLE)).Value = HtmlDom("input[name='productId']").Attr("data-productname")
138: WorkSheet.Cells(CurRow, ColumnS(ColumnType.RETAIL)).Value = HtmlDom("input[name='productId']").Attr("data-productprice")
139: End If
140:
141: WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_PRODUCT_TITLE)).Value = GetDuration(RowStartTime)
142:
143: Dim StPos As Integer = InStr(HTML, """pickup"":")
144: If StPos > 0 Then
145: StPos = StPos + Len("""pickup"":")
146: Dim EPos As Integer = InStr(StPos, HTML, "},")
147: If EPos > 0 Then
148: Dim FStr As String = Mid(HTML, StPos, EPos + Len("}}}];") - StPos - 4)
149: Dim Json3 As JObject = JObject.Parse(FStr)
150:
151: WorkSheet.Cells(CurRow, ColumnS(ColumnType.INVENTORY_AT_STORE)).Value = Json3("availabileQuantity").Value(Of Integer)
152:
153: WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_RETAIL)).Value = GetDuration(RowStartTime)
154: End If
155: End If
156:
157:
158: Dim ItemStore = HtmlDom(".pd-item-actions.grid-100.grid-parent")
159:
160: If ItemStore.Count > 0 Then
161: Dim Strong As CsQuery.CQ = ItemStore.Find("strong")
162: If Strong.Count = 2 Then
163: WorkSheet.Cells(CurRow, ColumnS(ColumnType.StoreName)).Value = ItemStore.Find("strong")(1).InnerText
164: Else
165: GoTo Request3
166: End If
167: Else
168: Request3:
....
14. Send Cookie to request.
180: Dim Headers As New Specialized.NameValueCollection
181: Headers.Add("Cookie", "sn=" & WorkSheet.Cells(CurRow, ColumnS(ColumnType.STORE)).Value)
182: Dim HTML As String = Loader.GetRequestStrAsync("https://www.lowes.com" & One("url").Value(Of String), Headers, CurRow)
15. Parse HTML by CsQuery with jQuery Selector.
200: Dim HtmlDom As CQ = HTML
201:
202: Dim EleItem = HtmlDom("input[name='productId']")
203: If EleItem.Count > 0 Then
204: WorkSheet.Cells(CurRow, ColumnS(ColumnType.PRODUCT_TITLE)).Value = HtmlDom("input[name='productId']").Attr("data-productname")
205: WorkSheet.Cells(CurRow, ColumnS(ColumnType.RETAIL)).Value = HtmlDom("input[name='productId']").Attr("data-productprice")
206: End If
207:
16. Parse data in JavaScript by Regular expression or string operations.
Most interesting data usually contains in Javascript. This data can be parses by Regular expression and string operation. In this program I use string operations.In this case I use string operations.
17. GetRequestStrAsync - require HTML by AsyncCallback.
This is small part of code fragment of my class WCF_Client I described in 2010 - WCF_CLIENT - клиент Web-сервиса. How it working?
First component is class to pass thread number of parallel process to asynchronous task of reading HTML. This is parameter CustomData in string 391.
1: Imports System.Net
2:
3: Public Class RequestState
4:
5: Public Property StringBuilder As System.Text.StringBuilder
6: Public Property BufferRead As Byte()
7: Public Property Request As Net.HttpWebRequest
8: Public Property ResponseStream As IO.Stream
9: Public Property StreamDecode As System.Text.Decoder
10: Public Property ErrorMessage As String
11: Public Property ParentThreadID As Integer
12: Public Property CallbackThreadID As Integer
13: Public Property CustomData As String
14: Public Property ErrorRepeateCount As Integer
15: Public Property ResponseHeader As Specialized.NameValueCollection
16: Public Property ResponseStatusCode As String
17: Public Property ResponseContentLength As Integer
18: Public Property ResponseContentType As String
19: Public Property ResponseContentEncoding As String
20: Public Property ResponseCharacterSet As String
21:
22: Public Sub New(ByVal BufSizeInBytes As Integer, ByVal Encoder As Wcf_Client.PostRequestEncode)
23: Request = Nothing
24: ResponseStream = Nothing
25: StringBuilder = New System.Text.StringBuilder("")
26: If Encoder = Wcf_Client.PostRequestEncode.ASCII Then
27: StreamDecode = System.Text.ASCIIEncoding.GetEncoding("windows-1251").GetDecoder
28: ElseIf Encoder = Wcf_Client.PostRequestEncode.UTF16 Then
29: StreamDecode = System.Text.UnicodeEncoding.Unicode.GetDecoder
30: ElseIf Encoder = Wcf_Client.PostRequestEncode.UTF8 Then
31: StreamDecode = System.Text.UTF8Encoding.UTF8.GetDecoder
32: ElseIf Encoder = Wcf_Client.PostRequestEncode.UTF7 Then
33: StreamDecode = System.Text.UTF7Encoding.UTF7.GetDecoder
34: ElseIf Encoder = Wcf_Client.PostRequestEncode.UTF32 Then
35: StreamDecode = System.Text.UTF32Encoding.UTF32.GetDecoder
36: End If
37: Dim buf1(BufSizeInBytes) As Byte
38: BufferRead = buf1
39: End Sub
40: End Class
41:
42: Public Class ResponseResult
43: Property HTML As String
44: Property ErrorMessage As String
45: Property AsyncResult As IAsyncResult
46: Property IsAllDoneExit As Boolean
47: End Class
48:
49: Public Class Wcf_Client
50:
51: Public Enum PostRequestEncode
52: ASCII = 1
53: UTF7 = 2
54: UTF8 = 3
55: UTF16 = 4
56: UTF32 = 5
57: End Enum
58:
59: Public Enum PostRedirectEncode
60: ASCII_Windows1251 = 1
61: UTF8 = 2
62: UTF16 = 3
63: ISO_8859_5 = 4
64: End Enum
65:
66: Public Property UserAgent As String = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
67: Public AllDone As Threading.ManualResetEvent
68: Dim BUFFER_SIZE As Integer = 10000
69:
70:
71: Public Overridable Function GetRequestStrAsync(ByVal URL As String, Optional Headers As Specialized.NameValueCollection = Nothing, Optional CustomData As String = Nothing, Optional ErrorRepeateCount As Integer = 0, Optional TimeOutSec As Integer = 60, Optional exitContext As Boolean = False, Optional ByVal ResponseEncoding As PostRequestEncode = PostRequestEncode.ASCII, Optional ByVal Full_ProxyURL As String = "") As ResponseResult
72: Dim RS As RequestState = New RequestState(BUFFER_SIZE, ResponseEncoding)
73:
74: Try
75: AllDone = New Threading.ManualResetEvent(False)
76:
77: '========== System.NotSupportedException The URI prefix is not recognized.
78: Dim Request As Net.HttpWebRequest = Net.HttpWebRequest.Create(URL)
79:
80: If Headers IsNot Nothing Then
81: Request.Headers.Add(Headers)
82: End If
83:
84: Request.UserAgent = UserAgent
85: Request.Method = "GET"
86: If Full_ProxyURL <> "" Then
87: Dim MyProxy As New Net.WebProxy
88: MyProxy.Address = New Uri(Full_ProxyURL)
89: Request.Proxy = MyProxy
90: End If
91:
92: ' Put the request into the state so it can be passed around.
93: RS.Request = Request
94: RS.ParentThreadID = System.Threading.Thread.CurrentThread.ManagedThreadId
95: RS.CustomData = CustomData
96: RS.ErrorRepeateCount = ErrorRepeateCount
97: RS.ResponseHeader = New Specialized.NameValueCollection
98:
99: 'Issue the async request.
100: Dim AsyncResult As IAsyncResult = CType(Request.BeginGetResponse(
101: New AsyncCallback(AddressOf RespCallback), RS), IAsyncResult)
102:
103: ' Wait until the ManualResetEvent is set so that the application does not exit until after the callback is called.
104: Dim IsExit As Boolean = AllDone.WaitOne((New TimeSpan(0, 0, 0, TimeOutSec, 0)), exitContext)
105:
106: Return New ResponseResult With {.HTML = RS.StringBuilder.ToString, .ErrorMessage = RS.ErrorMessage, .AsyncResult = AsyncResult, .IsAllDoneExit = IsExit}
107:
108: Catch ex As Exception
109: StartForm_Instance.InfoMessage(ex.Message)
110: Return New ResponseResult With {.HTML = RS.StringBuilder.ToString, .ErrorMessage = ex.Message}
111: End Try
112:
113: End Function
114:
115: Sub RespCallback(ByVal ar As IAsyncResult)
116: ' Get the RequestState object from the async result
117: Dim RS As RequestState = CType(ar.AsyncState, RequestState)
118:
119: Debug.Print("Callback in thread =" & RS.ParentThreadID)
120:
121: Try
122: ' Get the HttpWebRequest from RequestState.
123: Dim Req As Net.HttpWebRequest = RS.Request
124:
125: ' Call EndGetResponse, which returns the HttpWebResponse object that came from the request issued above.
126: Dim Resp As Net.HttpWebResponse = CType(Req.EndGetResponse(ar), Net.HttpWebResponse)
127:
128: If Resp.ContentEncoding = "" Then
129: RS.StreamDecode = Text.ASCIIEncoding.GetEncoding("windows-1251").GetDecoder
130: End If
131:
132: RS.ResponseContentLength = Resp.ContentLength
133: RS.ResponseContentType = Resp.ContentType
134: RS.ResponseContentEncoding = Resp.ContentEncoding
135: RS.ResponseCharacterSet = Resp.CharacterSet
136: RS.ResponseStatusCode = Resp.StatusCode
137: RS.CallbackThreadID = Threading.Thread.CurrentThread.ManagedThreadId
138: For Each OneKey As String In Resp.Headers.AllKeys
139: RS.ResponseHeader.Add(OneKey, Resp.GetResponseHeader(OneKey))
140: Next
141:
142: ' Start reading data from the respons stream.
143: '============= The remote server returned an error: (407) Proxy Authentication Required. ==========
144: Dim ResponseStream As IO.Stream = Resp.GetResponseStream()
145:
146: ' Store the reponse stream in RequestState to read the stream asynchronously.
147: RS.ResponseStream = ResponseStream
148:
149: ' Pass rs.BufferRead to BeginRead. Read data into rs.BufferRead.
150: Dim iarRead As IAsyncResult =
151: ResponseStream.BeginRead(RS.BufferRead, 0, BUFFER_SIZE,
152: New AsyncCallback(AddressOf ReadCallBack), RS)
153:
154: Catch ex As Exception
155: StartForm_Instance.InfoMessage(ex.Message)
156: RS.ErrorMessage = ex.Message
157: AllDone.Set()
158: End Try
159:
160: End Sub
161:
162: Sub ReadCallBack(ByVal asyncResult As IAsyncResult)
163:
164: StartForm_Instance.InfoMessage("Thread=" & CType(asyncResult.AsyncState, RequestState).ParentThreadID & ", row=" & CType(asyncResult.AsyncState, RequestState).CustomData & " - data received.")
165:
166: ' Get the RequestState object from the AsyncResult.
167: Dim rs As RequestState = CType(asyncResult.AsyncState, RequestState)
168:
169: ' Retrieve the ResponseStream that was set in RespCallback.
170: Dim responseStream As IO.Stream = rs.ResponseStream
171:
172: ' Read rs.BufferRead to verify that it contains data.
173: Dim read As Integer = responseStream.EndRead(asyncResult)
174: If read > 0 Then
175: ' Prepare a Char array buffer for converting to Unicode.
176: Dim charBuffer(rs.BufferRead.Count) As Char
177:
178: ' Convert byte stream to Char array and then String. len contains the number of characters converted to Unicode.
179: Dim len As Integer =
180: rs.StreamDecode.GetChars(rs.BufferRead, 0, read, charBuffer, 0)
181: Dim str As String = New String(charBuffer, 0, len)
182:
183: ' Append the recently read data to the RequestData stringbuilder object contained in RequestState.
184: rs.StringBuilder.Append(str)
185:
186: ' Continue reading data until responseStream.EndRead returns –1.
187: Dim ar As IAsyncResult =
188: responseStream.BeginRead(rs.BufferRead, 0, BUFFER_SIZE,
189: New AsyncCallback(AddressOf ReadCallBack), rs)
190: Else
191:
192: ' Close down the response stream.
193: responseStream.Close()
194:
195: ' Set the ManualResetEvent so the main thread can exit.
196: AllDone.Set()
197: End If
198:
199: Return
200: End Sub
201:
202: End Class
18. Use Profiler to check code performance.
If you use high performance parser, more than 100 thread, this is a very important step of code refactoring.
|