(NET) NET (2018)

Multithreading Parsers with Parallel, CsQuery, Newtonsoft.Json, OfficeOpenXml and IAsyncResult/AsyncCallback.





I write parsers many years, first parser I have described in my site is dated by 2009 WebDownloader_UltraLite - ваш личный поисковик по рунету с особыми возможностями поиска.. Time on time I described various part of parsers, for example component to use proxy servers - My web scrapper with asynchronous web request and visual proxy availability detection. In my parsers I use various HTML parse engine, for example Parse HTML by HtmlAgilityPack (Xpath selector) and CsQuery (jQuery selector) or various multithreading engine, for example - Building TreeView by Reactive Extensions NET (Recursive observe directory, Iterator function with Yield, Windows native thread).

In this page I will make a full description of my regular downloader and parser.

1. TSL 1.2

Windows 10 by default has not opportunity to strong SSL-connect, it require special parameters in app.config.


   1:  <?xml version="1.0" encoding="utf-8" ?>
   2:  <configuration>
   3:     <runtime>
   4:        <AppContextSwitchOverrides value="Switch.System.Net.DontEnableSchUseStrongCrypto=false"/>
  ...  
   9:  </configuration>

2. More then two endpoint

p>Windows 10 by default has not opportunity more than two output sockets, before you start program "Application.Run(StartForm_Instance)" you need to set "Net.ServicePointManager.DefaultConnectionLimit = 1000". This is string number 36 in code below.


3. Background thread and sockets read thread.


Usually threading of parsers look as that - windows form thread, background thread and socket read thread.



If exception is appear in not windows forms thread, program finish silently. Therefore each multithreading program (or program with backgroundworker) need a starter like this. I describe this future many times in my site Trap unhandled exception in windows application.


   1:  Imports CefSharp
   2:  Imports CefSharp.WinForms
   3:  Imports System
   4:  Imports System.IO
   5:  Imports System.Runtime.InteropServices
   6:  Imports System.Windows.Forms
   7:   
   8:  Public Module Starter
   9:   
  10:      <DllImport("user32.dll")>
  11:      Private Function ShowWindow(ByVal hWnd As IntPtr, ByVal nCmdShow As Integer) As Boolean
  12:      End Function
  13:   
  14:      <DllImport("Kernel32")>
  15:      Private Function GetConsoleWindow() As IntPtr
  16:      End Function
  17:   
  18:      Const SW_HIDE As Integer = 0
  19:      Const SW_SHOW As Integer = 5
  20:   
  21:      Public StartForm_Instance As StartForm
  22:   
  23:   
  24:      <STAThread>
  25:      Public Sub Main()
  26:   
  27:   
  28:          Dim hwnd As IntPtr
  29:          hwnd = GetConsoleWindow()
  30:          ShowWindow(hwnd, SW_HIDE)
  31:   
  32:          AddHandler Application.ThreadException, New Threading.ThreadExceptionEventHandler(AddressOf Form1_UIThreadException)
  33:          Application.SetUnhandledExceptionMode(UnhandledExceptionMode.CatchException)
  34:          AddHandler AppDomain.CurrentDomain.UnhandledException, New UnhandledExceptionEventHandler(AddressOf CurrentDomain_UnhandledException)
  35:   
  36:          System.Net.ServicePointManager.DefaultConnectionLimit = 1000
  37:   
  38:          StartForm_Instance = New StartForm
  39:   
  40:          Application.Run(StartForm_Instance)
  41:      End Sub
  42:   
  43:      Private Sub Form1_UIThreadException(ByVal sender As Object, ByVal t As Threading.ThreadExceptionEventArgs)
  44:          Dim result As DialogResult = DialogResult.Cancel
  45:   
  46:          Try
  47:              result = ShowThreadExceptionDialog("Fatal Error", t.Exception)
  48:              If result = DialogResult.Abort Then Application.[Exit]()
  49:          Catch
  50:              Try
  51:                  MessageBox.Show("Fatal Windows Forms Error", "UIThreadException", MessageBoxButtons.AbortRetryIgnore, MessageBoxIcon.[Stop])
  52:              Finally
  53:                  Application.[Exit]()
  54:              End Try
  55:          End Try
  56:      End Sub
  57:   
  58:      Private Sub CurrentDomain_UnhandledException(ByVal sender As Object, ByVal e As UnhandledExceptionEventArgs)
  59:          Dim errorMsg As String = "Fatal Non-UI Error:" & vbLf & vbLf
  60:          Dim ex As Exception = CType(e.ExceptionObject, Exception)
  61:          Try
  62:              Dim myLog As EventLog = New EventLog()
  63:              myLog.Source = "ThreadException"
  64:              myLog.WriteEntry(errorMsg & ex.Message & vbLf & vbLf & "Stack Trace:" & vbLf + ex.StackTrace)
  65:          Catch exc As Exception
  66:              Try
  67:                  MessageBox.Show("Could not write the error to the event log. Reason: " & exc.Message, "UnhandledException", MessageBoxButtons.OK, MessageBoxIcon.[Stop])
  68:              Finally
  69:                  Application.[Exit]()
  70:              End Try
  71:          End Try
  72:      End Sub
  73:   
  74:      Public Function ShowThreadExceptionDialog(ByVal title As String, ByVal e As Exception) As DialogResult
  75:          Dim errorMsg As String = "An application error occurred. Please contact the adminstrator " & "with the following information:" & vbLf & vbLf
  76:          errorMsg = errorMsg & e.Message & vbLf & vbLf & "Stack Trace:" & vbLf + e.StackTrace
  77:          Return MessageBox.Show(errorMsg, title, MessageBoxButtons.AbortRetryIgnore, MessageBoxIcon.[Stop])
  78:      End Function
  79:   
  80:  End Module

4. EPPlus extension function, relative column address and check if columns is exist.

Take a look to parser code details. Often it start with job, listed in Excel file. There is a standard part of parsers (more detail about EPPlus please see in page Processing Office Open XML (xlsx) files by EPPlus


   1:  Imports System.IO
   2:  Imports OfficeOpenXml
   3:  Imports Newtonsoft.Json
   4:  Imports Newtonsoft.Json.Linq
   5:  Imports CsQuery
   6:  Imports System.ComponentModel
   7:  Imports System.Timers
   8:   
   9:  Public Class StartForm
  10:   
  11:   
  12:      Private Sub StartForm_Load(sender As Object, e As EventArgs) Handles Me.Load
  13:          Text &= " (" & My.Application.Info.Version.ToString & ")"
  14:      End Sub
  15:   
  16:      Dim ExcelPackage As ExcelPackage
  17:      Dim FI As FileInfo
  18:      Dim WorkBook As ExcelWorkbook
  19:      Dim WorkSheet As ExcelWorksheet
  20:      Dim ColCount As Integer
  21:      Dim RowCount As Integer
  22:      Dim CurRow As Integer
  23:   
  24:      Public ColumnS(21) As Integer 'real column number
  25:   
  26:      Public Enum ColumnType
  27:          ITEM = 1
  28:          STORE = 2
  29:          Model = 3
  30:          StoreName = 4
  31:          URL = 5
  32:          PRODUCT_TITLE = 6
  33:          RETAIL = 7
  34:          INVENTORY_AT_STORE = 8
  35:          TS_MODEL = 9
  36:          TS_URL = 10
  37:          TS_STORENAME = 11
  38:          TS_PRODUCT_TITLE = 12
  39:          TS_RETAIL = 14
  40:          TS_INVENTORY = 13
  41:          TS_LEADTIME = 15
  42:          TS_URL1 = 16
  43:          TS_Json1 = 17
  44:          TS_URL2 = 18
  45:      End Enum
  46:   
  47:   
  48:      Private Sub Button1_Click(sender As Object, e As EventArgs) Handles ButtonSelect.Click
  49:          Dim X As New OpenFileDialog
  50:          X.RestoreDirectory = True
  51:          X.Title = "Browse input Excel File"
  52:          X.Filter = "All files (*.*)|*.*|(*.xls) files |*.xls|(*.xlsx) files |*.xlsx|(*.xlsm) files |*.xlsm"
  53:          X.FilterIndex = 1
  54:          X.CheckFileExists = True
  55:          X.CheckPathExists = True
  56:          X.Multiselect = False
  57:          If X.ShowDialog = DialogResult.OK Then
  58:              ExcelLocation.Text = X.FileName
  59:              FI = New FileInfo(ExcelLocation.Text)
  60:              ExcelPackage = New ExcelPackage(FI)
  61:              WorkBook = ExcelPackage.Workbook
  62:              WorkSheet = WorkBook.Worksheets.First()
  63:              ColCount = WorkSheet.Dimension.End.Column
  64:              RowCount = WorkSheet.Dimension.End.Row
  65:              If CheckIsColumnExist() Then
  66:                  ExcelRowCount.Text = "All columns exists," & RowCount & " rows found"
  67:                  StartButton.Enabled = True
  68:                  ThreadNumericUpDown.Enabled = True
  69:                  StartDelayNumericUpDown.Enabled = True
  70:                  RepeateDelayNumericUpDown.Enabled = True
  71:                  Label1_.Enabled = True
  72:                  Label2_.Enabled = True
  73:                  Label3_.Enabled = True
  74:                  Label4_.Enabled = True
  75:                  Label3.Enabled = True
  76:              End If
  77:              ToolStripProgressBar1.Maximum = (RowCount + 1) * 3
  78:              ToolStripProgressBar1.Style = ProgressBarStyle.Blocks
  79:          End If
  80:      End Sub
 ....   
 285:      Private Sub SaveButton_Click(sender As Object, e As EventArgs) Handles SaveButton.Click
 286:          ExcelPackage.SaveAs(FI)
 287:      End Sub
 288:   
 ....   
 352:      Function CheckIsColumnExist() As Boolean
 353:          For Each One As ColumnType In System.Enum.GetValues(GetType(ColumnType))
 354:              Dim ColNumber As Integer = WorkSheet.GetColumnByName(ColumnName(CInt(One) - 1), AddressOf BadColumnStructure)
 355:              If ColNumber = -1 Then
 356:                  Return False
 357:              Else
 358:                  ColumnS(CInt(One)) = ColNumber
 359:              End If
 360:          Next
 361:          Return True
 362:      End Function
 363:   
 364:      Sub BadColumnStructure()
 365:          Dim Str1 As New Text.StringBuilder
 366:          ColumnName.ToArray.ForEach(Of String)(ColumnName.ToArray, Sub(Z) Str1.AppendLine(Z))
 367:          MsgBox("Right column structure is: " & vbCrLf & Str1.ToString)
 368:      End Sub
 369:   
 ....   
 391:      Dim ColumnName() As String = {"ITEM #",
 392:              "STORE #",
 393:              "Model #",
 394:              "Store name",
 395:               "URL",
 396:              "PRODUCT TITLE",
 397:              "RETAIL",
 398:              "INVENTORY AT STORE",
 399:              "TS MODEL",
 400:              "TS URL",
 401:              "TS STORENAME",
 402:              "TS PRODUCT TITLE",
 403:              "TS RETAIL",
 404:              "TS INVENTORY",
 405:              "TS LEADTIME",
 406:              "TS_URL1",
 407:              "TS_Json1",
 408:              "TS_URL2"}
 409:   
 410:  End Class

   1:  Imports System.Runtime.CompilerServices
   2:  Imports OfficeOpenXml
   3:   
   4:  Module EpPlusExtensionMethods
   5:      <Extension()>
   6:      Function GetColumnByName(ByVal ws As ExcelWorksheet, ByVal columnName As String, ErrMessAction As Action) As Integer
   7:          If ws Is Nothing Then Throw New ArgumentNullException(NameOf(ws))
   8:          Try
   9:              Return ws.Cells("1:1").First(Function(c) c.Value.ToString().ToLower.Trim = columnName.ToLower.Trim).Start.Column
  10:          Catch ex As Exception
  11:              ErrMessAction.Invoke
  12:              Return -1
  13:          End Try
  14:   
  15:      End Function
  16:  End Module

5. Timer

Next component, usually present in parser is timer. Win forms timer usually not working, therefore I use system.timer.


 313:      Public Sub UpdateTimer(Value As Integer)
 314:          Me.InvokeOnUiThreadIfRequired(Sub() TimeLabel.Text += Timer1.Interval / 1000)
 315:          Me.InvokeOnUiThreadIfRequired(Sub() TimeLabel.Refresh())
 316:      End Sub
 ...   
 339:      Function GetDuration(StartTime As Long) As String
 340:          Return String.Format("{0:N2} seconds", New TimeSpan(Now.Ticks - StartTime).TotalSeconds)
 341:      End Function
 ...   
 374:      Dim WithEvents Timer1 As Timers.Timer
 375:      Sub StartTimer()
 376:          Timer1 = New Timer(1000)
 377:          Timer1.Enabled = True
 378:          Timer1.Start()
 379:      End Sub
 380:   
 381:      Sub StopTimer()
 382:          Timer1.Enabled = False
 383:          Timer1.Stop()
 384:      End Sub
 385:   
 386:      Private Sub Timer1_Elapsed(sender As Object, e As ElapsedEventArgs) Handles Timer1.Elapsed
 387:          UpdateTimer(Timer1.Interval)
 388:      End Sub
 389:   
 390:   

6. Resizable progressbar.

Each parser usually have ProgressBar.



  77:              ToolStripProgressBar1.Maximum = (RowCount + 1) * 3
  78:              ToolStripProgressBar1.Style = ProgressBarStyle.Blocks
  ...   
 298:      Public Sub UpdateProgress()
 299:          Me.InvokeOnUiThreadIfRequired(Sub()
 300:                                            If ToolStripProgressBar1.Maximum > ToolStripProgressBar1.Value Then
 301:                                                ToolStripProgressBar1.Value += 1
 302:                                            Else
 303:                                                ToolStripProgressBar1.Maximum += 1
 304:                                            End If
 305:                                        End Sub)
 306:          Me.InvokeOnUiThreadIfRequired(Sub() StatusStrip1.Refresh())
 307:      End Sub
 308:      Public Sub IncreaseProgress()
 309:          Me.InvokeOnUiThreadIfRequired(Sub() ToolStripProgressBar1.Maximum += 1)
 310:          Me.InvokeOnUiThreadIfRequired(Sub() StatusStrip1.Refresh())
 311:      End Sub
  ...   
 370:      Private Sub StatusStrip1_Resize(sender As Object, e As EventArgs) Handles StatusStrip1.Resize
 371:          ToolStripProgressBar1.Size = New Size(StatusStrip1.Width - 20, 16)
 372:      End Sub
  ...   

7. Backgroundworker.


Usually any continuous operation in Windows desktop program doing in background worker.


  85:      Dim WithEvents BGW1 As BackgroundWorker
  86:      Dim BGW1_Prm As New Object
  87:   
  88:      Private Sub StartButton_Click(sender As Object, e As EventArgs) Handles StartButton.Click
  89:          ButtonSelect.Enabled = False
  90:          StartButton.Enabled = False
  91:          SaveButton.Enabled = True
  92:          Label2.Enabled = True
  93:          BGW1 = New BackgroundWorker
  94:          BGW1_Prm = New With {.StartDelay = StartDelayNumericUpDown.Value, .Thread = ThreadNumericUpDown.Value, .RepeateDelay = RepeateDelayNumericUpDown.Value}
  95:          BGW1.RunWorkerAsync(BGW1_Prm)
  96:      End Sub
 ....   
 100:      Private Sub BGW1_DoWork(sender As Object, e As DoWorkEventArgs) Handles BGW1.DoWork
 ....   
 105:          .... call AddressOf ProcessOneRow)
 106:      End Sub
 ....   
 289:      Private Sub BGW1_RunWorkerCompleted(sender As Object, e As RunWorkerCompletedEventArgs) Handles BGW1.RunWorkerCompleted
 290:          ExcelPackage.SaveAs(FI)
 291:          StopTimer()
 292:          AutoClosingMessageBox.Show("Done.")
 293:          ButtonSelect.Enabled = True
 294:          StartButton.Enabled = True
 295:          ToolStripProgressBar1.Value = 0
 296:      End Sub

8. Multithreading by Parallel.


 105:          Dim W = Parallel.For(2, RowCount + 1, MaxThread, AddressOf ProcessOneRow)

In the screen below you can see process of thread degradation. I start from 50 thread, during processing request/response to network in the middle of execution program half of process was terminated, and before the finish still alive just only one thread.



And this is memory increase process and mymory utilization process by Global Catalog.



9. Control of Parallel count threading.

Integer)
 100:      Private Sub BGW1_DoWork(sender As Object, e As DoWorkEventArgs) Handles BGW1.DoWork
 101:          Dim MaxThread As New ParallelOptions()
 102:          MaxThread.MaxDegreeOfParallelism = e.Argument.Thread
 ....   
 105:          Dim W = Parallel.For(2, RowCount + 1, MaxThread, AddressOf ProcessOneRow)
 111:      End Sub
 112:   
 ....   
 115:      Sub ProcessOneRow(CurRow As Integer, ThreadState As ParallelLoopState)
 ....   

10. Shared ThreadSafe Value.


   1:  Public Class SharedThreadSafeValue(Of T)
   2:      Private _Value As T
   3:      Private readerWriterLock As Threading.ReaderWriterLockSlim = New Threading.ReaderWriterLockSlim()
   4:   
   5:      Public Sub New(ByVal Val As T)
   6:          _Value = Val
   7:      End Sub
   8:   
   9:      Public Property Value As T
  10:          Get
  11:              readerWriterLock.EnterReadLock()
  12:   
  13:              Try
  14:                  Return _Value
  15:              Finally
  16:                  readerWriterLock.ExitReadLock()
  17:              End Try
  18:          End Get
  19:          Set(ByVal Val As T)
  20:              readerWriterLock.EnterWriteLock()
  21:   
  22:              Try
  23:   
  24:                  If Not _Value.Equals(Val) Then
  25:                      _Value = Val
  26:                  End If
  27:   
  28:              Finally
  29:                  readerWriterLock.ExitWriteLock()
  30:              End Try
  31:          End Set
  32:      End Property
  33:  End Class

11. Full code of started multithreading in DoWork.


  98:      Dim StartDelay As SharedThreadSafeValue(Of Integer)
  99:      Dim RepeateDelay As SharedThreadSafeValue(Of Integer)
 100:      Private Sub BGW1_DoWork(sender As Object, e As DoWorkEventArgs) Handles BGW1.DoWork
 101:          Dim MaxThread As New ParallelOptions()
 102:          MaxThread.MaxDegreeOfParallelism = e.Argument.Thread
 103:          StartDelay = New SharedThreadSafeValue(Of Integer)(e.Argument.StartDelay)
 104:          RepeateDelay = New SharedThreadSafeValue(Of Integer)(e.Argument.RepeateDelay)
 105:          Dim W = Parallel.For(2, RowCount + 1, MaxThread, AddressOf ProcessOneRow)
 106:      End Sub
 107:   

12. Update WinForms from another thread.

In this case I use patten described in page Оновлення StatusLabel з потоку BackGroundWorker - приклад застосування Action, Delegate, Invoke, AddressOf, Extension, Expression.. Alternative solution I describe in page Mutlithreading InfoMessagBox by ConcurrentQueue, Interlocked and Timer.


   1:  Imports System.Runtime.CompilerServices
   2:   
   3:  Module StatusLabelExtension
   4:      <Extension()>
   5:      Sub InvokeOnUiThreadIfRequired(ByVal control As Control, ByVal action As Action)
   6:          If control.InvokeRequired Then
   7:              control.BeginInvoke(action)
   8:          Else
   9:              action.Invoke()
  10:          End If
  11:      End Sub
  12:  End Module

 314:      Public Sub UpdateTimer(Value As Integer)
 315:          Me.InvokeOnUiThreadIfRequired(Sub() TimeLabel.Text += Timer1.Interval / 1000)
 316:          Me.InvokeOnUiThreadIfRequired(Sub() TimeLabel.Refresh())
 317:      End Sub
 318:   
 319:      Public Sub UpdateLabel(RowNumber As String)
 320:          Me.InvokeOnUiThreadIfRequired(Sub() RowCountLabel.Text = RowNumber)
 321:          Me.InvokeOnUiThreadIfRequired(Sub() RowCountLabel.Refresh())
 322:      End Sub
 323:   
 324:      Public Sub InfoMessage1(Msg As String)
 325:          Me.InvokeOnUiThreadIfRequired(Sub() lErr1.Text = Msg)
 326:          Me.InvokeOnUiThreadIfRequired(Sub() lErr1.Refresh())
 327:      End Sub
 328:   
 329:      Public Sub InfoMessage2(Msg As String)
 330:          Me.InvokeOnUiThreadIfRequired(Sub() lErr2.Text = Msg)
 331:          Me.InvokeOnUiThreadIfRequired(Sub() lErr2.Refresh())
 332:      End Sub
 333:   
 334:      Public Sub InfoMessage3(Msg As String)
 335:          Me.InvokeOnUiThreadIfRequired(Sub() lErr3.Text = Msg)
 336:          Me.InvokeOnUiThreadIfRequired(Sub() lErr3.Refresh())
 337:      End Sub

13. Parse Json by Newtonsoft.Json package.

This is JSON in this program.


   1:  {
   2:     "products": [
   3:        {
   4:           "desc": "10 In. All Surface Wash Brush",
   5:           "url": "/pd/CRAFTSMAN-10-In-All-Surface-Wash-Brush/1000622473",
   6:           "brand": "CRAFTSMAN",
   7:           "itemNumber": "805817",
   8:           "vendorNumber": "302",
   9:           "modelNumber": "CMXMLBA7310A",
  10:           "imgName": "049206143470",
  11:           "productId": "1000622473",
  12:           "rating": 4.7797,
  13:           "reviewCount": 59,
  14:           "imgUrl": "http://images.lowes.com/product/converted/049206/049206143470sm.jpg",
  15:           "name": "CRAFTSMAN 10 In. All Surface Wash Brush"
  16:        },
  17:        {
  18:           "desc": "Ez Siding and House Wash Pressure Washer",
  19:           "url": "/pd/Mold-Armor-Ez-Siding-and-House-Wash-Pressure-Washer/1000731526",
  20:           "brand": "Mold Armor",
  21:           "itemNumber": "1150577",
  22:           "vendorNumber": "78",
  23:           "modelNumber": "FG581",
  24:           "imgName": "075919005811",
  25:           "productId": "1000731526",
  26:           "rating": 4.5455,
  27:           "reviewCount": 99,
  28:           "imgUrl": "http://images.lowes.com/product/converted/075919/075919005811sm.jpg",
  29:           "name": "Mold Armor Ez Siding and House Wash Pressure Washer"
  30:        },
  31:        {
  32:           "desc": "Deck Wash Kit",
  33:           "url": "/pd/Arnold-Deck-Wash-Kit/50119657",
  34:           "brand": "Arnold",
  35:           "itemNumber": "543172",
  36:           "vendorNumber": "442",
  37:           "modelNumber": "490-900-M061",
  38:           "imgName": "037049954339",
  39:           "productId": "50119657",
  40:           "rating": 4.4792,
  41:           "reviewCount": 48,
  42:           "imgUrl": "http://images.lowes.com/product/converted/037049/037049954339sm.jpg",
  43:           "name": "Arnold Deck Wash Kit"
  44:        }
  45:     ]
  46:  }


This is begining of this parser, some first strings of code.



And this is fragment of real finally code with using JSON parse - look at string 92,93,97.


   1:  Imports Newtonsoft.Json
   2:  Imports Newtonsoft.Json.Linq
   3:  Imports CsQuery
   4:  Imports System.Text.RegularExpressions
   5:   
   6:  Partial Public Class StartForm
   7:   
   8:      Public Shared Function BindIPEndPoint1(ByVal servicePoint As Net.ServicePoint, ByVal remoteEndPoint As Net.IPEndPoint, ByVal retryCount As Integer) As Net.IPEndPoint
   9:          If retryCount = 0 Then
  10:              ServerIPAddr = New SharedThreadSafeValue(Of String)(remoteEndPoint.Address.ToString)
  11:              StartForm_Instance.ShowServerIP(remoteEndPoint.Address.ToString)
  12:          End If
  13:          Return remoteEndPoint
  14:      End Function
  15:   
  16:      Dim StartDelay As SharedThreadSafeValue(Of Integer)
  17:      Dim RepeateDelay As SharedThreadSafeValue(Of Integer)
  18:      Dim ResponseTimeout As SharedThreadSafeValue(Of Integer)
  19:      Dim ResultCount As SharedThreadSafeValue(Of Integer)
  20:      Shared ServerIPAddr As SharedThreadSafeValue(Of String)
  21:   
  22:      Sub ProcessOneRow(CurRow As Integer, ThreadState As ParallelLoopState)
  23:          Try
  24:              Dim RepeateRequest1 As Integer = 0, RepeateRequest2 As Integer = 0, RepeateRequest3 As Integer = 0
  25:   
  26:              If String.IsNullOrEmpty(WorkSheet.Cells(CurRow, ColumnS(ColumnType.ITEM)).Value) Or String.IsNullOrEmpty(WorkSheet.Cells(CurRow, ColumnS(ColumnType.STORE)).Value) Then
  27:                  Exit Sub
  28:              End If
  29:   
  30:              ShowAliveThread(Process.GetCurrentProcess().Threads.Count)
  31:              InfoMessage("Start thread=" & System.Threading.Thread.CurrentThread.ManagedThreadId & ", row=" & CurRow)
  32:              Debug.Print("Start thread=" & System.Threading.Thread.CurrentThread.ManagedThreadId & ", row=" & CurRow)
  33:              Threading.Thread.Sleep(1000 * StartDelay.Value)
  34:              Dim Loader As New Wcf_Client()
  35:   
  36:              Dim RowStartTime As Long = Now.Ticks
  37:              WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_MODEL)).Value = Now()
  38:              WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_PRODUCT_TITLE)).Value = ""
  39:              WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_RETAIL)).Value = ""
  40:              WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_INVENTORY)).Value = ""
  41:              WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_STORENAME)).Value = ""
  42:              WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_URL)).Value = ""
  43:              WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_LEADTIME)).Value = "Working"
  44:   
  45:   
  46:  Request1:
  47:              Dim URL1 As String = "https://www.lowes.com/LowesSearchServices/resources/typeahead/v1_1?searchTerm=" &
  48:              WorkSheet.Cells(CurRow, ColumnS(ColumnType.ITEM)).Value &
  49:              "&maxArticles=20&maxBrands=0&maxDepts=0&maxProducts=20&store=" &
  50:              WorkSheet.Cells(CurRow, ColumnS(ColumnType.STORE)).Value &
  51:              "&isRemoteLocation=0"
  52:              WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_URL1)).Value = URL1
  53:   
  54:              Dim Cookie1 As Specialized.NameValueCollection
  55:   
  56:              Dim Response1 As ResponseResult = Loader.GetRequestStrAsync(URL1,, CurRow, RepeateRequest1, ResponseTimeout.Value)
  57:              Dim JsonHtml As String = Response1.HTML
  58:   
  59:              Dim mySP As Net.ServicePoint = Net.ServicePointManager.FindServicePoint(New Uri(URL1))
  60:              mySP.BindIPEndPointDelegate = New Net.BindIPEndPoint(AddressOf BindIPEndPoint1)
  61:   
  62:              WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_Json1)).Value = JsonHtml
  63:   
  64:              If JsonHtml = "Unable to connect to the remote server" Or JsonHtml = "The remote server returned an error: (500) Internal Server Error." Or JsonHtml = "" Or JsonHtml = "{}" Then
  65:   
  66:                  Dim ResponseState1 As RequestState = Response1.AsyncResult.AsyncState
  67:                  Dim ResponseHeaders1 As Specialized.NameValueCollection = ResponseState1.ResponseHeader()
  68:                  Dim RespSetCookie1 As String = ResponseHeaders1("Set-Cookie")
  69:                  If Not String.IsNullOrEmpty(RespSetCookie1) Then
  70:                      Cookie1 = New Specialized.NameValueCollection
  71:                      Cookie1.Add("Cookie", RespSetCookie1)
  72:                  End If
  73:                  Dim ErrMsg1 As String = "", StatusCode1 As String = ""
  74:                  If Not String.IsNullOrEmpty(ResponseState1.ErrorMessage) Then ErrMsg1 = ResponseState1.ErrorMessage
  75:                  If Not String.IsNullOrEmpty(ResponseState1.ResponseStatusCode) Then StatusCode1 = "HttpStatus=" & ResponseState1.ResponseStatusCode
  76:   
  77:                  WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_LEADTIME)).Value = "Unable to connect to the remote server (1), " & "Thread=" & Threading.Thread.CurrentThread.ManagedThreadId & ", Row=" & CurRow & " " & StatusCode1 & " " & ErrMsg1
  78:                  InfoMessage("Unable to connect to the remote server (1), Thread=" & Threading.Thread.CurrentThread.ManagedThreadId & ", Row=" & CurRow & ", Restarted (" & RepeateRequest1 & "). " & " " & StatusCode1 & " " & ErrMsg1)
  79:                  Debug.Print("Unable to connect to the remote server (1), Thread=" & Threading.Thread.CurrentThread.ManagedThreadId & ", Row=" & CurRow & ", Restarted (" & RepeateRequest1 & "). " & " " & StatusCode1 & " " & ErrMsg1)
  80:                  'IncreaseProgress()
  81:                  Loader = Nothing
  82:                  Threading.Thread.Sleep(1000 * RepeateDelay.Value)
  83:                  Loader = New Wcf_Client()
  84:                  RepeateRequest1 += 1
  85:                  If RepeateRequest1 < 10 Then GoTo Request1 Else Exit Sub
  86:              End If
  87:   
  88:              WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_URL)).Value = GetDuration(RowStartTime)
  89:              UpdateProgress()
  90:   
  91:              Dim Json As JObject = JObject.Parse(JsonHtml)
  92:              For Each One In Json("products")
  93:                  If One("itemNumber").Value(Of String).Trim = WorkSheet.Cells(CurRow, ColumnS(ColumnType.ITEM)).Value.ToString.Trim Then
  94:  Request2:
  95:                      Dim URL2 As String = "https://www.lowes.com" & One("url").Value(Of String)
  96:   
  97:                      WorkSheet.Cells(CurRow, ColumnS(ColumnType.Model)).Value = One("modelNumber").Value(Of String)
  98:                      WorkSheet.Cells(CurRow, ColumnS(ColumnType.URL)).Value = URL2
  99:   
 100:                      Dim Cookie2 As String = ""
 101:   
 102:                      Dim Headers As New Specialized.NameValueCollection
 103:                      Headers.Add("Cookie", "sn=" & WorkSheet.Cells(CurRow, ColumnS(ColumnType.STORE)).Value & Cookie2)
 104:                      Dim Response2 As ResponseResult = Loader.GetRequestStrAsync(URL2, Headers, CurRow, RepeateRequest2, ResponseTimeout.Value)
 105:                      Dim HTML As String = Response2.HTML
 106:   
 107:                      If HTML = "Unable to connect to the remote server" Or HTML = "The remote server returned an error: (500) Internal Server Error." Or HTML = "" Then
 108:   
 109:                          Dim ResponseState2 As RequestState = Response2.AsyncResult.AsyncState
 110:                          Dim ResponseHeaders2 As Specialized.NameValueCollection = ResponseState2.ResponseHeader()
 111:                          Dim RespSetCookie2 As String = ResponseHeaders2("Set-Cookie")
 112:                          If Not String.IsNullOrEmpty(RespSetCookie2) Then
 113:                              Cookie2 = ";" & RespSetCookie2
 114:                          End If
 115:                          Dim ErrMsg2 As String = "", StatusCode2 As String = ""
 116:                          If Not String.IsNullOrEmpty(ResponseState2.ErrorMessage) Then ErrMsg2 = ResponseState2.ErrorMessage
 117:                          If Not String.IsNullOrEmpty(ResponseState2.ResponseStatusCode) Then StatusCode2 = "HttpStatus=" & ResponseState2.ResponseStatusCode
 118:   
 119:                          WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_LEADTIME)).Value = "Unable to connect to the remote server (2), " & "Thread=" & Threading.Thread.CurrentThread.ManagedThreadId & ", Row=" & CurRow & " " & StatusCode2 & " " & ErrMsg2
 120:                          InfoMessage("Unable to connect to the remote server (2), Thread=" & Threading.Thread.CurrentThread.ManagedThreadId & ", Row=" & CurRow & ", Restarted (" & RepeateRequest2 & "). " & " " & StatusCode2 & " " & ErrMsg2)
 121:                          Debug.Print("Unable to connect to the remote server (2), Thread=" & Threading.Thread.CurrentThread.ManagedThreadId & ", Row=" & CurRow & ", Restarted (" & RepeateRequest2 & "). " & " " & StatusCode2 & " " & ErrMsg2)
 122:                          'IncreaseProgress()
 123:                          Loader = Nothing
 124:                          Threading.Thread.Sleep(1000 * RepeateDelay.Value)
 125:                          Loader = New Wcf_Client()
 126:                          RepeateRequest2 += 1
 127:                          If RepeateRequest2 < 10 Then GoTo Request2 Else Exit Sub
 128:                      End If
 129:   
 130:                      WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_STORENAME)).Value = GetDuration(RowStartTime)
 131:                      UpdateProgress()
 132:   
 133:                      Dim HtmlDom As CQ = HTML
 134:   
 135:                      Dim EleItem = HtmlDom("input[name='productId']")
 136:                      If EleItem.Count > 0 Then
 137:                          WorkSheet.Cells(CurRow, ColumnS(ColumnType.PRODUCT_TITLE)).Value = HtmlDom("input[name='productId']").Attr("data-productname")
 138:                          WorkSheet.Cells(CurRow, ColumnS(ColumnType.RETAIL)).Value = HtmlDom("input[name='productId']").Attr("data-productprice")
 139:                      End If
 140:   
 141:                      WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_PRODUCT_TITLE)).Value = GetDuration(RowStartTime)
 142:   
 143:                      Dim StPos As Integer = InStr(HTML, """pickup"":")
 144:                      If StPos > 0 Then
 145:                          StPos = StPos + Len("""pickup"":")
 146:                          Dim EPos As Integer = InStr(StPos, HTML, "},")
 147:                          If EPos > 0 Then
 148:                              Dim FStr As String = Mid(HTML, StPos, EPos + Len("}}}];") - StPos - 4)
 149:                              Dim Json3 As JObject = JObject.Parse(FStr)
 150:   
 151:                              WorkSheet.Cells(CurRow, ColumnS(ColumnType.INVENTORY_AT_STORE)).Value = Json3("availabileQuantity").Value(Of Integer)
 152:   
 153:                              WorkSheet.Cells(CurRow, ColumnS(ColumnType.TS_RETAIL)).Value = GetDuration(RowStartTime)
 154:                          End If
 155:                      End If
 156:   
 157:   
 158:                      Dim ItemStore = HtmlDom(".pd-item-actions.grid-100.grid-parent")
 159:   
 160:                      If ItemStore.Count > 0 Then
 161:                          Dim Strong As CsQuery.CQ = ItemStore.Find("strong")
 162:                          If Strong.Count = 2 Then
 163:                              WorkSheet.Cells(CurRow, ColumnS(ColumnType.StoreName)).Value = ItemStore.Find("strong")(1).InnerText
 164:                          Else
 165:                              GoTo Request3
 166:                          End If
 167:                      Else
 168:  Request3:
 ....   

14. Send Cookie to request.


 180:                      Dim Headers As New Specialized.NameValueCollection
 181:                      Headers.Add("Cookie", "sn=" & WorkSheet.Cells(CurRow, ColumnS(ColumnType.STORE)).Value)
 182:                      Dim HTML As String = Loader.GetRequestStrAsync("https://www.lowes.com" & One("url").Value(Of String), Headers, CurRow)

15. Parse HTML by CsQuery with jQuery Selector.

 200:                      Dim HtmlDom As CQ = HTML
 201:   
 202:                      Dim EleItem = HtmlDom("input[name='productId']")
 203:                      If EleItem.Count > 0 Then
 204:                          WorkSheet.Cells(CurRow, ColumnS(ColumnType.PRODUCT_TITLE)).Value = HtmlDom("input[name='productId']").Attr("data-productname")
 205:                          WorkSheet.Cells(CurRow, ColumnS(ColumnType.RETAIL)).Value = HtmlDom("input[name='productId']").Attr("data-productprice")
 206:                      End If
 207:   

16. Parse data in JavaScript by Regular expression or string operations.

Most interesting data usually contains in Javascript. This data can be parses by Regular expression and string operation. In this program I use string operations.In this case I use string operations.



17. GetRequestStrAsync - require HTML by AsyncCallback.

This is small part of code fragment of my class WCF_Client I described in 2010 - WCF_CLIENT - клиент Web-сервиса. How it working?

First component is class to pass thread number of parallel process to asynchronous task of reading HTML. This is parameter CustomData in string 391.


   1:  Imports System.Net
   2:   
   3:  Public Class RequestState
   4:   
   5:      Public Property StringBuilder As System.Text.StringBuilder
   6:      Public Property BufferRead As Byte()
   7:      Public Property Request As Net.HttpWebRequest
   8:      Public Property ResponseStream As IO.Stream
   9:      Public Property StreamDecode As System.Text.Decoder
  10:      Public Property ErrorMessage As String
  11:      Public Property ParentThreadID As Integer
  12:      Public Property CallbackThreadID As Integer
  13:      Public Property CustomData As String
  14:      Public Property ErrorRepeateCount As Integer
  15:      Public Property ResponseHeader As Specialized.NameValueCollection
  16:      Public Property ResponseStatusCode As String
  17:      Public Property ResponseContentLength As Integer
  18:      Public Property ResponseContentType As String
  19:      Public Property ResponseContentEncoding As String
  20:      Public Property ResponseCharacterSet As String
  21:   
  22:      Public Sub New(ByVal BufSizeInBytes As Integer, ByVal Encoder As Wcf_Client.PostRequestEncode)
  23:          Request = Nothing
  24:          ResponseStream = Nothing
  25:          StringBuilder = New System.Text.StringBuilder("")
  26:          If Encoder = Wcf_Client.PostRequestEncode.ASCII Then
  27:              StreamDecode = System.Text.ASCIIEncoding.GetEncoding("windows-1251").GetDecoder
  28:          ElseIf Encoder = Wcf_Client.PostRequestEncode.UTF16 Then
  29:              StreamDecode = System.Text.UnicodeEncoding.Unicode.GetDecoder
  30:          ElseIf Encoder = Wcf_Client.PostRequestEncode.UTF8 Then
  31:              StreamDecode = System.Text.UTF8Encoding.UTF8.GetDecoder
  32:          ElseIf Encoder = Wcf_Client.PostRequestEncode.UTF7 Then
  33:              StreamDecode = System.Text.UTF7Encoding.UTF7.GetDecoder
  34:          ElseIf Encoder = Wcf_Client.PostRequestEncode.UTF32 Then
  35:              StreamDecode = System.Text.UTF32Encoding.UTF32.GetDecoder
  36:          End If
  37:          Dim buf1(BufSizeInBytes) As Byte
  38:          BufferRead = buf1
  39:      End Sub
  40:  End Class
  41:   
  42:  Public Class ResponseResult
  43:      Property HTML As String
  44:      Property ErrorMessage As String
  45:      Property AsyncResult As IAsyncResult
  46:      Property IsAllDoneExit As Boolean
  47:  End Class
  48:   
  49:  Public Class Wcf_Client
  50:   
  51:      Public Enum PostRequestEncode
  52:          ASCII = 1
  53:          UTF7 = 2
  54:          UTF8 = 3
  55:          UTF16 = 4
  56:          UTF32 = 5
  57:      End Enum
  58:   
  59:      Public Enum PostRedirectEncode
  60:          ASCII_Windows1251 = 1
  61:          UTF8 = 2
  62:          UTF16 = 3
  63:          ISO_8859_5 = 4
  64:      End Enum
  65:   
  66:      Public Property UserAgent As String = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
  67:      Public AllDone As Threading.ManualResetEvent
  68:      Dim BUFFER_SIZE As Integer = 10000
  69:   
  70:   
  71:      Public Overridable Function GetRequestStrAsync(ByVal URL As String, Optional Headers As Specialized.NameValueCollection = Nothing, Optional CustomData As String = Nothing, Optional ErrorRepeateCount As Integer = 0, Optional TimeOutSec As Integer = 60, Optional exitContext As Boolean = False, Optional ByVal ResponseEncoding As PostRequestEncode = PostRequestEncode.ASCII, Optional ByVal Full_ProxyURL As String = "") As ResponseResult
  72:          Dim RS As RequestState = New RequestState(BUFFER_SIZE, ResponseEncoding)
  73:   
  74:          Try
  75:              AllDone = New Threading.ManualResetEvent(False)
  76:   
  77:              '========== System.NotSupportedException The URI prefix is not recognized.
  78:              Dim Request As Net.HttpWebRequest = Net.HttpWebRequest.Create(URL)
  79:   
  80:              If Headers IsNot Nothing Then
  81:                  Request.Headers.Add(Headers)
  82:              End If
  83:   
  84:              Request.UserAgent = UserAgent
  85:              Request.Method = "GET"
  86:              If Full_ProxyURL <> "" Then
  87:                  Dim MyProxy As New Net.WebProxy
  88:                  MyProxy.Address = New Uri(Full_ProxyURL)
  89:                  Request.Proxy = MyProxy
  90:              End If
  91:   
  92:              ' Put the request into the state so it can be passed around.  
  93:              RS.Request = Request
  94:              RS.ParentThreadID = System.Threading.Thread.CurrentThread.ManagedThreadId
  95:              RS.CustomData = CustomData
  96:              RS.ErrorRepeateCount = ErrorRepeateCount
  97:              RS.ResponseHeader = New Specialized.NameValueCollection
  98:   
  99:              'Issue the async request.  
 100:              Dim AsyncResult As IAsyncResult = CType(Request.BeginGetResponse(
 101:                      New AsyncCallback(AddressOf RespCallback), RS), IAsyncResult)
 102:   
 103:              ' Wait until the ManualResetEvent is set so that the application does not exit until after the callback is called.  
 104:              Dim IsExit As Boolean = AllDone.WaitOne((New TimeSpan(0, 0, 0, TimeOutSec, 0)), exitContext)
 105:   
 106:              Return New ResponseResult With {.HTML = RS.StringBuilder.ToString, .ErrorMessage = RS.ErrorMessage, .AsyncResult = AsyncResult, .IsAllDoneExit = IsExit}
 107:   
 108:          Catch ex As Exception
 109:              StartForm_Instance.InfoMessage(ex.Message)
 110:              Return New ResponseResult With {.HTML = RS.StringBuilder.ToString, .ErrorMessage = ex.Message}
 111:          End Try
 112:   
 113:      End Function
 114:   
 115:      Sub RespCallback(ByVal ar As IAsyncResult)
 116:          ' Get the RequestState object from the async result
 117:          Dim RS As RequestState = CType(ar.AsyncState, RequestState)
 118:   
 119:          Debug.Print("Callback in thread =" & RS.ParentThreadID)
 120:   
 121:          Try
 122:              ' Get the HttpWebRequest from RequestState.  
 123:              Dim Req As Net.HttpWebRequest = RS.Request
 124:   
 125:              ' Call EndGetResponse, which returns the HttpWebResponse object that came from the request issued above.  
 126:              Dim Resp As Net.HttpWebResponse = CType(Req.EndGetResponse(ar), Net.HttpWebResponse)
 127:   
 128:              If Resp.ContentEncoding = "" Then
 129:                  RS.StreamDecode = Text.ASCIIEncoding.GetEncoding("windows-1251").GetDecoder
 130:              End If
 131:   
 132:              RS.ResponseContentLength = Resp.ContentLength
 133:              RS.ResponseContentType = Resp.ContentType
 134:              RS.ResponseContentEncoding = Resp.ContentEncoding
 135:              RS.ResponseCharacterSet = Resp.CharacterSet
 136:              RS.ResponseStatusCode = Resp.StatusCode
 137:              RS.CallbackThreadID = Threading.Thread.CurrentThread.ManagedThreadId
 138:              For Each OneKey As String In Resp.Headers.AllKeys
 139:                  RS.ResponseHeader.Add(OneKey, Resp.GetResponseHeader(OneKey))
 140:              Next
 141:   
 142:              ' Start reading data from the respons stream. 
 143:              '============= The remote server returned an error: (407) Proxy Authentication Required. ==========
 144:              Dim ResponseStream As IO.Stream = Resp.GetResponseStream()
 145:   
 146:              ' Store the reponse stream in RequestState to read the stream asynchronously.  
 147:              RS.ResponseStream = ResponseStream
 148:   
 149:              ' Pass rs.BufferRead to BeginRead. Read data into rs.BufferRead.  
 150:              Dim iarRead As IAsyncResult =
 151:                 ResponseStream.BeginRead(RS.BufferRead, 0, BUFFER_SIZE,
 152:                 New AsyncCallback(AddressOf ReadCallBack), RS)
 153:   
 154:          Catch ex As Exception
 155:              StartForm_Instance.InfoMessage(ex.Message)
 156:              RS.ErrorMessage = ex.Message
 157:              AllDone.Set()
 158:          End Try
 159:   
 160:      End Sub
 161:   
 162:      Sub ReadCallBack(ByVal asyncResult As IAsyncResult)
 163:   
 164:          StartForm_Instance.InfoMessage("Thread=" & CType(asyncResult.AsyncState, RequestState).ParentThreadID & ", row=" & CType(asyncResult.AsyncState, RequestState).CustomData & " - data received.")
 165:   
 166:          ' Get the RequestState object from the AsyncResult.  
 167:          Dim rs As RequestState = CType(asyncResult.AsyncState, RequestState)
 168:   
 169:          ' Retrieve the ResponseStream that was set in RespCallback.  
 170:          Dim responseStream As IO.Stream = rs.ResponseStream
 171:   
 172:          ' Read rs.BufferRead to verify that it contains data.  
 173:          Dim read As Integer = responseStream.EndRead(asyncResult)
 174:          If read > 0 Then
 175:              ' Prepare a Char array buffer for converting to Unicode.  
 176:              Dim charBuffer(rs.BufferRead.Count) As Char
 177:   
 178:              ' Convert byte stream to Char array and then String. len contains the number of characters converted to Unicode.  
 179:              Dim len As Integer =
 180:                rs.StreamDecode.GetChars(rs.BufferRead, 0, read, charBuffer, 0)
 181:              Dim str As String = New String(charBuffer, 0, len)
 182:   
 183:              ' Append the recently read data to the RequestData stringbuilder  object contained in RequestState.  
 184:              rs.StringBuilder.Append(str)
 185:   
 186:              ' Continue reading data until responseStream.EndRead returns –1.  
 187:              Dim ar As IAsyncResult =
 188:                 responseStream.BeginRead(rs.BufferRead, 0, BUFFER_SIZE,
 189:                 New AsyncCallback(AddressOf ReadCallBack), rs)
 190:          Else
 191:   
 192:              ' Close down the response stream.  
 193:              responseStream.Close()
 194:   
 195:              ' Set the ManualResetEvent so the main thread can exit.  
 196:              AllDone.Set()
 197:          End If
 198:   
 199:          Return
 200:      End Sub
 201:   
 202:  End Class

18. Use Profiler to check code performance.

If you use high performance parser, more than 100 thread, this is a very important step of code refactoring.





Comments ( )
<00>  <01>  <02>  <03>  <04>  <05>  <06>  <07>  <08>  <09>  <10>  <11>  <12>  <13>  <14>  <15>  <16>  <17>  <18>  <19
Link to this page: http://www.vb-net.com/MulthreadingParser/Index.htm
<SITEMAP>  <MVC>  <ASP>  <NET>  <DATA>  <KIOSK>  <FLEX>  <SQL>  <NOTES>  <LINUX>  <MONO>  <FREEWARE>  <DOCS>  <ENG>  <MAIL ME>  <ABOUT ME>  < THANKS ME>