(MVC) MVC (2018)

Parse HTML by HtmlAgilityPack (Xpath selector) and CsQuery (jQuery selector).

Html usually has regular structure: header, footer and repeatable block of contents. For example, below you can see a block of content from site https://www.freelancer.com/, and some part in this section not mandatory, for example with class JobSearchCard-primary-heading-status


   1:  <div class="JobSearchCard-item-inner" data-project-card="true">
   2:    <div class="JobSearchCard-primary">
   3:       <div class="JobSearchCard-primary-heading">
   4:          <a href="/projects/software-architecture/appian-developer-needed/" class="JobSearchCard-primary-heading-link" 
   5:          data-qtsb-section="page-job-search-new" data-qtsb-subsection="card-job" data-qtsb-label="link-project-title" data-heading-link="true">
   6:          Appian Developer needed
   7:          </a>
   8:          <span class="JobSearchCard-primary-heading-days">6 days left</span>
   9:          <div class="JobSearchCard-primary-heading-status Tooltip--top" data-tooltip="This user has verified their Payment method">
  10:             <span class="Icon is-success">
  11:                <svg class="Icon-image" xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24">
  12:                   <path fill="none" d="M0 0h24v24H0z"/>
  13:                   <g>
  14:                      <path d="M20 2c0-1.104-.896-2-2-2H2C.897 0 0 .896 0 2v2h20V2zM19 14c.34 0 .668.036.99.09.002-.03.01-.06.01-.09V6H0v8c0 1.102.897 2 2 2h12.537c1.1-1.225 2.69-2 4.463-2zM8 13H3v-2h5v2zm2-3H3V8h7v2zm3-2h4v2h-4V8zM22.293 16.293L18 20.587l-2.293-2.294-1.414 1.413L18 23.416l5.707-5.71"/>
  15:                   </g>
  16:                </svg>
  17:             </span>
  18:             VERIFIED
  19:          </div>
  20:       </div>
  21:       <p class="JobSearchCard-primary-description">
  22:          Looking for a candidate who can work at EDT.
  23:          Need an Appian developer to work with our team to help complete a project. The project goal is to digitize and automate the creation of our clients&#039; project management reports so that project status is available, standardized, and intuitive for decision making and planning purposes.
  24:          Our ideal Appian Developer would have experience with: 
  25:          * Hands ...
  26:       </p>
  27:       <div class="JobSearchCard-primary-tags" data-qtsb-section="page-job-search-new" data-qtsb-subsection="card-job" data-qtsb-label="link-skill">
  28:          <a class="JobSearchCard-primary-tagsLink" href="/jobs/dot-net/">.NET</a>
  29:          <a class="JobSearchCard-primary-tagsLink" href="/jobs/asp-net/">ASP.NET</a>
  30:          <a class="JobSearchCard-primary-tagsLink" href="/jobs/c-sharp-programming/">C# Programming</a>
  31:          <a class="JobSearchCard-primary-tagsLink" href="/jobs/microsoft-sql-server/">Microsoft SQL Server</a>
  32:          <a class="JobSearchCard-primary-tagsLink" href="/jobs/software-architecture/">Software Architecture</a>
  33:       </div>
  34:       <div class="JobSearchCard-primary-hidden">
  35:          <div class="JobSearchCard-primary-price">
  36:             $21 / hr
  37:             <span class="JobSearchCard-primary-avgBid">(Avg Bid)</span>
  38:          </div>
  39:       </div>
  40:    </div>
  41:    <div class="JobSearchCard-secondary">
  42:       <div class="JobSearchCard-secondary-price">
  43:          $21 / hr
  44:          <span class="JobSearchCard-secondary-avgBid">
  45:          Avg Bid                                    </span>
  46:       </div>
  47:       <div class="JobSearchCard-secondary-entry">10 bids</div>
  48:       <div class="JobSearchCard-ctas ">
  49:          <a href="/projects/software-architecture/appian-developer-needed/"
  50:             class="JobSearchCard-ctas-btn btn btn-mini btn-success"
  51:             data-qtsb-section="page-job-search-new"
  52:             data-qtsb-subsection="card-cta-button"
  53:             data-qtsb-label="bid-cta">
  54:          Bid now </a>
  55:       </div>
  56:    </div>
  57:  </div>
  58:   

I parse this task to something my universal structure:



After that my software create automatically replay to interesting task for my skill with support of form below (I describe it in page ???????????? ???????? ?????? ???????????? ?? DataGridView and after that parsed task is uploaded automatically to my new project http://www.programmer.expert/Project/Search



My DB structure for your better understanding is that:


   1:  Imports System.ComponentModel.DataAnnotations
   2:  Namespace Model
   3:      Public Class AllProject
   4:          <Key>
   5:          Property I As Integer
   6:   
   7:          <Required(AllowEmptyStrings:=False)>
   8:          Property CrDate As DateTime
   9:   
  10:          <Required(AllowEmptyStrings:=False)>
  11:          Property ProjectType As Integer
  12:   
  13:          <Required(AllowEmptyStrings:=False)>
  14:          Property ToMySkill As Integer
  15:   
  16:          Property Checked As Integer?
  17:   
  18:          <StringLength(9)>
  19:          Property ID As String
  20:   
  21:          <StringLength(1000)>
  22:          Property Title As String
  23:   
  24:          Property TXT As String
  25:   
  26:          Property BidCount As Integer?
  27:   
  28:          <StringLength(2000)>
  29:          Property CategoryList As String
  30:   
  31:          <StringLength(255)>
  32:          Property TimeType As String
  33:   
  34:          <StringLength(255)>
  35:          Property RestTime As String
  36:   
  37:          <StringLength(50)>
  38:          Property AvgBid As String
  39:   
  40:          <StringLength(255)>
  41:          Property URL As String
  42:   
  43:          <StringLength(255)>
  44:          Property BudgetBound As String
  45:   
  46:          Property Summ As Integer?
  47:   
  48:          Property HourLeft As Integer?
  49:   
  50:          <StringLength(2000)>
  51:          Property Category As String
  52:   
  53:          <StringLength(250)>
  54:          Property Country As String
  55:   
  56:          <StringLength(250)>
  57:          Property FlagURL As String
  58:   
  59:          <StringLength(4000)>
  60:          Property Temp As String
  61:      End Class
  62:  End Namespace

And below you may see my really code of this parser by XPATH selector.



   1:  Module ReadAndParseFreelancer
   2:      'read one category
   3:      Function ParseFreelancerPage(ByVal HTML As String, ByVal SkilNum As Integer, ByVal URLSuffix As String, RecNumber As Integer) As Integer
   4:   
   5:          Dim HAP As New HtmlAgilityPack.HtmlDocument
   6:          Dim JobCount As Integer
   7:          Try
   8:              HAP.LoadHtml(HTML)
   9:              Dim Jobs = HAP.DocumentNode.SelectNodes("//*[@class='JobSearchCard-item-inner']") 'select all nodes
  10:              If Jobs Is Nothing Then
  11:                  Return 0
  12:                  Exit Function
  13:              End If
  14:              JobCount = Jobs.Count
  15:              Dim db1 As New ParserDBDataContext
  16:   
  17:              For Num As Integer = 0 To JobCount - 1
  18:   
  19:                  Dim One = Jobs(Num)
  20:                  Try
  21:                      '. = select from current nodes
  22:                      Dim JobSearchCard_primary_heading = One.SelectNodes(".//*[contains(@class,'JobSearchCard-primary-heading')]")
  23:                      Dim JobSearchCard_primary_heading_days = One.SelectNodes(".//*[contains(@class,'JobSearchCard-primary-heading-days')]")
  24:                      Dim JobSearchCard_primary_description = One.SelectNodes(".//*[contains(@class,'JobSearchCard-primary-description')]")
  25:                      Dim JobSearchCard_primary_price = One.SelectNodes(".//*[contains(@class,'JobSearchCard-primary-price')]")
  26:                      Dim JobSearchCard_secondary_price = One.SelectNodes(".//*[contains(@class,'JobSearchCard-secondary-price')]")
  27:                      Dim JobSearchCard_secondary_entry = One.SelectNodes(".//*[contains(@class,'JobSearchCard-secondary-entry')]")
  28:                      Dim JobSearchCard_primary_tagsLink = One.SelectNodes(".//*[contains(@class,'JobSearchCard-primary-tags')]")
  29:   
  30:                      Dim NewRecord As New AllProject
  31:                      NewRecord.ToMySkill = SkilNum
  32:                      NewRecord.ProjectType = 1
  33:                      NewRecord.CrDate = Now
  34:                      NewRecord.[FlagURL] = URLSuffix
  35:   
  36:                      If JobSearchCard_primary_heading IsNot Nothing Then
  37:                          If JobSearchCard_primary_heading.Count > 0 Then
  38:                              For Each SubNode In JobSearchCard_primary_heading
  39:                                  If SubNode.Name = "a" Then
  40:                                      NewRecord.[URL] = URLSuffix & SubNode.Attributes("href").Value
  41:                                      NewRecord.[Title] = SubNode.InnerHtml.Trim()
  42:                                  End If
  43:                              Next
  44:                          End If
  45:                      End If
  46:   
  47:                      If JobSearchCard_primary_heading_days IsNot Nothing Then
  48:                          If JobSearchCard_primary_heading_days.Count > 0 Then
  49:                              Dim IntDays As Integer
  50:                              Integer.TryParse(JobSearchCard_primary_heading_days(0).InnerText.Replace(" days left", ""), IntDays)
  51:                              NewRecord.[HourLeft] = IntDays * 24
  52:                          End If
  53:                      End If
  54:   
  55:                      If JobSearchCard_primary_description IsNot Nothing Then
  56:                          If JobSearchCard_primary_description.Count > 0 Then
  57:                              NewRecord.[TXT] = JobSearchCard_primary_description(0).InnerText.Trim
  58:                          End If
  59:                      End If
  60:   
  61:                      If JobSearchCard_secondary_price IsNot Nothing Then
  62:                          If JobSearchCard_secondary_price.Count > 0 Then
  63:                              NewRecord.[AvgBid] = JobSearchCard_secondary_price(0).FirstChild.InnerHtml.Trim()
  64:                              NewRecord.[TimeType] = JobSearchCard_secondary_price(0).FirstChild.InnerHtml.Trim()
  65:                              If NewRecord.[AvgBid].Contains("/ hr") Then
  66:                                  NewRecord.[Summ] = 0
  67:                              Else
  68:                                  Dim Intprice As Integer
  69:                                  Integer.TryParse(NewRecord.[AvgBid].Replace("$", ""), Intprice)
  70:                                  NewRecord.[Summ] = Intprice
  71:                              End If
  72:                          End If
  73:                      End If
  74:   
  75:                      If JobSearchCard_secondary_entry IsNot Nothing Then
  76:                          If JobSearchCard_secondary_entry.Count > 0 Then
  77:                              Dim IntBid As Integer
  78:                              Integer.TryParse(JobSearchCard_secondary_entry(0).InnerText.Replace("  bids", ""), IntBid)
  79:                              NewRecord.[BidCount] = IntBid
  80:                          End If
  81:                      End If
  82:   
  83:                      If JobSearchCard_primary_tagsLink IsNot Nothing Then
  84:                          If JobSearchCard_primary_tagsLink.Count > 0 Then
  85:                              Dim Category As String = ""
  86:                              Dim CategoryNum As String = ""
  87:                              Dim Flag1 As Boolean = False
  88:                              For Each Two In JobSearchCard_primary_tagsLink(0).ChildNodes
  89:                                  Dim Cat As String = Two.InnerHtml.Replace(vbLf, "").Replace(" ", "")
  90:                                  If Not String.IsNullOrEmpty(Cat) Then
  91:                                      Flag1 = True
  92:                                      Category &= Cat & ","
  93:                                      Dim Categories = (From Z In db1.FreelancerCategories Select Z Where Z.Name.ToLower.Trim = Two.InnerHtml.ToLower.Trim).ToList()
  94:                                      If Categories.Count > 0 Then
  95:                                          CategoryNum &= Categories(0).i & ","
  96:                                      End If
  97:                                  End If
  98:                              Next
  99:                              If Flag1 Then
 100:                                  NewRecord.[Category] = Left(Category, Len(Category) - 1)
 101:                                  NewRecord.[CategoryList] = Left(CategoryNum, Len(CategoryNum) - 1)
 102:                              End If
 103:                          End If
 104:                      End If
 105:   
 106:                      db1.AllProjects.InsertOnSubmit(NewRecord)
 107:                      db1.SubmitChanges()
 108:   
 109:                  Catch ex As Exception
 110:                      MsgBox("Freelancer Row # " & RecNumber & vbCrLf & "Project # " & Num & vbCrLf & vbCrLf & ex.Message)
 111:                  End Try
 112:              Next
 113:          Catch ex As Exception
 114:              MsgBox("Freelancer Row # " & RecNumber & vbCrLf & ex.Message)
 115:          End Try
 116:          Return JobCount
 117:      End Function
 118:  End Module

Alternative way to parse HTML is use jQquery selector. If you already have jQuery in html page and if you have HTML in browser, for example in this way CefSharp.Winforms.ChromiumWebBrowser minimal example on VB.NET (with cookies collector and script executor). You can use jQuery selector directly, for example in this way Multithreading Parsers with Parallel, CsQuery, Newtonsoft.Json, OfficeOpenXml and IAsyncResult/AsyncCallback. or in the way on the screen.



But if you have HTML as string by WebResponse you need to "Install-Package CsQuery" too, like in previous way "Install-Package HtmlAgilityPack"


Third way to parse HTML is possible only if you have XHTML. In this case you may use internal Microsoft classes XDocument/XElement, look template in this page for example VS2017 Plugins (Resolve Unused References & XPath Tools).



Comments ( )
Link to this page: //www.vb-net.com/Html-Parser/index.htm
< THANKS ME>