Parse HTML by HtmlAgilityPack (Xpath selector) and CsQuery (jQuery selector).
Html usually has regular structure: header, footer and repeatable block of contents. For example, below you can see a block of content from site https://www.freelancer.com/, and some part in this section not mandatory, for example with class JobSearchCard-primary-heading-status
1: <div class="JobSearchCard-item-inner" data-project-card="true">
2: <div class="JobSearchCard-primary">
3: <div class="JobSearchCard-primary-heading">
4: <a href="/projects/software-architecture/appian-developer-needed/" class="JobSearchCard-primary-heading-link"
5: data-qtsb-section="page-job-search-new" data-qtsb-subsection="card-job" data-qtsb-label="link-project-title" data-heading-link="true">
6: Appian Developer needed
7: </a>
8: <span class="JobSearchCard-primary-heading-days">6 days left</span>
9: <div class="JobSearchCard-primary-heading-status Tooltip--top" data-tooltip="This user has verified their Payment method">
10: <span class="Icon is-success">
11: <svg class="Icon-image" xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24">
12: <path fill="none" d="M0 0h24v24H0z"/>
13: <g>
14: <path d="M20 2c0-1.104-.896-2-2-2H2C.897 0 0 .896 0 2v2h20V2zM19 14c.34 0 .668.036.99.09.002-.03.01-.06.01-.09V6H0v8c0 1.102.897 2 2 2h12.537c1.1-1.225 2.69-2 4.463-2zM8 13H3v-2h5v2zm2-3H3V8h7v2zm3-2h4v2h-4V8zM22.293 16.293L18 20.587l-2.293-2.294-1.414 1.413L18 23.416l5.707-5.71"/>
15: </g>
16: </svg>
17: </span>
18: VERIFIED
19: </div>
20: </div>
21: <p class="JobSearchCard-primary-description">
22: Looking for a candidate who can work at EDT.
23: Need an Appian developer to work with our team to help complete a project. The project goal is to digitize and automate the creation of our clients' project management reports so that project status is available, standardized, and intuitive for decision making and planning purposes.
24: Our ideal Appian Developer would have experience with:
25: * Hands ...
26: </p>
27: <div class="JobSearchCard-primary-tags" data-qtsb-section="page-job-search-new" data-qtsb-subsection="card-job" data-qtsb-label="link-skill">
28: <a class="JobSearchCard-primary-tagsLink" href="/jobs/dot-net/">.NET</a>
29: <a class="JobSearchCard-primary-tagsLink" href="/jobs/asp-net/">ASP.NET</a>
30: <a class="JobSearchCard-primary-tagsLink" href="/jobs/c-sharp-programming/">C# Programming</a>
31: <a class="JobSearchCard-primary-tagsLink" href="/jobs/microsoft-sql-server/">Microsoft SQL Server</a>
32: <a class="JobSearchCard-primary-tagsLink" href="/jobs/software-architecture/">Software Architecture</a>
33: </div>
34: <div class="JobSearchCard-primary-hidden">
35: <div class="JobSearchCard-primary-price">
36: $21 / hr
37: <span class="JobSearchCard-primary-avgBid">(Avg Bid)</span>
38: </div>
39: </div>
40: </div>
41: <div class="JobSearchCard-secondary">
42: <div class="JobSearchCard-secondary-price">
43: $21 / hr
44: <span class="JobSearchCard-secondary-avgBid">
45: Avg Bid </span>
46: </div>
47: <div class="JobSearchCard-secondary-entry">10 bids</div>
48: <div class="JobSearchCard-ctas ">
49: <a href="/projects/software-architecture/appian-developer-needed/"
50: class="JobSearchCard-ctas-btn btn btn-mini btn-success"
51: data-qtsb-section="page-job-search-new"
52: data-qtsb-subsection="card-cta-button"
53: data-qtsb-label="bid-cta">
54: Bid now </a>
55: </div>
56: </div>
57: </div>
58:
I parse this task to something my universal structure:
After that my software create automatically replay to interesting task for my skill with support of form below (I describe it in page ???????????? ???????? ?????? ???????????? ?? DataGridView and after that parsed task is uploaded automatically to my new project http://www.programmer.expert/Project/Search
My DB structure for your better understanding is that:
1: Imports System.ComponentModel.DataAnnotations
2: Namespace Model
3: Public Class AllProject
4: <Key>
5: Property I As Integer
6:
7: <Required(AllowEmptyStrings:=False)>
8: Property CrDate As DateTime
9:
10: <Required(AllowEmptyStrings:=False)>
11: Property ProjectType As Integer
12:
13: <Required(AllowEmptyStrings:=False)>
14: Property ToMySkill As Integer
15:
16: Property Checked As Integer?
17:
18: <StringLength(9)>
19: Property ID As String
20:
21: <StringLength(1000)>
22: Property Title As String
23:
24: Property TXT As String
25:
26: Property BidCount As Integer?
27:
28: <StringLength(2000)>
29: Property CategoryList As String
30:
31: <StringLength(255)>
32: Property TimeType As String
33:
34: <StringLength(255)>
35: Property RestTime As String
36:
37: <StringLength(50)>
38: Property AvgBid As String
39:
40: <StringLength(255)>
41: Property URL As String
42:
43: <StringLength(255)>
44: Property BudgetBound As String
45:
46: Property Summ As Integer?
47:
48: Property HourLeft As Integer?
49:
50: <StringLength(2000)>
51: Property Category As String
52:
53: <StringLength(250)>
54: Property Country As String
55:
56: <StringLength(250)>
57: Property FlagURL As String
58:
59: <StringLength(4000)>
60: Property Temp As String
61: End Class
62: End Namespace
And below you may see my really code of this parser by XPATH selector.
1: Module ReadAndParseFreelancer
2: 'read one category
3: Function ParseFreelancerPage(ByVal HTML As String, ByVal SkilNum As Integer, ByVal URLSuffix As String, RecNumber As Integer) As Integer
4:
5: Dim HAP As New HtmlAgilityPack.HtmlDocument
6: Dim JobCount As Integer
7: Try
8: HAP.LoadHtml(HTML)
9: Dim Jobs = HAP.DocumentNode.SelectNodes("//*[@class='JobSearchCard-item-inner']") 'select all nodes
10: If Jobs Is Nothing Then
11: Return 0
12: Exit Function
13: End If
14: JobCount = Jobs.Count
15: Dim db1 As New ParserDBDataContext
16:
17: For Num As Integer = 0 To JobCount - 1
18:
19: Dim One = Jobs(Num)
20: Try
21: '. = select from current nodes
22: Dim JobSearchCard_primary_heading = One.SelectNodes(".//*[contains(@class,'JobSearchCard-primary-heading')]")
23: Dim JobSearchCard_primary_heading_days = One.SelectNodes(".//*[contains(@class,'JobSearchCard-primary-heading-days')]")
24: Dim JobSearchCard_primary_description = One.SelectNodes(".//*[contains(@class,'JobSearchCard-primary-description')]")
25: Dim JobSearchCard_primary_price = One.SelectNodes(".//*[contains(@class,'JobSearchCard-primary-price')]")
26: Dim JobSearchCard_secondary_price = One.SelectNodes(".//*[contains(@class,'JobSearchCard-secondary-price')]")
27: Dim JobSearchCard_secondary_entry = One.SelectNodes(".//*[contains(@class,'JobSearchCard-secondary-entry')]")
28: Dim JobSearchCard_primary_tagsLink = One.SelectNodes(".//*[contains(@class,'JobSearchCard-primary-tags')]")
29:
30: Dim NewRecord As New AllProject
31: NewRecord.ToMySkill = SkilNum
32: NewRecord.ProjectType = 1
33: NewRecord.CrDate = Now
34: NewRecord.[FlagURL] = URLSuffix
35:
36: If JobSearchCard_primary_heading IsNot Nothing Then
37: If JobSearchCard_primary_heading.Count > 0 Then
38: For Each SubNode In JobSearchCard_primary_heading
39: If SubNode.Name = "a" Then
40: NewRecord.[URL] = URLSuffix & SubNode.Attributes("href").Value
41: NewRecord.[Title] = SubNode.InnerHtml.Trim()
42: End If
43: Next
44: End If
45: End If
46:
47: If JobSearchCard_primary_heading_days IsNot Nothing Then
48: If JobSearchCard_primary_heading_days.Count > 0 Then
49: Dim IntDays As Integer
50: Integer.TryParse(JobSearchCard_primary_heading_days(0).InnerText.Replace(" days left", ""), IntDays)
51: NewRecord.[HourLeft] = IntDays * 24
52: End If
53: End If
54:
55: If JobSearchCard_primary_description IsNot Nothing Then
56: If JobSearchCard_primary_description.Count > 0 Then
57: NewRecord.[TXT] = JobSearchCard_primary_description(0).InnerText.Trim
58: End If
59: End If
60:
61: If JobSearchCard_secondary_price IsNot Nothing Then
62: If JobSearchCard_secondary_price.Count > 0 Then
63: NewRecord.[AvgBid] = JobSearchCard_secondary_price(0).FirstChild.InnerHtml.Trim()
64: NewRecord.[TimeType] = JobSearchCard_secondary_price(0).FirstChild.InnerHtml.Trim()
65: If NewRecord.[AvgBid].Contains("/ hr") Then
66: NewRecord.[Summ] = 0
67: Else
68: Dim Intprice As Integer
69: Integer.TryParse(NewRecord.[AvgBid].Replace("$", ""), Intprice)
70: NewRecord.[Summ] = Intprice
71: End If
72: End If
73: End If
74:
75: If JobSearchCard_secondary_entry IsNot Nothing Then
76: If JobSearchCard_secondary_entry.Count > 0 Then
77: Dim IntBid As Integer
78: Integer.TryParse(JobSearchCard_secondary_entry(0).InnerText.Replace(" bids", ""), IntBid)
79: NewRecord.[BidCount] = IntBid
80: End If
81: End If
82:
83: If JobSearchCard_primary_tagsLink IsNot Nothing Then
84: If JobSearchCard_primary_tagsLink.Count > 0 Then
85: Dim Category As String = ""
86: Dim CategoryNum As String = ""
87: Dim Flag1 As Boolean = False
88: For Each Two In JobSearchCard_primary_tagsLink(0).ChildNodes
89: Dim Cat As String = Two.InnerHtml.Replace(vbLf, "").Replace(" ", "")
90: If Not String.IsNullOrEmpty(Cat) Then
91: Flag1 = True
92: Category &= Cat & ","
93: Dim Categories = (From Z In db1.FreelancerCategories Select Z Where Z.Name.ToLower.Trim = Two.InnerHtml.ToLower.Trim).ToList()
94: If Categories.Count > 0 Then
95: CategoryNum &= Categories(0).i & ","
96: End If
97: End If
98: Next
99: If Flag1 Then
100: NewRecord.[Category] = Left(Category, Len(Category) - 1)
101: NewRecord.[CategoryList] = Left(CategoryNum, Len(CategoryNum) - 1)
102: End If
103: End If
104: End If
105:
106: db1.AllProjects.InsertOnSubmit(NewRecord)
107: db1.SubmitChanges()
108:
109: Catch ex As Exception
110: MsgBox("Freelancer Row # " & RecNumber & vbCrLf & "Project # " & Num & vbCrLf & vbCrLf & ex.Message)
111: End Try
112: Next
113: Catch ex As Exception
114: MsgBox("Freelancer Row # " & RecNumber & vbCrLf & ex.Message)
115: End Try
116: Return JobCount
117: End Function
118: End Module
Alternative way to parse HTML is use jQquery selector. If you already have jQuery in html page and if you have HTML in browser, for example in this way CefSharp.Winforms.ChromiumWebBrowser minimal example on VB.NET (with cookies collector and script executor). You can use jQuery selector directly, for example in this way Multithreading Parsers with Parallel, CsQuery, Newtonsoft.Json, OfficeOpenXml and IAsyncResult/AsyncCallback. or in the way on the screen.
But if you have HTML as string by WebResponse you need to "Install-Package CsQuery" too, like in previous way "Install-Package HtmlAgilityPack"
Third way to parse HTML is possible only if you have XHTML. In this case you may use internal Microsoft classes XDocument/XElement, look template in this page for example VS2017 Plugins (Resolve Unused References & XPath Tools).
|