
A quick glance to my previous posts would show my liking to insightful quotes. I admire those who had the wonderful ability to recite them by memory.
Unfortunately, I have never possessed such a dazzling memory. Thus, I seek the ability to making quotes accessible hoping to increase my ability of sounding more insightful.
The example I am about to introduce, elaborates on one of my favorites C# open source libraries I have found – the HTML Agility Pack. This is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents.
I used HTML Agility Pack and SQLlite, in order to build a lightweight database that I could utilized in my mobile application. SQLlite is used constantly in mobile applications today, thus it is a good storage technology to be informed about.
The iOS application, in this post was built by using a JavaScript open source library named PhoneGap and Jquery Mobile. PhoneGap enables you to build iOS and Android applications via Javascript and a UIWebView Control.
In a later posts, I may introduce the discussion on creating Jquery Mobile controls on the mobile applications. However, In this post I will stay within the scope of the HTML Agility Pack Library.
If you were to build a SQLlite table similar to the one presented in this post, and if building it from a .NET application, It is important to download the SQLite drivers for .NET and the SQLlite Browser
Building Data Through Web Scraping and HTML Agility Pack Example
Prior web scraping techniques involved the use of the HttpWebRequest class, and then StreamReader in order to capture the source code. Nonetheless, manipulating the source code required a certain degree of complexity, in order to obtain the desired information a sequence of substrings needed to be carried out.
HTML Agility Pack library has simplified this process significantly. If this is the first time you are reading about HTML Agility Pack, you must know that one of its most important elements is the HTMLNode class.
In lame terms, every HTML tag is a selectable object under HTML Agility pack. We could access various properties of these tag elements through the HTMLNode class, such as child nodes / tags, parent node / tag , attributes and many others of great importance.
Let’s take a brief look on the below example to understand a little bit about the anatomy.
1: <div id=”Container” class=”author block”>
2: <span id=”authorName”><b> Paulo Coelho</b></span>
3: <span id=”Quote”> “The Superclass tries to promote its values. Ordinary people complain of divine injustice, they envy power, and it pains them to see others having fun. </span>
4: <a id=”booklink” href=”http://www.amazon.com/The-Winner-Stands-Alone-Novel/dp/B004KAB3QG”>Book</a>
5: </div>
Simple enough, if we were to select the div tag <div id=”container”> above through the htmlAgility pack, we would have accessible its attributes with the key “id” and “author;” and its child nodes <span id=”authorname”>, <span id=”Quote”> and <a id=”booklink”>
Conversely, If we were to select the span tag <span id=”authorName”>, the htmlnode would have accessible its parent node <div id =”Container”>, its one children node <b> tag. The text within its tag will be accessible through the InnerText property.
Thus, we have that we could select various tags and various tags attributes if we understand the anatomy of the HTML document.
Nodes or Tags Selector
Download the latest code of the HTML Agility Pack (HAP) and include the reference in your code. The first step in using HAP is to load the HTML Document through the HTMLWeb class. Subsequently, there a few techniques for selecting tags based on attributes or parent node.
1: using HtmlAgilityPack;
2: ...
3: HtmlDocument doc = new HtmlWeb().Load(myurl);
4: // Selecting every div tag in the document
5: List<<HtmlNode> divs = doc.DocumentNode.SelectNodes("//div").Select(t => t).ToList();
6: // Selecting a div tag with specific id attribute
7: List<<HtmlNode> divs = doc.DocumentNode.SelectNodes("//div[@id='container']").Select(t => t).ToList();
Mobile Application Data Example
The website BrainyQuote is a good source for obtaining quotes from various authors and topics. In the following example, we will scrape all the author found under the favorite quote section, this site is at http://www.brainyquote.com/quotes/favorites.html
Prior to an effective scraping is important to understand the anatomy of the page. Next:

In the above anatomy, we could observe that the url link to the author page could be found in a tag, also we observed that the link contains the “/quotes/authors” thus, we will use this pattern to identify that the link is indeed a author type link. This link is responsible for directing us to the author page.
1: string url = "http://www.brainyquote.com/quotes/favorites.html";
2: HtmlDocument doc = new HtmlWeb().Load(url);
3: List<string> authorlinks = (from AuthLin in doc.DocumentNode.SelectNodes("//a")
4: where AuthLin.OuterHtml.IndexOf("/quotes/authors") != -1
5: // href will extract the value of the link attribute in the a tag
6: select ParentURL + AuthLin.Attributes["href"].Value.ToString()).ToList<string>();

Following, We will use parallel linq for rapid execution. We will load the url into a new HTML Document, so we could proceed to scrape the quotes located in the author page. Here is a brief look at the anatomy of this page.

Notice that the quote information is included under a div with class named “boxyPaddingBig” This div contains at least two child nodes – one that contains the node information and another one that contains the author information.
1: // Parallel foreach is faster than regular foreach loop
2: Parallel.ForEach(authorlinks, authorlink => {
3: HtmlDocument docAuth = new HtmlWeb().Load(authorlink); // loads the author page
4: List<List<HtmlNode>> quotes = docAuth.DocumentNode.SelectNodes("//div").Where(k => k.Attributes["class"] != null && k.Attributes["class"].Value == "boxyPaddingBig")
5: .Select(t => t.ChildNodes.Where(h => h.Name == "span") // select the child node but only were the tag is an span
6: .Select(k => k).ToList<HtmlNode>()).ToList();
7:
8: quotes.ForEach(p => {
9: TQuote newItem = new TQuote { message = p[0].InnerText, author = p[1].InnerText, Letter = p[1].InnerText.Substring(0, 1) };
10: Quotes.Add(newItem);
11: });
12: });
The TQuote class created to store the list of quotes.
1: public class TQuote
2: {
3: public string Letter { get; set;}
4: public string author { get; set; }
5: public string message { get; set; }
6:
7: }
So we have that only with a few lines above were are able to obtain the 25 quotes of the author first page.

While the task seems accomplish, there is one more caveat however. Each author page contains a navigation bar at the top and at the bottom of the page when quotes are not able to fit the first page. Could we then obtain the link to each page so we can make certain that we capture all the quotes?

1: List<HtmlNode> pages;
2: try
3: {
4: //below scrapes the page navigation in order to see how many pages are there for author..
5: pages = docAuth.DocumentNode.SelectNodes("//div").Where(k => k.Attributes["class"] != null && k.Attributes["class"].Value.IndexOf("pagination") != -1)
6: .Select(p => p).FirstOrDefault() // just picked the first one, since there are two navigation tags
7: .ChildNodes["ul"].ChildNodes.Where(n => n.Name == "li" && n.Attributes.Count == 0 && n.InnerText != "Next") //exludes the next link (since the link is repeated
8: .Select(m => m.ChildNodes[0]).ToList<HtmlNode>();
9: }
10: catch
11: {
12: // no more pages with quotes
13: pages = null;
14: }
Once we obtain the link of all the extra pages with quotes, we are ready to scrape the quotes in the similar fashion as performed above. Below is the complete code
1: public static List<TQuote> GetFavoriteAuthors()
2: {
3: List<TQuote> Quotes = new List<TQuote>();
4: string url = "http://www.brainyquote.com/quotes/favorites.html";
5: HtmlDocument doc = new HtmlWeb().Load(url);
6: List<string> authorlinks = (from AuthLin in doc.DocumentNode.SelectNodes("//a")
7: where AuthLin.OuterHtml.IndexOf("/quotes/authors") != -1
8: select ParentURL + AuthLin.Attributes["href"].Value.ToString()).ToList<string>();
9: Parallel.ForEach(authorlinks, authorlink => {
10: HtmlDocument docAuth = new HtmlWeb().Load(authorlink);
11: List<HtmlNode> pages;
12: try
13: {
14: //below scrapes the page navigation in order to see how many pages are there for author..
15: pages = docAuth.DocumentNode.SelectNodes("//div").Where(k => k.Attributes["class"] != null && k.Attributes["class"].Value.IndexOf("pagination") != -1)
16: .Select(p => p).FirstOrDefault() // just picked the first one, since there are two navigation tags
17: .ChildNodes["ul"].ChildNodes.Where(n => n.Name == "li" && n.Attributes.Count == 0 && n.InnerText != "Next") //exludes the next link (since the link is repeated
18: .Select(m => m.ChildNodes[0]).ToList<HtmlNode>();
19: }
20: catch
21: {
22: // no more pages with quotes
23: pages = null;
24: }
25: #region Get Quotes
26: List<List<HtmlNode>> quotes = docAuth.DocumentNode.SelectNodes("//div").Where(k => k.Attributes["class"] != null && k.Attributes["class"].Value == "boxyPaddingBig")
27: .Select(t => t.ChildNodes.Where(h => h.Name == "span") // select the child node but only were the tag is an span
28: .Select(k => k).ToList<HtmlNode>()).ToList();
29:
30: quotes.ForEach(p => {
31: TQuote newItem = new TQuote { message = p[0].InnerText, author = p[1].InnerText, Letter = p[1].InnerText.Substring(0, 1) };
32: Quotes.Add(newItem);
33: });
34: if(pages != null)
35: pages.ForEach(p => {
36: if (p.Attributes["href"] != null)
37: {
38: HtmlDocument AuthorPage = new HtmlWeb().Load(ParentURL + p.Attributes["href"].Value);
39: List<List<HtmlNode>> PageQuotes = AuthorPage.DocumentNode.SelectNodes("//div").Where(k => k.Attributes["class"] != null && k.Attributes["class"].Value == "boxyPaddingBig").Select(t => t.ChildNodes.Where(h => h.Name == "span").Select(k => k).ToList<HtmlNode>()).ToList();
40: PageQuotes.ForEach(Q =>
41: {
42: TQuote newItem = new TQuote { message = Q[0].InnerText, author = Q[1].InnerText, Letter = Q[1].InnerText.Substring(0, 1) };
43: Quotes.Add(newItem);
44:
45: });
46: Console.WriteLine(p.Attributes["href"].Value);
47: } });
48:
49: });
50: #endregion
51: return Quotes;
52:
53: }
54: private static string ParentURL
55: {
56: get
57: {
58: return "http://www.brainyquote.com";
59: }
60: }
I had the program write to the console all the subpages with quotes from author

By storing the data that we extracted in the above process in a SQLite table, we were able to generate a table of 26,263 quotes. The below is a view of the table.

The 20 Minute Mobile App
Easy enough, with the data that was generated above I was able to generate a 20 Minutes mobile application with PhoneGap and Jquery Mobile. Distinct from the above, I was also able to scrape quotes content by topic, then I classified content by author names and also by topic.



Share your Quotes
In this day in age, what good does an app make if you are not able to share its content in social media. Phonegap has a great child browser plugin that is ideal for issuing authorization request to the Facebook API.


