jSoupLink


A wrapper for jSoup that makes it easy to retrieve parts of HTML documents using CSS selectors.


These examples also appear in the jSoupLink announcement post on Mathematica.SE.

There are three different functions that are used, respectively, to interpret complete HTML documents obtained directly from a website or a local file, to interpret a fragment of HTML obtained in the same way, and to interpret HTML in a string.

jSoupLink`ParseHTML[
website address or local file path,
CSS selector,
data elements to extract
]

jSoupLink`ParseHTMLFragment[
website address or local file path,
CSS selector,
data elements to extract
]

jSoupLink`ParseString[
HTML in a string,
CSS selector,
data elements to extract
]

Selecting images from Wikipedia
urls = jSoupLink`ParseHTML[
   "http://en.wikipedia.org/wiki/Sweden", (* URL *)
   "table.infobox img", (* CSS selector *)
   "src" (* Attribute to retrieve *)
   ];
Partition[Import /@ urls, 2] // Grid

Example images from Wikipedia


Select headlines (both text and URL) from NYT
headlines = Rest@jSoupLink`ParseHTML[
    "http://www.nytimes.com/pages/politics/index.html",
    "h2 a, h3 a",
    {"text", "href"}
    ];
Take[headlines, 5] // TableForm

NYT headlines


Build a database with information about Swedish municipalities, using data on Wikipedia
headers = jSoupLink`ParseHTML[
   "http://en.wikipedia.org/wiki/List_of_municipalities_of_Sweden",
   "table.wikitable.sortable th",
   "text"
   ];
headers = StringReplace[#, "(" ~~ __ ~~ ")" -> ""] & /@ headers; (* Remove units *)
headers = StringReplace[#, WordBoundary ~~ x_ :> ToUpperCase[x]] & /@ headers; (* Capitalize *)
headers = StringReplace[#, " " -> ""] & /@ headers;(* Remove spaces *)

municipalities = jSoupLink`ParseHTML[
   "http://en.wikipedia.org/wiki/List_of_municipalities_of_Sweden",
   "table.wikitable.sortable td",
   "text"
   ];
municipalities = Partition[municipalities, 9];

ds = Dataset@Composition[
     Map[AssociationThread],
     Map[(headers -> #) &]
     ][municipalities];

Now if you want to select all municipalities that belong to the county Västra Götaland you just have to type

ds[Select[#County == "Västra Götaland County" &], "Municipality"] // Normal

{"Ale Municipality", "Alingsås Municipality", "Bengtsfors \ Municipality", "Bollebygd Municipality", ...


Select HTML and then parse it for more information
 questions = jSoupLink`ParseHTML["http://mathematica.stackexchange.com", ".question-summary", "html"];

 extractInformation[qhtml_] := {
   jSoupLink`ParseHTMLFragment[qhtml, "h3", "text"],
   Grid[{
     {"Votes", First@jSoupLink`ParseHTMLFragment[qhtml, ".votes", "text"]},
     {"Answers", First@jSoupLink`ParseHTMLFragment[qhtml, ".status", "text"]},
     {"Tags", First@jSoupLink`ParseHTMLFragment[qhtml, ".tags", "text"]}
     }, Alignment -> Left]
   }

 MapThread[OpenerView[{First[#], #2}] &, extractInformation /@ questions // Transpose] // Column

Screenshot of the formatting of questions