Easily scrap a website in R to data frame.

Let’s take a look on MBA students of Harvard Business School (class profile 2016). They always come from unique background and different parts of the world. However, they share common interests and ambitions that are valuable to the society.

Particularly, I am interested in from which universities/high schools/other institut. they come from. Yes, on the website we can read 35% internationals (I am one too), but that is a number which says almost nothing e.g. those 35% internationals – do they come from rich, or rather poor environment? The HBS would argue that is “a mix” -> but education says it’s all and always better sign to analyse.
Nonetheless, we will never know that, we can examine at least from which countries and universities they have studies.

HBS gives a nice list of schools, see below.


library(rvest)
hbs <- html("http://www.hbs.edu/mba/admissions/class-profile/Pages/undergraduate-institutions.aspx")

Here nothing is yet to be plotted, just the website is parsed.

schools <- hbs %>%
html_node(".ol-outset .unstyled") %>%
html_text()
schools

Using CSS selectors, we extract the real information we need – that is an unordered list of schools. Subsequently, we get rid of tags by using `html_text()` method, which further extracts text from html. You can see the output `schools`.


cat(schools, file = "ex.data", sep = "\n")
rew <- readLines("ex.data", encoding = "utf8towcs")
df <- data.frame(rew, stringsAsFactors = FALSE)

Now, we need to store it in `data.frame`. The problem Ihad was due to separator being `\r\n`. So I first write it the output to a file, using a normal `\n` and then by `readLines` was able to convert it to df. Inspirated by him.

Advertisements