Wikipedia Data Scraping with R: rvest in Action

Scraping list of people on bank notes for exploratory data analysis using rvest functions

Korkrid Kyle Akepanidtaworn

--

Introduction

Wikipedia is a a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation, currently having more than 5+ million articles in English. Today, I will work on the data exercise of wikipedia data scraping using rvest, “a new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces” (Wickham, 2014). Before you proceed, it is important for you to have a basic understanding of HTML and XML web structures. I recommend checking out HTML w3schools, which gives a good, simplified tutorial for learning, testing, and training.

Image Source: Chandransh Srivastava, An Engineer seeking Wisdom, See https://www.quora.com/With-knowledge-in-HTML-CSS-and-a-little-JavaScript-what-kind-of-projects-should-I-start-with-to-strengthen-my-skills-in-web-development

--

--

Korkrid Kyle Akepanidtaworn

AI Specialized CSA @ Microsoft | Enterprise AI, GenAI, LLM, LLamaIndex, ML | GenAITechLab Fellow, MScFE at WorldQuant, MSDS at CU Boulder