It started out as a joke, but Jenny Bryan recently posted a `vlookup`

implementation in `R`

. Here is the original post as seen on twitter:

Sometimes you just need to party like it’s VLOOKUP time 😁🖇 … seriously, sometimes a join doesn’t fit the bill pic.twitter.com/jz8StfQdNg

— Jenny Bryan (@JennyBryan) April 3, 2018

The argument for the creation of this kind of lookup implementation in `R`

, was to help facilitate a join that wasn’t covered by the standard joins. This tweet started out as a joke function that would surely never see the daylight of production code, but there was one thing that stood out about the small little function - it was written in `base`

R and made use of R’s list data objects. What makes this intriguing you ask? Well, combining `lists`

and `base`

usually has a nice consequence - speed.

So that is what I set out to test: How much more efficient would this `vlookup`

be when compared to a `dplyr`

or `data.table`

implementation? We start off by recreating the vlookup function and use the tweet’s example to show its use case:

```
library(tidyverse)
vlookup <- function(this, df, key, value) {
m <- match(this, df[[key]])
df[[value]][m]
}
vlookup_base <- function(){
c("Luke Skywalker", "Jabba Desilijic Tiure", "Yoda") %>%
vlookup(starwars, "name", "mass") %>%
sum
}
vlookup_base()
```

```
## [1] 1452
```

Next we build the same query, but using `dplyr`

notation to give us the same results (a quick check shows us the sum results are the same):

```
vlookup_dplyr <- function(){
starwars %>%
filter(name %in% c("Luke Skywalker", "Jabba Desilijic Tiure", "Yoda")) %>%
pull(mass) %>%
sum
}
vlookup_dplyr()
```

```
## [1] 1452
```

Lastly we build the query using `data.table`

. This implementation comes from the comment section of the tweet. We first have to coerce the `starwars`

data into a data.table object for this to work. We also lose the `%>%`

workflow which can be quite a pain. In this case it doesn’t matter, because we all about speed!

```
library(data.table)
vlookup_dt <- function(){
name_vec <- c("Luke Skywalker", "Jabba Desilijic Tiure", "Yoda")
df <- data.table(dplyr::starwars)
df[.(name_vec), on = 'name', sum(mass)]
}
```

To test the speed to all of these function, we will use the `microbenchmark`

library. Its one of my favourite libraries in `R`

,thanks to its ease of use and quick API to comparing functions.

```
library(microbenchmark)
microbenchmark(
base = vlookup_base(),
dplyr = vlookup_dplyr(),
data_table = vlookup_dt(),
times = 1000L
)
```

```
## Unit: microseconds
## expr min lq mean median uq max neval
## base 93.539 116.1030 143.8219 129.642 147.692 4797.951 1000
## dplyr 780.309 889.0265 1017.7738 943.795 1035.898 8516.107 1000
## data_table 1030.564 1166.3595 1454.0140 1252.103 1441.641 50518.175 1000
## cld
## a
## b
## c
```

Here we see the enormous speed gain we got from using `vlookup_base`

in comparison with the other two frameworks. In all fairness, I feel that I might be handicapping `data.table`

a bit, just to the coercion of the `starwars`

dataset each time. So lets see what happens when I do the coercion outside the function.

```
df <- data.table(dplyr::starwars)
vlookup_dt <- function(){
name_vec <- c("Luke Skywalker", "Jabba Desilijic Tiure", "Yoda")
df[.(name_vec), on = 'name', sum(mass)]
}
res_mic <- microbenchmark(
base = vlookup_base(),
dplyr = vlookup_dplyr(),
data_table = vlookup_dt(),
times = 1000L
)
res_mic
```

```
## Unit: microseconds
## expr min lq mean median uq max neval cld
## base 95.590 118.975 137.5330 128.410 140.1030 603.077 1000 a
## dplyr 784.410 873.026 962.0909 911.180 972.7185 3840.002 1000 c
## data_table 579.693 648.206 740.3640 685.129 746.0520 4667.900 1000 b
```

We definitely gained some speed by not having to coerce the data.frame over and over, but, the overall speed comparison is still nothing like the old fashion `vlookup_base`

.