This lab serves as a short vignette to writing functions in R. It does not cover everything that is necessary to become an experienced R programmer, but it will provide you with the fundamentals to understand the logic behind function creation and implementation. There are two objectives for this lab. First, you will be introduced to function writing. Second, you will be introduced to loops.
The previous labs have shown that many important tasks in R can be simplified by calling a function! Often times, there will be a library package with built in functions that can take care of your needs, but this doesn’t mean you don’t need to know about functions. #alternatively, there are two reasons why you may want to know how to write your own functions…First, there may not be an established function for what you need to do. Second, writing functions that are useful to you can help you to save time or focus computational resources over many projects. Lets write a function that allows us to easily call multiple library packages without needing to call each package one at a time. Let’s call it LoadLibraries()…
LoadLibraries
LoadLibraries()
When you type this code you will get errors that indicate that object and function are not found. This happens because the function that you were trying to call doesn’t actually exist in your object memory. This could be the result of not being included in the base R package, not having a key library package loaded, or it just simply doesn’t exist! Since the LoadLibraries function doesn’t exist, lets go ahead and code it…
LoadLibraries = function(){
library(ISLR)
library(MASS)
print("The libraries have been loaded")
}
LoadLibraries()
## Warning: package 'ISLR' was built under R version 3.2.3
## [1] "The libraries have been loaded"
Two things to note here. First, when you created the LoadLibraries function, you can see that it is now displayed on the right hand global environment window in R Studio under the functions header. Second, look at the code that you used to create LoadLibraries. You used a function to create a function, namely the function() function (say that sentence five times fast). Notice the brackets that come after the function() code. All code that is associated with this function’s job must be housed within the brackets. The LoadLibraries() function that we created essentially allowed you to turn three lines of code into one line of code…magic. The first two lines are simple library calls, while the third line prints out a confirmation message.
The above example, while cool is very basic and of limited use outside of our own personal needs, but it represents the strength of functions as a tool to aid you as a programmer and scientist. Let’s try something that is a little more complex…
math_function <- function(x){
print(sprintf("Here are some transformations for %d", x))
print("very cool!")
test <- sqrt(x); test2 <- 1 + log(x)
l_test <- list(square.root=test,natural.log=test2)
return(l_test)
}
math_function(3)
## [1] "Here are some transformations for 3"
## [1] "very cool!"
## $square.root
## [1] 1.732051
##
## $natural.log
## [1] 2.098612
math_function(20)
## [1] "Here are some transformations for 20"
## [1] "very cool!"
## $square.root
## [1] 4.472136
##
## $natural.log
## [1] 3.995732
math_function(75)
## [1] "Here are some transformations for 75"
## [1] "very cool!"
## $square.root
## [1] 8.660254
##
## $natural.log
## [1] 5.317488
math_function(100)
## [1] "Here are some transformations for 100"
## [1] "very cool!"
## $square.root
## [1] 10
##
## $natural.log
## [1] 5.60517
What does the math_function() do? Under the hood, there are five lines of code that are collapsed into one function. The first two lines print out useful data for the user, while the bottom three lines perform the requested mathematical transformations of x. Notice that we put an x in the function() function. x is our argument, something that you have been working with during the last few labs. x in this case is just the number we wish to be transformed by a square root and a natural log.
This function is still pretty basic, but you are probably confused by some of the code. Let’s break it down to its base components. The first two lines are simple print() functions that print strings, but you probably notice that the first line is unlike anything you have seen before. Housed within the print() code is another function called sprintf(). I won’t go too deep into it here, but sprintf() allows you to take variable values and display them in a printed string by calling the specific variable after the string. This type of strategy is useful for when you have functions that change values frequently, or have outside human input. This allows the function to adapt to any value given to the function. The third line of code should be familiar with you, but notice that the semi-colon is used as a way to have two different lines of code on the same line without interference. The fourth line of code is probably new to you, but it is basic nonetheless. It essentially takes the objects created by test and test2 and places them in a single list. Finally, the final line uses the return() function to report the results of the list. Using the return() function allows us to explicitly end the function and return the results, so take great care. You should always use the return() function at the end of a function’s code. If you don’t, you will prematurely cut off any expressions used after the return.
How do you like functions so far? Functions can seem daunting at first, but they are really only limited by your own creativity. Let’s try notching things up a bit…
centrality.function <- function(x, npar = FALSE, print = TRUE){
if (!npar) {
center <- mean(x); spread <- sd(x)
}
else {
center <- median(x); spread <- mad(x)
}
if (print & !npar) {
cat("Mean = ", center, "\n", "Standard deviation =", spread, "\n")
}
else if (print & npar) {
cat("Median=", center, "\n", "Mean absolute deviation =", spread, "\n")
}
}
The centrality.function() function allows us to quickly calculate and display measures of central tendency for a vector of x. Here we have created three arguments. The first one is x, which takes a numeric vector as its value. The second argument, npar, tells the function whether you want the evaluation to be non parametric or not. When we specify FALSE above, it tells the function that this is the default argument value when no user decision is made. Finally, the third argument called print tells the function whether or not to actually print the results of the function operation, with the default choice being TRUE. Before breaking down the code behind centrality.function(), lets first use it in practice.
set.seed(4590)
data <- rnorm(1000)
data2 <- rpois(1000, 4)
centrality.function(data)
## Mean = -0.03400282
## Standard deviation = 0.9923801
centrality.function(data2)
## Mean = 3.954
## Standard deviation = 2.008462
centrality.function(data, npar=TRUE)
## Median= 0.006192939
## Mean absolute deviation = 0.9202793
centrality.function(data, npar=TRUE, print=FALSE)
What makes the function specifications different in outcome? By using logic based control structures, you are able to specify by under what conditions code is evaluated by the function. These logic conditions are specified by using if/else branching structures. The first line of code specifies an if statement that says: “if npar is coded as FALSE, then create object”center" with mean of x and object “spread”" with sd of x. " The second line of code specifies an else statement, which tells the program what steps to take if npar is coded as anything other than FALSE. In this case, if npar is not FALSE (but TRUE), then the function creates the “center” object with the median of x and creates the “spread” object with the mean absolute deviation of x. This same logic holds for the final two lines of code that determine the exact nature of the printed resulted. In this case, the cat() function serves as a similar printing strategy to the sprintf() function used in the previous example.
By using logic based control structures, it is possible to branch estimation and reporting paths that are dependent on user input. In this case, we have created a more dynamic function for measuring central tendency as it can take into account traditional and non-parametric input data.
Loops are easy to use and extremely powerful, but many beginner programmers get hung out on the logic and implementation of loops. Loops are a continuation of programming control structures (if and else) that were touched on in the final function example above. The most commonly used loop is called a for loop. A for loop iterates a conditional argument over the values of an object in R. Let’s try using a for loop to print out numbers 1 through 15…
for(i in 1:15) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
What did we do here? Our output shows fifteen print results for fifteen numbers. The code is very simple, in that we took the well-known print() function and made R quickly apply it to a numeric list of 1 through 15. The exact syntax may be confusing at first, but it is actually very simple. The for function has special syntax that allows us to loop through any vector. In this case, i represents any value from the object 1:15. Just like a function, we use the {} brackets to house the exact operations we want looped through the object of interest (1:15).
The syntax of i in 1:15 is entirely arbitrary and is dependent on the user. We could very well specify the loop a different way and get the same results…
x <- 1:15
for(n in x) {
print(n)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
Here we changed i to n, and placed the 1:15 numeric list into a vector x. The results are exactly the same even though the exact code has changed. This example is incredibly simple, yet it is powerful. Let’s try a few more for loop examples to hammer the concept home…
x <- c("a", "b", "c", "d")
for(i in x){
print(i)
}
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
x <- matrix(1:6, 2, 3)
x
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
for(i in seq_len(nrow(x))) {
for(j in seq_len(ncol(x))){
print(x[i, j])
}
}
## [1] 1
## [1] 3
## [1] 5
## [1] 2
## [1] 4
## [1] 6
The first example simply replaces the vector with character strings to be printed, while the second uses a nested loop structure (loop inside a loop) to print out the values of a matrix with two rows and three columns. Let’s try a more statistical application using a for loop by trying to examine the Law of Large Numbers Theorem.
set.seed(1234)
store <- matrix(NA, 1000, 1)
for(i in 1:1000){
a <- rnorm(i)
store[i] <- mean(a)
}
plot(store, type="h", ylab="Sample Mean",
xlab="Number of observations")
abline(h=0, col = 'red', lwd=2)
The graph above plots the result of our for loop simulation of a standard normal distribution \(N(\mu=0,\sigma^{2}=1)\). If we are interested in finding evidence for the Law of Large Numbers Theorem for the standard normal distribution, we would expect to see the \(\mu\) of each successive draw to become closer to zero. To code this, we first created a matrix called store, that has no observations, 1000 rows, and one column. We then used the for loop to populate this matrix with values from increasingly bigger standard normal simulations that range from 1 random draw at the beginning to 1000 random draws at the end. The final two lines of code plot the result of the for loop by using the now populated store vector. To illustrate the convergence around zero, we place a red line at zero. The results are exactly what we should expect from a simulation study! As the number of observations from the random normal draw increases, so does the convergence around zero.
If you find yourself using R for extensive estimator creation, it is likely that you will only ever need for loops, but other types of loops exist as well. While loops are like for loops, but they depend on a conditional argument to be present. Let’s illustrate this in practice before going too deep into what this actually means…
count <- 0
while(count < 10){
print(count)
count <- count + 1
}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
What did we do here that was different from the for loop? The difference is the logical condition (count < 10) that is housed in the while() function. The first line of code makes an object called count that has the value 0. The while loop specifies that while the value of object “count” is less than 10, the program is to loop through the final two lines of code, a print function and a +1 addition to the value inside the “count” object. The while loop tells R to continue looping through these final two lines of code until the logical condition is met, in this case the value of “count” being less than 10. Thus the output stops at 9, as that is the final value that is less than 10.
Let’s try a more complex while loop..
z <- 5
set.seed(786)
while(z >= 3 && z <= 10){
coin <- rbinom(1, 1, 0.5)
if (coin == 1){
z <- z + 1
}
else {
z <- z - 1
}
}
print(z)
## [1] 2
This while loop may seem a little daunting, but it does quit a simple task under the hood. Let’s break it down line by line. The first line of code creates a vector object “z” that takes 5 as its value. The second line sets up a random seed, so that the results can be replicated across sessions. The third line sets up the while loop and logical condition. The logical condition here is that while the value of “z” is in between 3 and 10 (z greater than or equal to 3 and less than or equal to 10) it should loop through the specified code. The code specified within the while loop does three things. The first line creates an object called coin that takes its value (1 or 0) from a random binomial distribution draw that specifies one observation, one trial, and a 50% chance of success \(B(n = 1, p = 0.5\)). The part of the while loop code specifies an if/else control condition. If the coin flips to 1, than the code is to add 1 to z, if the coin is something other than 1, then subtract 1 from z. The while loop continues to execute these three lines of code until the value of object “z” is less than 3 or greater than 10, and in our case the result was z becoming 2.
One last note about while loops. Be careful when specifying the logical condition to be met, as it is possible to accidentally create infinite loops! Having an infinite while loop execute on your computer will probably result in your computer becoming sluggish, freezing up, or if old enough, possibly locking up entirely.
As you have learned, functions and control sequences can be daunting but powerful. There is still more to learn beyond the scope of this lab, but what you have learned today will provide you wish the basis to work with R programming in a robust way. When you are thinking about strategies for creating your own functions, it can be useful to examine the code that builds the common functions you use on a daily basis in R. For an example, lets examine the code underlying the linear regression function lm()…
lm
## function (formula, data, subset, weights, na.action, method = "qr",
## model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
## contrasts = NULL, offset, ...)
## {
## ret.x <- x
## ret.y <- y
## cl <- match.call()
## mf <- match.call(expand.dots = FALSE)
## m <- match(c("formula", "data", "subset", "weights", "na.action",
## "offset"), names(mf), 0L)
## mf <- mf[c(1L, m)]
## mf$drop.unused.levels <- TRUE
## mf[[1L]] <- quote(stats::model.frame)
## mf <- eval(mf, parent.frame())
## if (method == "model.frame")
## return(mf)
## else if (method != "qr")
## warning(gettextf("method = '%s' is not supported. Using 'qr'",
## method), domain = NA)
## mt <- attr(mf, "terms")
## y <- model.response(mf, "numeric")
## w <- as.vector(model.weights(mf))
## if (!is.null(w) && !is.numeric(w))
## stop("'weights' must be a numeric vector")
## offset <- as.vector(model.offset(mf))
## if (!is.null(offset)) {
## if (length(offset) != NROW(y))
## stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
## length(offset), NROW(y)), domain = NA)
## }
## if (is.empty.model(mt)) {
## x <- NULL
## z <- list(coefficients = if (is.matrix(y)) matrix(, 0,
## 3) else numeric(), residuals = y, fitted.values = 0 *
## y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w !=
## 0) else if (is.matrix(y)) nrow(y) else length(y))
## if (!is.null(offset)) {
## z$fitted.values <- offset
## z$residuals <- y - offset
## }
## }
## else {
## x <- model.matrix(mt, mf, contrasts)
## z <- if (is.null(w))
## lm.fit(x, y, offset = offset, singular.ok = singular.ok,
## ...)
## else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,
## ...)
## }
## class(z) <- c(if (is.matrix(y)) "mlm", "lm")
## z$na.action <- attr(mf, "na.action")
## z$offset <- offset
## z$contrasts <- attr(x, "contrasts")
## z$xlevels <- .getXlevels(mt, mf)
## z$call <- cl
## z$terms <- mt
## if (model)
## z$model <- mf
## if (ret.x)
## z$x <- x
## if (ret.y)
## z$y <- y
## if (!qr)
## z$qr <- NULL
## z
## }
## <bytecode: 0x00000000133a0be8>
## <environment: namespace:stats>
Pretty daunting, but some of the code concepts you observe in the lm function code should be a bit more familiar now. Whenever you want to examine the inner workings of a function that you use, simply type the name of the function without () brackets into the command line.