Getting Started

Introduction to Statistics and R

1. Overview

This page is designed to help you get a basic understanding of statistical concepts and using R before the core module content in week 10. This is self-guided content for you to go through at your own pace, with links where necessary to highlight extra resources.

There is a general introduction page if you are here as a complete beginner and this will cover the approach for the module and give you the tools and materials you need to get started. If you are already an RStudio user and comfortable managing and storing script files and scripts in R, you may want to jump straight into the Bootcamp modules.

We make a few assumptions about this material:

  • You follow along with the workshop pages consecutively

  • You complete each one before moving on to the next

  • You attempt all of the problems at the end of each page

Crucial Reminder! Throughout these early materials you will be introduced to new terminology and processes which may at first appear daunting. Do not fear! As with any software, and programming in general, it has its peculiarities that you will soon get used to. Ultimately the aim of this resource and the EDA module as a whole is to get you comfortable with just ‘having a go’ and knowing where to look when you get stuck.

2. Why Use R?

The motivation for using R is that it is designed to help people with no programming experience to perform sophisticated statistical analysis with minimum effort. R has grown in popularity recently and is used extensively by universities, companies, and researchers everywhere. Because of this, there is a very large community of users and a demand in business and academia for skills using R.

R is free and open source. R is easy to learn and works the same for folks with fast and slow computers, no matter what kind of operating system or computer they like to use, and it is easy to use via the web on any device.

3. Installing R

You have three options for following along with these materials as they are intended.

Option 1 Download and install R from CRAN and then download and install RStudio desktop. Install R first, then RStudio. It is probably a idea to go ahead and install the latest version of each if you have older versions installed. If you have a PC or laptop you regularly use, this option is probably best and will work for almost all hardware and operating systems.

Help for Windows

Help for Macs

Help for Linux

Option 2 If you can’t install R or do not wish to, or if you prefer to work in “the cloud”, you may wish to start a free account at RStudio Cloud and follow along that way.

Option 3 If you want another browser based option and you have an active Google account, you may wish to set up with Google Colaboratory and follow along that way. Please follow this guide for using ‘Colab’ with R.

4. RStudio: Overview & Set-Up

RStudio desktop is an environment to write R code, perform statistical analysis, organize big or small projects with multiple files, and view and organize outputs. There are many features of RStudio, but we are only going to point out a few.

4.1 The Script Window

The script window is located in the upper left of the RStudio console by default. You may need to open a script or start a new one: File > New File > R Script (hotkey Ctrl+Shift+N).

The script window is where you are likely to spend most of your time building scripts and executing commands you write. You can have many scripts open at the same time (in “tabs”), and you can have different kinds of scripts, e.g., for different parts of a project or even for programming languages.

4.2 The Console Window

The Console window is in the lower left by default. Notice there are several other tabs visible, but we will only mention the Console for now. The Console is the place where text outputs will be printed (e.g. the results of statistical tests), and also is a place where R will print Warning and Error messages.

4.3 The Global Environment

The Global Environment is in the Environment tab in the upper right of RStudio by default. This pane is useful in displaying data objects that you have loaded and available.

4.4 The Plots Window

The Plots window is a tab in the lower right by default. This is the place where graphics output is displayed and where plots can be named, resized, copied and saved. There are some other important tabs here as well, which you can also explore. When a new plot is produced, the Plots tab will become active.

5. Working in R

5.1 Script Setup

An R script is a plain text file where the file name ends in “dot R” (.R) by default.

An R script serves several purposes:

First, it documents your analysis allowing it to be reproduced exactly by yourself or by others.

Second, it is the interface between your commands and R software.

A goal is that your scripts should contain only important R commands and information, in an organized and logical way that has meaning for other people, maybe for people you have never spoken to. A typical way to achieve this is to organize every script according to the same plan.

Your R script should be a file good enough to show to a person in the future (like a supervisor, or even your future self). Someone who can help you, but also someone who you may not be able to explain the contents to. The script should be documented and complete. Think of this future person as a friend you respect.

Although there are many ways to achieve this, for the purposes of the Bootcamp we strongly encourage you to organize you scripts like this:

  • Header

  • Contents

  • One separate section for each item of contents

5.3 Contents

You may want to include a contents section near the top to provide a ‘road map’ for your script & analysis. For example:

# A typical script Contents section

## CONTENTS ####
## 00 Setup
## 01 Graphs
## 02 Analysis
## 03 Etc

5.4 Section ‘Chunks’

A ‘Code chunk’ break is just a notation method used to aid the readability of the script and to provide a section for each item in your table of contents. A code chunk is just a section of code set off from other sections.

Below is the beginning of a typical code chunk in an R script.

  • Code chunks must start with at least one hash sign “#”

  • Should have a title descriptive of code chunk contents

  • End with (at least) four hash signs “####”

  • Consecutively numbered titles can make things very tidy

For example:

## 01 This here is the first line of MY CODE CHUNK ####

5.5 Comments

Comments are messages that explain code in your script, and they should be used throughout every script. You can think of comments like the methods section of a scientific paper - there should be enough detail to exactly replicate and understand the script, but it should also be concise.

Comment lines begin with the # character and are not treated as “code” by R.

# Make a vector of numbers <--- a comment
my_variable <- c(2,5,3,6,3,4,7)

# Calculate the mean of my_variable <--- another comment
mean(my_variable)
[1] 4.285714

5.6 Running Code

To run your code, or ‘submit commands’, in your R script you can use a few different methods:

  • Run the whole line of code your cursor rests on (without selecting any) Ctrl+Enter (Cmd+Return in Macs)

  • Run code you have selected with your cursor Ctrl+Enter (Cmd+Return in Macs).

  • Use the “Run” button along the top of the Script window

  • Run code from the menu Code > Run Selected Line(s).

6. Exercises

Add a code chunk title to your CONTENTS section and to your script. Make sure to write brief comments for your code. Add the following code to your chunk run it and examine the output:

# Create a new variable
my_variable <- c(6.5, 1.35, 3.5)

# Calculate the mean of my_variable
mean(my_variable)
[1] 3.783333
# Calculate the standard deviation of my_variable
sd(my_variable)
[1] 2.586665

Don’t worry about understanding the code for now. We are just working on getting used to RStudio and submitting commands.