Data Wrangling using R

Data Wrangling Using R

A short course

Introduction

What is data wrangling?
Wikipedia defines it as “Data wrangling or data wrangling is loosely the process of manually converting or mapping data from one ‘raw’ form into another format that allows for more convenient consumption of the data with the help of semi-automated tools”. Have you ever been faced with a dataset that was in the wrong format for the statistical package you wanted to use? Well data-wrangling is about solving that problem.
If you have ever had to deal with large data-sets you may realise that data wrangling can take a considerable amount of time and skill with spreasheets programs like excel. Data wrangling is also dangerous for your analysis- if you stuff something up, like accidentally deleting some rows of data, it can affect all your results and the problem can be hard to detect.

There are better ways of wrangling data than with spreadsheets, and one way is with R.

R is fast becoming the most powerful and flexible programming package for environmental data analysis. It is also totally free. R is a powerful language for data wrangling because
1. It is relatively fast to process commands
2. You can create repeatable scripts
3. You can trace errors back to their source
3. You can share your wrangling scripts with other people
4. You can conveniently search large databases for errors
5. Having your data in R opens up a huge array of cutting edge analysis tools.

A core principle of science is repeatability. Creating and sharing your wrangling scripts helps you to fulfil the important principle of repeatability. It makes you a better scientist and can also make you a more popular scientist: a well thought out script for a common wrangling problem may be widely used by others. In fact, these days it is best practice to include your scripts with publications.
Most statistical methods in R require your data is input in a certain ‘tidy’ format. This course will cover how to use R to easily convert your data into a ‘tidy’ format. Steps include restructuring existing datasets and combining different data-sets. We will also create some data summaries and plots, which are easy once you have ‘tidy’ data. The course will be useful for people who want to explore, analyse and visualise their own field and experimental data in R. The skills covered are also a useful precusor for management of very large datasets in R.