Meta-data should include location and time information, the short-hand names of the variables matched to a description that includes their units, and a description of the levels of variables (if they are factors).
‘Tidy’ data begins with a single row of variable names and follows with one row for each observation. For instance, if you measured multiple individuals across different times, then each row will be a measurement from one individual at one time. You will also have time and ID variables to indicate the time and individual.
See the dplyr
and tidyr
packages if you need to tidy data.
There is one exception to this commandment: extremely large data-sets (e.g. multiple giga-bytes) will use less memory if stored in table or wide format.
You can use letters, numbers _
and .
in variable names. Avoid any other symbols. For instance, instead of calling a species Plectropomus leopardus
, name it Plectropomus_leopardus
.
Also don’t start variable names with numbers.
For instance if using CamelCase, stick to it. Or instead separate words with _
. Be consistent with capitalisation, for instance, R will see Plectropomus
as different to plectropomus
.
A further tip is to avoid using multiple names for the same identity, like plectropmus sp.
and plectropmus
.
R will automatically detect the type of data. If all levels of a variable are numbers, it will treat them as continuous (decimal or integer) numbers. If you mix numbers and words, R will treat all as different levels of a factor.
Someday you may want to join data together, for instance, data collected at the same sites across different studies. Keep the site names consistent.
There is a big difference between data you didn’t collect and data you tried to collect but went missing. So be explicit about missing data by entering the missing rows where you tried to collect the data and using NA
in the missing cells. It is better to use NA
than just leave the cell blank, because R interprets NA
as missing data.
Data-wrangling is the process of checking data for errors and formatting data so it can be used in analysis. If you do all the wrangling in R and keep the scripts, the process is totally repeatable. Then if someone wants to see how you changed the original data to check errors for instance, all the changes are recorded in a script.
If you want to know more about data wrangling in R, check out my latest courses.
Also, avoid .xls and .xlsx formats for saving data. Instead save your data as .csv (comma separated values).
It is inconvenient to read excel formats into R.
Excel can also change the ‘type’ of numbers in strange ways that alter your data. For instance, data that looks like dates (10/2) may get transformed into a different date format (10 - February). .xlsx formats also use more memory than .csv. So it is better to keep your data as .csv and wrangle it in R.
For instance, if you have collected data over different dates and want to keep them in separate spreadsheets give them names with the dates recorded in a consistent format: data_2-Oct-2015.csv
, data_3-Oct-2015.csv
and so on. That way you can rapidly read the files into R and parse the file names into actual dates.
R is an open-access resource, many people have invested much time in developing. To keep good karma, make your data and code open-access too (once you have written your publication of course and if ethics allow). Much scientific data is collected on public money, so it is quite reasonable that the public can access it!.
*Footnote: I changed point 7 on 18-Mar 2016 to be explicit that you should use NA
for missing data. Thanks to Andrew Beckerman for suggesting this. I appreciate feedback on my blogs, so feel free to get in touch if you have comments of your own. *
Designed by Chris Brown. Source on Github