If you find this page useful, and want more of the same, try
- Why spanners? Since 'throwing a spanner into the works' has bad connotations, let us begin with the most popular, normal, conventional (if blunt) tool.
Basic Application of plot Function in R. Figure 1 shows the output of the plot function: A scatterplot. Similarly, you can choose which plots to show or hide by using indices for the fig.keep option. For example, fig.keep = 1:2 means to keep the first two plots. There are a few shortcuts for this option: fig.keep = 'first' will only keep the first plot, fig.keep = 'last' only keeps the last plot, and fig.keep = 'none' discards all plots. Saving Plots in R Since R runs on so many different operating systems, and supports so many different graphics formats, it's not surprising that there are a variety of ways of saving your plots, depending on what operating system you are using, what you plan to do with the graph, and whether you're connecting locally or remotely.
Histograms / bargraphs
Unless you are trying to show data do not 'significantly' differ from 'normal' (e.g. using Lilliefors test) most people find the best way to explore data is some sort of graph. Yet, whilst there are many ways to graph frequency distributions, very few are in common use. Journalists (for reasons of their own) usually prefer pie-graphs, whereas scientists and high-school students conventionally use histograms, (or bar-graphs). Curiously, while statisticians condemn pie-graphs as misleading if not wholly inappropriate, they seldom criticise histograms - at any rate histograms appear in virtually every introductory statistics text, and many advanced ones.
- When asked to examine a distribution most people assume they are merely being asked to look at a histogram (which seldom stirs much enthusiasm) either before or after performing a statistical analysis. One justification (noted elsewhere) is publishers are reluctant to 'waste' page space upon qualitative and basic exploration. A more practical reason is that histograms work well when applied to very large sets of normal values, but are not a good way to examine small sets of values, or especially non-normal data. This is partly because, whilst grouping values into class-intervals smooths their distribution to some extent, that smoothing is wholly arbitrary. When applied to values which are highly skewed, highly polymodal, or highly discrete the outcome wholly depends upon your choice of breakpoints (even if you are unaware of making that choice).
If you assume R's default settings are liable to be the most reasonable in most circumstances, plotting a histogram is almost childishly simple. But, when inspecting a histogram, do remember that genuinely normal values are smoothly distributed.
The following code instructs R to randomly select a large sample of (n=1000000) values from a standard normal population and put ('assign') those values in a variable called 'y', then plot a histogram thereof.
- Note:
- because our intention is not to provide a software library, but to illustrate principles and promote thought, we only provide the most minimal R-code here.
- In the interests of clarity, we annotated our graphs using a simple image editor (MS PCpaint).
- For those new to R, text to the right of a hashmark is for your information, not R's.
- R purists may be horrified that we often assign values to variables using rather than
Histograms perform tolerably well when 'sensibly' applied to very large samples of 'normal' data, but very poorly when obtained from small samples and/or particularly non-normal data.
- There are two obvious reasons for that:
- The choice of class intervals is almost always arbitrary, hence prone to artefacts and bias.
- Collapsing data to class-intervals, equally arbitrarily, discards fine-structure information.
Of course, if a sample distribution's fine-structure is solely due to simple random variation, smoothing this out can give a more realistic picture of the population that sample represents. But that is only true if the smoothing function is appropriate! Which is one reason why histograms can be astonishingly misleading when their breakpoints are poorly (or unluckily) chosen.
Let us therefore consider some other ways of graphically displaying how values are distributed which do not require class-intervals.
Surprisingly, the rank-based nonparametric viewpoint has much to offer in exploring distributions - even if you merely want to see whether re-scaling (transforming) your data has made its errors roughly normal.
- Unfortunately, for historical reasons, it is commonly assumed that is the only reason to examine a distribution.
Some of the most useful procedures were devised for sampling distributions, or as extensions of confidence intervals, but are seldom applied to the actual data.
Univariate scatterplots
For a variety of reasons, univariate scatterplots (rugplots) are the simplest way to compare how sets of values are distributed - yet they are surprisingly rarely used, even in elementary stats texts. By convention rugplots are plotted along one or more of a graph's axes, often the x-axis, hence its name (imagine a rug viewed edge-on) - but this need not be so.
Therefore, whilst R's function allows you to add a rugplot to an existing plot, the following code takes a sample of n observations from a defined population (Y), and plots them as a simple (vertical) rugplot.
- Note that this graph represents (n=) 100 values - yet only 10 are visible on the plot. This is because y can only take the value of one of the ten discrete values given above. Tied values are not distinguished.
The following code requests R to take three random samples of different 'theoretical' populations, then to plot them as 'rugplots' up the middle of a graph. This time we have used dashes to minimize point overlap.
- Since we believe computers should be machines which save us work, we often disregard convention and use rather than
- Similarly, given R provides 'random' theoretical functions, we thought might be clearer than , and have used rather than
Conventional dotplots
Dotplots, traditionally drawn with graphpaper and pen, used to be a popular way to display distributions of small, heavily tied, sets of values.
The R code below assigns some values to a variable (y), then plots a conventional dotplot, with duplicate values arranged evenly above and below.
The conventional way to go about this task was (to instruct some unfortunate technician) to plot values initially as a rugplot, adding tied-values alternately above and below their fellows. Provided each value has an odd number of tied values the graph should be symmetrical about the x-axis, otherwise the result was arbitrarily assymetric - and for large sets of values, a tedious, untidy, and unsatisfactory affair.
Conventional dotplots display tied values one above the other. They are also known as univariate scatterplots, dot histograms or histogram-type dotplots, or (along with jittered dotplots) as density plots. Another form of dot histogram displays tied values by plotting the frequency of ties of each value of y, on the value of y, as described below.
Jittered dotplots
One advantage of those very simple (univariate) plots is that even an untrained eye can readily interpret the differing densities of values - until, that is, the points overlap. Therefore, whilst rugplots have an attractive simplicity for very small sets of values, they do not cope well with high densities of similar values, or with 'tied', 'discrete' data.
Years before computers were available, a popular way around those constraints was to plot the values as dotplots using ordinary graphpaper and, if there were duplicate or very-similar values, to add them (more-or-less evenly) either side of those already plotted. Given which, the wider the dotplot, the denser the values were around that value. Statisticians have criticised this method, partly because the rules for adding points were not standard, but especially because it has a habit of introducing unwanted, arbitrary, and sometimes misleading patterning.
Now computers are ubiquitous, and we have good pseudo-random-number algorithms, an increasingly popular way to separate similar values, whilst not introducing bias, is to add uniformly-distributed random variation orthogonally (at-right angles, or 90 degrees) to the observed values. Since adding random variation is known as 'jittering', these are commonly known as jittered dotplots, or jittered scatterplots.
The R code for displaying a single sample as a jittered dotplot is gloriously simple. The following code displays the sample obtained above
- Note the jittering variation (provided by the function) is uniformly, not normally distributed - in other words the jittering value is equally likely to have any value from zero to one (so uniform populations are usually parameterized by a minimum and maximum, rather than a mean and standard deviation).
- Notice also that, entirely for our own convenience, we plotted the y-variable horizontally rather than vertically.
Smoothing distributions
Applying repeated random variation to the observed values themselves, then averaging the result, smooths their distribution. But, if the result is not to be misleading, this smoothing requires you select suitably-distributed error-distribution, e.g. normal.
The following code instructs R to apply Gaussian (normal) smoothing to the values in variable y3, and plot their mean probability density. We have added the original values as a rugplot.
The computational advantage of using a theoretical and mathematically-defined function is you do not have to repeatedly jitter all your sample values, then examine how the jittered vales are distributed.
The advantage to users is smoothed plots resemble theoretical plots and histograms in text-books.
Since the form and degree of smoothing is unavoidably arbitrary, every smoothing function risks introducing bias and artefacts - even if that smoothing function is a simple running mean, or class-intervals (as used in histograms). Jittered scatterplots do not introduce bias on average, but when jittering is applied to an individual sample the human eye smooths the distribution to a random hence uneven degree.
The following code instructs R to produce jittered scatterplots of the 3 samples above
One important limitation of rugplots, jittered dotplots and their ilk, is they tend to obscure any fine structure within a sample distribution, such as tied values, or patterns within very similar values. Ironically, whilst many nonparametric statistics collapse data to ranks, rank-based methods avoid the problems inherent to class-intervals, and can retain all the fine structure for examination.
Simple rank scatterplots
Arguably the simplest rank-based graphical technique is a scatterplot of rank on value.
For instance, the following code instructs R to randomly select (n=) 30 values from a defined population distribution, and show the result as a scatterplot of rank on value.
Remember, with all the plots on this page, you are unlikely to get precisely (or sometimes even approximately) the same result as us, because the values are selected at random!
You can get the same result using these instructions:
- In the interest of readability, we decided not to reduce to
Notice that, because R's function assumes you want the mean (average, or expected) rank of tied values, the following code would loose some of that valuable information - unless the data lacks ties (so every value is unique) - which often happens in small samples, even of highly-discrete populations.
Notice this plot tends to obscure how ties within the data are distributed.
Cumulative rank scatterplots
If you wish to compare several samples containing unequal numbers of values it helps to standardize the ranks - most simply by converting to relative rank - as in this example:
- When relative rank is calculated in that way (p = r/n), for any given value, p is the proportion of values in y whose ranks are less than or equal to that value - hence ranking is a cumulative function (re-mapping).
These plots are also known as empirical distributions functions (ECDF), and to emphasize the fact they are unavoidably discrete, they are often plotted as stepplots. Plotting them as lineplots smooths the distribution to the eye, and makes them easier to compare, but implicitly assumes intermediate values could realistically be observed.
If you want to use R's ECDF function, you can plot the results using
Theoretical statisticians might also point out that an ECDF provides a maximum-likelihood estimate (MLE) of the population's cumulative distribution function (CDF) - and note that many MLE's are biased. In more everyday terms, these plots are cumulative distributions. Unfortunately, owing to the way statistics are taught in schools, the histogram holds powerful sway, and most people find cumulative distributions comparatively hard to interpret.
P-value plots
One reason cumulative distributions are unpopular is because people find it hard judge their location, dispersion, or skew. A simple way to address these issues is to use convert values of p above 0.5 to 1 minus p - in other words to reflect the upper tail downwards. However, since 1/n >= r/n >= 1 is inherently assymetric, p = {r-0.5}/n is a less biased measure of relative rank.
The following code takes 3 samples in the same way as immediately above, then presents them as p-value lineplots - to aid comparison, a vertical line shows each sample median.
One disadvantage of p-value plots is, since they are seldom used, they confuse the uninitiated - including otherwise-sensible statisticians.
- The p-values in these examples employ sample quantiles (not theoretical quantiles) so must not be confused with P-values of test statistics, or P-value plots of nested confidence intervals.
- Since no-one refers to 'empirical QQ plots', talking about 'empirical p-value plots' seems unlikely to improve matters.
Frequency of ties
One advantage of rank scatterplots is that, being cumulative, they are less affected by fine structure than rank-frequency plots - the larger your sample size, the less variable is its cumulative distribution. Hence the peak of each p-value plot (the median is where p=0.5) is a more reliable measure of location than a histogram's mode.
The following code instructs R to plot the relative frequency of each value of y1, calculated from its rank. Bars indicate the frequency each value is tied + 1.
Of course unless they are subject to rounding, because a normal population contains an infinite number of different values, the probability of selecting two identical normal values by chance approaches zero. In which case the frequency distribution (for example see above) is polymodal, therefore every mode has the same height (f = 1), and the result is equivalent to a univariate plot or rugplot. Rounded large samples produce histogram-like results - but, if the rounding is uneven, such plots are misleading. Remember, given sufficiently many class-intervals, a histogram will also have up to n modes, unless values are tied - in which case the result is equivalent to a bargraph.
At the opposite extreme, most people assume straight lines must be relatively easy to appraise. Which may explain why quantile-quantile plots (QQ plots) are a relatively popular way to compare two distributions.
Q-Q plots
The following instructions are a simple and transparent way to compare two samples of equal size:
A few moments thought reveals that if both samples are of the same population we would expect, on average, a QQ plot's points will be identical. Whereas, if the values are selected at random, this will seldom occur - unless the population is extremely small indeed. If you compare two large samples of infinite normal populations, you commonly find values are very similar around the plot's center, but differ at its extremes.
If your samples are of unequal size, R's function can use interpolated values from the larger sample. So if y1 has 3000 values and y2 has 3 values, qqplot only produces 3 points.
Normal quantile plots
Because two-sample QQ plots are comparatively rare, most people assume QQ plots are only used to see if a set of values deviates from their expected ('theoretical') normal values. This type of plot is more correctly termed a normal quantile plot, for example as follows:
- Notice that in real data the normal population's mean and standard deviation are seldom known, unless they are standardized (e.g. using ), these quantiles will be linearly related, but unequal.
- Again, because the theoretical values are normal population quantiles, a relative rank of P=r/n would bias those theoretical values. So, to reduce that bias, we use (r-0.5)/n
- For instance in a sample of n values the highest possible rank (rmax) equals n, therefore rmax/n = 1. However, if y is randomly selected from a normal distribution, we are unlikely to observe the highest value of y is infinite - unless n is also infinite.
Two further alternatives
If you prefer to use R's normal quantile function, it is called
The following code applies R's normal quantile function to the expected values of 5 normal observations, which we estimate from (R=) 50000 random samples (of n=15 values) from a normal population (otherwise known as ranked normal deviates, or rankits).
Lest you assume theoretical quantiles estimated via simulation (such as rankits) have no advantage over theoretical quantiles obtained from an inverse probability function, let us compare them a little more carefully.
The following code asks R to plot the difference between the (estimated) expected values on their theoretical quantiles (in this case obtained R's normal quantile plot function). Plotting the deviations from expected against their observed values is much more sensitive than a simple QQ plot - so can reveal systematic differences in two otherwise similar distributions.
- Notice that the median value is unbiased.
- In this example Y contained standard rankits, because their values are similar to theoretical standard normal quantiles.
- For a real sample, because the population parameters are unknown, you would only expect no difference if the data were standardized prior to plotting, for example using:
See also
More examples of R code for displaying frequency distrbutions: Drawing a histogram, a frequency polygon, a stem and leaf plot, jittered dot plot, rank scatterplots, frequency of each value, empirical cumulative distribution function (ECDF), P-value plot, multiple P-value plots, smoothed distribution function.
Elementary Statistics (Pages of information about basic statistics - for struggling Students - and their teachers): Displaying frequency distributions + Means, medians, modes + Types of variables + Variance and Standard Deviation + Standard error of the mean + The Normal distribution + Relationships between variables + Quantiles (median, range, interquartile and 90% range) + Statistics for beginners (using R) + Extras (where values are not normal)
Since R runs on so many different operating systems, and supports so many different graphics formats, it's not surprising that there are a variety of ways of saving your plots, depending on what operating system you are using, what you plan to do with the graph, and whether you're connecting locally or remotely.
The first step in deciding how to save plots is to decide on the output format that you want to use. The following table lists some of the available formats, along with guidance as to when they may be useful.
Format | Driver | Notes |
JPG | jpeg | Can be used anywhere, but doesn't resize |
PNG | png | Can be used anywhere, but doesn't resize |
WMF | win.metafile | Windows only; best choice with Word; easily resizable |
Best choice with pdflatex; easily resizable | ||
Postscript | postscript | Best choice with latex and Open Office; easily resizable |
A General Method
First, here's a general method that will work on any computer with R, regardless of operating system or the way that you are connecting.
- Choose the format that you want to use. In this example, I'll save a plot as a JPG file, so I'll use the jpeg driver.
- The only argument that the device drivers need is the name of the file that you will use to save your graph. Remember that your plot will be stored relative to the current directory. You can find the current directory by typing getwd() at the R prompt.
- You may want to make adjustments to the size of the plot before saving it. Consult the help file for your selected driver to learn how.
- Now enter your plotting commands as you normally would. You will not actually see the plot - the commands are being saved to a file instead.
- When you're done with your plotting commands, enter the dev.off() command. This is very important - without it you'll get a partial plot or nothing at all.
So if I wanted to save a jpg file called 'rplot.jpg' containing a plot of x and y, I would type the following commands:
Another Approach
If you follow the process in the previous section, you'll first have to make a plot to the screen, then re-enter the commands to save your plot to a file. R also provides the dev.copy command, to copy the contents of the graph window to a file without having to re-enter the commands. For most plots, things will be fine, but sometimes translating what was on the screen into a different format doesn't look as nice as it should.
To use this approach, first produce your graph in the usual way. When you're happy with the way it looks, call dev.copy, passing it the driver you want to use, the file name to store it in, and any other arguments appropriate to the driver.
For example, to create a png file called myplot.png from a graph that is displayed by R, type
Remember that when you save plots this way, the plot isn't actually written to the file until you call dev.off.
Local Sessions with Windows or OS X
If you're actually sitting in front of a Windows or Mac computer (i.e. not using ssh to connect), the graphical user interface makes it easy to save files. Under Windows, right click inside the graph window, and choose either 'Save as metafile ...' or 'Save as postscript ...' If using Word, make sure to save as a metafile.
R Plots Not Showing
On a Mac, click on the graphics window to make sure it's the active one, then go to File -> Save in the menubar, and choose a location to save the file. It will be saved as a pdf file, which you can double click to open in Preview, and then use the File -> Save As menu choice to convert to another format.