Alternative Format — Lesson 1: Scatter Plots and Lines or Curves of Best Fit

Let's Start Thinking

Scattered Showers

When the forecast calls for scattered showers, a meteorologist is predicting dispersed or irregular rainfall that may be widely spaced out with respect to time.

A red umbrella in the rain

Raindrops making scattered ripples

The word scatter is often associated with the idea of no particular arrangement, dispersed, or even disorganized. When seeds are planted by hand, people often scatter the seeds in their garden. They do not place seeds individually at equal increments.

A farmer's hand scattering cilantro seeds

Scattered

However, even though seeds may be scattered when they are planted, they are not typically thrown at random through an entire garden. They are actually scattered within very organized lines so that rows within a garden are a specific food. Perhaps carrots grow in one row, corn in another, beans in another, and so on.

A farmer's hand scattering wheat seeds in rows

Vegetables growing in rows

In mathematics, we often use scatter plots to help us analyze data. Sometimes data may seem to be random, but when we plot the information on the Cartesian plane we can see a pattern or relationship. 

In this lesson, we will look at scatter plots, determine if two variables show a connection, and use lines or curves to model the data collected.


Lesson Goals

  • Construct scatter plots and interpret the meaning of points on scatter plots.
  • Identify trends in data and describe the correlation.
  • Draw curves or lines of best fit and determine the equation of a line of best fit.
  • Interpolate and extrapolate information using the line of best fit.

Creating Scatter Plots and Interpreting Points


Scatter Plots

Recall

A scatter plot is a graph consisting of points which are formed using the values of two variables.

Every point on a scatter plot is an ordered pair. In the ordered pair, the first value is the independent variable, and the second value is the dependent variable.

Recall

Dependent means relies on or is determined by.

The independent variable is placed on the horizontal axis of the scatter plot while the dependent variable is placed on the vertical axis.

When identifying which variable is independent and which variable is dependent:

  • The independent variable is the variable that may cause change
  • The dependent variable is the variable that might be affected by the changes in the independent variable.

Scatter Plot 1: People's Reaction Times vs. Age

For example, consider the table of values and the scatter plot shown here:

Age Reaction Time (ms)
\(43\) \(328\)
\(31\) \(224\)
\(61\) \(460\)
\(23\) \(184\)
\(57\) \(411\)
\(38\) \(271\)
\(48\) \(353\)
The data in the table provided is shown in a scatter plot
  • Age is the independent variable. It represents the change in time. The scatter plot above shows ages between \(0\) and \(70\) years.
  • Reaction time is the dependent variable. A person's reaction time may be affected by their age. The scatter plot above shows reaction times between \(0\) and \(500\) milliseconds.

Scatter plots:

  • are used to graph two data sets to see if there is a relationship or connection between two variables. 
  • have data sets that arise from variable quantities.

Some additional examples of scatter plots are shown here:

Scatter Plot 2: Overall Math Mark vs. Number of Days Absent from Class

Days Absent Overall Math Mark(%)
14 98
14 91
9 90
9 97
15 73
7 55
8 87
39 30
19 76
8 76
20 74
4 79
18 69
13 72
4 89
8 64
10 56
17 68
8 54
5 52
11 62
9 67
9 71
30 64
45 61
3 65
30 61
9 63
15 91
13 70
12 53
18 77
5 38
11 84
9 85
6 89
5 94
6 88
21 88
8 91
9 72
19 68
15 71
6 75

A scatter plot shows the relation between number of days absent and math mark. Each point plotted represents one student.

  • Each point on the graph represents an individual student in the class.
  • The point tells us how many days the student was absent from class and their overall math mark.
  • Independent variable: Number of days absent, ranging from \(0\) to \(50\) days.
  • Dependent variable: Overall math mark, in percent, ranging from \(0\) to \(100\) percent.

Scatter Plot 3: Goals Allowed vs. Number of Games Played by Hockey Goalies in a Particular Season 

Games Played Goals Allowed
33 68
38 86
31 64
20 41
34 76
31 70
23 54
43 99
33 77
47 116
29 66
43 110
52 134
35 92
29 76
50 127
34 82
25 60
46 129
36 96
40 107
22 55
23 66
34 88
35 91
42 113
43 118
46 130
46 123
30 83
27 76
42 123
22 61
38 103
46 135
43 123
31 88
27 78
34 90
31 93
21 55
30 83
21 61
23 71
25 73
41 129
31 95
23 70
36 121
28 99

A scatter plot shows the relation between number of games played by a goalie, and number of goals allowed. Each point plotted represents one player.

  • Each point on the graph represents an individual goalie.
  • The point tells us the  number of games the goalie played and the number of goals that they allowed.
  • Independent variable: Number of games played, ranging from \(0\) to \(60\).
  • Dependent variable: Goals allowed, ranging from \(0\) to \(140\).

Scatter Plot 4: Percentage of the World's Population That Has Been Vaccinated for Measles vs. Year

Year Vaccination Rate
1980 17
1981 20
1982 21
1983 37
1984 42
1985 48
1986 47
1987 54
1988 63
1989 68
1990 73
1991 69
1992 69
1993 70
1994 72
1995 73
1996 73
1997 71
1998 71
1999 71
2000 72
2001 73
2002 72
2003 74
2004 76
2005 78
2006 79
2007 80
2008 81
2009 84
2010 85
2011 85
2012 84
2013 84
2014 84
2015 85

A scatter plot shows the relation between the year and the percent of the world vaccinated for measles. Each point plotted represents one year's data. The domain is (1980,2020)

  • Each point on the graph provides a year and a percentage of the world's population that was vaccinated against measles.
  • Independent variable: Year, ranging from \(1980\) to \(2020\).
  • Dependent variable: Percentage of the world vaccinated for measels, in percent, ranging from \(0\) to \(100\).

Notice the break in the \(x\)-axis (Year) on Scatter Plot 4. This indicates that there is a jump from \(0\) to \(1980\) and then the year increases by equal increments. This break allows the graph to be more readable. If we included all of the years from \(0\) to \(1980\) where there is no data available, the data we do have would be very close together and difficult to analyze as shown here:

Year Vaccination Rate
1980 17
1981 20
1982 21
1983 37
1984 42
1985 48
1986 47
1987 54
1988 63
1989 68
1990 73
1991 69
1992 69
1993 70
1994 72
1995 73
1996 73
1997 71
1998 71
1999 71
2000 72
2001 73
2002 72
2003 74
2004 76
2005 78
2006 79
2007 80
2008 81
2009 84
2010 85
2011 85
2012 84
2013 84
2014 84
2015 85

A scatter plot shows the relation between the year and the percent of the world vaccinated for measles. Each point plotted represents one year's data. The domain is (0,2000+)

Researchers often collect data sets and use scatter plots to quickly communicate their findings to others.

References:

  • Source: Measles Vaccinations - World Health Organization (2019, March 25). Immunization, Vaccines and Biologicals: Data, statistics and graphics. Retrieved from https://www.who.int/immunization/monitoring_surveillance/en/ and licensed under CC-BY 4.0.

Check Your Understanding 1

Question

Data was collected to study the relationship between price and the number of chocolate bars sold. Create a scatter plot of the data by answering the following:

  1. Determine the label for the horizontal axis.
  2. Determine the label for the vertical axis.
  3. Given the following table of values, create a scatter plot of the data.
    Price (dollars) # Sold
    \(1\) \(16\)
    \(3\) \(12\)
    \(5\) \(8\)
    \(7\) \(8\)
    \(8\) \(4\)
    \(9\) \(4\)

Answer

  1. The label for the horizontal axis is Price (dollars).
  2. The label for the vertical axis is # Sold.
  3. A plot of the values from the table.

Feedback

Since the price may cause the number of chocolate bars to change, price is the independent variable which is placed on the horizontal axis.

Interactive Version

Creating a Scatter Plot


Using Scatter Plots With Technology

In the following parts, we will look at using technology to generate a scatter plot and use the graph to answer several questions.


Technology

  • Although we can create scatter plots by hand, often data sets involve large quantities of data, and it is more efficient and more accurate to use one of the many technological options available to us.
  • Most options involve using a spreadsheet and a graphing tool

A spreadsheet is an interactive table that is used to organize and store data. Typically the columns are labelled alphabetically and the rows are labelled numerically. A single piece of data can be referred to using the column and row identification.

  A B C
1      
2      
3      
4      
5   Cell B5  
6      
7      
8      
9      

In the following example, we will use technology to help us create a scatter plot, and we will use the plot to help us interpret the meaning of points included on the graph. 

Example 1

Data is given about city population and the number of operating restaurants.

Population (100 000s)  Restaurants (Qty)
\(2.17\) \(643\)
\(0.15\) \(285\)
\(1.41\) \(358\)
\(1.53\) \(1200\)
\(8.34\) \(2198\)
\(3.06\) \(715\)
\(3.84\) \(844\)
\(1.08\) \(233\)
\(1.83\) \(378\)
\(7.22\) \(1396\)
Population (100 000s)  Restaurants (Qty)
\(2.33\) \(450\)
\(1.94\) \(367\)
\(5.37\) \(1006\)
\(1.95\) \(359\)
\(1.02\) \(187\)
\(1.30\) \(226\)
\(3.29\) \(540\)
\(1.62\) \(257\)
\(1.20\) \(165\)
\(5.94\) \(437\)
  1. Create a scatter plot using technology.
  2. Use the scatter plot to locate the point that represents the city with a population of \(537~000\). Read the corresponding number of restaurants for the city and verify this number using the original table of values.
  3. Use the scatter plot to locate the point that represents the city with \(844\) restaurants. Read the corresponding population for the city and verify this number using the original table of values. 
  4. What point(s) on the graph represent(s) a city population that has more than the expected number of restaurants? 
  5. What point on the graph represents a city population that has less than the expected number of restaurants?

Solution — Part A

Notice, our data set includes numbers to two decimal places. Using technology will allow us to create an accurate scatter plot.

First, we can copy our data into a spreadsheet.

  A B C
1 2.17 643  
2 0.15 285  
3 1.41 358  
4 1.53 1200  
5 8.34 2198  
6 3.06 715  
7 3.84 844  
\(\Large{\vdots}\) \(\Large{\vdots}\) \(\Large{\vdots}\)  

Using a graphing tool, we can use technology to generate the scatter plot. For our data, we will have population along the \(x\)-axis as the independent variable (Column A) and the number of restaurants along the \(y\)-axis as the dependent variable (Column B).

A scatter plot shows the relation between city population and number of operational restaurants. Each plot point represents a city.

When using technology to generate scatter plots, remember to label each axis as you would when graphing by hand. It is particularly important in this example because the population is measured in hundred thousands. This means a value of \(3\) on the \(x\)-axis represents \(300~000\) people.

You may also need to adjust the scales on the axes to clearly show the data. The scales on the plot were set to increments of \(1\) on the \(x\)-axis, between \(0\) and \(10\). And increments of \(200\) on the \(y\)-axis, between \(0\)and \(2200\). The graph was also set to show only the positive \(x\)- and \(y\)-axis.

Solution — Part B

Recall part b): Use the scatter plot to locate the point that represents the city with a population of \(537~000\). Read the corresponding number of restaurants for the city and verify this number using the original table of values.

A scatter plot shows the relation between city population and number of operational restaurants. Each plot point represents a city.

Looking at the graph, we know a population of \(537~000\) will be close to the middle of \(5\) and \(6\) along the \(x\)-axis. Moving up, we can locate the point, and read across the \(y\)-axis and find that the city with a population of \(537~000\) has approximately \(1000\) restaurants.

A scatter plot shows the relation between city population and number of operational restaurants. Each plot point represents a city. The city with co-ordinates (5.37, 1006) is highlighted.

If we look at the original table of values, we can see that the city actually has \(1006\) restaurants.

Population
(100 000s)
Number of
Restaurants
\(5.37\) \(1006\)

It is important to note that if we are not given a table of values and we are using just a scatter plot to determine values, we are often estimating because the points are not typically located on grid lines that are easy to read.

Solution — Part C

Recall part c): Use the scatter plot to locate the point that represents the city with \(844\) restaurants. Read the corresponding population for the city and verify this number using the original table of values.

A scatter plot shows the relation between city population and number of operational restaurants. Each plot point represents a city.

Looking at the graph, we go to the \(y\)-axis this time to help us find a point that is a bit above the \(800\) mark. Reading the corresponding \(x\)-value on the horizontal axis, we find that the matching population is approximately \(380~000\). 

A scatter plot shows the relation between city population and number of operational restaurants. Each plot point represents a city. The city with co-ordinates (3.84, 844) is highlighted.

That is, the city with \(844\) restaurants has a population of approximately \(380~000\).

If we look at the table of values, we can see that the actual population for this city is \(384~000\).

Population
(100 000s)
Number of
Restaurants
\(3.84\) \(844\)

Solution — Part D

Recall part d): What point(s) on the graph represent(s) a city population that has more than the expected number of restaurants?

A scatter plot shows the relation between city population and number of operational restaurants. Each plot point represents a city.

We are looking for a point that shows a city with a higher number of operating restaurants than the points surrounding it. One point that satisfies this condition is highlighted here and labelled as point \(A\). This point is significantly above the points immediately around it. The point shows a population of approximately \(150~000\) with approximately \(1200\) restaurants.

Point A is highlighted.

A second point that may also satisfy this condition is highlighted here and labeled as point \(B\). The point does appear to be higher than expected for a population of \(840~000\) people. However, because this is the last point in our set of data, it does not stand out as much as point \(A\).

Point B is highlighted.

Solution — Part E

Recall part e): What point on the graph represents a city population that has less than the expected number of restaurants?

A scatter plot shows the relation between city population and number of operational restaurants. Each plot point represents a city.

Now we are looking for a point that shows a city with a lower number of operating restaurants than the points surrounding it. We can highlight this point on our graph as shown here.

The point near (6, 400) is highlighted.

This point is significantly below the points immediately around it. The point shows a population of approximately \(590~000\), with approximately \(425\) restaurants.


Check Your Understanding 2

Question — Version 1

Which point represents a student with a higher test mark than expected?

Homework Mark Test Mark
\(35\) \(51\)
\(5\) \(44\)
\(50\) \(93\)
\(15\) \(63\)
\(34\) \(80\)
\(16\) \(62\)
\(50\) \(95\)
\(7\) \(75\)
\(38\) \(88\)

A plot of the data from the table.

Answer — Version 1

The point is \((7,75)\).

Feedback — Version 1

The point is \((7,75)\).

The point (7, 75) is highlighted.

Question — Version 2

Which point represents a student with a homework mark of \(5\)?

Homework Mark Test Mark
\(35\) \(51\)
\(5\) \(44\)
\(50\) \(93\)
\(15\) \(63\)
\(34\) \(80\)
\(16\) \(62\)
\(50\) \(95\)
\(7\) \(75\)
\(38\) \(88\)

A plot of the data from the table.

Answer — Version 2

The point is \((5,44)\).

Feedback — Version 2

The point is \((5,44)\).

The point (5, 44) is highlighted.

Question — Version 3

Which point represents a student with a test mark of \(80\)?

Homework Mark Test Mark
\(35\) \(51\)
\(5\) \(44\)
\(50\) \(93\)
\(15\) \(63\)
\(34\) \(80\)
\(16\) \(62\)
\(50\) \(95\)
\(7\) \(75\)
\(38\) \(88\)

A plot of the data from the table.

Answer — Version 3

The point is \((34,80)\).

Feedback — Version 3

The point is \((34,80)\).

The point (34, 80) is highlighted.

Interactive Version

Interpreting a Scatter Plot


Lines and Curves of Best Fit


Relationships in Data

A trend describes the behaviour of the data and is used to determine if there is a relationship between the variables.

A line of best fit can sometimes be used to represent the relationship between the variables on a scatterplot. The line of best follows the trend of the data.

For example, the scatter plot of goals allowed vs. number of games played by goalies, from the previous section, is reproduced below. A possible line of best fit has been drawn on the graph. This line of best fit represents a trend in the data.

A scatter plot shows the relation between number of games played by a goalie, and number of goals allowed. Each point plotted represents one player. A possible line of best fit is drawn.


Correlation

Correlation is a measure of the relationship or connection between two or more variables.

When we are working with a line of best fit, we can describe the correlation as positive or negative.

  • A linear correlation is positive when the the dependent variable increases as the independent variable increases. A line of best fit would have a positive slope.
  • A linear correlation is negative when the dependent variable decreases as the independent variable increases. A line of best fit would have a negative slope.

Positive Correlation

We can further describe a correlation as strong or weak.

Let's first consider positive linear correlations.

Perfect Positive
Linear Correlation

A scatter plot shows points that lie along the same positively-sloped line.

  • All of the points form a straight line

Strong Positive
Linear Correlation

A scatter plot shows points that approximate a positively-sloped line. The points are slightly scattered on both sides of the implicit line.

  • The points do not form a straight line
  • A linear trend is evident

Weak Positive
Linear Correlation

A scatter plot shows points that approximate a positively-sloped line. The points are moderately scattered on both sides of the implicit line.

  • The points do not form a straight line. That is, the points are spread out.
  • A linear trend exists, but it is less evident

Negative Correlation

Next, we will look at negative linear correlations

Perfect Negative Linear Correlation

A scatter plot shows points that lie along the same negatively-sloped line.

  • All of the points form a straight line

Strong Negative Linear Correlation

A scatter plot shows points that approximate a negatively-sloped line. The points are slightly scattered on both sides of the implicit line.

  • The points do not form a straight line. That is, the points are not perfectly aligned.
  • A linear trend is evident

Weak Negative Linear Correlation

A scatter plot shows points that approximate a negatively-sloped line. The points are moderately scattered on both sides of the implicit line.

  • The points do not form a straight line. That is, the points are spread out.
  • A linear trend exists, but it is less evident

No Correlation

  • It is very common for two variables to show no correlation between them.
  • In the graph shown here, the points seem to be scattered randomly across the graph, and no trend is evident. These variables do not show a correlation.

Points appear to be randomly scattered across the graph

Correlation Does Not Mean Causation

Causation is the relationship between cause and effect. It implies that one event actually causes the other.

  • It is important to note that just because two variables show a correlation, it does not prove there is causation, or that the change in one variable causes the change in the other.
    • For example, if data showed a correlation between the annual per capita consumption of pizza and the number of math degrees awarded per year, we should not conclude that the more pizza the population eats, the more successful they will be in mathematics. 

  • Often there are other factors that have an impact and are the reason for the correlation.
    • In the pizza versus number of degrees example we just mentioned, maybe the price of cheese went down and pizza became cheaper to make, or maybe a significant number of pizza shops started opening up. One or both of these might be the cause of an increase in pizza consumption, and perhaps it was just a coincidence that a correlation existed with the number of math degrees increasing at the same time.

Check Your Understanding 3

Question

Does the data in the following scatter plots suggest a positive correlation, a negative correlation, or no correlation?

  1. A series of data that tends to go from Quadrant 2 to Quadrant 4.
  2. A series of data that appears to be randomly spread between all quadrants.
  3. A series of data that tends to go from Quadrant 3 to Quadrant 1.

Answer

  1. Negative correlation
  2. No correlation
  3. Positive correlation

Feedback

  1. The dependent values are decreasing as the independent values increase, therefore the data suggests a negative correlation.
  2. The data does not show a trend, therefore the data suggests no correlation.
  3. The dependent values are increasing as the independent values increase, therefore the data suggests a positive correlation.

Lines and Curves of Best Fit

Sometimes, when we are analyzing scatter plots, we can draw a line or curve of best fit to show the relationship between the independent and dependent variables.


Explore This 1

Description

What is the relationship between the height of a square-based pyramid and its volume?

Observe the volume as the height increases. Note the square-based pyramid has a fixed \(3\times 3\) base.

Height \(=2.5\)

The values for heights between 0 and 2.5 are shown. The values appear to lie on a straight line.A square-based pyramid with height 2.5.

Height \(=5\)

The values for heights between 0 and 5 are shown. The values appear to lie on a straight line.A square-based pyramid with height 5.

Height \(=7.5\)

The values for heights between 0 and 7.5 are shown. The values appear to lie on a straight line.A square-based pyramid with height 7.5.

Height \(=10\)

The values for heights between 0 and 10 are shown. The values appear to lie on a straight line.A square-based pyramid with height 10.

Interactive Version

Relationship Between Height and Volume of a Square-Based Pyramid


Explore This 1 Summary

In the Explore This, the height of a square-based pyramid was changed and the base lengths remained fixed with a length of \(3\) units.

The resulting scatter plot of the volume (dependent variable) with respect to the height of the pyramid (independent variable) is shown here.

Height Volume
0 0
0.5 1.5
1 3
1.5 4.5
2 6
2.5 7.5
3 9
3.5 10.5
4 12
4.5 13.5
5 15
5.5 16.5
6 18
6.5 19.5
7 21
7.5 22.5
8 24
8.5 25.5
9 27
9.5 28.5
10 30
10.5 31.5

The points representing the (height, volume) pairs collected in the previous Explore This activity are plotted.

Notice that the volume and height data show a relationship where every point on our graph lies on the same line. We can say that the relationship between the volume and height is linear. As the height of the pyramid increases, the volume of the pyramid also increases at a constant rate. We can represent the relationship with the line that passes through our data points.

Height Volume
0 0
0.5 1.5
1 3
1.5 4.5
2 6
2.5 7.5
3 9
3.5 10.5
4 12
4.5 13.5
5 15
5.5 16.5
6 18
6.5 19.5
7 21
7.5 22.5
8 24
8.5 25.5
9 27
9.5 28.5
10 30
10.5 31.5

The points representing the (height, volume) pairs collected in the previous Explore This activity are plotted. A line approximating the points is drawn.


Explore This 2

Description

What is the relationship between the base length of a square-based pyramid and its volume?

Observe the volume as the base length increases. Note the square-based pyramid has a fixed height of \(6\).

Base Length \(= 2\)

The point (2, 8) is highlighted. Base length values between 0 and 2 are plotted as well but it can't be told if they appear to form a line or a curve.The associated pyramid with base length of 2.5

Base Length \(= 5\)

The point (5, 50) is highlighted. Base length values between 0 and 5 seem to form a curve.The associated pyramid with base length of 5

Base Length \(= 7.5\)

The point (7.5, 112) is plotted. Base length values between 0 and 7.5 seem to form a curve.The associated pyramid with base length of 7.5

Base Length \(= 10\)

The point (10, 200) is plotted. Base length values between 0 and 10 seem to form a curve.The associated pyramid with base length of 10

Interactive Version

Relationship Between Base Length and Volume of a Square-Based Pyramid


Explore This 2 Summary

In the Explore This, the height of a square-based pyramid was fixed at \(6\) units and the base lengths were changed.

The resulting scatter plot of the volume (dependent variable) with respect to the base length of the pyramid (independent variable) is shown here.

Base Length Volume
0 0.00
0.5 0.50
1 2.00
1.5 4.50
2 8.00
2.5 12.50
3 18.00
3.5 24.50
4 32.00
4.5 40.50
5 50.00
5.5 60.50
6 72.00
6.5 84.50
7 98.00
7.5 112.50
8 128.00
8.5 144.50
9 162.00
9.5 180.50
10 200.00

The points representing the (base length, volume) pairs collected in the previous Explore This activity are plotted.

Notice that the volume and base length data result in a non-linear relationship. The points in this relationship all lie on the same curve. As the base length of the pyramid increases, the volume also increases, but not at a constant rate. We can represent the relationship with the curve that passes through our data points.

Base Length Volume
0 0.00
0.5 0.50
1 2.00
1.5 4.50
2 8.00
2.5 12.50
3 18.00
3.5 24.50
4 32.00
4.5 40.50
5 50.00
5.5 60.50
6 72.00
6.5 84.50
7 98.00
7.5 112.50
8 128.00
8.5 144.50
9 162.00
9.5 180.50
10 200.00

The points representing the (base length, volume) pairs collected in the previous Explore This activity are plotted. A curve approximating the points is drawn.

A line of best fit or curve of best fit represents the relationship between the two variables on a scatter plot.

 Other Curves of Best Fit

In the previous two exercises, we generated scatter plots based on functions with highly predictable outcomes. Often, you will be presented with data where the trend isn’t so easily characterized by a familiar relation or pattern type. Sometimes the data on a scatter plot can appear to be more random.

Consider the air temperature data for a two year period collected at a climate station in Waterloo, Ontario. The dependent variable, the average monthly temperature in degrees Celsius, is on the vertical axis. The independent variable, the number of months passed, is on the horizontal axis.

Month Average Temperature
1 -3.14
2 -1.25
3 -1.36
4 8.45
5 11.43
6 18.24
7 20.02
8 18.44
9 16.75
10 11.81
11 2.15
12 -5.71
13 -7.01
14 -3.10
15 -1.73
16 1.65
17 16.72
18 18.49
19 21.52
20 21.22
21 17.44
22 7.91
23 0.11
24 -1.34

A scatter plot shows the average monthly temperature for 24 months. Each point represents one month of aggregate data.

A line would not be representative of the relationship shown on the graph as a line would pass through very few points. A curve would be more suitable, but may not be as obvious as the previous curve we saw. If we draw a curve to model the data it does not have to pass through every point on our scatter plot. The curve we want is one that is most representative of the relationship between average temperature and month as shown here.

Month Average Temperature
1 -3.14
2 -1.25
3 -1.36
4 8.45
5 11.43
6 18.24
7 20.02
8 18.44
9 16.75
10 11.81
11 2.15
12 -5.71
13 -7.01
14 -3.10
15 -1.73
16 1.65
17 16.72
18 18.49
19 21.52
20 21.22
21 17.44
22 7.91
23 0.11
24 -1.34

A scatter plot shows the average monthly temperature for 24 months. Each point represents one month of aggregate data. The points are approximated with a curve that oscillates smoothly between lower and higher values..

We draw a curve that oscillates smoothly between the lower and higher extremes of the dependent variable, or temperature. The rate of change varies along the curve.

Remember, when we draw a line or curve on a scatter plot to represent the relationship between the two variables, we are drawing a line or curve of best fit.

  •  A line or curve of best fit may pass through all, some or none of the points on a scatter plot.
  • A line or curve of best fit will always follow the trend of the data.

Recall

A trend describes the behaviour of the data and is used to determine if there is a relationship between the variables.

References


Check Your Understanding 4

Question

For each scatter plot, determine if the data in the scatter plot be represented by a line of best fit, a curve of best fit, or neither?

  1. Points appear to be randomly scattered around the Cartesian plane.
  2. For a variety of points, as x decreases, points trend towards negative infinity and as x increases plots trend towards 0.
  3. Points are mostly clustered together between quadrants 2 and 4.

Answer

  1. Neither
  2. Curve of best fit
  3. Line of best fit

Feedback

  1. The points do not appear to suggest any trend. Neither a line nor a curve of best fit is representative of the data shown in the scatter plot.

    No trend line is drawn.

  2. The points suggest a trend that follows a curve. A curve of best fit is representative of the data shown in the scatter plot.

    A curve is drawn such that for large negative x values, the curve has large negative y values and as x values increase the curve approaches the x-axis.

  3. The points suggest a linear trend. A line of best fit is representative of the data shown in the scatter plot.

    A line with negative slope is drawn from quadrant 2 through quadrant 4.


Drawing a Line of Best Fit

Recall

A line of best fit will always follow the trend of the data.

One way to identify if there is a trend to the data is to draw an elliptical enclosure around the data.

If the ellipse we draw is more round, or close to being circular, this is an indication that there is no trend and we cannot draw a line of best fit to represent the relationship between the two variables. We can say there is no correlation between the variables.

The more elongated and narrow the ellipse, or the closer it is to being flat, the stronger the relationship, or correlation. If this is the case, we can use a ruler and draw a line of best fit to represent the relationship.

Scatter Plot 1: Jersey Number vs. Player Weight

Let's consider a scatter plot of basketball player jersey numbers (the dependent variable) with respect to their weight (the independent variable).

Weight (lbs) Jersey Number
220 13
253 23
250 23
195 30
242 35
190 30
240 35
200 0
270 0
210 1
193 11
260 12
210 4
220 10
250 21
240 6
207 3
175 23
232 23
184 15
220 13
250 23
190 3
248 32
215 45

A scatter plot shows the relationship between player jersey numbers and weight

If we enclose the data in the smallest possible ellipse that conatins all of the data points, we get:

Weight (lbs) Jersey Number
220 13
253 23
250 23
195 30
242 35
190 30
240 35
200 0
270 0
210 1
193 11
260 12
210 4
220 10
250 21
240 6
207 3
175 23
232 23
184 15
220 13
250 23
190 3
248 32
215 45

A scatter plot shows the relationship between player jersey numbers and weight. An ellipse encloses all of the points.

Our ellipse is close to being circular, which shows that there is no trend or relationship between a player's jersey number and their weight. Therefore, we will not draw a line of best on this scatter plot.

Scatter Plot 2: Hot Drink Sales vs. Temperature

Let's consider a scatter plot of the sale of hot drinks with respect to air temperature.

Temperature (°C) Hot Drink Sales ($)
-20 1215
-12 1003
-5 1008
0 980
4 995
2 1019
12 765
9 872
16 689
20 671
-21 904
26 486
32 389
-18 963
-28 1562
17 803
10 954
14 502

The data in the provided table is shown in a scatter plot.

If we enclose the data in the smallest possible ellipse that contains all of the data points, we get:

Temperature (°C) Hot Drink Sales ($)
-20 1215
-12 1003
-5 1008
0 980
4 995
2 1019
12 765
9 872
16 689
20 671
-21 904
26 486
32 389
-18 963
-28 1562
17 803
10 954
14 502

The data in the provided table is shown in a scatter plot with an ellipse drawn around the points.

Our ellipse on this graph is narrower and we can see that there appears to be a trend or relationship between the two variables. The data shows a strong negative correlation.

Although the key idea when drawing a line of best fit is to follow the trend of the data, there are some additional recommendations to keep in mind:

  • The line of best fit should pass through as many points as possible, but we should try to pass through a minimum of two points. (Note: "Passing through" the point means that the line of best fit visually touches the point on the graph. Mathematically, this means that the distance between the point and the line is a very small value, ideally zero.)
  • The number of remaining points should be split close to equal on either side of the line.

Temperature (°C) Hot Drink Sales ($)
-20 1215
-12 1003
-5 1008
0 980
4 995
2 1019
12 765
9 872
16 689
20 671
-21 904
26 486
32 389
-18 963
-28 1562
17 803
10 954
14 502

A line of best fit is drawn on the graph with a ruler

Our line of best fit passes through four points and has an equal number of points above and below it. (Recall from above: "Passing through" the point means that the line of best fit visually touches the point on the graph. Mathematically, this means that the distance between the point and the line is a very small value, ideally zero.)

The data in the provided table is shown in a scatter plot with a line of best fit.

Points Above Line
Temperature (°C) Hot Drink Sales ($)
-28 1562
2 1019
4 995
9 872
10 954
17 803
20 671
Points On or Near Line
Temperature (°C) Hot Drink Sales ($)
-5 1008
0 980
12 765
16 689
Points Below Line
Temperature (°C) Hot Drink Sales ($)
-21 904
-20 1215
-18 963
-12 1003
14 502
26 486
32 389

When lines of best fit are drawn by hand, it is important to note that there are many slightly different lines that can be drawn to represent the data. It is unlikely than any two individuals would draw exactly the same line of best fit.  

For Scatterplot 2, any hand drawn lines of best fit should still reflect a negative correlation in order to correctly follow the trend of the data.

Scatter Plot 3: Money Earned vs. Hours Worked

Let's consider a scatter plot of money earned with respect to hours worked. 

Hours Worked Money Earned ($)
3 27
7 49
8 112
10 150
6 60
8 140
2 50
1 15
7 87.5
12 132
2.5 25
4.5 54
7.5 101.25
11.5 120
11 154
6.5 91
1.5 22.5
3.5 45
5 80
7 105

The data provided in the table is shown in a scatter plot.

If we enclose the data in the smallest possible ellipse that contains all of the data points, we get:

Hours Worked Money Earned ($)
3 27
7 49
8 112
10 150
6 60
8 140
2 50
1 15
7 87.5
12 132
2.5 25
4.5 54
7.5 101.25
11.5 120
11 154
6.5 91
1.5 22.5
3.5 45
5 80
7 105

The data in the provided table is shown in a scatter plot with an ellipse drawn around the points.

We can see that there appears to be a trend or relationship between the two variables. We can then draw a line of best fit.

Hours Worked Money Earned ($)
3 27
7 49
8 112
10 150
6 60
8 140
2 50
1 15
7 87.5
12 132
2.5 25
4.5 54
7.5 101.25
11.5 120
11 154
6.5 91
1.5 22.5
3.5 45
5 80
7 105

A line of best fit is drawn on the graph with a ruler

We can see the data shows a positive correlation and therefore our line of best fit has a positive slope.

Points Above Line
Hours Worked Money Earned ($)
2 50
5 80
7 49
8 112
8 140
10 150
11 154
Points On or Near Line
Hours Worked Money Earned ($)
1 15
1.5 22.5
3.5 45
6.5 91
7.5 101.25
Points Below Line
Hours Worked Money Earned ($)
2.5 25
3 27
4.5 54
6 60
7 87.5
12 132
11.5 120
7 105

Let's revisit the scatter plot of city population and number of restaurants.

Example 2

  1. Given the scatter plot shown, draw a line of best fit.
  2. Determine the equation of the line of best fit.
  3. Use technology to determine a line of best fit.

A scatter plot showing the population in hundreds of thousands and the number of restaurants in operation. See adjacent alternative format for data.

Population (100 000s)  Restaurants (Qty)
\(2.17\) \(643\)
\(0.15\) \(285\)
\(1.41\) \(358\)
\(1.53\) \(1200\)
\(8.34\) \(2198\)
\(3.06\) \(715\)
\(3.84\) \(844\)
\(1.08\) \(233\)
\(1.83\) \(378\)
\(7.22\) \(1396\)
\(2.33\) \(450\)
\(1.94\) \(367\)
\(5.37\) \(1006\)
\(1.95\) \(359\)
\(1.02\) \(187\)
\(1.30\) \(226\)
\(3.29\) \(540\)
\(1.62\) \(257\)
\(1.20\) \(165\)
\(5.94\) \(437\)

Solution — Part A

As mentioned previously, we can create an ellipse around our data to identify the trend. Before we do this though, let's recall that we previously identified a point that represented a city with a lower than expected number of operating restaurants for its population, and two additional points where there is a higher than expected number of operating restaurants for its population.

  • On our scatter plot we have three points that are distant from the majority.

A scatter plot shows the data provided.

An outlier is a point that is distant from other points. These points should not be given the same amount of consideration when identifying a trend.

Therefore, when we draw our ellipse to identify the trend, we will leave these three points outside.

After drawing our ellipse, we can then use it to draw a line of best fit.

A scatter plot shows the data provided.

Remember to use a ruler and follow the trend of the data, try to pass through as many points as possible, and split the remaining points as equally as possible on either side of the line.

A scatter plot shows the data provided.

Our completed graph is shown:

A scatter plot shows the data provided.

Solution — Part B

Recall part b): Determine the equation of the line of best fit.

Now that we have drawn our line of best fit we can determine the equation of our line. To determine the equation, we will need two points that lie on our line of best fit.

Recall

We need two points to determine the equation of a line.

These do not have to be points that are part of the original scatter plot. In fact, our line of best fit touches several points, but doesn't actually have two points from the scatter plot that lie directly on it. 

A scatter plot shows the data provided. The friendly points (0,0) and (4,800) are indicated.

Remember to choose friendly or easy to read points. For our line of best fit, we will use the points \((0, 0)\) and \((4, 800)\).

First, we will calculate the slope of the line:

\(\begin{aligned} m &=\frac{\Delta y}{\Delta x} \\ &=\frac{800-0}{4-0} \\ &=\frac{800}{4} \\ &=200 \end{aligned}\)

Our line of best fit is telling us that there are \(200\) operating restaurants per \(100~000\) city population.

Next, we can identify the \(y\)-intercept on our graph as \(0\).

Having both the slope and the \(y\)-intercept, we can write the equation of our line of best fit as

\(y=200x\)

where \(x\) represents the city population in \(100~000\)s and \(y\) represents the number of operating restaurants.

Solution — Part C

Recall part c): Use technology to determine a line of best fit.

A scatter plot shows the data provided.

When we draw a line of best fit by hand, recall that five people could draw five different lines to represent the data. It is very likely that one person given the same set of data twice would draw two slightly different lines of best fit.

A scatter plot shows the data provided, with two alternative lines of best fit drawn over it.

Researchers who analyze scatter plots do not draw lines of best fit by hand. They use technology to generate a line or curve of best fit for their data.

Determining a model (i.e. a line or curve of best fit) for data is called regression analysis.

Many software options have a regression analysis tool to generate lines and curves of best fit. 

For example, one tool might analyze our data and determine a line of best fit that is represented by the equation:

\(y=183.56x+92.82\)

where \(x\) represents the city population in \(100~000\)s and \(y\) represents the number of operating restaurants.

A scatter plot shows the data provided. The line y=183.56x + 92.82 is provided as a line of best fit.

This line of best fit has a higher vertical intercept and a slope that is less steep than the line of best fit we drew by hand.

A scatter plot shows the data provided. The lines y=183.56x + 92.82 and y=200x are provided as two potential lines of best fit.


Next, we will determine information using a model by continuing with our previous example. 

Example 3

Use the line of best fit generated using technology, and given by the equation \(y=183.56x+92.82\), to calculate

  1. how many restaurants would be expected in a city with a population of \(425~000\) and
  2. the population of a city with \(2500\) restaurants.

Solution — Part A

When we use a line of best fit to calculate missing information within a data set it is called interpolation.

A scatter plot of the restaurant data, with our calculated regression line.

Population (100 000s)  Restaurants (Qty)
\(2.17\) \(643\)
\(0.15\) \(285\)
\(1.41\) \(358\)
\(1.53\) \(1200\)
\(8.34\) \(2198\)
\(3.06\) \(715\)
\(3.84\) \(844\)
\(1.08\) \(233\)
\(1.83\) \(378\)
\(7.22\) \(1396\)
\(2.33\) \(450\)
\(1.94\) \(367\)
\(5.37\) \(1006\)
\(1.95\) \(359\)
\(1.02\) \(187\)
\(1.30\) \(226\)
\(3.29\) \(540\)
\(1.62\) \(257\)
\(1.20\) \(165\)
\(5.94\) \(437\)

For our graph, if we are looking at population values between \(15~000\) and \(834~000\), we are working within our data set.

A scatter plot of the restaurant data, with our calculated regression line. The x-interval containing our data set is highlighted.

And if we are looking at restaurant values between \(165\) and \(2198\), we are working within our data set.

A scatter plot of the restaurant data, with our calculated regression line. The y-interval containing our data set is highlighted.

This question asks us about a population of \(425~000\), which is within our data set.

A scatter plot of the restaurant data, with our calculated regression line. The line x=4.25 is drawn on the graph, inside of the x-interval representing our data set.

Now let's continue with answering the question using the line of best fit. 

Remember that the population is represented in hundred thousands. Therefore, we will substitute \(x=4.25\) and solve for \(y\):

\(\begin{align*} y&=183.56 x+92.82 \\[5px] y&=183.56(4.25)+92.82 \\[5px] y&=872.95 \end{align*}\)

Completing the calculations, we determine that a population of \(425~000\) will have approximately \(873\) restaurants. Remember, even though the model gives us a decimal value, we need to round to a whole number, as we cannot have part of a restaurant.

Solution — Part B

Recall part b): Use the line of best fit generated using technology and given by the equation \(y=183.56x+92.82\) to calculate the population of a city with \(2500\) restaurants.

A scatter plot of the restaurant data, with our calculated regression line.

When we use a line of best fit to calculate information outside of a data set, or to make predictions, it is called extrapolation.

For our graph, if we are looking at population values less than \(15~000\) or greater than \(834~000\), we are working outside our data set.

A scatter plot of the restaurant data, with our calculated regression line. The x-intervals lower than the lowest population point and higher than the highest population point are highlighted.

And if we are looking at restaurant values less than \(165\) or greater than \(2198\), we are working outside our data set.

A scatter plot of the restaurant data, with our calculated regression line. The y-intervals lower than the lowest restaurant count and higher than the highest restaurant count are highlighted.

This question asks about \(2500\) restaurants, which is outside of our data set.

Now to answer this question, we will substitute \(y=2500\) and solve for \(x\):

\(\begin{align*} y &=183.56 x+92.82 \\[5px] 2500 &=183.56 x+92.82 \\[5px] 2500-92.82 &=183.56 x \\[5px] 2407.18 &=183.56 x \\[5px] 13.11100218 & \approx x \end{align*}\)

Solving, we calculate \(x\) to be equal to approximately \(13.111\). Remember, \(x\) is in \(100~000\)s, so we need to multiply our \(x\) value by \(100~000\). This means that a city with a population of approximately \(1~311~100\) will have \(2500\) restaurants.

Again, notice that we round to a whole number as we cannot have a fraction of a person as part of our population.


Check Your Understanding 5

Question

For each of the following scatter plots, draw an approximate line of best fit.

  1. The following points are plotted: \((-5, 18)\), \((-3, 16)\), \((-2, 12)\), \((-2, 9)\), \((-2, 4)\), \((0, -2)\), \((2, -8)\), \((3, -8)\), \((3, 11)\), and \((4, -15)\).

  2. The following points are plotted: \((-5, -14)\), \((-5, -10)\), \((-2, -6)\), \((-2, -10)\), \((0, -5)\), \((1, 0)\), \((2, 0)\), \((3, 2)\), \((3, 5)\), and \((4, 4)\).

  3. The following points are plotted: \((-5, -14)\), \((-3, -16)\), \((-2, -15)\), \((-2, 5)\), \((-1, -8)\), \((0, 0)\), \((1, 14)\), \((2, 6)\), \((3, 7)\), and \((4, 16)\). 

Answer

  1. A line with negative slope which passes approximately through points (0, 1), and (0.25, 0).
  2. A line with positive slope which passes approximately through points (0, negative 3.5), and (1.5, 0).
  3. A line with positive slope which passes approximately through points (0, 0.25), and (negative 3, negative 10).

Interactive Version

Lines of Best Fit


Making Sense of a Model

Let's consider the winning times (in seconds) for the \(100\) m sprint for both men and women in an international competition that is held every four years.  

Year Men Women
\(1900\) \(11\) N/A
\(1904\) \(11\) N/A
\(1908\) \(10.8\) N/A
\(1912\) \(10.8\) N/A
\(1920\) \(10.8\) N/A
\(1924\) \(10.6\) N/A
\(1928\) \(10.8\) \(12.2\)
\(1932\) \(10.3\) \(11.9\)
\(1936\) \(10.3\) \(11.5\)
\(1948\) \(10.3\) \(11.9\)
\(1952\) \(10.4\) \(11.5\)
\(1956\) \(10.5\) \(11.5\)
\(1960\) \(10.2\) \(11.0\)
\(1964\) \(10.0\) \(11.4\)
\(1968\) \(9.9\) \(11.0\)
\(1972\) \(10.14\) \(11.07\)
\(1976\) \(10.06\) \(11.08\)
\(1980\) \(10.25\) \(11.06\)
\(1984\) \(9.99\) \(10.97\)
\(1988\) \(9.92\) \(10.54\)
\(1992\) \(9.96\) \(10.82\)
\(1996\) \(9.84\) \(10.94\)
\(2000\) \(9.87\) \(11.12\)
\(2004\) \(9.85\) \(10.93\)
\(2008\) \(9.69\) \(10.78\)
\(2012\) \(9.63\) \(10.75\)
\(2016\) \(9.81\) \(10.71\)

We will use technology to graph a scatter plot of the data.

A scatter plot of the data provided in the table.

Our scatter plots both show that as the year increases, the winning times are decreasing. We can use technology to determine a line of best fit for each set of data.

A scatter plot of the data provided in the table. Calculated regression lines are added to the graph.

  • The line of best fit generated using technology for the men's winning times is represented by the equation: \(y=-0.0109x+31.559\) and
  • The line of best fit generated using technology for the women's winning times is represented by the equation: \(y=-0.0142x+39.188\) 
    where \(x\) represents the year and \(y\) represents the winning time in seconds.

We can use the lines of best fit to interpolate and extrapolate information from our graph.

Interpolating

We are missing the winning times for the years \(1940\) and \(1944\). Using the lines of best fit we can calculate what the winning times might have been according to our model.

Men: substitute \(x=1940\)

\(\begin{align*} y&=-0.0109x+31.559\\ y&=-0.0109(1940)+31.559\\ y&=10.413 \end{align*}\)

Therefore, the men's \(100\) m winning time in \(1940\) according to our model is approximately \(10.41\) s.

Men: substitute \(x=1944\)

\(\begin{align*} y&=-0.0109x+31.559\\ y&=-0.0109(1944)+31.559\\ y&=10.3694 \end{align*}\)

Therefore, the men's \(100\) m winning time in \(1944\) according to our model is approximately \(10.37\) s.

Women: substitute \(x=1940\)

\(\begin{align*} y&=-0.0142x+39.188\\ y&=-0.0142(1940)+39.188\\ y&=11.64 \end{align*}\)

Therefore, the women's \(100\) m winning time in \(1940\) according to our model is approximately \(11.64\) s.

Women: substitute \(x=1944\)

\(\begin{align*} y&=-0.0142x+39.188\\ y&=-0.0142(1944)+39.188\\ y&=11.5832 \end{align*}\)

Therefore, the women's \(100\) m winning time in \(1944\) according to our model is approximately \(11.58\) s.

The calculated winning times appear to fit well within our data sets.

Extrapolating

Next, let's use our models to predict the winning times for the competition to be held in the year \(2048\).

Men: substitute \(x=2048\)

\(\begin{align*} y&=-0.0109x+31.559\\ y&=-0.0109(2048)+31.559\\ y&=9.2358 \end{align*}\)

Therefore, the men's \(100\) m winning time in \(2048\) according to our model will be approximately \(9.24\) s.

Women: substitute \(x=2048\)

\(\begin{align*} y&=-0.0142x+39.188\\ y&=-0.0142(2048)+39.188\\ y&=10.1064 \end{align*}\)

Therefore, the women's \(100\) m winning time in \(2048\) according to our model will be approximately \(10.11\) s.

These predictions don't seem impossible and could prove to be fairly accurate; only time will tell.

Reliability

What happens to our predictions as we move farther away from the collected data?

 Let's look at our graph again, with the horizontal axis extended to the year \(2600\).

The scatter plot is reproduced, and the regression lines are extended to reach x-values (Years) up to 2600.

The models show that the winning times continue to decrease as the years go on. In fact, they show that by the \(2304\) competition, the women sprinters will be faster than the men sprinters. 

However, as we keep moving farther away from our collected data set, our predictions seem to be less probable, and eventually impossible. By the year \(2550\) both men and women sprinters are predicted to run the \(100\) m race in less than \(4\) seconds. By the year \(2700\) the model predicts that the winning women's sprint will be less than \(1\) second.

Although models help us fill in gaps in information and make predictions of future values, it is important to also consider the reality of the information that a model provides. The closer the model is to the collected data, the more reliable the calculation will be. 


Wrap-Up


Lesson Summary

In this lesson, we:

  • Reviewed what a scatter plot is and learned how to construct scatter plots using technology.
  • Identified trends in data from a scatter plot and described the correlation between two variables.
  • Drew curves or lines of best fit and determined the equation of a line of best fit.
  • Interpolated and extrapolated information using a line of best fit.

Take It With You

Ten students were asked how many hours they spent on a screen the day before a test. The data collected is shown in the table and the corresponding graph is also shown here. A line of best fit has been drawn on the graph to model the relationship between the number of hours on a screen and test mark earned.

Number of Hours on a Screen Test Mark (%)
\(2\) \(85\)
\(4\) \(60\)
\(0.5\) \(85\)
\(3\) \(70\)
\(1.5\) \(74\)
\(2\) \(89\)
\(0\) \(84\)
\(3\) \(73\)
\(1\) \(95\)
\(2.5\) \(76\)

The data in the table at left is plotted on a scatter graph, with its line of best fit.

An additional ten students were surveyed and the data is shown here. If this data is plotted on the same graph as the data from the first ten students, is the line shown in the graph still the line of best fit?

Number of Hours on a Screen Test Mark (%)
\(3.5\) \(75\)
\(3\) \(81\)
\(1\) \(78\)
\(5.5\) \(62\)
\(1\) \(89\)
\(3\) \(81\)
\(0\) \(90\)
\(2.5\) \(77\)
\(1\) \(89\)
\(4\) \(55\)