julia_coursera/Week4_Working with data.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working with data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## In this lecture"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- [Introduction](#Introduction)\n",
    "- [Distributions](#Distributions)\n",
    "- [Normal distribution](#Normal-distribution)\n",
    "- [Other distributions](#Other-distributions)\n",
    "- [DataFrames](#DataFrames)\n",
    "    - [Combining dataframes](#Combining-dataframes)\n",
    "    - [Grouping](#Grouping)\n",
    "    - [Sorting](#Sorting)\n",
    "    - [Unique rows only](#Unique-rows-only)\n",
    "    - [Deleting rows](#Deleting-rows)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ability to use data is fundamental to most modern computer coding taks.  In this lecture, we will have a brief introduction to the way in which the Julia language incorporates data through the use of the `Distributions.jl` and `DataFrames.jl` packages."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Back to the top](#In-this-lecture)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Distributions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data point values for a distribution usually follow a pattern.  Such patterns are called distributions.  Distributions are either discrete or continuous.  The `Distribution.jl` package contains most of the common data distributions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will also use the `Random.jl` package to seed the pseudo-random number generator so that we can reproduce the random values that we are going to use in the lecture."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Distributions\n",
    "using Random"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Back to the top](#In-this-lecture)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The normal distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The normal distribution is the famous bell-shaped curve that we are familiar with.  Values around the mean occur most frequently and as values get progressively further away from the mean, they occur less frequently."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.0, 1.0)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Seed the pseudo-random number generator\n",
    "Random.seed!(1234)\n",
    "#Saving the standard normal distribution as an object\n",
    "n = Distributions.Normal()  # This function is from the Distributions package\n",
    "#Parameter values of the standard normal distribution\n",
    "params(n)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the `params()` function, we note a mean on $0$ and a standard deviation of $1$, also called the _standard normal distribution_."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `fieldnames()` function provides the actual parameters of the given distribution.  In the case of the normal distribution, it will be the average and the standard deviation, namely $\\mu$ and $\\sigma$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(:μ, :σ)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Returning the parameters of the normal distribution\n",
    "fieldnames(Normal)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we create a variable called `var1` and use the `rand()` function to create select $10$ random values from the standard normal distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Seed the pseudo-random number generator\n",
    "Random.seed!(1234)\n",
    "#Select 10 elements at random from n\n",
    "var1 = rand(n, 10);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can calculate the average and standard deviation of our randomly selected values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.18909179133831322"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Average\n",
    "mean(var1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9879593623730926"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Standard deviation\n",
    "std(var1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `pdf()` calculates the probability density function value of a given distribution up until a specified point (from $- \\infty$)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.38138781546052414"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Probability density function value at x = 0.3\n",
    "pdf(Normal(), 0.3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `cdf()` functions calculates the cummulative distribution function value of a given distribution up until a specified point (from $- \\infty$)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5987063256829237"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Cumulative distribution function as x = 0.25\n",
    "cdf(Normal(), 0.25)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The values for the average and standard deviation can be specified."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Creating 100 data point values from a normal distribution\n",
    "# with a mean of 100 and a standard deviation of 10\n",
    "Random.seed!(1234)\n",
    "var2 = rand(Normal(100, 10), 100);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "98.52365657772843"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calculating the mean of var2\n",
    "mean(var2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9.580963685859091"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Calculating the standard deviation of var2\n",
    "std(var2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The parameters of a set of values for a specified distribution can be returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Normal{Float64}(μ=98.52365657772843, σ=9.532938502804532)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Using fit() to calculate the parameters of a distribution\n",
    "fit(Normal, var2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `quantiles()` function provides us with values for the specific percentiles (provided as fractions).  Below we calculate the $2.5$% and $97.5$% percentile values of the standard normal distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1.9599639845400592"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Quantiles\n",
    "quantile(Normal(), 0.025)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.9599639845400576"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "quantile(Normal(), 0.975)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Back to the top](#In-this-lecture)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Other distributions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many distributions in the `Distribution().jl` package. In the code below, a few of these are showcased by way of setting parameters, selecting random values, and fitting those value back to the distribution or returning the parameter field names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Beta{Float64}(α=1.236721159927394, β=1.1368118923305863)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Beta distribution\n",
    "b = Beta(1, 1)\n",
    "params(b)\n",
    "Random.seed!(1234)\n",
    "var3 = rand(b, 100);\n",
    "fit(Beta, var3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(:ν,)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# χ2 distribution\n",
    "c = Chisq(1)\n",
    "Random.seed!(1234)\n",
    "var4 = rand(c, 100)\n",
    "fieldnames(Chisq) # Degrees of freedom"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Back to the top](#In-this-lecture)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataframes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Dataframes.jl` package allows for creation of a flat data structure (rows and columns).  Columns are variables and rows are subjects (examples)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "using DataFrames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below, we create an empty dataframe object that we call `df`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Create and empty DataFrame\n",
    "df = DataFrame();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Column headers representing statistical variable names are entered in square brackets as symbols, i.e. preceeded with a colon.  We will attach the `var2` set of values as data point entries for this statistical variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add a column with data point values (rows)\n",
    "df[:Var2] = var2;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can specify to print the first $5$ rows to the screen with the `first()` function,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>Var2</th></tr><tr><th></th><th>Float64</th></tr></thead><tbody><p>5 rows × 1 columns</p><tr><th>1</th><td>108.673</td></tr><tr><th>2</th><td>90.9826</td></tr><tr><th>3</th><td>95.0552</td></tr><tr><th>4</th><td>90.9709</td></tr><tr><th>5</th><td>108.644</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|c}\n",
       "\t& Var2\\\\\n",
       "\t\\hline\n",
       "\t& Float64\\\\\n",
       "\t\\hline\n",
       "\t1 & 108.673 \\\\\n",
       "\t2 & 90.9826 \\\\\n",
       "\t3 & 95.0552 \\\\\n",
       "\t4 & 90.9709 \\\\\n",
       "\t5 & 108.644 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "5×1 DataFrame\n",
       "│ Row │ Var2    │\n",
       "│     │ \u001b[90mFloat64\u001b[39m │\n",
       "├─────┼─────────┤\n",
       "│ 1   │ 108.673 │\n",
       "│ 2   │ 90.9826 │\n",
       "│ 3   │ 95.0552 │\n",
       "│ 4   │ 90.9709 │\n",
       "│ 5   │ 108.644 │"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#View first five rows\n",
    "first(df, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below, we create another statistical variable with some data point values that we already have in the waiting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add another column\n",
    "df[:Var3] = var3;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `last()` functions shows the last specified rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>Var2</th><th>Var3</th></tr><tr><th></th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>3 rows × 2 columns</p><tr><th>1</th><td>95.5675</td><td>0.831916</td></tr><tr><th>2</th><td>83.3677</td><td>0.221771</td></tr><tr><th>3</th><td>94.7877</td><td>0.655592</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|cc}\n",
       "\t& Var2 & Var3\\\\\n",
       "\t\\hline\n",
       "\t& Float64 & Float64\\\\\n",
       "\t\\hline\n",
       "\t1 & 95.5675 & 0.831916 \\\\\n",
       "\t2 & 83.3677 & 0.221771 \\\\\n",
       "\t3 & 94.7877 & 0.655592 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "3×2 DataFrame\n",
       "│ Row │ Var2    │ Var3     │\n",
       "│     │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m  │\n",
       "├─────┼─────────┼──────────┤\n",
       "│ 1   │ 95.5675 │ 0.831916 │\n",
       "│ 2   │ 83.3677 │ 0.221771 │\n",
       "│ 3   │ 94.7877 │ 0.655592 │"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# View last three rows\n",
    "last(df, 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `size()` function returns a tuple with the number of rows and columns returned,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(100, 2)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Dimensions of a DataFrame\n",
    "size(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `describe()` functions attemps tp provide summary statistics of the variables>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>variable</th><th>mean</th><th>min</th><th>median</th><th>max</th><th>nunique</th><th>nmissing</th><th>eltype</th></tr><tr><th></th><th>Symbol</th><th>Float64</th><th>Float64</th><th>Float64</th><th>Float64</th><th>Nothing</th><th>Nothing</th><th>DataType</th></tr></thead><tbody><p>2 rows × 8 columns</p><tr><th>1</th><td>Var2</td><td>98.5237</td><td>67.8864</td><td>98.1718</td><td>124.175</td><td></td><td></td><td>Float64</td></tr><tr><th>2</th><td>Var3</td><td>0.521047</td><td>0.00145384</td><td>0.522808</td><td>0.971161</td><td></td><td></td><td>Float64</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|cccccccc}\n",
       "\t& variable & mean & min & median & max & nunique & nmissing & eltype\\\\\n",
       "\t\\hline\n",
       "\t& Symbol & Float64 & Float64 & Float64 & Float64 & Nothing & Nothing & DataType\\\\\n",
       "\t\\hline\n",
       "\t1 & Var2 & 98.5237 & 67.8864 & 98.1718 & 124.175 &  &  & Float64 \\\\\n",
       "\t2 & Var3 & 0.521047 & 0.00145384 & 0.522808 & 0.971161 &  &  & Float64 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "2×8 DataFrame. Omitted printing of 2 columns\n",
       "│ Row │ variable │ mean     │ min        │ median   │ max      │ nunique │\n",
       "│     │ \u001b[90mSymbol\u001b[39m   │ \u001b[90mFloat64\u001b[39m  │ \u001b[90mFloat64\u001b[39m    │ \u001b[90mFloat64\u001b[39m  │ \u001b[90mFloat64\u001b[39m  │ \u001b[90mNothing\u001b[39m │\n",
       "├─────┼──────────┼──────────┼────────────┼──────────┼──────────┼─────────┤\n",
       "│ 1   │ Var2     │ 98.5237  │ 67.8864    │ 98.1718  │ 124.175  │         │\n",
       "│ 2   │ Var3     │ 0.521047 │ 0.00145384 │ 0.522808 │ 0.971161 │         │"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Summarize the content\n",
    "describe(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data type for each variable can be returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2-element Array{DataType,1}:\n",
       " Float64\n",
       " Float64"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Data type only\n",
    "eltypes(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below we create a new instance of a dataframe object called `df2`.  It contains four statistical variables.  Note the use of symbol notation in creating the names of these variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 6 Create a bigger DataFrame\n",
    "df2 = DataFrame()\n",
    "df2[:A] = 1:10\n",
    "df2[:B] = [\"I\", \"II\", \"II\", \"I\", \"II\",\"I\", \"II\", \"II\", \"I\", \"II\"]\n",
    "Random.seed!(1234)\n",
    "df2[:C] = rand(Normal(), 10)\n",
    "df2[:D] = rand(Chisq(1), 10);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By using indexing (in square brackets), we can refer to row and column values (i.e. _row, column_).  Below is an example of seleting data point values for rows one through three, showing all the columns.  The colon symbol serves as shortcut syntax for this selection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th><th>D</th></tr><tr><th></th><th>Int64</th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>3 rows × 4 columns</p><tr><th>1</th><td>1</td><td>I</td><td>0.867347</td><td>0.0123688</td></tr><tr><th>2</th><td>2</td><td>II</td><td>-0.901744</td><td>0.213586</td></tr><tr><th>3</th><td>3</td><td>II</td><td>-0.494479</td><td>0.00899443</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|cccc}\n",
       "\t& A & B & C & D\\\\\n",
       "\t\\hline\n",
       "\t& Int64 & String & Float64 & Float64\\\\\n",
       "\t\\hline\n",
       "\t1 & 1 & I & 0.867347 & 0.0123688 \\\\\n",
       "\t2 & 2 & II & -0.901744 & 0.213586 \\\\\n",
       "\t3 & 3 & II & -0.494479 & 0.00899443 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "3×4 DataFrame\n",
       "│ Row │ A     │ B      │ C         │ D          │\n",
       "│     │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m   │ \u001b[90mFloat64\u001b[39m    │\n",
       "├─────┼───────┼────────┼───────────┼────────────┤\n",
       "│ 1   │ 1     │ I      │ 0.867347  │ 0.0123688  │\n",
       "│ 2   │ 2     │ II     │ -0.901744 │ 0.213586   │\n",
       "│ 3   │ 3     │ II     │ -0.494479 │ 0.00899443 │"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# First three rows with all the colums\n",
    "df2[1:3, :]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If only specified columns, that is to say, not the range of one, two, and three as we did above, but rather only colums one and three, we create a list to indicate this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Float64</th></tr></thead><tbody><p>10 rows × 2 columns</p><tr><th>1</th><td>1</td><td>0.867347</td></tr><tr><th>2</th><td>2</td><td>-0.901744</td></tr><tr><th>3</th><td>3</td><td>-0.494479</td></tr><tr><th>4</th><td>4</td><td>-0.902914</td></tr><tr><th>5</th><td>5</td><td>0.864401</td></tr><tr><th>6</th><td>6</td><td>2.21188</td></tr><tr><th>7</th><td>7</td><td>0.532813</td></tr><tr><th>8</th><td>8</td><td>-0.271735</td></tr><tr><th>9</th><td>9</td><td>0.502334</td></tr><tr><th>10</th><td>10</td><td>-0.516984</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|cc}\n",
       "\t& A & C\\\\\n",
       "\t\\hline\n",
       "\t& Int64 & Float64\\\\\n",
       "\t\\hline\n",
       "\t1 & 1 & 0.867347 \\\\\n",
       "\t2 & 2 & -0.901744 \\\\\n",
       "\t3 & 3 & -0.494479 \\\\\n",
       "\t4 & 4 & -0.902914 \\\\\n",
       "\t5 & 5 & 0.864401 \\\\\n",
       "\t6 & 6 & 2.21188 \\\\\n",
       "\t7 & 7 & 0.532813 \\\\\n",
       "\t8 & 8 & -0.271735 \\\\\n",
       "\t9 & 9 & 0.502334 \\\\\n",
       "\t10 & 10 & -0.516984 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "10×2 DataFrame\n",
       "│ Row │ A     │ C         │\n",
       "│     │ \u001b[90mInt64\u001b[39m │ \u001b[90mFloat64\u001b[39m   │\n",
       "├─────┼───────┼───────────┤\n",
       "│ 1   │ 1     │ 0.867347  │\n",
       "│ 2   │ 2     │ -0.901744 │\n",
       "│ 3   │ 3     │ -0.494479 │\n",
       "│ 4   │ 4     │ -0.902914 │\n",
       "│ 5   │ 5     │ 0.864401  │\n",
       "│ 6   │ 6     │ 2.21188   │\n",
       "│ 7   │ 7     │ 0.532813  │\n",
       "│ 8   │ 8     │ -0.271735 │\n",
       "│ 9   │ 9     │ 0.502334  │\n",
       "│ 10  │ 10    │ -0.516984 │"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# All rows columns 1 and 3\n",
    "df2[:, [1, 3]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of indicating the column numbers, we can also reference the actual column names (statistical variable names), using symbol notation, i.e. `:A`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Float64</th></tr></thead><tbody><p>10 rows × 2 columns</p><tr><th>1</th><td>1</td><td>0.867347</td></tr><tr><th>2</th><td>2</td><td>-0.901744</td></tr><tr><th>3</th><td>3</td><td>-0.494479</td></tr><tr><th>4</th><td>4</td><td>-0.902914</td></tr><tr><th>5</th><td>5</td><td>0.864401</td></tr><tr><th>6</th><td>6</td><td>2.21188</td></tr><tr><th>7</th><td>7</td><td>0.532813</td></tr><tr><th>8</th><td>8</td><td>-0.271735</td></tr><tr><th>9</th><td>9</td><td>0.502334</td></tr><tr><th>10</th><td>10</td><td>-0.516984</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|cc}\n",
       "\t& A & C\\\\\n",
       "\t\\hline\n",
       "\t& Int64 & Float64\\\\\n",
       "\t\\hline\n",
       "\t1 & 1 & 0.867347 \\\\\n",
       "\t2 & 2 & -0.901744 \\\\\n",
       "\t3 & 3 & -0.494479 \\\\\n",
       "\t4 & 4 & -0.902914 \\\\\n",
       "\t5 & 5 & 0.864401 \\\\\n",
       "\t6 & 6 & 2.21188 \\\\\n",
       "\t7 & 7 & 0.532813 \\\\\n",
       "\t8 & 8 & -0.271735 \\\\\n",
       "\t9 & 9 & 0.502334 \\\\\n",
       "\t10 & 10 & -0.516984 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "10×2 DataFrame\n",
       "│ Row │ A     │ C         │\n",
       "│     │ \u001b[90mInt64\u001b[39m │ \u001b[90mFloat64\u001b[39m   │\n",
       "├─────┼───────┼───────────┤\n",
       "│ 1   │ 1     │ 0.867347  │\n",
       "│ 2   │ 2     │ -0.901744 │\n",
       "│ 3   │ 3     │ -0.494479 │\n",
       "│ 4   │ 4     │ -0.902914 │\n",
       "│ 5   │ 5     │ 0.864401  │\n",
       "│ 6   │ 6     │ 2.21188   │\n",
       "│ 7   │ 7     │ 0.532813  │\n",
       "│ 8   │ 8     │ -0.271735 │\n",
       "│ 9   │ 9     │ 0.502334  │\n",
       "│ 10  │ 10    │ -0.516984 │"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Different notation\n",
    "df2[:, [:A, :C]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `CSV.jl` package's `read()` function can import a comma separated values data file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make sure to install the package in the REPL first\n",
    "using CSV"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The file is saved in the same directory / folder as this notebook file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import csv file (in same directory / folder)\n",
    "data1 = CSV.read(\"CCS.csv\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the `type()` function, we note that we now have an instance of a dataframe object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DataFrame"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "typeof(data1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's view the first five rows of data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>PatientID</th><th>Cat1</th><th>Cat2</th><th>Var1</th><th>Var2</th><th>Var3</th></tr><tr><th></th><th>Int64⍰</th><th>String⍰</th><th>String⍰</th><th>Float64⍰</th><th>Float64⍰</th><th>Float64⍰</th></tr></thead><tbody><p>5 rows × 6 columns</p><tr><th>1</th><td>1</td><td>A</td><td>C</td><td>38.2568</td><td>5.93913</td><td>35.0579</td></tr><tr><th>2</th><td>2</td><td>A</td><td>C</td><td>17.8317</td><td>5.34754</td><td>21.131</td></tr><tr><th>3</th><td>8</td><td>A</td><td>B</td><td>16.0218</td><td>6.60709</td><td>60.9436</td></tr><tr><th>4</th><td>9</td><td>A</td><td>C</td><td>45.1158</td><td>6.00733</td><td>21.8797</td></tr><tr><th>5</th><td>16</td><td>A</td><td>C</td><td>20.448</td><td>8.54819</td><td>20.6623</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|cccccc}\n",
       "\t& PatientID & Cat1 & Cat2 & Var1 & Var2 & Var3\\\\\n",
       "\t\\hline\n",
       "\t& Int64⍰ & String⍰ & String⍰ & Float64⍰ & Float64⍰ & Float64⍰\\\\\n",
       "\t\\hline\n",
       "\t1 & 1 & A & C & 38.2568 & 5.93913 & 35.0579 \\\\\n",
       "\t2 & 2 & A & C & 17.8317 & 5.34754 & 21.131 \\\\\n",
       "\t3 & 8 & A & B & 16.0218 & 6.60709 & 60.9436 \\\\\n",
       "\t4 & 9 & A & C & 45.1158 & 6.00733 & 21.8797 \\\\\n",
       "\t5 & 16 & A & C & 20.448 & 8.54819 & 20.6623 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "5×6 DataFrame\n",
       "│ Row │ PatientID │ Cat1    │ Cat2    │ Var1     │ Var2     │ Var3     │\n",
       "│     │ \u001b[90mInt64⍰\u001b[39m    │ \u001b[90mString⍰\u001b[39m │ \u001b[90mString⍰\u001b[39m │ \u001b[90mFloat64⍰\u001b[39m │ \u001b[90mFloat64⍰\u001b[39m │ \u001b[90mFloat64⍰\u001b[39m │\n",
       "├─────┼───────────┼─────────┼─────────┼──────────┼──────────┼──────────┤\n",
       "│ 1   │ 1         │ A       │ C       │ 38.2568  │ 5.93913  │ 35.0579  │\n",
       "│ 2   │ 2         │ A       │ C       │ 17.8317  │ 5.34754  │ 21.131   │\n",
       "│ 3   │ 8         │ A       │ B       │ 16.0218  │ 6.60709  │ 60.9436  │\n",
       "│ 4   │ 9         │ A       │ C       │ 45.1158  │ 6.00733  │ 21.8797  │\n",
       "│ 5   │ 16        │ A       │ C       │ 20.448   │ 8.54819  │ 20.6623  │"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first(data1, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `describe()` function will attempt to summarize all the variables.  In the case of categorical variables, an alphabetical arrangement for minimum and maximum values will be stated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>variable</th><th>mean</th><th>min</th><th>median</th><th>max</th><th>nunique</th><th>nmissing</th><th>eltype</th></tr><tr><th></th><th>Symbol</th><th>Union…</th><th>Any</th><th>Union…</th><th>Any</th><th>Union…</th><th>Int64</th><th>DataType</th></tr></thead><tbody><p>6 rows × 8 columns</p><tr><th>1</th><td>PatientID</td><td>60.5</td><td>1</td><td>60.5</td><td>120</td><td></td><td>0</td><td>Int64</td></tr><tr><th>2</th><td>Cat1</td><td></td><td>A</td><td></td><td>B</td><td>2</td><td>0</td><td>String</td></tr><tr><th>3</th><td>Cat2</td><td></td><td>B</td><td></td><td>X</td><td>6</td><td>0</td><td>String</td></tr><tr><th>4</th><td>Var1</td><td>27.9679</td><td>15.2356</td><td>22.6801</td><td>84.2378</td><td></td><td>0</td><td>Float64</td></tr><tr><th>5</th><td>Var2</td><td>5.92121</td><td>3.01173</td><td>5.64241</td><td>15.5826</td><td></td><td>0</td><td>Float64</td></tr><tr><th>6</th><td>Var3</td><td>51.95</td><td>20.3153</td><td>44.3042</td><td>147.397</td><td></td><td>0</td><td>Float64</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|cccccccc}\n",
       "\t& variable & mean & min & median & max & nunique & nmissing & eltype\\\\\n",
       "\t\\hline\n",
       "\t& Symbol & Union… & Any & Union… & Any & Union… & Int64 & DataType\\\\\n",
       "\t\\hline\n",
       "\t1 & PatientID & 60.5 & 1 & 60.5 & 120 &  & 0 & Int64 \\\\\n",
       "\t2 & Cat1 &  & A &  & B & 2 & 0 & String \\\\\n",
       "\t3 & Cat2 &  & B &  & X & 6 & 0 & String \\\\\n",
       "\t4 & Var1 & 27.9679 & 15.2356 & 22.6801 & 84.2378 &  & 0 & Float64 \\\\\n",
       "\t5 & Var2 & 5.92121 & 3.01173 & 5.64241 & 15.5826 &  & 0 & Float64 \\\\\n",
       "\t6 & Var3 & 51.95 & 20.3153 & 44.3042 & 147.397 &  & 0 & Float64 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "6×8 DataFrame. Omitted printing of 1 columns\n",
       "│ Row │ variable  │ mean    │ min     │ median  │ max     │ nunique │ nmissing │\n",
       "│     │ \u001b[90mSymbol\u001b[39m    │ \u001b[90mUnion…\u001b[39m  │ \u001b[90mAny\u001b[39m     │ \u001b[90mUnion…\u001b[39m  │ \u001b[90mAny\u001b[39m     │ \u001b[90mUnion…\u001b[39m  │ \u001b[90mInt64\u001b[39m    │\n",
       "├─────┼───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────────┤\n",
       "│ 1   │ PatientID │ 60.5    │ 1       │ 60.5    │ 120     │         │ 0        │\n",
       "│ 2   │ Cat1      │         │ A       │         │ B       │ 2       │ 0        │\n",
       "│ 3   │ Cat2      │         │ B       │         │ X       │ 6       │ 0        │\n",
       "│ 4   │ Var1      │ 27.9679 │ 15.2356 │ 22.6801 │ 84.2378 │         │ 0        │\n",
       "│ 5   │ Var2      │ 5.92121 │ 3.01173 │ 5.64241 │ 15.5826 │         │ 0        │\n",
       "│ 6   │ Var3      │ 51.95   │ 20.3153 │ 44.3042 │ 147.397 │         │ 0        │"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "describe(data1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Back to the top](#In-this-lecture)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combining dataframes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Combining dataframes on a common variable is a very useful operation.  Below we create two dataframe instances.  Note that both have a `Number` variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Creating DataFrames\n",
    "subjects = DataFrame(Number = [100, 101, 102, 103], Stage = [\"I\", \"III\", \"II\", \"I\"])\n",
    "treatment  = DataFrame(Number = [103, 102, 101, 100], Treatment = [\"A\", \"B\", \"A\", \"B\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `join()` function takes the dataframe objects that require joining as arguments.  The `on =` argument (in symbol form), specifies the variable on which to join.  In this default mode, only values for the stated variable that appear in both dataframes will be included."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>Number</th><th>Stage</th><th>Treatment</th></tr><tr><th></th><th>Int64</th><th>String</th><th>String</th></tr></thead><tbody><p>4 rows × 3 columns</p><tr><th>1</th><td>100</td><td>I</td><td>B</td></tr><tr><th>2</th><td>101</td><td>III</td><td>A</td></tr><tr><th>3</th><td>102</td><td>II</td><td>B</td></tr><tr><th>4</th><td>103</td><td>I</td><td>A</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|ccc}\n",
       "\t& Number & Stage & Treatment\\\\\n",
       "\t\\hline\n",
       "\t& Int64 & String & String\\\\\n",
       "\t\\hline\n",
       "\t1 & 100 & I & B \\\\\n",
       "\t2 & 101 & III & A \\\\\n",
       "\t3 & 102 & II & B \\\\\n",
       "\t4 & 103 & I & A \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "4×3 DataFrame\n",
       "│ Row │ Number │ Stage  │ Treatment │\n",
       "│     │ \u001b[90mInt64\u001b[39m  │ \u001b[90mString\u001b[39m │ \u001b[90mString\u001b[39m    │\n",
       "├─────┼────────┼────────┼───────────┤\n",
       "│ 1   │ 100    │ I      │ B         │\n",
       "│ 2   │ 101    │ III    │ A         │\n",
       "│ 3   │ 102    │ II     │ B         │\n",
       "│ 4   │ 103    │ I      │ A         │"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Joining\n",
    "df3 = join(subjects, treatment, on = :Number);\n",
    "df3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Adding a longer list of subjects\n",
    "subjects = DataFrame(Number = [100, 101, 102, 103, 104, 105], Stage = [\"I\", \"III\", \"II\", \"I\", \"II\", \"II\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `kind =` argument allows for more control.  An inner join is the default (same as above)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>Number</th><th>Stage</th><th>Treatment</th></tr><tr><th></th><th>Int64</th><th>String</th><th>String</th></tr></thead><tbody><p>4 rows × 3 columns</p><tr><th>1</th><td>100</td><td>I</td><td>B</td></tr><tr><th>2</th><td>101</td><td>III</td><td>A</td></tr><tr><th>3</th><td>102</td><td>II</td><td>B</td></tr><tr><th>4</th><td>103</td><td>I</td><td>A</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|ccc}\n",
       "\t& Number & Stage & Treatment\\\\\n",
       "\t\\hline\n",
       "\t& Int64 & String & String\\\\\n",
       "\t\\hline\n",
       "\t1 & 100 & I & B \\\\\n",
       "\t2 & 101 & III & A \\\\\n",
       "\t3 & 102 & II & B \\\\\n",
       "\t4 & 103 & I & A \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "4×3 DataFrame\n",
       "│ Row │ Number │ Stage  │ Treatment │\n",
       "│     │ \u001b[90mInt64\u001b[39m  │ \u001b[90mString\u001b[39m │ \u001b[90mString\u001b[39m    │\n",
       "├─────┼────────┼────────┼───────────┤\n",
       "│ 1   │ 100    │ I      │ B         │\n",
       "│ 2   │ 101    │ III    │ A         │\n",
       "│ 3   │ 102    │ II     │ B         │\n",
       "│ 4   │ 103    │ I      │ A         │"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Inner join\n",
    " df4 = join(subjects, treatment, on = :Number, kind = :inner);\n",
    " df4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An outer join will join both dataframes and add `missing` data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>Number</th><th>Stage</th><th>Treatment</th></tr><tr><th></th><th>Int64⍰</th><th>String⍰</th><th>String⍰</th></tr></thead><tbody><p>6 rows × 3 columns</p><tr><th>1</th><td>100</td><td>I</td><td>B</td></tr><tr><th>2</th><td>101</td><td>III</td><td>A</td></tr><tr><th>3</th><td>102</td><td>II</td><td>B</td></tr><tr><th>4</th><td>103</td><td>I</td><td>A</td></tr><tr><th>5</th><td>104</td><td>II</td><td>missing</td></tr><tr><th>6</th><td>105</td><td>II</td><td>missing</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|ccc}\n",
       "\t& Number & Stage & Treatment\\\\\n",
       "\t\\hline\n",
       "\t& Int64⍰ & String⍰ & String⍰\\\\\n",
       "\t\\hline\n",
       "\t1 & 100 & I & B \\\\\n",
       "\t2 & 101 & III & A \\\\\n",
       "\t3 & 102 & II & B \\\\\n",
       "\t4 & 103 & I & A \\\\\n",
       "\t5 & 104 & II &  \\\\\n",
       "\t6 & 105 & II &  \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "6×3 DataFrame\n",
       "│ Row │ Number │ Stage   │ Treatment │\n",
       "│     │ \u001b[90mInt64⍰\u001b[39m │ \u001b[90mString⍰\u001b[39m │ \u001b[90mString⍰\u001b[39m   │\n",
       "├─────┼────────┼─────────┼───────────┤\n",
       "│ 1   │ 100    │ I       │ B         │\n",
       "│ 2   │ 101    │ III     │ A         │\n",
       "│ 3   │ 102    │ II      │ B         │\n",
       "│ 4   │ 103    │ I       │ A         │\n",
       "│ 5   │ 104    │ II      │ \u001b[90mmissing\u001b[39m   │\n",
       "│ 6   │ 105    │ II      │ \u001b[90mmissing\u001b[39m   │"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Outer joing: empty fields filled with missing\n",
    "df5  = join(subjects, treatment, on = :Number, kind = :outer);\n",
    "df5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Back to the top](#In-this-lecture)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Grouping"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A dataframe can be _spliced_ by grouping rows according to values in a variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Variable1</th><th>Variable2</th></tr><tr><th></th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>3 rows × 3 columns</p><tr><th>1</th><td>B</td><td>0.447358</td><td>0.137658</td></tr><tr><th>2</th><td>B</td><td>-0.396211</td><td>0.60808</td></tr><tr><th>3</th><td>B</td><td>0.366773</td><td>0.255054</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|ccc}\n",
       "\t& Group & Variable1 & Variable2\\\\\n",
       "\t\\hline\n",
       "\t& String & Float64 & Float64\\\\\n",
       "\t\\hline\n",
       "\t1 & B & 0.447358 & 0.137658 \\\\\n",
       "\t2 & B & -0.396211 & 0.60808 \\\\\n",
       "\t3 & B & 0.366773 & 0.255054 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "3×3 DataFrame\n",
       "│ Row │ Group  │ Variable1 │ Variable2 │\n",
       "│     │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m   │ \u001b[90mFloat64\u001b[39m   │\n",
       "├─────┼────────┼───────────┼───────────┤\n",
       "│ 1   │ B      │ 0.447358  │ 0.137658  │\n",
       "│ 2   │ B      │ -0.396211 │ 0.60808   │\n",
       "│ 3   │ B      │ 0.366773  │ 0.255054  │"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Creating a new DataFrame\n",
    "df6 = DataFrame(Group = rand([\"A\", \"B\", \"C\"], 15), Variable1 = randn(15), Variable2 = rand(15));\n",
    "first(df6, 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `by()` function takes a dataframe object as first argument.  This is followed by a column (variable) on which to group by.  Below we use the `size` argument to indicate the number of rows and columns for the number of each unique values that are found in the specified variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>x1</th></tr><tr><th></th><th>String</th><th>Tuple…</th></tr></thead><tbody><p>3 rows × 2 columns</p><tr><th>1</th><td>B</td><td>(9, 3)</td></tr><tr><th>2</th><td>A</td><td>(4, 3)</td></tr><tr><th>3</th><td>C</td><td>(2, 3)</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|cc}\n",
       "\t& Group & x1\\\\\n",
       "\t\\hline\n",
       "\t& String & Tuple…\\\\\n",
       "\t\\hline\n",
       "\t1 & B & (9, 3) \\\\\n",
       "\t2 & A & (4, 3) \\\\\n",
       "\t3 & C & (2, 3) \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "3×2 DataFrame\n",
       "│ Row │ Group  │ x1     │\n",
       "│     │ \u001b[90mString\u001b[39m │ \u001b[90mTuple…\u001b[39m │\n",
       "├─────┼────────┼────────┤\n",
       "│ 1   │ B      │ (9, 3) │\n",
       "│ 2   │ A      │ (4, 3) │\n",
       "│ 3   │ C      │ (2, 3) │"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Grouping using by()\n",
    "by(df6, :Group, size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the dataframe has three columns, we note that as the second value in the `count` tuple returned above.  The first value shows the number of instances of the unique values found for the specified variable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below we create a dataframe instance that shows only the count of the unique values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Count</th></tr><tr><th></th><th>String</th><th>Int64</th></tr></thead><tbody><p>3 rows × 2 columns</p><tr><th>1</th><td>B</td><td>9</td></tr><tr><th>2</th><td>A</td><td>4</td></tr><tr><th>3</th><td>C</td><td>2</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|cc}\n",
       "\t& Group & Count\\\\\n",
       "\t\\hline\n",
       "\t& String & Int64\\\\\n",
       "\t\\hline\n",
       "\t1 & B & 9 \\\\\n",
       "\t2 & A & 4 \\\\\n",
       "\t3 & C & 2 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "3×2 DataFrame\n",
       "│ Row │ Group  │ Count │\n",
       "│     │ \u001b[90mString\u001b[39m │ \u001b[90mInt64\u001b[39m │\n",
       "├─────┼────────┼───────┤\n",
       "│ 1   │ B      │ 9     │\n",
       "│ 2   │ A      │ 4     │\n",
       "│ 3   │ C      │ 2     │"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Count unique data point values in :Group column\n",
    "by(df6, :Group, dfc -> DataFrame(Count = size(dfc, 1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `aggregate()` function also groups a dataframe by unique values for a specified column, but then provides the ability to list statistical tests required."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3×5 DataFrame\n",
      "│ Row │ Group  │ Variable1_mean │ Variable2_mean │ Variable1_std │ Variable2_std │\n",
      "│     │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m        │ \u001b[90mFloat64\u001b[39m        │ \u001b[90mFloat64\u001b[39m       │ \u001b[90mFloat64\u001b[39m       │\n",
      "├─────┼────────┼────────────────┼────────────────┼───────────────┼───────────────┤\n",
      "│ 1   │ B      │ 0.127675       │ 0.446397       │ 0.973237      │ 0.268476      │\n",
      "│ 2   │ A      │ -0.33429       │ 0.339451       │ 1.04503       │ 0.352194      │\n",
      "│ 3   │ C      │ -0.902111      │ 0.373207       │ 1.51729       │ 0.368007      │"
     ]
    }
   ],
   "source": [
    "# Aggregate for descriptive statistics\n",
    "print(aggregate(df6, :Group, [mean, std]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `groupby()` function actually creates sub-dataframes based on the unique values found in the specified variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<p><b>GroupedDataFrame with 3 groups based on key: Group</b></p><p><i>First Group (9 rows): Group = \"B\"</i></p><table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Variable1</th><th>Variable2</th></tr><tr><th></th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><tr><th>1</th><td>B</td><td>0.447358</td><td>0.137658</td></tr><tr><th>2</th><td>B</td><td>-0.396211</td><td>0.60808</td></tr><tr><th>3</th><td>B</td><td>0.366773</td><td>0.255054</td></tr><tr><th>4</th><td>B</td><td>0.621673</td><td>0.498734</td></tr><tr><th>5</th><td>B</td><td>2.06353</td><td>0.52509</td></tr><tr><th>6</th><td>B</td><td>-1.41453</td><td>0.265511</td></tr><tr><th>7</th><td>B</td><td>0.134475</td><td>0.110096</td></tr><tr><th>8</th><td>B</td><td>-0.750421</td><td>0.834362</td></tr><tr><th>9</th><td>B</td><td>0.076418</td><td>0.78299</td></tr></tbody></table><p>&vellip;</p><p><i>Last Group (2 rows): Group = \"C\"</i></p><table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Variable1</th><th>Variable2</th></tr><tr><th></th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><tr><th>1</th><td>C</td><td>0.170778</td><td>0.633427</td></tr><tr><th>2</th><td>C</td><td>-1.975</td><td>0.112987</td></tr></tbody></table>"
      ],
      "text/latex": [
       "GroupedDataFrame with 3 groups based on key: Group\n",
       "\n",
       "First Group (9 rows): Group = \"B\"\n",
       "\n",
       "\\begin{tabular}{r|ccc}\n",
       "\t& Group & Variable1 & Variable2\\\\\n",
       "\t\\hline\n",
       "\t& String & Float64 & Float64\\\\\n",
       "\t\\hline\n",
       "\t1 & B & 0.447358 & 0.137658 \\\\\n",
       "\t2 & B & -0.396211 & 0.60808 \\\\\n",
       "\t3 & B & 0.366773 & 0.255054 \\\\\n",
       "\t4 & B & 0.621673 & 0.498734 \\\\\n",
       "\t5 & B & 2.06353 & 0.52509 \\\\\n",
       "\t6 & B & -1.41453 & 0.265511 \\\\\n",
       "\t7 & B & 0.134475 & 0.110096 \\\\\n",
       "\t8 & B & -0.750421 & 0.834362 \\\\\n",
       "\t9 & B & 0.076418 & 0.78299 \\\\\n",
       "\\end{tabular}\n",
       "\n",
       "$\\dots$\n",
       "\n",
       "Last Group (2 rows): Group = \"C\"\n",
       "\n",
       "\\begin{tabular}{r|ccc}\n",
       "\t& Group & Variable1 & Variable2\\\\\n",
       "\t\\hline\n",
       "\t& String & Float64 & Float64\\\\\n",
       "\t\\hline\n",
       "\t1 & C & 0.170778 & 0.633427 \\\\\n",
       "\t2 & C & -1.975 & 0.112987 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "GroupedDataFrame with 3 groups based on key: Group\n",
       "First Group (9 rows): Group = \"B\"\n",
       "│ Row │ Group  │ Variable1 │ Variable2 │\n",
       "│     │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m   │ \u001b[90mFloat64\u001b[39m   │\n",
       "├─────┼────────┼───────────┼───────────┤\n",
       "│ 1   │ B      │ 0.447358  │ 0.137658  │\n",
       "│ 2   │ B      │ -0.396211 │ 0.60808   │\n",
       "│ 3   │ B      │ 0.366773  │ 0.255054  │\n",
       "│ 4   │ B      │ 0.621673  │ 0.498734  │\n",
       "│ 5   │ B      │ 2.06353   │ 0.52509   │\n",
       "│ 6   │ B      │ -1.41453  │ 0.265511  │\n",
       "│ 7   │ B      │ 0.134475  │ 0.110096  │\n",
       "│ 8   │ B      │ -0.750421 │ 0.834362  │\n",
       "│ 9   │ B      │ 0.076418  │ 0.78299   │\n",
       "⋮\n",
       "Last Group (2 rows): Group = \"C\"\n",
       "│ Row │ Group  │ Variable1 │ Variable2 │\n",
       "│     │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m   │ \u001b[90mFloat64\u001b[39m   │\n",
       "├─────┼────────┼───────────┼───────────┤\n",
       "│ 1   │ C      │ 0.170778  │ 0.633427  │\n",
       "│ 2   │ C      │ -1.975    │ 0.112987  │"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Group\n",
    "groupby(df6, :Group)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By calling the `length()` function, we note that there are indeed three sub-dataframes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "length(groupby(df6, :Group))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using indexing, we can select any of the three sub-dataframes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Variable1</th><th>Variable2</th></tr><tr><th></th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>4 rows × 3 columns</p><tr><th>1</th><td>A</td><td>0.182588</td><td>0.0940369</td></tr><tr><th>2</th><td>A</td><td>-1.58492</td><td>0.337865</td></tr><tr><th>3</th><td>A</td><td>0.799335</td><td>0.838042</td></tr><tr><th>4</th><td>A</td><td>-0.734161</td><td>0.0878598</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|ccc}\n",
       "\t& Group & Variable1 & Variable2\\\\\n",
       "\t\\hline\n",
       "\t& String & Float64 & Float64\\\\\n",
       "\t\\hline\n",
       "\t1 & A & 0.182588 & 0.0940369 \\\\\n",
       "\t2 & A & -1.58492 & 0.337865 \\\\\n",
       "\t3 & A & 0.799335 & 0.838042 \\\\\n",
       "\t4 & A & -0.734161 & 0.0878598 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "4×3 SubDataFrame\n",
       "│ Row │ Group  │ Variable1 │ Variable2 │\n",
       "│     │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m   │ \u001b[90mFloat64\u001b[39m   │\n",
       "├─────┼────────┼───────────┼───────────┤\n",
       "│ 1   │ A      │ 0.182588  │ 0.0940369 │\n",
       "│ 2   │ A      │ -1.58492  │ 0.337865  │\n",
       "│ 3   │ A      │ 0.799335  │ 0.838042  │\n",
       "│ 4   │ A      │ -0.734161 │ 0.0878598 │"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "groupby(df6, :Group)[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Back to the top](#In-this-lecture)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sorting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sorting using the `sort!()` function (permanent bang version used here), does what is says on the box.  A list can be provided to sort by more than one variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Variable1</th><th>Variable2</th></tr><tr><th></th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>7 rows × 3 columns</p><tr><th>1</th><td>A</td><td>-1.58492</td><td>0.337865</td></tr><tr><th>2</th><td>A</td><td>-0.734161</td><td>0.0878598</td></tr><tr><th>3</th><td>A</td><td>0.182588</td><td>0.0940369</td></tr><tr><th>4</th><td>A</td><td>0.799335</td><td>0.838042</td></tr><tr><th>5</th><td>B</td><td>-1.41453</td><td>0.265511</td></tr><tr><th>6</th><td>B</td><td>-0.750421</td><td>0.834362</td></tr><tr><th>7</th><td>B</td><td>-0.396211</td><td>0.60808</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|ccc}\n",
       "\t& Group & Variable1 & Variable2\\\\\n",
       "\t\\hline\n",
       "\t& String & Float64 & Float64\\\\\n",
       "\t\\hline\n",
       "\t1 & A & -1.58492 & 0.337865 \\\\\n",
       "\t2 & A & -0.734161 & 0.0878598 \\\\\n",
       "\t3 & A & 0.182588 & 0.0940369 \\\\\n",
       "\t4 & A & 0.799335 & 0.838042 \\\\\n",
       "\t5 & B & -1.41453 & 0.265511 \\\\\n",
       "\t6 & B & -0.750421 & 0.834362 \\\\\n",
       "\t7 & B & -0.396211 & 0.60808 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "7×3 DataFrame\n",
       "│ Row │ Group  │ Variable1 │ Variable2 │\n",
       "│     │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m   │ \u001b[90mFloat64\u001b[39m   │\n",
       "├─────┼────────┼───────────┼───────────┤\n",
       "│ 1   │ A      │ -1.58492  │ 0.337865  │\n",
       "│ 2   │ A      │ -0.734161 │ 0.0878598 │\n",
       "│ 3   │ A      │ 0.182588  │ 0.0940369 │\n",
       "│ 4   │ A      │ 0.799335  │ 0.838042  │\n",
       "│ 5   │ B      │ -1.41453  │ 0.265511  │\n",
       "│ 6   │ B      │ -0.750421 │ 0.834362  │\n",
       "│ 7   │ B      │ -0.396211 │ 0.60808   │"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df6S = sort!(df6, [:Group, :Variable1]);\n",
    "first(df6S, 7)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Back to the top](#In-this-lecture)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Unique rows only"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below we create a dataframe with two identical rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>6 rows × 3 columns</p><tr><th>1</th><td>1</td><td>11</td><td>A</td></tr><tr><th>2</th><td>2</td><td>12</td><td>B</td></tr><tr><th>3</th><td>2</td><td>12</td><td>B</td></tr><tr><th>4</th><td>3</td><td>13</td><td>C</td></tr><tr><th>5</th><td>4</td><td>14</td><td>D</td></tr><tr><th>6</th><td>5</td><td>15</td><td>E</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|ccc}\n",
       "\t& A & B & C\\\\\n",
       "\t\\hline\n",
       "\t& Int64 & Int64 & String\\\\\n",
       "\t\\hline\n",
       "\t1 & 1 & 11 & A \\\\\n",
       "\t2 & 2 & 12 & B \\\\\n",
       "\t3 & 2 & 12 & B \\\\\n",
       "\t4 & 3 & 13 & C \\\\\n",
       "\t5 & 4 & 14 & D \\\\\n",
       "\t6 & 5 & 15 & E \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "6×3 DataFrame\n",
       "│ Row │ A     │ B     │ C      │\n",
       "│     │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
       "├─────┼───────┼───────┼────────┤\n",
       "│ 1   │ 1     │ 11    │ A      │\n",
       "│ 2   │ 2     │ 12    │ B      │\n",
       "│ 3   │ 2     │ 12    │ B      │\n",
       "│ 4   │ 3     │ 13    │ C      │\n",
       "│ 5   │ 4     │ 14    │ D      │\n",
       "│ 6   │ 5     │ 15    │ E      │"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Creating a DataFrame with an obvious duplicate row\n",
    "df7 = DataFrame(A = [1, 2, 2, 3, 4, 5],  B = [11, 12, 12, 13, 14, 15], C = [\"A\", \"B\", \"B\", \"C\", \"D\", \"E\"]);\n",
    "df7"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `unique()` function will, as the name implies, delete the duplicate row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>5 rows × 3 columns</p><tr><th>1</th><td>1</td><td>11</td><td>A</td></tr><tr><th>2</th><td>2</td><td>12</td><td>B</td></tr><tr><th>3</th><td>3</td><td>13</td><td>C</td></tr><tr><th>4</th><td>4</td><td>14</td><td>D</td></tr><tr><th>5</th><td>5</td><td>15</td><td>E</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|ccc}\n",
       "\t& A & B & C\\\\\n",
       "\t\\hline\n",
       "\t& Int64 & Int64 & String\\\\\n",
       "\t\\hline\n",
       "\t1 & 1 & 11 & A \\\\\n",
       "\t2 & 2 & 12 & B \\\\\n",
       "\t3 & 3 & 13 & C \\\\\n",
       "\t4 & 4 & 14 & D \\\\\n",
       "\t5 & 5 & 15 & E \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "5×3 DataFrame\n",
       "│ Row │ A     │ B     │ C      │\n",
       "│     │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
       "├─────┼───────┼───────┼────────┤\n",
       "│ 1   │ 1     │ 11    │ A      │\n",
       "│ 2   │ 2     │ 12    │ B      │\n",
       "│ 3   │ 3     │ 13    │ C      │\n",
       "│ 4   │ 4     │ 14    │ D      │\n",
       "│ 5   │ 5     │ 15    │ E      │"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Only unique rows\n",
    "unique(df7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>6 rows × 3 columns</p><tr><th>1</th><td>1</td><td>11</td><td>A</td></tr><tr><th>2</th><td>2</td><td>12</td><td>B</td></tr><tr><th>3</th><td>2</td><td>12</td><td>B</td></tr><tr><th>4</th><td>3</td><td>13</td><td>C</td></tr><tr><th>5</th><td>4</td><td>14</td><td>D</td></tr><tr><th>6</th><td>5</td><td>15</td><td>E</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|ccc}\n",
       "\t& A & B & C\\\\\n",
       "\t\\hline\n",
       "\t& Int64 & Int64 & String\\\\\n",
       "\t\\hline\n",
       "\t1 & 1 & 11 & A \\\\\n",
       "\t2 & 2 & 12 & B \\\\\n",
       "\t3 & 2 & 12 & B \\\\\n",
       "\t4 & 3 & 13 & C \\\\\n",
       "\t5 & 4 & 14 & D \\\\\n",
       "\t6 & 5 & 15 & E \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "6×3 DataFrame\n",
       "│ Row │ A     │ B     │ C      │\n",
       "│     │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
       "├─────┼───────┼───────┼────────┤\n",
       "│ 1   │ 1     │ 11    │ A      │\n",
       "│ 2   │ 2     │ 12    │ B      │\n",
       "│ 3   │ 2     │ 12    │ B      │\n",
       "│ 4   │ 3     │ 13    │ C      │\n",
       "│ 5   │ 4     │ 14    │ D      │\n",
       "│ 6   │ 5     │ 15    │ E      │"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df7"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As always, the bang will make the change permament."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>5 rows × 3 columns</p><tr><th>1</th><td>1</td><td>11</td><td>A</td></tr><tr><th>2</th><td>2</td><td>12</td><td>B</td></tr><tr><th>3</th><td>3</td><td>13</td><td>C</td></tr><tr><th>4</th><td>4</td><td>14</td><td>D</td></tr><tr><th>5</th><td>5</td><td>15</td><td>E</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|ccc}\n",
       "\t& A & B & C\\\\\n",
       "\t\\hline\n",
       "\t& Int64 & Int64 & String\\\\\n",
       "\t\\hline\n",
       "\t1 & 1 & 11 & A \\\\\n",
       "\t2 & 2 & 12 & B \\\\\n",
       "\t3 & 3 & 13 & C \\\\\n",
       "\t4 & 4 & 14 & D \\\\\n",
       "\t5 & 5 & 15 & E \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "5×3 DataFrame\n",
       "│ Row │ A     │ B     │ C      │\n",
       "│     │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
       "├─────┼───────┼───────┼────────┤\n",
       "│ 1   │ 1     │ 11    │ A      │\n",
       "│ 2   │ 2     │ 12    │ B      │\n",
       "│ 3   │ 3     │ 13    │ C      │\n",
       "│ 4   │ 4     │ 14    │ D      │\n",
       "│ 5   │ 5     │ 15    │ E      │"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Permanant change\n",
    "unique!(df7)\n",
    "df7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>5 rows × 3 columns</p><tr><th>1</th><td>1</td><td>11</td><td>A</td></tr><tr><th>2</th><td>2</td><td>12</td><td>B</td></tr><tr><th>3</th><td>3</td><td>13</td><td>C</td></tr><tr><th>4</th><td>4</td><td>14</td><td>D</td></tr><tr><th>5</th><td>5</td><td>15</td><td>E</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|ccc}\n",
       "\t& A & B & C\\\\\n",
       "\t\\hline\n",
       "\t& Int64 & Int64 & String\\\\\n",
       "\t\\hline\n",
       "\t1 & 1 & 11 & A \\\\\n",
       "\t2 & 2 & 12 & B \\\\\n",
       "\t3 & 3 & 13 & C \\\\\n",
       "\t4 & 4 & 14 & D \\\\\n",
       "\t5 & 5 & 15 & E \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "5×3 DataFrame\n",
       "│ Row │ A     │ B     │ C      │\n",
       "│     │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
       "├─────┼───────┼───────┼────────┤\n",
       "│ 1   │ 1     │ 11    │ A      │\n",
       "│ 2   │ 2     │ 12    │ B      │\n",
       "│ 3   │ 3     │ 13    │ C      │\n",
       "│ 4   │ 4     │ 14    │ D      │\n",
       "│ 5   │ 5     │ 15    │ E      │"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Permanant change\n",
    "unique!(df7)\n",
    "df7"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Back to the top](#In-this-lecture)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deleting rows"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `deleterows!()` function (permanent bang version used here), deletes specified rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>3 rows × 3 columns</p><tr><th>1</th><td>2</td><td>12</td><td>B</td></tr><tr><th>2</th><td>3</td><td>13</td><td>C</td></tr><tr><th>3</th><td>4</td><td>14</td><td>D</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|ccc}\n",
       "\t& A & B & C\\\\\n",
       "\t\\hline\n",
       "\t& Int64 & Int64 & String\\\\\n",
       "\t\\hline\n",
       "\t1 & 2 & 12 & B \\\\\n",
       "\t2 & 3 & 13 & C \\\\\n",
       "\t3 & 4 & 14 & D \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "3×3 DataFrame\n",
       "│ Row │ A     │ B     │ C      │\n",
       "│     │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
       "├─────┼───────┼───────┼────────┤\n",
       "│ 1   │ 2     │ 12    │ B      │\n",
       "│ 2   │ 3     │ 13    │ C      │\n",
       "│ 3   │ 4     │ 14    │ D      │"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Permanently\n",
    "deleterows!(df7, [1, 5])\n",
    "df7"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Back to the top](#In-this-lecture)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 1.0.3",
   "language": "julia",
   "name": "julia-1.0"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.0.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}