{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In this lecture" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Introduction](#Introduction)\n", "- [Distributions](#Distributions)\n", "- [Normal distribution](#Normal-distribution)\n", "- [Other distributions](#Other-distributions)\n", "- [DataFrames](#DataFrames)\n", " - [Combining dataframes](#Combining-dataframes)\n", " - [Grouping](#Grouping)\n", " - [Sorting](#Sorting)\n", " - [Unique rows only](#Unique-rows-only)\n", " - [Deleting rows](#Deleting-rows)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ability to use data is fundamental to most modern computer coding taks. In this lecture, we will have a brief introduction to the way in which the Julia language incorporates data through the use of the `Distributions.jl` and `DataFrames.jl` packages." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to the top](#In-this-lecture)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distributions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data point values for a distribution usually follow a pattern. Such patterns are called distributions. Distributions are either discrete or continuous. The `Distribution.jl` package contains most of the common data distributions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will also use the `Random.jl` package to seed the pseudo-random number generator so that we can reproduce the random values that we are going to use in the lecture." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "using Distributions\n", "using Random" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to the top](#In-this-lecture)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The normal distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The normal distribution is the famous bell-shaped curve that we are familiar with. Values around the mean occur most frequently and as values get progressively further away from the mean, they occur less frequently." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.0, 1.0)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Seed the pseudo-random number generator\n", "Random.seed!(1234)\n", "#Saving the standard normal distribution as an object\n", "n = Distributions.Normal() # This function is from the Distributions package\n", "#Parameter values of the standard normal distribution\n", "params(n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the `params()` function, we note a mean on $0$ and a standard deviation of $1$, also called the _standard normal distribution_." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `fieldnames()` function provides the actual parameters of the given distribution. In the case of the normal distribution, it will be the average and the standard deviation, namely $\\mu$ and $\\sigma$." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(:μ, :σ)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Returning the parameters of the normal distribution\n", "fieldnames(Normal)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we create a variable called `var1` and use the `rand()` function to create select $10$ random values from the standard normal distribution." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "#Seed the pseudo-random number generator\n", "Random.seed!(1234)\n", "#Select 10 elements at random from n\n", "var1 = rand(n, 10);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can calculate the average and standard deviation of our randomly selected values." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.18909179133831322" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Average\n", "mean(var1)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9879593623730926" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Standard deviation\n", "std(var1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pdf()` calculates the probability density function value of a given distribution up until a specified point (from $- \\infty$)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.38138781546052414" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Probability density function value at x = 0.3\n", "pdf(Normal(), 0.3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `cdf()` functions calculates the cummulative distribution function value of a given distribution up until a specified point (from $- \\infty$)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5987063256829237" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Cumulative distribution function as x = 0.25\n", "cdf(Normal(), 0.25)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The values for the average and standard deviation can be specified." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Creating 100 data point values from a normal distribution\n", "# with a mean of 100 and a standard deviation of 10\n", "Random.seed!(1234)\n", "var2 = rand(Normal(100, 10), 100);" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "98.52365657772843" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculating the mean of var2\n", "mean(var2)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "9.580963685859091" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Calculating the standard deviation of var2\n", "std(var2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The parameters of a set of values for a specified distribution can be returned." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Normal{Float64}(μ=98.52365657772843, σ=9.532938502804532)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Using fit() to calculate the parameters of a distribution\n", "fit(Normal, var2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `quantiles()` function provides us with values for the specific percentiles (provided as fractions). Below we calculate the $2.5$% and $97.5$% percentile values of the standard normal distribution." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-1.9599639845400592" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Quantiles\n", "quantile(Normal(), 0.025)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.9599639845400576" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quantile(Normal(), 0.975)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to the top](#In-this-lecture)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other distributions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many distributions in the `Distribution().jl` package. In the code below, a few of these are showcased by way of setting parameters, selecting random values, and fitting those value back to the distribution or returning the parameter field names." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Beta{Float64}(α=1.236721159927394, β=1.1368118923305863)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Beta distribution\n", "b = Beta(1, 1)\n", "params(b)\n", "Random.seed!(1234)\n", "var3 = rand(b, 100);\n", "fit(Beta, var3)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(:ν,)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# χ2 distribution\n", "c = Chisq(1)\n", "Random.seed!(1234)\n", "var4 = rand(c, 100)\n", "fieldnames(Chisq) # Degrees of freedom" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to the top](#In-this-lecture)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataframes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Dataframes.jl` package allows for creation of a flat data structure (rows and columns). Columns are variables and rows are subjects (examples)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "using DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, we create an empty dataframe object that we call `df`." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "#Create and empty DataFrame\n", "df = DataFrame();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Column headers representing statistical variable names are entered in square brackets as symbols, i.e. preceeded with a colon. We will attach the `var2` set of values as data point entries for this statistical variables." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# Add a column with data point values (rows)\n", "df[:Var2] = var2;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can specify to print the first $5$ rows to the screen with the `first()` function," ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

5 rows × 1 columns

Var2
Float64
1108.673
290.9826
395.0552
490.9709
5108.644
" ], "text/latex": [ "\\begin{tabular}{r|c}\n", "\t& Var2\\\\\n", "\t\\hline\n", "\t& Float64\\\\\n", "\t\\hline\n", "\t1 & 108.673 \\\\\n", "\t2 & 90.9826 \\\\\n", "\t3 & 95.0552 \\\\\n", "\t4 & 90.9709 \\\\\n", "\t5 & 108.644 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "5×1 DataFrame\n", "│ Row │ Var2 │\n", "│ │ \u001b[90mFloat64\u001b[39m │\n", "├─────┼─────────┤\n", "│ 1 │ 108.673 │\n", "│ 2 │ 90.9826 │\n", "│ 3 │ 95.0552 │\n", "│ 4 │ 90.9709 │\n", "│ 5 │ 108.644 │" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#View first five rows\n", "first(df, 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, we create another statistical variable with some data point values that we already have in the waiting." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Add another column\n", "df[:Var3] = var3;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `last()` functions shows the last specified rows." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

3 rows × 2 columns

Var2Var3
Float64Float64
195.56750.831916
283.36770.221771
394.78770.655592
" ], "text/latex": [ "\\begin{tabular}{r|cc}\n", "\t& Var2 & Var3\\\\\n", "\t\\hline\n", "\t& Float64 & Float64\\\\\n", "\t\\hline\n", "\t1 & 95.5675 & 0.831916 \\\\\n", "\t2 & 83.3677 & 0.221771 \\\\\n", "\t3 & 94.7877 & 0.655592 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "3×2 DataFrame\n", "│ Row │ Var2 │ Var3 │\n", "│ │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n", "├─────┼─────────┼──────────┤\n", "│ 1 │ 95.5675 │ 0.831916 │\n", "│ 2 │ 83.3677 │ 0.221771 │\n", "│ 3 │ 94.7877 │ 0.655592 │" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View last three rows\n", "last(df, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `size()` function returns a tuple with the number of rows and columns returned," ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(100, 2)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Dimensions of a DataFrame\n", "size(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `describe()` functions attemps tp provide summary statistics of the variables>" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

2 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolFloat64Float64Float64Float64NothingNothingDataType
1Var298.523767.886498.1718124.175Float64
2Var30.5210470.001453840.5228080.971161Float64
" ], "text/latex": [ "\\begin{tabular}{r|cccccccc}\n", "\t& variable & mean & min & median & max & nunique & nmissing & eltype\\\\\n", "\t\\hline\n", "\t& Symbol & Float64 & Float64 & Float64 & Float64 & Nothing & Nothing & DataType\\\\\n", "\t\\hline\n", "\t1 & Var2 & 98.5237 & 67.8864 & 98.1718 & 124.175 & & & Float64 \\\\\n", "\t2 & Var3 & 0.521047 & 0.00145384 & 0.522808 & 0.971161 & & & Float64 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "2×8 DataFrame. Omitted printing of 2 columns\n", "│ Row │ variable │ mean │ min │ median │ max │ nunique │\n", "│ │ \u001b[90mSymbol\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mNothing\u001b[39m │\n", "├─────┼──────────┼──────────┼────────────┼──────────┼──────────┼─────────┤\n", "│ 1 │ Var2 │ 98.5237 │ 67.8864 │ 98.1718 │ 124.175 │ │\n", "│ 2 │ Var3 │ 0.521047 │ 0.00145384 │ 0.522808 │ 0.971161 │ │" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Summarize the content\n", "describe(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data type for each variable can be returned." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2-element Array{DataType,1}:\n", " Float64\n", " Float64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Data type only\n", "eltypes(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below we create a new instance of a dataframe object called `df2`. It contains four statistical variables. Note the use of symbol notation in creating the names of these variables." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# 6 Create a bigger DataFrame\n", "df2 = DataFrame()\n", "df2[:A] = 1:10\n", "df2[:B] = [\"I\", \"II\", \"II\", \"I\", \"II\",\"I\", \"II\", \"II\", \"I\", \"II\"]\n", "Random.seed!(1234)\n", "df2[:C] = rand(Normal(), 10)\n", "df2[:D] = rand(Chisq(1), 10);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By using indexing (in square brackets), we can refer to row and column values (i.e. _row, column_). Below is an example of seleting data point values for rows one through three, showing all the columns. The colon symbol serves as shortcut syntax for this selection." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

3 rows × 4 columns

ABCD
Int64StringFloat64Float64
11I0.8673470.0123688
22II-0.9017440.213586
33II-0.4944790.00899443
" ], "text/latex": [ "\\begin{tabular}{r|cccc}\n", "\t& A & B & C & D\\\\\n", "\t\\hline\n", "\t& Int64 & String & Float64 & Float64\\\\\n", "\t\\hline\n", "\t1 & 1 & I & 0.867347 & 0.0123688 \\\\\n", "\t2 & 2 & II & -0.901744 & 0.213586 \\\\\n", "\t3 & 3 & II & -0.494479 & 0.00899443 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "3×4 DataFrame\n", "│ Row │ A │ B │ C │ D │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n", "├─────┼───────┼────────┼───────────┼────────────┤\n", "│ 1 │ 1 │ I │ 0.867347 │ 0.0123688 │\n", "│ 2 │ 2 │ II │ -0.901744 │ 0.213586 │\n", "│ 3 │ 3 │ II │ -0.494479 │ 0.00899443 │" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# First three rows with all the colums\n", "df2[1:3, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If only specified columns, that is to say, not the range of one, two, and three as we did above, but rather only colums one and three, we create a list to indicate this." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

10 rows × 2 columns

AC
Int64Float64
110.867347
22-0.901744
33-0.494479
44-0.902914
550.864401
662.21188
770.532813
88-0.271735
990.502334
1010-0.516984
" ], "text/latex": [ "\\begin{tabular}{r|cc}\n", "\t& A & C\\\\\n", "\t\\hline\n", "\t& Int64 & Float64\\\\\n", "\t\\hline\n", "\t1 & 1 & 0.867347 \\\\\n", "\t2 & 2 & -0.901744 \\\\\n", "\t3 & 3 & -0.494479 \\\\\n", "\t4 & 4 & -0.902914 \\\\\n", "\t5 & 5 & 0.864401 \\\\\n", "\t6 & 6 & 2.21188 \\\\\n", "\t7 & 7 & 0.532813 \\\\\n", "\t8 & 8 & -0.271735 \\\\\n", "\t9 & 9 & 0.502334 \\\\\n", "\t10 & 10 & -0.516984 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "10×2 DataFrame\n", "│ Row │ A │ C │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n", "├─────┼───────┼───────────┤\n", "│ 1 │ 1 │ 0.867347 │\n", "│ 2 │ 2 │ -0.901744 │\n", "│ 3 │ 3 │ -0.494479 │\n", "│ 4 │ 4 │ -0.902914 │\n", "│ 5 │ 5 │ 0.864401 │\n", "│ 6 │ 6 │ 2.21188 │\n", "│ 7 │ 7 │ 0.532813 │\n", "│ 8 │ 8 │ -0.271735 │\n", "│ 9 │ 9 │ 0.502334 │\n", "│ 10 │ 10 │ -0.516984 │" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# All rows columns 1 and 3\n", "df2[:, [1, 3]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of indicating the column numbers, we can also reference the actual column names (statistical variable names), using symbol notation, i.e. `:A`." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

10 rows × 2 columns

AC
Int64Float64
110.867347
22-0.901744
33-0.494479
44-0.902914
550.864401
662.21188
770.532813
88-0.271735
990.502334
1010-0.516984
" ], "text/latex": [ "\\begin{tabular}{r|cc}\n", "\t& A & C\\\\\n", "\t\\hline\n", "\t& Int64 & Float64\\\\\n", "\t\\hline\n", "\t1 & 1 & 0.867347 \\\\\n", "\t2 & 2 & -0.901744 \\\\\n", "\t3 & 3 & -0.494479 \\\\\n", "\t4 & 4 & -0.902914 \\\\\n", "\t5 & 5 & 0.864401 \\\\\n", "\t6 & 6 & 2.21188 \\\\\n", "\t7 & 7 & 0.532813 \\\\\n", "\t8 & 8 & -0.271735 \\\\\n", "\t9 & 9 & 0.502334 \\\\\n", "\t10 & 10 & -0.516984 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "10×2 DataFrame\n", "│ Row │ A │ C │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n", "├─────┼───────┼───────────┤\n", "│ 1 │ 1 │ 0.867347 │\n", "│ 2 │ 2 │ -0.901744 │\n", "│ 3 │ 3 │ -0.494479 │\n", "│ 4 │ 4 │ -0.902914 │\n", "│ 5 │ 5 │ 0.864401 │\n", "│ 6 │ 6 │ 2.21188 │\n", "│ 7 │ 7 │ 0.532813 │\n", "│ 8 │ 8 │ -0.271735 │\n", "│ 9 │ 9 │ 0.502334 │\n", "│ 10 │ 10 │ -0.516984 │" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Different notation\n", "df2[:, [:A, :C]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `CSV.jl` package's `read()` function can import a comma separated values data file." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "# Make sure to install the package in the REPL first\n", "using CSV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The file is saved in the same directory / folder as this notebook file." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "# Import csv file (in same directory / folder)\n", "data1 = CSV.read(\"CCS.csv\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the `type()` function, we note that we now have an instance of a dataframe object." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DataFrame" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "typeof(data1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's view the first five rows of data." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

5 rows × 6 columns

PatientIDCat1Cat2Var1Var2Var3
Int64⍰String⍰String⍰Float64⍰Float64⍰Float64⍰
11AC38.25685.9391335.0579
22AC17.83175.3475421.131
38AB16.02186.6070960.9436
49AC45.11586.0073321.8797
516AC20.4488.5481920.6623
" ], "text/latex": [ "\\begin{tabular}{r|cccccc}\n", "\t& PatientID & Cat1 & Cat2 & Var1 & Var2 & Var3\\\\\n", "\t\\hline\n", "\t& Int64⍰ & String⍰ & String⍰ & Float64⍰ & Float64⍰ & Float64⍰\\\\\n", "\t\\hline\n", "\t1 & 1 & A & C & 38.2568 & 5.93913 & 35.0579 \\\\\n", "\t2 & 2 & A & C & 17.8317 & 5.34754 & 21.131 \\\\\n", "\t3 & 8 & A & B & 16.0218 & 6.60709 & 60.9436 \\\\\n", "\t4 & 9 & A & C & 45.1158 & 6.00733 & 21.8797 \\\\\n", "\t5 & 16 & A & C & 20.448 & 8.54819 & 20.6623 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "5×6 DataFrame\n", "│ Row │ PatientID │ Cat1 │ Cat2 │ Var1 │ Var2 │ Var3 │\n", "│ │ \u001b[90mInt64⍰\u001b[39m │ \u001b[90mString⍰\u001b[39m │ \u001b[90mString⍰\u001b[39m │ \u001b[90mFloat64⍰\u001b[39m │ \u001b[90mFloat64⍰\u001b[39m │ \u001b[90mFloat64⍰\u001b[39m │\n", "├─────┼───────────┼─────────┼─────────┼──────────┼──────────┼──────────┤\n", "│ 1 │ 1 │ A │ C │ 38.2568 │ 5.93913 │ 35.0579 │\n", "│ 2 │ 2 │ A │ C │ 17.8317 │ 5.34754 │ 21.131 │\n", "│ 3 │ 8 │ A │ B │ 16.0218 │ 6.60709 │ 60.9436 │\n", "│ 4 │ 9 │ A │ C │ 45.1158 │ 6.00733 │ 21.8797 │\n", "│ 5 │ 16 │ A │ C │ 20.448 │ 8.54819 │ 20.6623 │" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first(data1, 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `describe()` function will attempt to summarize all the variables. In the case of categorical variables, an alphabetical arrangement for minimum and maximum values will be stated." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

6 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolUnion…AnyUnion…AnyUnion…Int64DataType
1PatientID60.5160.51200Int64
2Cat1AB20String
3Cat2BX60String
4Var127.967915.235622.680184.23780Float64
5Var25.921213.011735.6424115.58260Float64
6Var351.9520.315344.3042147.3970Float64
" ], "text/latex": [ "\\begin{tabular}{r|cccccccc}\n", "\t& variable & mean & min & median & max & nunique & nmissing & eltype\\\\\n", "\t\\hline\n", "\t& Symbol & Union… & Any & Union… & Any & Union… & Int64 & DataType\\\\\n", "\t\\hline\n", "\t1 & PatientID & 60.5 & 1 & 60.5 & 120 & & 0 & Int64 \\\\\n", "\t2 & Cat1 & & A & & B & 2 & 0 & String \\\\\n", "\t3 & Cat2 & & B & & X & 6 & 0 & String \\\\\n", "\t4 & Var1 & 27.9679 & 15.2356 & 22.6801 & 84.2378 & & 0 & Float64 \\\\\n", "\t5 & Var2 & 5.92121 & 3.01173 & 5.64241 & 15.5826 & & 0 & Float64 \\\\\n", "\t6 & Var3 & 51.95 & 20.3153 & 44.3042 & 147.397 & & 0 & Float64 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "6×8 DataFrame. Omitted printing of 1 columns\n", "│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │\n", "│ │ \u001b[90mSymbol\u001b[39m │ \u001b[90mUnion…\u001b[39m │ \u001b[90mAny\u001b[39m │ \u001b[90mUnion…\u001b[39m │ \u001b[90mAny\u001b[39m │ \u001b[90mUnion…\u001b[39m │ \u001b[90mInt64\u001b[39m │\n", "├─────┼───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────────┤\n", "│ 1 │ PatientID │ 60.5 │ 1 │ 60.5 │ 120 │ │ 0 │\n", "│ 2 │ Cat1 │ │ A │ │ B │ 2 │ 0 │\n", "│ 3 │ Cat2 │ │ B │ │ X │ 6 │ 0 │\n", "│ 4 │ Var1 │ 27.9679 │ 15.2356 │ 22.6801 │ 84.2378 │ │ 0 │\n", "│ 5 │ Var2 │ 5.92121 │ 3.01173 │ 5.64241 │ 15.5826 │ │ 0 │\n", "│ 6 │ Var3 │ 51.95 │ 20.3153 │ 44.3042 │ 147.397 │ │ 0 │" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "describe(data1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to the top](#In-this-lecture)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining dataframes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combining dataframes on a common variable is a very useful operation. Below we create two dataframe instances. Note that both have a `Number` variable." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "# Creating DataFrames\n", "subjects = DataFrame(Number = [100, 101, 102, 103], Stage = [\"I\", \"III\", \"II\", \"I\"])\n", "treatment = DataFrame(Number = [103, 102, 101, 100], Treatment = [\"A\", \"B\", \"A\", \"B\"]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `join()` function takes the dataframe objects that require joining as arguments. The `on =` argument (in symbol form), specifies the variable on which to join. In this default mode, only values for the stated variable that appear in both dataframes will be included." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

4 rows × 3 columns

NumberStageTreatment
Int64StringString
1100IB
2101IIIA
3102IIB
4103IA
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& Number & Stage & Treatment\\\\\n", "\t\\hline\n", "\t& Int64 & String & String\\\\\n", "\t\\hline\n", "\t1 & 100 & I & B \\\\\n", "\t2 & 101 & III & A \\\\\n", "\t3 & 102 & II & B \\\\\n", "\t4 & 103 & I & A \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "4×3 DataFrame\n", "│ Row │ Number │ Stage │ Treatment │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mString\u001b[39m │\n", "├─────┼────────┼────────┼───────────┤\n", "│ 1 │ 100 │ I │ B │\n", "│ 2 │ 101 │ III │ A │\n", "│ 3 │ 102 │ II │ B │\n", "│ 4 │ 103 │ I │ A │" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Joining\n", "df3 = join(subjects, treatment, on = :Number);\n", "df3" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "# Adding a longer list of subjects\n", "subjects = DataFrame(Number = [100, 101, 102, 103, 104, 105], Stage = [\"I\", \"III\", \"II\", \"I\", \"II\", \"II\"]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `kind =` argument allows for more control. An inner join is the default (same as above)." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

4 rows × 3 columns

NumberStageTreatment
Int64StringString
1100IB
2101IIIA
3102IIB
4103IA
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& Number & Stage & Treatment\\\\\n", "\t\\hline\n", "\t& Int64 & String & String\\\\\n", "\t\\hline\n", "\t1 & 100 & I & B \\\\\n", "\t2 & 101 & III & A \\\\\n", "\t3 & 102 & II & B \\\\\n", "\t4 & 103 & I & A \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "4×3 DataFrame\n", "│ Row │ Number │ Stage │ Treatment │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mString\u001b[39m │\n", "├─────┼────────┼────────┼───────────┤\n", "│ 1 │ 100 │ I │ B │\n", "│ 2 │ 101 │ III │ A │\n", "│ 3 │ 102 │ II │ B │\n", "│ 4 │ 103 │ I │ A │" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Inner join\n", " df4 = join(subjects, treatment, on = :Number, kind = :inner);\n", " df4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An outer join will join both dataframes and add `missing` data." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

6 rows × 3 columns

NumberStageTreatment
Int64⍰String⍰String⍰
1100IB
2101IIIA
3102IIB
4103IA
5104IImissing
6105IImissing
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& Number & Stage & Treatment\\\\\n", "\t\\hline\n", "\t& Int64⍰ & String⍰ & String⍰\\\\\n", "\t\\hline\n", "\t1 & 100 & I & B \\\\\n", "\t2 & 101 & III & A \\\\\n", "\t3 & 102 & II & B \\\\\n", "\t4 & 103 & I & A \\\\\n", "\t5 & 104 & II & \\\\\n", "\t6 & 105 & II & \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "6×3 DataFrame\n", "│ Row │ Number │ Stage │ Treatment │\n", "│ │ \u001b[90mInt64⍰\u001b[39m │ \u001b[90mString⍰\u001b[39m │ \u001b[90mString⍰\u001b[39m │\n", "├─────┼────────┼─────────┼───────────┤\n", "│ 1 │ 100 │ I │ B │\n", "│ 2 │ 101 │ III │ A │\n", "│ 3 │ 102 │ II │ B │\n", "│ 4 │ 103 │ I │ A │\n", "│ 5 │ 104 │ II │ \u001b[90mmissing\u001b[39m │\n", "│ 6 │ 105 │ II │ \u001b[90mmissing\u001b[39m │" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Outer joing: empty fields filled with missing\n", "df5 = join(subjects, treatment, on = :Number, kind = :outer);\n", "df5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to the top](#In-this-lecture)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grouping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A dataframe can be _spliced_ by grouping rows according to values in a variable." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

3 rows × 3 columns

GroupVariable1Variable2
StringFloat64Float64
1B0.4473580.137658
2B-0.3962110.60808
3B0.3667730.255054
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& Group & Variable1 & Variable2\\\\\n", "\t\\hline\n", "\t& String & Float64 & Float64\\\\\n", "\t\\hline\n", "\t1 & B & 0.447358 & 0.137658 \\\\\n", "\t2 & B & -0.396211 & 0.60808 \\\\\n", "\t3 & B & 0.366773 & 0.255054 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "3×3 DataFrame\n", "│ Row │ Group │ Variable1 │ Variable2 │\n", "│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n", "├─────┼────────┼───────────┼───────────┤\n", "│ 1 │ B │ 0.447358 │ 0.137658 │\n", "│ 2 │ B │ -0.396211 │ 0.60808 │\n", "│ 3 │ B │ 0.366773 │ 0.255054 │" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creating a new DataFrame\n", "df6 = DataFrame(Group = rand([\"A\", \"B\", \"C\"], 15), Variable1 = randn(15), Variable2 = rand(15));\n", "first(df6, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `by()` function takes a dataframe object as first argument. This is followed by a column (variable) on which to group by. Below we use the `size` argument to indicate the number of rows and columns for the number of each unique values that are found in the specified variable." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

3 rows × 2 columns

Groupx1
StringTuple…
1B(9, 3)
2A(4, 3)
3C(2, 3)
" ], "text/latex": [ "\\begin{tabular}{r|cc}\n", "\t& Group & x1\\\\\n", "\t\\hline\n", "\t& String & Tuple…\\\\\n", "\t\\hline\n", "\t1 & B & (9, 3) \\\\\n", "\t2 & A & (4, 3) \\\\\n", "\t3 & C & (2, 3) \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "3×2 DataFrame\n", "│ Row │ Group │ x1 │\n", "│ │ \u001b[90mString\u001b[39m │ \u001b[90mTuple…\u001b[39m │\n", "├─────┼────────┼────────┤\n", "│ 1 │ B │ (9, 3) │\n", "│ 2 │ A │ (4, 3) │\n", "│ 3 │ C │ (2, 3) │" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Grouping using by()\n", "by(df6, :Group, size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the dataframe has three columns, we note that as the second value in the `count` tuple returned above. The first value shows the number of instances of the unique values found for the specified variable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below we create a dataframe instance that shows only the count of the unique values." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

3 rows × 2 columns

GroupCount
StringInt64
1B9
2A4
3C2
" ], "text/latex": [ "\\begin{tabular}{r|cc}\n", "\t& Group & Count\\\\\n", "\t\\hline\n", "\t& String & Int64\\\\\n", "\t\\hline\n", "\t1 & B & 9 \\\\\n", "\t2 & A & 4 \\\\\n", "\t3 & C & 2 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "3×2 DataFrame\n", "│ Row │ Group │ Count │\n", "│ │ \u001b[90mString\u001b[39m │ \u001b[90mInt64\u001b[39m │\n", "├─────┼────────┼───────┤\n", "│ 1 │ B │ 9 │\n", "│ 2 │ A │ 4 │\n", "│ 3 │ C │ 2 │" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Count unique data point values in :Group column\n", "by(df6, :Group, dfc -> DataFrame(Count = size(dfc, 1)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `aggregate()` function also groups a dataframe by unique values for a specified column, but then provides the ability to list statistical tests required." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3×5 DataFrame\n", "│ Row │ Group │ Variable1_mean │ Variable2_mean │ Variable1_std │ Variable2_std │\n", "│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n", "├─────┼────────┼────────────────┼────────────────┼───────────────┼───────────────┤\n", "│ 1 │ B │ 0.127675 │ 0.446397 │ 0.973237 │ 0.268476 │\n", "│ 2 │ A │ -0.33429 │ 0.339451 │ 1.04503 │ 0.352194 │\n", "│ 3 │ C │ -0.902111 │ 0.373207 │ 1.51729 │ 0.368007 │" ] } ], "source": [ "# Aggregate for descriptive statistics\n", "print(aggregate(df6, :Group, [mean, std]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `groupby()` function actually creates sub-dataframes based on the unique values found in the specified variable." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

GroupedDataFrame with 3 groups based on key: Group

First Group (9 rows): Group = \"B\"

GroupVariable1Variable2
StringFloat64Float64
1B0.4473580.137658
2B-0.3962110.60808
3B0.3667730.255054
4B0.6216730.498734
5B2.063530.52509
6B-1.414530.265511
7B0.1344750.110096
8B-0.7504210.834362
9B0.0764180.78299

Last Group (2 rows): Group = \"C\"

GroupVariable1Variable2
StringFloat64Float64
1C0.1707780.633427
2C-1.9750.112987
" ], "text/latex": [ "GroupedDataFrame with 3 groups based on key: Group\n", "\n", "First Group (9 rows): Group = \"B\"\n", "\n", "\\begin{tabular}{r|ccc}\n", "\t& Group & Variable1 & Variable2\\\\\n", "\t\\hline\n", "\t& String & Float64 & Float64\\\\\n", "\t\\hline\n", "\t1 & B & 0.447358 & 0.137658 \\\\\n", "\t2 & B & -0.396211 & 0.60808 \\\\\n", "\t3 & B & 0.366773 & 0.255054 \\\\\n", "\t4 & B & 0.621673 & 0.498734 \\\\\n", "\t5 & B & 2.06353 & 0.52509 \\\\\n", "\t6 & B & -1.41453 & 0.265511 \\\\\n", "\t7 & B & 0.134475 & 0.110096 \\\\\n", "\t8 & B & -0.750421 & 0.834362 \\\\\n", "\t9 & B & 0.076418 & 0.78299 \\\\\n", "\\end{tabular}\n", "\n", "$\\dots$\n", "\n", "Last Group (2 rows): Group = \"C\"\n", "\n", "\\begin{tabular}{r|ccc}\n", "\t& Group & Variable1 & Variable2\\\\\n", "\t\\hline\n", "\t& String & Float64 & Float64\\\\\n", "\t\\hline\n", "\t1 & C & 0.170778 & 0.633427 \\\\\n", "\t2 & C & -1.975 & 0.112987 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "GroupedDataFrame with 3 groups based on key: Group\n", "First Group (9 rows): Group = \"B\"\n", "│ Row │ Group │ Variable1 │ Variable2 │\n", "│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n", "├─────┼────────┼───────────┼───────────┤\n", "│ 1 │ B │ 0.447358 │ 0.137658 │\n", "│ 2 │ B │ -0.396211 │ 0.60808 │\n", "│ 3 │ B │ 0.366773 │ 0.255054 │\n", "│ 4 │ B │ 0.621673 │ 0.498734 │\n", "│ 5 │ B │ 2.06353 │ 0.52509 │\n", "│ 6 │ B │ -1.41453 │ 0.265511 │\n", "│ 7 │ B │ 0.134475 │ 0.110096 │\n", "│ 8 │ B │ -0.750421 │ 0.834362 │\n", "│ 9 │ B │ 0.076418 │ 0.78299 │\n", "⋮\n", "Last Group (2 rows): Group = \"C\"\n", "│ Row │ Group │ Variable1 │ Variable2 │\n", "│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n", "├─────┼────────┼───────────┼───────────┤\n", "│ 1 │ C │ 0.170778 │ 0.633427 │\n", "│ 2 │ C │ -1.975 │ 0.112987 │" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Group\n", "groupby(df6, :Group)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By calling the `length()` function, we note that there are indeed three sub-dataframes." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "length(groupby(df6, :Group))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using indexing, we can select any of the three sub-dataframes." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

4 rows × 3 columns

GroupVariable1Variable2
StringFloat64Float64
1A0.1825880.0940369
2A-1.584920.337865
3A0.7993350.838042
4A-0.7341610.0878598
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& Group & Variable1 & Variable2\\\\\n", "\t\\hline\n", "\t& String & Float64 & Float64\\\\\n", "\t\\hline\n", "\t1 & A & 0.182588 & 0.0940369 \\\\\n", "\t2 & A & -1.58492 & 0.337865 \\\\\n", "\t3 & A & 0.799335 & 0.838042 \\\\\n", "\t4 & A & -0.734161 & 0.0878598 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "4×3 SubDataFrame\n", "│ Row │ Group │ Variable1 │ Variable2 │\n", "│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n", "├─────┼────────┼───────────┼───────────┤\n", "│ 1 │ A │ 0.182588 │ 0.0940369 │\n", "│ 2 │ A │ -1.58492 │ 0.337865 │\n", "│ 3 │ A │ 0.799335 │ 0.838042 │\n", "│ 4 │ A │ -0.734161 │ 0.0878598 │" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "groupby(df6, :Group)[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to the top](#In-this-lecture)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sorting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sorting using the `sort!()` function (permanent bang version used here), does what is says on the box. A list can be provided to sort by more than one variable." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

7 rows × 3 columns

GroupVariable1Variable2
StringFloat64Float64
1A-1.584920.337865
2A-0.7341610.0878598
3A0.1825880.0940369
4A0.7993350.838042
5B-1.414530.265511
6B-0.7504210.834362
7B-0.3962110.60808
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& Group & Variable1 & Variable2\\\\\n", "\t\\hline\n", "\t& String & Float64 & Float64\\\\\n", "\t\\hline\n", "\t1 & A & -1.58492 & 0.337865 \\\\\n", "\t2 & A & -0.734161 & 0.0878598 \\\\\n", "\t3 & A & 0.182588 & 0.0940369 \\\\\n", "\t4 & A & 0.799335 & 0.838042 \\\\\n", "\t5 & B & -1.41453 & 0.265511 \\\\\n", "\t6 & B & -0.750421 & 0.834362 \\\\\n", "\t7 & B & -0.396211 & 0.60808 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "7×3 DataFrame\n", "│ Row │ Group │ Variable1 │ Variable2 │\n", "│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n", "├─────┼────────┼───────────┼───────────┤\n", "│ 1 │ A │ -1.58492 │ 0.337865 │\n", "│ 2 │ A │ -0.734161 │ 0.0878598 │\n", "│ 3 │ A │ 0.182588 │ 0.0940369 │\n", "│ 4 │ A │ 0.799335 │ 0.838042 │\n", "│ 5 │ B │ -1.41453 │ 0.265511 │\n", "│ 6 │ B │ -0.750421 │ 0.834362 │\n", "│ 7 │ B │ -0.396211 │ 0.60808 │" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df6S = sort!(df6, [:Group, :Variable1]);\n", "first(df6S, 7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to the top](#In-this-lecture)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Unique rows only" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below we create a dataframe with two identical rows." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

6 rows × 3 columns

ABC
Int64Int64String
1111A
2212B
3212B
4313C
5414D
6515E
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& A & B & C\\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & String\\\\\n", "\t\\hline\n", "\t1 & 1 & 11 & A \\\\\n", "\t2 & 2 & 12 & B \\\\\n", "\t3 & 2 & 12 & B \\\\\n", "\t4 & 3 & 13 & C \\\\\n", "\t5 & 4 & 14 & D \\\\\n", "\t6 & 5 & 15 & E \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "6×3 DataFrame\n", "│ Row │ A │ B │ C │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n", "├─────┼───────┼───────┼────────┤\n", "│ 1 │ 1 │ 11 │ A │\n", "│ 2 │ 2 │ 12 │ B │\n", "│ 3 │ 2 │ 12 │ B │\n", "│ 4 │ 3 │ 13 │ C │\n", "│ 5 │ 4 │ 14 │ D │\n", "│ 6 │ 5 │ 15 │ E │" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creating a DataFrame with an obvious duplicate row\n", "df7 = DataFrame(A = [1, 2, 2, 3, 4, 5], B = [11, 12, 12, 13, 14, 15], C = [\"A\", \"B\", \"B\", \"C\", \"D\", \"E\"]);\n", "df7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `unique()` function will, as the name implies, delete the duplicate row." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

5 rows × 3 columns

ABC
Int64Int64String
1111A
2212B
3313C
4414D
5515E
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& A & B & C\\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & String\\\\\n", "\t\\hline\n", "\t1 & 1 & 11 & A \\\\\n", "\t2 & 2 & 12 & B \\\\\n", "\t3 & 3 & 13 & C \\\\\n", "\t4 & 4 & 14 & D \\\\\n", "\t5 & 5 & 15 & E \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "5×3 DataFrame\n", "│ Row │ A │ B │ C │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n", "├─────┼───────┼───────┼────────┤\n", "│ 1 │ 1 │ 11 │ A │\n", "│ 2 │ 2 │ 12 │ B │\n", "│ 3 │ 3 │ 13 │ C │\n", "│ 4 │ 4 │ 14 │ D │\n", "│ 5 │ 5 │ 15 │ E │" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Only unique rows\n", "unique(df7)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

6 rows × 3 columns

ABC
Int64Int64String
1111A
2212B
3212B
4313C
5414D
6515E
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& A & B & C\\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & String\\\\\n", "\t\\hline\n", "\t1 & 1 & 11 & A \\\\\n", "\t2 & 2 & 12 & B \\\\\n", "\t3 & 2 & 12 & B \\\\\n", "\t4 & 3 & 13 & C \\\\\n", "\t5 & 4 & 14 & D \\\\\n", "\t6 & 5 & 15 & E \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "6×3 DataFrame\n", "│ Row │ A │ B │ C │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n", "├─────┼───────┼───────┼────────┤\n", "│ 1 │ 1 │ 11 │ A │\n", "│ 2 │ 2 │ 12 │ B │\n", "│ 3 │ 2 │ 12 │ B │\n", "│ 4 │ 3 │ 13 │ C │\n", "│ 5 │ 4 │ 14 │ D │\n", "│ 6 │ 5 │ 15 │ E │" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As always, the bang will make the change permament." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

5 rows × 3 columns

ABC
Int64Int64String
1111A
2212B
3313C
4414D
5515E
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& A & B & C\\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & String\\\\\n", "\t\\hline\n", "\t1 & 1 & 11 & A \\\\\n", "\t2 & 2 & 12 & B \\\\\n", "\t3 & 3 & 13 & C \\\\\n", "\t4 & 4 & 14 & D \\\\\n", "\t5 & 5 & 15 & E \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "5×3 DataFrame\n", "│ Row │ A │ B │ C │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n", "├─────┼───────┼───────┼────────┤\n", "│ 1 │ 1 │ 11 │ A │\n", "│ 2 │ 2 │ 12 │ B │\n", "│ 3 │ 3 │ 13 │ C │\n", "│ 4 │ 4 │ 14 │ D │\n", "│ 5 │ 5 │ 15 │ E │" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Permanant change\n", "unique!(df7)\n", "df7" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

5 rows × 3 columns

ABC
Int64Int64String
1111A
2212B
3313C
4414D
5515E
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& A & B & C\\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & String\\\\\n", "\t\\hline\n", "\t1 & 1 & 11 & A \\\\\n", "\t2 & 2 & 12 & B \\\\\n", "\t3 & 3 & 13 & C \\\\\n", "\t4 & 4 & 14 & D \\\\\n", "\t5 & 5 & 15 & E \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "5×3 DataFrame\n", "│ Row │ A │ B │ C │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n", "├─────┼───────┼───────┼────────┤\n", "│ 1 │ 1 │ 11 │ A │\n", "│ 2 │ 2 │ 12 │ B │\n", "│ 3 │ 3 │ 13 │ C │\n", "│ 4 │ 4 │ 14 │ D │\n", "│ 5 │ 5 │ 15 │ E │" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Permanant change\n", "unique!(df7)\n", "df7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to the top](#In-this-lecture)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deleting rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `deleterows!()` function (permanent bang version used here), deletes specified rows." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

3 rows × 3 columns

ABC
Int64Int64String
1212B
2313C
3414D
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& A & B & C\\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & String\\\\\n", "\t\\hline\n", "\t1 & 2 & 12 & B \\\\\n", "\t2 & 3 & 13 & C \\\\\n", "\t3 & 4 & 14 & D \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "3×3 DataFrame\n", "│ Row │ A │ B │ C │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n", "├─────┼───────┼───────┼────────┤\n", "│ 1 │ 2 │ 12 │ B │\n", "│ 2 │ 3 │ 13 │ C │\n", "│ 3 │ 4 │ 14 │ D │" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Permanently\n", "deleterows!(df7, [1, 5])\n", "df7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to the top](#In-this-lecture)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.0.3", "language": "julia", "name": "julia-1.0" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.0.3" } }, "nbformat": 4, "nbformat_minor": 2 }