2128 lines
70 KiB
Plaintext
2128 lines
70 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Working with data"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## In this lecture"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"- [Introduction](#Introduction)\n",
|
||
"- [Distributions](#Distributions)\n",
|
||
"- [Normal distribution](#Normal-distribution)\n",
|
||
"- [Other distributions](#Other-distributions)\n",
|
||
"- [DataFrames](#DataFrames)\n",
|
||
" - [Combining dataframes](#Combining-dataframes)\n",
|
||
" - [Grouping](#Grouping)\n",
|
||
" - [Sorting](#Sorting)\n",
|
||
" - [Unique rows only](#Unique-rows-only)\n",
|
||
" - [Deleting rows](#Deleting-rows)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Introduction"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The ability to use data is fundamental to most modern computer coding taks. In this lecture, we will have a brief introduction to the way in which the Julia language incorporates data through the use of the `Distributions.jl` and `DataFrames.jl` packages."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"[Back to the top](#In-this-lecture)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Distributions"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Data point values for a distribution usually follow a pattern. Such patterns are called distributions. Distributions are either discrete or continuous. The `Distribution.jl` package contains most of the common data distributions."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"We will also use the `Random.jl` package to seed the pseudo-random number generator so that we can reproduce the random values that we are going to use in the lecture."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"using Distributions\n",
|
||
"using Random"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"[Back to the top](#In-this-lecture)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### The normal distribution"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The normal distribution is the famous bell-shaped curve that we are familiar with. Values around the mean occur most frequently and as values get progressively further away from the mean, they occur less frequently."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(0.0, 1.0)"
|
||
]
|
||
},
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"#Seed the pseudo-random number generator\n",
|
||
"Random.seed!(1234)\n",
|
||
"#Saving the standard normal distribution as an object\n",
|
||
"n = Distributions.Normal() # This function is from the Distributions package\n",
|
||
"#Parameter values of the standard normal distribution\n",
|
||
"params(n)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Using the `params()` function, we note a mean on $0$ and a standard deviation of $1$, also called the _standard normal distribution_."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `fieldnames()` function provides the actual parameters of the given distribution. In the case of the normal distribution, it will be the average and the standard deviation, namely $\\mu$ and $\\sigma$."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(:μ, :σ)"
|
||
]
|
||
},
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Returning the parameters of the normal distribution\n",
|
||
"fieldnames(Normal)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we create a variable called `var1` and use the `rand()` function to create select $10$ random values from the standard normal distribution."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"#Seed the pseudo-random number generator\n",
|
||
"Random.seed!(1234)\n",
|
||
"#Select 10 elements at random from n\n",
|
||
"var1 = rand(n, 10);"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"We can calculate the average and standard deviation of our randomly selected values."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.18909179133831322"
|
||
]
|
||
},
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Average\n",
|
||
"mean(var1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.9879593623730926"
|
||
]
|
||
},
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Standard deviation\n",
|
||
"std(var1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `pdf()` calculates the probability density function value of a given distribution up until a specified point (from $- \\infty$)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.38138781546052414"
|
||
]
|
||
},
|
||
"execution_count": 7,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"#Probability density function value at x = 0.3\n",
|
||
"pdf(Normal(), 0.3)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `cdf()` functions calculates the cummulative distribution function value of a given distribution up until a specified point (from $- \\infty$)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.5987063256829237"
|
||
]
|
||
},
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"#Cumulative distribution function as x = 0.25\n",
|
||
"cdf(Normal(), 0.25)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The values for the average and standard deviation can be specified."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Creating 100 data point values from a normal distribution\n",
|
||
"# with a mean of 100 and a standard deviation of 10\n",
|
||
"Random.seed!(1234)\n",
|
||
"var2 = rand(Normal(100, 10), 100);"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"98.52365657772843"
|
||
]
|
||
},
|
||
"execution_count": 10,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Calculating the mean of var2\n",
|
||
"mean(var2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"9.580963685859091"
|
||
]
|
||
},
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"#Calculating the standard deviation of var2\n",
|
||
"std(var2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The parameters of a set of values for a specified distribution can be returned."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Normal{Float64}(μ=98.52365657772843, σ=9.532938502804532)"
|
||
]
|
||
},
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"#Using fit() to calculate the parameters of a distribution\n",
|
||
"fit(Normal, var2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `quantiles()` function provides us with values for the specific percentiles (provided as fractions). Below we calculate the $2.5$% and $97.5$% percentile values of the standard normal distribution."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"-1.9599639845400592"
|
||
]
|
||
},
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Quantiles\n",
|
||
"quantile(Normal(), 0.025)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"1.9599639845400576"
|
||
]
|
||
},
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"quantile(Normal(), 0.975)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"[Back to the top](#In-this-lecture)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Other distributions"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"There are many distributions in the `Distribution().jl` package. In the code below, a few of these are showcased by way of setting parameters, selecting random values, and fitting those value back to the distribution or returning the parameter field names."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Beta{Float64}(α=1.236721159927394, β=1.1368118923305863)"
|
||
]
|
||
},
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Beta distribution\n",
|
||
"b = Beta(1, 1)\n",
|
||
"params(b)\n",
|
||
"Random.seed!(1234)\n",
|
||
"var3 = rand(b, 100);\n",
|
||
"fit(Beta, var3)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(:ν,)"
|
||
]
|
||
},
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# χ2 distribution\n",
|
||
"c = Chisq(1)\n",
|
||
"Random.seed!(1234)\n",
|
||
"var4 = rand(c, 100)\n",
|
||
"fieldnames(Chisq) # Degrees of freedom"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"[Back to the top](#In-this-lecture)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Dataframes"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `Dataframes.jl` package allows for creation of a flat data structure (rows and columns). Columns are variables and rows are subjects (examples)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"using DataFrames"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Below, we create an empty dataframe object that we call `df`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 18,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"#Create and empty DataFrame\n",
|
||
"df = DataFrame();"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Column headers representing statistical variable names are entered in square brackets as symbols, i.e. preceeded with a colon. We will attach the `var2` set of values as data point entries for this statistical variables."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Add a column with data point values (rows)\n",
|
||
"df[:Var2] = var2;"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"We can specify to print the first $5$ rows to the screen with the `first()` function,"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 20,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>Var2</th></tr><tr><th></th><th>Float64</th></tr></thead><tbody><p>5 rows × 1 columns</p><tr><th>1</th><td>108.673</td></tr><tr><th>2</th><td>90.9826</td></tr><tr><th>3</th><td>95.0552</td></tr><tr><th>4</th><td>90.9709</td></tr><tr><th>5</th><td>108.644</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|c}\n",
|
||
"\t& Var2\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Float64\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & 108.673 \\\\\n",
|
||
"\t2 & 90.9826 \\\\\n",
|
||
"\t3 & 95.0552 \\\\\n",
|
||
"\t4 & 90.9709 \\\\\n",
|
||
"\t5 & 108.644 \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"5×1 DataFrame\n",
|
||
"│ Row │ Var2 │\n",
|
||
"│ │ \u001b[90mFloat64\u001b[39m │\n",
|
||
"├─────┼─────────┤\n",
|
||
"│ 1 │ 108.673 │\n",
|
||
"│ 2 │ 90.9826 │\n",
|
||
"│ 3 │ 95.0552 │\n",
|
||
"│ 4 │ 90.9709 │\n",
|
||
"│ 5 │ 108.644 │"
|
||
]
|
||
},
|
||
"execution_count": 20,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"#View first five rows\n",
|
||
"first(df, 5)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Below, we create another statistical variable with some data point values that we already have in the waiting."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 21,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Add another column\n",
|
||
"df[:Var3] = var3;"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `last()` functions shows the last specified rows."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>Var2</th><th>Var3</th></tr><tr><th></th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>3 rows × 2 columns</p><tr><th>1</th><td>95.5675</td><td>0.831916</td></tr><tr><th>2</th><td>83.3677</td><td>0.221771</td></tr><tr><th>3</th><td>94.7877</td><td>0.655592</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|cc}\n",
|
||
"\t& Var2 & Var3\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Float64 & Float64\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & 95.5675 & 0.831916 \\\\\n",
|
||
"\t2 & 83.3677 & 0.221771 \\\\\n",
|
||
"\t3 & 94.7877 & 0.655592 \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"3×2 DataFrame\n",
|
||
"│ Row │ Var2 │ Var3 │\n",
|
||
"│ │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
|
||
"├─────┼─────────┼──────────┤\n",
|
||
"│ 1 │ 95.5675 │ 0.831916 │\n",
|
||
"│ 2 │ 83.3677 │ 0.221771 │\n",
|
||
"│ 3 │ 94.7877 │ 0.655592 │"
|
||
]
|
||
},
|
||
"execution_count": 22,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# View last three rows\n",
|
||
"last(df, 3)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `size()` function returns a tuple with the number of rows and columns returned,"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 23,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(100, 2)"
|
||
]
|
||
},
|
||
"execution_count": 23,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Dimensions of a DataFrame\n",
|
||
"size(df)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `describe()` functions attemps tp provide summary statistics of the variables>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>variable</th><th>mean</th><th>min</th><th>median</th><th>max</th><th>nunique</th><th>nmissing</th><th>eltype</th></tr><tr><th></th><th>Symbol</th><th>Float64</th><th>Float64</th><th>Float64</th><th>Float64</th><th>Nothing</th><th>Nothing</th><th>DataType</th></tr></thead><tbody><p>2 rows × 8 columns</p><tr><th>1</th><td>Var2</td><td>98.5237</td><td>67.8864</td><td>98.1718</td><td>124.175</td><td></td><td></td><td>Float64</td></tr><tr><th>2</th><td>Var3</td><td>0.521047</td><td>0.00145384</td><td>0.522808</td><td>0.971161</td><td></td><td></td><td>Float64</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|cccccccc}\n",
|
||
"\t& variable & mean & min & median & max & nunique & nmissing & eltype\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Symbol & Float64 & Float64 & Float64 & Float64 & Nothing & Nothing & DataType\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & Var2 & 98.5237 & 67.8864 & 98.1718 & 124.175 & & & Float64 \\\\\n",
|
||
"\t2 & Var3 & 0.521047 & 0.00145384 & 0.522808 & 0.971161 & & & Float64 \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"2×8 DataFrame. Omitted printing of 2 columns\n",
|
||
"│ Row │ variable │ mean │ min │ median │ max │ nunique │\n",
|
||
"│ │ \u001b[90mSymbol\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mNothing\u001b[39m │\n",
|
||
"├─────┼──────────┼──────────┼────────────┼──────────┼──────────┼─────────┤\n",
|
||
"│ 1 │ Var2 │ 98.5237 │ 67.8864 │ 98.1718 │ 124.175 │ │\n",
|
||
"│ 2 │ Var3 │ 0.521047 │ 0.00145384 │ 0.522808 │ 0.971161 │ │"
|
||
]
|
||
},
|
||
"execution_count": 24,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Summarize the content\n",
|
||
"describe(df)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The data type for each variable can be returned."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"2-element Array{DataType,1}:\n",
|
||
" Float64\n",
|
||
" Float64"
|
||
]
|
||
},
|
||
"execution_count": 25,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Data type only\n",
|
||
"eltypes(df)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Below we create a new instance of a dataframe object called `df2`. It contains four statistical variables. Note the use of symbol notation in creating the names of these variables."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 26,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# 6 Create a bigger DataFrame\n",
|
||
"df2 = DataFrame()\n",
|
||
"df2[:A] = 1:10\n",
|
||
"df2[:B] = [\"I\", \"II\", \"II\", \"I\", \"II\",\"I\", \"II\", \"II\", \"I\", \"II\"]\n",
|
||
"Random.seed!(1234)\n",
|
||
"df2[:C] = rand(Normal(), 10)\n",
|
||
"df2[:D] = rand(Chisq(1), 10);"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"By using indexing (in square brackets), we can refer to row and column values (i.e. _row, column_). Below is an example of seleting data point values for rows one through three, showing all the columns. The colon symbol serves as shortcut syntax for this selection."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th><th>D</th></tr><tr><th></th><th>Int64</th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>3 rows × 4 columns</p><tr><th>1</th><td>1</td><td>I</td><td>0.867347</td><td>0.0123688</td></tr><tr><th>2</th><td>2</td><td>II</td><td>-0.901744</td><td>0.213586</td></tr><tr><th>3</th><td>3</td><td>II</td><td>-0.494479</td><td>0.00899443</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|cccc}\n",
|
||
"\t& A & B & C & D\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Int64 & String & Float64 & Float64\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & 1 & I & 0.867347 & 0.0123688 \\\\\n",
|
||
"\t2 & 2 & II & -0.901744 & 0.213586 \\\\\n",
|
||
"\t3 & 3 & II & -0.494479 & 0.00899443 \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"3×4 DataFrame\n",
|
||
"│ Row │ A │ B │ C │ D │\n",
|
||
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
|
||
"├─────┼───────┼────────┼───────────┼────────────┤\n",
|
||
"│ 1 │ 1 │ I │ 0.867347 │ 0.0123688 │\n",
|
||
"│ 2 │ 2 │ II │ -0.901744 │ 0.213586 │\n",
|
||
"│ 3 │ 3 │ II │ -0.494479 │ 0.00899443 │"
|
||
]
|
||
},
|
||
"execution_count": 27,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# First three rows with all the colums\n",
|
||
"df2[1:3, :]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"If only specified columns, that is to say, not the range of one, two, and three as we did above, but rather only colums one and three, we create a list to indicate this."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 28,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Float64</th></tr></thead><tbody><p>10 rows × 2 columns</p><tr><th>1</th><td>1</td><td>0.867347</td></tr><tr><th>2</th><td>2</td><td>-0.901744</td></tr><tr><th>3</th><td>3</td><td>-0.494479</td></tr><tr><th>4</th><td>4</td><td>-0.902914</td></tr><tr><th>5</th><td>5</td><td>0.864401</td></tr><tr><th>6</th><td>6</td><td>2.21188</td></tr><tr><th>7</th><td>7</td><td>0.532813</td></tr><tr><th>8</th><td>8</td><td>-0.271735</td></tr><tr><th>9</th><td>9</td><td>0.502334</td></tr><tr><th>10</th><td>10</td><td>-0.516984</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|cc}\n",
|
||
"\t& A & C\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Int64 & Float64\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & 1 & 0.867347 \\\\\n",
|
||
"\t2 & 2 & -0.901744 \\\\\n",
|
||
"\t3 & 3 & -0.494479 \\\\\n",
|
||
"\t4 & 4 & -0.902914 \\\\\n",
|
||
"\t5 & 5 & 0.864401 \\\\\n",
|
||
"\t6 & 6 & 2.21188 \\\\\n",
|
||
"\t7 & 7 & 0.532813 \\\\\n",
|
||
"\t8 & 8 & -0.271735 \\\\\n",
|
||
"\t9 & 9 & 0.502334 \\\\\n",
|
||
"\t10 & 10 & -0.516984 \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"10×2 DataFrame\n",
|
||
"│ Row │ A │ C │\n",
|
||
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
|
||
"├─────┼───────┼───────────┤\n",
|
||
"│ 1 │ 1 │ 0.867347 │\n",
|
||
"│ 2 │ 2 │ -0.901744 │\n",
|
||
"│ 3 │ 3 │ -0.494479 │\n",
|
||
"│ 4 │ 4 │ -0.902914 │\n",
|
||
"│ 5 │ 5 │ 0.864401 │\n",
|
||
"│ 6 │ 6 │ 2.21188 │\n",
|
||
"│ 7 │ 7 │ 0.532813 │\n",
|
||
"│ 8 │ 8 │ -0.271735 │\n",
|
||
"│ 9 │ 9 │ 0.502334 │\n",
|
||
"│ 10 │ 10 │ -0.516984 │"
|
||
]
|
||
},
|
||
"execution_count": 28,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# All rows columns 1 and 3\n",
|
||
"df2[:, [1, 3]]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Instead of indicating the column numbers, we can also reference the actual column names (statistical variable names), using symbol notation, i.e. `:A`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 29,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Float64</th></tr></thead><tbody><p>10 rows × 2 columns</p><tr><th>1</th><td>1</td><td>0.867347</td></tr><tr><th>2</th><td>2</td><td>-0.901744</td></tr><tr><th>3</th><td>3</td><td>-0.494479</td></tr><tr><th>4</th><td>4</td><td>-0.902914</td></tr><tr><th>5</th><td>5</td><td>0.864401</td></tr><tr><th>6</th><td>6</td><td>2.21188</td></tr><tr><th>7</th><td>7</td><td>0.532813</td></tr><tr><th>8</th><td>8</td><td>-0.271735</td></tr><tr><th>9</th><td>9</td><td>0.502334</td></tr><tr><th>10</th><td>10</td><td>-0.516984</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|cc}\n",
|
||
"\t& A & C\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Int64 & Float64\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & 1 & 0.867347 \\\\\n",
|
||
"\t2 & 2 & -0.901744 \\\\\n",
|
||
"\t3 & 3 & -0.494479 \\\\\n",
|
||
"\t4 & 4 & -0.902914 \\\\\n",
|
||
"\t5 & 5 & 0.864401 \\\\\n",
|
||
"\t6 & 6 & 2.21188 \\\\\n",
|
||
"\t7 & 7 & 0.532813 \\\\\n",
|
||
"\t8 & 8 & -0.271735 \\\\\n",
|
||
"\t9 & 9 & 0.502334 \\\\\n",
|
||
"\t10 & 10 & -0.516984 \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"10×2 DataFrame\n",
|
||
"│ Row │ A │ C │\n",
|
||
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
|
||
"├─────┼───────┼───────────┤\n",
|
||
"│ 1 │ 1 │ 0.867347 │\n",
|
||
"│ 2 │ 2 │ -0.901744 │\n",
|
||
"│ 3 │ 3 │ -0.494479 │\n",
|
||
"│ 4 │ 4 │ -0.902914 │\n",
|
||
"│ 5 │ 5 │ 0.864401 │\n",
|
||
"│ 6 │ 6 │ 2.21188 │\n",
|
||
"│ 7 │ 7 │ 0.532813 │\n",
|
||
"│ 8 │ 8 │ -0.271735 │\n",
|
||
"│ 9 │ 9 │ 0.502334 │\n",
|
||
"│ 10 │ 10 │ -0.516984 │"
|
||
]
|
||
},
|
||
"execution_count": 29,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Different notation\n",
|
||
"df2[:, [:A, :C]]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `CSV.jl` package's `read()` function can import a comma separated values data file."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 30,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Make sure to install the package in the REPL first\n",
|
||
"using CSV"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The file is saved in the same directory / folder as this notebook file."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 31,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Import csv file (in same directory / folder)\n",
|
||
"data1 = CSV.read(\"CCS.csv\");"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Using the `type()` function, we note that we now have an instance of a dataframe object."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 32,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"DataFrame"
|
||
]
|
||
},
|
||
"execution_count": 32,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"typeof(data1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Let's view the first five rows of data."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 33,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>PatientID</th><th>Cat1</th><th>Cat2</th><th>Var1</th><th>Var2</th><th>Var3</th></tr><tr><th></th><th>Int64⍰</th><th>String⍰</th><th>String⍰</th><th>Float64⍰</th><th>Float64⍰</th><th>Float64⍰</th></tr></thead><tbody><p>5 rows × 6 columns</p><tr><th>1</th><td>1</td><td>A</td><td>C</td><td>38.2568</td><td>5.93913</td><td>35.0579</td></tr><tr><th>2</th><td>2</td><td>A</td><td>C</td><td>17.8317</td><td>5.34754</td><td>21.131</td></tr><tr><th>3</th><td>8</td><td>A</td><td>B</td><td>16.0218</td><td>6.60709</td><td>60.9436</td></tr><tr><th>4</th><td>9</td><td>A</td><td>C</td><td>45.1158</td><td>6.00733</td><td>21.8797</td></tr><tr><th>5</th><td>16</td><td>A</td><td>C</td><td>20.448</td><td>8.54819</td><td>20.6623</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|cccccc}\n",
|
||
"\t& PatientID & Cat1 & Cat2 & Var1 & Var2 & Var3\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Int64⍰ & String⍰ & String⍰ & Float64⍰ & Float64⍰ & Float64⍰\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & 1 & A & C & 38.2568 & 5.93913 & 35.0579 \\\\\n",
|
||
"\t2 & 2 & A & C & 17.8317 & 5.34754 & 21.131 \\\\\n",
|
||
"\t3 & 8 & A & B & 16.0218 & 6.60709 & 60.9436 \\\\\n",
|
||
"\t4 & 9 & A & C & 45.1158 & 6.00733 & 21.8797 \\\\\n",
|
||
"\t5 & 16 & A & C & 20.448 & 8.54819 & 20.6623 \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"5×6 DataFrame\n",
|
||
"│ Row │ PatientID │ Cat1 │ Cat2 │ Var1 │ Var2 │ Var3 │\n",
|
||
"│ │ \u001b[90mInt64⍰\u001b[39m │ \u001b[90mString⍰\u001b[39m │ \u001b[90mString⍰\u001b[39m │ \u001b[90mFloat64⍰\u001b[39m │ \u001b[90mFloat64⍰\u001b[39m │ \u001b[90mFloat64⍰\u001b[39m │\n",
|
||
"├─────┼───────────┼─────────┼─────────┼──────────┼──────────┼──────────┤\n",
|
||
"│ 1 │ 1 │ A │ C │ 38.2568 │ 5.93913 │ 35.0579 │\n",
|
||
"│ 2 │ 2 │ A │ C │ 17.8317 │ 5.34754 │ 21.131 │\n",
|
||
"│ 3 │ 8 │ A │ B │ 16.0218 │ 6.60709 │ 60.9436 │\n",
|
||
"│ 4 │ 9 │ A │ C │ 45.1158 │ 6.00733 │ 21.8797 │\n",
|
||
"│ 5 │ 16 │ A │ C │ 20.448 │ 8.54819 │ 20.6623 │"
|
||
]
|
||
},
|
||
"execution_count": 33,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"first(data1, 5)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `describe()` function will attempt to summarize all the variables. In the case of categorical variables, an alphabetical arrangement for minimum and maximum values will be stated."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 34,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>variable</th><th>mean</th><th>min</th><th>median</th><th>max</th><th>nunique</th><th>nmissing</th><th>eltype</th></tr><tr><th></th><th>Symbol</th><th>Union…</th><th>Any</th><th>Union…</th><th>Any</th><th>Union…</th><th>Int64</th><th>DataType</th></tr></thead><tbody><p>6 rows × 8 columns</p><tr><th>1</th><td>PatientID</td><td>60.5</td><td>1</td><td>60.5</td><td>120</td><td></td><td>0</td><td>Int64</td></tr><tr><th>2</th><td>Cat1</td><td></td><td>A</td><td></td><td>B</td><td>2</td><td>0</td><td>String</td></tr><tr><th>3</th><td>Cat2</td><td></td><td>B</td><td></td><td>X</td><td>6</td><td>0</td><td>String</td></tr><tr><th>4</th><td>Var1</td><td>27.9679</td><td>15.2356</td><td>22.6801</td><td>84.2378</td><td></td><td>0</td><td>Float64</td></tr><tr><th>5</th><td>Var2</td><td>5.92121</td><td>3.01173</td><td>5.64241</td><td>15.5826</td><td></td><td>0</td><td>Float64</td></tr><tr><th>6</th><td>Var3</td><td>51.95</td><td>20.3153</td><td>44.3042</td><td>147.397</td><td></td><td>0</td><td>Float64</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|cccccccc}\n",
|
||
"\t& variable & mean & min & median & max & nunique & nmissing & eltype\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Symbol & Union… & Any & Union… & Any & Union… & Int64 & DataType\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & PatientID & 60.5 & 1 & 60.5 & 120 & & 0 & Int64 \\\\\n",
|
||
"\t2 & Cat1 & & A & & B & 2 & 0 & String \\\\\n",
|
||
"\t3 & Cat2 & & B & & X & 6 & 0 & String \\\\\n",
|
||
"\t4 & Var1 & 27.9679 & 15.2356 & 22.6801 & 84.2378 & & 0 & Float64 \\\\\n",
|
||
"\t5 & Var2 & 5.92121 & 3.01173 & 5.64241 & 15.5826 & & 0 & Float64 \\\\\n",
|
||
"\t6 & Var3 & 51.95 & 20.3153 & 44.3042 & 147.397 & & 0 & Float64 \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"6×8 DataFrame. Omitted printing of 1 columns\n",
|
||
"│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │\n",
|
||
"│ │ \u001b[90mSymbol\u001b[39m │ \u001b[90mUnion…\u001b[39m │ \u001b[90mAny\u001b[39m │ \u001b[90mUnion…\u001b[39m │ \u001b[90mAny\u001b[39m │ \u001b[90mUnion…\u001b[39m │ \u001b[90mInt64\u001b[39m │\n",
|
||
"├─────┼───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────────┤\n",
|
||
"│ 1 │ PatientID │ 60.5 │ 1 │ 60.5 │ 120 │ │ 0 │\n",
|
||
"│ 2 │ Cat1 │ │ A │ │ B │ 2 │ 0 │\n",
|
||
"│ 3 │ Cat2 │ │ B │ │ X │ 6 │ 0 │\n",
|
||
"│ 4 │ Var1 │ 27.9679 │ 15.2356 │ 22.6801 │ 84.2378 │ │ 0 │\n",
|
||
"│ 5 │ Var2 │ 5.92121 │ 3.01173 │ 5.64241 │ 15.5826 │ │ 0 │\n",
|
||
"│ 6 │ Var3 │ 51.95 │ 20.3153 │ 44.3042 │ 147.397 │ │ 0 │"
|
||
]
|
||
},
|
||
"execution_count": 34,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"describe(data1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"[Back to the top](#In-this-lecture)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Combining dataframes"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Combining dataframes on a common variable is a very useful operation. Below we create two dataframe instances. Note that both have a `Number` variable."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 35,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Creating DataFrames\n",
|
||
"subjects = DataFrame(Number = [100, 101, 102, 103], Stage = [\"I\", \"III\", \"II\", \"I\"])\n",
|
||
"treatment = DataFrame(Number = [103, 102, 101, 100], Treatment = [\"A\", \"B\", \"A\", \"B\"]);"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `join()` function takes the dataframe objects that require joining as arguments. The `on =` argument (in symbol form), specifies the variable on which to join. In this default mode, only values for the stated variable that appear in both dataframes will be included."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 36,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>Number</th><th>Stage</th><th>Treatment</th></tr><tr><th></th><th>Int64</th><th>String</th><th>String</th></tr></thead><tbody><p>4 rows × 3 columns</p><tr><th>1</th><td>100</td><td>I</td><td>B</td></tr><tr><th>2</th><td>101</td><td>III</td><td>A</td></tr><tr><th>3</th><td>102</td><td>II</td><td>B</td></tr><tr><th>4</th><td>103</td><td>I</td><td>A</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|ccc}\n",
|
||
"\t& Number & Stage & Treatment\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Int64 & String & String\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & 100 & I & B \\\\\n",
|
||
"\t2 & 101 & III & A \\\\\n",
|
||
"\t3 & 102 & II & B \\\\\n",
|
||
"\t4 & 103 & I & A \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"4×3 DataFrame\n",
|
||
"│ Row │ Number │ Stage │ Treatment │\n",
|
||
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mString\u001b[39m │\n",
|
||
"├─────┼────────┼────────┼───────────┤\n",
|
||
"│ 1 │ 100 │ I │ B │\n",
|
||
"│ 2 │ 101 │ III │ A │\n",
|
||
"│ 3 │ 102 │ II │ B │\n",
|
||
"│ 4 │ 103 │ I │ A │"
|
||
]
|
||
},
|
||
"execution_count": 36,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"#Joining\n",
|
||
"df3 = join(subjects, treatment, on = :Number);\n",
|
||
"df3"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 37,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Adding a longer list of subjects\n",
|
||
"subjects = DataFrame(Number = [100, 101, 102, 103, 104, 105], Stage = [\"I\", \"III\", \"II\", \"I\", \"II\", \"II\"]);"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `kind =` argument allows for more control. An inner join is the default (same as above)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 38,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>Number</th><th>Stage</th><th>Treatment</th></tr><tr><th></th><th>Int64</th><th>String</th><th>String</th></tr></thead><tbody><p>4 rows × 3 columns</p><tr><th>1</th><td>100</td><td>I</td><td>B</td></tr><tr><th>2</th><td>101</td><td>III</td><td>A</td></tr><tr><th>3</th><td>102</td><td>II</td><td>B</td></tr><tr><th>4</th><td>103</td><td>I</td><td>A</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|ccc}\n",
|
||
"\t& Number & Stage & Treatment\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Int64 & String & String\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & 100 & I & B \\\\\n",
|
||
"\t2 & 101 & III & A \\\\\n",
|
||
"\t3 & 102 & II & B \\\\\n",
|
||
"\t4 & 103 & I & A \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"4×3 DataFrame\n",
|
||
"│ Row │ Number │ Stage │ Treatment │\n",
|
||
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mString\u001b[39m │\n",
|
||
"├─────┼────────┼────────┼───────────┤\n",
|
||
"│ 1 │ 100 │ I │ B │\n",
|
||
"│ 2 │ 101 │ III │ A │\n",
|
||
"│ 3 │ 102 │ II │ B │\n",
|
||
"│ 4 │ 103 │ I │ A │"
|
||
]
|
||
},
|
||
"execution_count": 38,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Inner join\n",
|
||
" df4 = join(subjects, treatment, on = :Number, kind = :inner);\n",
|
||
" df4"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"An outer join will join both dataframes and add `missing` data."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 39,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>Number</th><th>Stage</th><th>Treatment</th></tr><tr><th></th><th>Int64⍰</th><th>String⍰</th><th>String⍰</th></tr></thead><tbody><p>6 rows × 3 columns</p><tr><th>1</th><td>100</td><td>I</td><td>B</td></tr><tr><th>2</th><td>101</td><td>III</td><td>A</td></tr><tr><th>3</th><td>102</td><td>II</td><td>B</td></tr><tr><th>4</th><td>103</td><td>I</td><td>A</td></tr><tr><th>5</th><td>104</td><td>II</td><td>missing</td></tr><tr><th>6</th><td>105</td><td>II</td><td>missing</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|ccc}\n",
|
||
"\t& Number & Stage & Treatment\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Int64⍰ & String⍰ & String⍰\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & 100 & I & B \\\\\n",
|
||
"\t2 & 101 & III & A \\\\\n",
|
||
"\t3 & 102 & II & B \\\\\n",
|
||
"\t4 & 103 & I & A \\\\\n",
|
||
"\t5 & 104 & II & \\\\\n",
|
||
"\t6 & 105 & II & \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"6×3 DataFrame\n",
|
||
"│ Row │ Number │ Stage │ Treatment │\n",
|
||
"│ │ \u001b[90mInt64⍰\u001b[39m │ \u001b[90mString⍰\u001b[39m │ \u001b[90mString⍰\u001b[39m │\n",
|
||
"├─────┼────────┼─────────┼───────────┤\n",
|
||
"│ 1 │ 100 │ I │ B │\n",
|
||
"│ 2 │ 101 │ III │ A │\n",
|
||
"│ 3 │ 102 │ II │ B │\n",
|
||
"│ 4 │ 103 │ I │ A │\n",
|
||
"│ 5 │ 104 │ II │ \u001b[90mmissing\u001b[39m │\n",
|
||
"│ 6 │ 105 │ II │ \u001b[90mmissing\u001b[39m │"
|
||
]
|
||
},
|
||
"execution_count": 39,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Outer joing: empty fields filled with missing\n",
|
||
"df5 = join(subjects, treatment, on = :Number, kind = :outer);\n",
|
||
"df5"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"[Back to the top](#In-this-lecture)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Grouping"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"A dataframe can be _spliced_ by grouping rows according to values in a variable."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 40,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Variable1</th><th>Variable2</th></tr><tr><th></th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>3 rows × 3 columns</p><tr><th>1</th><td>B</td><td>0.447358</td><td>0.137658</td></tr><tr><th>2</th><td>B</td><td>-0.396211</td><td>0.60808</td></tr><tr><th>3</th><td>B</td><td>0.366773</td><td>0.255054</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|ccc}\n",
|
||
"\t& Group & Variable1 & Variable2\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& String & Float64 & Float64\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & B & 0.447358 & 0.137658 \\\\\n",
|
||
"\t2 & B & -0.396211 & 0.60808 \\\\\n",
|
||
"\t3 & B & 0.366773 & 0.255054 \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"3×3 DataFrame\n",
|
||
"│ Row │ Group │ Variable1 │ Variable2 │\n",
|
||
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
|
||
"├─────┼────────┼───────────┼───────────┤\n",
|
||
"│ 1 │ B │ 0.447358 │ 0.137658 │\n",
|
||
"│ 2 │ B │ -0.396211 │ 0.60808 │\n",
|
||
"│ 3 │ B │ 0.366773 │ 0.255054 │"
|
||
]
|
||
},
|
||
"execution_count": 40,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Creating a new DataFrame\n",
|
||
"df6 = DataFrame(Group = rand([\"A\", \"B\", \"C\"], 15), Variable1 = randn(15), Variable2 = rand(15));\n",
|
||
"first(df6, 3)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `by()` function takes a dataframe object as first argument. This is followed by a column (variable) on which to group by. Below we use the `size` argument to indicate the number of rows and columns for the number of each unique values that are found in the specified variable."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 41,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>x1</th></tr><tr><th></th><th>String</th><th>Tuple…</th></tr></thead><tbody><p>3 rows × 2 columns</p><tr><th>1</th><td>B</td><td>(9, 3)</td></tr><tr><th>2</th><td>A</td><td>(4, 3)</td></tr><tr><th>3</th><td>C</td><td>(2, 3)</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|cc}\n",
|
||
"\t& Group & x1\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& String & Tuple…\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & B & (9, 3) \\\\\n",
|
||
"\t2 & A & (4, 3) \\\\\n",
|
||
"\t3 & C & (2, 3) \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"3×2 DataFrame\n",
|
||
"│ Row │ Group │ x1 │\n",
|
||
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mTuple…\u001b[39m │\n",
|
||
"├─────┼────────┼────────┤\n",
|
||
"│ 1 │ B │ (9, 3) │\n",
|
||
"│ 2 │ A │ (4, 3) │\n",
|
||
"│ 3 │ C │ (2, 3) │"
|
||
]
|
||
},
|
||
"execution_count": 41,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"#Grouping using by()\n",
|
||
"by(df6, :Group, size)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Since the dataframe has three columns, we note that as the second value in the `count` tuple returned above. The first value shows the number of instances of the unique values found for the specified variable."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Below we create a dataframe instance that shows only the count of the unique values."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 42,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Count</th></tr><tr><th></th><th>String</th><th>Int64</th></tr></thead><tbody><p>3 rows × 2 columns</p><tr><th>1</th><td>B</td><td>9</td></tr><tr><th>2</th><td>A</td><td>4</td></tr><tr><th>3</th><td>C</td><td>2</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|cc}\n",
|
||
"\t& Group & Count\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& String & Int64\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & B & 9 \\\\\n",
|
||
"\t2 & A & 4 \\\\\n",
|
||
"\t3 & C & 2 \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"3×2 DataFrame\n",
|
||
"│ Row │ Group │ Count │\n",
|
||
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mInt64\u001b[39m │\n",
|
||
"├─────┼────────┼───────┤\n",
|
||
"│ 1 │ B │ 9 │\n",
|
||
"│ 2 │ A │ 4 │\n",
|
||
"│ 3 │ C │ 2 │"
|
||
]
|
||
},
|
||
"execution_count": 42,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Count unique data point values in :Group column\n",
|
||
"by(df6, :Group, dfc -> DataFrame(Count = size(dfc, 1)))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `aggregate()` function also groups a dataframe by unique values for a specified column, but then provides the ability to list statistical tests required."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 43,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"3×5 DataFrame\n",
|
||
"│ Row │ Group │ Variable1_mean │ Variable2_mean │ Variable1_std │ Variable2_std │\n",
|
||
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
|
||
"├─────┼────────┼────────────────┼────────────────┼───────────────┼───────────────┤\n",
|
||
"│ 1 │ B │ 0.127675 │ 0.446397 │ 0.973237 │ 0.268476 │\n",
|
||
"│ 2 │ A │ -0.33429 │ 0.339451 │ 1.04503 │ 0.352194 │\n",
|
||
"│ 3 │ C │ -0.902111 │ 0.373207 │ 1.51729 │ 0.368007 │"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Aggregate for descriptive statistics\n",
|
||
"print(aggregate(df6, :Group, [mean, std]))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `groupby()` function actually creates sub-dataframes based on the unique values found in the specified variable."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 44,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<p><b>GroupedDataFrame with 3 groups based on key: Group</b></p><p><i>First Group (9 rows): Group = \"B\"</i></p><table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Variable1</th><th>Variable2</th></tr><tr><th></th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><tr><th>1</th><td>B</td><td>0.447358</td><td>0.137658</td></tr><tr><th>2</th><td>B</td><td>-0.396211</td><td>0.60808</td></tr><tr><th>3</th><td>B</td><td>0.366773</td><td>0.255054</td></tr><tr><th>4</th><td>B</td><td>0.621673</td><td>0.498734</td></tr><tr><th>5</th><td>B</td><td>2.06353</td><td>0.52509</td></tr><tr><th>6</th><td>B</td><td>-1.41453</td><td>0.265511</td></tr><tr><th>7</th><td>B</td><td>0.134475</td><td>0.110096</td></tr><tr><th>8</th><td>B</td><td>-0.750421</td><td>0.834362</td></tr><tr><th>9</th><td>B</td><td>0.076418</td><td>0.78299</td></tr></tbody></table><p>⋮</p><p><i>Last Group (2 rows): Group = \"C\"</i></p><table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Variable1</th><th>Variable2</th></tr><tr><th></th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><tr><th>1</th><td>C</td><td>0.170778</td><td>0.633427</td></tr><tr><th>2</th><td>C</td><td>-1.975</td><td>0.112987</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"GroupedDataFrame with 3 groups based on key: Group\n",
|
||
"\n",
|
||
"First Group (9 rows): Group = \"B\"\n",
|
||
"\n",
|
||
"\\begin{tabular}{r|ccc}\n",
|
||
"\t& Group & Variable1 & Variable2\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& String & Float64 & Float64\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & B & 0.447358 & 0.137658 \\\\\n",
|
||
"\t2 & B & -0.396211 & 0.60808 \\\\\n",
|
||
"\t3 & B & 0.366773 & 0.255054 \\\\\n",
|
||
"\t4 & B & 0.621673 & 0.498734 \\\\\n",
|
||
"\t5 & B & 2.06353 & 0.52509 \\\\\n",
|
||
"\t6 & B & -1.41453 & 0.265511 \\\\\n",
|
||
"\t7 & B & 0.134475 & 0.110096 \\\\\n",
|
||
"\t8 & B & -0.750421 & 0.834362 \\\\\n",
|
||
"\t9 & B & 0.076418 & 0.78299 \\\\\n",
|
||
"\\end{tabular}\n",
|
||
"\n",
|
||
"$\\dots$\n",
|
||
"\n",
|
||
"Last Group (2 rows): Group = \"C\"\n",
|
||
"\n",
|
||
"\\begin{tabular}{r|ccc}\n",
|
||
"\t& Group & Variable1 & Variable2\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& String & Float64 & Float64\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & C & 0.170778 & 0.633427 \\\\\n",
|
||
"\t2 & C & -1.975 & 0.112987 \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"GroupedDataFrame with 3 groups based on key: Group\n",
|
||
"First Group (9 rows): Group = \"B\"\n",
|
||
"│ Row │ Group │ Variable1 │ Variable2 │\n",
|
||
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
|
||
"├─────┼────────┼───────────┼───────────┤\n",
|
||
"│ 1 │ B │ 0.447358 │ 0.137658 │\n",
|
||
"│ 2 │ B │ -0.396211 │ 0.60808 │\n",
|
||
"│ 3 │ B │ 0.366773 │ 0.255054 │\n",
|
||
"│ 4 │ B │ 0.621673 │ 0.498734 │\n",
|
||
"│ 5 │ B │ 2.06353 │ 0.52509 │\n",
|
||
"│ 6 │ B │ -1.41453 │ 0.265511 │\n",
|
||
"│ 7 │ B │ 0.134475 │ 0.110096 │\n",
|
||
"│ 8 │ B │ -0.750421 │ 0.834362 │\n",
|
||
"│ 9 │ B │ 0.076418 │ 0.78299 │\n",
|
||
"⋮\n",
|
||
"Last Group (2 rows): Group = \"C\"\n",
|
||
"│ Row │ Group │ Variable1 │ Variable2 │\n",
|
||
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
|
||
"├─────┼────────┼───────────┼───────────┤\n",
|
||
"│ 1 │ C │ 0.170778 │ 0.633427 │\n",
|
||
"│ 2 │ C │ -1.975 │ 0.112987 │"
|
||
]
|
||
},
|
||
"execution_count": 44,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"#Group\n",
|
||
"groupby(df6, :Group)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"By calling the `length()` function, we note that there are indeed three sub-dataframes."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 45,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"3"
|
||
]
|
||
},
|
||
"execution_count": 45,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"length(groupby(df6, :Group))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Using indexing, we can select any of the three sub-dataframes."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 46,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Variable1</th><th>Variable2</th></tr><tr><th></th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>4 rows × 3 columns</p><tr><th>1</th><td>A</td><td>0.182588</td><td>0.0940369</td></tr><tr><th>2</th><td>A</td><td>-1.58492</td><td>0.337865</td></tr><tr><th>3</th><td>A</td><td>0.799335</td><td>0.838042</td></tr><tr><th>4</th><td>A</td><td>-0.734161</td><td>0.0878598</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|ccc}\n",
|
||
"\t& Group & Variable1 & Variable2\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& String & Float64 & Float64\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & A & 0.182588 & 0.0940369 \\\\\n",
|
||
"\t2 & A & -1.58492 & 0.337865 \\\\\n",
|
||
"\t3 & A & 0.799335 & 0.838042 \\\\\n",
|
||
"\t4 & A & -0.734161 & 0.0878598 \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"4×3 SubDataFrame\n",
|
||
"│ Row │ Group │ Variable1 │ Variable2 │\n",
|
||
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
|
||
"├─────┼────────┼───────────┼───────────┤\n",
|
||
"│ 1 │ A │ 0.182588 │ 0.0940369 │\n",
|
||
"│ 2 │ A │ -1.58492 │ 0.337865 │\n",
|
||
"│ 3 │ A │ 0.799335 │ 0.838042 │\n",
|
||
"│ 4 │ A │ -0.734161 │ 0.0878598 │"
|
||
]
|
||
},
|
||
"execution_count": 46,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"groupby(df6, :Group)[2]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"[Back to the top](#In-this-lecture)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Sorting"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Sorting using the `sort!()` function (permanent bang version used here), does what is says on the box. A list can be provided to sort by more than one variable."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 47,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Variable1</th><th>Variable2</th></tr><tr><th></th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>7 rows × 3 columns</p><tr><th>1</th><td>A</td><td>-1.58492</td><td>0.337865</td></tr><tr><th>2</th><td>A</td><td>-0.734161</td><td>0.0878598</td></tr><tr><th>3</th><td>A</td><td>0.182588</td><td>0.0940369</td></tr><tr><th>4</th><td>A</td><td>0.799335</td><td>0.838042</td></tr><tr><th>5</th><td>B</td><td>-1.41453</td><td>0.265511</td></tr><tr><th>6</th><td>B</td><td>-0.750421</td><td>0.834362</td></tr><tr><th>7</th><td>B</td><td>-0.396211</td><td>0.60808</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|ccc}\n",
|
||
"\t& Group & Variable1 & Variable2\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& String & Float64 & Float64\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & A & -1.58492 & 0.337865 \\\\\n",
|
||
"\t2 & A & -0.734161 & 0.0878598 \\\\\n",
|
||
"\t3 & A & 0.182588 & 0.0940369 \\\\\n",
|
||
"\t4 & A & 0.799335 & 0.838042 \\\\\n",
|
||
"\t5 & B & -1.41453 & 0.265511 \\\\\n",
|
||
"\t6 & B & -0.750421 & 0.834362 \\\\\n",
|
||
"\t7 & B & -0.396211 & 0.60808 \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"7×3 DataFrame\n",
|
||
"│ Row │ Group │ Variable1 │ Variable2 │\n",
|
||
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
|
||
"├─────┼────────┼───────────┼───────────┤\n",
|
||
"│ 1 │ A │ -1.58492 │ 0.337865 │\n",
|
||
"│ 2 │ A │ -0.734161 │ 0.0878598 │\n",
|
||
"│ 3 │ A │ 0.182588 │ 0.0940369 │\n",
|
||
"│ 4 │ A │ 0.799335 │ 0.838042 │\n",
|
||
"│ 5 │ B │ -1.41453 │ 0.265511 │\n",
|
||
"│ 6 │ B │ -0.750421 │ 0.834362 │\n",
|
||
"│ 7 │ B │ -0.396211 │ 0.60808 │"
|
||
]
|
||
},
|
||
"execution_count": 47,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df6S = sort!(df6, [:Group, :Variable1]);\n",
|
||
"first(df6S, 7)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"[Back to the top](#In-this-lecture)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Unique rows only"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Below we create a dataframe with two identical rows."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 48,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>6 rows × 3 columns</p><tr><th>1</th><td>1</td><td>11</td><td>A</td></tr><tr><th>2</th><td>2</td><td>12</td><td>B</td></tr><tr><th>3</th><td>2</td><td>12</td><td>B</td></tr><tr><th>4</th><td>3</td><td>13</td><td>C</td></tr><tr><th>5</th><td>4</td><td>14</td><td>D</td></tr><tr><th>6</th><td>5</td><td>15</td><td>E</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|ccc}\n",
|
||
"\t& A & B & C\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Int64 & Int64 & String\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & 1 & 11 & A \\\\\n",
|
||
"\t2 & 2 & 12 & B \\\\\n",
|
||
"\t3 & 2 & 12 & B \\\\\n",
|
||
"\t4 & 3 & 13 & C \\\\\n",
|
||
"\t5 & 4 & 14 & D \\\\\n",
|
||
"\t6 & 5 & 15 & E \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"6×3 DataFrame\n",
|
||
"│ Row │ A │ B │ C │\n",
|
||
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
|
||
"├─────┼───────┼───────┼────────┤\n",
|
||
"│ 1 │ 1 │ 11 │ A │\n",
|
||
"│ 2 │ 2 │ 12 │ B │\n",
|
||
"│ 3 │ 2 │ 12 │ B │\n",
|
||
"│ 4 │ 3 │ 13 │ C │\n",
|
||
"│ 5 │ 4 │ 14 │ D │\n",
|
||
"│ 6 │ 5 │ 15 │ E │"
|
||
]
|
||
},
|
||
"execution_count": 48,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Creating a DataFrame with an obvious duplicate row\n",
|
||
"df7 = DataFrame(A = [1, 2, 2, 3, 4, 5], B = [11, 12, 12, 13, 14, 15], C = [\"A\", \"B\", \"B\", \"C\", \"D\", \"E\"]);\n",
|
||
"df7"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `unique()` function will, as the name implies, delete the duplicate row."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 49,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>5 rows × 3 columns</p><tr><th>1</th><td>1</td><td>11</td><td>A</td></tr><tr><th>2</th><td>2</td><td>12</td><td>B</td></tr><tr><th>3</th><td>3</td><td>13</td><td>C</td></tr><tr><th>4</th><td>4</td><td>14</td><td>D</td></tr><tr><th>5</th><td>5</td><td>15</td><td>E</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|ccc}\n",
|
||
"\t& A & B & C\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Int64 & Int64 & String\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & 1 & 11 & A \\\\\n",
|
||
"\t2 & 2 & 12 & B \\\\\n",
|
||
"\t3 & 3 & 13 & C \\\\\n",
|
||
"\t4 & 4 & 14 & D \\\\\n",
|
||
"\t5 & 5 & 15 & E \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"5×3 DataFrame\n",
|
||
"│ Row │ A │ B │ C │\n",
|
||
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
|
||
"├─────┼───────┼───────┼────────┤\n",
|
||
"│ 1 │ 1 │ 11 │ A │\n",
|
||
"│ 2 │ 2 │ 12 │ B │\n",
|
||
"│ 3 │ 3 │ 13 │ C │\n",
|
||
"│ 4 │ 4 │ 14 │ D │\n",
|
||
"│ 5 │ 5 │ 15 │ E │"
|
||
]
|
||
},
|
||
"execution_count": 49,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Only unique rows\n",
|
||
"unique(df7)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 50,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>6 rows × 3 columns</p><tr><th>1</th><td>1</td><td>11</td><td>A</td></tr><tr><th>2</th><td>2</td><td>12</td><td>B</td></tr><tr><th>3</th><td>2</td><td>12</td><td>B</td></tr><tr><th>4</th><td>3</td><td>13</td><td>C</td></tr><tr><th>5</th><td>4</td><td>14</td><td>D</td></tr><tr><th>6</th><td>5</td><td>15</td><td>E</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|ccc}\n",
|
||
"\t& A & B & C\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Int64 & Int64 & String\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & 1 & 11 & A \\\\\n",
|
||
"\t2 & 2 & 12 & B \\\\\n",
|
||
"\t3 & 2 & 12 & B \\\\\n",
|
||
"\t4 & 3 & 13 & C \\\\\n",
|
||
"\t5 & 4 & 14 & D \\\\\n",
|
||
"\t6 & 5 & 15 & E \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"6×3 DataFrame\n",
|
||
"│ Row │ A │ B │ C │\n",
|
||
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
|
||
"├─────┼───────┼───────┼────────┤\n",
|
||
"│ 1 │ 1 │ 11 │ A │\n",
|
||
"│ 2 │ 2 │ 12 │ B │\n",
|
||
"│ 3 │ 2 │ 12 │ B │\n",
|
||
"│ 4 │ 3 │ 13 │ C │\n",
|
||
"│ 5 │ 4 │ 14 │ D │\n",
|
||
"│ 6 │ 5 │ 15 │ E │"
|
||
]
|
||
},
|
||
"execution_count": 50,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df7"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"As always, the bang will make the change permament."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 51,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>5 rows × 3 columns</p><tr><th>1</th><td>1</td><td>11</td><td>A</td></tr><tr><th>2</th><td>2</td><td>12</td><td>B</td></tr><tr><th>3</th><td>3</td><td>13</td><td>C</td></tr><tr><th>4</th><td>4</td><td>14</td><td>D</td></tr><tr><th>5</th><td>5</td><td>15</td><td>E</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|ccc}\n",
|
||
"\t& A & B & C\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Int64 & Int64 & String\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & 1 & 11 & A \\\\\n",
|
||
"\t2 & 2 & 12 & B \\\\\n",
|
||
"\t3 & 3 & 13 & C \\\\\n",
|
||
"\t4 & 4 & 14 & D \\\\\n",
|
||
"\t5 & 5 & 15 & E \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"5×3 DataFrame\n",
|
||
"│ Row │ A │ B │ C │\n",
|
||
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
|
||
"├─────┼───────┼───────┼────────┤\n",
|
||
"│ 1 │ 1 │ 11 │ A │\n",
|
||
"│ 2 │ 2 │ 12 │ B │\n",
|
||
"│ 3 │ 3 │ 13 │ C │\n",
|
||
"│ 4 │ 4 │ 14 │ D │\n",
|
||
"│ 5 │ 5 │ 15 │ E │"
|
||
]
|
||
},
|
||
"execution_count": 51,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Permanant change\n",
|
||
"unique!(df7)\n",
|
||
"df7"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 52,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>5 rows × 3 columns</p><tr><th>1</th><td>1</td><td>11</td><td>A</td></tr><tr><th>2</th><td>2</td><td>12</td><td>B</td></tr><tr><th>3</th><td>3</td><td>13</td><td>C</td></tr><tr><th>4</th><td>4</td><td>14</td><td>D</td></tr><tr><th>5</th><td>5</td><td>15</td><td>E</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|ccc}\n",
|
||
"\t& A & B & C\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Int64 & Int64 & String\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & 1 & 11 & A \\\\\n",
|
||
"\t2 & 2 & 12 & B \\\\\n",
|
||
"\t3 & 3 & 13 & C \\\\\n",
|
||
"\t4 & 4 & 14 & D \\\\\n",
|
||
"\t5 & 5 & 15 & E \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"5×3 DataFrame\n",
|
||
"│ Row │ A │ B │ C │\n",
|
||
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
|
||
"├─────┼───────┼───────┼────────┤\n",
|
||
"│ 1 │ 1 │ 11 │ A │\n",
|
||
"│ 2 │ 2 │ 12 │ B │\n",
|
||
"│ 3 │ 3 │ 13 │ C │\n",
|
||
"│ 4 │ 4 │ 14 │ D │\n",
|
||
"│ 5 │ 5 │ 15 │ E │"
|
||
]
|
||
},
|
||
"execution_count": 52,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"#Permanant change\n",
|
||
"unique!(df7)\n",
|
||
"df7"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"[Back to the top](#In-this-lecture)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Deleting rows"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The `deleterows!()` function (permanent bang version used here), deletes specified rows."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 53,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>3 rows × 3 columns</p><tr><th>1</th><td>2</td><td>12</td><td>B</td></tr><tr><th>2</th><td>3</td><td>13</td><td>C</td></tr><tr><th>3</th><td>4</td><td>14</td><td>D</td></tr></tbody></table>"
|
||
],
|
||
"text/latex": [
|
||
"\\begin{tabular}{r|ccc}\n",
|
||
"\t& A & B & C\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t& Int64 & Int64 & String\\\\\n",
|
||
"\t\\hline\n",
|
||
"\t1 & 2 & 12 & B \\\\\n",
|
||
"\t2 & 3 & 13 & C \\\\\n",
|
||
"\t3 & 4 & 14 & D \\\\\n",
|
||
"\\end{tabular}\n"
|
||
],
|
||
"text/plain": [
|
||
"3×3 DataFrame\n",
|
||
"│ Row │ A │ B │ C │\n",
|
||
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
|
||
"├─────┼───────┼───────┼────────┤\n",
|
||
"│ 1 │ 2 │ 12 │ B │\n",
|
||
"│ 2 │ 3 │ 13 │ C │\n",
|
||
"│ 3 │ 4 │ 14 │ D │"
|
||
]
|
||
},
|
||
"execution_count": 53,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Permanently\n",
|
||
"deleterows!(df7, [1, 5])\n",
|
||
"df7"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"[Back to the top](#In-this-lecture)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Julia 1.0.3",
|
||
"language": "julia",
|
||
"name": "julia-1.0"
|
||
},
|
||
"language_info": {
|
||
"file_extension": ".jl",
|
||
"mimetype": "application/julia",
|
||
"name": "julia",
|
||
"version": "1.0.3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 2
|
||
}
|