julia_coursera/Week4_Working with data.ipynb

2128 lines
70 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Working with data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## In this lecture"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [Introduction](#Introduction)\n",
"- [Distributions](#Distributions)\n",
"- [Normal distribution](#Normal-distribution)\n",
"- [Other distributions](#Other-distributions)\n",
"- [DataFrames](#DataFrames)\n",
" - [Combining dataframes](#Combining-dataframes)\n",
" - [Grouping](#Grouping)\n",
" - [Sorting](#Sorting)\n",
" - [Unique rows only](#Unique-rows-only)\n",
" - [Deleting rows](#Deleting-rows)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ability to use data is fundamental to most modern computer coding taks. In this lecture, we will have a brief introduction to the way in which the Julia language incorporates data through the use of the `Distributions.jl` and `DataFrames.jl` packages."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to the top](#In-this-lecture)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Distributions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data point values for a distribution usually follow a pattern. Such patterns are called distributions. Distributions are either discrete or continuous. The `Distribution.jl` package contains most of the common data distributions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will also use the `Random.jl` package to seed the pseudo-random number generator so that we can reproduce the random values that we are going to use in the lecture."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"using Distributions\n",
"using Random"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to the top](#In-this-lecture)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The normal distribution"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The normal distribution is the famous bell-shaped curve that we are familiar with. Values around the mean occur most frequently and as values get progressively further away from the mean, they occur less frequently."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.0, 1.0)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Seed the pseudo-random number generator\n",
"Random.seed!(1234)\n",
"#Saving the standard normal distribution as an object\n",
"n = Distributions.Normal() # This function is from the Distributions package\n",
"#Parameter values of the standard normal distribution\n",
"params(n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the `params()` function, we note a mean on $0$ and a standard deviation of $1$, also called the _standard normal distribution_."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `fieldnames()` function provides the actual parameters of the given distribution. In the case of the normal distribution, it will be the average and the standard deviation, namely $\\mu$ and $\\sigma$."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(:μ, :σ)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Returning the parameters of the normal distribution\n",
"fieldnames(Normal)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we create a variable called `var1` and use the `rand()` function to create select $10$ random values from the standard normal distribution."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"#Seed the pseudo-random number generator\n",
"Random.seed!(1234)\n",
"#Select 10 elements at random from n\n",
"var1 = rand(n, 10);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can calculate the average and standard deviation of our randomly selected values."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.18909179133831322"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Average\n",
"mean(var1)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.9879593623730926"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Standard deviation\n",
"std(var1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `pdf()` calculates the probability density function value of a given distribution up until a specified point (from $- \\infty$)."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.38138781546052414"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Probability density function value at x = 0.3\n",
"pdf(Normal(), 0.3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `cdf()` functions calculates the cummulative distribution function value of a given distribution up until a specified point (from $- \\infty$)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.5987063256829237"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Cumulative distribution function as x = 0.25\n",
"cdf(Normal(), 0.25)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The values for the average and standard deviation can be specified."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Creating 100 data point values from a normal distribution\n",
"# with a mean of 100 and a standard deviation of 10\n",
"Random.seed!(1234)\n",
"var2 = rand(Normal(100, 10), 100);"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"98.52365657772843"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Calculating the mean of var2\n",
"mean(var2)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"9.580963685859091"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Calculating the standard deviation of var2\n",
"std(var2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The parameters of a set of values for a specified distribution can be returned."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Normal{Float64}(μ=98.52365657772843, σ=9.532938502804532)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Using fit() to calculate the parameters of a distribution\n",
"fit(Normal, var2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `quantiles()` function provides us with values for the specific percentiles (provided as fractions). Below we calculate the $2.5$% and $97.5$% percentile values of the standard normal distribution."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-1.9599639845400592"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Quantiles\n",
"quantile(Normal(), 0.025)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.9599639845400576"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"quantile(Normal(), 0.975)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to the top](#In-this-lecture)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other distributions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many distributions in the `Distribution().jl` package. In the code below, a few of these are showcased by way of setting parameters, selecting random values, and fitting those value back to the distribution or returning the parameter field names."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Beta{Float64}(α=1.236721159927394, β=1.1368118923305863)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Beta distribution\n",
"b = Beta(1, 1)\n",
"params(b)\n",
"Random.seed!(1234)\n",
"var3 = rand(b, 100);\n",
"fit(Beta, var3)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(:ν,)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# χ2 distribution\n",
"c = Chisq(1)\n",
"Random.seed!(1234)\n",
"var4 = rand(c, 100)\n",
"fieldnames(Chisq) # Degrees of freedom"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to the top](#In-this-lecture)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dataframes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `Dataframes.jl` package allows for creation of a flat data structure (rows and columns). Columns are variables and rows are subjects (examples)."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"using DataFrames"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, we create an empty dataframe object that we call `df`."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"#Create and empty DataFrame\n",
"df = DataFrame();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Column headers representing statistical variable names are entered in square brackets as symbols, i.e. preceeded with a colon. We will attach the `var2` set of values as data point entries for this statistical variables."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# Add a column with data point values (rows)\n",
"df[:Var2] = var2;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can specify to print the first $5$ rows to the screen with the `first()` function,"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>Var2</th></tr><tr><th></th><th>Float64</th></tr></thead><tbody><p>5 rows × 1 columns</p><tr><th>1</th><td>108.673</td></tr><tr><th>2</th><td>90.9826</td></tr><tr><th>3</th><td>95.0552</td></tr><tr><th>4</th><td>90.9709</td></tr><tr><th>5</th><td>108.644</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|c}\n",
"\t& Var2\\\\\n",
"\t\\hline\n",
"\t& Float64\\\\\n",
"\t\\hline\n",
"\t1 & 108.673 \\\\\n",
"\t2 & 90.9826 \\\\\n",
"\t3 & 95.0552 \\\\\n",
"\t4 & 90.9709 \\\\\n",
"\t5 & 108.644 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"5×1 DataFrame\n",
"│ Row │ Var2 │\n",
"│ │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼─────────┤\n",
"│ 1 │ 108.673 │\n",
"│ 2 │ 90.9826 │\n",
"│ 3 │ 95.0552 │\n",
"│ 4 │ 90.9709 │\n",
"│ 5 │ 108.644 │"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#View first five rows\n",
"first(df, 5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, we create another statistical variable with some data point values that we already have in the waiting."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# Add another column\n",
"df[:Var3] = var3;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `last()` functions shows the last specified rows."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>Var2</th><th>Var3</th></tr><tr><th></th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>3 rows × 2 columns</p><tr><th>1</th><td>95.5675</td><td>0.831916</td></tr><tr><th>2</th><td>83.3677</td><td>0.221771</td></tr><tr><th>3</th><td>94.7877</td><td>0.655592</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cc}\n",
"\t& Var2 & Var3\\\\\n",
"\t\\hline\n",
"\t& Float64 & Float64\\\\\n",
"\t\\hline\n",
"\t1 & 95.5675 & 0.831916 \\\\\n",
"\t2 & 83.3677 & 0.221771 \\\\\n",
"\t3 & 94.7877 & 0.655592 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"3×2 DataFrame\n",
"│ Row │ Var2 │ Var3 │\n",
"│ │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼─────────┼──────────┤\n",
"│ 1 │ 95.5675 │ 0.831916 │\n",
"│ 2 │ 83.3677 │ 0.221771 │\n",
"│ 3 │ 94.7877 │ 0.655592 │"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# View last three rows\n",
"last(df, 3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `size()` function returns a tuple with the number of rows and columns returned,"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(100, 2)"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Dimensions of a DataFrame\n",
"size(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `describe()` functions attemps tp provide summary statistics of the variables>"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>variable</th><th>mean</th><th>min</th><th>median</th><th>max</th><th>nunique</th><th>nmissing</th><th>eltype</th></tr><tr><th></th><th>Symbol</th><th>Float64</th><th>Float64</th><th>Float64</th><th>Float64</th><th>Nothing</th><th>Nothing</th><th>DataType</th></tr></thead><tbody><p>2 rows × 8 columns</p><tr><th>1</th><td>Var2</td><td>98.5237</td><td>67.8864</td><td>98.1718</td><td>124.175</td><td></td><td></td><td>Float64</td></tr><tr><th>2</th><td>Var3</td><td>0.521047</td><td>0.00145384</td><td>0.522808</td><td>0.971161</td><td></td><td></td><td>Float64</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cccccccc}\n",
"\t& variable & mean & min & median & max & nunique & nmissing & eltype\\\\\n",
"\t\\hline\n",
"\t& Symbol & Float64 & Float64 & Float64 & Float64 & Nothing & Nothing & DataType\\\\\n",
"\t\\hline\n",
"\t1 & Var2 & 98.5237 & 67.8864 & 98.1718 & 124.175 & & & Float64 \\\\\n",
"\t2 & Var3 & 0.521047 & 0.00145384 & 0.522808 & 0.971161 & & & Float64 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"2×8 DataFrame. Omitted printing of 2 columns\n",
"│ Row │ variable │ mean │ min │ median │ max │ nunique │\n",
"│ │ \u001b[90mSymbol\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mNothing\u001b[39m │\n",
"├─────┼──────────┼──────────┼────────────┼──────────┼──────────┼─────────┤\n",
"│ 1 │ Var2 │ 98.5237 │ 67.8864 │ 98.1718 │ 124.175 │ │\n",
"│ 2 │ Var3 │ 0.521047 │ 0.00145384 │ 0.522808 │ 0.971161 │ │"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Summarize the content\n",
"describe(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data type for each variable can be returned."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2-element Array{DataType,1}:\n",
" Float64\n",
" Float64"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Data type only\n",
"eltypes(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below we create a new instance of a dataframe object called `df2`. It contains four statistical variables. Note the use of symbol notation in creating the names of these variables."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# 6 Create a bigger DataFrame\n",
"df2 = DataFrame()\n",
"df2[:A] = 1:10\n",
"df2[:B] = [\"I\", \"II\", \"II\", \"I\", \"II\",\"I\", \"II\", \"II\", \"I\", \"II\"]\n",
"Random.seed!(1234)\n",
"df2[:C] = rand(Normal(), 10)\n",
"df2[:D] = rand(Chisq(1), 10);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By using indexing (in square brackets), we can refer to row and column values (i.e. _row, column_). Below is an example of seleting data point values for rows one through three, showing all the columns. The colon symbol serves as shortcut syntax for this selection."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th><th>D</th></tr><tr><th></th><th>Int64</th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>3 rows × 4 columns</p><tr><th>1</th><td>1</td><td>I</td><td>0.867347</td><td>0.0123688</td></tr><tr><th>2</th><td>2</td><td>II</td><td>-0.901744</td><td>0.213586</td></tr><tr><th>3</th><td>3</td><td>II</td><td>-0.494479</td><td>0.00899443</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cccc}\n",
"\t& A & B & C & D\\\\\n",
"\t\\hline\n",
"\t& Int64 & String & Float64 & Float64\\\\\n",
"\t\\hline\n",
"\t1 & 1 & I & 0.867347 & 0.0123688 \\\\\n",
"\t2 & 2 & II & -0.901744 & 0.213586 \\\\\n",
"\t3 & 3 & II & -0.494479 & 0.00899443 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"3×4 DataFrame\n",
"│ Row │ A │ B │ C │ D │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼───────┼────────┼───────────┼────────────┤\n",
"│ 1 │ 1 │ I │ 0.867347 │ 0.0123688 │\n",
"│ 2 │ 2 │ II │ -0.901744 │ 0.213586 │\n",
"│ 3 │ 3 │ II │ -0.494479 │ 0.00899443 │"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# First three rows with all the colums\n",
"df2[1:3, :]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If only specified columns, that is to say, not the range of one, two, and three as we did above, but rather only colums one and three, we create a list to indicate this."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Float64</th></tr></thead><tbody><p>10 rows × 2 columns</p><tr><th>1</th><td>1</td><td>0.867347</td></tr><tr><th>2</th><td>2</td><td>-0.901744</td></tr><tr><th>3</th><td>3</td><td>-0.494479</td></tr><tr><th>4</th><td>4</td><td>-0.902914</td></tr><tr><th>5</th><td>5</td><td>0.864401</td></tr><tr><th>6</th><td>6</td><td>2.21188</td></tr><tr><th>7</th><td>7</td><td>0.532813</td></tr><tr><th>8</th><td>8</td><td>-0.271735</td></tr><tr><th>9</th><td>9</td><td>0.502334</td></tr><tr><th>10</th><td>10</td><td>-0.516984</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cc}\n",
"\t& A & C\\\\\n",
"\t\\hline\n",
"\t& Int64 & Float64\\\\\n",
"\t\\hline\n",
"\t1 & 1 & 0.867347 \\\\\n",
"\t2 & 2 & -0.901744 \\\\\n",
"\t3 & 3 & -0.494479 \\\\\n",
"\t4 & 4 & -0.902914 \\\\\n",
"\t5 & 5 & 0.864401 \\\\\n",
"\t6 & 6 & 2.21188 \\\\\n",
"\t7 & 7 & 0.532813 \\\\\n",
"\t8 & 8 & -0.271735 \\\\\n",
"\t9 & 9 & 0.502334 \\\\\n",
"\t10 & 10 & -0.516984 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"10×2 DataFrame\n",
"│ Row │ A │ C │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼───────┼───────────┤\n",
"│ 1 │ 1 │ 0.867347 │\n",
"│ 2 │ 2 │ -0.901744 │\n",
"│ 3 │ 3 │ -0.494479 │\n",
"│ 4 │ 4 │ -0.902914 │\n",
"│ 5 │ 5 │ 0.864401 │\n",
"│ 6 │ 6 │ 2.21188 │\n",
"│ 7 │ 7 │ 0.532813 │\n",
"│ 8 │ 8 │ -0.271735 │\n",
"│ 9 │ 9 │ 0.502334 │\n",
"│ 10 │ 10 │ -0.516984 │"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# All rows columns 1 and 3\n",
"df2[:, [1, 3]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instead of indicating the column numbers, we can also reference the actual column names (statistical variable names), using symbol notation, i.e. `:A`."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Float64</th></tr></thead><tbody><p>10 rows × 2 columns</p><tr><th>1</th><td>1</td><td>0.867347</td></tr><tr><th>2</th><td>2</td><td>-0.901744</td></tr><tr><th>3</th><td>3</td><td>-0.494479</td></tr><tr><th>4</th><td>4</td><td>-0.902914</td></tr><tr><th>5</th><td>5</td><td>0.864401</td></tr><tr><th>6</th><td>6</td><td>2.21188</td></tr><tr><th>7</th><td>7</td><td>0.532813</td></tr><tr><th>8</th><td>8</td><td>-0.271735</td></tr><tr><th>9</th><td>9</td><td>0.502334</td></tr><tr><th>10</th><td>10</td><td>-0.516984</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cc}\n",
"\t& A & C\\\\\n",
"\t\\hline\n",
"\t& Int64 & Float64\\\\\n",
"\t\\hline\n",
"\t1 & 1 & 0.867347 \\\\\n",
"\t2 & 2 & -0.901744 \\\\\n",
"\t3 & 3 & -0.494479 \\\\\n",
"\t4 & 4 & -0.902914 \\\\\n",
"\t5 & 5 & 0.864401 \\\\\n",
"\t6 & 6 & 2.21188 \\\\\n",
"\t7 & 7 & 0.532813 \\\\\n",
"\t8 & 8 & -0.271735 \\\\\n",
"\t9 & 9 & 0.502334 \\\\\n",
"\t10 & 10 & -0.516984 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"10×2 DataFrame\n",
"│ Row │ A │ C │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼───────┼───────────┤\n",
"│ 1 │ 1 │ 0.867347 │\n",
"│ 2 │ 2 │ -0.901744 │\n",
"│ 3 │ 3 │ -0.494479 │\n",
"│ 4 │ 4 │ -0.902914 │\n",
"│ 5 │ 5 │ 0.864401 │\n",
"│ 6 │ 6 │ 2.21188 │\n",
"│ 7 │ 7 │ 0.532813 │\n",
"│ 8 │ 8 │ -0.271735 │\n",
"│ 9 │ 9 │ 0.502334 │\n",
"│ 10 │ 10 │ -0.516984 │"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Different notation\n",
"df2[:, [:A, :C]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `CSV.jl` package's `read()` function can import a comma separated values data file."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# Make sure to install the package in the REPL first\n",
"using CSV"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The file is saved in the same directory / folder as this notebook file."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"# Import csv file (in same directory / folder)\n",
"data1 = CSV.read(\"CCS.csv\");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the `type()` function, we note that we now have an instance of a dataframe object."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DataFrame"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"typeof(data1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's view the first five rows of data."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>PatientID</th><th>Cat1</th><th>Cat2</th><th>Var1</th><th>Var2</th><th>Var3</th></tr><tr><th></th><th>Int64⍰</th><th>String⍰</th><th>String⍰</th><th>Float64⍰</th><th>Float64⍰</th><th>Float64⍰</th></tr></thead><tbody><p>5 rows × 6 columns</p><tr><th>1</th><td>1</td><td>A</td><td>C</td><td>38.2568</td><td>5.93913</td><td>35.0579</td></tr><tr><th>2</th><td>2</td><td>A</td><td>C</td><td>17.8317</td><td>5.34754</td><td>21.131</td></tr><tr><th>3</th><td>8</td><td>A</td><td>B</td><td>16.0218</td><td>6.60709</td><td>60.9436</td></tr><tr><th>4</th><td>9</td><td>A</td><td>C</td><td>45.1158</td><td>6.00733</td><td>21.8797</td></tr><tr><th>5</th><td>16</td><td>A</td><td>C</td><td>20.448</td><td>8.54819</td><td>20.6623</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cccccc}\n",
"\t& PatientID & Cat1 & Cat2 & Var1 & Var2 & Var3\\\\\n",
"\t\\hline\n",
"\t& Int64⍰ & String⍰ & String⍰ & Float64⍰ & Float64⍰ & Float64⍰\\\\\n",
"\t\\hline\n",
"\t1 & 1 & A & C & 38.2568 & 5.93913 & 35.0579 \\\\\n",
"\t2 & 2 & A & C & 17.8317 & 5.34754 & 21.131 \\\\\n",
"\t3 & 8 & A & B & 16.0218 & 6.60709 & 60.9436 \\\\\n",
"\t4 & 9 & A & C & 45.1158 & 6.00733 & 21.8797 \\\\\n",
"\t5 & 16 & A & C & 20.448 & 8.54819 & 20.6623 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"5×6 DataFrame\n",
"│ Row │ PatientID │ Cat1 │ Cat2 │ Var1 │ Var2 │ Var3 │\n",
"│ │ \u001b[90mInt64⍰\u001b[39m │ \u001b[90mString⍰\u001b[39m │ \u001b[90mString⍰\u001b[39m │ \u001b[90mFloat64⍰\u001b[39m │ \u001b[90mFloat64⍰\u001b[39m │ \u001b[90mFloat64⍰\u001b[39m │\n",
"├─────┼───────────┼─────────┼─────────┼──────────┼──────────┼──────────┤\n",
"│ 1 │ 1 │ A │ C │ 38.2568 │ 5.93913 │ 35.0579 │\n",
"│ 2 │ 2 │ A │ C │ 17.8317 │ 5.34754 │ 21.131 │\n",
"│ 3 │ 8 │ A │ B │ 16.0218 │ 6.60709 │ 60.9436 │\n",
"│ 4 │ 9 │ A │ C │ 45.1158 │ 6.00733 │ 21.8797 │\n",
"│ 5 │ 16 │ A │ C │ 20.448 │ 8.54819 │ 20.6623 │"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"first(data1, 5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `describe()` function will attempt to summarize all the variables. In the case of categorical variables, an alphabetical arrangement for minimum and maximum values will be stated."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>variable</th><th>mean</th><th>min</th><th>median</th><th>max</th><th>nunique</th><th>nmissing</th><th>eltype</th></tr><tr><th></th><th>Symbol</th><th>Union…</th><th>Any</th><th>Union…</th><th>Any</th><th>Union…</th><th>Int64</th><th>DataType</th></tr></thead><tbody><p>6 rows × 8 columns</p><tr><th>1</th><td>PatientID</td><td>60.5</td><td>1</td><td>60.5</td><td>120</td><td></td><td>0</td><td>Int64</td></tr><tr><th>2</th><td>Cat1</td><td></td><td>A</td><td></td><td>B</td><td>2</td><td>0</td><td>String</td></tr><tr><th>3</th><td>Cat2</td><td></td><td>B</td><td></td><td>X</td><td>6</td><td>0</td><td>String</td></tr><tr><th>4</th><td>Var1</td><td>27.9679</td><td>15.2356</td><td>22.6801</td><td>84.2378</td><td></td><td>0</td><td>Float64</td></tr><tr><th>5</th><td>Var2</td><td>5.92121</td><td>3.01173</td><td>5.64241</td><td>15.5826</td><td></td><td>0</td><td>Float64</td></tr><tr><th>6</th><td>Var3</td><td>51.95</td><td>20.3153</td><td>44.3042</td><td>147.397</td><td></td><td>0</td><td>Float64</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cccccccc}\n",
"\t& variable & mean & min & median & max & nunique & nmissing & eltype\\\\\n",
"\t\\hline\n",
"\t& Symbol & Union… & Any & Union… & Any & Union… & Int64 & DataType\\\\\n",
"\t\\hline\n",
"\t1 & PatientID & 60.5 & 1 & 60.5 & 120 & & 0 & Int64 \\\\\n",
"\t2 & Cat1 & & A & & B & 2 & 0 & String \\\\\n",
"\t3 & Cat2 & & B & & X & 6 & 0 & String \\\\\n",
"\t4 & Var1 & 27.9679 & 15.2356 & 22.6801 & 84.2378 & & 0 & Float64 \\\\\n",
"\t5 & Var2 & 5.92121 & 3.01173 & 5.64241 & 15.5826 & & 0 & Float64 \\\\\n",
"\t6 & Var3 & 51.95 & 20.3153 & 44.3042 & 147.397 & & 0 & Float64 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"6×8 DataFrame. Omitted printing of 1 columns\n",
"│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │\n",
"│ │ \u001b[90mSymbol\u001b[39m │ \u001b[90mUnion…\u001b[39m │ \u001b[90mAny\u001b[39m │ \u001b[90mUnion…\u001b[39m │ \u001b[90mAny\u001b[39m │ \u001b[90mUnion…\u001b[39m │ \u001b[90mInt64\u001b[39m │\n",
"├─────┼───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────────┤\n",
"│ 1 │ PatientID │ 60.5 │ 1 │ 60.5 │ 120 │ │ 0 │\n",
"│ 2 │ Cat1 │ │ A │ │ B │ 2 │ 0 │\n",
"│ 3 │ Cat2 │ │ B │ │ X │ 6 │ 0 │\n",
"│ 4 │ Var1 │ 27.9679 │ 15.2356 │ 22.6801 │ 84.2378 │ │ 0 │\n",
"│ 5 │ Var2 │ 5.92121 │ 3.01173 │ 5.64241 │ 15.5826 │ │ 0 │\n",
"│ 6 │ Var3 │ 51.95 │ 20.3153 │ 44.3042 │ 147.397 │ │ 0 │"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"describe(data1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to the top](#In-this-lecture)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Combining dataframes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Combining dataframes on a common variable is a very useful operation. Below we create two dataframe instances. Note that both have a `Number` variable."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"# Creating DataFrames\n",
"subjects = DataFrame(Number = [100, 101, 102, 103], Stage = [\"I\", \"III\", \"II\", \"I\"])\n",
"treatment = DataFrame(Number = [103, 102, 101, 100], Treatment = [\"A\", \"B\", \"A\", \"B\"]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `join()` function takes the dataframe objects that require joining as arguments. The `on =` argument (in symbol form), specifies the variable on which to join. In this default mode, only values for the stated variable that appear in both dataframes will be included."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>Number</th><th>Stage</th><th>Treatment</th></tr><tr><th></th><th>Int64</th><th>String</th><th>String</th></tr></thead><tbody><p>4 rows × 3 columns</p><tr><th>1</th><td>100</td><td>I</td><td>B</td></tr><tr><th>2</th><td>101</td><td>III</td><td>A</td></tr><tr><th>3</th><td>102</td><td>II</td><td>B</td></tr><tr><th>4</th><td>103</td><td>I</td><td>A</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& Number & Stage & Treatment\\\\\n",
"\t\\hline\n",
"\t& Int64 & String & String\\\\\n",
"\t\\hline\n",
"\t1 & 100 & I & B \\\\\n",
"\t2 & 101 & III & A \\\\\n",
"\t3 & 102 & II & B \\\\\n",
"\t4 & 103 & I & A \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"4×3 DataFrame\n",
"│ Row │ Number │ Stage │ Treatment │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mString\u001b[39m │\n",
"├─────┼────────┼────────┼───────────┤\n",
"│ 1 │ 100 │ I │ B │\n",
"│ 2 │ 101 │ III │ A │\n",
"│ 3 │ 102 │ II │ B │\n",
"│ 4 │ 103 │ I │ A │"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Joining\n",
"df3 = join(subjects, treatment, on = :Number);\n",
"df3"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"# Adding a longer list of subjects\n",
"subjects = DataFrame(Number = [100, 101, 102, 103, 104, 105], Stage = [\"I\", \"III\", \"II\", \"I\", \"II\", \"II\"]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `kind =` argument allows for more control. An inner join is the default (same as above)."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>Number</th><th>Stage</th><th>Treatment</th></tr><tr><th></th><th>Int64</th><th>String</th><th>String</th></tr></thead><tbody><p>4 rows × 3 columns</p><tr><th>1</th><td>100</td><td>I</td><td>B</td></tr><tr><th>2</th><td>101</td><td>III</td><td>A</td></tr><tr><th>3</th><td>102</td><td>II</td><td>B</td></tr><tr><th>4</th><td>103</td><td>I</td><td>A</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& Number & Stage & Treatment\\\\\n",
"\t\\hline\n",
"\t& Int64 & String & String\\\\\n",
"\t\\hline\n",
"\t1 & 100 & I & B \\\\\n",
"\t2 & 101 & III & A \\\\\n",
"\t3 & 102 & II & B \\\\\n",
"\t4 & 103 & I & A \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"4×3 DataFrame\n",
"│ Row │ Number │ Stage │ Treatment │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mString\u001b[39m │\n",
"├─────┼────────┼────────┼───────────┤\n",
"│ 1 │ 100 │ I │ B │\n",
"│ 2 │ 101 │ III │ A │\n",
"│ 3 │ 102 │ II │ B │\n",
"│ 4 │ 103 │ I │ A │"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Inner join\n",
" df4 = join(subjects, treatment, on = :Number, kind = :inner);\n",
" df4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An outer join will join both dataframes and add `missing` data."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>Number</th><th>Stage</th><th>Treatment</th></tr><tr><th></th><th>Int64⍰</th><th>String⍰</th><th>String⍰</th></tr></thead><tbody><p>6 rows × 3 columns</p><tr><th>1</th><td>100</td><td>I</td><td>B</td></tr><tr><th>2</th><td>101</td><td>III</td><td>A</td></tr><tr><th>3</th><td>102</td><td>II</td><td>B</td></tr><tr><th>4</th><td>103</td><td>I</td><td>A</td></tr><tr><th>5</th><td>104</td><td>II</td><td>missing</td></tr><tr><th>6</th><td>105</td><td>II</td><td>missing</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& Number & Stage & Treatment\\\\\n",
"\t\\hline\n",
"\t& Int64⍰ & String⍰ & String⍰\\\\\n",
"\t\\hline\n",
"\t1 & 100 & I & B \\\\\n",
"\t2 & 101 & III & A \\\\\n",
"\t3 & 102 & II & B \\\\\n",
"\t4 & 103 & I & A \\\\\n",
"\t5 & 104 & II & \\\\\n",
"\t6 & 105 & II & \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"6×3 DataFrame\n",
"│ Row │ Number │ Stage │ Treatment │\n",
"│ │ \u001b[90mInt64⍰\u001b[39m │ \u001b[90mString⍰\u001b[39m │ \u001b[90mString⍰\u001b[39m │\n",
"├─────┼────────┼─────────┼───────────┤\n",
"│ 1 │ 100 │ I │ B │\n",
"│ 2 │ 101 │ III │ A │\n",
"│ 3 │ 102 │ II │ B │\n",
"│ 4 │ 103 │ I │ A │\n",
"│ 5 │ 104 │ II │ \u001b[90mmissing\u001b[39m │\n",
"│ 6 │ 105 │ II │ \u001b[90mmissing\u001b[39m │"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Outer joing: empty fields filled with missing\n",
"df5 = join(subjects, treatment, on = :Number, kind = :outer);\n",
"df5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to the top](#In-this-lecture)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Grouping"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A dataframe can be _spliced_ by grouping rows according to values in a variable."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Variable1</th><th>Variable2</th></tr><tr><th></th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>3 rows × 3 columns</p><tr><th>1</th><td>B</td><td>0.447358</td><td>0.137658</td></tr><tr><th>2</th><td>B</td><td>-0.396211</td><td>0.60808</td></tr><tr><th>3</th><td>B</td><td>0.366773</td><td>0.255054</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& Group & Variable1 & Variable2\\\\\n",
"\t\\hline\n",
"\t& String & Float64 & Float64\\\\\n",
"\t\\hline\n",
"\t1 & B & 0.447358 & 0.137658 \\\\\n",
"\t2 & B & -0.396211 & 0.60808 \\\\\n",
"\t3 & B & 0.366773 & 0.255054 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"3×3 DataFrame\n",
"│ Row │ Group │ Variable1 │ Variable2 │\n",
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼────────┼───────────┼───────────┤\n",
"│ 1 │ B │ 0.447358 │ 0.137658 │\n",
"│ 2 │ B │ -0.396211 │ 0.60808 │\n",
"│ 3 │ B │ 0.366773 │ 0.255054 │"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Creating a new DataFrame\n",
"df6 = DataFrame(Group = rand([\"A\", \"B\", \"C\"], 15), Variable1 = randn(15), Variable2 = rand(15));\n",
"first(df6, 3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `by()` function takes a dataframe object as first argument. This is followed by a column (variable) on which to group by. Below we use the `size` argument to indicate the number of rows and columns for the number of each unique values that are found in the specified variable."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>x1</th></tr><tr><th></th><th>String</th><th>Tuple…</th></tr></thead><tbody><p>3 rows × 2 columns</p><tr><th>1</th><td>B</td><td>(9, 3)</td></tr><tr><th>2</th><td>A</td><td>(4, 3)</td></tr><tr><th>3</th><td>C</td><td>(2, 3)</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cc}\n",
"\t& Group & x1\\\\\n",
"\t\\hline\n",
"\t& String & Tuple…\\\\\n",
"\t\\hline\n",
"\t1 & B & (9, 3) \\\\\n",
"\t2 & A & (4, 3) \\\\\n",
"\t3 & C & (2, 3) \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"3×2 DataFrame\n",
"│ Row │ Group │ x1 │\n",
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mTuple…\u001b[39m │\n",
"├─────┼────────┼────────┤\n",
"│ 1 │ B │ (9, 3) │\n",
"│ 2 │ A │ (4, 3) │\n",
"│ 3 │ C │ (2, 3) │"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Grouping using by()\n",
"by(df6, :Group, size)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since the dataframe has three columns, we note that as the second value in the `count` tuple returned above. The first value shows the number of instances of the unique values found for the specified variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below we create a dataframe instance that shows only the count of the unique values."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Count</th></tr><tr><th></th><th>String</th><th>Int64</th></tr></thead><tbody><p>3 rows × 2 columns</p><tr><th>1</th><td>B</td><td>9</td></tr><tr><th>2</th><td>A</td><td>4</td></tr><tr><th>3</th><td>C</td><td>2</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cc}\n",
"\t& Group & Count\\\\\n",
"\t\\hline\n",
"\t& String & Int64\\\\\n",
"\t\\hline\n",
"\t1 & B & 9 \\\\\n",
"\t2 & A & 4 \\\\\n",
"\t3 & C & 2 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"3×2 DataFrame\n",
"│ Row │ Group │ Count │\n",
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mInt64\u001b[39m │\n",
"├─────┼────────┼───────┤\n",
"│ 1 │ B │ 9 │\n",
"│ 2 │ A │ 4 │\n",
"│ 3 │ C │ 2 │"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Count unique data point values in :Group column\n",
"by(df6, :Group, dfc -> DataFrame(Count = size(dfc, 1)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `aggregate()` function also groups a dataframe by unique values for a specified column, but then provides the ability to list statistical tests required."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3×5 DataFrame\n",
"│ Row │ Group │ Variable1_mean │ Variable2_mean │ Variable1_std │ Variable2_std │\n",
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼────────┼────────────────┼────────────────┼───────────────┼───────────────┤\n",
"│ 1 │ B │ 0.127675 │ 0.446397 │ 0.973237 │ 0.268476 │\n",
"│ 2 │ A │ -0.33429 │ 0.339451 │ 1.04503 │ 0.352194 │\n",
"│ 3 │ C │ -0.902111 │ 0.373207 │ 1.51729 │ 0.368007 │"
]
}
],
"source": [
"# Aggregate for descriptive statistics\n",
"print(aggregate(df6, :Group, [mean, std]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `groupby()` function actually creates sub-dataframes based on the unique values found in the specified variable."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<p><b>GroupedDataFrame with 3 groups based on key: Group</b></p><p><i>First Group (9 rows): Group = \"B\"</i></p><table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Variable1</th><th>Variable2</th></tr><tr><th></th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><tr><th>1</th><td>B</td><td>0.447358</td><td>0.137658</td></tr><tr><th>2</th><td>B</td><td>-0.396211</td><td>0.60808</td></tr><tr><th>3</th><td>B</td><td>0.366773</td><td>0.255054</td></tr><tr><th>4</th><td>B</td><td>0.621673</td><td>0.498734</td></tr><tr><th>5</th><td>B</td><td>2.06353</td><td>0.52509</td></tr><tr><th>6</th><td>B</td><td>-1.41453</td><td>0.265511</td></tr><tr><th>7</th><td>B</td><td>0.134475</td><td>0.110096</td></tr><tr><th>8</th><td>B</td><td>-0.750421</td><td>0.834362</td></tr><tr><th>9</th><td>B</td><td>0.076418</td><td>0.78299</td></tr></tbody></table><p>&vellip;</p><p><i>Last Group (2 rows): Group = \"C\"</i></p><table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Variable1</th><th>Variable2</th></tr><tr><th></th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><tr><th>1</th><td>C</td><td>0.170778</td><td>0.633427</td></tr><tr><th>2</th><td>C</td><td>-1.975</td><td>0.112987</td></tr></tbody></table>"
],
"text/latex": [
"GroupedDataFrame with 3 groups based on key: Group\n",
"\n",
"First Group (9 rows): Group = \"B\"\n",
"\n",
"\\begin{tabular}{r|ccc}\n",
"\t& Group & Variable1 & Variable2\\\\\n",
"\t\\hline\n",
"\t& String & Float64 & Float64\\\\\n",
"\t\\hline\n",
"\t1 & B & 0.447358 & 0.137658 \\\\\n",
"\t2 & B & -0.396211 & 0.60808 \\\\\n",
"\t3 & B & 0.366773 & 0.255054 \\\\\n",
"\t4 & B & 0.621673 & 0.498734 \\\\\n",
"\t5 & B & 2.06353 & 0.52509 \\\\\n",
"\t6 & B & -1.41453 & 0.265511 \\\\\n",
"\t7 & B & 0.134475 & 0.110096 \\\\\n",
"\t8 & B & -0.750421 & 0.834362 \\\\\n",
"\t9 & B & 0.076418 & 0.78299 \\\\\n",
"\\end{tabular}\n",
"\n",
"$\\dots$\n",
"\n",
"Last Group (2 rows): Group = \"C\"\n",
"\n",
"\\begin{tabular}{r|ccc}\n",
"\t& Group & Variable1 & Variable2\\\\\n",
"\t\\hline\n",
"\t& String & Float64 & Float64\\\\\n",
"\t\\hline\n",
"\t1 & C & 0.170778 & 0.633427 \\\\\n",
"\t2 & C & -1.975 & 0.112987 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"GroupedDataFrame with 3 groups based on key: Group\n",
"First Group (9 rows): Group = \"B\"\n",
"│ Row │ Group │ Variable1 │ Variable2 │\n",
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼────────┼───────────┼───────────┤\n",
"│ 1 │ B │ 0.447358 │ 0.137658 │\n",
"│ 2 │ B │ -0.396211 │ 0.60808 │\n",
"│ 3 │ B │ 0.366773 │ 0.255054 │\n",
"│ 4 │ B │ 0.621673 │ 0.498734 │\n",
"│ 5 │ B │ 2.06353 │ 0.52509 │\n",
"│ 6 │ B │ -1.41453 │ 0.265511 │\n",
"│ 7 │ B │ 0.134475 │ 0.110096 │\n",
"│ 8 │ B │ -0.750421 │ 0.834362 │\n",
"│ 9 │ B │ 0.076418 │ 0.78299 │\n",
"⋮\n",
"Last Group (2 rows): Group = \"C\"\n",
"│ Row │ Group │ Variable1 │ Variable2 │\n",
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼────────┼───────────┼───────────┤\n",
"│ 1 │ C │ 0.170778 │ 0.633427 │\n",
"│ 2 │ C │ -1.975 │ 0.112987 │"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Group\n",
"groupby(df6, :Group)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By calling the `length()` function, we note that there are indeed three sub-dataframes."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"length(groupby(df6, :Group))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using indexing, we can select any of the three sub-dataframes."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Variable1</th><th>Variable2</th></tr><tr><th></th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>4 rows × 3 columns</p><tr><th>1</th><td>A</td><td>0.182588</td><td>0.0940369</td></tr><tr><th>2</th><td>A</td><td>-1.58492</td><td>0.337865</td></tr><tr><th>3</th><td>A</td><td>0.799335</td><td>0.838042</td></tr><tr><th>4</th><td>A</td><td>-0.734161</td><td>0.0878598</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& Group & Variable1 & Variable2\\\\\n",
"\t\\hline\n",
"\t& String & Float64 & Float64\\\\\n",
"\t\\hline\n",
"\t1 & A & 0.182588 & 0.0940369 \\\\\n",
"\t2 & A & -1.58492 & 0.337865 \\\\\n",
"\t3 & A & 0.799335 & 0.838042 \\\\\n",
"\t4 & A & -0.734161 & 0.0878598 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"4×3 SubDataFrame\n",
"│ Row │ Group │ Variable1 │ Variable2 │\n",
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼────────┼───────────┼───────────┤\n",
"│ 1 │ A │ 0.182588 │ 0.0940369 │\n",
"│ 2 │ A │ -1.58492 │ 0.337865 │\n",
"│ 3 │ A │ 0.799335 │ 0.838042 │\n",
"│ 4 │ A │ -0.734161 │ 0.0878598 │"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"groupby(df6, :Group)[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to the top](#In-this-lecture)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sorting"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sorting using the `sort!()` function (permanent bang version used here), does what is says on the box. A list can be provided to sort by more than one variable."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>Group</th><th>Variable1</th><th>Variable2</th></tr><tr><th></th><th>String</th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>7 rows × 3 columns</p><tr><th>1</th><td>A</td><td>-1.58492</td><td>0.337865</td></tr><tr><th>2</th><td>A</td><td>-0.734161</td><td>0.0878598</td></tr><tr><th>3</th><td>A</td><td>0.182588</td><td>0.0940369</td></tr><tr><th>4</th><td>A</td><td>0.799335</td><td>0.838042</td></tr><tr><th>5</th><td>B</td><td>-1.41453</td><td>0.265511</td></tr><tr><th>6</th><td>B</td><td>-0.750421</td><td>0.834362</td></tr><tr><th>7</th><td>B</td><td>-0.396211</td><td>0.60808</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& Group & Variable1 & Variable2\\\\\n",
"\t\\hline\n",
"\t& String & Float64 & Float64\\\\\n",
"\t\\hline\n",
"\t1 & A & -1.58492 & 0.337865 \\\\\n",
"\t2 & A & -0.734161 & 0.0878598 \\\\\n",
"\t3 & A & 0.182588 & 0.0940369 \\\\\n",
"\t4 & A & 0.799335 & 0.838042 \\\\\n",
"\t5 & B & -1.41453 & 0.265511 \\\\\n",
"\t6 & B & -0.750421 & 0.834362 \\\\\n",
"\t7 & B & -0.396211 & 0.60808 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"7×3 DataFrame\n",
"│ Row │ Group │ Variable1 │ Variable2 │\n",
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼────────┼───────────┼───────────┤\n",
"│ 1 │ A │ -1.58492 │ 0.337865 │\n",
"│ 2 │ A │ -0.734161 │ 0.0878598 │\n",
"│ 3 │ A │ 0.182588 │ 0.0940369 │\n",
"│ 4 │ A │ 0.799335 │ 0.838042 │\n",
"│ 5 │ B │ -1.41453 │ 0.265511 │\n",
"│ 6 │ B │ -0.750421 │ 0.834362 │\n",
"│ 7 │ B │ -0.396211 │ 0.60808 │"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df6S = sort!(df6, [:Group, :Variable1]);\n",
"first(df6S, 7)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to the top](#In-this-lecture)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Unique rows only"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below we create a dataframe with two identical rows."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>6 rows × 3 columns</p><tr><th>1</th><td>1</td><td>11</td><td>A</td></tr><tr><th>2</th><td>2</td><td>12</td><td>B</td></tr><tr><th>3</th><td>2</td><td>12</td><td>B</td></tr><tr><th>4</th><td>3</td><td>13</td><td>C</td></tr><tr><th>5</th><td>4</td><td>14</td><td>D</td></tr><tr><th>6</th><td>5</td><td>15</td><td>E</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& A & B & C\\\\\n",
"\t\\hline\n",
"\t& Int64 & Int64 & String\\\\\n",
"\t\\hline\n",
"\t1 & 1 & 11 & A \\\\\n",
"\t2 & 2 & 12 & B \\\\\n",
"\t3 & 2 & 12 & B \\\\\n",
"\t4 & 3 & 13 & C \\\\\n",
"\t5 & 4 & 14 & D \\\\\n",
"\t6 & 5 & 15 & E \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"6×3 DataFrame\n",
"│ Row │ A │ B │ C │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
"├─────┼───────┼───────┼────────┤\n",
"│ 1 │ 1 │ 11 │ A │\n",
"│ 2 │ 2 │ 12 │ B │\n",
"│ 3 │ 2 │ 12 │ B │\n",
"│ 4 │ 3 │ 13 │ C │\n",
"│ 5 │ 4 │ 14 │ D │\n",
"│ 6 │ 5 │ 15 │ E │"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Creating a DataFrame with an obvious duplicate row\n",
"df7 = DataFrame(A = [1, 2, 2, 3, 4, 5], B = [11, 12, 12, 13, 14, 15], C = [\"A\", \"B\", \"B\", \"C\", \"D\", \"E\"]);\n",
"df7"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `unique()` function will, as the name implies, delete the duplicate row."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>5 rows × 3 columns</p><tr><th>1</th><td>1</td><td>11</td><td>A</td></tr><tr><th>2</th><td>2</td><td>12</td><td>B</td></tr><tr><th>3</th><td>3</td><td>13</td><td>C</td></tr><tr><th>4</th><td>4</td><td>14</td><td>D</td></tr><tr><th>5</th><td>5</td><td>15</td><td>E</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& A & B & C\\\\\n",
"\t\\hline\n",
"\t& Int64 & Int64 & String\\\\\n",
"\t\\hline\n",
"\t1 & 1 & 11 & A \\\\\n",
"\t2 & 2 & 12 & B \\\\\n",
"\t3 & 3 & 13 & C \\\\\n",
"\t4 & 4 & 14 & D \\\\\n",
"\t5 & 5 & 15 & E \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"5×3 DataFrame\n",
"│ Row │ A │ B │ C │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
"├─────┼───────┼───────┼────────┤\n",
"│ 1 │ 1 │ 11 │ A │\n",
"│ 2 │ 2 │ 12 │ B │\n",
"│ 3 │ 3 │ 13 │ C │\n",
"│ 4 │ 4 │ 14 │ D │\n",
"│ 5 │ 5 │ 15 │ E │"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Only unique rows\n",
"unique(df7)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>6 rows × 3 columns</p><tr><th>1</th><td>1</td><td>11</td><td>A</td></tr><tr><th>2</th><td>2</td><td>12</td><td>B</td></tr><tr><th>3</th><td>2</td><td>12</td><td>B</td></tr><tr><th>4</th><td>3</td><td>13</td><td>C</td></tr><tr><th>5</th><td>4</td><td>14</td><td>D</td></tr><tr><th>6</th><td>5</td><td>15</td><td>E</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& A & B & C\\\\\n",
"\t\\hline\n",
"\t& Int64 & Int64 & String\\\\\n",
"\t\\hline\n",
"\t1 & 1 & 11 & A \\\\\n",
"\t2 & 2 & 12 & B \\\\\n",
"\t3 & 2 & 12 & B \\\\\n",
"\t4 & 3 & 13 & C \\\\\n",
"\t5 & 4 & 14 & D \\\\\n",
"\t6 & 5 & 15 & E \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"6×3 DataFrame\n",
"│ Row │ A │ B │ C │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
"├─────┼───────┼───────┼────────┤\n",
"│ 1 │ 1 │ 11 │ A │\n",
"│ 2 │ 2 │ 12 │ B │\n",
"│ 3 │ 2 │ 12 │ B │\n",
"│ 4 │ 3 │ 13 │ C │\n",
"│ 5 │ 4 │ 14 │ D │\n",
"│ 6 │ 5 │ 15 │ E │"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df7"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As always, the bang will make the change permament."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>5 rows × 3 columns</p><tr><th>1</th><td>1</td><td>11</td><td>A</td></tr><tr><th>2</th><td>2</td><td>12</td><td>B</td></tr><tr><th>3</th><td>3</td><td>13</td><td>C</td></tr><tr><th>4</th><td>4</td><td>14</td><td>D</td></tr><tr><th>5</th><td>5</td><td>15</td><td>E</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& A & B & C\\\\\n",
"\t\\hline\n",
"\t& Int64 & Int64 & String\\\\\n",
"\t\\hline\n",
"\t1 & 1 & 11 & A \\\\\n",
"\t2 & 2 & 12 & B \\\\\n",
"\t3 & 3 & 13 & C \\\\\n",
"\t4 & 4 & 14 & D \\\\\n",
"\t5 & 5 & 15 & E \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"5×3 DataFrame\n",
"│ Row │ A │ B │ C │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
"├─────┼───────┼───────┼────────┤\n",
"│ 1 │ 1 │ 11 │ A │\n",
"│ 2 │ 2 │ 12 │ B │\n",
"│ 3 │ 3 │ 13 │ C │\n",
"│ 4 │ 4 │ 14 │ D │\n",
"│ 5 │ 5 │ 15 │ E │"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Permanant change\n",
"unique!(df7)\n",
"df7"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>5 rows × 3 columns</p><tr><th>1</th><td>1</td><td>11</td><td>A</td></tr><tr><th>2</th><td>2</td><td>12</td><td>B</td></tr><tr><th>3</th><td>3</td><td>13</td><td>C</td></tr><tr><th>4</th><td>4</td><td>14</td><td>D</td></tr><tr><th>5</th><td>5</td><td>15</td><td>E</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& A & B & C\\\\\n",
"\t\\hline\n",
"\t& Int64 & Int64 & String\\\\\n",
"\t\\hline\n",
"\t1 & 1 & 11 & A \\\\\n",
"\t2 & 2 & 12 & B \\\\\n",
"\t3 & 3 & 13 & C \\\\\n",
"\t4 & 4 & 14 & D \\\\\n",
"\t5 & 5 & 15 & E \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"5×3 DataFrame\n",
"│ Row │ A │ B │ C │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
"├─────┼───────┼───────┼────────┤\n",
"│ 1 │ 1 │ 11 │ A │\n",
"│ 2 │ 2 │ 12 │ B │\n",
"│ 3 │ 3 │ 13 │ C │\n",
"│ 4 │ 4 │ 14 │ D │\n",
"│ 5 │ 5 │ 15 │ E │"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Permanant change\n",
"unique!(df7)\n",
"df7"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to the top](#In-this-lecture)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Deleting rows"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `deleterows!()` function (permanent bang version used here), deletes specified rows."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr><tr><th></th><th>Int64</th><th>Int64</th><th>String</th></tr></thead><tbody><p>3 rows × 3 columns</p><tr><th>1</th><td>2</td><td>12</td><td>B</td></tr><tr><th>2</th><td>3</td><td>13</td><td>C</td></tr><tr><th>3</th><td>4</td><td>14</td><td>D</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& A & B & C\\\\\n",
"\t\\hline\n",
"\t& Int64 & Int64 & String\\\\\n",
"\t\\hline\n",
"\t1 & 2 & 12 & B \\\\\n",
"\t2 & 3 & 13 & C \\\\\n",
"\t3 & 4 & 14 & D \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"3×3 DataFrame\n",
"│ Row │ A │ B │ C │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │\n",
"├─────┼───────┼───────┼────────┤\n",
"│ 1 │ 2 │ 12 │ B │\n",
"│ 2 │ 3 │ 13 │ C │\n",
"│ 3 │ 4 │ 14 │ D │"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Permanently\n",
"deleterows!(df7, [1, 5])\n",
"df7"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to the top](#In-this-lecture)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 1.0.3",
"language": "julia",
"name": "julia-1.0"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.0.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}