## Working with data

## In this lecture

- [Introduction](#Introduction)
- [Distributions](#Distributions)
- [Normal distribution](#Normal-distribution)
- [Other distributions](#Other-distributions)
- [DataFrames](#DataFrames)
    - [Combining dataframes](#Combining-dataframes)
    - [Grouping](#Grouping)
    - [Sorting](#Sorting)
    - [Unique rows only](#Unique-rows-only)
    - [Deleting rows](#Deleting-rows)

## Introduction

The ability to use data is fundamental to most modern computer coding taks.  In this lecture, we will have a brief introduction to the way in which the Julia language incorporates data through the use of the `Distributions.jl` and `DataFrames.jl` packages.

[Back to the top](#In-this-lecture)

## Distributions

Data point values for a distribution usually follow a pattern.  Such patterns are called distributions.  Distributions are either discrete or continuous.  The `Distribution.jl` package contains most of the common data distributions.

We will also use the `Random.jl` package to seed the pseudo-random number generator so that we can reproduce the random values that we are going to use in the lecture.

In [1]:
using Distributions
using Random

[Back to the top](#In-this-lecture)

### The normal distribution

The normal distribution is the famous bell-shaped curve that we are familiar with.  Values around the mean occur most frequently and as values get progressively further away from the mean, they occur less frequently.

In [2]:
#Seed the pseudo-random number generator
Random.seed!(1234)
#Saving the standard normal distribution as an object
n = Distributions.Normal()  # This function is from the Distributions package
#Parameter values of the standard normal distribution
params(n)

(0.0, 1.0)

Using the `params()` function, we note a mean on $0$ and a standard deviation of $1$, also called the _standard normal distribution_.

The `fieldnames()` function provides the actual parameters of the given distribution.  In the case of the normal distribution, it will be the average and the standard deviation, namely $\mu$ and $\sigma$.

In [3]:
# Returning the parameters of the normal distribution
fieldnames(Normal)

(:μ, :σ)

Now we create a variable called `var1` and use the `rand()` function to create select $10$ random values from the standard normal distribution.

In [4]:
#Seed the pseudo-random number generator
Random.seed!(1234)
#Select 10 elements at random from n
var1 = rand(n, 10);

We can calculate the average and standard deviation of our randomly selected values.

In [5]:
# Average
mean(var1)

0.18909179133831322

In [6]:
# Standard deviation
std(var1)

0.9879593623730926

The `pdf()` calculates the probability density function value of a given distribution up until a specified point (from $- \infty$).

In [7]:
#Probability density function value at x = 0.3
pdf(Normal(), 0.3)

0.38138781546052414

The `cdf()` functions calculates the cummulative distribution function value of a given distribution up until a specified point (from $- \infty$).

In [8]:
#Cumulative distribution function as x = 0.25
cdf(Normal(), 0.25)

0.5987063256829237

The values for the average and standard deviation can be specified.

In [9]:
# Creating 100 data point values from a normal distribution
# with a mean of 100 and a standard deviation of 10
Random.seed!(1234)
var2 = rand(Normal(100, 10), 100);

In [10]:
# Calculating the mean of var2
mean(var2)

98.52365657772843

In [11]:
#Calculating the standard deviation of var2
std(var2)

9.580963685859091

The parameters of a set of values for a specified distribution can be returned.

In [12]:
#Using fit() to calculate the parameters of a distribution
fit(Normal, var2)

Normal{Float64}(μ=98.52365657772843, σ=9.532938502804532)

The `quantiles()` function provides us with values for the specific percentiles (provided as fractions).  Below we calculate the $2.5$% and $97.5$% percentile values of the standard normal distribution.

In [13]:
# Quantiles
quantile(Normal(), 0.025)

-1.9599639845400592

In [14]:
quantile(Normal(), 0.975)

1.9599639845400576

[Back to the top](#In-this-lecture)

### Other distributions

There are many distributions in the `Distribution().jl` package. In the code below, a few of these are showcased by way of setting parameters, selecting random values, and fitting those value back to the distribution or returning the parameter field names.

In [15]:
# Beta distribution
b = Beta(1, 1)
params(b)
Random.seed!(1234)
var3 = rand(b, 100);
fit(Beta, var3)

Beta{Float64}(α=1.236721159927394, β=1.1368118923305863)

In [16]:
# χ2 distribution
c = Chisq(1)
Random.seed!(1234)
var4 = rand(c, 100)
fieldnames(Chisq) # Degrees of freedom

(:ν,)

[Back to the top](#In-this-lecture)

## Dataframes

The `Dataframes.jl` package allows for creation of a flat data structure (rows and columns).  Columns are variables and rows are subjects (examples).

In [17]:
using DataFrames

Below, we create an empty dataframe object that we call `df`.

In [18]:
#Create and empty DataFrame
df = DataFrame();

Column headers representing statistical variable names are entered in square brackets as symbols, i.e. preceeded with a colon.  We will attach the `var2` set of values as data point entries for this statistical variables.

In [19]:
# Add a column with data point values (rows)
df[:Var2] = var2;

We can specify to print the first $5$ rows to the screen with the `first()` function,

In [20]:
#View first five rows
first(df, 5)

Unnamed: 0_level_0,Var2
Unnamed: 0_level_1,Float64
1,108.673
2,90.9826
3,95.0552
4,90.9709
5,108.644


Below, we create another statistical variable with some data point values that we already have in the waiting.

In [21]:
# Add another column
df[:Var3] = var3;

The `last()` functions shows the last specified rows.

In [22]:
# View last three rows
last(df, 3)

Unnamed: 0_level_0,Var2,Var3
Unnamed: 0_level_1,Float64,Float64
1,95.5675,0.831916
2,83.3677,0.221771
3,94.7877,0.655592


The `size()` function returns a tuple with the number of rows and columns returned,

In [23]:
# Dimensions of a DataFrame
size(df)

(100, 2)

The `describe()` functions attemps tp provide summary statistics of the variables>

In [24]:
# Summarize the content
describe(df)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,Nothing,Nothing,DataType
1,Var2,98.5237,67.8864,98.1718,124.175,,,Float64
2,Var3,0.521047,0.00145384,0.522808,0.971161,,,Float64


The data type for each variable can be returned.

In [25]:
# Data type only
eltypes(df)

2-element Array{DataType,1}:
 Float64
 Float64

Below we create a new instance of a dataframe object called `df2`.  It contains four statistical variables.  Note the use of symbol notation in creating the names of these variables.

In [26]:
# 6 Create a bigger DataFrame
df2 = DataFrame()
df2[:A] = 1:10
df2[:B] = ["I", "II", "II", "I", "II","I", "II", "II", "I", "II"]
Random.seed!(1234)
df2[:C] = rand(Normal(), 10)
df2[:D] = rand(Chisq(1), 10);

By using indexing (in square brackets), we can refer to row and column values (i.e. _row, column_).  Below is an example of seleting data point values for rows one through three, showing all the columns.  The colon symbol serves as shortcut syntax for this selection.

In [27]:
# First three rows with all the colums
df2[1:3, :]

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Int64,String,Float64,Float64
1,1,I,0.867347,0.0123688
2,2,II,-0.901744,0.213586
3,3,II,-0.494479,0.00899443


If only specified columns, that is to say, not the range of one, two, and three as we did above, but rather only colums one and three, we create a list to indicate this.

In [28]:
# All rows columns 1 and 3
df2[:, [1, 3]]

Unnamed: 0_level_0,A,C
Unnamed: 0_level_1,Int64,Float64
1,1,0.867347
2,2,-0.901744
3,3,-0.494479
4,4,-0.902914
5,5,0.864401
6,6,2.21188
7,7,0.532813
8,8,-0.271735
9,9,0.502334
10,10,-0.516984


Instead of indicating the column numbers, we can also reference the actual column names (statistical variable names), using symbol notation, i.e. `:A`.

In [29]:
# Different notation
df2[:, [:A, :C]]

Unnamed: 0_level_0,A,C
Unnamed: 0_level_1,Int64,Float64
1,1,0.867347
2,2,-0.901744
3,3,-0.494479
4,4,-0.902914
5,5,0.864401
6,6,2.21188
7,7,0.532813
8,8,-0.271735
9,9,0.502334
10,10,-0.516984


The `CSV.jl` package's `read()` function can import a comma separated values data file.

In [30]:
# Make sure to install the package in the REPL first
using CSV

The file is saved in the same directory / folder as this notebook file.

In [31]:
# Import csv file (in same directory / folder)
data1 = CSV.read("CCS.csv");

Using the `type()` function, we note that we now have an instance of a dataframe object.

In [32]:
typeof(data1)

DataFrame

Let's view the first five rows of data.

In [33]:
first(data1, 5)

Unnamed: 0_level_0,PatientID,Cat1,Cat2,Var1,Var2,Var3
Unnamed: 0_level_1,Int64⍰,String⍰,String⍰,Float64⍰,Float64⍰,Float64⍰
1,1,A,C,38.2568,5.93913,35.0579
2,2,A,C,17.8317,5.34754,21.131
3,8,A,B,16.0218,6.60709,60.9436
4,9,A,C,45.1158,6.00733,21.8797
5,16,A,C,20.448,8.54819,20.6623


The `describe()` function will attempt to summarize all the variables.  In the case of categorical variables, an alphabetical arrangement for minimum and maximum values will be stated.

In [34]:
describe(data1)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Int64,DataType
1,PatientID,60.5,1,60.5,120,,0,Int64
2,Cat1,,A,,B,2.0,0,String
3,Cat2,,B,,X,6.0,0,String
4,Var1,27.9679,15.2356,22.6801,84.2378,,0,Float64
5,Var2,5.92121,3.01173,5.64241,15.5826,,0,Float64
6,Var3,51.95,20.3153,44.3042,147.397,,0,Float64


[Back to the top](#In-this-lecture)

### Combining dataframes

Combining dataframes on a common variable is a very useful operation.  Below we create two dataframe instances.  Note that both have a `Number` variable.

In [35]:
# Creating DataFrames
subjects = DataFrame(Number = [100, 101, 102, 103], Stage = ["I", "III", "II", "I"])
treatment  = DataFrame(Number = [103, 102, 101, 100], Treatment = ["A", "B", "A", "B"]);

The `join()` function takes the dataframe objects that require joining as arguments.  The `on =` argument (in symbol form), specifies the variable on which to join.  In this default mode, only values for the stated variable that appear in both dataframes will be included.

In [36]:
#Joining
df3 = join(subjects, treatment, on = :Number);
df3

Unnamed: 0_level_0,Number,Stage,Treatment
Unnamed: 0_level_1,Int64,String,String
1,100,I,B
2,101,III,A
3,102,II,B
4,103,I,A


In [37]:
# Adding a longer list of subjects
subjects = DataFrame(Number = [100, 101, 102, 103, 104, 105], Stage = ["I", "III", "II", "I", "II", "II"]);

The `kind =` argument allows for more control.  An inner join is the default (same as above).

In [38]:
# Inner join
 df4 = join(subjects, treatment, on = :Number, kind = :inner);
 df4

Unnamed: 0_level_0,Number,Stage,Treatment
Unnamed: 0_level_1,Int64,String,String
1,100,I,B
2,101,III,A
3,102,II,B
4,103,I,A


An outer join will join both dataframes and add `missing` data.

In [39]:
# Outer joing: empty fields filled with missing
df5  = join(subjects, treatment, on = :Number, kind = :outer);
df5

Unnamed: 0_level_0,Number,Stage,Treatment
Unnamed: 0_level_1,Int64⍰,String⍰,String⍰
1,100,I,B
2,101,III,A
3,102,II,B
4,103,I,A
5,104,II,missing
6,105,II,missing


[Back to the top](#In-this-lecture)

### Grouping

A dataframe can be _spliced_ by grouping rows according to values in a variable.

In [40]:
# Creating a new DataFrame
df6 = DataFrame(Group = rand(["A", "B", "C"], 15), Variable1 = randn(15), Variable2 = rand(15));
first(df6, 3)

Unnamed: 0_level_0,Group,Variable1,Variable2
Unnamed: 0_level_1,String,Float64,Float64
1,B,0.447358,0.137658
2,B,-0.396211,0.60808
3,B,0.366773,0.255054


The `by()` function takes a dataframe object as first argument.  This is followed by a column (variable) on which to group by.  Below we use the `size` argument to indicate the number of rows and columns for the number of each unique values that are found in the specified variable.

In [41]:
#Grouping using by()
by(df6, :Group, size)

Unnamed: 0_level_0,Group,x1
Unnamed: 0_level_1,String,Tuple…
1,B,"(9, 3)"
2,A,"(4, 3)"
3,C,"(2, 3)"


Since the dataframe has three columns, we note that as the second value in the `count` tuple returned above.  The first value shows the number of instances of the unique values found for the specified variable.

Below we create a dataframe instance that shows only the count of the unique values.

In [42]:
# Count unique data point values in :Group column
by(df6, :Group, dfc -> DataFrame(Count = size(dfc, 1)))

Unnamed: 0_level_0,Group,Count
Unnamed: 0_level_1,String,Int64
1,B,9
2,A,4
3,C,2


The `aggregate()` function also groups a dataframe by unique values for a specified column, but then provides the ability to list statistical tests required.

In [43]:
# Aggregate for descriptive statistics
print(aggregate(df6, :Group, [mean, std]))

3×5 DataFrame
│ Row │ Group  │ Variable1_mean │ Variable2_mean │ Variable1_std │ Variable2_std │
│     │ [90mString[39m │ [90mFloat64[39m        │ [90mFloat64[39m        │ [90mFloat64[39m       │ [90mFloat64[39m       │
├─────┼────────┼────────────────┼────────────────┼───────────────┼───────────────┤
│ 1   │ B      │ 0.127675       │ 0.446397       │ 0.973237      │ 0.268476      │
│ 2   │ A      │ -0.33429       │ 0.339451       │ 1.04503       │ 0.352194      │
│ 3   │ C      │ -0.902111      │ 0.373207       │ 1.51729       │ 0.368007      │

The `groupby()` function actually creates sub-dataframes based on the unique values found in the specified variable.

In [44]:
#Group
groupby(df6, :Group)

Unnamed: 0_level_0,Group,Variable1,Variable2
Unnamed: 0_level_1,String,Float64,Float64
1,B,0.447358,0.137658
2,B,-0.396211,0.60808
3,B,0.366773,0.255054
4,B,0.621673,0.498734
5,B,2.06353,0.52509
6,B,-1.41453,0.265511
7,B,0.134475,0.110096
8,B,-0.750421,0.834362
9,B,0.076418,0.78299

Unnamed: 0_level_0,Group,Variable1,Variable2
Unnamed: 0_level_1,String,Float64,Float64
1,C,0.170778,0.633427
2,C,-1.975,0.112987


By calling the `length()` function, we note that there are indeed three sub-dataframes.

In [45]:
length(groupby(df6, :Group))

3

Using indexing, we can select any of the three sub-dataframes.

In [46]:
groupby(df6, :Group)[2]

Unnamed: 0_level_0,Group,Variable1,Variable2
Unnamed: 0_level_1,String,Float64,Float64
1,A,0.182588,0.0940369
2,A,-1.58492,0.337865
3,A,0.799335,0.838042
4,A,-0.734161,0.0878598


[Back to the top](#In-this-lecture)

### Sorting

Sorting using the `sort!()` function (permanent bang version used here), does what is says on the box.  A list can be provided to sort by more than one variable.

In [47]:
df6S = sort!(df6, [:Group, :Variable1]);
first(df6S, 7)

Unnamed: 0_level_0,Group,Variable1,Variable2
Unnamed: 0_level_1,String,Float64,Float64
1,A,-1.58492,0.337865
2,A,-0.734161,0.0878598
3,A,0.182588,0.0940369
4,A,0.799335,0.838042
5,B,-1.41453,0.265511
6,B,-0.750421,0.834362
7,B,-0.396211,0.60808


[Back to the top](#In-this-lecture)

### Unique rows only

Below we create a dataframe with two identical rows.

In [48]:
# Creating a DataFrame with an obvious duplicate row
df7 = DataFrame(A = [1, 2, 2, 3, 4, 5],  B = [11, 12, 12, 13, 14, 15], C = ["A", "B", "B", "C", "D", "E"]);
df7

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Int64,String
1,1,11,A
2,2,12,B
3,2,12,B
4,3,13,C
5,4,14,D
6,5,15,E


The `unique()` function will, as the name implies, delete the duplicate row.

In [49]:
# Only unique rows
unique(df7)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Int64,String
1,1,11,A
2,2,12,B
3,3,13,C
4,4,14,D
5,5,15,E


In [50]:
df7

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Int64,String
1,1,11,A
2,2,12,B
3,2,12,B
4,3,13,C
5,4,14,D
6,5,15,E


As always, the bang will make the change permament.

In [51]:
# Permanant change
unique!(df7)
df7

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Int64,String
1,1,11,A
2,2,12,B
3,3,13,C
4,4,14,D
5,5,15,E


In [52]:
#Permanant change
unique!(df7)
df7

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Int64,String
1,1,11,A
2,2,12,B
3,3,13,C
4,4,14,D
5,5,15,E


[Back to the top](#In-this-lecture)

### Deleting rows

The `deleterows!()` function (permanent bang version used here), deletes specified rows.

In [53]:
# Permanently
deleterows!(df7, [1, 5])
df7

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Int64,String
1,2,12,B
2,3,13,C
3,4,14,D


[Back to the top](#In-this-lecture)