Binning Data- quantile,cut,ntile, and logical indexing

Adam Ginensky

January 29, 2020

Introduction

  1. Binning data- histograms with too many different values take time.
  2. Quantile regressions are another example.

WHAT IS BINNING ?

  1. To which bin does a given data point belong ?
  2. Can we assign all values in a given bin one standard value ?

The Quantile() function

# sample the integers 1- 1000 100 times with replacement)
x =c(1:100)
quantile(x) # Shows the cut points to divide the data into quartiles
##     0%    25%    50%    75%   100% 
##   1.00  25.75  50.50  75.25 100.00
quantile(x,probs = seq(0,1,.1)) # compute the deciles
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##   1.0  10.9  20.8  30.7  40.6  50.5  60.4  70.3  80.2  90.1 100.0
quantile(x,probs = seq(0,1,.25)) # computes the quartiles.
##     0%    25%    50%    75%   100% 
##   1.00  25.75  50.50  75.25 100.00

The output of the quantile function is a vector with all the cut points. The cut points are labelled.

x = c(1:1000)
quantile(x) # displays the division
##      0%     25%     50%     75%    100% 
##    1.00  250.75  500.50  750.25 1000.00
y = quantile(x,probs = seq(0,1,1/1000)) # divide the data into 1000 bins
y[1] # minimum value
## 0.0% 
##    1
y[1000] # start of last  bin
##   99.9% 
## 999.001
y[1001] # maximum value 
## 100.0% 
##   1000
y[501] # the median 
## 50.0% 
## 500.5
head(names(y))
## [1] "0.0%" "0.1%" "0.2%" "0.3%" "0.4%" "0.5%"

What can we do with this

library(dplyr) # we need to load the library
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
v1= matrix(rnorm(1000000),1000000) # matrix with 1M rows and 5 cols

see1= ntile(v1,100)
see2 = ntile(v1,1000)

head(see1) # first 6 bins
## [1] 67 62 63 32 91 51
summary(see1) # as expected
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   25.75   50.50   50.50   75.25  100.00
head(see2)
## [1] 663 619 630 315 909 505
summary(see2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   250.8   500.5   500.5   750.2  1000.0

And Finally- logical indexing

pctl = quantile(x, probs = seq(0,1,.01)) # 101 values
pctl.low = pctl[1:100] # the lower value of all the bins
pctl.high = pctl[2:101] # the upper values
pctl.mean = .5*(pctl.low + pctl.high)
pct.bin = pctl.mean[ntile(v1,100)]
  1. It’s a vector of length(v1)
  2. in row j it looks at which tile v1[j] is and assigns to the j-th element the value of pctl.mean of that value.
  3. More detail. Suppose v1[200] is in the fourth tile, then the 200th element of pct.bin is pctl.mean[4].

Thanks for your attention + Conclusion