Package 'jrrosell'

Title: Personal R package for Jordi Rosell
Description: Useful functions for personal usage.
Authors: Jordi Rosell [aut, cre]
Maintainer: Jordi Rosell <[email protected]>
License: CC0
Version: 0.0.0.9011
Built: 2024-10-28 18:23:20 UTC
Source: https://github.com/jrosell/jrrosell

Help Index


Add hash for each row

Description

It sorts the column names, it hash every row and add the column.

Usage

add_row_hash(df, primary_keys)

Arguments

df

a data.frame

primary_keys

the column anmes of the primary key

Examples

df <- data.frame(
  id = c(1, 2, 3),
  name = c("AAAAA", "BBBB", "CCC")
)
add_row_hash(df, id)

Data type utilities

Description

Get the bit representation of a double number

Usage

as.bitstring(x)

Arguments

x

A numeric vetor.

Details

Get the bit representation of a double number Using rev() ensures that the bit order is correct, and the binary representation aligns with the usual convention of having the MSB first and the LSB last. This is because numToBits() returns the bits in the reverse order, and without rev(), we end up with the LSB first and the MSB last.

Source

https://youtu.be/J4DnzjIFj8w

Examples

0.1 + 0.2 == 0.3
as.bitstring(0.1 + 0.2)
as.bitstring(0.3)

Multiple aside functions with base R pipe

Description

Multiple aside functions with base R pipe

Usage

aside(x, ...)

Arguments

x

An object

...

functions to run aside

Examples

n_try <- 1
rnorm(200) |>
  matrix(ncol = 2) |>
  aside(
    print("Matrix prepared"),
    print(n_try)
  ) |>
  colSums()

Calculate split proportion

Description

From a data frame, it returns the minimal split proportion for validation.

Usage

calc_split_prop(df)

Arguments

df

A data frame

Details

The calc_validation_size function returns the optimal split proportion according to the number of rows for your validation set.

Source

https://stats.stackexchange.com/a/305063/7387

Examples

calc_split_prop(data.frame(row = 1:891))

Calculate split size

Description

From binary classification problems, with the desired std_err it returns the minimal assesment/validation set size.

Usage

calc_split_size(std_err = 0.001)

Arguments

std_err

The desired std_err numeric (default 0.001)

Details

The calc_validation_size function returns the minimal validation size for expected probabilities and desired error. s

Source

https://stats.stackexchange.com/a/304996/7387

Examples

calc_split_size()
calc_split_size(std_err = 0.02)

Create a vector of characters from a string

Description

Create a vector of characters from a string

Usage

chars(x, ...)

Arguments

x

a vector of characters of length 1.

...

unused

Details

chars expects a single string as input. To create a list of these, consider lapply(strings, chars).

Value

a vector of characters

See Also

https://github.com/jonocarroll/charcuterie

Examples

chars("hola")

Check if the last github version is installed

Description

Check if the last main github version is installed.

Usage

check_installed_gihub(repo)

Arguments

repo

a github repo/package. Ex: check_installed_gihub("tidyverse/dplyr")

Examples

if (FALSE) {
  check_installed_gihub("jrosell/jrrosell")
}

Count the number of duplicated rows

Description

Count the number of duplicated rows

Usage

count_duplicated_rows(df)

Arguments

df

a data.frame

Examples

count_duplicated_rows(data.frame(a = c(1, 2, 3), b = c(3, 4, 5)))
count_duplicated_rows(data.frame(a = c(1, 2, 3), b = c(1, 4, 5)))

Count a variable or variables sorted

Description

It returns the ordered counts of the variable in the data.frame.

Usage

count_sorted(df, ...)

Arguments

df

a data.frame

...

the variables to use and other arguments to count

Examples

data.frame(a = c("x", "y", "x"), b = c("z", "z", "n")) |>
  count_sorted(a)

Detect cores that could be used

Description

Select cores in max/min of the available cores.

Usage

detect_cores(max = 10, min = 2)

Arguments

max

An integer with the max desired cores (default 10)

min

An integer with the min desired cores (default 2)

Details

The detect_cores function uses parallelly package. It returns the desired max cores if available or it fails if not min cores are available.

Examples

cores <- detect_cores(max = 5, min = 1)
print(cores)
if (FALSE) {
  library(jrrosell)
  library(future)
  plan(multisession, workers = detect_cores(max = 10, min = 2))
  plan(sequential)
}

Fit a workflow with specific parameters

Description

Fit a workflow with specific parameters

Usage

fit_results(wf, resamples, param_info = NULL, grid = 10, fn = "tune_grid", ...)

Arguments

wf

workflow

resamples

rset

param_info

for tune_* functions

grid

for tune_* functions

fn

the name of the function to run when tuning

...

Optional engine arguments

Examples

library(tidymodels)
library(xgboost)
library(modeldata)
data(cells)
split <- cells |>
  mutate(across(where(is.character), as.factor)) |>
  sample_n(500) |>
  initial_split(strata = case)
train <- training(split)
resamples <- vfold_cv(train, v = 2, strata = case)
wf_spec <- train |>
  recipe(case ~ .) |>
  step_integer(all_nominal_predictors()) |>
  workflow(boost_tree(mode = "classification"))
res_spec <- wf_spec |> fit_results(resamples)
res_spec |> collect_metrics()

Glimpse multiple datasets

Description

Glimpse multiple datasets

Usage

glimpses(...)

Arguments

...

Multiple data.frame

Examples

df1 <- data.frame(a = c(1, 2))
df2 <- data.frame(b = c(3, 4))
glimpses(df1, df2)

Do the last fit and get the metrics

Description

Do the last fit and get the metrics

Usage

last_fit_metrics(res, split, metric)

Arguments

res

Tune results

split

The initial split object

metric

What metric to use to select the best workflow

Examples

library(tidymodels)
library(xgboost)
library(modeldata)
data(cells)
split <- cells |>
  mutate(across(where(is.character), as.factor)) |>
  sample_n(500) |>
  initial_split(strata = class)
train <- training(split)
folds <- vfold_cv(train, v = 2, strata = class)
wf <- train |>
  recipe(case ~ .) |>
  step_integer(all_nominal_predictors()) |>
  workflow_boost_tree()
res <- wf |>
  tune::tune_grid(
    resamples = folds,
    grid = 2,
    metrics = metric_set(roc_auc),
    control = tune::control_grid(save_workflow = TRUE, verbose = FALSE)
  )
res |> collect_metrics()
res |> last_fit_metrics(split, "roc_auc")
best <- res |> fit_best()
best |>
  augment(testing(split)) |>
  roc_auc(case, .pred_Test) |>
  pull(.estimate)

Name unnamed chunks in .Rmd or .qmd files Use with caution.

Description

Name unnamed chunks in .Rmd or .qmd files Use with caution.

Usage

name_unnamed_chunks(file_path)

Arguments

file_path

the file name


Center and scale double vectors

Description

Center and scale double vectors

Usage

normalize_vec(...)

Arguments

...

a double vector or multiple double vectors

Examples

normalize_vec(1, 2, 3, )

Make a sound and send an email when a process finished

Description

The notify_finished make a sound using beepr::beep, compose and email and send it returing the blastula::smtp_send call results.

Usage

notify_finished(name, body = "", ..., sound = 1, tictoc_result = NULL)

Arguments

name

The process name (Required)

body

The contents of the email (Default "")

...

Additional arguments to pass to the template function. If you're using the default template, you can use font_family to control the base font, and content_width to control the width of the main content; see blastula_template(). By default, the content_width is set to ⁠1000px⁠. Using widths less than ⁠600px⁠ is generally not advised but, if necessary, be sure to test such HTML emails with a wide range of email clients before sending to the intended recipients. The Outlook mail client (Windows, Desktop) does not respect content_width.

sound

The sound for beepr::beep call (Default 1)

tictoc_result

the result from tictoc::toc (Default NULL)

Details

The following environment variables should be set:

  • MY_SMTP_USER from

  • MY_SMTP_RECIPIENT to

  • MY_SMTP_PASSWORD service password (for gmail you can use https://myaccount.google.com/apppasswords)

  • MY_SMTP_PROVIDER blastula provider (gmail if not set)

Examples

if (exists("not_run")) {
  tictoc::tic()
  Sys.sleep(1)
  jrrosell::notify_finished("job", "Well done", sound = "fanfare", tictoc_result = tictoc::toc())
}

Github name of the package

Description

Get the name of the package from the DESCRIPTION file of the master branch in the github repo

Usage

package_github_name(x, file_lines = NULL)

Arguments

x

a single repo/package to check Ex: package_github_name("tidyverse/dplyr")

file_lines

(default = NULL, internal)

Examples

if (FALSE) {
  package_github_name("jrosell/jrrosell")
}

Github version of the package

Description

Get the version from the DESCRIPTION file of the master branch in the github repo

Usage

package_github_version(x, file_lines = NULL)

Arguments

x

a single repo/package to check Ex: package_github_version("tidyverse/dplyr")

file_lines

(default = NULL, internal)

Examples

if (FALSE) {
  package_github_version("jrosell/jrrosell")
}

Plot bars for non double columns

Description

Plot bars for non double columns

Usage

plot_bars(df, ..., top_values = 50)

Arguments

df

a data.frame

...

optional parameters to geom_histogram

top_values

fist most common values (default 50)

Examples

plot_bars(data.frame(a = c("x", "y"), b = c("z", "z")))

Plot histograms for double columns

Description

Plot histograms for double columns

Usage

plot_histograms(df, ...)

Arguments

df

a data.frame

...

optional parameters to geom_histogram

Examples

plot_histograms(data.frame(a = c(1, 2), b = c(1, 3)))

Plot missing values

Description

Plot missing values

Usage

plot_missing(df)

Arguments

df

a data.frame

Examples

plot_missing(data.frame(a = c(1, NA), b = c(NA, 4)))

Plot a variable

Description

It returns a bar or a histogram of the variable

Usage

plot_variable(df, variable, ..., type = "numeric")

Arguments

df

a data.frame

variable

the variable to use.

...

params passed to geom_*

type

numeric (default) or nominal.

Examples

data.frame(a = c("x", "y", "y"), b = c("z", "z", "x")) |> plot_variable(a)

Prep, juice and glimpse a recipe or workflow

Description

Prep, juice and glimpse a recipe or workflow

Usage

prep_juice(object)

Arguments

object

A recipe or a workflow object with a recipe

Source

https://recipes.tidymodels.org/reference/update.step.html

Examples

recipes::recipe(spray ~ ., data = InsectSprays) |>
  prep_juice()
recipes::recipe(spray ~ ., data = InsectSprays) |>
  workflows::workflow(parsnip::linear_reg()) |>
  prep_juice()

Prep, juice and get cols from a recipe or workflow

Description

Prep, juice and get cols from a recipe or workflow

Usage

prep_juice_ncol(object)

Arguments

object

A recipe or a workflow object with a recipe

Examples

recipes::recipe(spray ~ ., data = InsectSprays) |>
  prep_juice_ncol()

Read character columns with clean names

Description

It's useful for reading the most common types of flat file data, comma separated values and tab separated values.

Usage

read_chr(file, delim = ",", locale = NULL, ...)

Arguments

file

Either a path to a file, a connection, or literal data (either a single string or a raw vector).

delim

Single character used to separate fields within a record.

locale

The locale controls defaults that vary from place to place. The default locale is US-centric (like R), but you can use locale() to create your own locale that controls things like the default time zone, encoding, decimal mark, big mark, and day/month names.

...

Other parameters to readr::read_delim.

Details

The read_chr function works like readr::read_delim, except that column sreturned would be characters and with clean names. It requires readr and janitor packages installed.

Examples

es <- readr::locale("es", tz = "Europe/Madrid", decimal_mark = ",", grouping_mark = ".")
read_chr(readr::readr_example("mtcars.csv"), delim = ",", locale = es)

Read the html text of an url

Description

It's useful for getting the text for webpages in a single character vector.

Usage

read_url(url, sleep = 0)

Arguments

url

Full url including http or https protocol and the page path.

sleep

Seconds to sleep after the request is done and before returning the result.

Details

The read_url function works uses rvest::read_html and purr::possibly and it's fault tolearnt.

Examples

if (FALSE) read_url("https://www.google.cat/", sleep = 1)

Read a sheet from a xlsx file into a tibbles

Description

It's useful for reading a single sheets from a Excel/Openoffice file.

Usage

read_xlsx(xlsxFile, ..., sheet = 1, startRow = 1)

Arguments

xlsxFile

The name of the file.

...

Other parameters to openxls::read.xlsx function

sheet

The name or index of the sheet (default 1)

startRow

The number of the starting reading row (default 1)

Details

The write_xlsx it's a wroapper for openxls::write.xlsx.

Examples

l <- list("IRIS" = iris, "MTCATS" = mtcars, matrix(runif(1000), ncol = 5))
tmp_file <- tempfile(fileext = ".xlsx")
write_xlsx(l, tmp_file, colWidths = c(NA, "auto", "auto"))
read_xlsx(tmp_file)
file.remove(tmp_file)

Sanitize title with dashes

Description

It generates slugs URLs as WordPress does

Usage

sanitize_title_with_dashes(title)

Arguments

title

the title

Examples

sanitize_title_with_dashes("Hello world")

Tune a recipe using glmnet and lightgbm and stacks

Description

Tune a recipe using glmnet and lightgbm and stacks

Usage

score_recipe(rec, resamples, grids = list(10, 10), metric = "accuracy")

Arguments

rec

recipe

resamples

rset

grids

for glmnet and lightgbm tuning

metric

to be compared


spain_ccaas

Description

spain_ccaas

Usage

spain_ccaas

Format

spain_ccaas

A sf object with 19 rows and 4 columns:

OBJECTID
codigo
nombre
geometry

Source

https://github.com/koldLight/curso-r-dataviz/blob/master/dat/spain_ccaas.geojson

Examples

library(sf)
data(spain_ccaas)
head(spain_ccaas)

spain_provinces

Description

spain_provinces

Usage

spain_provinces

Format

spain_provinces

A sf object with 60 rows and 4 columns:

OBJECTID
codigo
nombre
geometry

Source

https://github.com/koldLight/curso-r-dataviz/blob/master/dat/spain_provinces.geojson

Examples

library(sf)
data(spain_provinces)
head(spain_provinces)

Sum the missing values from a data.frame

Description

Sum the missing values from a data.frame

Usage

sum_missing(...)

Arguments

...

one or multiple data.frame

Examples

sum_missing(data.frame(a = c(1, 2), b = c(3, 4)))
sum_missing(data.frame(a = c(1, NA), b = c(3, 4)))
sum_missing(data.frame(a = c(1, NA), b = c(NA, 4)))
sum_missing(data.frame(a = c(NA, NA), b = c(NA, NA)))

Select constant columns from a data.frame

Description

Select constant columns from a data.frame

Usage

summarize_n_distinct(df)

Arguments

df

a data.frame

Examples

summarize_n_distinct(data.frame(a = c(1, 2), b = c(2, 3)))
summarize_n_distinct(data.frame(a = c(1, 1), b = c(2, 3)))

Tee pipe that return the original value instead of the result

Description

Pipe a value forward into a functio or call expression and return the original value instead of the result. This is useful when an expression is used for its side-effect, say plotting or printing.

Usage

tee(x, expr)

Arguments

x

An object

expr

An expresion

Details

The tee pipe works like |>, except the return value is x itself, and not the result of expr call.

Thanks

I want to give credit to Michael Milton and Matthew Kay for the idea and the code.

Source

https://mastodon.social/@[email protected]/109555362766969210

Examples

rnorm(200) |>
  matrix(ncol = 2) |>
  as.data.frame() |>
  tee(\(x) {
    ggplot(x, aes(V1, V2)) +
      geom_point()
  }) |>
  colSums()

Sets a minimal theme using the Roboto font family

Description

It requires roboto fonts installed in your O.S. and run z

Usage

theme_roboto(
  base_size = 11,
  strip_text_size = 12,
  strip_text_margin = 5,
  subtitle_size = 13,
  subtitle_margin = 10,
  plot_title_size = 16,
  plot_title_margin = 10,
  ...
)

Arguments

base_size

= 11

strip_text_size

= 12

strip_text_margin

= 5

subtitle_size

= 13

subtitle_margin

= 10

plot_title_size

= 16

plot_title_margin

= 10

...

Other parameters passed to theme_set

Examples

library(jrrosell)
library(ggplot2)
theme_set(theme_roboto())
ggplot(iris, aes(Species)) +
  geom_bar()

Sets a dark blue colored dark minimal theme using the Roboto font family

Description

Sets a dark blue colored dark minimal theme using the Roboto font family

Usage

theme_set_roboto_darkblue(...)

Arguments

...

Other parameters passed to theme_set

Examples

library(jrrosell)
library(ggplot2)
theme_set_roboto_darkblue()
ggplot(iris, aes(Species)) +
  geom_bar()

Update recipe step values by id

Description

Update the vaules of a specific recipe step located by id

Usage

update_step(object, target_id, ...)

Arguments

object

A recipe or a workflow object with a recipe

target_id

The id name of the step

...

The arguments to update the step.

Examples

recipes::recipe(spray ~ ., data = InsectSprays) |>
  recipes::step_ns(count, deg_free = hardhat::tune(), id = "ns") |>
  update_step("ns", deg_free = 1)

Create an xgboost tunable workflow for regression and classification

Description

Create an xgboost tunable workflow for regression and classification

Usage

workflow_boost_tree(rec, engine = "xgboost", counts = TRUE, ...)

Arguments

rec

prerocessing recipe to build the workflow

engine

xgboost, lightgbm (xgboost by default)

counts

Optional logic argument wether mtry use counts or not

...

optional engine arguments

Examples

library(tidymodels)
library(xgboost)
library(modeldata)
library(future)
data(cells)
split <- cells |>
  mutate(across(where(is.character), as.factor)) |>
  sample_n(500) |>
  initial_split(strata = class)
train <- training(split)
folds <- vfold_cv(train, v = 2, strata = class)
wf <- train |>
  recipe(case ~ .) |>
  step_integer(all_nominal_predictors()) |>
  workflow_boost_tree()
doFuture::registerDoFuture()
plan(sequential)
res <- wf |>
  tune::tune_grid(
    folds,
    grid = 2,
    metrics = metric_set(roc_auc),
    control = tune::control_grid(save_workflow = TRUE, verbose = FALSE)
  )
res |> collect_metrics()
res |> last_fit_metrics(split, "roc_auc")
best <- res |> fit_best()
best |>
  augment(testing(split)) |>
  roc_auc(case, .pred_Test) |>
  pull(.estimate)

Create a tuneable glmnet worfklow for regression and classification

Description

Create a tuneable glmnet worfklow for regression and classification

Usage

workflow_elasticnet(rec, engine = "glmnet", ...)

Arguments

rec

prerocessing recipe to build the workflow

engine

glmnet, spark, brulee (glmnet by default)

...

Optional engine arguments

Examples

library(tidymodels)
library(glmnet)
library(modeldata)
library(future)
data(cells)
split <- cells |>
  mutate(across(where(is.character), as.factor)) |>
  sample_n(500) |>
  initial_split(strata = class)
train <- training(split)
folds <- vfold_cv(train, v = 2, strata = class)
wf <- train |>
  recipe(case ~ .) |>
  step_integer(all_nominal_predictors()) |>
  workflow_elasticnet()
doFuture::registerDoFuture()
plan(sequential)
res <- wf |>
  tune::tune_grid(
    folds,
    grid = 2,
    metrics = metric_set(roc_auc),
    control = tune::control_grid(save_workflow = TRUE, verbose = FALSE)
  )
res |> collect_metrics()
res |> last_fit_metrics(split, "roc_auc")
best <- res |> fit_best()
best |>
  augment(testing(split)) |>
  roc_auc(case, .pred_Test) |>
  pull(.estimate)

Write a list of tibbles to a xlsx file

Description

It's useful for saving multiple data to a multiple sheets of a single Excel/Openoffice/libreoffice file.

Usage

write_xlsx(data, distfile, ...)

Arguments

data

A named list of tibbles

distfile

The name of the destination file.

...

Other parameters to openxls::write.xlsx function

Details

The write_xlsx it's a wroapper for openxls::write.xlsx.

Examples

l <- list("IRIS" = iris, "MTCATS" = mtcars, matrix(runif(1000), ncol = 5))
tmp_file <- tempfile(fileext = ".xlsx")
write_xlsx(l, tmp_file, colWidths = c(NA, "auto", "auto"))
file.remove(tmp_file)