Samplers
utilities.samplers.memory_efficient_sampling
Sampling a random subset from a csv file without loading the full dataset into memory.
Authors:
Vanessa Graber (graber @ ice.csic.es)
Celsa Pardo Araujo (pardo @ ice.csic.es)
choose_rows(number_of_rows_to_select, total_number_of_rows, previously_chosen_rows=None)
Choose a subset of random indices from all the indices of a dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
number_of_rows_to_select
|
int
|
Number of rows to randomly select from the full dataset without taking into account the headers. |
required |
total_number_of_rows
|
int
|
Number of rows in the full dataset without taking into account the headers. |
required |
previously_chosen_rows
|
list
|
Rows previously chosen from previous subset. |
None
|
Returns:
| Type | Description |
|---|---|
list
|
A sorted list of the randomly chosen indices. |
Source code in utilities/samplers/memory_efficient_sampling.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | |
select(file_path, size_subset, size_full_dataset, previously_chosen_rows=None)
Select a random subset from a dataset without loading the full file into memory.
The following implementation only works when the full dataset has two headers as our final_pop_dyn.csv, it won't
work otherwise.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the full dataset. |
required |
size_subset
|
int
|
Number of rows of the desired random subset without taking into account the headers. |
required |
size_full_dataset
|
int
|
Number of rows in the full dataset without taking into account the headers. |
required |
previously_chosen_rows
|
list
|
Rows previously chosen from previous subset. |
None
|
Returns:
| Type | Description |
|---|---|
Dataframe
|
Dataframe of the random subset. |
Source code in utilities/samplers/memory_efficient_sampling.py
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | |
utilities.samplers.pop_sampler
Population sampler script.
This script randomly samples a population with a reduced number of stars for every
population simulated from the simulator initialize_evolve_population.py
with different initial parameters.
This expects that a set of populations have been generated either using directly that script or using the helper with its particular directory tree.
The user can choose the total number of stars to randomly select from the original simulated population, they can provide a weighted selection according to the stars' distances from the Sun and can provide a distance cut-off to select only stars that are nearer to the Sun.
The resampled population is also saved in a .pkl.gz file alongside with the .json file containing the related labels.
Display help message to run the code:
python pop_sampler.py --help
Displays all the relevant arguments that can be used.
Authors:
Michele Ronchi (ronchi@ice.csic.es)
calculate_selection_weights(d)
Calculate the weights to assign to every star for selection. Weights are evaluated as a function of distance from the Sun, nearest stars are easier and more likely to be detected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
d
|
ndarray
|
Array of distances from the Sun [kpc]. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Array of selection weights. |
Source code in utilities/samplers/pop_sampler.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | |
data_sampler(args)
This function reads the simulated population files (usually by the simulation helper) folder and creates simulated population files with a reduced number of stars by randomly sampling the original evolved population file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
args
|
Namespace
|
An argparse.Namespace object containing the following attributes:
|
required |
Source code in utilities/samplers/pop_sampler.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | |
utilities.samplers.random_sampler
Calculating the cumulative distribution function for a given probability density function using the trapezoidal rule and drawing random values from the cumulative distribution and probability density function.
Authors:
Vanessa Graber (graber@ice.csic.es)
Michele Ronchi (ronchi@ice.csic.es)
cdf_calculator(x, pdf)
Calculating the cumulative distribution function for any given probability density function evaluated at the points x using the trapezoidal rule.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
ndarray
|
Discrete set of values at which the pdf is evaluated. |
required |
pdf
|
Callable
|
Probability density function. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Normalized cumulative distribution function. |
Source code in utilities/samplers/random_sampler.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | |
random_from_cdf(x, cdf, num_draw)
Drawing random values from a given normalized cumulative distribution function.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
ndarray
|
Discrete set of values at which the cdf is evaluated. |
required |
cdf
|
ndarray
|
Normalized cumulative probability density function. |
required |
num_draw
|
int
|
Number of values to draw. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Random values drawn from the cdf. |
Source code in utilities/samplers/random_sampler.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | |
random_from_pdf(x, pdf, num_draw)
Drawing random values from a given probability density function.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
ndarray
|
Discrete set of values at which the pdf is evaluated. |
required |
pdf
|
Callable
|
Probability density function. |
required |
num_draw
|
int
|
Number of values to draw. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Random values drawn from the pdf. |
Source code in utilities/samplers/random_sampler.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | |
random_from_pdf_2d(x1, x2, pdf_2d, num_draw)
Drawing random values from a given 2D probability density function. x1 is the variable running along the rows (axis=0), x2 is the variable running along the columns (axis=1) of the 2D array defining the pdf.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x1
|
ndarray
|
Discrete set of values for coordinate x1 at which the pdf is evaluated. |
required |
x2
|
ndarray
|
Discrete set of values for coordinate x2 at which the pdf is evaluated. |
required |
pdf_2d
|
ndarray
|
2D probability density function. |
required |
num_draw
|
int
|
Number of values to draw. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[ndarray, ndarray]
|
Random points of coordinates (x1, x2) drawn from the pdf. |
Source code in utilities/samplers/random_sampler.py
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | |