Data Set API

Summary

The DataSet API added in v1.5.3, makes it easy to work with potentially large data sets, perform complex pre-processing tasks and feed these data sets into TensorFlow models.

Data Set¶

Basics¶

A DataSet[X] instance is simply a wrapper over an Iterable[X] object, although the user still has access to the underlying collection.

Tip

The dtfdata object gives the user easy access to the DataSet API.

  1 2 3 4 5 6 7 8 9 10 11 12 import _root_.io.github.mandar2812.dynaml.probability._ import _root_.io.github.mandar2812.dynaml.pipes._ import io.github.mandar2812.dynaml.tensorflow._ val random_numbers = GaussianRV(0.0, 1.0) :* GaussianRV(1.0, 2.0) //Create a data set. val dataset1 = dtfdata.dataset(random_numbers.iid(10000).draw) //Access underlying data dataset1.data 

Transformations¶

DynaML data sets support several operations of the map-reduce philosophy.

Map¶

Transform each element of type X into some other element of type Y (Y can possibly be the same as X).

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import _root_.io.github.mandar2812.dynaml.probability._ import _root_.io.github.mandar2812.dynaml.pipes._ import io.github.mandar2812.dynaml.tensorflow._ val random_numbers = GaussianRV(0.0, 1.0) //A data set of random gaussian numbers. val random_gaussian_dataset = dtfdata.dataset( random_numbers.iid(10000).draw ) //Transform data set by applying a scala function val random_chisq_dataset = random_gaussian_dataset.map((x: Double) => x*x) val exp_tr = DataPipe[Double, Double](math.exp _) //Can pass a DataPipe instead of a function val random_log_gaussian_dataset = random_gaussian_dataset.map(exp_tr) 

Flat Map¶

Process each element by applying a function which transforms each element into an Iterable, this operation is followed by flattening of the top level Iterable.

Schematically, this process is

Iterable[X] -> Iterable[Iterable[Y]] -> Iterable[Y]

  1 2 3 4 5 6 7 8 9 10 11 12 13 import _root_.io.github.mandar2812.dynaml.probability._ import _root_.io.github.mandar2812.dynaml.pipes._ import scala.util.Random import io.github.mandar2812.dynaml.tensorflow._ val random_gaussian_dataset = dtfdata.dataset( GaussianRV(0.0, 1.0).iid(10000).draw ) //Transform data set by applying a scala function val gaussian_mixture = random_gaussian_dataset.flatMap( (x: Double) => GaussianRV(0.0, x*x).iid(10).draw ) 

Filter¶

Collect only the elements which satisfy some predicate, i.e. a function which returns true for the elements to be selected (filtered) and false for the ones which should be discarded.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import _root_.io.github.mandar2812.dynaml.probability._ import _root_.io.github.mandar2812.dynaml.pipes._ import scala.util.Random import io.github.mandar2812.dynaml.tensorflow._ val gaussian_dataset = dtfdata.dataset( GaussianRV(0.0, 1.0).iid(10000).draw ) val onlyPositive = DataPipe[Double, Boolean](_ > 0.0) val truncated_gaussian = gaussian_dataset.filter(onlyPositive) val zeroOrGreater = (x: Double) => x >= 0.0 //filterNot works in the opposite manner to filter val neg_truncated_gaussian = gaussian_dataset.filterNot(zeroOrGreater) 

Scan & Friends¶

Sometimes, we need to perform operations on a data set which are sequential in nature. In this situation, the scanLeft() and scanRight() are useful.

Lets simulate a random walk, we start with $x_0$, a number and add independent gaussian increments to it.

\begin{align*} x_t &= x_{t-1} + \epsilon \\ \epsilon &\sim \mathcal{N}(0, 1) \end{align*}
  1 2 3 4 5 6 7 8 9 10 11 12 13 import _root_.io.github.mandar2812.dynaml.probability._ import _root_.io.github.mandar2812.dynaml.pipes._ import scala.util.Random import io.github.mandar2812.dynaml.tensorflow._ val gaussian_increments = dtfdata.dataset( GaussianRV(0.0, 1.0).iid(10000).draw ) val increment = DataPipe2[Double, Double, Double]((x, i) => x + i) //Start the random walk from zero, and keep adding increments. val random_walk = gaussian_increments.scanLeft(0.0)(increment) 

The scanRight() works just like the scanLeft() method, except it begins from the last element of the collection.

Reduce & Reduce Left¶

The reduce() and reduceLeft() methods help in computing summary values from the entire data collection.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import _root_.io.github.mandar2812.dynaml.probability._ import _root_.io.github.mandar2812.dynaml.pipes._ import scala.util.Random import io.github.mandar2812.dynaml.tensorflow._ val gaussian_increments = dtfdata.dataset( GaussianRV(0.0, 1.0).iid(10000).draw ) val increment = DataPipe2[Double, Double, Double]((x, i) => x + i) val random_walk = gaussian_increments.scanLeft(0.0)(increment) val average = random_walk.reduce( DataPipe2[Double, Double, Double]((x, y) => x + y) )/10000.0 

Other Transformations¶

Some times transformations on data sets cannot be applied on each element individually, but the entire data collection is required for such a transformation.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 import _root_.io.github.mandar2812.dynaml.probability._ import _root_.io.github.mandar2812.dynaml.pipes._ import scala.util.Random import io.github.mandar2812.dynaml.tensorflow._ val gaussian_data = dtfdata.dataset( GaussianRV(0.0, 1.0).iid(10000).draw ) val resample = DataPipe[Iterable[Double], Iterable[Double]]( coll => (0 until 10000).map(_ => coll(Random.nextInt(10000))) ) val resampled_data = gaussian_data.transform(resample) 

Note

Conversion to TF-Scala Dataset class

The TensorFlow scala API also has a Dataset class, from a DynaML DataSet instance, it is possible to obtain a TensorFlow Dataset.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import _root_.io.github.mandar2812.dynaml.probability._ import _root_.io.github.mandar2812.dynaml.pipes._ import io.github.mandar2812.dynaml.tensorflow._ import org.platanios.tensorflow.api._ import org.platanios.tensorflow.api.types._ val random_numbers = GaussianRV(0.0, 1.0) //Create a data set. val dataset1 = dtfdata.dataset(random_numbers.iid(10000).draw) //Convert to TensorFlow data set dataset1.build[Tensor, Output, DataType.Aux[Double], DataType, Shape]( Left(DataPipe[Double, Tensor](x => dtf.tensor_f64(1)(x))), FLOAT64, Shape(1) ) 

Tuple Data & Supervised Data¶

The classes ZipDataSet[X, Y] and SupervisedDataSet[X, Y] both represent data collections which consist of (X, Y) tuples. They can be created in a number of ways.

Zip Data¶

The zip() method can be used to create data sets consisting of tuples.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import _root_.io.github.mandar2812.dynaml.probability._ import _root_.io.github.mandar2812.dynaml.pipes._ import scala.util.Random import _root_.breeze.stats.distributions._ import io.github.mandar2812.dynaml.tensorflow._ val gaussian_data = dtfdata.dataset( GaussianRV(0.0, 1.0).iid(10000).draw ) val log_normal_data = gaussian_data.map((x: Double) => math.exp(x)) val poisson_data = dtfdata.dataset( RandomVariable(Poisson(2.5)).iid(10000).draw ) val tuple_data1 = poisson_data.zip(gaussian_data) val tuple_data2 = poisson_data.zip(log_normal_data) //Join on the keys, in this case the //Poisson distributed integers tuple_data1.join(tuple_data2) 

Supervised Data¶

For supervised learning operations, we can use the SupervisedDataSet class, which can be instantiated in the following ways.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import _root_.io.github.mandar2812.dynaml.probability._ import _root_.io.github.mandar2812.dynaml.pipes._ import scala.util.Random import _root_.breeze.stats.distributions._ import io.github.mandar2812.dynaml.tensorflow._ val gaussian_data = dtfdata.dataset( GaussianRV(0.0, 1.0).iid(10000).draw ) val sup_data1 = gaussian_data.to_supervised( DataPipe[Double, (Double, Double)](x => (x, GaussianRV(0.0, x*x).draw)) ) val targets = gaussian_data.map((x: Double) => math.exp(x)) val sup_data2 = dtfdata.supervised_dataset(gaussian_data, targets)