zoo.pipeline.nnframes package

Submodules

zoo.pipeline.nnframes.nn_classifier module

class zoo.pipeline.nnframes.nn_classifier.HasBatchSize[source]

Bases: pyspark.ml.param.Params

Mixin for param batchSize: batch size.

batchSize = Param(parent='undefined', name='batchSize', doc='batchSize (>= 0).')

param for batch size.

getBatchSize()[source]

Gets the value of batchSize or its default value.

setBatchSize(val)[source]

Sets the value of batchSize.

class zoo.pipeline.nnframes.nn_classifier.HasOptimMethod[source]

Bases: object

getOptimMethod()[source]

Gets the optimization method

setOptimMethod(val)[source]

Sets optimization method. E.g. SGD, Adam, LBFGS etc. from bigdl.optim.optimizer. default: SGD()

class zoo.pipeline.nnframes.nn_classifier.HasSamplePreprocessing[source]

Bases: object

Mixin for param samplePreprocessing

getSamplePreprocessing()[source]
samplePreprocessing = None
setSamplePreprocessing(val)[source]

Sets samplePreprocessing

class zoo.pipeline.nnframes.nn_classifier.HasThreshold[source]

Bases: pyspark.ml.param.Params

Mixin for param Threshold in binary classification.

The threshold applies to the raw output of the model. If the output is greater than threshold, then predict 1, else 0. A high threshold encourages the model to predict 0 more often; a low threshold encourages the model to predict 1 more often.

Note: the param is different from the one in Spark ProbabilisticClassifier which is compared against estimated probability.

Default is 0.5.

getThreshold()[source]

Gets the value of threshold or its default value.

setThreshold(val)[source]

Sets the value of threshold.

class zoo.pipeline.nnframes.nn_classifier.NNClassifier(model, criterion, feature_preprocessing=None, jvalue=None, bigdl_type='float')[source]

Bases: zoo.pipeline.nnframes.nn_classifier.NNEstimator

NNClassifier is a specialized NNEstimator that simplifies the data format for classification tasks. It only supports label column of DoubleType, and the fitted NNClassifierModel will have the prediction column of DoubleType.

setSamplePreprocessing(val)[source]

Sets the value of sample_preprocessing :param val: a Preprocesing[(Feature, Option(Label), Sample]

class zoo.pipeline.nnframes.nn_classifier.NNClassifierModel(model, feature_preprocessing=None, jvalue=None, bigdl_type='float')[source]

Bases: zoo.pipeline.nnframes.nn_classifier.NNModel, zoo.pipeline.nnframes.nn_classifier.HasThreshold

NNClassifierModel is a specialized [[NNModel]] for classification tasks. The prediction column will have the datatype of Double.

static load(path)[source]
class zoo.pipeline.nnframes.nn_classifier.NNEstimator(model, criterion, feature_preprocessing=None, label_preprocessing=None, jvalue=None, bigdl_type='float')[source]

Bases: pyspark.ml.wrapper.JavaEstimator, pyspark.ml.param.shared.HasFeaturesCol, pyspark.ml.param.shared.HasLabelCol, pyspark.ml.param.shared.HasPredictionCol, zoo.pipeline.nnframes.nn_classifier.HasBatchSize, zoo.pipeline.nnframes.nn_classifier.HasOptimMethod, zoo.pipeline.nnframes.nn_classifier.HasSamplePreprocessing, bigdl.util.common.JavaValue

NNEstimator extends org.apache.spark.ml.Estimator and supports training a BigDL model with Spark DataFrame data. It can be integrated into a standard Spark ML Pipeline to enable users for combined usage with Spark MLlib.

NNEstimator supports different feature and label data type through operation defined in Preprocessing. We provide pre-defined Preprocessing for popular data types like Array or Vector in package zoo.feature, while user can also develop customized Preprocess which extends from feature.common.Preprocessing. During fit, NNEstimator will extract feature and label data from input DataFrame and use the Preprocessing to prepare data for the model. Using the Preprocessing allows NNEstimator to cache only the raw data and decrease the memory consumption during feature conversion and training.

More concrete examples are available in package com.intel.analytics.zoo.examples.nnframes

clearGradientClipping()[source]

Clear clipping params, in this case, clipping will not be applied. In order to take effect, it needs to be called before fit.

getCheckpoint()[source]
Returns:a tuple containing (checkpointPath, checkpointTrigger, checkpointOverwrite)
getDataCacheLevel()[source]
getEndWhen()[source]

Gets the value of endWhen or its default value.

getLearningRate()[source]

Gets the value of learningRate or its default value.

getLearningRateDecay()[source]

Gets the value of learningRateDecay or its default value.

getMaxEpoch()[source]

Gets the value of maxEpoch or its default value.

getTrainSummary()[source]

Gets the train summary

getValidation()[source]

Gets the validate configuration. If validation config has been set, getValidation will return a List of [ValidationTrigger, Validation data, Array[ValidationMethod[T]], batchsize]

getValidationSummary()[source]

Gets the Validation summary

isCachingSample()[source]

Gets the value of cachingSample or its default value.

setCachingSample(val)[source]

whether to cache the Samples after preprocessing. Default: True

setCheckpoint(path, trigger, isOverWrite=True)[source]

Set check points during training. Not enabled by default :param path: the directory to save the model :param trigger: how often to save the check point :param isOverWrite: whether to overwrite existing snapshots in path. Default is True :return: self

setConstantGradientClipping(min, max)[source]

Set constant gradient clipping during the training process. In order to take effect, it needs to be called before fit.

# Arguments min: The minimum value to clip by. Float. max: The maximum value to clip by. Float.

setDataCacheLevel(level, numSlice=None)[source]
Parameters:level

string, “DRAM”, “PMEM” or “DISK_AND_DRAM”. If it’s DRAM, will cache dataset into dynamic random-access memory If it’s PMEM, will cache dataset into Intel Optane DC Persistent Memory If it’s DISK_AND_DRAM, will cache dataset into disk, and only hold 1/numSlice

of the data into memory during the training. After going through the 1/numSlice, we will release the current cache, and load another slice into memory.
setEndWhen(trigger)[source]

When to stop the training, passed in a Trigger. E.g. maxIterations(100)

setGradientClippingByL2Norm(clip_norm)[source]

Clip gradient to a maximum L2-Norm during the training process. In order to take effect, it needs to be called before fit.

# Arguments clip_norm: Gradient L2-Norm threshold. Float.

setLearningRate(val)[source]

Sets the value of learningRate. .. note:: Deprecated in 0.4.0. Please set learning rate with optimMethod directly.

setLearningRateDecay(val)[source]

Sets the value of learningRateDecay. .. note:: Deprecated in 0.4.0. Please set learning rate decay with optimMethod directly.

setMaxEpoch(val)[source]

Sets the value of maxEpoch.

setSamplePreprocessing(val)[source]

Sets the value of sample_preprocessing :param val: a Preprocesing[(Feature, Option(Label), Sample]

setTrainSummary(val)[source]

Statistics (LearningRate, Loss, Throughput, Parameters) collected during training for the training data, which can be used for visualization via Tensorboard. Use setTrainSummary to enable train logger. Then the log will be saved to logDir/appName/train as specified by the parameters of TrainSummary. Default: Not enabled

Parameters:summary – a TrainSummary object
setValidation(trigger, val_df, val_method, batch_size)[source]

Set a validate evaluation during training

Parameters:
  • trigger – validation interval
  • val_df – validation dataset
  • val_method – the ValidationMethod to use,e.g. “Top1Accuracy”, “Top5Accuracy”, “Loss”
  • batch_size – validation batch size
setValidationSummary(val)[source]

Statistics (LearningRate, Loss, Throughput, Parameters) collected during training for the validation data if validation data is set, which can be used for visualization via Tensorboard. Use setValidationSummary to enable validation logger. Then the log will be saved to logDir/appName/ as specified by the parameters of validationSummary. Default: None

class zoo.pipeline.nnframes.nn_classifier.NNModel(model, feature_preprocessing=None, jvalue=None, bigdl_type='float')[source]

Bases: pyspark.ml.wrapper.JavaTransformer, pyspark.ml.param.shared.HasFeaturesCol, pyspark.ml.param.shared.HasPredictionCol, zoo.pipeline.nnframes.nn_classifier.HasBatchSize, zoo.pipeline.nnframes.nn_classifier.HasSamplePreprocessing, bigdl.util.common.JavaValue

NNModel extends Spark ML Transformer and supports BigDL model with Spark DataFrame.

NNModel supports different feature data type through Preprocessing. Some common Preprocessing have been defined in com.intel.analytics.zoo.feature.

After transform, the prediction column contains the output of the model as Array[T], where T (Double or Float) is decided by the model type.

static load(path)[source]
save(path)[source]
class zoo.pipeline.nnframes.nn_classifier.XGBClassifierModel(jvalue)[source]

Bases: object

XGBClassifierModel is a trained XGBoost classification model. The prediction column will have the prediction results.

static loadModel(path, numClasses)[source]

load a pretrained XGBoostClassificationModel :param path: pretrained model path :param numClasses: number of classes for classification

setFeaturesCol(features)[source]
setPredictionCol(prediction)[source]
transform(dataset)[source]

zoo.pipeline.nnframes.nn_image_reader module

class zoo.pipeline.nnframes.nn_image_reader.NNImageReader[source]

Bases: object

Primary DataFrame-based image loading interface, defining API to read images from files to DataFrame.

static readImages(path, sc=None, minPartitions=1, resizeH=-1, resizeW=-1, image_codec=-1, bigdl_type='float')[source]

Read the directory of images into DataFrame from the local or remote source. :param path Directory to the input data files, the path can be comma separated paths as the

list of inputs. Wildcards path are supported similarly to sc.binaryFiles(path).

:param min_partitions A suggestion value of the minimal splitting number for input data. :param resizeH height after resize, by default is -1 which will not resize the image :param resizeW width after resize, by default is -1 which will not resize the image :param image_codec specifying the color type of a loaded image, same as in OpenCV.imread.

By default is Imgcodecs.CV_LOAD_IMAGE_UNCHANGED(-1). >0 Return a 3-channel color image. Note In the current implementation the

alpha channel, if any, is stripped from the output image. Use negative value if you need the alpha channel.

=0 Return a grayscale image. <0 Return the loaded image as is (with alpha channel if any).

:return DataFrame with a single column “image”; Each record in the column represents
one image record: Row (uri, height, width, channels, CvType, bytes).

zoo.pipeline.nnframes.nn_image_schema module

zoo.pipeline.nnframes.nn_image_schema.with_origin_column(dataset, imageColumn='image', originColumn='origin', bigdl_type='float')[source]

Module contents