zoo.pipeline.nnframes package¶
Submodules¶
zoo.pipeline.nnframes.nn_classifier module¶
-
class
zoo.pipeline.nnframes.nn_classifier.HasBatchSize[source]¶ Bases:
pyspark.ml.param.ParamsMixin for param batchSize: batch size.
-
batchSize= Param(parent='undefined', name='batchSize', doc='batchSize (>= 0).')¶ param for batch size.
-
-
class
zoo.pipeline.nnframes.nn_classifier.HasSamplePreprocessing[source]¶ Bases:
objectMixin for param samplePreprocessing
-
samplePreprocessing= None¶
-
-
class
zoo.pipeline.nnframes.nn_classifier.HasThreshold[source]¶ Bases:
pyspark.ml.param.ParamsMixin for param Threshold in binary classification.
The threshold applies to the raw output of the model. If the output is greater than threshold, then predict 1, else 0. A high threshold encourages the model to predict 0 more often; a low threshold encourages the model to predict 1 more often.
Note: the param is different from the one in Spark ProbabilisticClassifier which is compared against estimated probability.
Default is 0.5.
-
class
zoo.pipeline.nnframes.nn_classifier.NNClassifier(model, criterion, feature_preprocessing=None, jvalue=None, bigdl_type='float')[source]¶ Bases:
zoo.pipeline.nnframes.nn_classifier.NNEstimatorNNClassifier is a specialized NNEstimator that simplifies the data format for classification tasks. It only supports label column of DoubleType, and the fitted NNClassifierModel will have the prediction column of DoubleType.
-
class
zoo.pipeline.nnframes.nn_classifier.NNClassifierModel(model, feature_preprocessing=None, jvalue=None, bigdl_type='float')[source]¶ Bases:
zoo.pipeline.nnframes.nn_classifier.NNModel,zoo.pipeline.nnframes.nn_classifier.HasThresholdNNClassifierModel is a specialized [[NNModel]] for classification tasks. The prediction column will have the datatype of Double.
-
class
zoo.pipeline.nnframes.nn_classifier.NNEstimator(model, criterion, feature_preprocessing=None, label_preprocessing=None, jvalue=None, bigdl_type='float')[source]¶ Bases:
pyspark.ml.wrapper.JavaEstimator,pyspark.ml.param.shared.HasFeaturesCol,pyspark.ml.param.shared.HasLabelCol,pyspark.ml.param.shared.HasPredictionCol,zoo.pipeline.nnframes.nn_classifier.HasBatchSize,zoo.pipeline.nnframes.nn_classifier.HasOptimMethod,zoo.pipeline.nnframes.nn_classifier.HasSamplePreprocessing,bigdl.util.common.JavaValueNNEstimator extends org.apache.spark.ml.Estimator and supports training a BigDL model with Spark DataFrame data. It can be integrated into a standard Spark ML Pipeline to enable users for combined usage with Spark MLlib.
NNEstimator supports different feature and label data type through operation defined in Preprocessing. We provide pre-defined Preprocessing for popular data types like Array or Vector in package zoo.feature, while user can also develop customized Preprocess which extends from feature.common.Preprocessing. During fit, NNEstimator will extract feature and label data from input DataFrame and use the Preprocessing to prepare data for the model. Using the Preprocessing allows NNEstimator to cache only the raw data and decrease the memory consumption during feature conversion and training.
More concrete examples are available in package com.intel.analytics.zoo.examples.nnframes
-
clearGradientClipping()[source]¶ Clear clipping params, in this case, clipping will not be applied. In order to take effect, it needs to be called before fit.
-
getCheckpoint()[source]¶ Returns: a tuple containing (checkpointPath, checkpointTrigger, checkpointOverwrite)
-
getValidation()[source]¶ Gets the validate configuration. If validation config has been set, getValidation will return a List of [ValidationTrigger, Validation data, Array[ValidationMethod[T]], batchsize]
-
setCheckpoint(path, trigger, isOverWrite=True)[source]¶ Set check points during training. Not enabled by default :param path: the directory to save the model :param trigger: how often to save the check point :param isOverWrite: whether to overwrite existing snapshots in path. Default is True :return: self
-
setConstantGradientClipping(min, max)[source]¶ Set constant gradient clipping during the training process. In order to take effect, it needs to be called before fit.
# Arguments min: The minimum value to clip by. Float. max: The maximum value to clip by. Float.
-
setDataCacheLevel(level, numSlice=None)[source]¶ Parameters: level – string, “DRAM”, “PMEM” or “DISK_AND_DRAM”. If it’s DRAM, will cache dataset into dynamic random-access memory If it’s PMEM, will cache dataset into Intel Optane DC Persistent Memory If it’s DISK_AND_DRAM, will cache dataset into disk, and only hold 1/numSlice
of the data into memory during the training. After going through the 1/numSlice, we will release the current cache, and load another slice into memory.
-
setEndWhen(trigger)[source]¶ When to stop the training, passed in a Trigger. E.g. maxIterations(100)
-
setGradientClippingByL2Norm(clip_norm)[source]¶ Clip gradient to a maximum L2-Norm during the training process. In order to take effect, it needs to be called before fit.
# Arguments clip_norm: Gradient L2-Norm threshold. Float.
-
setLearningRate(val)[source]¶ Sets the value of
learningRate. .. note:: Deprecated in 0.4.0. Please set learning rate with optimMethod directly.
-
setLearningRateDecay(val)[source]¶ Sets the value of
learningRateDecay. .. note:: Deprecated in 0.4.0. Please set learning rate decay with optimMethod directly.
-
setSamplePreprocessing(val)[source]¶ Sets the value of sample_preprocessing :param val: a Preprocesing[(Feature, Option(Label), Sample]
-
setTrainSummary(val)[source]¶ Statistics (LearningRate, Loss, Throughput, Parameters) collected during training for the training data, which can be used for visualization via Tensorboard. Use setTrainSummary to enable train logger. Then the log will be saved to logDir/appName/train as specified by the parameters of TrainSummary. Default: Not enabled
Parameters: summary – a TrainSummary object
-
setValidation(trigger, val_df, val_method, batch_size)[source]¶ Set a validate evaluation during training
Parameters: - trigger – validation interval
- val_df – validation dataset
- val_method – the ValidationMethod to use,e.g. “Top1Accuracy”, “Top5Accuracy”, “Loss”
- batch_size – validation batch size
-
setValidationSummary(val)[source]¶ Statistics (LearningRate, Loss, Throughput, Parameters) collected during training for the validation data if validation data is set, which can be used for visualization via Tensorboard. Use setValidationSummary to enable validation logger. Then the log will be saved to logDir/appName/ as specified by the parameters of validationSummary. Default: None
-
-
class
zoo.pipeline.nnframes.nn_classifier.NNModel(model, feature_preprocessing=None, jvalue=None, bigdl_type='float')[source]¶ Bases:
pyspark.ml.wrapper.JavaTransformer,pyspark.ml.param.shared.HasFeaturesCol,pyspark.ml.param.shared.HasPredictionCol,zoo.pipeline.nnframes.nn_classifier.HasBatchSize,zoo.pipeline.nnframes.nn_classifier.HasSamplePreprocessing,bigdl.util.common.JavaValueNNModel extends Spark ML Transformer and supports BigDL model with Spark DataFrame.
NNModel supports different feature data type through Preprocessing. Some common Preprocessing have been defined in com.intel.analytics.zoo.feature.
After transform, the prediction column contains the output of the model as Array[T], where T (Double or Float) is decided by the model type.
-
class
zoo.pipeline.nnframes.nn_classifier.XGBClassifierModel(jvalue)[source]¶ Bases:
objectXGBClassifierModel is a trained XGBoost classification model. The prediction column will have the prediction results.
zoo.pipeline.nnframes.nn_image_reader module¶
-
class
zoo.pipeline.nnframes.nn_image_reader.NNImageReader[source]¶ Bases:
objectPrimary DataFrame-based image loading interface, defining API to read images from files to DataFrame.
-
static
readImages(path, sc=None, minPartitions=1, resizeH=-1, resizeW=-1, image_codec=-1, bigdl_type='float')[source]¶ Read the directory of images into DataFrame from the local or remote source. :param path Directory to the input data files, the path can be comma separated paths as the
list of inputs. Wildcards path are supported similarly to sc.binaryFiles(path).:param min_partitions A suggestion value of the minimal splitting number for input data. :param resizeH height after resize, by default is -1 which will not resize the image :param resizeW width after resize, by default is -1 which will not resize the image :param image_codec specifying the color type of a loaded image, same as in OpenCV.imread.
By default is Imgcodecs.CV_LOAD_IMAGE_UNCHANGED(-1). >0 Return a 3-channel color image. Note In the current implementation the
alpha channel, if any, is stripped from the output image. Use negative value if you need the alpha channel.=0 Return a grayscale image. <0 Return the loaded image as is (with alpha channel if any).
- :return DataFrame with a single column “image”; Each record in the column represents
- one image record: Row (uri, height, width, channels, CvType, bytes).
-
static