Feature API Documentation¶

imagePreprocessing¶

class zoo.feature.image.imagePreprocessing.ImageAspectScale(min_size, scale_multiple_of=1, max_size=1000, resize_mode=1, use_scale_factor=True, min_scale=-1.0, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

Resize the image, keep the aspect ratio. scale according to the short edge.Randomly apply the preprocessing to some of the input ImageFeatures, with probability specified.E.g. if prob = 0.5, the preprocessing will apply to half of the input ImageFeatures.

Parameters:

min_size – scale size, apply to short edge
scale_multiple_of – make the scaled size multiple of some value
max_size – max size after scale
resize_mode – if resizeMode = -1, random select a mode from (Imgproc.INTER_LINEAR, Imgproc.INTER_CUBIC, Imgproc.INTER_AREA, Imgproc.INTER_NEAREST, Imgproc.INTER_LANCZOS4)
use_scale_factor – if true, scale factor fx and fy is used, fx = fy = 0
min_scale – control the minimum scale up for image

Returns:

a DistributedImageSet

>>> import numpy as np
>>> from bigdl.util.common import callBigDlFunc
>>> from numpy.testing import assert_allclose
>>> np.random.seed(123)
>>> sample = Sample.from_ndarray(np.random.random((2,3)), np.random.random((2,3)))
>>> sample_back = callBigDlFunc("float", "testSample", sample)
>>> assert_allclose(sample.features[0].to_ndarray(), sample_back.features[0].to_ndarray())
>>> assert_allclose(sample.label.to_ndarray(), sample_back.label.to_ndarray())
>>> expected_feature_storage = np.array(([[0.69646919, 0.28613934, 0.22685145], [0.55131477, 0.71946895, 0.42310646]]))
>>> expected_feature_shape = np.array([2, 3])
>>> expected_label_storage = np.array(([[0.98076421, 0.68482971, 0.48093191], [0.39211753, 0.343178, 0.72904968]]))
>>> expected_label_shape = np.array([2, 3])
>>> assert_allclose(sample.features[0].storage, expected_feature_storage, rtol=1e-6, atol=1e-6)
>>> assert_allclose(sample.features[0].shape, expected_feature_shape)

is_local()[source]¶

whether this is a LocalImageSet Create a ImageSet from rdds of ndarray.

Parameters:	image_rdd – a rdd of ndarray, each ndarray should has dimension of 3 or 4 (3D images) label_rdd – a rdd of ndarray
Returns:	a DistributedImageSet

class zoo.feature.image.imagePreprocessing.ImageBrightness(delta_low, delta_high, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

adjust the image brightness.

Launches a set of actors which connect via distributed PyTorch and coordinate gradient updates to train the provided model. If Ray is not initialized, TorchTrainer will automatically initialize a local Ray cluster for you. Be sure to run ray.init(address=”auto”) to leverage multi-node training.

class MyTrainingOperator(TrainingOperator):

    def setup(self, config):
        model = nn.Linear(1, 1)
        optimizer = torch.optim.SGD(
            model.parameters(), lr=config.get("lr", 1e-4))
        loss = torch.nn.MSELoss()

        batch_size = config["batch_size"]
        train_data, val_data = LinearDataset(2, 5), LinearDataset(2, 5)
        train_loader = DataLoader(train_data, batch_size=batch_size)
        val_loader = DataLoader(val_data, batch_size=batch_size)

        self.model, self.optimizer = self.register(
            models=model,
            optimizers=optimizer,
            criterion=loss)

        self.register_data(
            train_loader=train_loader,
            validation_loader=val_loader)

trainer = TorchTrainer(
    training_operator_cls=MyTrainingOperator,
    config={"batch_size": 32},
    use_gpu=True
)
for i in range(4):
    trainer.train()

Parameters:

training_operator_cls (type) – Custom training operator class that subclasses the TrainingOperator class. This class will be copied onto all remote workers and used to specify training components and custom training and validation operations.
initialization_hook (function) – A function to call on all training workers when they are first initialized. This could be useful to set environment variables for all the worker processes.
config (dict) – Custom configuration value to be passed to all operator constructors.
training_operator_cls – Custom training operator class that subclasses the TrainingOperator class. This class will be copied onto all remote workers and used to specify training components and custom training and validation operations.
initialization_hook – A function to call on all training workers when they are first initialized. This could be useful to set environment variables for all the worker processes.
config – Custom configuration value to be passed to all operator constructors.

is_local()[source]¶

whether this is a LocalImageSet

Create a ImageSet from rdds of ndarray.

Parameters:

training_operator_cls (type) – Custom training operator class that subclasses the TrainingOperator class. This class will be copied onto all remote workers and used to specify training components and custom training and validation operations.
initialization_hook (function) – A function to call on all training workers when they are first initialized. This could be useful to set environment variables for all the worker processes.
config (dict) – Custom configuration value to be passed to all operator constructors.

class zoo.feature.image.imagePreprocessing.ImageBytesToMat(byte_key='bytes', image_codec=-1, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

Transform byte array(original image file in byte) to OpenCVMat

Parameters:	byte_key – key that maps byte array image_codec – specifying the color type of a loaded image, same as in OpenCV.imread. By default is Imgcodecs.CV_LOAD_IMAGE_UNCHANGED

class zoo.feature.image.imagePreprocessing.ImageCenterCrop(crop_width, crop_height, is_clip=True, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

Crop a cropWidth x cropHeight patch from center of image. The patch size should be less than the image size.

Parameters:	crop_width – width after crop crop_height – height after crop is_clip – clip cropping box boundary

class zoo.feature.image.imagePreprocessing.ImageChannelNormalize(mean_r, mean_g, mean_b, std_r=1.0, std_g=1.0, std_b=1.0, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

image channel normalize

Parameters:	mean_r – mean value in R channel mean_g – mean value in G channel meanB_b – mean value in B channel std_r – std value in R channel std_g – std value in G channel std_b – std value in B channel

class zoo.feature.image.imagePreprocessing.ImageChannelOrder(bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

random change the channel of an image

class zoo.feature.image.imagePreprocessing.ImageColorJitter(brightness_prob=0.5, brightness_delta=32.0, contrast_prob=0.5, contrast_lower=0.5, contrast_upper=1.5, hue_prob=0.5, hue_delta=18.0, saturation_prob=0.5, saturation_lower=0.5, saturation_upper=1.5, random_order_prob=0.0, shuffle=False, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

Random adjust brightness, contrast, hue, saturation

Parameters:

brightness_prob – probability to adjust brightness
brightness_delta – brightness parameter
contrast_prob – probability to adjust contrast
contrast_lower – contrast lower parameter
contrast_upper – contrast upper parameter
hue_prob – probability to adjust hue
hue_delta – hue parameter
saturation_prob – probability to adjust saturation
saturation_lower – saturation lower parameter
saturation_upper – saturation upper parameter
random_order_prob – random order for different operation
shuffle – shuffle the transformers

class zoo.feature.image.imagePreprocessing.ImageExpand(means_r=123, means_g=117, means_b=104, min_expand_ratio=1.0, max_expand_ratio=4.0, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

expand image, fill the blank part with the meanR, meanG, meanB

Parameters:	means_r – means in R channel means_g – means in G channel means_b – means in B channel min_expand_ratio – min expand ratio max_expand_ratio – max expand ratio

class zoo.feature.image.imagePreprocessing.ImageFeatureToSample(bigdl_type='float')[source]¶

Bases: zoo.feature.common.Preprocessing

A transformer that get Sample from ImageFeature.

class zoo.feature.image.imagePreprocessing.ImageFeatureToTensor(bigdl_type='float')[source]¶

Bases: zoo.feature.common.Preprocessing

a Transformer that convert ImageFeature to a Tensor.

class zoo.feature.image.imagePreprocessing.ImageFiller(start_x, start_y, end_x, end_y, value=255, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

Fill part of image with certain pixel value

Parameters:	start_x – start x ratio start_y – start y ratio end_x – end x ratio end_y – end y ratio value – filling value

class zoo.feature.image.imagePreprocessing.ImageFixedCrop(x1, y1, x2, y2, normalized=True, is_clip=True, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

Crop a fixed area of image

Parameters:	x1 – start in width y1 – start in height x2 – end in width y2 – end in height normalized – whether args are normalized, i.e. in range [0, 1] is_clip – whether to clip the roi to image boundaries

class zoo.feature.image.imagePreprocessing.ImageHFlip(bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

Flip the image horizontally

class zoo.feature.image.imagePreprocessing.ImageHue(delta_low, delta_high, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

adjust the image hue

Parameters:	deltaLow – hue parameter: low bound deltaHigh – hue parameter: high bound

class zoo.feature.image.imagePreprocessing.ImageMatToTensor(to_RGB=False, tensor_key='imageTensor', share_buffer=True, format='NCHW', bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

MatToTensor

Parameters:	toRGB – BGR to RGB (default is BGR) tensorKey – key to store transformed tensor format – DataFormat.NCHW or DataFormat.NHWC

class zoo.feature.image.imagePreprocessing.ImageMirror(bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

Flip the image horizontally and vertically

class zoo.feature.image.imagePreprocessing.ImagePixelBytesToMat(byte_key='bytes', bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

Transform byte array(pixels in byte) to OpenCVMat

Parameters:	byte_key – key that maps byte array

class zoo.feature.image.imagePreprocessing.ImagePixelNormalize(means, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

Pixel level normalizer, data(i) = data(i) - mean(i)

Parameters:	means – pixel level mean, following H * W * C order

class zoo.feature.image.imagePreprocessing.ImagePreprocessing(bigdl_type='float', *args)[source]¶

Bases: zoo.feature.common.Preprocessing

ImagePreprocessing is a transformer that transform ImageFeature

class zoo.feature.image.imagePreprocessing.ImageRandomAspectScale(scales, scale_multiple_of=1, max_size=1000, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

resize the image by randomly choosing a scale

Parameters:	scales – array of scale options that for random choice scaleMultipleOf – Resize test images so that its width and height are multiples of maxSize – Max pixel size of the longest side of a scaled input image

class zoo.feature.image.imagePreprocessing.ImageRandomCrop(crop_width, crop_height, is_clip=True, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

Random crop a cropWidth x cropHeight patch from an image. The patch size should be less than the image size.

Parameters:	crop_width – width after crop crop_height – height after crop whether (is_clip) – to clip the roi to image boundaries

class zoo.feature.image.imagePreprocessing.ImageRandomPreprocessing(preprocessing, prob, bigdl_type='float')[source]¶

Bases: zoo.feature.common.Preprocessing

Randomly apply the preprocessing to some of the input ImageFeatures, with probability specified. E.g. if prob = 0.5, the preprocessing will apply to half of the input ImageFeatures.

Parameters:	preprocessing – preprocessing to apply. prob – probability to apply the preprocessing action.

class zoo.feature.image.imagePreprocessing.ImageResize(resize_h, resize_w, resize_mode=1, use_scale_factor=True, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

Resize image

Parameters:	resize_h – height after resize resize_w – width after resize resize_mode – if resizeMode = -1, random select a mode from (Imgproc.INTER_LINEAR,Imgproc.INTER_CUBIC, Imgproc.INTER_AREA, Imgproc.INTER_NEAREST, Imgproc.INTER_LANCZOS4) use_scale_factor – if true, scale factor fx and fy is used, fx = fy = 0 note that the result of the following are different

Imgproc.resize(mat, mat, new Size(resizeWH, resizeWH), 0, 0, Imgproc.INTER_LINEAR) Imgproc.resize(mat, mat, new Size(resizeWH, resizeWH))

class zoo.feature.image.imagePreprocessing.ImageSaturation(delta_low, delta_high, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

adjust the image Saturation

Parameters:	brightness parameter (deltaHigh) – low bound brightness parameter – high bound

class zoo.feature.image.imagePreprocessing.ImageSetToSample(input_keys=['imageTensor'], target_keys=['label'], sample_key='sample', bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

transform imageframe to samples

Parameters:	input_keys – keys that maps inputs (each input should be a tensor) target_keys – keys that maps targets (each target should be a tensor) sample_key – key to store sample

class zoo.feature.image.imagePreprocessing.PerImageNormalize(min, max, norm_type=32, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

Normalizes the norm or value range per image, similar to opencv::normalize https://docs.opencv.org/ref/master/d2/de8/group__core__array.html

#ga87eef7ee3970f86906d69a92cbf064bd ImageNormalize normalizes scale and shift the input features. Various normalize methods are supported, Eg. NORM_INF, NORM_L1, NORM_L2 or NORM_MINMAX Pleas notice it’s a per image normalization.

Parameters:	min – lower range boundary in case of the range normalization or norm value to normalize max – upper range boundary in case of the range normalization.It is not used for the norm normalization. norm_type – normalization type, see opencv:NormTypes.

https://docs.opencv.org/ref/master/d2/de8/group__core__array.html #gad12cefbcb5291cf958a85b4b67b6149f Default Core.NORM_MINMAX

class zoo.feature.image.imagePreprocessing.RowToImageFeature(bigdl_type='float')[source]¶

Bases: zoo.feature.common.Preprocessing

a Transformer that converts a Spark Row to a BigDL ImageFeature.

imageset¶

class zoo.feature.image.imageset.DistributedImageSet(image_rdd=None, label_rdd=None, jvalue=None, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imageset.ImageSet

DistributedImageSet wraps an RDD of ImageFeature

get_image(key='floats', to_chw=True)[source]¶: get image rdd from ImageSet

get_label()[source]¶: get label rdd from ImageSet

get_predict(key='predict')[source]¶: get prediction rdd from ImageSet

class zoo.feature.image.imageset.ImageSet(jvalue, bigdl_type='float')[source]¶

Bases: bigdl.util.common.JavaValue

ImageSet wraps a set of ImageFeature

classmethod from_image_frame(image_frame, bigdl_type='float')[source]¶

classmethod from_rdds(image_rdd, label_rdd=None, bigdl_type='float')[source]¶

Create a ImageSet from rdds of ndarray.

Parameters:	image_rdd – a rdd of ndarray, each ndarray should has dimension of 3 or 4 (3D images) label_rdd – a rdd of ndarray
Returns:	a DistributedImageSet

get_image(key='floats', to_chw=True)[source]¶: get image from ImageSet

get_label()[source]¶: get label from ImageSet

get_predict(key='predict')[source]¶: get prediction from ImageSet

is_distributed()[source]¶: whether this is a DistributedImageSet

is_local()[source]¶: whether this is a LocalImageSet

label_map¶

the labelMap of this ImageSet, None if the ImageSet does not have a labelMap

Type:	return

classmethod read(path, sc=None, min_partitions=1, resize_height=-1, resize_width=-1, image_codec=-1, with_label=False, one_based_label=True, bigdl_type='float')[source]¶

Read images as Image Set

Parameters:	path – path to read images

if sc is defined, path can be local or HDFS. Wildcard character are supported.

if withLabel is set to true, path should be a directory that have two levels. The first level is class folders, and the second is images. All images belong to a same class should be put into the same class folder. So each image in the path is labeled by the folder it belongs.

Parameters:

sc – SparkContext
min_partitions – A suggestion value of the minimal splitting number for input data.
resize_height – height after resize, by default is -1 which will not resize the image
resize_width – width after resize, by default is -1 which will not resize the image
image_codec – specifying the color type of a loaded image, same as in OpenCV.imread.By default is Imgcodecs.CV_LOAD_IMAGE_UNCHANGED(-1)
with_label – whether to treat folders in the path as image classification labels and read the labels into ImageSet.
one_based_label – whether to use one based label

Returns:

ImageSet

to_image_frame(bigdl_type='float')[source]¶

transform(transformer)[source]¶: transformImageSet

class zoo.feature.image.imageset.LocalImageSet(image_list=None, label_list=None, jvalue=None, bigdl_type='float')[source]¶

Bases: zoo.feature.image.imageset.ImageSet

LocalImageSet wraps a list of ImageFeature

get_image(key='floats', to_chw=True)[source]¶: get image list from ImageSet

get_label()[source]¶: get label list from ImageSet

get_predict(key='predict')[source]¶: get prediction list from ImageSet

zoo.feature.image.imageset.is_local(self)[source]¶

whether this is a LocalImageSet Create a ImageSet from rdds of ndarray.

Parameters:	image_rdd – a rdd of ndarray, each ndarray should has dimension of 3 or 4 (3D images) label_rdd – a rdd of ndarray
Returns:	a DistributedImageSet

>>> import numpy as np
>>> from bigdl.util.common import callBigDlFunc
>>> from numpy.testing import assert_allclose
>>> np.random.seed(123)
>>> sample = Sample.from_ndarray(np.random.random((2,3)), np.random.random((2,3)))
>>> sample_back = callBigDlFunc("float", "testSample", sample)
>>> assert_allclose(sample.features[0].to_ndarray(), sample_back.features[0].to_ndarray())
>>> assert_allclose(sample.label.to_ndarray(), sample_back.label.to_ndarray())
>>> expected_feature_storage = np.array(([[0.69646919, 0.28613934, 0.22685145], [0.55131477, 0.71946895, 0.42310646]]))
>>> expected_feature_shape = np.array([2, 3])
>>> expected_label_storage = np.array(([[0.98076421, 0.68482971, 0.48093191], [0.39211753, 0.343178, 0.72904968]]))
>>> expected_label_shape = np.array([2, 3])
>>> assert_allclose(sample.features[0].storage, expected_feature_storage, rtol=1e-6, atol=1e-6)
>>> assert_allclose(sample.features[0].shape, expected_feature_shape)

transformation¶

class zoo.feature.image3d.transformation.AffineTransform3D(affine_mat, translation=array([0., 0., 0.]), clamp_mode='clamp', pad_val=0.0, bigdl_type='float')[source]¶

Bases: zoo.feature.image3d.transformation.ImagePreprocessing3D

Affine transformer implements affine transformation on a given tensor. To avoid defects in resampling, the mapping is from destination to source. dst(z,y,x) = src(f(z),f(y),f(x)) where f: dst -> src :param affine_mat: numpy array in 3x3 shape.Define affine transformation from dst to src. :param translation: numpy array in 3 dimension.Default value is np.zero(3).

Define translation in each axis.

Parameters:	clampMode – str, default value is “clamp”. Define how to handle interpolation off the input image. padVal – float, default is 0.0. Define padding value when clampMode=”padding”. Setting this value when clampMode=”clamp” will cause an error.

class zoo.feature.image3d.transformation.CenterCrop3D(crop_depth, crop_height, crop_width, bigdl_type='float')[source]¶

Bases: zoo.feature.image3d.transformation.ImagePreprocessing3D

Center crop a cropDepth x cropHeight x cropWidth patch from an image. The patch size should be less than the image size.

:param crop_depth depth after crop :param crop_height height after crop :param crop_width width after crop

class zoo.feature.image3d.transformation.Crop3D(start, patch_size, bigdl_type='float')[source]¶

Bases: zoo.feature.image3d.transformation.ImagePreprocessing3D

Crop a patch from a 3D image from ‘start’ of patch size. The patch size should be less than the image size.

:param start start point list[depth, height, width] for cropping :param patchSize patch size list[depth, height, width]

class zoo.feature.image3d.transformation.ImagePreprocessing3D(bigdl_type='float', *args)[source]¶

Bases: zoo.feature.image.imagePreprocessing.ImagePreprocessing

ImagePreprocessing3D is a transformer that transform ImageFeature for 3D image

class zoo.feature.image3d.transformation.RandomCrop3D(crop_depth, crop_height, crop_width, bigdl_type='float')[source]¶

Bases: zoo.feature.image3d.transformation.ImagePreprocessing3D

Random crop a cropDepth x cropHeight x cropWidth patch from an image. The patch size should be less than the image size.

:param crop_depth depth after crop :param crop_height height after crop :param crop_width width after crop

class zoo.feature.image3d.transformation.Rotate3D(rotation_angles, bigdl_type='float')[source]¶

Bases: zoo.feature.image3d.transformation.ImagePreprocessing3D

Rotate a 3D image with specified angles.

:param rotation_angles the angles for rotation. Which are the yaw(a counterclockwise rotation angle about the z-axis), pitch(a counterclockwise rotation angle about the y-axis), and roll(a counterclockwise rotation angle about the x-axis).

text_feature¶

class zoo.feature.text.text_feature.TextFeature(text=None, label=None, uri=None, jvalue=None, bigdl_type='float')[source]¶

Bases: bigdl.util.common.JavaValue

Each TextFeature keeps information of a single text record. It can include various status (if any) of a text, e.g. original text content, uri, category label, tokens, index representation of tokens, BigDL Sample representation, prediction result and so on.

get_label()[source]¶

Get the label of the TextFeature. If no label is stored, -1 will be returned.

Returns:	Int

get_sample()[source]¶

Get the Sample representation of the TextFeature. If the TextFeature hasn’t been transformed to Sample, None will be returned.

Returns:	BigDL Sample

get_text()[source]¶

Get the text content of the TextFeature.

Returns:	String

get_tokens()[source]¶

Get the tokens of the TextFeature. If text hasn’t been segmented, None will be returned.

Returns:	List of String

get_uri()[source]¶

Get the identifier of the TextFeature. If no id is stored, None will be returned.

Returns:	String

has_label()[source]¶

Whether the TextFeature contains label.

Returns:	Boolean

keys()[source]¶

Get the keys that the TextFeature contains.

Returns:	List of String

set_label(label)[source]¶

Set the label for the TextFeature.

Parameters:	label – Int
Returns:	The TextFeature with label.

text_set¶

class zoo.feature.text.text_set.DistributedTextSet(texts=None, labels=None, jvalue=None, bigdl_type='float')[source]¶

Bases: zoo.feature.text.text_set.TextSet

DistributedTextSet is comprised of RDDs.

class zoo.feature.text.text_set.LocalTextSet(texts=None, labels=None, jvalue=None, bigdl_type='float')[source]¶

Bases: zoo.feature.text.text_set.TextSet

LocalTextSet is comprised of lists.

class zoo.feature.text.text_set.TextSet(jvalue, bigdl_type='float', *args)[source]¶

Bases: bigdl.util.common.JavaValue

TextSet wraps a set of texts with status.

classmethod from_relation_lists(relations, corpus1, corpus2, bigdl_type='float')[source]¶

Used to generate a TextSet for ranking.

This method does the following: 1. For each id1 in relations, find the list of id2 with corresponding label that comes together with id1. In other words, group relations by id1. 2. Join with corpus to transform each id to indexedTokens. Note: Make sure that the corpus has been transformed by SequenceShaper and WordIndexer. 3. For each list, generate a TextFeature having Sample with: - feature of shape (list_length, text1_length + text2_length). - label of shape (list_length, 1).

Parameters:	relations – List or RDD of Relation. corpus1 – TextSet that contains all id1 in relations. For each TextFeature in corpus1, text must have been transformed to indexedTokens of the same length. corpus2 – TextSet that contains all id2 in relations. For each TextFeature in corpus2, text must have been transformed to indexedTokens of the same length.

Note that if relations is a list, then corpus1 and corpus2 must both be LocalTextSet. If relations is RDD, then corpus1 and corpus2 must both be DistributedTextSet.

Returns:	TextSet.

classmethod from_relation_pairs(relations, corpus1, corpus2, bigdl_type='float')[source]¶

Used to generate a TextSet for pairwise training.

This method does the following: 1. Generate all RelationPairs: (id1, id2Positive, id2Negative) from Relations. 2. Join RelationPairs with corpus to transform id to indexedTokens. Note: Make sure that the corpus has been transformed by SequenceShaper and WordIndexer. 3. For each pair, generate a TextFeature having Sample with: - feature of shape (2, text1Length + text2Length). - label of value [1 0] as the positive relation is placed before the negative one.

Parameters:	relations – List or RDD of Relation. corpus1 – TextSet that contains all id1 in relations. For each TextFeature in corpus1, text must have been transformed to indexedTokens of the same length. corpus2 – TextSet that contains all id2 in relations. For each TextFeature in corpus2, text must have been transformed to indexedTokens of the same length.

Note that if relations is a list, then corpus1 and corpus2 must both be LocalTextSet. If relations is RDD, then corpus1 and corpus2 must both be DistributedTextSet.

Returns:	TextSet.

generate_sample()[source]¶

Generate BigDL Sample. Need to word2idx first. See TextFeatureToSample for more details.

Returns:	TextSet with Samples.

generate_word_index_map(remove_topN=0, max_words_num=-1, min_freq=1, existing_map=None)[source]¶

Generate word_index map based on sorted word frequencies in descending order. Return the result dictionary, which can also be retrieved by ‘get_word_index()’. Make sure you call this after tokenize. Otherwise you will get an error. See word2idx for more details.

Returns:	Dictionary {word: id}

get_labels()[source]¶

Get the labels of a TextSet (if any). If a text doesn’t have a label, its corresponding position will be -1.

Returns:	List of int for LocalTextSet. RDD of int for DistributedTextSet.

get_predicts()[source]¶

Get the prediction results (if any) combined with uris (if any) of a TextSet. If a text doesn’t have a uri, its corresponding uri will be None. If a text hasn’t been predicted by a model, its corresponding prediction will be None.

Returns:	List of (uri, prediction as a list of numpy array) for LocalTextSet. RDD of (uri, prediction as a list of numpy array) for DistributedTextSet.

get_samples()[source]¶

Get the BigDL Sample representations of a TextSet (if any). If a text hasn’t been transformed to Sample, its corresponding position will be None.

Returns:	List of Sample for LocalTextSet. RDD of Sample for DistributedTextSet.

get_texts()[source]¶

Get the text contents of a TextSet.

Returns:	List of String for LocalTextSet. RDD of String for DistributedTextSet.

get_uris()[source]¶

Get the identifiers of a TextSet. If a text doesn’t have a uri, its corresponding position will be None.

Returns:	List of String for LocalTextSet. RDD of String for DistributedTextSet.

get_word_index()[source]¶

Get the word_index dictionary of the TextSet. If the TextSet hasn’t been transformed from word to index, None will be returned.

Returns:	Dictionary {word: id}

is_distributed()[source]¶

Whether it is a DistributedTextSet.

Returns:	Boolean

is_local()[source]¶

Whether it is a LocalTextSet.

Returns:	Boolean

load_word_index(path)[source]¶

Load the word_index map which was saved after the training, so that this TextSet can directly use this word_index during inference. Each separate line should be “word id”.

Note that after calling load_word_index, you do not need to specify any argument when calling word2idx in the preprocessing pipeline as now you are using exactly the loaded word_index for transformation.

For LocalTextSet, load txt from a local file system. For DistributedTextSet, load txt from a local or distributed file system (such as HDFS).

Returns:	TextSet with the loaded word_index.

normalize()[source]¶

Do normalization on tokens. Need to tokenize first. See Normalizer for more details.

Returns:	TextSet after normalization.

random_split(weights)[source]¶

Randomly split into list of TextSet with provided weights. Only available for DistributedTextSet for now.

Parameters:	weights – List of float indicating the split portions.

classmethod read(path, sc=None, min_partitions=1, bigdl_type='float')[source]¶

Read text files with labels from a directory. The folder structure is expected to be the following: path

|dir1 - text1, text2, … |dir2 - text1, text2, … |dir3 - text1, text2, …

Under the target path, there ought to be N subdirectories (dir1 to dirN). Each subdirectory represents a category and contains all texts that belong to such category. Each category will be a given a label according to its position in the ascending order sorted among all subdirectories. All texts will be given a label according to the subdirectory where it is located. Labels start from 0.

Parameters:

path – The folder path to texts. Local or distributed file system (such as HDFS) are supported. If you want to read from a distributed file system, sc needs to be specified.
sc – An instance of SparkContext. If specified, texts will be read as a DistributedTextSet. Default is None and in this case texts will be read as a LocalTextSet.
min_partitions – Int. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not None. Default is 1.

Returns:

TextSet.

classmethod read_csv(path, sc=None, min_partitions=1, bigdl_type='float')[source]¶

Read texts with id from csv file. Each record is supposed to contain the following two fields in order: id(string) and text(string). Note that the csv file should be without header.

Parameters:

path – The path to the csv file. Local or distributed file system (such as HDFS) are supported. If you want to read from a distributed file system, sc needs to be specified.
sc – An instance of SparkContext. If specified, texts will be read as a DistributedTextSet. Default is None and in this case texts will be read as a LocalTextSet.
min_partitions – Int. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not None. Default is 1.

Returns:

TextSet.

classmethod read_parquet(path, sc, bigdl_type='float')[source]¶

Read texts with id from parquet file. Schema should be the following: “id”(string) and “text”(string).

Parameters:	path – The path to the parquet file. sc – An instance of SparkContext.
Returns:	DistributedTextSet.

save_word_index(path)[source]¶

Save the word_index dictionary to text file, which can be used for future inference. Each separate line will be “word id”.

For LocalTextSet, save txt to a local file system. For DistributedTextSet, save txt to a local or distributed file system (such as HDFS).

Parameters:	path – The path to the text file.

set_word_index(vocab)[source]¶

Assign a word_index dictionary for this TextSet to use during word2idx. If you load the word_index from the saved file, you are recommended to use load_word_index directly.

Returns:	TextSet with the word_index set.

shape_sequence(len, trunc_mode='pre', pad_element=0)[source]¶

Shape the sequence of indices to a fixed length. Need to word2idx first. See SequenceShaper for more details.

Returns:	TextSet after sequence shaping.

to_distributed(sc=None, partition_num=4)[source]¶

Convert to a DistributedTextSet.

Need to specify SparkContext to convert a LocalTextSet to a DistributedTextSet. In this case, you may also want to specify partition_num, the default of which is 4.

Returns:	DistributedTextSet

to_local()[source]¶

Convert to a LocalTextSet.

Returns:	LocalTextSet

tokenize()[source]¶

Do tokenization on original text. See Tokenizer for more details.

Returns:	TextSet after tokenization.

transform(transformer)[source]¶

word2idx(remove_topN=0, max_words_num=-1, min_freq=1, existing_map=None)[source]¶

Map word tokens to indices. Important: Take care that this method behaves a bit differently for training and inference.

—————————————Training——————————————– During the training, you need to generate a new word_index dictionary according to the texts you are dealing with. Thus this method will first do the dictionary generation and then convert words to indices based on the generated dictionary.

You can specify the following arguments which pose some constraints when generating the dictionary. In the result dictionary, index will start from 1 and corresponds to the occurrence frequency of each word sorted in descending order. Here we adopt the convention that index 0 will be reserved for unknown words. After word2idx, you can get the generated word_index dictionary by calling ‘get_word_index’. Also, you can call save_word_index to save this word_index dictionary to be used in future training.

Parameters:

remove_topN – Non-negative int. Remove the topN words with highest frequencies in the case where those are treated as stopwords. Default is 0, namely remove nothing.
max_words_num – Int. The maximum number of words to be taken into consideration. Default is -1, namely all words will be considered. Otherwise, it should be a positive int.
min_freq – Positive int. Only those words with frequency >= min_freq will be taken into consideration. Default is 1, namely all words that occur will be considered.
existing_map – Existing dictionary of word_index if any. Default is None and in this case a new dictionary with index starting from 1 will be generated. If not None, then the generated dictionary will preserve the word_index in existing_map and assign subsequent indices to new words.

—————————————Inference——————————————– During the inference, you are supposed to use exactly the same word_index dictionary as in the training stage instead of generating a new one. Thus please be aware that you do not need to specify any of the above arguments. You need to call load_word_index or set_word_index beforehand for dictionary loading.

Need to tokenize first. See WordIndexer for more details.

Returns:	TextSet after word2idx.

transformer¶

class zoo.feature.text.transformer.Normalizer(bigdl_type='float')[source]¶

Bases: zoo.feature.text.transformer.TextTransformer

Removes all dirty characters (non English alphabet) from tokens and converts words to lower case. Need to tokenize first. Original tokens will be replaced by normalized tokens.

>>> normalizer = Normalizer()
creating: createNormalizer

class zoo.feature.text.transformer.SequenceShaper(len, trunc_mode='pre', pad_element=0, bigdl_type='float')[source]¶

Bases: zoo.feature.text.transformer.TextTransformer

Shape the sequence of indices to a fixed length. If the original sequence is longer than the target length, it will be truncated from the beginning or the end. If the original sequence is shorter than the target length, it will be padded to the end. Need to word2idx first. The original indices sequence will be replaced by the shaped sequence.

# Arguments len: Positive int. The target length. trunc_mode: Truncation mode. String. Either ‘pre’ or ‘post’. Default is ‘pre’.

If ‘pre’, the sequence will be truncated from the beginning. If ‘post’, the sequence will be truncated from the end.

pad_element: Int. The element to be padded to the sequence if the original length is: smaller than the target length. Default is 0 with the convention that we reserve index 0 for unknown words.

>>> sequence_shaper = SequenceShaper(len=6, trunc_mode="post", pad_element=10000)
creating: createSequenceShaper

class zoo.feature.text.transformer.TextFeatureToSample(bigdl_type='float')[source]¶

Bases: zoo.feature.text.transformer.TextTransformer

Transform indexedTokens and label (if any) of a TextFeature to a BigDL Sample. Need to word2idx first.

>>> to_sample = TextFeatureToSample()
creating: createTextFeatureToSample

class zoo.feature.text.transformer.TextTransformer(bigdl_type='float', *args)[source]¶

Bases: zoo.feature.common.Preprocessing

Base class of Transformers that transform TextFeature.

transform(text_feature)[source]¶: Transform a TextFeature.

class zoo.feature.text.transformer.Tokenizer(bigdl_type='float')[source]¶

Bases: zoo.feature.text.transformer.TextTransformer

Transform text to array of string tokens.

>>> tokenizer = Tokenizer()
creating: createTokenizer

class zoo.feature.text.transformer.WordIndexer(map, bigdl_type='float')[source]¶

Bases: zoo.feature.text.transformer.TextTransformer

Given a wordIndex map, transform tokens to corresponding indices. Those words not in the map will be aborted. Need to tokenize first.

# Arguments map: Dict with word (string) as its key and index (int) as its value.

>>> word_indexer = WordIndexer(map={"it": 1, "me": 2})
creating: createWordIndexer

common¶

class zoo.feature.common.ArrayToTensor(size, bigdl_type='float')[source]¶

Bases: zoo.feature.common.Preprocessing

a Transformer that converts an Array[_] to a Tensor. :param size dimensions of target Tensor.

class zoo.feature.common.BigDLAdapter(bigdl_transformer, bigdl_type='float')[source]¶: Bases: zoo.feature.common.Preprocessing

class zoo.feature.common.ChainedPreprocessing(transformers, bigdl_type='float')[source]¶

Bases: zoo.feature.common.Preprocessing

chains two Preprocessing together. The output type of the first Preprocessing should be the same with the input type of the second Preprocessing.

class zoo.feature.common.FeatureLabelPreprocessing(feature_transformer, label_transformer, bigdl_type='float')[source]¶

Bases: zoo.feature.common.Preprocessing

construct a Transformer that convert (Feature, Label) tuple to a Sample. The returned Transformer is robust for the case label = null, in which the Sample is derived from Feature only. :param feature_transformer transformer for feature, transform F to Tensor[T] :param label_transformer transformer for label, transform L to Tensor[T]

class zoo.feature.common.FeatureSet(jvalue=None, bigdl_type='float')[source]¶

Bases: bigdl.dataset.dataset.DataSet

A set of data which is used in the model optimization process. The FeatureSet can be accessed in a random data sample sequence. In the training process, the data sequence is a looped endless sequence. While in the validation process, the data sequence is a limited length sequence. Different from BigDL’s DataSet, this FeatureSet could be cached to Intel Optane DC Persistent Memory, if you set memory_type to PMEM when creating FeatureSet.

classmethod image_frame(image_frame, memory_type='DRAM', sequential_order=False, shuffle=True, bigdl_type='float')[source]¶

Create FeatureSet from ImageFrame. :param image_frame: ImageFrame :param memory_type: string, DRAM, PMEM or a Int number.

If it’s DRAM, will cache dataset into dynamic random-access memory If it’s PMEM, will cache dataset into Intel Optane DC Persistent Memory If it’s a Int number n, will cache dataset into disk, and only hold 1/n

of the data into memory during the training. After going through the 1/n, we will release the current cache, and load another 1/n into memory.

Parameters:	sequential_order – whether to iterate the elements in the feature set in sequential order for training. shuffle – whether to shuffle the elements in each partition before each epoch when training bigdl_type – numeric type
Returns:	A feature set

classmethod image_set(imageset, memory_type='DRAM', sequential_order=False, shuffle=True, bigdl_type='float')[source]¶

Create FeatureSet from ImageFrame. :param imageset: ImageSet :param memory_type: string, DRAM or PMEM

If it’s DRAM, will cache dataset into dynamic random-access memory If it’s PMEM, will cache dataset into Intel Optane DC Persistent Memory If it’s a Int number n, will cache dataset into disk, and only hold 1/n

of the data into memory during the training. After going through the 1/n, we will release the current cache, and load another 1/n into memory.

Parameters:	sequential_order – whether to iterate the elements in the feature set in sequential order for training. shuffle – whether to shuffle the elements in each partition before each epoch when training bigdl_type – numeric type
Returns:	A feature set

classmethod pytorch_dataloader(dataloader, features='_data[0]', labels='_data[1]', bigdl_type='float')[source]¶: Create FeatureSet from pytorch dataloader :param dataloader: a pytorch dataloader, or a function return pytorch dataloader. :param features: features in _data, _data is get from dataloader. :param labels: lables in _data, _data is get from dataloader. :param bigdl_type: numeric type :return: A feature set

classmethod rdd(rdd, memory_type='DRAM', sequential_order=False, shuffle=True, bigdl_type='float')[source]¶

Create FeatureSet from RDD. :param rdd: A RDD :param memory_type: string, DRAM, PMEM or a Int number.

If it’s DRAM, will cache dataset into dynamic random-access memory If it’s PMEM, will cache dataset into Intel Optane DC Persistent Memory If it’s a Int number n, will cache dataset into disk, and only hold 1/n

of the data into memory during the training. After going through the 1/n, we will release the current cache, and load another 1/n into memory.

Parameters:	sequential_order – whether to iterate the elements in the feature set in sequential order when training. shuffle – whether to shuffle the elements in each partition before each epoch when training

:param bigdl_type:numeric type :return: A feature set

classmethod sample_rdd(rdd, memory_type='DRAM', sequential_order=False, shuffle=True, bigdl_type='float')[source]¶

Create FeatureSet from RDD[Sample]. :param rdd: A RDD[Sample] :param memory_type: string, DRAM or PMEM

If it’s DRAM, will cache dataset into dynamic random-access memory If it’s PMEM, will cache dataset into Intel Optane DC Persistent Memory If it’s a Int number n, will cache dataset into disk, and only hold 1/n

of the data into memory during the training. After going through the 1/n, we will release the current cache, and load another 1/n into memory.

Parameters:	sequential_order – whether to iterate the elements in the feature set in sequential order when training. shuffle – whether to shuffle the elements in each partition before each epoch when training

:param bigdl_type:numeric type :return: A feature set

classmethod tf_dataset(func, total_size, bigdl_type='float')[source]¶

Parameters:	func – a function return a tensorflow dataset total_size – total size of this dataset bigdl_type – numeric type
Returns:	A feature set

to_dataset()[source]¶: To BigDL compatible DataSet :return:

transform(transformer)[source]¶: Helper function to transform the data type in the data set. :param transformer: the transformers to transform this feature set. :return: A feature set

class zoo.feature.common.FeatureToTupleAdapter(sample_transformer, bigdl_type='float')[source]¶: Bases: zoo.feature.common.Preprocessing

class zoo.feature.common.MLlibVectorToTensor(size, bigdl_type='float')[source]¶

Bases: zoo.feature.common.Preprocessing

a Transformer that converts MLlib Vector to a Tensor. .. note:: Deprecated in 0.4.0. NNEstimator will automatically extract Vectors now. :param size dimensions of target Tensor.

class zoo.feature.common.Preprocessing(bigdl_type='float', *args)[source]¶

Bases: bigdl.util.common.JavaValue

Preprocessing defines data transform action during feature preprocessing. Python wrapper for the scala Preprocessing

class zoo.feature.common.Relation(id1, id2, label, bigdl_type='float')[source]¶

Bases: object

It represents the relationship between two items.

to_tuple()[source]¶

class zoo.feature.common.Relations[source]¶

Bases: object

static read(path, sc=None, min_partitions=1, bigdl_type='float')[source]¶

Read relations from csv or txt file. Each record is supposed to contain the following three fields in order: id1(string), id2(string) and label(int).

For csv file, it should be without header. For txt file, each line should contain one record with fields separated by comma.

Parameters:

path – The path to the relations file, which can either be a local or disrtibuted file system (such as HDFS) path.
sc – An instance of SparkContext. If specified, return RDD of Relation. Default is None and in this case return list of Relation.
min_partitions – Int. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not None. Default is 1.

static read_parquet(path, sc, bigdl_type='float')[source]¶

Read relations from parquet file. Schema should be the following: “id1”(string), “id2”(string) and “label”(int).

Parameters:	path – The path to the parquet file. sc – An instance of SparkContext.
Returns:	RDD of Relation.

class zoo.feature.common.SampleToMiniBatch(batch_size, bigdl_type='float')[source]¶

Bases: zoo.feature.common.Preprocessing

a Transformer that converts Feature to (Feature, None).

class zoo.feature.common.ScalarToTensor(bigdl_type='float')[source]¶

Bases: zoo.feature.common.Preprocessing

a Preprocessing that converts a number to a Tensor.

class zoo.feature.common.SeqToMultipleTensors(size=[], bigdl_type='float')[source]¶

Bases: zoo.feature.common.Preprocessing

a Transformer that converts an Array[_] or Seq[_] or ML Vector to several tensors. :param size, list of int list, dimensions of target Tensors, e.g. [[2],[4]]

class zoo.feature.common.SeqToTensor(size=[], bigdl_type='float')[source]¶

Bases: zoo.feature.common.Preprocessing

a Transformer that converts an Array[_] or Seq[_] to a Tensor. :param size dimensions of target Tensor.

class zoo.feature.common.TensorToSample(bigdl_type='float')[source]¶

Bases: zoo.feature.common.Preprocessing

a Transformer that converts Tensor to Sample.

class zoo.feature.common.ToTuple(bigdl_type='float')[source]¶

Bases: zoo.feature.common.Preprocessing

a Transformer that converts Feature to (Feature, None).