Feature API Documentation¶
imagePreprocessing¶
-
class
zoo.feature.image.imagePreprocessing.ImageAspectScale(min_size, scale_multiple_of=1, max_size=1000, resize_mode=1, use_scale_factor=True, min_scale=-1.0, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingResize the image, keep the aspect ratio. scale according to the short edge.Randomly apply the preprocessing to some of the input ImageFeatures, with probability specified.E.g. if prob = 0.5, the preprocessing will apply to half of the input ImageFeatures.
Parameters: - min_size – scale size, apply to short edge
- scale_multiple_of – make the scaled size multiple of some value
- max_size – max size after scale
- resize_mode – if resizeMode = -1, random select a mode from (Imgproc.INTER_LINEAR, Imgproc.INTER_CUBIC, Imgproc.INTER_AREA, Imgproc.INTER_NEAREST, Imgproc.INTER_LANCZOS4)
- use_scale_factor – if true, scale factor fx and fy is used, fx = fy = 0
- min_scale – control the minimum scale up for image
Returns: a DistributedImageSet
>>> import numpy as np >>> from bigdl.util.common import callBigDlFunc >>> from numpy.testing import assert_allclose >>> np.random.seed(123) >>> sample = Sample.from_ndarray(np.random.random((2,3)), np.random.random((2,3))) >>> sample_back = callBigDlFunc("float", "testSample", sample) >>> assert_allclose(sample.features[0].to_ndarray(), sample_back.features[0].to_ndarray()) >>> assert_allclose(sample.label.to_ndarray(), sample_back.label.to_ndarray()) >>> expected_feature_storage = np.array(([[0.69646919, 0.28613934, 0.22685145], [0.55131477, 0.71946895, 0.42310646]])) >>> expected_feature_shape = np.array([2, 3]) >>> expected_label_storage = np.array(([[0.98076421, 0.68482971, 0.48093191], [0.39211753, 0.343178, 0.72904968]])) >>> expected_label_shape = np.array([2, 3]) >>> assert_allclose(sample.features[0].storage, expected_feature_storage, rtol=1e-6, atol=1e-6) >>> assert_allclose(sample.features[0].shape, expected_feature_shape)
-
class
zoo.feature.image.imagePreprocessing.ImageBrightness(delta_low, delta_high, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingadjust the image brightness.
Launches a set of actors which connect via distributed PyTorch and coordinate gradient updates to train the provided model. If Ray is not initialized, TorchTrainer will automatically initialize a local Ray cluster for you. Be sure to run ray.init(address=”auto”) to leverage multi-node training.
class MyTrainingOperator(TrainingOperator): def setup(self, config): model = nn.Linear(1, 1) optimizer = torch.optim.SGD( model.parameters(), lr=config.get("lr", 1e-4)) loss = torch.nn.MSELoss() batch_size = config["batch_size"] train_data, val_data = LinearDataset(2, 5), LinearDataset(2, 5) train_loader = DataLoader(train_data, batch_size=batch_size) val_loader = DataLoader(val_data, batch_size=batch_size) self.model, self.optimizer = self.register( models=model, optimizers=optimizer, criterion=loss) self.register_data( train_loader=train_loader, validation_loader=val_loader) trainer = TorchTrainer( training_operator_cls=MyTrainingOperator, config={"batch_size": 32}, use_gpu=True ) for i in range(4): trainer.train()
Parameters: - training_operator_cls (type) – Custom training operator class that subclasses the TrainingOperator class. This class will be copied onto all remote workers and used to specify training components and custom training and validation operations.
- initialization_hook (function) – A function to call on all training workers when they are first initialized. This could be useful to set environment variables for all the worker processes.
- config (dict) – Custom configuration value to be passed to all operator constructors.
- training_operator_cls – Custom training operator class that subclasses the TrainingOperator class. This class will be copied onto all remote workers and used to specify training components and custom training and validation operations.
- initialization_hook – A function to call on all training workers when they are first initialized. This could be useful to set environment variables for all the worker processes.
- config – Custom configuration value to be passed to all operator constructors.
-
is_local()[source]¶ whether this is a LocalImageSet
Create a ImageSet from rdds of ndarray.
Parameters: - training_operator_cls (type) – Custom training operator class that subclasses the TrainingOperator class. This class will be copied onto all remote workers and used to specify training components and custom training and validation operations.
- initialization_hook (function) – A function to call on all training workers when they are first initialized. This could be useful to set environment variables for all the worker processes.
- config (dict) – Custom configuration value to be passed to all operator constructors.
-
class
zoo.feature.image.imagePreprocessing.ImageBytesToMat(byte_key='bytes', image_codec=-1, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingTransform byte array(original image file in byte) to OpenCVMat
Parameters: - byte_key – key that maps byte array
- image_codec – specifying the color type of a loaded image, same as in OpenCV.imread. By default is Imgcodecs.CV_LOAD_IMAGE_UNCHANGED
-
class
zoo.feature.image.imagePreprocessing.ImageCenterCrop(crop_width, crop_height, is_clip=True, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingCrop a cropWidth x cropHeight patch from center of image. The patch size should be less than the image size.
Parameters: - crop_width – width after crop
- crop_height – height after crop
- is_clip – clip cropping box boundary
-
class
zoo.feature.image.imagePreprocessing.ImageChannelNormalize(mean_r, mean_g, mean_b, std_r=1.0, std_g=1.0, std_b=1.0, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingimage channel normalize
Parameters: - mean_r – mean value in R channel
- mean_g – mean value in G channel
- meanB_b – mean value in B channel
- std_r – std value in R channel
- std_g – std value in G channel
- std_b – std value in B channel
-
class
zoo.feature.image.imagePreprocessing.ImageChannelOrder(bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingrandom change the channel of an image
-
class
zoo.feature.image.imagePreprocessing.ImageColorJitter(brightness_prob=0.5, brightness_delta=32.0, contrast_prob=0.5, contrast_lower=0.5, contrast_upper=1.5, hue_prob=0.5, hue_delta=18.0, saturation_prob=0.5, saturation_lower=0.5, saturation_upper=1.5, random_order_prob=0.0, shuffle=False, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingRandom adjust brightness, contrast, hue, saturation
Parameters: - brightness_prob – probability to adjust brightness
- brightness_delta – brightness parameter
- contrast_prob – probability to adjust contrast
- contrast_lower – contrast lower parameter
- contrast_upper – contrast upper parameter
- hue_prob – probability to adjust hue
- hue_delta – hue parameter
- saturation_prob – probability to adjust saturation
- saturation_lower – saturation lower parameter
- saturation_upper – saturation upper parameter
- random_order_prob – random order for different operation
- shuffle – shuffle the transformers
-
class
zoo.feature.image.imagePreprocessing.ImageExpand(means_r=123, means_g=117, means_b=104, min_expand_ratio=1.0, max_expand_ratio=4.0, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingexpand image, fill the blank part with the meanR, meanG, meanB
Parameters: - means_r – means in R channel
- means_g – means in G channel
- means_b – means in B channel
- min_expand_ratio – min expand ratio
- max_expand_ratio – max expand ratio
-
class
zoo.feature.image.imagePreprocessing.ImageFeatureToSample(bigdl_type='float')[source]¶ Bases:
zoo.feature.common.PreprocessingA transformer that get Sample from ImageFeature.
-
class
zoo.feature.image.imagePreprocessing.ImageFeatureToTensor(bigdl_type='float')[source]¶ Bases:
zoo.feature.common.Preprocessinga Transformer that convert ImageFeature to a Tensor.
-
class
zoo.feature.image.imagePreprocessing.ImageFiller(start_x, start_y, end_x, end_y, value=255, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingFill part of image with certain pixel value
Parameters: - start_x – start x ratio
- start_y – start y ratio
- end_x – end x ratio
- end_y – end y ratio
- value – filling value
-
class
zoo.feature.image.imagePreprocessing.ImageFixedCrop(x1, y1, x2, y2, normalized=True, is_clip=True, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingCrop a fixed area of image
Parameters: - x1 – start in width
- y1 – start in height
- x2 – end in width
- y2 – end in height
- normalized – whether args are normalized, i.e. in range [0, 1]
- is_clip – whether to clip the roi to image boundaries
-
class
zoo.feature.image.imagePreprocessing.ImageHFlip(bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingFlip the image horizontally
-
class
zoo.feature.image.imagePreprocessing.ImageHue(delta_low, delta_high, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingadjust the image hue
Parameters: - deltaLow – hue parameter: low bound
- deltaHigh – hue parameter: high bound
-
class
zoo.feature.image.imagePreprocessing.ImageMatToTensor(to_RGB=False, tensor_key='imageTensor', share_buffer=True, format='NCHW', bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingMatToTensor
Parameters: - toRGB – BGR to RGB (default is BGR)
- tensorKey – key to store transformed tensor
- format – DataFormat.NCHW or DataFormat.NHWC
-
class
zoo.feature.image.imagePreprocessing.ImageMirror(bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingFlip the image horizontally and vertically
-
class
zoo.feature.image.imagePreprocessing.ImagePixelBytesToMat(byte_key='bytes', bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingTransform byte array(pixels in byte) to OpenCVMat
Parameters: byte_key – key that maps byte array
-
class
zoo.feature.image.imagePreprocessing.ImagePixelNormalize(means, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingPixel level normalizer, data(i) = data(i) - mean(i)
Parameters: means – pixel level mean, following H * W * C order
-
class
zoo.feature.image.imagePreprocessing.ImagePreprocessing(bigdl_type='float', *args)[source]¶ Bases:
zoo.feature.common.PreprocessingImagePreprocessing is a transformer that transform ImageFeature
-
class
zoo.feature.image.imagePreprocessing.ImageRandomAspectScale(scales, scale_multiple_of=1, max_size=1000, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingresize the image by randomly choosing a scale
Parameters: - scales – array of scale options that for random choice
- scaleMultipleOf – Resize test images so that its width and height are multiples of
- maxSize – Max pixel size of the longest side of a scaled input image
-
class
zoo.feature.image.imagePreprocessing.ImageRandomCrop(crop_width, crop_height, is_clip=True, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingRandom crop a cropWidth x cropHeight patch from an image. The patch size should be less than the image size.
Parameters: - crop_width – width after crop
- crop_height – height after crop
- whether (is_clip) – to clip the roi to image boundaries
-
class
zoo.feature.image.imagePreprocessing.ImageRandomPreprocessing(preprocessing, prob, bigdl_type='float')[source]¶ Bases:
zoo.feature.common.PreprocessingRandomly apply the preprocessing to some of the input ImageFeatures, with probability specified. E.g. if prob = 0.5, the preprocessing will apply to half of the input ImageFeatures.
Parameters: - preprocessing – preprocessing to apply.
- prob – probability to apply the preprocessing action.
-
class
zoo.feature.image.imagePreprocessing.ImageResize(resize_h, resize_w, resize_mode=1, use_scale_factor=True, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingResize image
Parameters: - resize_h – height after resize
- resize_w – width after resize
- resize_mode – if resizeMode = -1, random select a mode from (Imgproc.INTER_LINEAR,Imgproc.INTER_CUBIC, Imgproc.INTER_AREA, Imgproc.INTER_NEAREST, Imgproc.INTER_LANCZOS4)
- use_scale_factor – if true, scale factor fx and fy is used, fx = fy = 0 note that the result of the following are different
Imgproc.resize(mat, mat, new Size(resizeWH, resizeWH), 0, 0, Imgproc.INTER_LINEAR) Imgproc.resize(mat, mat, new Size(resizeWH, resizeWH))
-
class
zoo.feature.image.imagePreprocessing.ImageSaturation(delta_low, delta_high, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingadjust the image Saturation
Parameters: - brightness parameter (deltaHigh) – low bound
- brightness parameter – high bound
-
class
zoo.feature.image.imagePreprocessing.ImageSetToSample(input_keys=['imageTensor'], target_keys=['label'], sample_key='sample', bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingtransform imageframe to samples
Parameters: - input_keys – keys that maps inputs (each input should be a tensor)
- target_keys – keys that maps targets (each target should be a tensor)
- sample_key – key to store sample
-
class
zoo.feature.image.imagePreprocessing.PerImageNormalize(min, max, norm_type=32, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingNormalizes the norm or value range per image, similar to opencv::normalize https://docs.opencv.org/ref/master/d2/de8/group__core__array.html
#ga87eef7ee3970f86906d69a92cbf064bd ImageNormalize normalizes scale and shift the input features. Various normalize methods are supported, Eg. NORM_INF, NORM_L1, NORM_L2 or NORM_MINMAX Pleas notice it’s a per image normalization.
Parameters: - min – lower range boundary in case of the range normalization or norm value to normalize
- max – upper range boundary in case of the range normalization.It is not used for the norm normalization.
- norm_type – normalization type, see opencv:NormTypes.
https://docs.opencv.org/ref/master/d2/de8/group__core__array.html #gad12cefbcb5291cf958a85b4b67b6149f Default Core.NORM_MINMAX
-
class
zoo.feature.image.imagePreprocessing.RowToImageFeature(bigdl_type='float')[source]¶ Bases:
zoo.feature.common.Preprocessinga Transformer that converts a Spark Row to a BigDL ImageFeature.
imageset¶
-
class
zoo.feature.image.imageset.DistributedImageSet(image_rdd=None, label_rdd=None, jvalue=None, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imageset.ImageSetDistributedImageSet wraps an RDD of ImageFeature
-
class
zoo.feature.image.imageset.ImageSet(jvalue, bigdl_type='float')[source]¶ Bases:
bigdl.util.common.JavaValueImageSet wraps a set of ImageFeature
-
classmethod
from_rdds(image_rdd, label_rdd=None, bigdl_type='float')[source]¶ Create a ImageSet from rdds of ndarray.
Parameters: - image_rdd – a rdd of ndarray, each ndarray should has dimension of 3 or 4 (3D images)
- label_rdd – a rdd of ndarray
Returns: a DistributedImageSet
-
label_map¶ the labelMap of this ImageSet, None if the ImageSet does not have a labelMap
Type: return
-
classmethod
read(path, sc=None, min_partitions=1, resize_height=-1, resize_width=-1, image_codec=-1, with_label=False, one_based_label=True, bigdl_type='float')[source]¶ Read images as Image Set
Parameters: path – path to read images if sc is defined, path can be local or HDFS. Wildcard character are supported.
if withLabel is set to true, path should be a directory that have two levels. The first level is class folders, and the second is images. All images belong to a same class should be put into the same class folder. So each image in the path is labeled by the folder it belongs.
Parameters: - sc – SparkContext
- min_partitions – A suggestion value of the minimal splitting number for input data.
- resize_height – height after resize, by default is -1 which will not resize the image
- resize_width – width after resize, by default is -1 which will not resize the image
- image_codec – specifying the color type of a loaded image, same as in OpenCV.imread.By default is Imgcodecs.CV_LOAD_IMAGE_UNCHANGED(-1)
- with_label – whether to treat folders in the path as image classification labels and read the labels into ImageSet.
- one_based_label – whether to use one based label
Returns: ImageSet
-
classmethod
-
class
zoo.feature.image.imageset.LocalImageSet(image_list=None, label_list=None, jvalue=None, bigdl_type='float')[source]¶ Bases:
zoo.feature.image.imageset.ImageSetLocalImageSet wraps a list of ImageFeature
-
zoo.feature.image.imageset.is_local(self)[source]¶ whether this is a LocalImageSet Create a ImageSet from rdds of ndarray.
Parameters: - image_rdd – a rdd of ndarray, each ndarray should has dimension of 3 or 4 (3D images)
- label_rdd – a rdd of ndarray
Returns: a DistributedImageSet
>>> import numpy as np >>> from bigdl.util.common import callBigDlFunc >>> from numpy.testing import assert_allclose >>> np.random.seed(123) >>> sample = Sample.from_ndarray(np.random.random((2,3)), np.random.random((2,3))) >>> sample_back = callBigDlFunc("float", "testSample", sample) >>> assert_allclose(sample.features[0].to_ndarray(), sample_back.features[0].to_ndarray()) >>> assert_allclose(sample.label.to_ndarray(), sample_back.label.to_ndarray()) >>> expected_feature_storage = np.array(([[0.69646919, 0.28613934, 0.22685145], [0.55131477, 0.71946895, 0.42310646]])) >>> expected_feature_shape = np.array([2, 3]) >>> expected_label_storage = np.array(([[0.98076421, 0.68482971, 0.48093191], [0.39211753, 0.343178, 0.72904968]])) >>> expected_label_shape = np.array([2, 3]) >>> assert_allclose(sample.features[0].storage, expected_feature_storage, rtol=1e-6, atol=1e-6) >>> assert_allclose(sample.features[0].shape, expected_feature_shape)
transformation¶
-
class
zoo.feature.image3d.transformation.AffineTransform3D(affine_mat, translation=array([0., 0., 0.]), clamp_mode='clamp', pad_val=0.0, bigdl_type='float')[source]¶ Bases:
zoo.feature.image3d.transformation.ImagePreprocessing3DAffine transformer implements affine transformation on a given tensor. To avoid defects in resampling, the mapping is from destination to source. dst(z,y,x) = src(f(z),f(y),f(x)) where f: dst -> src :param affine_mat: numpy array in 3x3 shape.Define affine transformation from dst to src. :param translation: numpy array in 3 dimension.Default value is np.zero(3).
Define translation in each axis.Parameters: - clampMode – str, default value is “clamp”. Define how to handle interpolation off the input image.
- padVal – float, default is 0.0. Define padding value when clampMode=”padding”. Setting this value when clampMode=”clamp” will cause an error.
-
class
zoo.feature.image3d.transformation.CenterCrop3D(crop_depth, crop_height, crop_width, bigdl_type='float')[source]¶ Bases:
zoo.feature.image3d.transformation.ImagePreprocessing3DCenter crop a cropDepth x cropHeight x cropWidth patch from an image. The patch size should be less than the image size.
:param crop_depth depth after crop :param crop_height height after crop :param crop_width width after crop
-
class
zoo.feature.image3d.transformation.Crop3D(start, patch_size, bigdl_type='float')[source]¶ Bases:
zoo.feature.image3d.transformation.ImagePreprocessing3DCrop a patch from a 3D image from ‘start’ of patch size. The patch size should be less than the image size.
:param start start point list[depth, height, width] for cropping :param patchSize patch size list[depth, height, width]
-
class
zoo.feature.image3d.transformation.ImagePreprocessing3D(bigdl_type='float', *args)[source]¶ Bases:
zoo.feature.image.imagePreprocessing.ImagePreprocessingImagePreprocessing3D is a transformer that transform ImageFeature for 3D image
-
class
zoo.feature.image3d.transformation.RandomCrop3D(crop_depth, crop_height, crop_width, bigdl_type='float')[source]¶ Bases:
zoo.feature.image3d.transformation.ImagePreprocessing3DRandom crop a cropDepth x cropHeight x cropWidth patch from an image. The patch size should be less than the image size.
:param crop_depth depth after crop :param crop_height height after crop :param crop_width width after crop
-
class
zoo.feature.image3d.transformation.Rotate3D(rotation_angles, bigdl_type='float')[source]¶ Bases:
zoo.feature.image3d.transformation.ImagePreprocessing3DRotate a 3D image with specified angles.
:param rotation_angles the angles for rotation. Which are the yaw(a counterclockwise rotation angle about the z-axis), pitch(a counterclockwise rotation angle about the y-axis), and roll(a counterclockwise rotation angle about the x-axis).
text_feature¶
-
class
zoo.feature.text.text_feature.TextFeature(text=None, label=None, uri=None, jvalue=None, bigdl_type='float')[source]¶ Bases:
bigdl.util.common.JavaValueEach TextFeature keeps information of a single text record. It can include various status (if any) of a text, e.g. original text content, uri, category label, tokens, index representation of tokens, BigDL Sample representation, prediction result and so on.
-
get_label()[source]¶ Get the label of the TextFeature. If no label is stored, -1 will be returned.
Returns: Int
-
get_sample()[source]¶ Get the Sample representation of the TextFeature. If the TextFeature hasn’t been transformed to Sample, None will be returned.
Returns: BigDL Sample
-
get_tokens()[source]¶ Get the tokens of the TextFeature. If text hasn’t been segmented, None will be returned.
Returns: List of String
-
text_set¶
-
class
zoo.feature.text.text_set.DistributedTextSet(texts=None, labels=None, jvalue=None, bigdl_type='float')[source]¶ Bases:
zoo.feature.text.text_set.TextSetDistributedTextSet is comprised of RDDs.
-
class
zoo.feature.text.text_set.LocalTextSet(texts=None, labels=None, jvalue=None, bigdl_type='float')[source]¶ Bases:
zoo.feature.text.text_set.TextSetLocalTextSet is comprised of lists.
-
class
zoo.feature.text.text_set.TextSet(jvalue, bigdl_type='float', *args)[source]¶ Bases:
bigdl.util.common.JavaValueTextSet wraps a set of texts with status.
-
classmethod
from_relation_lists(relations, corpus1, corpus2, bigdl_type='float')[source]¶ Used to generate a TextSet for ranking.
This method does the following: 1. For each id1 in relations, find the list of id2 with corresponding label that comes together with id1. In other words, group relations by id1. 2. Join with corpus to transform each id to indexedTokens. Note: Make sure that the corpus has been transformed by SequenceShaper and WordIndexer. 3. For each list, generate a TextFeature having Sample with: - feature of shape (list_length, text1_length + text2_length). - label of shape (list_length, 1).
Parameters: - relations – List or RDD of Relation.
- corpus1 – TextSet that contains all id1 in relations. For each TextFeature in corpus1, text must have been transformed to indexedTokens of the same length.
- corpus2 – TextSet that contains all id2 in relations. For each TextFeature in corpus2, text must have been transformed to indexedTokens of the same length.
Note that if relations is a list, then corpus1 and corpus2 must both be LocalTextSet. If relations is RDD, then corpus1 and corpus2 must both be DistributedTextSet.
Returns: TextSet.
-
classmethod
from_relation_pairs(relations, corpus1, corpus2, bigdl_type='float')[source]¶ Used to generate a TextSet for pairwise training.
This method does the following: 1. Generate all RelationPairs: (id1, id2Positive, id2Negative) from Relations. 2. Join RelationPairs with corpus to transform id to indexedTokens. Note: Make sure that the corpus has been transformed by SequenceShaper and WordIndexer. 3. For each pair, generate a TextFeature having Sample with: - feature of shape (2, text1Length + text2Length). - label of value [1 0] as the positive relation is placed before the negative one.
Parameters: - relations – List or RDD of Relation.
- corpus1 – TextSet that contains all id1 in relations. For each TextFeature in corpus1, text must have been transformed to indexedTokens of the same length.
- corpus2 – TextSet that contains all id2 in relations. For each TextFeature in corpus2, text must have been transformed to indexedTokens of the same length.
Note that if relations is a list, then corpus1 and corpus2 must both be LocalTextSet. If relations is RDD, then corpus1 and corpus2 must both be DistributedTextSet.
Returns: TextSet.
-
generate_sample()[source]¶ Generate BigDL Sample. Need to word2idx first. See TextFeatureToSample for more details.
Returns: TextSet with Samples.
-
generate_word_index_map(remove_topN=0, max_words_num=-1, min_freq=1, existing_map=None)[source]¶ Generate word_index map based on sorted word frequencies in descending order. Return the result dictionary, which can also be retrieved by ‘get_word_index()’. Make sure you call this after tokenize. Otherwise you will get an error. See word2idx for more details.
Returns: Dictionary {word: id}
-
get_labels()[source]¶ Get the labels of a TextSet (if any). If a text doesn’t have a label, its corresponding position will be -1.
Returns: List of int for LocalTextSet. RDD of int for DistributedTextSet.
-
get_predicts()[source]¶ Get the prediction results (if any) combined with uris (if any) of a TextSet. If a text doesn’t have a uri, its corresponding uri will be None. If a text hasn’t been predicted by a model, its corresponding prediction will be None.
Returns: List of (uri, prediction as a list of numpy array) for LocalTextSet. RDD of (uri, prediction as a list of numpy array) for DistributedTextSet.
-
get_samples()[source]¶ Get the BigDL Sample representations of a TextSet (if any). If a text hasn’t been transformed to Sample, its corresponding position will be None.
Returns: List of Sample for LocalTextSet. RDD of Sample for DistributedTextSet.
-
get_texts()[source]¶ Get the text contents of a TextSet.
Returns: List of String for LocalTextSet. RDD of String for DistributedTextSet.
-
get_uris()[source]¶ Get the identifiers of a TextSet. If a text doesn’t have a uri, its corresponding position will be None.
Returns: List of String for LocalTextSet. RDD of String for DistributedTextSet.
-
get_word_index()[source]¶ Get the word_index dictionary of the TextSet. If the TextSet hasn’t been transformed from word to index, None will be returned.
Returns: Dictionary {word: id}
-
load_word_index(path)[source]¶ Load the word_index map which was saved after the training, so that this TextSet can directly use this word_index during inference. Each separate line should be “word id”.
Note that after calling load_word_index, you do not need to specify any argument when calling word2idx in the preprocessing pipeline as now you are using exactly the loaded word_index for transformation.
For LocalTextSet, load txt from a local file system. For DistributedTextSet, load txt from a local or distributed file system (such as HDFS).
Returns: TextSet with the loaded word_index.
-
normalize()[source]¶ Do normalization on tokens. Need to tokenize first. See Normalizer for more details.
Returns: TextSet after normalization.
-
random_split(weights)[source]¶ Randomly split into list of TextSet with provided weights. Only available for DistributedTextSet for now.
Parameters: weights – List of float indicating the split portions.
-
classmethod
read(path, sc=None, min_partitions=1, bigdl_type='float')[source]¶ Read text files with labels from a directory. The folder structure is expected to be the following: path
Under the target path, there ought to be N subdirectories (dir1 to dirN). Each subdirectory represents a category and contains all texts that belong to such category. Each category will be a given a label according to its position in the ascending order sorted among all subdirectories. All texts will be given a label according to the subdirectory where it is located. Labels start from 0.
Parameters: - path – The folder path to texts. Local or distributed file system (such as HDFS) are supported. If you want to read from a distributed file system, sc needs to be specified.
- sc – An instance of SparkContext. If specified, texts will be read as a DistributedTextSet. Default is None and in this case texts will be read as a LocalTextSet.
- min_partitions – Int. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not None. Default is 1.
Returns: TextSet.
-
classmethod
read_csv(path, sc=None, min_partitions=1, bigdl_type='float')[source]¶ Read texts with id from csv file. Each record is supposed to contain the following two fields in order: id(string) and text(string). Note that the csv file should be without header.
Parameters: - path – The path to the csv file. Local or distributed file system (such as HDFS) are supported. If you want to read from a distributed file system, sc needs to be specified.
- sc – An instance of SparkContext. If specified, texts will be read as a DistributedTextSet. Default is None and in this case texts will be read as a LocalTextSet.
- min_partitions – Int. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not None. Default is 1.
Returns: TextSet.
-
classmethod
read_parquet(path, sc, bigdl_type='float')[source]¶ Read texts with id from parquet file. Schema should be the following: “id”(string) and “text”(string).
Parameters: - path – The path to the parquet file.
- sc – An instance of SparkContext.
Returns: DistributedTextSet.
-
save_word_index(path)[source]¶ Save the word_index dictionary to text file, which can be used for future inference. Each separate line will be “word id”.
For LocalTextSet, save txt to a local file system. For DistributedTextSet, save txt to a local or distributed file system (such as HDFS).
Parameters: path – The path to the text file.
-
set_word_index(vocab)[source]¶ Assign a word_index dictionary for this TextSet to use during word2idx. If you load the word_index from the saved file, you are recommended to use load_word_index directly.
Returns: TextSet with the word_index set.
-
shape_sequence(len, trunc_mode='pre', pad_element=0)[source]¶ Shape the sequence of indices to a fixed length. Need to word2idx first. See SequenceShaper for more details.
Returns: TextSet after sequence shaping.
-
to_distributed(sc=None, partition_num=4)[source]¶ Convert to a DistributedTextSet.
Need to specify SparkContext to convert a LocalTextSet to a DistributedTextSet. In this case, you may also want to specify partition_num, the default of which is 4.
Returns: DistributedTextSet
-
tokenize()[source]¶ Do tokenization on original text. See Tokenizer for more details.
Returns: TextSet after tokenization.
-
word2idx(remove_topN=0, max_words_num=-1, min_freq=1, existing_map=None)[source]¶ Map word tokens to indices. Important: Take care that this method behaves a bit differently for training and inference.
—————————————Training——————————————– During the training, you need to generate a new word_index dictionary according to the texts you are dealing with. Thus this method will first do the dictionary generation and then convert words to indices based on the generated dictionary.
You can specify the following arguments which pose some constraints when generating the dictionary. In the result dictionary, index will start from 1 and corresponds to the occurrence frequency of each word sorted in descending order. Here we adopt the convention that index 0 will be reserved for unknown words. After word2idx, you can get the generated word_index dictionary by calling ‘get_word_index’. Also, you can call save_word_index to save this word_index dictionary to be used in future training.
Parameters: - remove_topN – Non-negative int. Remove the topN words with highest frequencies in the case where those are treated as stopwords. Default is 0, namely remove nothing.
- max_words_num – Int. The maximum number of words to be taken into consideration. Default is -1, namely all words will be considered. Otherwise, it should be a positive int.
- min_freq – Positive int. Only those words with frequency >= min_freq will be taken into consideration. Default is 1, namely all words that occur will be considered.
- existing_map – Existing dictionary of word_index if any. Default is None and in this case a new dictionary with index starting from 1 will be generated. If not None, then the generated dictionary will preserve the word_index in existing_map and assign subsequent indices to new words.
—————————————Inference——————————————– During the inference, you are supposed to use exactly the same word_index dictionary as in the training stage instead of generating a new one. Thus please be aware that you do not need to specify any of the above arguments. You need to call load_word_index or set_word_index beforehand for dictionary loading.
Need to tokenize first. See WordIndexer for more details.
Returns: TextSet after word2idx.
-
classmethod
transformer¶
-
class
zoo.feature.text.transformer.Normalizer(bigdl_type='float')[source]¶ Bases:
zoo.feature.text.transformer.TextTransformerRemoves all dirty characters (non English alphabet) from tokens and converts words to lower case. Need to tokenize first. Original tokens will be replaced by normalized tokens.
>>> normalizer = Normalizer() creating: createNormalizer
-
class
zoo.feature.text.transformer.SequenceShaper(len, trunc_mode='pre', pad_element=0, bigdl_type='float')[source]¶ Bases:
zoo.feature.text.transformer.TextTransformerShape the sequence of indices to a fixed length. If the original sequence is longer than the target length, it will be truncated from the beginning or the end. If the original sequence is shorter than the target length, it will be padded to the end. Need to word2idx first. The original indices sequence will be replaced by the shaped sequence.
# Arguments len: Positive int. The target length. trunc_mode: Truncation mode. String. Either ‘pre’ or ‘post’. Default is ‘pre’.
If ‘pre’, the sequence will be truncated from the beginning. If ‘post’, the sequence will be truncated from the end.- pad_element: Int. The element to be padded to the sequence if the original length is
- smaller than the target length. Default is 0 with the convention that we reserve index 0 for unknown words.
>>> sequence_shaper = SequenceShaper(len=6, trunc_mode="post", pad_element=10000) creating: createSequenceShaper
-
class
zoo.feature.text.transformer.TextFeatureToSample(bigdl_type='float')[source]¶ Bases:
zoo.feature.text.transformer.TextTransformerTransform indexedTokens and label (if any) of a TextFeature to a BigDL Sample. Need to word2idx first.
>>> to_sample = TextFeatureToSample() creating: createTextFeatureToSample
-
class
zoo.feature.text.transformer.TextTransformer(bigdl_type='float', *args)[source]¶ Bases:
zoo.feature.common.PreprocessingBase class of Transformers that transform TextFeature.
-
class
zoo.feature.text.transformer.Tokenizer(bigdl_type='float')[source]¶ Bases:
zoo.feature.text.transformer.TextTransformerTransform text to array of string tokens.
>>> tokenizer = Tokenizer() creating: createTokenizer
-
class
zoo.feature.text.transformer.WordIndexer(map, bigdl_type='float')[source]¶ Bases:
zoo.feature.text.transformer.TextTransformerGiven a wordIndex map, transform tokens to corresponding indices. Those words not in the map will be aborted. Need to tokenize first.
# Arguments map: Dict with word (string) as its key and index (int) as its value.
>>> word_indexer = WordIndexer(map={"it": 1, "me": 2}) creating: createWordIndexer
common¶
-
class
zoo.feature.common.ArrayToTensor(size, bigdl_type='float')[source]¶ Bases:
zoo.feature.common.Preprocessinga Transformer that converts an Array[_] to a Tensor. :param size dimensions of target Tensor.
-
class
zoo.feature.common.ChainedPreprocessing(transformers, bigdl_type='float')[source]¶ Bases:
zoo.feature.common.Preprocessingchains two Preprocessing together. The output type of the first Preprocessing should be the same with the input type of the second Preprocessing.
-
class
zoo.feature.common.FeatureLabelPreprocessing(feature_transformer, label_transformer, bigdl_type='float')[source]¶ Bases:
zoo.feature.common.Preprocessingconstruct a Transformer that convert (Feature, Label) tuple to a Sample. The returned Transformer is robust for the case label = null, in which the Sample is derived from Feature only. :param feature_transformer transformer for feature, transform F to Tensor[T] :param label_transformer transformer for label, transform L to Tensor[T]
-
class
zoo.feature.common.FeatureSet(jvalue=None, bigdl_type='float')[source]¶ Bases:
bigdl.dataset.dataset.DataSetA set of data which is used in the model optimization process. The FeatureSet can be accessed in a random data sample sequence. In the training process, the data sequence is a looped endless sequence. While in the validation process, the data sequence is a limited length sequence. Different from BigDL’s DataSet, this FeatureSet could be cached to Intel Optane DC Persistent Memory, if you set memory_type to PMEM when creating FeatureSet.
-
classmethod
image_frame(image_frame, memory_type='DRAM', sequential_order=False, shuffle=True, bigdl_type='float')[source]¶ Create FeatureSet from ImageFrame. :param image_frame: ImageFrame :param memory_type: string, DRAM, PMEM or a Int number.
If it’s DRAM, will cache dataset into dynamic random-access memory If it’s PMEM, will cache dataset into Intel Optane DC Persistent Memory If it’s a Int number n, will cache dataset into disk, and only hold 1/n
of the data into memory during the training. After going through the 1/n, we will release the current cache, and load another 1/n into memory.Parameters: - sequential_order – whether to iterate the elements in the feature set in sequential order for training.
- shuffle – whether to shuffle the elements in each partition before each epoch when training
- bigdl_type – numeric type
Returns: A feature set
-
classmethod
image_set(imageset, memory_type='DRAM', sequential_order=False, shuffle=True, bigdl_type='float')[source]¶ Create FeatureSet from ImageFrame. :param imageset: ImageSet :param memory_type: string, DRAM or PMEM
If it’s DRAM, will cache dataset into dynamic random-access memory If it’s PMEM, will cache dataset into Intel Optane DC Persistent Memory If it’s a Int number n, will cache dataset into disk, and only hold 1/n
of the data into memory during the training. After going through the 1/n, we will release the current cache, and load another 1/n into memory.Parameters: - sequential_order – whether to iterate the elements in the feature set in sequential order for training.
- shuffle – whether to shuffle the elements in each partition before each epoch when training
- bigdl_type – numeric type
Returns: A feature set
-
classmethod
pytorch_dataloader(dataloader, features='_data[0]', labels='_data[1]', bigdl_type='float')[source]¶ Create FeatureSet from pytorch dataloader :param dataloader: a pytorch dataloader, or a function return pytorch dataloader. :param features: features in _data, _data is get from dataloader. :param labels: lables in _data, _data is get from dataloader. :param bigdl_type: numeric type :return: A feature set
-
classmethod
rdd(rdd, memory_type='DRAM', sequential_order=False, shuffle=True, bigdl_type='float')[source]¶ Create FeatureSet from RDD. :param rdd: A RDD :param memory_type: string, DRAM, PMEM or a Int number.
If it’s DRAM, will cache dataset into dynamic random-access memory If it’s PMEM, will cache dataset into Intel Optane DC Persistent Memory If it’s a Int number n, will cache dataset into disk, and only hold 1/n
of the data into memory during the training. After going through the 1/n, we will release the current cache, and load another 1/n into memory.Parameters: - sequential_order – whether to iterate the elements in the feature set in sequential order when training.
- shuffle – whether to shuffle the elements in each partition before each epoch when training
:param bigdl_type:numeric type :return: A feature set
-
classmethod
sample_rdd(rdd, memory_type='DRAM', sequential_order=False, shuffle=True, bigdl_type='float')[source]¶ Create FeatureSet from RDD[Sample]. :param rdd: A RDD[Sample] :param memory_type: string, DRAM or PMEM
If it’s DRAM, will cache dataset into dynamic random-access memory If it’s PMEM, will cache dataset into Intel Optane DC Persistent Memory If it’s a Int number n, will cache dataset into disk, and only hold 1/n
of the data into memory during the training. After going through the 1/n, we will release the current cache, and load another 1/n into memory.Parameters: - sequential_order – whether to iterate the elements in the feature set in sequential order when training.
- shuffle – whether to shuffle the elements in each partition before each epoch when training
:param bigdl_type:numeric type :return: A feature set
-
classmethod
-
class
zoo.feature.common.MLlibVectorToTensor(size, bigdl_type='float')[source]¶ Bases:
zoo.feature.common.Preprocessinga Transformer that converts MLlib Vector to a Tensor. .. note:: Deprecated in 0.4.0. NNEstimator will automatically extract Vectors now. :param size dimensions of target Tensor.
-
class
zoo.feature.common.Preprocessing(bigdl_type='float', *args)[source]¶ Bases:
bigdl.util.common.JavaValuePreprocessing defines data transform action during feature preprocessing. Python wrapper for the scala Preprocessing
-
class
zoo.feature.common.Relation(id1, id2, label, bigdl_type='float')[source]¶ Bases:
objectIt represents the relationship between two items.
-
class
zoo.feature.common.Relations[source]¶ Bases:
object-
static
read(path, sc=None, min_partitions=1, bigdl_type='float')[source]¶ Read relations from csv or txt file. Each record is supposed to contain the following three fields in order: id1(string), id2(string) and label(int).
For csv file, it should be without header. For txt file, each line should contain one record with fields separated by comma.
Parameters: - path – The path to the relations file, which can either be a local or disrtibuted file system (such as HDFS) path.
- sc – An instance of SparkContext. If specified, return RDD of Relation. Default is None and in this case return list of Relation.
- min_partitions – Int. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not None. Default is 1.
-
static
-
class
zoo.feature.common.SampleToMiniBatch(batch_size, bigdl_type='float')[source]¶ Bases:
zoo.feature.common.Preprocessinga Transformer that converts Feature to (Feature, None).
-
class
zoo.feature.common.ScalarToTensor(bigdl_type='float')[source]¶ Bases:
zoo.feature.common.Preprocessinga Preprocessing that converts a number to a Tensor.
-
class
zoo.feature.common.SeqToMultipleTensors(size=[], bigdl_type='float')[source]¶ Bases:
zoo.feature.common.Preprocessinga Transformer that converts an Array[_] or Seq[_] or ML Vector to several tensors. :param size, list of int list, dimensions of target Tensors, e.g. [[2],[4]]
-
class
zoo.feature.common.SeqToTensor(size=[], bigdl_type='float')[source]¶ Bases:
zoo.feature.common.Preprocessinga Transformer that converts an Array[_] or Seq[_] to a Tensor. :param size dimensions of target Tensor.
-
class
zoo.feature.common.TensorToSample(bigdl_type='float')[source]¶ Bases:
zoo.feature.common.Preprocessinga Transformer that converts Tensor to Sample.
-
class
zoo.feature.common.ToTuple(bigdl_type='float')[source]¶ Bases:
zoo.feature.common.Preprocessinga Transformer that converts Feature to (Feature, None).