 _Note:_ Please right-click and open screenshot image in new tab, if it is too small to read in your current browser settings.

# Data Preparation
![step_data_preparation.png](ATFZjNxmSZfB)
  **1. Cleaning: Failure codes are given meaning and batch durations are calculated.** 
 _Interpret failure code:_  
First, a new column “status” is added, in which the failure code from the “failures” column is converted to readable strings, to make later analysis easier. (1’s are changed to “failure”, 0’s are changed to “no failure”)
![step_data_preparation_1_replace.png](q2CL6twKewLj)

 _Feature engineering - batch duration calculation:_ 
Next a new column is added, containing the duration of each batch (in minutes) calculated from the start and end times.
![step_data_preparation_1_duration.png](RxvbQBbhj97p)

 **2. Join: The two datasets are joined based on the timestamps and equipment_id.** 
The join is done using the Join recipe. As a result, the sensor readings are allocated to a machine and to a batch period (each batch has a start and end time).
![step_data_preparation_2_join.png](4eqn4Zpxb7Rq)

 **3. Group by: Dataset is grouped by batch, to enable better analysis later.** 
The resulting dataset from  the join step above has multiple readings from a single sensor within a batch. To better compare batches during later analysis, it will be useful to just have the average sensor value for each sensor per batch, and perhaps also the standard deviation of the readings of each  sensor and the number of sensor measurements. Also, if a prediction model is to be created at a later point, such irregularly shaped data is a difficult format to work with because the number of sensory readings per batch varies and is inconsistent. Thus the Group recipe is used to aggregate sensor readings by batch. batch_id is used as the group key, and the average, standard deviation, and count of each sensor is aggregated, as well as the other information about the batches.
![step_data_preparation_3_group_aggregations.png](ztsRRE0JpUZx)

Since some rows in the input dataset batch_and_sensor_joined do not have a batch_id defined, these NA aggregations are dropped using a post-filter in the Group recipe.
![step_data_preparation_3_group_post-filter.png](pJTFh2qbHU6P)