The art of cooking good data: 3 basic steps to get quality data

       Written by Claudio Vivante

In this article we talk about some simple rules that allow you to have good data ready to feed your application DATA ANALYTICS, simple but not so obvious.

For those of you who eat fish, it must have happened to order seabream or bass at the restaurant and observing the waiter removing the bones and the skin before serving it to you. Removing the bones and serving the "useful" part of the food is my favorite picture when I have to explain the concept of good data.

The diner is the user of the data or the data analyst, or, more often, the analysis software. As I mentioned in my previous post (read article here), data management (collection, post-processing and rescue methods) should be directed to simplify the analysis. Data analysis is the most critical and expensive part of the entire information management lifecycle (except in cases where data derive from particularly complex physical sensors and processes, but it is a small minority of cases) and, therefore, every euro spent on improving the quality of data provided to the analysis process is a well-spent euro, that it has made you to save at least two. So today, we will see some simple steps that allow you to collect good data, ready to feed your DATA ANALYTICS application.


1. Choose unique names for data sources

The first thing to do is to choose unique names for the data sources: for example, if we need to measure the temperature of a machine, it is better to add a prefix that reminds us that the thermocouples are in that particular machine (ex. M1/temp1, M1/Temp2). This is because sooner or later there will be a second machine from which we will need to measure the temperatures and it is better not to change the names of the identified signal to work in progress.

It is always desirable that each measured signal has an editable name that can change over time and a globally unique identifier (GUID) that is used by the analysis programs to access the data and do the math: this simplifies the work of programmers and reduces mistakes. The definition of GUID can be done in various ways, I recommend you to choose a format that takes up the encoding of EPC-TAG SGTIN adopted by RFID, where a dot character "." separates the parts of the name, for example


Obviously, another separator character between the parties that make up the GUID is ok, some platforms use "/" other ":" the important thing is not to put characters used in the definition of the same ID.


2. Standardize signal magnitudes before analyzing data

The second thing to be considered is the unit in which collected sizes are expressed. More and more data collection points come from different brands and with the advent of the Industrial Internet Of Things (IIOT), smart sensors are directly connected to data collection applications via webservices or other protocols and, with too many PLUG & PLAY, it is likely to lose control of what is being measured. Is the installed temperature sensor publishing in Celsius or Fahrenheit by default? Is the packaging machine cycle time expressed in seconds or tenths of a second? And so on.

The ideal is to design a post-processing stage specially dedicated to size standardization in which signals are passed to the analysis system, able to identify any suspect value before firing the automatic procedure for data analysis. For example, if the temperature measured in the meeting room is 77 ° Celsius, it is better to check the sensor factory settings before you call the fire department!


3. Use the UTC form for time representation

The third thing to remember is to use the UTC format for the representation of time, especially for those who make 24/7 monitoring. Many people think that the use of UTC makes sense to aggregate data from regions of the world with different time zones, I believe that the UTC is a good solution to the problem of "ghost data" that occurs two days per year when switching from summer to winter.

In spring, it happens that data arrive until the hour 1:59 AM and then resume at 03:00 AM because computers have moved the clock one hour ahead. In autumn, it happens instead that data from the hours 1:59 am to 2:59 are doubled, because computers have put the clock back an hour! The second case is more insidious because a doubled production capacity is measured and all KPI produce implausible data!

In our future posts, we will tell other simple things you can do to get quality data and save time and money in the debugging of applications for analysis and, above all, to make the most scalable and robust measuring system. Stay tuned!


For further information, you will find useful tips at the following links: 

EPC Codification:
Standard UTC:

« Next news

The true origin of industry 4.0 (the untold story) and its social consequences

Previous news »

Technology Can Be Bought, Innovation Must Be Watered To Grow


Add comment

Subscribe to our newsletter

Subscribe to get notified about the next update.

Read the privacy policy