In data analytics, there are a lot of nice and shiny buzzwords, products and concepts. Before you decide anything, you should be clear about your actual and future needs. Your analytics infrastructure should enable you to analyze data. But there are aspects that architectures support differently and you have to trade off. There is no free lunch.
Here are some explanations that should help you to orientate what you need and enable you to compare different approaches.
It’s important where the data is located. You can leave the data on the source system and read it per analysis. Or you can copy the data to an analytics system.
Pros of leave and read are:
- You will save space.
- You will have always the most recent data.
Pros of copy are:
- You may get fast analyses, since you can optimize the storage for analytics.
- You maintain fast operations, since you don’t read data that the source system wants to use at the same time.
- You get consistent results, since you control the updates.
- You can implement historization and don’t lose information.
Usually the data is copied, but if the volume is big enough or you don’t have the resources you may want to leave it on the source.
How or if your data is structured has a significant impact on your analytics infrastructure. There are structured data, semi structured data and unstructured data. Structured data are for example relational databases. Semi structured data contains information how to separate values and identify structure. Examples for tabular semi structured data are simple Excel sheets or CSV files. Examples for hierarchical are XML or JSON that are often used by web services and APIs. Examples for unstructured data are images, PDF or plain text.
Analytics is all about reduction of information to be consumable by humans or processes that humans create and understand. Reduction requires structured data. Semi structured data can be transformed into structured data. Unstructured data may be transformable into structured data, but not always, not that easy or not error free. Avoid Excel, PDF or Text as data source, whenever possible.
Data from different sources may be hard to combine, since there is no common identifier or different formats. There are again two approaches: "Schema on read" and "Schema on write". "Schema on read" means, you leave the data in its raw form and transform it when you analyze it. "Schema on write" means, if you write the data you transform it into a common format. You may change the format, the data types, normalize it and deduplicate it.
Pros of "Schema on read" are:
- You may save effort on integrating new data.
- The raw data may contain more information than the integrated data.
Pros of "Schema on write" are:
- It’s less effort for analyses using integrated data in development and computation terms.
- Investment in quality pays out more, due to the reuse of transformed data.
- Preaggregated data will speed up analyses.
It’s hard to say what data volume is big, but anyway it should have a big influence on your individual solution. In analytics systems, performance is often provided by redundancy. Some results that are used several times or with reduced latency are precalculated and stored.
So one piece of information is stored many times in different ways. That’s performance efficient but not storage efficient.
Obviously that may be a problem with a lot of data.
Usually one can say that data that is automatically generated, e.g. from sensors or logging functions may come in high volumes. Manually generated data like orders in your ERP system or master data usually not.
Data velocity means how much time elapses from data generation to analysis. High velocity may come along with other restrictions or increased effort.
Most common are scenarios with updates on a daily-basis. For regulatory supervision it may be enough to update your data once a quarter.
A often misused term in this topic is real-time. If your system is real-time capable, it means that you guarantee a result in a specified time. That’s important for example in embedded systems in automotive or industrial environments. In business context real-time is used as best effort latency and only in special cases necessary. Imagine a manager that makes decisions, based on reports that are updated and changed every 5 minutes. That holds the risk to react to random events instead of pursuit a strategy. That may be different in a cloud-based application scenario, where you want to scale-up or scale-down the system based on the usage. Or think about a process that changes the prices in a ecommerce scenario based on the recent sales.
You see if someone tries to sell you something without listening to your requirements, there is a good chance to end up with something that does not deliver what you need or may be accomplished with less effort.