As more companies discover the importance of data science and advanced analytics for their bottom line, a clash of cultures has begun. How can these quickly growing fields become part of a company’s ecosystem, especially for established companies that have been around for a decade or longer?
Data scientists and IT professionals have vastly different needs when it comes to infrastructure. Here, I’ll lay out some of those requirements and discuss how to move beyond them—and evolve together.
Department Perspectives
When starting up data science programs within companies, the biggest issues often arise not from the technology itself, but from simple miscommunication. Interdepartmental misconceptions can result in a lot of grudge-holding between fledgling data science teams and IT departments.
To combat this, we’ll examine both perspectives and take each of their needs into account. We'll start by defining what an IT professional requires to maintain a successful workflow, and then we'll look at what a data scientist needs for maximum efficiency. Finally, we'll find the common ground: how to use it to implement a healthy infrastructure for both to flourish.
IT Needs
Let’s start by taking a look at a typical data infrastructure for IT and Software Development.
Regarding data, there are three essential prerequisites that any IT department will focus on:
- data that is secure
- data that is efficient
- data that is consistent
Because of this, much of IT utilizes table-based schemas, and often uses SQL (Structured Query Language) or one of its variants.
This setup means that there are a large number of tables for every purpose. Each of these tables is separate from one another, with foreign keys connecting them. Because of this setup, queries can be executed quickly, efficiently, and with security in mind. This is important for software development, where data needs to remain intact and reliable.
With this structure, the required hardware is often minimal when compared to the needs of data science. The stored data is well defined, and it evolves at a predictable pace. Little of the data repeats, and the querying process reduces the amount of processing resources required.
Let’s see how data science differs.
Data Science Needs
On the other end, data science has a different set of needs. Data scientists need freedom of movement with their data—and flexibility to modify their data quickly. They need to be able to move data in non-standard ways and process large amounts at a time.
These needs are hard to implement using highly structured databases. Data science requires a different infrastructure, relying instead upon unstructured data and table-less schemas.
When referring to unstructured data, we’re talking about data with no intrinsic definition. It’s nebulous until given form by a data scientist. For most development, each field needs to be of a defined type—such as an integer or a string. For data science, however, it’s about supporting data points that are ill defined.
Table-less schemas add more versatility to this quasi-chaotic setup, allowing all the information to live in one place. It’s especially useful for data scientists who regularly need to merge data in creative and unstructured ways. Popular choices include NoSQL variants or structures that allow several dimensions, such as OLAP cubes.
As a result, the hardware required for data science is often more substantial. It will need to hold the entirety of the data used, as well as subsets of that data (though this is often spread out among multiple structures or services). The hardware can also require considerable processing resources as large amounts of data are moved and aggregated.
Distilling Needs Into Action
With these two sets of needs in mind, we can now see how miscommunication can occur. Let’s take these perspectives and use them to define what changes we’re looking for and how. What problems need to be solved when bringing data science into a traditional IT setting?
Ease of Data Manipulation
In a traditional IT setting, any given business’s databases likely follow a rigid structure, with tables divided to fit specific needs, an appropriate schema to define each piece of data, and foreign keys to tie it all together. This makes for an efficient system of querying data. The exploratory nature of some data science methods can push this to its limits.
When a common task might require joining a dozen or more tables, the benefits of table-based structures become less apparent. A popular method to handle this is to implement a secondary NoSQL or multi-dimensional database. This secondary database uses regular ETLs (Extract, Transform, Load) to keep its information fresh. This adds the cost of additional hardware or cloud service usage, but minimizes any other drawbacks.
Keep in mind that in some cases, adding a separate database for data science can be more affordable than using the same database (especially when complicated licensing issues come into play).
Ease of Data Scaling
This specific problem covers two mentioned mismatches:
- regular increases in data from procedures
- a need for unstructured data types
In traditional IT, the size of your database is well defined, either staying the same size or growing at a modest pace. When using a database for data science, that growth can be exponential. It is common to add gigabytes of data each day (or more). With the sheer size of this kind of data, a business will need to incorporate a plan for scaling internal architecture or use an appropriate cloud solution.
As for unstructured data, it can take up a lot of resources in terms of storage and processing power, depending on your specific uses. Because of this, it's often inefficient to keep it all in a database that might be used for other purposes. The solution is similar to scaling in general. We’ll either need a plan for scaling our internal architecture to meet these needs or we'll have to find an appropriate cloud solution.
Resource Usage
The last major difference we’ll talk about is the use of resources. For IT, the usage of resources is typically efficient, well defined, and consistent. If a database powers an eCommerce site, there are known constraints. An IT professional will know roughly how many users there will be over a given period of time, so they can plan their hardware provisioning based on how much information is needed for each user.
With traditional IT infrastructure, there won’t be any problems encountered if a project uses only a few hundred rows from a handful of tables. But a project that requires every row from two dozen tables can quickly become a problem. In data science, the needs in terms of processing and storage change from project to project—and that kind of unpredictability can be difficult to support.
In traditional IT, resources may be shared with other parties, which might be a live production site or an internal dev team. The risk here is that running a large-scale data science project could potentially lock those other users out for a period of time. Another risk is that the servers holding a database may not be able to handle the sheer amount of processing necessary. Calling 200,000 rows from 15 tables, and asking for data aggregation on top, becomes a problem. This magnitude of queries can be extremely taxing on a server that might normally handle a thousand or so simultaneous users.
The ideal solution comes down to cloud processing. This addresses two key factors. The first is that it allows query performance away from any important databases. The second is that it provides scaling resources that can fit each project.
So What’s the Final List of Requirements for Both?
Now that we’ve talked about the needs in depth, let’s sum them up. An IT and data science department will need the following for long-term success:
- a separate database to reduce the impact on other stakeholders
- a scaling storage solution to accommodate changes in data
- a scaling processing solution to accommodate varying project types
- an unstructured database to provide efficient retrieval and storage of highly varying data
Building a Case for Data Science
Let’s break everything down into specifications so we can put together a mutually beneficial solution. Now we’ll take a look at how to define the specific resources needed for an organization:
Researching Specifications
From the IT side, there are three main definitions needed to create the necessary infrastructure. These are:
- the amount of data
- to what extent it needs processing
- how the data will get to the storage solution
Here’s how you can determine each.
Data Storage Needs
It all starts with the initial data size needed and the estimated ongoing data additions.
For your initial data needs, take the defined size of your current database. Now subtract any columns or tables that you won't need in your data science projects. Take this number and add in the data size of any new sources that you’ll be introducing. New sources might include Google Analytics data or information from your Point of Sale system. This total will be the data storage we’ll be looking to attain upfront.
While the initial storage needs are useful upfront, you’ll still have to consider ongoing data needs—as you’ll likely be adding more information to the database over time. To find this information out, you can calculate your daily added data from your currently available data. Take a look at the amount of information that has been added to your database in the last 30 days, and then divide that by 30. Then repeat that for each information source that you’ll be using, and add them together.
While this isn’t precise, there’s an old development mantra that you should double your estimate, and we’re going to use that here. Why? We want to account for unpredictable changes that might affect your data storage needs—like company growth, per-project needs, or just general areas.
With that number now defined, multiply it by 365. This is now your projected data growth for one year, which, when added to your initial amount, will determine how much storage you should look at obtaining.
Processing Resource Needs
Unlike data storage needs, processing needs are a lot more difficult to calculate exactly. The main goal here is to decide whether you want to put the heavy lifting on queries or on a local machine (or cloud instance). Keep in mind here that when I talk about a local machine, I don’t mean just the computer you normally use—you’ll likely need some kind of optimized workstation for the more intensive calculations.
To make this choice, it helps to think about the biggest data science project that you might run within the next year. Can your data solution handle a query of that size without becoming inaccessible to all other stakeholders? If it can, then you’re good to go with no additional resources needed. If not, then you’ll need to plan on getting an appropriately sized workstation or scaling cloud instances.
ETL (Extract, Transform, Load) Processes
After deciding where to store and process your data, the next decision is how. Creating an ETL process will keep your data science database orderly and updated and prevent it from using unnecessary resources from elsewhere.
Here’s what you should have in your ETL documentation:
- any backup procedures that should take place
- where data will be coming from and where it will be going
- the exact dimensions that should be moved
- how often the transfer should occur
- whether the transfer needs to be complete (rewrite the whole database) or can be additive (only move over new things)
Preparing a Solution
With all the data points in hand, it’s time to pick out a solution. This part will take a bit of research and will rely heavily on your specific needs, as on the surface they tend to have a lot of similarities.
Three of the biggest cloud solutions—Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure—offer some of the best prices and features. All three have relatively similar costs, though AWS is notably more difficult to calculate costs for (due to its a la carte pricing structure).
Beyond price, each offers scalable data storage and the ability to add processing instances, though each calls its ‘instances’ by a different name. When researching which to use for your own infrastructure, take into account which types of projects you’ll be utilizing the most, as that can shift the value of each one’s pricing and feature set.
However, many companies simply select whichever aligns with their existing technology stack.
You may also want to set up your own infrastructure in-house, although this is significantly more complex and not for the faint of heart.
Extra Tips for Smooth Implementation
With all of your ducks in a row, you can start implementation! To help out, here are some hard-earned tips on making your project easier—from pitch to execution.
Test Your ETL Process
When you first put together your ETL process, don’t test the entire thing all at once! This can put some serious strain on your resources and increase your cloud costs drastically if there’s a mistake, or if you have to attempt the process several times.
Instead, it’s a good idea to run your process using just the first 100 rows or so of your origin tables at first. Then run the full transfer once you know it will work.
Test Your Queries Too
The same goes for any large query you run on a cloud instance. Making a mistake that pulls in millions of pieces of data is much harder on a system than one that only pulls in a few—especially when you’re paying per GB.
Create a Warehousing Backup Strategy
Most cloud operators offer this as a feature, so you may not have to worry about it. Your team should still discuss whether they would like to create their own regular backups of the data, though, or if it might be more effective to reconstruct the data should the need arise.
Security and Privacy Concerns
When moving customer data to the cloud, make sure that everyone involved is aware of your company’s data governance policies in order to prevent problems down the road. This can also help you save some money on the amount of data being stored in the cloud.
Dimension Naming During ETL
When performing your ETL from a table-based database to an unstructured one, be careful about naming procedures. If names are just wholesale transferred over, you’ll likely have a lot of fields from different tables sharing the same name. An easy way to overcome this at first is to name your new dimensions in the unstructured database as {oldtablename}_{columnname}
and then rename them from there.
Get Your Motor Running!
Now you can plan the basics of your analytics and data science infrastructure. With many of the key questions and answers defined, the process of implementation and getting managerial buy-in should go much more smoothly.
Having difficulty coming up with answers for your own company? Did I gloss over something important? Let me know in the comments!
by Kyle Speaker via Envato Tuts+ Code
No comments:
Post a Comment