Data Scientists – Converting Big Business to Open Source


By Sarah Tarraf, Director, Analytics, Gongos, Inc.

The stereotypical image of a data scientist is that of the harried analyst, hacking away late into the night in an unrestrained startup environment, tapping into massive amounts of data to uncover interesting relationships that explain once-unexplainable phenomena. What we tend to ignore is the more common scenario—a data scientist working in an established corporate environment with a limited technology budget and an incomplete infrastructure. Those of us in these shoes realize that we’re not that different from the hackneyed version of our profession. Like our ‘startup’ counterparts, we too are hungry to tap into the ever-expanding libraries of open source tools and technologies to move organizations toward the future.

However, there exists a paradox in most organizations today. Corporations have an increasing appetite for technology-driven breakthroughs, yet the majority of business analysis is neatly packaged within the confines and constraints of boxed software.

open sourceAnd as data scientists continue to evolve into their role from either an analytic or data functionality, they must convince their organizations to move from big-budget, big-boxed software and embrace the open source movement.

In fact, we’ve blazed this trail ourselves.

Knowing that advancements are fueled by open source, data scientists need to put on their executive hats when it comes to articulating the inherent weaknesses of boxed software and make a solid business case for organizations to wholeheartedly adopt these new tools. And, spoiler alert, the biggest case is not what you’re thinking (that it’s free).

Let’s look at the three most compelling reasons:

Boxed software inhibits the true potential of enterprise data.

We’re awash in data. And as analysts we feel the urgency to do something about it. However, most organizations’ software solutions restrict the analysis we can do and the data we can use. When it comes to analysis, the real-time development “community” of open source ensures the ability to experiment with the most up-to-date techniques. When it comes to data, we can take advantage of cloud computing solutions that are both scalable and guarantee access to increased infrastructure for working with large data. As a consequence, leveraging open source to explore enterprise data erases the red tape associated with acquiring new software and infrastructure, lessens the need to measure ROI, and empowers data scientists to experiment with the best ways to extract value across any business challenge.

Case in point: Data Dan was seeking to help a new client identify clusters of consumers who react similarly to digital marketing touchpoints. Dan’s organization’s background was in SPSS, so he accessed the traditional library of distance-based clustering algorithms. After trying a K-means, a two-step, and a hierarchical clustering, Dan was at a loss. The solutions he was finding were interesting, but provided no tangible benefit for his client. He’d been reading about a few new probabilistic approaches to clustering and was eager to try one out with his data. Luckily, although Dave’s employer invested in SPSS licenses for their analysts, they also encouraged experimentation with open source analytic programs, R and Python. Dave was able to install the FlexMix and mclust packages and run his data through those to see if the solutions uncovered were different (or better!). In the end, these solutions identified a segment of consumers overlooked by the other approaches.

Boxed software impedes innovation.

Besides the accessibility and usability of tools, the growing group of ‘co-creators’ within the open source community broadens the field of collaborators beyond the walls of corporate workspaces and innovation think tanks. This has—and will continue to—yield fertile ground for innovation. The organic nature of this “cloud-based laboratory” breeds an active, productive, and engaged community of developers open source communityand users producing tools that can be adapted to the needs of virtually any industry. By and of itself, the open source community is a catalyst for applied and cross-pollinated solutions industry to industry. And organizations that miss this opportunity for incremental innovation will be duly left behind. If there is any indication of the power of this boundless reciprocity, offer your boss a digital tour one day of the active message boards, blogs, and videos.

Case in point: Analyst Allison was at an impasse. She reached out to her trusted marketing sciences network for guidance on building a predictive model to ultimately impute a set of missing data.  After input that not only felt stale but wouldn’t do the job, Ally put the question to her open source message board and connected with a biostatistician who had recently grappled with a conceptually different, but analytically similar problem. This biostatistician provided Ally with both the resources and code to create a solution, as well as a fresh take on her challenge.

Boxed software is unappetizing to new talent.

True data scientists are among the most highly sought-after positions organizations are clamoring to fill. Not only do they have the rare combination of skills to make sense of messy, voluminous, unstructured data; they are equipped with the strong communication abilities and business-savvy techniques needed to draw out the most meaningful nuggets of information. Transcending this rare skill set, however, is a more important attribute—their passion. This often means spending their free time acquiring new skills, availing themselves to open courseware, and downloading publicly available data sets to put these skills to use. An organization that does not provide this same open and collaborative environment is anathema to the very individuals they are trying to recruit.

Case in point: Scientist Sally heads a data analytics team for a decision intelligence company. With roots in traditional quantitative research, this company’s analysts were heavily trained in SPSS and SAS. Sensing a substantive shift in the industry, Sally knew she needed to both grow existing—and attract new kinds of—talent to make better sense of the expansive and diverse types of data clients have in their grasp. Initially, she was able to tap into open source to train her tenured talent, which in turn fostered a learning environment that has become one of her biggest recruiting tools to draw in new talent.

One can argue that the value of the open source community conflicts with the worldview of corporate decision-makers: “if it’s going to provide value, of course we have to pay for it.” Even so, this movement is well aligned with the notion of the new “shared economy.”

As today’s data scientists continue their own personal training and advancement within the open source community, they will constantly be met with opportunities to drive big business forward. The collective learning environment perpetuated by the “generosity” of this community empowers data scientists—and decision makers—with new levels of confidence. This momentum will not only allow for the experimentation necessary for organizations to remain competitive but will continue to evolve business as we know it today.

Subscribe to the Blog