THE INDUSTRY EXPERT: James Kobielus — Defining big data and the data scientist

By Masha Tseveen.

Welcome to the first installment of The Industry Expert Interview Series!

The Industry Expert Series features some of the most well-known experts in IBM talking about Cloud, Big Data & Analytics, their experiences in their respective fields, and advices for people pursuing a career in technology.

JAMES KOBIELUS_official ibm photo_nov 4 2013_CROPPEDOur first interview is with James G. Kobielus, Senior Program Director, Product Marketing, Big Data Analytics. James is an industry veteran and serves as IBM’s big data evangelist. He spearheads IBM’s thought leadership activities in Big Data, Hadoop, enterprise data warehousing, advanced analytics, business intelligence, data management, and next best action technologies. He works with IBM’s product management and marketing teams in Big Data. He has spoken at such leading industry events as IBM Information on Demand, Hadoop Summit, Strata, and Forrester Business Process Forum. He has published several business technology books and is a very popular provider of original commentary on blogs and many social media.

1) What is Big Data? How would you define it for people who are not familiar with it?

Big Data is all about deriving deep business value from advanced analytics and trustworthy data at any scale of operations. Advanced analytics is all about statistical analysis – data mining, predictive modeling, natural language processing, and so forth – to find patterns in the data that might not be obvious at first glance. One common example of these “non-obvious” patterns is the specific set of demographic factors that determine whether one group of customers buys your products while others don’t respond at all to your marketing pitches. Trustworthy data is essential if your statistical models aren’t going to lead you astray. And the notion of “any scale of operations” is also fundamental. A great many business insights can be obtained on relatively small data sets, but many insights require large amounts of data from diverse sources in order for the patterns – in other words, statistical correlations – to become apparent. A common example of the latter is the search for “social-media influencer” customers who, though they may be few in number in your target market, can drive a lot of business your way if they tweet positive things about your products. Such influentials are in “microsegments” that could overlooked entirely if you only look at small samples from the target population, but that might be readily discovered if you survey the entire population and keep the resultant big data in a high-capacity cloud database.

2) Many people are concluding that the answer to working with large amount of data is Hadoop. What is your viewpoint on this?

In spite of what you may have heard, Hadoop is not the sum total of big data.

Another big data “H”—hybrid—is becoming dominant, and Hadoop is an important (but not all-encompassing) component of it. In the larger evolutionary perspective, big data is evolving into a hybridized paradigm under which Hadoop, massively parallel processing (MPP) enterprise data warehouses (EDW), in-memory columnar, stream computing, NoSQL, document databases, and other approaches support extreme analytics in the cloud.

Hybrid architectures address the heterogeneous reality of big data environments and respond to the need to incorporate both established and new analytic database approaches into a common architecture. The fundamental principle of hybrid architectures is that each constituent big data platform is fit-for-purpose to the role for which it’s best suited. These big data deployment roles may include any or all of the following: data acquisition, collection, transformation, movement, cleansing, staging, sandboxing, modeling, governance, access, delivery, interactive exploration, and archiving.

In any role, a fit-for-purpose big data platform often supports specific data sources, workloads, applications, and users. Hybrid is the future of big data because users increasingly realize that no single type of analytic platform is always best for all requirements. Also, platform churn—plus the heterogeneity it usually produces—will make hybrid architectures more common in big data deployments.

Hybrid deployments are already widespread in many real-world big data deployments. The most typical are the three-tier—also called “hub-and-spoke”—architectures. These environments may have, for example, Hadoop (e.g., IBM InfoSphere BigInsights) in the data acquisition, collection, staging, preprocessing, and transformation layer; relational-based MPP EDWs (e.g., IBM PureData System for Analytics) in the hub/governance layer; and in-memory databases (e.g., IBM Cognos TM1) in the access and interaction layer.

The complexity of hybrid architectures depends on range of sources, workloads, and applications you’re trying to support. In the back-end staging tier, you might need different preprocessing clusters for each of the disparate sources: structured, semi-structured, and unstructured. In the hub tier, you may need disparate clusters configured with different underlying data platforms—RDBMS, stream computing, HDFS, HBase, Cassandra, NoSQL, and so on—-and corresponding metadata, governance, and in-database execution components. And in the front-end access tier, you might require various combinations of in-memory, columnar, OLAP, dimensionless, and other database technologies to deliver the requisite performance on diverse analytic applications, ranging from operational BI to advanced analytics and complex event processing.

3) Gartner finds that by 2015, the demand for data and analytics resources will reach 4.4 million jobs globally, but only one-third of those jobs will be filled. The emerging role of data scientist is meant to fill that skills gap. What does it take to be a data scientist? If you were hiring a person for this job, what would you look for?

We can argue till we’re blue in the face on the issue of whether a true data scientist must have academic credentials. But no one doubts that credentials mean little if you can’t actually do the work.

You can call yourself a data scientist in good conscience only if you can master the methodology. Yes, there’s a significant–some might say “scary”– learning curve awaiting anybody who seriously wants to enter this field. Many people let their fear of math keep them from getting that degree, cracking open the books, glancing at the journals, or paying close attention when data scientists are speaking.

Data scientists must truly walk the walk through a thicket of statistical algorithms and techniques. It’s not enough to have a passing familiarity with regression modeling, for example, because that’s not the only statistical approach in the data scientist kitbag and, besides, there are several ways to regress variables, none of which is perfectly suited to every modeling scenario. Choosing the right modeling approach is often a creative exercise that demands expert human judgment.

You don’t need to have a Ph.D. in statistics to be a data scientist. What you do need are curiosity, intellectual agility, statistical fluency, research stamina, scientific rigor and a skeptical nature. You must also be articulate, because no one will accept the validity of the patterns you surface if you can’t explain clearly how you built your model, what variables and data you used, or what the results truly mean in the context either of some business problem or scientific endeavor.

What I’ve sketched are the aptitudes, and skills I would look for if I were hiring for the job of data scientist. An autodidact could conceivably fit these criteria to a T. However, enterprise big-data initiatives thrive if all data scientists have been trained and certified on a common curriculum with the following foundation:

  • Paradigms and practices: Every data scientist should acquire a grounding in core concepts of data science, analytics and data management. They should gain a common understanding of the data science lifecycle, as well as the typical roles and responsibilities of data scientists in every phase. They should be instructed on the various role(s) of data scientists and how they work in teams and in conjunction with business domain experts and stakeholders. And they learn a standard approach for establishing, managing and operationalizing data science projects in the business.
  • Algorithms and modeling: Every data scientist should obtain a core understanding of linear algebra, basic statistics, linear and logistic regression, data mining, predictive modeling, cluster analysis, association rules, market basket analysis, decision trees, time-series analysis, forecasting, machine learning, Bayesian and Monte Carlo Statistics, matrix operations, sampling, text analytics, summarization, classification, primary components analysis, experimental design, unsupervised learning constrained optimization.
  • Tools and platforms: Every data scientist should master a core group of modeling, development and visualization tools used on your data science projects, as well as the platforms used for storage, execution, integration and governance of big data in your organization. Depending on your environment, and the extent to which data scientists work with both structured and unstructured data, this may involve some combination of data warehousing, Hadoop, stream computing, NoSQL and other platforms. It will probably also entail providing instruction in MapReduce, R and other new open-source development languages, in addition to SPSS, SAS and any other established tools.
  • Applications and outcomes: Every data scientist should learn the chief business applications of data science in your organization, as well as in how to work best with subject-domain experts. In many companies, data science focuses on marketing, customer service, next best offer, and other customer-centric applications. Often, these applications require that data scientists understand how to leverage customer data acquired from structured survey tools, sentiment analysis software, social media monitoring tools and other sources. It also essential that every data scientist gain an understanding of the key business outcomes–such as maximizing customer lifetime value–that should focus their modeling initiatives.

Classroom instruction is important, but a curriculum that is 100%  devoted to reading books, taking tests and sitting through lectures is insufficient. Hands-on laboratory work is paramount for a truly well-rounded data scientist. Enterprises need to make sure that their data scientists acquire certifications and degrees that reflect them actually developing statistical models that use real data and address substantive business issues.

4) Recently IBM announced the establishment of IBM Watson Group to focus on big data and cognitive computing. What is your say on this?

That was an extremely important announcement, both for IBM and for our customers, in the development of the Smarter Planet. In addition to forming the IBM Watson Group, we’ve recently announced a wide range of other moves, both in solutions and services, that signify the strategic importance of cognitive computing in our solution portfolio.

One of the most strategic moves was the announcement of IBM Watson Foundations in January 2014. Leveraging IBM’s full Big Data and analytics solution portfolio, IBM Watson Foundations provides a comprehensive, integrated set of capabilities that enable clients to find and capitalize on actionable insights. It provides IBM clients with tools and capabilities to tap into all relevant data – regardless of source or type – and run analytics to gain fresh insights in real-time, securely across any aspect of an enterprise, including revenue generation, marketing, finance, risk and operations.

Watson Foundations includes business analytics with predictive and decision management capabilities, information management with in-memory and stream computing, enterprise content management, as well as information integration and governance. Packaged into modular offerings, organizations of any size can address immediate needs for decision support, gain sustainable value from initial investments, and grow from there.

The components of Watson Foundations consist of all of IBM’s offerings in these solution categories:

  • IBM Decision Management
  • IBM Content Analytics
  • IBM Planning & Forecasting
  • IBM Discovery & Exploration
  • IBM BI & Predictive Analytics
  • IBM Content Management
  • IBM Hadoop System
  • IBM Stream Computing
  • IBM Data Management & Data Warehouse
  • IBM Information Integration & Governance

Cognitive computing — which relies on big data, machine learning, cloud computing, Internet of Things, and other key infrastructures — will become even more pervasive across our deepening solution portfolio in coming years.

5) Let’s talk more about your professional journey. How did your career journey begin? What took you to where you’re today?

I had absolutely no idea in my college years, or even early in my career, that I would be where I am today. Life’s an adventure.

I have degrees in economics (BA, University of Michigan, Ann Arbor, 1980) and journalism (M.A., University of Wisconsin, Madison, 1984). My original plan was to become a business writer for a newspaper, magazine, or what have you. If I’d actually gotten any job offers in that direction, that’s what I might have done with my life. In the back of my mind, I wanted to specialize in covering one or another industry as a deep “beat.”

After grad school, in 1985, the first decent job offer that had some relevance to my ambitions was as a business writer at a telecommunications trade association in Washington DC. It was the year after the AT&T divestiture. The US telecommunications industry was being deregulated, the economy was booming, and high-tech startups were springing up all over. It was the perfect time to get in the ground floor of dynamic industry that, 30 years later, is still booming. Just as important, someone was willing to take a chance on an unproven J-school grad who didn’t have much technical background but was a quick study who expressed himself clearly.

That’s where it started, and before long I began strange zigzagging career that brought me, in various capacities, to consulting firms, tech vendors, and analyst groups. Also, from the late 80s to the late 00s, I maintained an active sideline as a freelance business/tech writer, most notably as a longtime columnist for Network World. I also published two business/tech books during this period. It was those “sidelines,” plus my many years as an IT industry analyst/consultant (Burton Group, Current Analysis, Forrester Research) that built my visibility, reputation, and knowledge base as a versatile “guru” on a wide range of IT topics.

By the time IBM hired me in April 2012, I was already well-known throughout the emerging Big Data industry. One of my core roles at IBM is as a thought leader and subject-matter expert on all things connected with Big Data and analytics. I produce and post fresh thought leadership material every single work day. I’m both an outward-facing spokesperson for IBM on these topics and an internal resource working with many IBM teams – in products and services – on a wide range of initiatives. What I’m doing for IBM is not dissimilar to what I did as an industry analyst.

6)  What are the most memorable moments of your career, things that make you smile whenever you think about them?

That’s hard to pluck out of a long career. Perhaps it was when my books were published (though they didn’t sell well) – I had proved to myself that I could write book-length materials that were reasonably good. Perhaps it was when I was quoted in the Wall Street Journal late in my analyst years, concerning an acquisition in the Hadoop market. As with books, that was great fodder for bragging. Or perhaps it was when IBM brought me on board as its Big Data Evangelist.  Finally, I was with a company that everybody knew, and in a high-visibility position. Of course, that move upped the pressure on me to be even sharper at what I do. I like to think I’ve risen to the challenge.

7) What three pieces of advice would you give to people who would like to follow a career path similar to yours?

First, don’t fool yourself into thinking that you’re “not technical” simply because you have no schooling or credentials of a technical nature. Second, don’t overlearn any one technical topic – instead, you’ll need to master a wide range of disciplines to be effective and agile over your career. And third, keep on challenging yourself outside your comfort zone. Stay hungry for new things to do. Don’t fall into a mental rut.

8) Anything else you would like to share?

I do yoga. It keeps me calm, strong, and agile. I need that to do this day after day without fail.