• Print Friendly and PDF
  • Print Friendly and PDF
Mark Myers

Wrangling Big Data with Enterprise Search

comments 0 comments  |  638 reads

In an earlier blog post titled, “Use the Four V’s to Better Understand the Big Data Ecosystem,” I discussed the concepts of volume, velocity, variety and variability that represent the measurable dimensions of big data. I then reviewed some research on how the various tools that make up the big data “ecosystem” address these dimensions. Further vetting of these ideas has helped to fuel discussions about the role of enterprise search in addressing big data with customers, partners, analysts and a number of big data practitioners I met at the recent Strata conference in Santa Clara, California. One of the key takeaways of this research is the real-time element that search can add to a big data deployment—more on that later. As promised in my initial post, I developed the topic of search and the value it can bring to big data into a Vivisimo White Paper titled, Optimizing Big Data.

So this blog post is number two in a series on big data. My plan is to progress from general principles to specifics:

  • Part 1 defined the nature and dimensions of “big data,” as well as the relative strengths of the available tools currently used to address big data, with special emphasis on the role that search can fulfill.
  • In Part 2 (which you are reading today), I’m getting more specific and will identify some scenarios for deploying enterprise search as part of the big data ecosystem.
  • In Part 3 and beyond, I will discuss uses of enterprise search to generate business value from big data in applications such as national security, legal discovery, social network analysis, customer experience management and revenue assurance—what we at Vivisimo call “big data optimization.”

If I may take a moment to define “enterprise search,” I am referring to a comprehensive indexing platform or service with the capability to index content from a variety of different sources and to provide a single point of access for search and discovery. This is a minimalist definition, because a full-featured enterprise search platform like Vivisimo’s Velocity Platform offers many more capabilities, from deep information discovery and collaboration features that are very obvious to end users, to the not-so-obvious but critical back-end capabilities such as entity extraction, security, scalability and fault tolerance. Velocity can also serve as a platform for search-based applications in which the central role of search is not immediately apparent to the end user, but defines the unique capabilities and business value of the application.

Four Scenarios
These scenarios lay the foundation for generating business value from big data. Put another way, they define the architecture of potential search-based applications that leverage big data. A few themes run across all of these scenarios:

  • Truly robust enterprise search adds a real-time element to the predominantly batch-oriented world of big data processing
  • The ability to access multiple different data sources (the “variability” dimension discussed in my previous post) greatly expands the scope of possibilities for exploiting big data
  • Search is accessible and usable to end users, whereas the typical hands-on big data user is a data scientist

So without further ado, I’ll walk through our four scenarios for enterprise search deployed as part of the big data ecosystem.

1)  Indexing and Fusion of Big Data
In this scenario, the search platform indexes content that is resident in a big data repository or “holding area” such as Hadoop Distributed File System (HDFS). As discussed in my earlier post, such information is typically under control of data scientists and not easily accessible to end users. Furthermore, the connections and relationships to information in other enterprise systems are not always apparent in the isolation of a big data laboratory. This is where enterprise search can step in by enabling search that goes across both the big data repository and other organizational information. This fusion of enterprise application data and data that has been placed in the big data lab can provide a unique view and insights that would not otherwise be apparent to both the analyst and the everyday user.

2)  Indexing and Search of Big Data Analytics
Most of the hard work done in the typical big data lab is designed to drive analytics. A single project could produce an extremely large number of results “packages” that are stored in individual files, aggregated into a single large file, or stored as database records. Future navigation and recall of these results, either individually or in related sets, can be problematic. If they are viewed once and allowed to languish on a file server somewhere, future value may be lost. What if we were to index these analytic products and make them accessible to a broader range of users, over time? Results from analytics can also of course be merged with results from other supported data sources, providing a critical fusion function for deeper insight into the business or mission context of the analysis.

3)  Access and Loading of Content from Diverse Data Sources
The typical big data deployment needs to be “fed” data from wherever it is generated or collected. This step usually involves creation of custom data adapters. A robust enterprise search platform such as Vivisimo’s Velocity has the ability to collect data from a wide range of external systems, transform it into a format that is useful for merging with other data, and passed along for processing in a big data project. This process may bypass the normal indexing step, in which case the search platform is providing something similar to an “extract, transform and load” function.

4)  Bulk Processing and Conversion of Extremely Large Data Sets
An enterprise search system can use the distributed batch processing capabilities of a big data framework such as Hadoop and MapReduce to perform bulk processing tasks such as entity extraction and document conversion against extremely large data sets. In this use case, the native analysis, conversion and metadata extraction processes of the search platform are either deployed within MapReduce or replaced with equivalent functions. The search system could then then ingest the output of these processes and pass it along to the indexing stage of the pipeline. This is an attractive option for truly massive data sets, and where organizations already have invested in big data processing infrastructure to leverage commodity hardware for massively parallel processing.

Next Step
In a future post, I’ll explore the business side of big data optimization, identifying applications and business solutions that can be deployed using these four scenarios, plus any that are introduced in future discussions.

Are you using search as part of your big data project, or do you plan to do so? What is the deployment architecture? How does search integrate with your other big data infrastructure? What business problems do you propose to solve?


Republished with author's permission from original post by Mark Myers.

Mark Myers

As Sr. Director of Product Management at Vivisimo, Mark is responsible for understanding clients' business challenges and helping to align the company's product and service offerings to ensure that organizations achieve maximum return on investment from Vivisimo's solutions.

0 comments »

Join the conversation!

The content of this field is kept private and will not be shown publicly.
CAPTCHA
Are you human? Please answer this question to help us prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

MarketPlace

Boost Customer Satisfaction & Loyalty at SCORE 2013

[May 29-31, Boston] Customer experience management (CEM) strategy meets customer operations at SCORE Conference 2013. Topics include driving customer satisfaction and loyalty, employee engagement, customer retention, call center technology and big data analytics. CustomerThink members save $700 off the regular registration fee.

Digital vs. Human Banking Experiences: Can This Be a Happy Marriage?

[June 6] It's time for banking leaders to rethink how to nurture and grow customer relationships in an increasingly digital world. Get the results of a new study that revealed the CX practices of top performing banks. Learn how digital Innovations can enable more personal service.

eMetrics Summit

[June 10-13, Chicago] If you are responsible for the results of your company’s website, social media, ecommerce, web intelligence, data strategy, audience research and/or measurement, then mark your calendar. Customerthink members save 15% off full conference passes with code CTKTO15.

Predictive Analytics World

[June 10-13, Chicago] PAW's program will feature over 40 sessions with case studies so you can witness how predictive analytics is applied at leading enterprises. Customerthink members save 15% off full conference passes with code CTKTO15.

Confirmit’s Community Conference ’13 – London and Las Vegas

[June 19-21, London; June 26-28, Las Vegas] Attending CCC ‘13 gives you an unrivaled opportunity to understand and address rapid industry changes and discover new techniques that can drive your business forward. Create a tailored agenda that explains how to overcome the challenges your business faces. Take advantage of excellent networking opportunities and face-to-face discussions with thought leaders.

Global Customer Experience Management (CEM) Certification Program

[Sept 19-20, Amsterdam; Sept 24-25, Sao Paulo; Nov 12-13, San Francisco] An internationally recognized program with proven track record of success - being run for 40 times in 17 cities with attendees from 58 countries, the program is developed based on the U.S. patent-pending Branded CEM Method which aims to drive customer loyalty and brand differentiation with quantifiable business results. Limited offer: USD300 early bird discount.

Customer Experience Certification

[Sept 24-26, London] If you’re developing a customer experience program or want to review your current approach, join other customer experience leaders for this intensive 2.5-day certification. Presented by Medallia, the global leader in customer experience management. Enter code ‘Cthink’ to save$300/£200.

Voice of Customer 2.0: Creating Change Your Customers and Employees Can Believe In

[Recorded April 25] Despite good intentions, in the majority of companies Voice of Customer programs contribute little to business success. Join us to learn the secrets to capitalize on Customer Experience feedback, so you can drive organization actions that will unlock profitable growth.

Get your event or resource listed in the MarketPlace, reaching 200,000 business leaders monthly.
For more information, contact CustomerThink advertising sales.