Sunday, 29 March 2020

Cloud-Based Quantum Computing Shows Increasing Enterprise Interest

Over the past year quantum computing has started moving from the realm of computer science and research into something that enterprises might be able to use in the workplace. In fact, the emergence of quantum computing in the cloud has excited curiosity among tech buyers who are now looking at practical applications for this kind of computing.

Quantum Computing Adoption

Tech buyers are not the only ones that are looking at it either. Recent research from IDC points to the fact that behind this interest is the fact that quantum computing offers organizations the possibility of improved AI capabilities, accelerated business intelligence, and increased productivity and efficiency.
The research, entitled Quantum Computing Adoption Trends: 2020 Survey Findings (behind paywall) says that while cloud-based quantum computing is a young market, and allocated funds for quantum computing initiatives are limited (0-2% of IT budgets), end-users are optimistic that early investment will result in a competitive advantage.
There are number of industries that are particularly interested in it, among them manufacturing, financial services and security industries are all experimenting with more potential use cases, developing advanced prototypes, and are further along in their implementation status. However, complex technology, skillset limitations, lack of available resources, and cost is deterring some organizations from investing in quantum computing technology.
The rise of quantum computing has not gone unnoticed by investors who have been sniffing around cutting edge companies that are currently developing new ways to enable enterprise access quantum computing. Take Berkeley Calif.-based Rigetti Computing.Founded in 2013 by Chad Rigetti, it has made its quantum computers available over the cloud since 2017, and recently raised $71 million in a recent funding round for the company’s future development.
Through its Quantum Cloud Services (QCS) platform, its machines can be integrated into any public, private or hybrid cloud. In a recent interview Michael Brett, senior vice president of applications at Rigetti, pointed out that over the past two years the number of companies providing access to quantum computers has gone from two (IBM and Rigetti) to many more.
However, in recent months this has been transformed by the entry of big cloud vendors. Rigetti for example has a deal with AWS to use its cloud compute time. Microsoft Azure offers a similar service, and other cloud players are expected to follow. This not only opens a more efficient array of computing, but it changes the market dynamics: “AWS is both an aggregator and a distributor of quantum compute, as well as a provider of software tools to help developers,” Brett told DigiFin.

Quantum Developments

SRI International (SRI) is an American nonprofit scientific research institute and organization headquartered in Menlo Park, California It is also one of the supporters of the Quantum Economic Development Consortium (QED-C) which is also supported by the National Institute of Standards and Technology (NIST) in the Department of Commerce and by more than 70 U.S. companies from across the supply chain.
Just before Xmas this year QED-C held a workshop to identify cryogenic technology advances that will enable a ten-fold improvement in quantum information science and technology (QIST) in the next decade “Cryogenics is slowing progress toward the establishment of a quantum industry, which will have both economic and national security implications, and which will create benefits to mankind that we can only dream of today,”  Luke Mauritsen, founder and CEO of Montana Instruments said of the project Workshop
As well as outlining the path for future development of quantum computing, it also identified some of the problems that need to be answered before quantum computing can move forward. Among the problems that need to be solved is reductions in size, weight and power requirements that will enable commercial applications that today are not feasible.
On top of that for many at research universities and in small companies, cost of the systems that are required for certain applications, which can be more than $1 million, is a barrier to entry. Finally, the dwindling pipeline of relevantly educated workers is a growing problem for companies in the field.

Quantum Potential

That said, Itamar Sivan CEO of Israel-based Quantum Machines, which also received a recent funding round, this time for $17.5 million, argues that cloud-based quantum computing is a field that holds incredible potential. The advantages seem obvious quantum computing holds a promise for immense computational power, far beyond anything classical computing will ever be able to provide. Like the incredible growth of cloud computing for AI and big-data applications. the anticipated adoption by large corporations of quantum cloud computing, will surely be immense. “So, in theory, as IDC has stated, quantum cloud investment can provide advantages across all industries, and we highly recommend corporations to begin allocating resources to the field,” he said.
Such investments would be gradual, as the field progresses, to allow companies, both, not to be left behind when quantum reaches its tipping point, or even better, to gain a first-movers advantages as soon as this happens. Whether this happens in two or five years, companies which have wisely invested in the field, might benefit from an unprecedented advantage.

Is Quantum Computing Necessary?

But is quantum computing necessary? Jay Valentine of  Austin-based Cloud-Sliver believes not. Instead, he believes that Edge computing will drastically impact quantum. Quantum computing, he said, in the way it is being practiced is a hardware solution that runs very specific apps superfast. It is massively expensive, and the skills needed to make it work are not there. Edge computing (vs computing at the edge) has created an entirely different distributed tech stack that enables any app to be broken up into infinite pieces, run simultaneously on tiny hardware devices like a Raspberry Pi or embedded device. “These today actually exceed the speed of any quantum computer with published results. And it is delivered with traditional hardware, (Raspberry Pi), inexpensively, today, and is in production at several sites” he said.
He said that already, we are taking apps that consume an $87 million data center, for 90 hours to come to a result, and we deliver the same in 20 seconds on a single Raspberry Pi. This kind of EDGE tech will eliminate most of the need for quantum computers.

Thursday, 26 March 2020

Australian government readies COVID-19 information app

The federal government is looking to publish an app to help distribute information on the COVID-19 outbreak.

The COVID-19 Gov't Mobile Platform opportunity was posted on the Digital Transformation Agency's (DTA) Digital Marketplace, which is touted by the agency as being a "simple, open platform that brings government buyers and digital sellers together".

"This request seeks to engage an experienced seller to partner with the DTA to continue to develop, support, and host a government mobile platform to allow the dissemination of COVID-19 virus information, related restrictions, and other supporting advice and directions," the overview says.

The opportunity was posted on Wednesday and will remain open until 6 pm AEDT on Thursday.

So far, only one vendor has been invited to participate.

The length of the contract is six months, with the option for a further six-month extension.

Essential criteria is listed as experience in developing mobile platforms, and desirable criteria is that the successful vendor has the ability to support and host mobile platforms during a pandemic.

The app was initially announced when the federal government first unveiled its COVID-19 support package. It falls under the new national communications campaign, which will see AU$30 million spent on providing people with "practical advice on how they can play their part in containing the virus and staying healthy", such as through the app.

The Australian government on Wednesday began its text message campaign, telling nearly 36 million mobile numbers how to navigate the health of individuals and the broader community.

"As the spread of the coronavirus increases, it's vital every Australian understands the practical action they must take to look after themselves and help us protect those most at risk," a statement from Minister for Health Greg Hunt, Australian Chief Medical Officer Professor Brendan Murphy, and Minister for Communications, Cyber Safety and the Arts Paul Fletcher said.

The government said it would continue to use text messages as one of its communication methods.

This follows Canberra suffering backlash earlier this week for its inability to provide appropriate tech capability to handle the number of Australians seeking government assistance.

In the wake of COVID-19, social distancing measures and business closures have left many without a job. In response, the Australian government announced, over the past fortnight, a handful of measures to support the newly unemployed as more than 1 million Australians could be forced onto welfare.

However, on Monday, thousands were unable to access the government's myGov online portal to sign up for income assistance.

"We are deeply sorry about this," Morrison said on Tuesday night.

"We've gone from 6,000 to 50,000 to 150,000 all in the space of, a matter of a day. And tonight, they're working to boost it again. I would say to Australians, yes, we are terribly sorry, but at the same time, we are asking Australians, even in these most difficult of circumstances, to be patient. Everyone is doing their best."

OPEN-SOURCING TRACING TECH IN SINGAPORE
Earlier this week, Singapore announced it will open-source its COVID-19 contact-tracing app, TraceTogether.

"GovTech Singapore is now working around the clock to finalise our protocol reference documents and reference implementation, so that others may deploy their own flavours of TraceTogether -- each implementing the BlueTrace protocol," Minister-in-charge of the Smart Nation Programme Office initiative Vivian Balakrishnan said in a Facebook post.

"We believe that making our code available to the world will enhance trust and collaboration in dealing with a global threat that does not respect boundaries, political systems or economies."

TraceTogether is built on the BlueTrace protocol, designed by the Government Digital Services team at Government Technology Agency of Singapore.

Participating devices exchange proximity information whenever an app detects another device with the TraceTogether app installed.

Balakrishnan said TraceTogether has been installed by more than 620,000 people.

The app uses the Bluetooth Relative Signal Strength Indicator (RSSI) readings between devices across time, to approximate the proximity and duration of an encounter between two users, the TraceTogether website explains.

"This proximity and duration information is stored on one's phone for 21 days on a rolling basis -- anything beyond that would be deleted. No location data is collected," it adds.

"If a person unfortunately falls ill with COVID-19, the Ministry of Health (MOH) would work with them to map out their activity for past 14 days, for contact tracing. And if the person has the TraceTogether app installed, they can grant MOH to access their TraceTogether Bluetooth proximity data -- making it quicker for MOH to contact people who had close contact with the infected individual, to provide timely guidance and care."

COVID-19 slams tech outfits and startups in India

With COVID-19 cutting a devastating swath throughout the world, what everyone wants to know about India is how bad the situation really is. In a country with a large but mostly poor population of 1.3 billion and a per capita of around just $2,000, a virus such as this can spread like wildfire and cause devastation. 

So far, India has seen 612 cases and twelve deaths, but this is easily a questionable number considering the lack of testing kits, testers, and the country's massive population. China shut Wuhan down almost instantly and still suffered. India, like China, could also be deeply affected, especially if it has under-reported its figures. Realising this, the Indian government has done the smart thing by implementing a 21-day lockdown -- or de-facto "house arrest" -- along with an international and domestic flight ban, and a stoppage of the railway service. 

The Indian government, with help from tech firm Haptik, has also launched a WhatsApp chatbot called MyGov Corona Helpdesk where people can text with questions about the virus. In turn, they can receive instant responses, including information about where they could receive assistance. 

With government implementing these initiatives, its hope is that by the end of this period, much of the immediate threats from the virus will have blown over.  

As the body continues to rise then, it may seem trivial to talk about the state of the tech sector as I am going to try and do, but the reality is that tech employs a wide swath of people, including Uber and Ola drivers and delivery personnel. With potentially nothing in their bank accounts to rescue them, their livelihoods are directly at stake with business closures arising from COVID-19.

Here's a short account of how various parts of the tech sector have fared with the ongoing crisis.

ONLINE GROCERIES
A few days ago, I wrote an article about how edtech firms in India are experiencing an upswing with schools being shut down during the exam period. Similarly, the revenues of internet-enabled grocery outfits such as Grofers and Bigbasket have gone through the roof, growing by double according to Quartz, as Indians go through a surge of panic buying. Average basket values have also gone up by as much as 20%. 

Experts say this will continue to have stickiness as new users discover the convenience of online shopping where they would not have otherwise. However, online shopping has taken a big hit in general, with Amazon to stop shipping non-essential products.

STARTUPS
For many tech companies that are still in their infancy, this pandemic will probably come as a bitter blow. February already registered a 19-month low for investments and although this was primarily due to fewer big-ticket acquisitions, it is no doubt be a harbinger for things to come. 

Sequoia recently sent out an email to its companies to "question every assumption" about their business and to start thinking about how to cut spending and jobs. Firms that have focused mainly on customer acquisition rather than generating profits will also find that the game has dramatically changed. According to Livemint, fundraising has already ground to a halt, with many companies only having control over their burn rate.

One promising startup sector that has already been hard hit is logistics, which had been booming until now. Shipsy, a Gurugram-based company that uses a digital platform to connect exporters and importers, has already seen their business plummet 25%, and with global shipping and transport having ground to a halt, this looks like the tip of the iceberg.

ENTERTAINMENT
Just like it did in Europe a few days ago, Netflix announced that it would throttle its traffic over Indian telecom networks by 25% in order to alleviate the data congestion that has arisen from cooped up people binge-watching their content. Youtube and Amazon have also made similar announcements.

MANUFACTURING
One of the biggest potential impacts to India's employment and economy could be the shuttering of manufacturing plants and assembly lines in the tech sector. For example, Xiaomi has eight factories that churn out smartphones, smart TVs, and power banks. One can only imagine the ripple effect that this would have on revenues, handset supply, and jobs if it were forced to close the doors on all of its factories.

So far, Xiaomi, Lenovo-Motorola, and Lava have already been impacted, with some of their smartphone factories being forced to shut down following a diktat by state government. 

ONLINE TRAVEL
Needless to say, no one is even thinking of travelling so you can imagine the implosion that is currently taking place for online-enabled homestays, hotels, and travel websites like the rapidly growing Oyo. Leading travel aggregator Yatra.com said that 35% of bookings that include travel and hotel to international destinations have already been cancelled. 

Tuesday, 25 February 2020

Hybrid-cloud management requires new tools, skills

Hybrid cloud environments can deliver an array of benefits, but in many enterprises, they're becoming increasingly complex and difficult to manage. To cope, adopters typically turn to some type of management software. What soon becomes apparent, however, is that hybrid cloud management tools can be as complex and confounding as the environments they're designed to support.

A hybrid cloud typically includes a mix of computing, storage and other services. The environment is formed by a combination of on-premises infrastructure resources, private cloud services, and one or more public cloud offerings, such as Amazon Web Services (AWS) or Microsoft Azure, as well as orchestration among the various platforms.

Any organization contemplating a hybrid cloud deployment should begin building a transition framework at the earliest possible stage. "The biggest decision is what data and which applications should be on-premises due to the sensitivity of data, and what goes into the cloud," says Umesh Padval, a partner at venture capital firm Thomvest Ventures.

Numerous other issues also need to be sorted out at the start, including the ultimate destination of lower priority, yet still critical, data and applications. Will they be kept on premises forever or migrated at some point into the cloud? With applications and data scattered, security is another major concern. Operational factors and costs also need to be addressed at the very beginning. "Your email application may run great in your data center, but may operate differently in the cloud," Padval notes.

Hybrid cloud tools immature yet evolving
A complex hybrid cloud requires constant oversight as well as a way to intuitively and effectively manage an array of operations, including network performance, workload management, security and cost control. Not surprisingly, given the large number of management tasks needed to run an efficient and reliable hybrid cloud environment, adopters can select from a rapidly growing array of management tools.

"There’s a dizzying array of options from vendors, and it can be difficult to sort through them all," says R. Leigh Henning, principal network architect for data center operator Markley Group. "Vendors don’t always do the best job at making their differentiators clear, and a lot of time and effort is wasted as a result of this confusion. Companies are getting bogged down in an opaque field of choices."

The current hybrid cloud management market is both immature and evolving, declares Paul Miller, vice president of hybrid cloud at Hewlett Packard Enterprise. Vendors are still getting a handle on the types of management tools their customers need. "Offerings are limited and may not be supported across all public, on-premises and edges," Miller adds.

Perhaps the biggest challenge to hybrid cloud management is that the technology adds new, complex and frequently discordant layers to operations management. "Many solutions have compatibility restrictions on the components they can manage, locking your management platform into a vendor or group of vendors, which may or may not align with your current or future system architecture," warns George Burns III, senior consultant of cloud operations for IT professional services firm SPR.

A lack of standardized APIs, which in turn results in a shortage of standardized management tools, presents another adoption challenge. "The lack of standardized tools increases operational complexity through the creation of multiple incongruent tools; this leads to vendor lock-in and, in some cases, gross inefficiencies in terms of resource utilization," explains Vipin Jain, CTO of Pensando, a software-defined services platform developer. "To make it worse, these kinds of problems are typically 'solved' by adding another layer of software, which further increases complexity, reduces debuggability, and results in suboptimal use of features and resources."

Meanwhile, using standardized open-source tools can be an effective starting point to safeguard against compatibility issues. "Cloud Native Computing Foundation (CNCF) tools, such as Kubernetes and Prometheus, are good examples," Jain says. "Open-source tools from HashiCorp, such as Vault, Vagrant, Packer, and Terraform, [provide] a good normalization layer for multi-cloud and hybrid cloud deployments, but they are by no means sufficient," he notes. Ideally, the leading public cloud vendors would all agree on a standardized set of APIs that the rest of the industry could then follow. "Standardization can be a moving target, but it's critical from an efficiency and customer satisfaction perspective," Jain says.

Developers writing API configurations, as well as developers using API configurations, form a symbiotic relationship that should be mutually maintained, Burns advises. "Hardware vendors need to be open about changes and enhancements coming to their products and how that will affect their APIs," he explains. "Equally, management platform developers need to be mindful of changes to hardware platform APIs, [and] regularly participate in testing releases and provide adequate feedback to the vendor about results and functionality."

Prioritize management requirements; expect gaps
Even when everything works right, there are often gaps remaining between intended and actual management functionality. "In an ideal world, developers would have the perfect lab environments that would allow them to successfully test each product implementation, allowing functionality to be seamless across upgrades," Burns observes. "Unfortunately, we can’t expect everything to function perfectly and cannot forgo [on-site] testing."

When selecting a hybrid cloud management platform, it's important to not only be aware of its documented limitations, but also to know that nothing is certain until it's tested in its user's own hybrid cloud environment, Burns advises. "Gaps will exist, but it's ultimately your responsibility to fully identify and verify those gaps in your own environment," he says.

Further muddling the situation is the fact that many management tool packages are designed to supply multiple functions, which can make product selection difficult and confusing. "To simplify, customers need to consider which features are most important to them based on their use cases and can show a quick return on investment, mapping to their specific cloud journey," Miller explains.

Real-world experience with hybrid cloud management
Despite management challenges, most hybrid cloud adopters find a way to get their environment to function effectively, reliably and securely.

Gavin Burris, senior project leader, research computing, at the Wharton School of the University of Pennsylvania, appreciates the flexibility a hybrid cloud provides. "We have a small cluster ... that's generally available to all the faculty and PhD students," he notes. The school's hybrid environment supports a fair share prioritization scheme, which ensures that all users have access to the resources they need to support their work. "When they need more, they're able to request their own dedicated job queue that's run in the cloud," he says.

Burris, who uses Univa management products, says that having a management tool that allows fast and easy changes is perfect for individuals who like to maintain firm control over their hybrid environment. "I like to do things with scripting and automation, so to be able to go in and write my own rules and policies and build my own cluster with these management tools is really what I’m looking for," he explains.

James McGibney, senior director of cybersecurity and compliance at Rosendin Electric, an electrical contractor headquartered in San Jose, Calif., relies on a hybrid cloud to support a variety of essential operations. "Approximately two years ago we embarked on our journey from an on-premises disaster recovery, quality assurance and production environment to a cloud migration encompassing hundreds of terabytes of data," he says. McGibney relies on a management console provided by AWS and VMWare. The tool meets his current needs, but like many hybrid cloud administrators, he's keeping a close eye on industry developments. "We're currently investigating [other] options, just to see what’ out there," he says. Yet he doesn't expect to make any changes in the short term. "We're happy with the tools currently provided by AWS and VMware."

Sharpen network skills for hybrid cloud
Selecting a hybrid cloud management platform is not as simple as purchasing software and spinning up some VMs to run it. "During implementation, ensure that you have selected the proper product owners and engineers, and then determine what, if any, additional education or credentials they will need to effectively deploy and maintain the platform," Burns suggests. "Fully define your architecture, ensure buy-in from your staff, work with them to identify education gaps and create a solid operational plan for going forward."

Most hybrid cloud management tasks focus on configuration and access control operations, which tend to be both complex and challenging to implement. "At the same time, the beauty of the cloud is its ability to automate," says Mike Lamberg vice president and CISO at ION Group and its Openlink unit, which provides risk management, operations and finance software. Yet deploying a high level of automation also requires new skills and developers who can expertly handle the demands of virtual software-defined infrastructures as well as traditional environments. "We can’t assume that because teams can build applications in physical data centers that these skills will translate as they move to the cloud; new skills are required for success," Lamberg notes.

Hybrid cloud management requires a new team mindset. "IT networking staff literally need to unlearn what they know about physical networks and connectivity and recognize that the moving of packets and data is now handled by a forwarding software configuration, not by physical routers or switches," Lamberg says. "You can’t take what you did in building and supporting physical data centers and just apply it to the cloud—it simply doesn’t work."

In the big picture, transitioning to a hybrid cloud environment can solve many problems, yet it can also create some new obstacles if not properly implemented and managed. "Don't rush into any decision without considering all the points of impact that you can identify," Burns advises. "Make sure that you understand the breadth of a hybrid infrastructure and how it will be used to address business needs."

Wednesday, 12 February 2020

How to query and extract data from SaaS applications

Behind every SaaS application are databases storing business information about employees, suppliers, customers, and other partners. SaaS applications support workflows such as CRM for sales and marketing, cloud ERPs for financials, workforce management for human resource functions, and other enterprise and departmental services. Today, many businesses use a wide range of SaaS applications—from mainstream products such as Salesforce, Slack, Workday, and Atlassian, to many smaller SaaS tools.

SaaS applications shouldn’t operate in silos, and most organizations need to integrate capabilities across them and with other enterprise applications managed in private or public clouds.

If a workflow across multiple applications requires application integration, then development teams can leverage a SaaS platform’s APIs to trigger events from one platform to another. Enterprise integration platforms such as Boomi, SnapLogic, or MuleSoft are options when many applications and services need integration. If lighterweight integrations that follow the If This, Then That pattern are required, then an IFTTT platform may provide sufficient integration. Development teams should also explore low-code platforms such as Appian, OutSystems, and PowWow if they are developing new applications that connect to multiple SaaS and enterprise workflows.

Leveraging SaaS data for different business needs

What if you need to integrate the data from a SaaS platform with other data sources? There are a few reasons why data integration across SaaS tools may be required:

  • Business analysts want to develop reports and dashboards using this data.
  • Data science teams want the data for machine learning experiments.
  • Business teams want to centralize the data to support workflows and other types of applications. For example, marketing teams often use customer data platforms or master data platforms to centralize data on customers, products, and other business entities.
  • IT teams should extract the data for backups or enable transitioning data to other platforms.
  • Legal teams sometimes need to perform legal discovery on the underlying data.
  • Data stewards often want to cleanse, transform, or enrich the underlying data.

Sure, you can leverage the SaaS platforms’ APIs to extract data, but this may require a significant development effort to learn the APIs, understand the SaaS platform’s data model, create data stores for any new data, write the code to load the data, and develop the logic for any transformations. In addition, IT teams have to define cloud or data center infrastructure to host this application or service. Lastly, ongoing support is required for any data integrations designed to run on a schedule or on demand. Developing the integration from scratch may be expensive for development teams and IT organizations with other, more strategic priorities.

Another methodology is to consider data integration, data streaming, ETL (extraction, transformation, and loading), or other data prep platforms. Using a data integration platform may be the optimal method when working with large volumes of data that frequently change, since these platforms enable flexible extraction and transformation. However, they also require upfront development for the integration before end-users get access and utilize the information.


Lighterweight means of querying and managing SaaS data may be desirable. Sometimes, these are useful to experiment, discover, and prototype rapidly. Other times these approaches can easily be used for operation or production needs, especially when data volumes are low and query throughput isn’t significant. Here are three options.

1. BI platforms that directly query SaaS applications

If your primary requirement is reporting, then many self-service BI and data visualization platforms have direct connectors to the more popular SaaS applications.
  • Tableau can connect to platforms such as Intuit Quickbooks, Google Analytics, LinkedIn Sales Navigator, ServiceNow, Eloqua, Marketo, and Salesforce.
  • Microsoft Power BI also integrates with online services such as Adobe Analytics, Facebook, GitHub, MailChimp, Stripe, Quick Base, and Zendesk.
  • Domo claims to have more than a thousand connectors, including platforms such as HubSpot, Jira, Instagram, Qualtrics, Shopify, SurveyMonkey, Twitter, and Workday.
At a minimum, these integrations provide an easy way to query and discover the underlying SaaS data sources. At best, the out-of-the-box integration is sufficient for end-users to create the required data blending, reports, and dashboards.

There are some considerations.
  • These platforms enable joins and data blends when columns have matching keys. They become harder to use if significant data transformation is required before integrating the data source or blending it with other data sources.
  • Review whether SaaS data integrations are performed with real-time queries, or whether the data is extracted or cached.
  • Performance may be a factor if the SaaS application contains large data volumes, if there are complex joins with many other data sources, or if dashboards will be utilized concurrently by many users.
2. Platforms that emulate ODBC, JDBC, OData, or other drivers

If the business needs to go beyond reporting and dashboarding, and a lightweight integration approach is still desirable, then some commercial tools convert SaaS APIs into standard database drivers such as ODBC, JDBC, or OData. Two options for drivers to common SaaS platforms are Progress DataDirect and CData Driver Technologies.

The driver method may be most useful to data science teams who want to perform ad hoc queries into SaaS databases before pulling the data into their analysis. It’s also a good option for application developers who require real-time querying of SaaS application data.

Development and data science teams should investigate the performance of this integration, especially if high query volumes, large data sets, or low latency is required. Also, many SaaS applications throttle or charge customers based on API usage, so this may be a factor if higher query or data volumes are needed.

3. Lightweight ETL platforms that sync SaaS data to cloud databases

One final idea is to instrument a data integration out of the SaaS application into a cloud database that your organization sets up and manages. This strategy adds some operational complexity and costs, and it may not be ideal if real-time querying of the SaaS application data is required. But it does have several advantages:
  • It provides more control over the database platform and data architecture that business users, data scientists (including citizen data scientists), and application developers utilize. The platform and architecture should meet the volume, performance, and latency requirements.
  • Storing the data independent of the SaaS database provides greater flexibility to transform, join, cleanse, cube, or aggregate data as required by downstream users and applications.
  • If data security, data privacy, or other data governance controls for querying this data are different from the access and entitlement controls available in the SaaS applications, then hosting the data in a separate database may be required.
  • Hosting the data independent of the SaaS platform may be more cost-effective for higher data and query volume needs.
Although you could instrument this integration with data integration or data prep platforms, there are SaaS data integration platforms with out-of-the-box connectors to many SaaS applications. Stitch, a Talend company, is a plug-and-play solution if your objective is to stream data from SaaS applications to cloud databases. You can select what data to replicate and the replication frequency, but it does not provide any tools for transforming or filtering the data. Skyvia offers a similar product, and both have free tiers to let development teams try out integrations. Alooma, part of Google Cloud, focuses on moving data into big data platforms such as Google BigQuery, Amazon Redshift, and Snowflake, and provides some data transformation capabilities.

If your organization is utilizing many SaaS platforms, then a one-size-fits-all strategy may not work. Each integration path supports different SaaS integrations, and the type of integration must align with anticipated business needs. Reviewing the tools and considering multiple options is a best practice, especially when data integration needs vary.

Wednesday, 29 January 2020

Hybrid cloud management requires new tools, skills

Hybrid cloud environments can deliver an array of benefits, but in many enterprises, they're becoming increasingly complex and difficult to manage. To cope, adopters typically turn to some type of management software. What soon becomes apparent, however, is that hybrid cloud management tools can be as complex and confounding as the environments they're designed to support.

A hybrid cloud typically includes a mix of computing, storage and other services. The environment is formed by a combination of on-premises infrastructure resources, private cloud services, and one or more public cloud offerings, such as Amazon Web Services (AWS) or Microsoft Azure, as well as orchestration among the various platforms.

Any organization contemplating a hybrid cloud deployment should begin building a transition framework at the earliest possible stage. "The biggest decision is what data and which applications should be on-premises due to the sensitivity of data, and what goes into the cloud," says Umesh Padval, a partner at venture capital firm Thomvest Ventures.

Numerous other issues also need to be sorted out at the start, including the ultimate destination of lower priority, yet still critical, data and applications. Will they be kept on premises forever or migrated at some point into the cloud? With applications and data scattered, security is another major concern. Operational factors and costs also need to be addressed at the very beginning. "Your email application may run great in your data center, but may operate differently in the cloud," Padval notes.

Hybrid cloud tools immature yet evolving
A complex hybrid cloud requires constant oversight as well as a way to intuitively and effectively manage an array of operations, including network performance, workload management, security and cost control. Not surprisingly, given the large number of management tasks needed to run an efficient and reliable hybrid cloud environment, adopters can select from a rapidly growing array of management tools.

"There’s a dizzying array of options from vendors, and it can be difficult to sort through them all," says R. Leigh Henning, principal network architect for data center operator Markley Group. "Vendors don’t always do the best job at making their differentiators clear, and a lot of time and effort is wasted as a result of this confusion. Companies are getting bogged down in an opaque field of choices."

The current hybrid cloud management market is both immature and evolving, declares Paul Miller, vice president of hybrid cloud at Hewlett Packard Enterprise. Vendors are still getting a handle on the types of management tools their customers need. "Offerings are limited and may not be supported across all public, on-premises and edges," Miller adds.

Perhaps the biggest challenge to hybrid cloud management is that the technology adds new, complex and frequently discordant layers to operations management. "Many solutions have compatibility restrictions on the components they can manage, locking your management platform into a vendor or group of vendors, which may or may not align with your current or future system architecture," warns George Burns III, senior consultant of cloud operations for IT professional services firm SPR.

A lack of standardized APIs, which in turn results in a shortage of standardized management tools, presents another adoption challenge. "The lack of standardized tools increases operational complexity through the creation of multiple incongruent tools; this leads to vendor lock-in and, in some cases, gross inefficiencies in terms of resource utilization," explains Vipin Jain, CTO of Pensando, a software-defined services platform developer. "To make it worse, these kinds of problems are typically 'solved' by adding another layer of software, which further increases complexity, reduces debuggability, and results in suboptimal use of features and resources."

Meanwhile, using standardized open-source tools can be an effective starting point to safeguard against compatibility issues. "Cloud Native Computing Foundation (CNCF) tools, such as Kubernetes and Prometheus, are good examples," Jain says. "Open-source tools from HashiCorp, such as Vault, Vagrant, Packer, and Terraform, [provide] a good normalization layer for multi-cloud and hybrid cloud deployments, but they are by no means sufficient," he notes. Ideally, the leading public cloud vendors would all agree on a standardized set of APIs that the rest of the industry could then follow. "Standardization can be a moving target, but it's critical from an efficiency and customer satisfaction perspective," Jain says.

Developers writing API configurations, as well as developers using API configurations, form a symbiotic relationship that should be mutually maintained, Burns advises. "Hardware vendors need to be open about changes and enhancements coming to their products and how that will affect their APIs," he explains. "Equally, management platform developers need to be mindful of changes to hardware platform APIs, [and] regularly participate in testing releases and provide adequate feedback to the vendor about results and functionality."

Prioritize management requirements; expect gaps
Even when everything works right, there are often gaps remaining between intended and actual management functionality. "In an ideal world, developers would have the perfect lab environments that would allow them to successfully test each product implementation, allowing functionality to be seamless across upgrades," Burns observes. "Unfortunately, we can’t expect everything to function perfectly and cannot forgo [on-site] testing."

When selecting a hybrid cloud management platform, it's important to not only be aware of its documented limitations, but also to know that nothing is certain until it's tested in its user's own hybrid cloud environment, Burns advises. "Gaps will exist, but it's ultimately your responsibility to fully identify and verify those gaps in your own environment," he says.

Further muddling the situation is the fact that many management tool packages are designed to supply multiple functions, which can make product selection difficult and confusing. "To simplify, customers need to consider which features are most important to them based on their use cases and can show a quick return on investment, mapping to their specific cloud journey," Miller explains.

Real-world experience with hybrid cloud management
Despite management challenges, most hybrid cloud adopters find a way to get their environment to function effectively, reliably and securely.

Gavin Burris, senior project leader, research computing, at the Wharton School of the University of Pennsylvania, appreciates the flexibility a hybrid cloud provides. "We have a small cluster ... that's generally available to all the faculty and PhD students," he notes. The school's hybrid environment supports a fair share prioritization scheme, which ensures that all users have access to the resources they need to support their work. "When they need more, they're able to request their own dedicated job queue that's run in the cloud," he says.

Burris, who uses Univa management products, says that having a management tool that allows fast and easy changes is perfect for individuals who like to maintain firm control over their hybrid environment. "I like to do things with scripting and automation, so to be able to go in and write my own rules and policies and build my own cluster with these management tools is really what I’m looking for," he explains.

James McGibney, senior director of cybersecurity and compliance at Rosendin Electric, an electrical contractor headquartered in San Jose, Calif., relies on a hybrid cloud to support a variety of essential operations. "Approximately two years ago we embarked on our journey from an on-premises disaster recovery, quality assurance and production environment to a cloud migration encompassing hundreds of terabytes of data," he says. McGibney relies on a management console provided by AWS and VMWare. The tool meets his current needs, but like many hybrid cloud administrators, he's keeping a close eye on industry developments. "We're currently investigating [other] options, just to see what’ out there," he says. Yet he doesn't expect to make any changes in the short term. "We're happy with the tools currently provided by AWS and VMware."

Sharpen network skills for hybrid cloud
Selecting a hybrid cloud management platform is not as simple as purchasing software and spinning up some VMs to run it. "During implementation, ensure that you have selected the proper product owners and engineers, and then determine what, if any, additional education or credentials they will need to effectively deploy and maintain the platform," Burns suggests. "Fully define your architecture, ensure buy-in from your staff, work with them to identify education gaps and create a solid operational plan for going forward."

Most hybrid cloud management tasks focus on configuration and access control operations, which tend to be both complex and challenging to implement. "At the same time, the beauty of the cloud is its ability to automate," says Mike Lamberg vice president and CISO at ION Group and its Openlink unit, which provides risk management, operations and finance software. Yet deploying a high level of automation also requires new skills and developers who can expertly handle the demands of virtual software-defined infrastructures as well as traditional environments. "We can’t assume that because teams can build applications in physical data centers that these skills will translate as they move to the cloud; new skills are required for success," Lamberg notes.

Hybrid cloud management requires a new team mindset. "IT networking staff literally need to unlearn what they know about physical networks and connectivity and recognize that the moving of packets and data is now handled by a forwarding software configuration, not by physical routers or switches," Lamberg says. "You can’t take what you did in building and supporting physical data centers and just apply it to the cloud—it simply doesn’t work."

In the big picture, transitioning to a hybrid cloud environment can solve many problems, yet it can also create some new obstacles if not properly implemented and managed. "Don't rush into any decision without considering all the points of impact that you can identify," Burns advises. "Make sure that you understand the breadth of a hybrid infrastructure and how it will be used to address business needs."

Tuesday, 21 January 2020

Deep learning vs. machine learning: Understand the differences

Machine learning and deep learning are both forms of artificial intelligence. You can also say, correctly, that deep learning is a specific kind of machine learning. Both machine learning and deep learning start with training and test data and a model and go through an optimisation process to find the weights that make the model best fit the data. Both can handle numeric (regression) and non-numeric (classification) problems, although there are several application areas, such as object recognition and language translation, where deep learning models tend to produce better fits than machine learning models.

Machine learning explained
Machine learning algorithms are often divided into supervised (the training data are tagged with the answers) and unsupervised (any labels that may exist are not shown to the training algorithm). Supervised machine learning problems are further divided into classification (predicting non-numeric answers, such as the probability of a missed mortgage payment) and regression (predicting numeric answers, such as the number of widgets that will sell next month in your Manhattan store).

Unsupervised learning is further divided into clustering (finding groups of similar objects, such as running shoes, walking shoes, and dress shoes), association (finding common sequences of objects, such as coffee and cream), and dimensionality reduction (projection, feature selection, and feature extraction).

Classification algorithms
A classification problem is a supervised learning problem that asks for a choice between two or more classes, usually providing probabilities for each class. Leaving out neural networks and deep learning, which require a much higher level of computing resources, the most common algorithms are Naive Bayes, Decision Tree, Logistic Regression, K-Nearest Neighbors, and Support Vector Machine (SVM). You can also use ensemble methods (combinations of models), such as Random Forest, other Bagging methods, and boosting methods such as AdaBoost and XGBoost.

Regression algorithms
A regression problem is a supervised learning problem that asks the model to predict a number. The simplest and fastest algorithm is linear (least squares) regression, but you shouldn’t stop there, because it often gives you a mediocre result. Other common machine learning regression algorithms (short of neural networks) include Naive Bayes, Decision Tree, K-Nearest Neighbors, LVQ (Learning Vector Quantization), LARS Lasso, Elastic Net, Random Forest, AdaBoost, and XGBoost. You’ll notice that there is some overlap between machine learning algorithms for regression and classification.

Clustering algorithms
A clustering problem is an unsupervised learning problem that asks the model to find groups of similar data points. The most popular algorithm is K-Means Clustering; others include Mean-Shift Clustering, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), GMM (Gaussian Mixture Models), and HAC (Hierarchical Agglomerative Clustering).

Dimensionality reduction algorithms
Dimensionality reduction is an unsupervised learning problem that asks the model to drop or combine variables that have little or no effect on the result. This is often used in combination with classification or regression. Dimensionality reduction algorithms include removing variables with many missing values, removing variables with low variance, Decision Tree, Random Forest, removing or combining variables with high correlation, Backward Feature Elimination, Forward Feature Selection, Factor Analysis, and PCA (Principal Component Analysis).

Optimization methods
Training and evaluation turn supervised learning algorithms into models by optimizing their parameter weights to find the set of values that best matches the ground truth of your data. The algorithms often rely on variants of steepest descent for their optimizers, for example stochastic gradient descent, which is essentially steepest descent performed multiple times from randomized starting points.

Common refinements on stochastic gradient descent add factors that correct the direction of the gradient based on momentum, or adjust the learning rate based on progress from one pass through the data (called an epoch or a batch) to the next.

Data cleaning for machine learning
There is no such thing as clean data in the wild. To be useful for machine learning, data must be aggressively filtered. For example, you’ll want to:

  • Look at the data and exclude any columns that have a lot of missing data.
  • Look at the data again and pick the columns you want to use (feature selection) for your prediction. This is something you may want to vary when you iterate.
  • Exclude any rows that still have missing data in the remaining columns.
  • Correct obvious typos and merge equivalent answers. For example, U.S., US, USA, and America should be merged into a single category.
  • Exclude rows that have data that is out of range. For example, if you’re analyzing taxi trips within New York City, you’ll want to filter out rows with pickup or drop-off latitudes and longitudes that are outside the bounding box of the metropolitan area.

There is a lot more you can do, but it will depend on the data collected. This can be tedious, but if you set up a data cleaning step in your machine learning pipeline you can modify and repeat it at will.

Data encoding and normalization for machine learning
To use categorical data for machine classification, you need to encode the text labels into another form. There are two common encodings.

One is label encoding, which means that each text label value is replaced with a number. The other is one-hot encoding, which means that each text label value is turned into a column with a binary value (1 or 0). Most machine learning frameworks have functions that do the conversion for you. In general, one-hot encoding is preferred, as label encoding can sometimes confuse the machine learning algorithm into thinking that the encoded column is supposed to be an ordered list.

To use numeric data for machine regression, you usually need to normalize the data. Otherwise, the numbers with larger ranges might tend to dominate the Euclidian distance between feature vectors, their effects could be magnified at the expense of the other fields, and the steepest descent optimization might have difficulty converging. There are a number of ways to normalize and standardize data for machine learning, including min-max normalization, mean normalization, standardization, and scaling to unit length. This process is often called feature scaling.

Feature engineering for machine learning
A feature is an individual measurable property or characteristic of a phenomenon being observed. The concept of a “feature” is related to that of an explanatory variable, which is used in statistical techniques such as linear regression. Feature vectors combine all the features for a single row into a numerical vector.

Part of the art of choosing features is to pick a minimum set of independent variables that explain the problem. If two variables are highly correlated, either they need to be combined into a single feature, or one should be dropped. Sometimes people perform principal component analysis to convert correlated variables into a set of linearly uncorrelated variables.

Some of the transformations that people use to construct new features or reduce the dimensionality of feature vectors are simple. For example, subtract Year of Birth from Year of Death and you construct Age at Death, which is a prime independent variable for lifetime and mortality analysis. In other cases, feature construction may not be so obvious.

Splitting data for machine learning
The usual practice for supervised machine learning is to split the data set into subsets for training, validation, and test. One way of working is to assign 80% of the data to the training data set, and 10% each to the validation and test data sets. (The exact split is a matter of preference.) The bulk of the training is done against the training data set, and prediction is done against the validation data set at the end of every epoch.

The errors in the validation data set can be used to identify stopping criteria, or to drive hyperparameter tuning. Most importantly, the errors in the validation data set can help you find out whether the model has overfit the training data.

Prediction against the test data set is typically done on the final model. If the test data set was never used for training, it is sometimes called the holdout data set.

There are several other schemes for splitting the data. One common technique, cross-validation, involves repeatedly splitting the full data set into a training data set and a validation data set. At the end of each epoch, the data is shuffled and split again.

Machine learning libraries
In Python, Spark MLlib and Scikit-learn are excellent choices for machine learning libraries. In R, some machine learning package options are CARAT, randomForest, e1071, and KernLab. In Java, good choices include Java-ML, RapidMiner, and Weka.

Deep learning explained
Deep learning is a form of machine learning in which the model being trained has more than one hidden layer between the input and the output. In most discussions, deep learning means using deep neural networks. There are, however, a few algorithms that implement deep learning using other kinds of hidden layers besides neural networks.

The ideas for “artificial” neural networks go back to the 1940s. The essential concept is that a network of artificial neurons built out of interconnected threshold switches can learn to recognize patterns in the same way that an animal brain and nervous system (including the retina) does.

Backprop
The learning occurs basically by strengthening the connection between two neurons when both are active at the same time during training. In modern neural network software this is most commonly a matter of increasing the weight values for the connections between neurons using a rule called back propagation of error, backprop, or BP.

Neurons in artificial neural networks
How are the neurons modeled? Each has a propagation function that transforms the outputs of the connected neurons, often with a weighted sum. The output of the propagation function passes to an activation function, which fires when its input exceeds a threshold value.

Activation functions in neural networks
In the 1940s and ’50s artificial neurons used a step activation function and were called perceptrons. Modern neural networks may say they are using perceptrons, but actually have smooth activation functions, such as the logistic or sigmoid function, the hyperbolic tangent, or the Rectified Linear Unit (ReLU). ReLU is usually the best choice for fast convergence, although it has an issue of neurons “dying” during training if the learning rate is set too high.

The output of the activation function can pass to an output function for additional shaping. Often, however, the output function is the identity function, meaning that the output of the activation function is passed to the downstream connected neurons.

Neural network topologies
Now that we know about the neurons, we need to learn about the common neural network topologies. In a feed-forward network, the neurons are organized into distinct layers: one input layer, n hidden processing layers, and one output layer. The outputs from each layer go only to the next layer.

In a feed-forward network with shortcut connections, some connections can jump over one or more intermediate layers. In recurrent neural networks, neurons can influence themselves, either directly or indirectly through the next layer.

Training neural networks
Supervised learning of a neural network is done just like any other machine learning: You present the network with groups of training data, compare the network output with the desired output, generate an error vector, and apply corrections to the network based on the error vector. Batches of training data that are run together before applying corrections are called epochs.

Optimizers for neural networks
Optimizers for neural networks typically use some form of gradient descent algorithm to drive the back propagation, often with a mechanism to help avoid becoming stuck in local minima, such as optimizing randomly selected mini-batches (Stochastic Gradient Descent) and applying momentum corrections to the gradient. Some optimization algorithms also adapt the learning rates of the model parameters by looking at the gradient history (AdaGrad, RMSProp, and Adam).

As with all machine learning, you need to check the predictions of the neural network against a separate validation data set. Without doing that you risk creating neural networks that only memorize their inputs instead of learning to be generalized predictors.

Deep learning algorithms
A deep neural network for a real problem might have upwards of 10 hidden layers. Its topology might be simple, or quite complex.

The more layers in the network, the more characteristics it can recognize. Unfortunately, the more layers in the network, the longer it will take to calculate, and the harder it will be to train.

Convolutional neural networks (CNN) are often used for machine vision. Convolutional neural networks typically use convolutional, pooling, ReLU, fully connected, and loss layers to simulate a visual cortex. The convolutional layer basically takes the integrals of many small overlapping regions. The pooling layer performs a form of non-linear down-sampling. ReLU layers apply the non-saturating activation function f(x) = max(0,x). In a fully connected layer, the neurons have connections to all activations in the previous layer. A loss layer computes how the network training penalizes the deviation between the predicted and true labels, using a Softmax or cross-entropy loss function for classification, or a Euclidean loss function for regression.

Recurrent neural networks (RNN) are often used for natural language processing (NLP) and other sequence processing, as are Long Short-Term Memory (LSTM) networks and attention-based neural networks. In feed-forward neural networks, information flows from the input, through the hidden layers, to the output. This limits the network to dealing with a single state at a time.

In recurrent neural networks, the information cycles through a loop, which allows the network to remember recent previous outputs. This allows for the analysis of sequences and time series. RNNs have two common issues: exploding gradients (easily fixed by clamping the gradients) and vanishing gradients (not so easy to fix).

In LSTMs, the network is capable of forgetting (gating) previous information as well as remembering it, in both cases by altering weights. This effectively gives an LSTM both long-term and short-term memory, and solves the vanishing gradient problem. LSTMs can deal with sequences of hundreds of past inputs.

Attention modules are generalized gates that apply weights to a vector of inputs. A hierarchical neural attention encoder uses multiple layers of attention modules to deal with tens of thousands of past inputs.

Random Decision Forests (RDF), which are not neural networks, are useful for a range of classification and regression problems. RDFs are constructed from many layers, but instead of neurons an RDF is constructed from decision trees, and outputs a statistical average (mode for classification or mean for regression) of the predictions of the individual trees. The randomized aspects of RDFs are the use of bootstrap aggregation (a.k.a. bagging) for individual trees, and taking random subsets of the features for the trees.

XGBoost (eXtreme Gradient Boosting), also not a deep neural network, is a scalable, end-to-end tree boosting system that has produced state-of-the-art results on many machine learning challenges. Bagging and boosting are often mentioned in the same breath; the difference is that instead of generating an ensemble of randomized trees, gradient tree boosting starts with a single decision or regression tree, optimizes it, and then builds the next tree from the residuals of the first tree.

Some of the best Python deep learning frameworks are TensorFlow, Keras, PyTorch, and MXNet. Deeplearning4j is one of the best Java deep learning frameworks. ONNX and TensorRT are runtimes for deep learning models.

Deep learning vs. machine learning
In general, classical (non-deep) machine learning algorithms train and predict much faster than deep learning algorithms; one or more CPUs will often be sufficient to train a classical model. Deep learning models often need hardware accelerators such as GPUs, TPUs, or FPGAs for training, and also for deployment at scale; without them, the models would take months to train.

For many problems, some classical machine learning algorithms will produce a “good enough” model. For other problems, classical machine learning algorithms have not worked terribly well in the past.

One area that is usually attacked with deep learning is natural language processing, which encompasses language translation, automatic summarization, co-reference resolution, discourse analysis, morphological segmentation, named entity recognition, natural language generation, natural language understanding, part-of-speech tagging, sentiment analysis, and speech recognition.

Another prime area for deep learning is image classification, which includes image classification with localization, object detection, object segmentation, image style transfer, image colorization, image reconstruction, image super-resolution, and image synthesis.

In addition, deep learning has been used successfully to predict how molecules will interact in order to help pharmaceutical companies design new drugs, to search for subatomic particles, and to automatically parse microscope images used to construct a three-dimensional map of the human brain.