Behavioural Analytics

The world around us is changing. One of the most important things is we’re getting more and more data. Things are beginning to run on data. We have a data-rich society and we’re familiar with some of this.

Through Machine Learning Algorithms we have the ability to see the pulse of the society and measure it using these new big data sources has been very exciting for a lot of people. We’re entering a world that’s really highly instrumented, by using this big data where we can cure problems and detect problems very early on using these big data sources.

Behavioural Analytics uses machine learning and big data sources to understand and predict behaviour, it is a powerful approach for understanding humans. The key insight is that people’s behavior is not a function of what goes on in their brain so much, as it is to what the people around them do. We learn mostly from other humans. That’s what we call culture. When we see people doing something, if it looks like a good idea, we copy it, it’s human nature. It’s not something that came from inside your head, it came from copying other people and the consequence, is that ideas flow from person, to person, to person. Social learning, copying other people, talking to other people, causes ideas to flow around in society or within a company.

And once we begin realizing this, we have a different approach to understanding what people want and what they’re likely to do. For instance, if we wanted to understand what sort of apps somebody would do we could go and say, well, they’re this old and earn this much money and things like that, but sometimes what we can do is look what their buddies do. And if we see their buddies all using some particular app on their smartphone it’s a sure bet that they will too. In fact, in comparing the sort of demographic approach, individual features versus the social physics social features, we find that often there’s a five times better accuracy in predicting what people will do by paying attention to the social context than you could get from paying attention to individual features. That, right there, is a real revolution in analysis. Behavioural Analytics focuses on using big data to understand and predict human behavior based on their social context as well as traditional demographic measures.

 

Nextbit continues the Microsoft partnership with Azure and Business Intelligence competencies

Microsoft has made significant improvements in its infrastructure for supporting Big Data Analytics and Machine Learning, for this reason Nextbit has continued to renew the partnership and certification on Data Intelligence and Azure Cloud Computing with Microsoft since 2013.

Nextbit is a Silver Business Intelligence Competency in the Microsoft Partner Network. In the years of collaboration, Microsoft has valued our commitment to creating and delivering innovative customer solutions and services based on Cloud Computing and Advanced Analytics.

 

Nextbit is member of Cloudera Connect Partner Program

Nextbit is strongly believes in the value of Big Data and Open Source, it is a member of Cloudera Connect Partner Program since 2013.

Cloudera (www.cloudera.com) offers a powerful and integrated Big Data platform comprising software, support, training, professional services, and indemnity. This platform, which has open source Apache Hadoop software at its core, allows customers to store, process, and analyze far more data, of more types and formats, allowing them to “ask bigger questions”.

The Cloudera Connect Partner Program is designed to champion partner advancement and solution development for the Cloudera Enterprise ecosystem. Cloudera and partners promote a broader, stronger presence by combining product and services expertise from partners with CDH, Cloudera 100% open source and enterprise-ready distribution of Apache Hadoop.

Nextbit participates in the SAS Forum Italy with the presentation “Effective Sales Forecasting – Amgen Dompé case study”

SAS Forum Italy is an important conference to learn about innovative ideas in the field of Business Intelligence and Business Analytics. This year the conference will be held on April 17 at MiCo – Milano Congressi.

Nextbit, silver partner of SAS Institute, is a sponsor of the conference and participates actively with the presentation “Effective Sales Forecasting – Amgen Dompé case study”.

The presentation will be held by the Administrator of Nextbit Federico Pagani.

For more information visit http://www.sasforumitalia.it/ and come and visit us at SAS Forum 2013, at Nextbit stand.

Survival Analysis applied in banking field

Survival Analysis is a branch of statistics that takes care of representing development of an event during a period of time.

Its main application is into demography, especially in the analysis of human mortality. Some survival models have been created to produce principally 2 functions: Survival Function S(t), which represents the odds that the event would happen after time t, and Hazard Curve h(t), that describes probability of the phenomenon at time t.

In banking field, one possible application is the description of credit risk. In concrete terms, the event “death” happens when a contract has n outstanding installments (in other terms, having current delinquency equal to n). This analysis is absorbing, because when the contract goes into this bad situation, it will definitively exit from portfolio population. Another way to go out from community is the censor: the contract finishes in advance or it has no outstanding installments during the period of analysis.

Hazard Curve’s notion could resemble another useful banking indicator, vintage absorbing, because the concept is the same: ratio between number of “dead contract” at time t and total number of contracts.

But their trends are completely different due to 2 reasons: vintage numerator is not decreasing (“dead contracts” continue to be considered, differing from Hazard Curve), while denominator is constant (starting portfolio, differing from Hazard Curve where population is decreasing).

Through PROC LIFETEST of SAS, it is possible to make interesting analysis with some aims:

- Compare developments of S(t) and h(t), with a fixed cohort of contracts, according on significant variables; if we consider more variables, SAS calculates every combinations among all the categories.

S(t) e h(t) of car loans of a bank with variable of new/used car

- Compare developments of S(t) and h(t), with a fixed cohort of contracts, depending of levels of current delinquency.

- Compare developments of S(t) and h(t), with a fixed level of current delinquency, depending on cohort (temporal analysis).

SAS program produces automatically graphics and statistics: this study could be useful to understand temporal trends or the way these variables influence credit risk in portfolio.

Furthermore Survival Analysis has a model development, through 2 ways:

- some models evaluating Survival Function S(t) and Hazard Curve h(t), also depending of other significant variables; at the beginning it’s necessary to suppose a specific distribution to estimate parameters (Gamma, LogNormal, LogLogistic);

- some models foreseeing development of Survival Function S(t), depending only by time t; in this process some secondary functions of S(t) are used, because they are mainly alike to a straight line;

In the first case, we’ll have a model as a function of n+1 variables (time t and n significant variables), while in the other, it will depend only by time (through a method similar to linear regression).

Summarizing, Survival Analysis applied in banking field could be useful for some reasons:

- identification of mainly significant variables of a phenomenon
- understanding the influence of these variables on currency delinquency
- development of Survival Function depending on time
- forecasting of Survival Function for next months
- forecasting of Survival Function depending of significant variables

Link Analysis

Increasing use of blogs and forums for comments and reviews of products has driven companies to analyze them carefully. Users increasingly inquire the Web and are influenced by opinions and suggestions of other users.

Moreover, companies should not only focus on the users’ network but also relationship between websites and on the importance of every forum within the “blogosphere”. In fact, just like for web-users, we can classify blogs in the following way:

- “opinion leader”, these blogs influence the network

- “follower”, a blog influenced by surrounding network

- well-connected into network

- withdrawn, completely independent from Web

- renowned on Social Networks or having a good Google PageRank

- unknown in the Web

W. Pareto, through 80/20 rule, affirms that analyzing 20% of most important sources, it is possible to know 80% of the phenomenon.

Link analysis applied to the blogosphere focuses on this theme: it studies connections among websites (of a specific area), analyzing visibility on search engine, social shares and presence of links to/from other websites.

The process evolves in 4 phases.

1) It is necessary to surf the Net, exploring blogs to create a wide sample. The example reported below is on 41 blogs about comments and reviews on motorcycle.

2) Every single blog is then analyzed to provide some useful indicators about site:

- Google PageRank

- Number of links to the site

- Number of domains that link to the site

- Number of shares/likes on Social Network

- List of all links that point to the site

3) All of the above information is useful to understand the importance of each website within the Net. An N x N matrix (41 x 41 in this example) is the created, which describes links between websites, indicating the number of hyperlink of every blog to the others. All these links are put in a matrix M x 2, to list all the M connections among all the nodes of the network.

4) Subsequently, through applications specific for link analysis, it is possible to re-create the blogosphere and personalize it for our own aims.

The Motorcycle blogosphere we analyzed had these features:

- Oriented graphic

- node’s size depending of number of links

- node’s color depending of number of shares/likes on Social Network

- arrows’ size depending of number of links between blogs

Below is the graph of the blogosphere regarding motorcycles.

Blogosfera

Obviously it’s possible to enlarge the image and analyze every nodes of the network, focusing on rows, node’s features and position in the network.

Moreover link analysis produces some interesting and useful indicators to understand network connections and dynamics:

- in-degree: number of ingoing arrows

- out-degree: number of outgoing arrows

- betweeness centrality: skill of connection among different nodes (bridge capacity)

- closeness centrality: skill of interaction with other vertices

- eigenvector centrality: indicator of centrality in the network

- pagerank: score of the node in the blogosphere

- clustering coefficient: indicator of the presence of connections among close nodes

- reciprocated vertex pair ratio: ratio between ingoing and outgoing connections

It’s possible to affirm that link analysis is an ex-ante studying of blogs and websites, to understand blog dynamics even better.

This analysis can be very useful because it permits to focusing attention only on relevant websites (“opinion leader”) and to ignoring unknown blogs.

Otherwise another solutions is to introduce weight to posts of any blogs, according on their importance in the network.

Wikidata, a worldwide database

Only few lines describe this brand project on the Wikipedia website:

“Wikidata is a proposed project to provide a collaboratively edited database to
support Wikipedia. The project is being started by Wikimedia Deutschland and is intended to provide a common source of certain data types, for example, birth dates, a class of validated data, which can be used in all other articles on Wikipedia. It will be the first new project of the Wikimedia Foundation since 2006.”

Wikimedia Deutschland, the German section of the Wikimedia movement, is now
developing a collaboratively database of the world’s knowledge that can be read and
edited by humans and machines alike.

Developing a machine-readable database doesn’t just help push the web forward,
it also helps Wikipedia itself: Wikidata will support more than 280 languages with
one common source of structured data that can be used in all articles of the free
encyclopedia. Infact, the idea is for the data to live in the “info box” (on the right side
of some Wikipedia pages): Wikidata will drive the info boxes wherever they appear.
Obviously some details can change because the project is still in progress.

The data will be published under a free Creative Commons license, so they can be
used for different applications, for example e-government, or to connect data in the
sciences.

Wikidata project will be developed in three phases.
The first one (finished in August 2012) centralizes links among different language
versions of Wikipedia, creating a common source of structured data that can be used
in all articles. In this way all the data would be recorded and maintained in one place.
In the second phase (December 2012) editors will be able to add and use data in
Wikidata.
The last phase will allow for the automatic creation of lists and charts based on the
data in Wikidata which will be present in the pages of Wikipedia.
Wikimedia Deutschland cares for Wikidata until March 2013, before Wikimedia
Foundation will be the owner of this project.

A second purpose of Wikidata project is to enable users to ask different types of
questions (e.g. who is the youngest prime minister in Europe?). Today the only way
to answer those types of questions is to create manually a list; Wikidata will be able to
create these lists automatically.
It’s not a coincidence that the leader of the eight developers team is Dr. Denny
Vrandečić, from the Karlsruhe Institute of Technology. In fact, together with
Dr. Markus Krötzsch, he is co-founder of the Semantic MediaWiki project, which has
pursued the goals of Wikidata for the last few years.

There is one more sentence in the Wikipedia definition of Wikidata:

“The creation of the project was founded by donations from the Allen Institute for
Artificial Intelligence, the Gordon and Betty Moore Foundation, and Google, Inc.,
totaling €1.3 million.”

50% of the total investment has been donated by the Allen Institute for Artificial
Intelligence, the organization established by Microsoft co-founder Paul Allen in 2010.
This organization supports long-range research activities that have the potential to
accelerate progress in artificial intelligence (like web semantics).
One quarter of the funding comes from the Gordon and Betty Moore Foundation,
through its Science Program.
The last quarter has been provided by Google, Inc.: “Google’s mission is to make the
world’s information universally accessible and useful” said Director Chris DiBona,
Open Source at Google. “We’re therefore pleased to participate in the Wikidata
project which we hope will make significant amounts of structured data available to
all”.
Google has certainly a big interest in this project: thanks to a centralized semantic
database they will be able to provide direct answers to common queries. As it moves
further into semantic search, Google could provide answers itself so that people
would spend more time on Google than on detailed websites.

Open Data, Open Information

In the last three years there has been an impressive growth of the Open Data movements. Not only public administrations but also private companies decided to release important data without any copyright restriction: even some scientific institutions shared free datasets.

The number of public datasets is highly increased in last few years and there are different types of data: ZIP codes, public expense reports, crime data, health service data, transport timetable, … This is possible due to community of developers, researchers, and journalists who are trying to convince governments and local administrations to release their data.

In 2009 US and UK governments open their websites: data.gov and data.gov.uk (strongly wanted by  WEB inventor Tim Berners-Lee): most recently also other governments like Italy and France open their open data websites: data.gov.it and data.gov.fr.

Open Data is a deep innovation for different reasons. The most important aspect is the increasing of institutions transparency: free online access to public information assures a better transparency of public administrations actions. Another significant feature is the possibility to develop new business based on the creation of innovative services: transforming open data into easy accessible information is an open business.

There are some useful and successful services or apps for everybody. For example “Spotlight on spend” is about public expense, “Fix my street” helps citizens to report local problems about streets, “Spot Crime” displays crimes next to an address, or “School-o-Scope” turns official government data about schools.

The first italian product for ideas and applications based on open data is Apps4Italy: it was developed with the purpose of increase this new business and it had a quite good success.

Italy is beginning to release some open data. The regions of Piemonte and Emilia Romagna, the city of Firenze (also thanks to Wikitalia) are the most active local adminstrations; but not every local administration joins open data project.

But it is very important to have the possibility to compare data from different places, both national and international, for a better benefit.

David Eavans, advisor for open data in public administrations and companies, says that only the comparison of different cities, administrations or countries can give us a real benefit. If only a single city opens its data, it will be irrelevant to change the whole system; but if 100 cities join the project, there will be a real changing.

Moreover Eavans states that if the open data community wants to grow in the future, it is necessary the presence of big companies (such as Google, Microsoft, SAS and the Red Cross) in this project.

Nextbit participates in the SAS Forum Italy 2012 with the presentation “The value of data extracted from Social Media”

SAS Forum Italy is an important conference to learn about innovative ideas in the field of Business Intelligence and Business Analytics. This year the conference will be held on Tuesday April 17 at MiCo – Milano Congressi.
Nextbit, silver partner of SAS Institute, is a sponsor of the conference and participates actively with the presentation “The value of data extracted from Social Media”.
The presentation will be held by the Administrator of Nextbit Federico Pagani and Andrea Cerri, Digital Communication Group Manager BTicino.
The presentation will highlight two very important aspects of social media analysis: the first is the quantity and characteristics of data and information analyzed, the second aspect is that, unlike traditional marketing, in the world of Social Media is the customer who controls the conversation. Companies need to be actively involved in the analysis of these data.
The presentation will discuss the main techniques for analyzing social media and the types of information generated.