AT TDWI Boston, iBasis (a leading global VoIP provider) presented their Striim solution on how they: *Reduced “alarm noise” by alerting on only atypical situations *Leveraged filtered alerts to instantly trigger actions and workflows *Fed processed stream data back into traditional systems to identify outliers, patterns, and KPIs affected.


Unedited Transcript:

So let me start from simply describing what iBasis is. iBasis is about when your adult company started from the garage of one of the founders to build the voice over IP network across the globe. The goal was to enable our customers, major telcos and smaller telcos to intercut communicate between them more efficiently and more cost effectively. Think about very simple use case. You pick up your phone at home and trying to call your friends and friends, in this case your company, let’s say Verizon will try to send the call to company like iBasis. There are quite a few of those and the iBasis having our algorithms to do least cost routing will help to transfer the call efficiently with a good quality and the low cost to another side, the to France.

So obviously any operator being at AT&T or the smaller carrier will need to spend much more efforts to interconnect with all the different telecom carriers around the globe. It’s about 2000 telco carriers. So it’s a lot of work. So company like iBasis is in the business to help gears to achieve the more scale, lower cost and delivers a value in the middle. So we are the high volume and low margin business. So in our case, the cost and efficiency and speed is a key. If you’re not efficient identifying any issues, we may lose a lot of money. So obviously over the years our business grew, we started to be more of a prominent voice business. So we have multiple product offerings, we are looking at the fraud, we can identify fraud patterns and we’ll help our customers to block those calls.

We do analytics around the quality and margin distribution within our carriers. We also in the business of the mobile interconnection. So think about the case where you traveling internationally, your phone needs to keep connected to the network while roaming in another country. So we provide a lot of those services as well. We also in the business of a SMS transfer. So basically everything that has international bounded, it’s our business. A little bit about statistics. As I said, we are high volume so we’re processing about 2.5 billion transactions a day. Think about sending a SMS messages, sending a voice calls, signaling and so on. We keep that thing in about 200 terabytes. IBM, it is a storage data warehouse. Our system processing roughly certain thousand queries running from multiple systems, from different sources serving our internal users and our customers and we trying to make a goal to keep average greater and time less than five seconds. What that means that majority of the queries are finishing significantly less time, one to two seconds, but they’re a significantly longer query that may around maybe for five, 10 minutes but altogether it’s an average of five seconds.

Our use case when we started our journey to see how we can do better in terms of delivering faster analysis to our business and to our customers. We saw that in looking at couple of other carriers that is really very well positioned for us to achieve fast delivery of the data visit significantly lower time to build stuff and you will see throughout the presentation. Let me explain what is a GRX and LTE. LTE is probably more familiar for you. It’s a 4G networks so of course when you trying to roam in another country, once you get into let’s say an airport and France, you will be prompted to see, actually not prompted, you will be trying to connect to company like teleconference and of course teleconference. We’ll try to connect you with your home operator and if it’s all good and you have a roaming agreement, you will be a authorized to work and the roaming that country in order to achieve that.

A lot of work is done by iBasis to integrate between all the curators around the world. Today we have a coverage in about 642 networks. So company just need to connect by iBasis and the rest will be handled by us delivering that smooth user experience in another country. So our goal is to ensure a very consistent network performance in the other country and deliver high speed mobile data experience, which is very important for us. So what is the use case being the network be at real time? We do expect to monitor our network throughout the 24 by seven timeframe and we want to identify issues in our networks, both performance and cost related as soon as possible. As I said, it’s a high volume, low margin business. We need to be very efficient in order to deliver that. We can simply look at what Steve pointed before has a certain event and immediately escalated for the analysis.

But it’s a lot of waste for our network team to do that because they will get a lot of noise. So what we are trying to do, we are trying to do smart stuff where we’re looking at the patterns of the traffic, compare patterns with what is the expected performance should be and only if it’s a continuing problem that is not self recoverable like in most of those cases then we’ll try to split it up and they present to the network operational control for further analysis. So make a simple example. If I get six issues within the last six 30 minutes, then it’s a time to alarm, I don’t want to alarm and some things it is maybe just disappeared after first event. So the power of action is exactly to address not just the analysis of the data as it is coming in, but also look at it in the time window.

It’s called the flighting window or jumping window to to identify the pattern and only when every things that should happen happened on the Ventolin. Really beautifully work is direction. Together with that. We also need to provide an availability of data rights there so our network team can immediately look at the alarm together with a reasons for that alarm. So what are the trends? How is the data actually changed? What are the signals that caused that? We wanted to be real efficient served one as any one of developers here. Probably with the attended every today’s a requirements from the businesses to fast changes. Agile is a buzz word, but in reality everyone wants to get changes done very fast, especially in the new product. Like this one, we know that we build something initially we know it will change significantly in the next, let’s say healthy year to year.

So we need to make a solution that is easy to adapt. So how would you do it before? Obviously majority of you’re probably familiar with the old school of doing details. You would get the data, you will process it through the systems. You will probably need to lend it somewhere for the further analysis. So you need to store it in the database. You will then need because we will compare with the baseline. We’ll need to pull the aggregated level data from the history. Let’s say last 30 days, identifies the trends and store it for the compare later on. Then we’ll need to do the most, a significant part of a program here is to create a process that take on one side the data’s that represent current performance analysis. Take the threshold baselines that we defined in step number two and compare and correlate and one certain events happen to alarm, of course we need to store this data to be accessible throughout other systems for monitoring ends.

The last step is to build a dashboard that will access the data generated and step three and expect the to get more information along with that. For training analysis, not a big deal, we all do it, but there is a simpler way to do it all you want to do, you want to simply acquire a new data set. He wanted to make simple just point to the file occasion and lets that the streaming, once the data streaming, you want to keep it in your memory readily available for processing later on and second step to generate your baseline based on the previous a period of time and also put it in what considered to be pipe. Right? So you have concurrently running two pipes, one with the current data, one visit threshold data. All you do, you simply put now in step three compare on the window.

So you’re saying in the last hour of the processing of my current data and I have also the baseline are the deviations from the norm. So you identify those deviation. In my case, let’s say six deviations from the norm in the last hour. So the two, here is a deviation. How to identify as a division in the period of time. So once to our matching, based on our algorithm, we will then create an alarm. Very simple, very logical in normal human sense and it’s easily done in the tool. So the key concept here are streams. That data is flying in and window, the data is captured in a certain period of time. Once you have those two key pieces you can do with this data a lot and the beauty of it, you can do it like an SQL and the other nice tool in a nice part of the functionality is a, you can build dashboards or build reports and other data analysis right on the data itself on the same data as the triangle.

So your pipe, you can analyze it as it comes available. I’ll probably show it to you in the next slide, the, you can see how it’s done now, why we decided to embark on this solution. So we have pretty advanced approach to analytics. We have a Nateeza appliance, we have other tools that help us to analyze that fairly efficiently so we can get the data on our network. That happened about hour, hour and half before, but it’s a lot of time. So we were good in this situation for like six, six or seven years. But we quickly realized in the last couple of years it to get insights on what’s current, what’s happened in the last five minutes. The last hour is a very important because in this last hour we can lose a lot of money. So we started to improve our roadmap and build infrastructure that will basically split our analytics into two online. It’s the last two hours offline, it’s before.

So of course it’s a lot value for offline, but it’s something that you cannot affect what you really can affect as the data that happened just now in the last five, 10 minutes. Because it’s become that is at operation that you can make some changes. So yes, maybe you will lose something in five minutes, but you can immediately or in the next five and you can affect significantly boost margin and quality and quality perception. And for business is a very simple, important company. You may see you not operating well, it will switch to who gains a business back. So as much as it’s maybe a considered minor, the first piece here is very important for us. And we looked at the technologies that are fairly a fairly advanced and convenient for us to use while it should be also simple and effective. So the approach we took, we just decided to put some things at processing that right as soon as it’s become available and the for operational needs only throw out some decision making or some a notification to our network operational team. And then we simply takes that data forward, move it to the data warehouse, do further enrichment processing and it’s become available through our normal process. So the position of the net, uh, of of action is right on the top.

How will we see the flow of this? We acquired that from our elders, from the network, the immediately loads as data into production cluster. The action cluster help us to basically identify as alarms. Once alarm has been identified, we sent it to network operational control. At the same time, those network specialists can look at the data as it’s become available and you will see the examples later on so they can start analysis right away and then go to more detailed data and make changes itself.

Vince, once we address that issue, because the most priority for us, we can then push the data further through the enrichment process to the data warehouse and then go as a normal process to build reports to beat on analytics and so on. 

So the key initial use cases that we have built is I believe probably those seven. So the first point is that your data is loaded directly into Striim. So is there is no overhead of a storage time and the processing times that data becomes immediately available and you will see in the next slide how easy it is to get the new data set loaded. That data can be stored in memory or is it data can be loaded into the elastic search onto the database. It can be done in the first stage or it can be done at the last stage. It’s really good. CP, alternative c Pete nowadays and we looked at number of alternatives. It’s the mostly the very complex development being the Java being the some other technologies and today my team is not the Java team in this business intelligence. It’s a very good skill that people that familiar is business object, Oracle data integrator, Tableau and others, but there is no Java set of skills.

We don’t want to engage in this pattern right now. We want to basically find a solution that is very intuitive, very simple for us to use, leveraging that knowledge and business practices we have today. So that’s, that’s why it’s become really more interesting for us to do that, to have the, to have both data intake processing of the data and analytics on top of it. It’s a really compelling advantage for us because everything done at once. You simply save time and efforts on building three different components. Loading part processing part in the presentation part, it’s all done at once. Like I mentioned before, you can do the low price, you can do the development in the user interface or you can do it in the TQL language. Take your language, you’re going to see in a few slides after that. It’s a like SQL example for people with experience on a skill.

It’s no brainer. You can really build applications literally in hours or days depending on complexity and the beauty of it, you can expand it and it’s really become simple. The solutions that I’m talking here is about 400 lines of code. That’s, Eh, it’s a very important of course, to be able to extract the data, turn to external system, sends a Striim, provides a restful API, which allows us to extract the data to other systems and the subscribe to the services. Uh, of course the action provides a solution for Cla for scaling up and it’s a cluster so you can add to two servers or you can expand it further. So it’s pretty robust and nice. But something that exists for a lot of tools today, the ability to simply put an agent on your source of the data being the trout or being in something else and simply start pulling the data interaction cluster seamlessly.

I’ll show you a quick example of the tool. Okay, I do. I can do zoom. So what you see here is a results of the web action dashboard. You see on the top bunch of alarms that had been generated in the last couple of hours. And what you see clearly here that once I select one of those alarms, you have a baseline ingredient. And suddenly from this point on we identify an issue. So something is not performing well. In this case it’s a number of successful transactions are failing. So here’s the expected baseline, here’s where we are. After those six events, alarm triggered alarm, been triggered, xray could be an edit to the screen trends. Analysis is available right underneath. So it’s very easy and the efficient for the network and operational team to investigate the alarm, they can click right there and go to another systems that have more details than operational solution to solve the issue.

So from the moment being, we identify as the problem to the moment of resolution may take five minutes because it’s a high priority alarms and it’s set up being alerted and the action can be taken very easy and very efficient. Another example within the same tool, we are looking not just on the success ratios but also on the volume degradation. So if you suddenly see that we don’t have enough traffic compared to what we expect to have, it’s a period of time the line goes underneath like patient dead. We’ll also learn. So within the same process we have multiple way, multiple ways to alarm and the triggers, certain events.

Now to the code, maybe the more complex part, but really what’s convenient here is a simplistic, you can see that the first part is just defining the types. It’s a complex type of multiple KPIs connected to certain MNO, eh mobile network to mobile network connectivity. We defines the stream stream is considered, it’s a pipe, is it getting data in right after that you can see right there, we simply connected to the file location. Those a file folder. There’s a folder is a bunch of files, visit wildcard, defining certain attributes of the files and starting to streams as data into the streams that we just defined. Very simple. So think about as a data kind of available in the table, but the table is constantly changing because the data is changing in the stable s as time goes through.

Another example, we can pull that into cash. So on one side we stream fog data. On the other side, we need to pull the data from our tables so we can do some aggregates and create the baseline. In this case, we also create a type of the data, bunch of KPIs for certain, for certain dimensions. And for this data we defined connection to the Oracle data business case and get the KPIs that we need to use. So what we created just now is two pipes. One pipe is getting the data from the network. It’s a current performance statistics. Second pipe or cash is the data as it is, it should be measured against. So any deviation from the current performance does the baseline will be identified as a violation and will be triggered triggering alarm eventually. Comments? Questions?

All right. Uh, next one is a bit more complex. I’ll try to simply skim through.

Every single game is about types. Like you probably know that nothing is simple in terms of data management, so you identify certain type. In this case it’s again certain data stream from certain cater to certain carrier. With some KPIs you aggregate the data on the hourly interval, so this is your Vindo of an hour so that doesn’t getting in being stored for an hour and being analyzed and compared with the baseline and this one is this one hour moves on and there’s no interesting events that that has been just a ignored. Okay. In this case you do simple insert statement from one stream to another and you do some enrichment. By the way, you can basically look at some a success you defined basically as a success ratio in this case, how many successful transactions compared to the total number of transactions? Fairly simple. This application by itself is not rocket science. There is no like advanced statistics, but it’s a nice and simple way to automate, automate what person normally would do or some other more complex programs. We do identify number of events in a certain moving window and as soon as a certain events happened to alarm on that

Further down we also do some additional analysis and aggregation. Trying to identify which of the events that we just through are interesting for us. Interesting event is kind of an outlier. Something out of the norm out of the norm is a case where you have something violating the thresholds razor. So if it’s validating our thresholds we’ll put it into the separate table for further processing. So it’s like step by step analysis of the data, is the data coming through and the last step as we identify those alarms, we’ll just put it in a separate stream. This stream is any other stream is available for that reporting. So first of all we’ll analyze the stream as soon as six alarms in this case happened in the one hour interval will start an alarm alarm is just a few lines of a command to generate and send to our operational systems. Now the same data now can be accessed directly from the memory in the dashboard. This is very nice and easy way to approach the data. You don’t need to store it. And in addition you don’t need to move it around that that does it to judge generated immediately available and refreshed on the screen. So think about user experience where you have a big screen that suddenly new align, a new line just dropped in. You’ve been alerted, you can now make an action and solve it.

Eh, just a few slides make actually one about the UI part. There is a way of course to do the visual interface interactions and you can build those nice complex or simple workflow and in flows. It’s a way to do it. I personally more have like a lot of technical skills. I prefer to do TQL but again, it’s up to the persons that worked with it. You can do it as a one way and you can do it another way. This was done in a TQL. It’s still a been transformed by verb action into the nicely presentable views that you can. If you follow it, you can get sense of what actual application is doing. Okay.