Category Archives: Economics

Internet privacy and security beyond 2016                           

(Download this article as PDF : Security and Internet beyond 2016_v1)


We feel that we need ‘security’ in our networks especially in internet. But what do we mean by ‘security’? Do we have clarity about what we want? There are concepts that live in the security arena but are not equivalent: reliability vs. trust vs. safety, identity vs. anonymity vs. privacy.

By increasing the scale, reach and speed of our networks we are redefining what we used to name as ‘communication’. Let us explore the paradoxes that internet is introducing in human communication. Let us take a look at what we can/cannot expect today from internet and let us look beyond 2016 for the possible evolution of human communication through internet.

Modern day trade-offs

In today’s News you can easily find ‘Famous Government’ vs. ‘Famous Company’ in a clear case of privacy vs. security, you can recall a recent breach that exposed data of customers of a business in which privacy was key in a case of privacy vs. morality, in recent years we have seen information leaks suggesting that governments perform massive scale interception of citizens communication, so the leak is a case of government right to keep secrets vs citizen right to keep secrets.

The Room

When thinking about communication in The Internet I propose to apply a simple paradigm: imagine that all the actors in communication are people in the same closed room and they have no other device than their voice and the ability to walk around. Anyone can speak aloud and publicly for all the room or N people can join together and ‘try’ to talk in short range volume. I propose this simplified model to analyse all possible cases of privacy, anonymity, and trust in today’s internet. Due to the current capabilities of internet HW and SW and due to its wide reach there are many similarities in terms of privacy and trust between The Room and The Internet.

Reliability and Privacy

We want the network to deliver the message to the intended recipient and no one else, but worldwide networks like internet cannot grant that a single message will ever reach the intended recipient. This statement has a meaning beyond technology prowess. High quality fibre links have virtually no losses, connection oriented protocols like TCP are supposed to grant that every message makes it through the network, but routers, fibre links and TCP connections can be spied and attacked by anyone with physical access to them (anyone can spy in the case of internet).

VPNs and shared secrets

VPNs (Virtual Private Networks) are subnetworks created ‘on top’ or ‘through’ internet. They are kind of a private tunnel through a public space. Access to the VPN is restricted to N endpoints that share some common property. VPN technology adds reliability by making more difficult to break or spy the conversation. VPNs fight impersonation, tampering, injection or deletion of messages. VPNs rely on encryption (cyphering messages) but encryption is not completely safe under attack. There are many methods to cypher data today. All of these methods rely on pieces of information or ‘keys’ that must be known only by the two entities that communicate. Message plain text is combined with the key using some transformation that is difficult to invert without knowing the key. It is difficult but not impossible. Inverting the transformation must be straightforward if you know the message and the key.  The most obvious cipher is based on a symmetric key. The same key is used to cipher and to decipher. Having key + plaintext the direct transformation is easy to do and it renders a ciphered message. Having the ciphered message and the key the inverse transformation is easy to do and it renders the plaintext. Symmetric key cryptography requires that sender and receiver have the key. There is a ´key distribution problem’. The transformation selected must be very difficult to invert when you have the ciphered message but not the key.

Statistical attack

As far as the same key is applied to a growing amount of information (many messages) the channel between sender and receiver becomes more and more vulnerable to statistic attack. In everyday´ s life keys are short pieces of information compared to the amount of information that these keys encrypt. As Claude Shannon demonstrated the only possible escape to statistical attack is to use a key that is at least the same length of the message it cyphers. Shannon´s demonstration led to an encryption technique known as ‘one time pad’. Sender and receiver have a shared secret (key) as long as the message. Once the key is used to cypher a message, the key is discarded (and thus the ‘one time ’). To send a new message the sender must use a new key.

Everything is a VPN

Beyond TCP we could use any imaginable protocol to make our channel through internet more resistant to losses and less vulnerable to attacks and/or spies, but from a logical point of view any imaginable ‘secure’ channel built through internet is completely equivalent to a VPN and always relies on some kind of encryption, so it shares the vulnerability of known VPNs to statistic attack.

A VPN is only as robust as its encryption technique. Establishment of a VPN link is based on the existence of secrets shared by both ends. State of the art VPNs use short keys that are static. How do we share these secrets? If we use internet we can be spied and keys can be discovered.

Public keys

A proposed solution to key distribution is public key cryptography. This is the solution adopted in certificates and state of the art VPNs. I want to share a secret (key) with many people. I divide the key in two parts. I distribute part 1 widely (public) and keep part 2 secret (private). Anyone having part 1 can use it to cipher a message and send it to me. I can use part 2 to decipher what was ciphered using part 1, but no one having only part 1 can decipher it. If I want to reply to a message I need to know part 1 of the receiver´s key, his ‘public’ key and he will use part 2, his ‘private’ key to decipher. This is not really ‘sharing a secret’ as public keys are no secret, everyone knows them, and private keys are never shared. The relation public key-private key is what mimics sharing secrets. It mimics sharing because it exports some information about part2, the private key, without exporting the whole key. The methods used to divide a key in public + private are difficult to invert when you only have the public key and the message but do not have the private key, but inversion is not impossible, it is only computationally difficult.

Out Of Band secret sharing

An alternate approach to public key is a key distribution method based on ‘out of band secrets’. This ‘out of band’ means that we need to share a secret (the key) with the other end by means of any other channel that is not internet. Two people in The Room can communicate aloud in front of everyone else with perfect secrecy as far as they have shared enough secrets out of The Room.

Grid Cards

As you can verify, people that need privacy over internet channels have put in place VPN-like mechanisms that rely on out of band secrets: Banks provide ‘pads’ (grid cards) to their customers, cards with an indexed sequence of characters. With each new transaction the bank provides an index and requests the customer to provide the indexed character. This mechanism uses the OOB secret to authenticate the user before establishing the channel. A grid card cannot hold many different keys, so the reservoir of keys is pretty small to implement an OTP (one time pad).


Some companies provide e-Tokens. Every e-Token is a portable device with a reasonably accurate local clock that is one-time synced to a central server at ‘device-to-user-linking-time’. Every e-Token generates a short pseudorandom sequence (PIN) based on the current time and a seed. Every e-Token uses the same PRNG algorithm to generate the PIN. This mechanism ensures that we can ‘share’ a secret (PIN) OOB (not using the internet) between all tokens and the server. The server ´knows´ how to generate the PIN based on the current time-slot and it has the same seed, so it can check if a PIN is good at any given time. When a user needs to start a secure VPN to the server the user can add the PIN to his identity to qualify as a member of the server + e-Tokens closed group (a kind of ‘family’ of users). This authentication mechanism is called 2-factor authentication (password + PIN) or multi-factor authentication. This mechanism works as far as the PRNG algorithm remains unknown and the timestamp of the e-Token cannot be recreated by an attacker. The PIN is only valid and unique inside the current time slot; usually the server allows slots to be 10s to 30s long. Quartz IC clocks in e-Tokens have considerable drift, and they cannot be reset by user so if there is no resync at the server side for that user account (and there usually isn´t) after some time the PIN authentication will fail. To overcome this limitation a better quartz clock (more expensive) can be used or the server may try to adjust to the drift of each specific user by maintaining a drift number per user and adjusting it with each new PIN authentication request. As you can see it suffices to reveal the PRNG method and seed to compromise the whole network, as it is not really difficult to recreate a valid timestamp to feed the PRNG inside a 30s slot.

Connected e-Token

A refinement of the e-Token is the ‘connected e-Token’. This is a portable device with a clock, a PRNG, memory and CPU with crypto functions and a communication link (more expensive). The physical format may be a smart card or it can even be an App living in a smart phone. The connection to the server solves the drift problem, and that is all the merit of the device. Crypto functions are used to implement cyphered protocols that handle the synchronization. These crypto functions will normally use a symmetric cypher applied to the PIN extracted from the PRNG. As you can see the connected device does not protect the ´family’ (the set of users that share the server) against any attack that reveals the PRNG method. An interesting property of some connected e-Tokens is that they can be used to generate PINs in sequence, one per time slot, provide them over a USB link to a host and the host will use them to cypher a sequence of transactions (which is faster than entering the PINs by hand). The connected e-Token adds a weakness not present in the e-Token: synchronization takes place in-band, so it can be attacked statistically. Now there are two ways to attack the connected e-Token: 1) discover PRNG method, 2) spy synchronization messages. By means of 2) an attacker can solve 1).

Secure transaction vs secure channel

As you can see bank grid cards and e-Tokens just protect the start of a session. They protect a single transaction. The rest of the session in the VPN is protected by a static key. No matter how securely this key is stored, the key is short compared to the message. Connected e-Tokens may protect a small number of transactions per minute. Latency token-server limits the minimum size of the time slot in which a PIN is unique. So forget about apps that have more than 2 to 6 short messages per minute. In Internet physical access to the links cannot be avoided. This means that all the messages can be captured and analysed statistically. The current usage of bank pads and e-Tokens provides just an illusion of privacy to users.  The best we can say about grid cards and e-Tokens is that the less they are used the more secure they are against statistical attacks. But hey, the most secure transaction is the one that never happened, so did we need to buy an OOB device to re-discover that?  Definitely these devices will not work for people that want to ´talk a lot’ privately through internet.

Identity and perfect Trust

We want to ensure that our message reaches the intended recipient and no other, but at the same time we know that there are many entities in the internet with the capacity and the motivation to detect and intercept our messages. (Remember The Room). Again the only perfect method to determine identity based on messages received over internet is ‘shared secrets´. We need to ask the other end about some information that only this other end can know. As we have discussed above, OOB secret sharing is the only method that can grant perfect secrecy. Authentication (determination of identity) today can be done with perfect reliability as far as we have an OOB channel available (for instance we can walk to our banks desk to get a grid card or we can walk to our IT help desk to get an e-Token).  Authentication is easily protected by a small shared secret because it is a ‘small and infrequent transaction’. It carries little information and we do not do it 10^6 times a second, so it may be enough to replenish our shared secrets reservoir once a month, or once a year. The problem that comes with current implementations of perfect authentication via OOB shared secrets is that this method is ‘only’ used to secure ‘the start’ of a connection (a VPN or a TLS session), and it is never implemented as an OTP, because keys are reused: grid cards reuse keys as the card does not hold many keys, e-Tokens have a much wider key space so they reuse less, but knowing the seed and method you could reproduce the actual key at any moment, so the ‘real’ key is just the seed and that seed is reused in every transaction. To simplify let us assume that we implemented OOB secret reasonably to protect the start of the conversation, we ‘started’ talking to the right person, but after the start an attacker may eventually break our VPN by statistical attack, then he can read, eliminate or inject messages. The right solution would be to apply OOB authentication to every message.  Clearly the grid card or the e-Token or the connected e-Token do not work for this purpose. Can you imagine hand-entering a 6 digit PIN for every 6 chars that you transmit? Can you imagine chatting with the other end at a pace of 6 chars every 30 s? It does not look very attractive.

Can we have perfect Trust? Trust usually means we can be assured that the message is not modified and identity of our partner is known. We cannot protect any internet channel of a decent bitrate using the available OOB secret sharing technology available today. So no, in general, we cannot have perfect Trust. For a reduced amount of transactions or for a low bitrate we can use one time pads. Two partners can agree to meet (OOB) physically once a year and share, let’s say 1TB, 1PB, whatever size they require of secret data in a physical device (HD/flash memory), and then they will consume the secret over the next year having perfect secrecy. OK, that works. But as noted it is a fastidious technique and it has not been implemented mainstream.


Anonymity in communications may have two meanings: 1) I communicate to N receivers, no one of the N can know my identity, interceptors cannot know my identity; 2) I communicate to N receivers, all of them know my identity, interceptors cannot know my identity. As far as at any time during my communication I can explicitly reveal my identity the important difference between 1) and 2) is that 1) presents a mechanism in which a receiver must accept a message from unidentified sender (as in telephony) while in 2)there cannot exist unidentified senders but there are identity hiding techniques oriented to interceptors.  Internet today is in the case 2). It is not possible to hide the origin of a message. It can be tracked, always. There are mechanisms to obfuscate the identity of the sender (Tor network), but these methods only make the task difficult and this difficulty can be overcome using a decent amount of computational power.

Do we really want anonymity? Anti-tampering

In the phone world there is no real anonymity as any call can be tracked by the carriers if there is motivation (a court order for example). But out of those extreme cases, it is possible and really annoying to receive calls from unidentified callers. Many people have developed the tradition of not taking unidentified calls, which is a very reasonable choice. In internet it is not really possible to hide the sender address. Yes, there are people with the capability to do ‘spoofing’, tampering with the lower level headers and faking the sender address in a message. This spoofing technique looks scary at first sight, but then you must remember that the address of the sender, as any other flag, header or bit of information in a message is unprotected in internet and can be tampered with. That tampering capability means that the message can be modified, faked or even destroyed, but it does not mean that the message can be understood by the attacker. Without understanding the message semantics it is easy for the two ends that communicate to devise mechanisms that alert of tampering: checksums, timestamps, sequenced messages, identity challenges, and many others. These mechanisms will use OOB information so they cannot be attacked statistically. So, no, we do not want or need anonymity and we are not afraid of message tampering as far as we have enough OOB secrets and we know how to build an anti-tampering protocol based on them.

Current levels of Trust

It is interesting to note that the internet that we have in 2016 does not have what we demand from it in terms of security. As we have briefly reviewed everyone is in need of a VPN to connect to every other person or service. This is not happening yet. Even in case of a hypothetic VPN boom tomorrow morning, every commercial VPN is vulnerable to statistical attack, so we will be just reducing the set of attackers that can do harm to those with patience and big computers: governments?, big corporations?, organized crime? Can we really implement VPNs based on OTPs that in turn rely on OOB secrets? Well, we can do it on a one-by-one basis, so if we meet someone in the real world and we have periodic access to this person in the real world we can replenish our OOB secrets and conduct perfectly secret VPN traffic. But as you easily see we will not like to do that for every relationship that we have today through internet with everyone and with every service that we use. And by the way, current commercial VPNs do not implement proper OTP.

Devaluation of privacy, identity, responsibility and trust

So no, in internet we don’t trust. We can’t. Those with private info, important transactions, or a lot of responsibility know how to use OTP based on OOB secrets. Those who don’t, maybe you, probably are not aware of the solution or the perils of not having a solution. The result is people do not expose through internet those bits of information that are really valuable to them, unless they have no other option. If you suspect that your bank’s grid card is not secure enough for your operations you have very little option beyond doing every transaction personally at your bank’s desk. To buy a book via internet you are not going to worry. If you are target of an online fraud you will take it as a risk of modern life. If someone impersonates you on LinkedIn or Facebook, things may get more serious. You may end up in court. Even in that case what can you do? Are you going to ask LinkedIn or Facebook to implement OTPs? I don´t think so. How could they do it? Will they have a LinkedIn or Facebook desk in every small village of the world to share OOB secrets with 100 billion users? We are seeing increased usage of VPNs. Especially for remote workers. We are also seeing increased usage of multi-factor authentication, naturally for VPNs and remote workers but that is also becoming common for wide spectrum services like Facebook, Gmail, LinkedIn, and others. Trust is ‘forced’. We trust online retail platforms because we want to shop online. We cannot live without that. But we will not shop in the first online portal that we bump into. Prestige in the online world is more important than ever. Companies that have been longer in the market and have a track record of none or very little leaks of data or compromised transactions will be the choice.

What to expect in the near future

Internet evolution is accelerating. Many companies that are in business today will not be in 5 years. Many companies that do not exist while I write this will become dominant in 5 to 10 years from now. In terms of security we cannot expect a breakthrough in such a short time. We may see some sophistication reaching the common internet user. We can expect to have free personal VPN services with state of the art security, which is not really 100% secure, but it is what ‘free’ will buy you in the short term. VPNs for businesses will grow in security. The best of them will opt for higher levels of encryption, maybe even OTP/OOB. Services that have a wide range of users will target multifactor security for authentication and transactions. They will surpass soon the current level of security that we can find in banks.

Banks, they need to evolve.

Banks really do not seem to be taking the challenge very seriously. The technology that they use is way old and insecure to be dealing with customer’s money. As the non-banking world evolves providing electronic payment we can assume that banks will improve their technology to attract customers. One of the first movements must be to provide VPNs to all their customers and better OOB, more complex secrets given at their desks to consumers. Grid cards are no good for protecting frequent transactions. As micropayments for online services become much more popular (they are already popular now), and thus much more frequent, grid cards need to be replaced by an OOB method with a much wider key space. I do not think e-Tokens are a good replacement. Much better would be a gigantic grid card implemented as an indexed ROM created per user. Let´s say 32-64 GByte of random bytes burned on an inalterable IC given to every customer. Add a display and keyboard to enter the index and you are done. This kind of IC can be created today and is affordable to banks. The eGridCard must not have connectivity. Any connectivity will make it weaker as the keys could be spied over USB, or wifi or any kind of link.

Social and Retail

Multi-factor authentication will take over the place. Social networks do not have a great return on each individual member (a few cents/year due to ads), so they are unlikely to invest in a HW OOB+OTP, but I can see cheaper multifactor coming: adding phone numbers to your identity (the phone network is a good OOB separate channel). I also see premium services coming from social networks. Paid premium services allow to provide OOB+OTP HW, as described for the case of banks.  Online retail sites and online premium social networks can offer true privacy to their members via eGridCards, at least to protect start of session.  To protect long messages we will need a better way to share a huge secret.

Professional big secret sharing

Corporations wanting to seriously protect a channel, not a transaction, will push the state of the art of VPNs. Combining VPN methods for reliable channels: sequencing, timestamping, identity challenges, checksums, multiple handshake, and others with OOB+OTP will make corporations much safer. This effort will require new HW and new SW. In opposition to protecting a single transaction, protecting a channel requires a continuous feed of secrets to the transmission device. This feed cannot be delegated to a human (as in an e-token or Grid card), but we cannot rely on an ‘open interface’ as USB, Ethernet, radio or whatever existing link. The solution that comes to mind is that the secret holding HW must be coupled to the internet connected device only briefly, while the channel is active, and the coupling must be a one way interface that does not work from the internet side. This kind of HW interface is not available today (at least it is not mainstream) , but there is no difficulty in building it.

Size of secrets

We can speculate that any ‘decent’ communication today is very likely to move from KBytes per minute to MBytes per minute. Non-media-intensive talks will be in the lower 1 Kbps to 100 kbps, while state of the art media-mixed talks may be 100Kbps to 500 Kbps, and some rich-media talks will reach the Mbps (1 Mbps to 5 Mbps). This speculation applies to very general communication taking place in social media, micro transactions in banking and retail (small photographs included), in mobile terminals and desktop computers. In other more professional environments like VoIP and videoconferencing we may move up the Mbps scale. If we want to protect a channel of 1 Mbps that is active 8 h/day, 300 day/year, we need 8.64×10^12 bits (8.64 Tbits =1.08 TBytes). It will be easy to build OOB shared secrets worth of 1 TByte/year. A cheap HD will do.

Internet fabric

Internet is made of routers and links. We have said that every link and every router is accessible to eavesdroppers today, which is true and you better act as if you believe in that statement. Internet is multitenant (many entities own the routers and the links) so we could reasonably guess that some internet portion could be hardened against eavesdroppers while remaining standards compliant in its connection to the rest of internet. Yes, this can be done by replacing every router in that area with routing machines that cipher every single packet that goes through using OOB + OTP secrets. Ciphering works end to end in the area that the secret is shared. As this area cannot be the whole internet, we can think of a new kind of router that admits periodic replacement of a HW storage module containing OOB secrets. All routers in the area will receive the module let’s say once a week or once a month. Modules will be written centrally at network owners premises. Traffic that comes into that ‘spot’ in internet will be ciphered via OOB+OTP so only routers in that ‘spot’ will understand the low level packets. Egressing traffic will be just ‘normal’ traffic as low level packets will be deciphered at the border. The ‘spot’ will be transparent to the rest of internet, but now traffic cannot be spied in that spot. This is a business advantage. If a customer traffic originates in that spot and terminates in that spot it will be much more secure and the customer does not need to do anything special. This claim may attract final customers to an specific ISP or Telco or network service provider. This could be called a STN (Secure Transaction Network) for similarity to a CDN, which is a closed infrastructure. Today we call SDN a Software Defined Network. Interestingly SW defined networking will make much easier to build custom routers and thus STN. Imagine how easy it will be to build a ‘device’ out of a fast packet forwarding engine (HW based) plus SDN modules for OOB+OTP written in house to cipher every packet and support our proprietary storage module. I would move from my current ISP to another ISP that can swear (and demonstrate) that my traffic will ONLY go through this kind of routers in my country. At least I can reach my bank and a good number of friends and family in a secure spot.


It is very unlikely that we will see a new standard appear to include ciphering in the base internet protocols to transform all routers in secure routers. Even if we see that standard appear in the next few years (5 years) that standard will be based on classical cryptography which is vulnerable to statistical attack. This is due to the impossibility of specifying OOB mechanisms in a standard. And due to the fact that very few global coverage networks exist that are not internet accessible (OOB). The most practical two networks that can be used for OOB are: people carrying secrets in their pockets, phone (non-data but voice) network. The second network is much less reliable as an OOB than the first one. Even if an agreement is reached for a OOB method (impossible in my view) adoption through a significant part of internet will take over 10 years, which will render the effort useless.


You have to do your part. If you want to have an increased level of privacy you cannot count on current privacy protection from internet links and/or routers, internet protocols, bank grid cards, e-Tokens, or VPNs. You cannot count on this situation improving to a practical level over the next 5 to 10 years. You can implement some sort of OOB + OTP today on your own. Just look for the pieces of technology out there to implement your secret sharing at the level that you require.

Modern Exploitation of Content Businesses

(Download this article as PDF: Exploitation content services)

3D schema. Exploitation of Content. by Adolfo M. Rosas.

1 Introduction

Today some say ‘content is king’ … some others say ‘network is king’, or ‘user is king’, or ‘data is king’ or … whatever thing is king. There are so many ‘kings’ today in the markets and industry.  But you have to concede that content services flourish today, as of mid-2014, in the network. We have more online video than ever before, more shared media: photos, videos, messages (sometimes with embedded media), downloadable apps, streaming games…and more video streaming services than anyone could have imagined just a few years ago.

If we admit that the internet economy is going up, and at the same time we admit that content services are an important part of that economy, it would be reasonable to assume that people have discovered how to exploit content services. How to build services that are not only convenient for consumers: easy to use, appealing, affordable, modern, engaging ….but also convenient for the ones behind the service, convenient for exploitation: easy to run, understandable, reliable, profitable.

To a certain extent it is true that some people learnt to build those services and made them as good as it is conceivable, but only to a certain extent. The fact is you can name a few (less than 5) content services that surpass all others, you can name even less leading search engines, you can name very few leading streaming services, you can name essentially a couple leading movie services, very few first-line music services, virtually only one leading e-book service…  The very existence of this clear category of ‘leading content services’ lets us know that not all the people building services knew how to make them as good as it is possible.

I want to devote this article to the art and science of making content services ‘exploitable’ in modern terms.


2 Service ideation

Many managers I’ve known just cannot imagine the ‘whole coherent idea’ of a new content service. They cannot create a mind representation of a coherent set of actions, definitions, processes that involve all actors (producer, aggregator, sellers, end-users,…), all infrastructure (production stages, repositories, distribution infrastructure, catalogues, front ends, payment systems,…), and all processes(content ingestion, metadata generation, catalogue and frontend provisioning, transaction processing, billing, analytics,…)

Content Business Managers ‘not always’ understand how involved the ‘service idea’ or ‘service model’ is with technology. Some of these managers will let ‘technical people’ alone build the managers ‘current view’ of the service. Usually in these cases that ‘view’ is a short list of requirements that describes some concrete aspects about what the user should see, what should happen when the user does this and that, and what options should ‘the system’ (doesn’t it make you smile to realize how many people refer to almost any part of the service as: ‘the system…’) provide to end users and administrators.

But please do not take me wrong on this; I would not say that ‘technical people’ can do it better. In fact most ‘technical people’ I’ve known would have even a shorter, less-coherent view of the whole service. The problem is many managers have a tendency to imagine the service only as a money-maker forgetting completely about how it works and many technicians have a tendency to imagine only isolated parts of the service at work and they can’t and won’t imagine how the service makes money.

What happens usually?

Working on a partial and incoherent view of the service any implementation of that ‘view’ will be flawed, impossible to evolve once we start asking ourselves very basic questions that test the coherence of the service model. If a technician raises one of these questions, many times he will need the manager to give an answer, to think of choices he never imagined. If the manager was the one to discover something he does not like in the behavior of the service, many times he will need assessment from the technician, making the technician think of options he never considered. It is only a matter of time (very short time) that other managers or even the same manager will need to do very simple variations to the original idea: instead of single items sell bundles, instead of fixed prices do promotions, instead of billing per total traffic by end of month bill per percentile 95/5 min of instant available bitrate, instead of ….

In many cases the first implementation of a content service must be thrown to litter and start all over again just after the first round of ‘improvement ideas’. It happens that those ‘clever requirements’ that we had at the start were mutually conflicting, short-sighted, and did not allow for any flexibility even the most basic.

So, what to do to avoid recreating all the service from scratch every month?

The approach that I would suggest is to do a lot of planning before actually building anything.

Write your ideas down. Draw flow diagrams on paper. Draw mockups on paper. Put them through ‘logic tests’. Ask yourself questions in the style ‘what if…?’ , or ‘how can the user do…?’, or would it be possible later to enhance/add/change…?’  Show your work to others.  Rethink. Rewrite. Retest….

Spend a few days or better a few weeks doing this and I can assure you that your requirements list will grow, your idea about complexity of the service will change, your enthusiasm will increase substantially, your understanding of possible problems will be much better, and the time and effort to build and maintain your service will be sensibly reduced.

It is important that in the ideation process different people work together: marketing & pricing experts, engineering & operations experts, end user experience experts, billing, analytics, support… I would also say: please do not use computers for work in this phase. PowerPoint is great… but your hands with a pencil are far better and much faster. And in a room with no computers there is no possibility for someone to go and check email and thus be out of the group for a while. (Smartphones are computers too and should be prohibited in this phase.) Recall that: ‘if you cannot imagine it you cannot draw it. If you cannot draw it with a pencil you cannot draw it in PowerPoint’. If your boxes are not perfectly rectangular and your circles are not perfectly round you can rest assured that at some point later a good computer will fix that with no pain.

If you were not forced to imagine the internals of the service at work you would not find the problems you have to solve and you would never find the solutions. You must realize that everyone that has a voice to claim requirements over the service MUST imagine the service at the SAME time that others do it and SHARE immediately his ideas and concerns with the entire group.


3 Building coherency: a semantic model

Oops…I couldn’t avoid the word ‘semantic’ I’ve immediately lost half of my readers. I’m sorry. For the brave remaining readers: Yes, it is an academic style of saying that the whole service must make sense and not contradict itself neither in ‘concept’ nor in ‘implementation’.

I’ve noticed that some people, including some of my colleagues, start to mentally wander when we speak about modelling. But it is important to create models of services.  Services do not grow on trees.  A service is not a natural thing, it is a contraption of the human mind and it must be polished, rounded and perfected until it can be communicated to other minds.

A ‘model’ is a simplified version of what the service will be once built. Our model can start as an entity that only lives in paper. A few connected boxes with inputs and outputs and processes inside may be a perfectly good model. Of course the model starts to get complicated as we imagine more and more functions and capabilities that will be in the service. At some point our model will need to live out of paper,  just because paper cannot support the complexity of the model… but recall that the model is always a simplification of the real service, so if you think your model is complicated, what about your not-yet-existent service? Do you really want to start spending money in something you do not really understand?

Have you noticed that you need to establish a common-language (also known as ‘nomenclature’) to communicate ‘about’ the service inside the service team? Let me show you an example: what happens when you tell another member of the team: “…then the user will be able to undo (some operation)…” But, who is ‘the user’? For the designer of the consumer interface it is unequivocally ‘the end user consuming the service’, for the OSS/support/operations guys it is unequivocally ‘the operator of the support interfaces’, and for other people ‘the user’ may even be other kind of human.  You probably think I’m exaggerating… you think this is not a problem as every expert has a good understanding of ‘the user’ in his own context.  But what happens if the end-user-at-home must be mixed in the same sentence with the user-of-the-admin interfaces?  Let me tell you what happens: each one dealing with that sentence will express it in the way that is more convenient to him.   The sentence will be expressed in terms of the mind that is in charge of that document/chapter/paragraph… Names will be applied to distinguish between roles, but these names belong to the experience of the one that is writing the text and they do not always fit in the ideas of other people in the team. These names have never been agreed A-Priori. And worse, in other sentences that also mix the two concepts another different guy can name the same roles completely differently.  This is one of the reasons why some technical documents about a service are impossible to read by anyone that just has common sense but lacks 1500 hours of meetings with all the internal stakeholders in the service.

I cannot be sure that I convinced you of the need to stablish a ‘common language’ about the service, but if you have had to coordinate a documentation department or a big program management office you know what I’m saying. In fact the problem goes much farther than ‘names’. The problem extends to ‘concepts’.

I would never have thought at the start of my career that people could develop so very different ideas of what is ‘billing’, what is ‘support’, what is ‘reporting’, what is ‘consuming a content’, what is ‘consumer behavior’, and a few other hundreds of concepts… But over time I learnt. These, and essentially all concepts that we deal with in our human language do not have the same representation in each other’s mind. That does not usually create problems in our daily life but when we design something very detailed, if we want others to understand it at first attempt we need to agree in the language and concepts that we deal with.  We need to start with a model of the service that is semantically correct, and never use terms strange to this model to communicate about the service.  The paradigm of the correctness in semantic models is called an ‘ontology’ (oops bye, bye another half of readers). Ontologies are very hard to ‘close’ as coherency must be complete but we may do reasonably well with a simpler semantic model that contains ‘primitive concepts’ (definitions of objects and actions that will not be questioned), semantic relations between primitive concepts and ‘derivative concepts’ which are objects and actions defined in terms of the primitive concepts.


4 Content Service primitive concepts

The word ‘service’ taken as a ‘semantic field’ inside human experience can extend to touch many concepts. If we shrink our area of interest just to Telecommunication services the possible interpretations of service are much less, but if we go further and define our domain as ‘Telecommunication Content Services’, then the number of possible concepts that we will touch and the language appropriate to deal with them gets much more ‘manageable’.

‘Exploitation model’ for a content service: simple semantic model that contains all definitions of entities and actions relevant to describe ‘exploitation’ of a content service.

We define ‘exploitation’ as: all actions that pursue the creation and maintenance of value/utility for consumers and all accompanying actions that ensure that service stakeholders obtain ‘something’ in return.

‘Exploitation system’ for a content service: information system built to implement the exploitation model of a content service.

The following could be a very simple Exploitation Model for Content Services:

Content or Content Product: a piece of information valuable to humans that thus can be sold.

Content Service: a continued action that provides value through Content Products. A service can be sold as a product, but the service is action(s) while Content Product is information. A concrete service is defined by specifying the actions allowed to the consumer (see Capabilities below).

Consumer or Customer: Human that receives value from a product or service. (P.S.: a subtle remark: Consumer gets utility and Customer pays for it. Sometimes they are not the same person.)

Service Capabilities: actions available through a service.

Contract: statement of commitment that links consumer identity + product/service + pricing + SLA

Bundle (of products/services): group of products/services gathered through some criteria.  The criteria used to bundle may vary: joint packaging, joint charging, joint promotion … but these criteria define the bundle as much as its components and thus criteria + components must be stated explicitly.

– (one possible) List of Content Products that pretends to define a CDN-based Service:

Streaming VoD (Movies…): content pre-stored and consumed on demand as a stream.

Streaming Live (channels, events…): content consumed live, as the event is captured, as a stream.

Non-streaming Big Object (SW, docs…): pre-stored huge content consumed on demand as a block.

Non-streaming Small Object (web objects…): pre-stored tiny content consumed O.D. as a block.


The following could be a possible minimum description of an Exploitation system for the above model:

Portals: (Human Interaction tools)

-Product owner (product definition) portal: meta-portal permits defining prod+capabilities.

-Seller portal: meta-portal permits define ‘soft’ prod properties: name, pricing…

-Customer portal: handles customer action: consumption and feedback.

-value added portals & tools: customer analytics, trends, help, guides…

Mediation systems

-report gathering & analytics: log gathering & processing, analytics processing

-billing & clearing: ‘money analytics’

Operation systems

-customer provisioning: customer identity management

-service provisioning: contract management + resource allocation

-event (incidence) ticketing: link customer + product + ticket + alarms + SLA

-SLA monitoring: link product properties + analytics + contract -> alarms -> billing

Inventory systems

-commercial (product) inventory: database of products + capabilities (properties)

-resources (technical) inventory: database of infrastructure items (OSS)

-customer inventory: (protected) database of identity records


5 Human interactions: Portals

This is usually the first (and sadly sometimes apparently the only) part of service that a company works on.

It is not bad to devote time to design great interaction with your consumers and operators, but you must go beyond nice interfaces. In fact a streamlined interface that does just what is needed and nothing else (extra functionality is not a gift, it is just distracting) is not so easy to build.

As I mentioned before it is important to have ‘perfected’ the idea about the service. Once that idea is clear and once you have written down your objectives: what ‘powers’ (capabilities) you want to put in the hands of your consumers and what would it take to implement those capabilities in the unavoidable explosion of support systems, operation portals, processes,… it is time to go and draw a few mockups of the interfaces.

It is not easy to build simple interfaces. The simpler the look & feel the more clever you have to be to handle correctly the interaction. It is especially difficult to get rid of ‘bad habits’ and ‘legacies’ in interface design. You must consider that today the interface that you design will be ‘interpreted’ by devices that have new and appealing possibilities: HD displays with video even in small mobile devices, multi-touch displays, high quality sound … and best of all: always-on connection and social interaction with other ‘places/sites/services’ . This interaction must be ‘facilitated’ by making a natural ‘shift’ from your interface to any other site/interface that you share your consumer identity with at the click (or swipe) of one widget.

5.1 The customer portal

You need interaction with your consumers/customers. They need a place to contract with you (in case you support ‘online contract generation’ which by the way is the trend), and a place to consume your product that as far as this article is concerned is ‘content’, that is: information intended for humans.

This portal is the place to let consumers browse products, get info about them, get them… and this is also the place to let them know transparently everything that links them to you: their purchases (contracts), their wishes, their issues/complaints, their payment info, and their approved social-links to other sites…

What makes you a provider of preference to a consumer is: trust, ease of use and clarity.

Through the customer portal the Consumer registers his identity with the service (maybe totally automated or aided by Operations), purchases products, receives billing information and places complaints. We’ll see later that there are a number of entities related to these actions: Customer Inventory, Product Inventory, Mediation, Event Ticketing, Operations,…

5.2 The seller (internal) portal

You may run a small company and in that case you run directly your business through a relatively small portal in which you do everything: advertising, contracting, delivery, feedback, ticketing…

Or you may be a huge company with commercial branches in tens of countries, with local pricing in different currencies, portals in many languages and a localized portfolio in every region.

In both cases it is useful to keep a clear distinction between ‘hard’ and ‘soft’ properties of your product.

‘Hard properties’ of your (content) product/service are those properties that make your product valuable to customers: the content itself (movie, channel, episodes, events…), the ease of use (view in a click, purchase in a click…), the quality (high bandwidth, quality encoding, and flexibility of formats…), the responsiveness of your service (good personalized attention, quick response times, knowledgeable staff…), etc.

‘Soft properties’ of your (content) product/service are those properties that you have added to make exploitation possible but that are not critical to your consumers : the ‘names’ that you use to sell (names of your options, packages, bundles, promotions, IDs, SKUs,…), the prices and price models (per GByte, per movie, per Mbps, per event, per bundle, per channel, per promotion…), the ads you use to promote (ads targeting by population segment, by language, by region,…), the social links and commercial alliances you build, the themes and colors, the time-windows of pricing, ….

The best way to materialize the distinction between ‘hard’ and ‘soft’ properties of a product/service is to keep two distinct portals (and all their associated backend) for ‘product owner’ and for ‘product seller’.

In the ‘product owner’ portal you will manage hard properties .

In the ‘product seller’ portal you will manage soft properties.

The customer portals are built ON the seller portals. That means that you at least have as many customer portals as ‘sellers’ in your organization. If you have branches in 30 countries and each of them has autonomy to localize the portfolio, pricing, ads, names, etc… each branch is a seller. Each branch needs an internal portal to build its entire commercial portfolio adding all the soft properties to a very basic set of common hard properties taken from the product owner internal portal (see below). Each branch (seller) will build one or more customer portals depending upon their internal seller portal.

You can even be a ‘relatively small’ company that licenses products/tools to resellers. In that case you provide an internal reseller portal to your licensees so they can sell your hard product as their own by changing names, prices, ads, links, etc…

5.3 The product owner (internal) portal

This is the sancta sanctorum of the product definition. This is the place where you define the internals of your product. This is the place where you empower your consumers by ‘creating’ fantastic actions over fantastic content pieces and putting all these actions and portfolio in the hands of your consumers.

It is VERY difficult to find a tool that could be flexible enough to take ANY piece of content and link it to ANY possible action that makes commercial sense. In fact it is impossible.

For this reason the ‘owner portal’ lives more time in the realm of developers than in the realm of administrators. (This MUST NOT be the case for the other portals: seller, customer… or you would be in serious trouble.)

What I mean is: it is impossible to design the ‘tool of tools’ that can graphically modify the actions that you make available to your consumers in every imaginable way. The new ideas that you come up with will surely require some new code and unfortunately this code will be at the heart of your exploitation systems. For this reason it is better to cleanly separate your sellers internal backend from the mother company backend and your customers portals from the sellers internal portals.

But do not despair; it is possible to implement a very serious backend for content exploitation, tremendously powerful, and a flexible tool that manages the most common hard properties of a content product /service.

The common ‘product owner portal’ must implement the following concepts and links:

-the complete product list: items not listed here are not available to sellers

-the complete capabilities list: actions and options not listed here are not available to sellers

-general ‘hard’ restrictions: internal SKUs, internal options (formats, quality steps, viewing options…), billing options (per item, per info unit, per bandwidth…), SLA (availability, BER, re-buffering ratio, re-buffering events/minute…)

Every Content Product must go through a cycle: ingestion – delivery – mediation (accounting & billing).

The links between consumer id, contract, product id, options for consumption, options for billing, options for SLA, etc… must be implemented in several information systems: databases, registries, logs, CRMs, CDRs, LDAPs…)

From these three segments in a service life-cycle: ingestion–delivery-mediation, most academic articles on content services focus on ‘delivery’ as this is the aspect of the service that creates the hardest problems and thus it is fertile soil for innovation (CDNs are a great example). This article focuses in all the rest of the effort, everything that is not pure service delivery.   One important goal of this article is to demonstrate that creating a great online service and later figuring out how to exploit it is a bad idea.

5.4 Value added portals & tools

These tools and values added are as I’ve said…‘added’. No one needs them to find, get, pay and enjoy a Content Product. But who ‘needs’ a Movie anyway? I mean ‘needing’ is not a good word to describe the reasons that drive Content consumption.

Many times it happens that there are actions that no one ‘needs’ but the value they add to a basic service provides such an attraction to consumers that the service becomes popular and the new actions become a convenience that every other competing service must implement. Examples are: search engines, comparison engines, wish lists, social appreciation lists, ranks, comments … People are buying ‘objects’ that no one needs to keep himself alive, there is nothing obvious per-se in a Content Product value. We must compute its value before purchase judging on other people appreciation and comments.

All the ‘tools’ that we can imagine that may help people understand our offer (portfolio) and navigate through it are positive to our business and should be appropriately placed in the customer portals or gathered together in a tools tab , not distracting consumers but helping them know and consume Content.

Some modern tools that help monetize Content: product search, product comparison, social wish list, price evolution over time, price comparison, related products, buyers ranks, social (our group) ranks, open comments, long term trend analysis…


6 Mediation Systems

These are key systems to exploitation but the name is really ugly. What do we mean with ‘Mediation’?

We need a middle-man, an inter-mediate, a mediator, when we do not have all the capabilities required to do something or when it is convenient to delegate some task in others that will do it better or at least equally well but cheaper than us… or simply when we prefer to focus in other tasks.

In a commercial exploitation system, ‘Mediation’, usually means ‘doing everything that is needed to apply and enforce contracts’. Sounds easy, yes? OK, it isn’t.

Enforcing contracts is ‘Mediation’ for us because we choose not to identify with all the ‘boring’ actions needed to apply the contract,… we prefer to identify ourselves with the actions that deliver the service, and that is human. Delivery is much more visible. Delivery usually drives much more engagement in consumers.

Mediation usually involves accounting and processing of data items to prepare billing and data analyses.

Mediation in Content Services includes:

log gathering & processing

billing & clearing (and sometimes payment)

analytics processing and report rendering

In the CDN business Log gathering & processing is a cornerstone of the business. Many CDNs in fact offer edge logs as a byproduct and some sell them. Even in some legal frameworks especially in Europe CDN service providers are forced to keep edge logs for 12 months available to any authority that may demand them for audit.

CDNs are huge infrastructures, usually with thousands of edge machines in tens or hundreds of PoPs distributed over tens of countries. Almost 100% of CDNs bill their customers by the total number of bytes delivered over a month (GBytes/month). Only a few CDNs bill customers per percentile 95 measured in 5 min slots of delivery speed over a Month (Mbps/Month). In any case it is necessary to measure traffic at the delivery points (edge). But the edge is huge in a CDN so lots of log files will need to be moved to a central element for some processing. This processing involves separating CDRs (Customer Data Records) that belong to different customers, different Content Products, different regions, etc…etc… In case a CDN implements percentile 95/5 billing the downloads have to be processed in 5 min slots, average Mbps per slot and customer calculated, rank of slots over the whole month gathered and the percentile 95 calculated per customer.

Usually other interesting calculations are worth doing over edge logs.

We now live in the era of ‘Big Data’ which is a new buzzword for an activity that has been for long time present in some businesses (like CDNs) and longtime absent in some other businesses (like many online services), this activity is behavior recording (journaling) and offline analysis (trend spotting and data correlation).

Analytics and billing should be related in a CDN. As time goes by more and more data analyses become popular for CDN customers. We started with very basic billing information (traffic/month) and that is still valid but many other analyses become feasible in these days due to increased processing power and due to new and interesting propositions about data correlation.   Online content businesses have appeared over the world in a moment when other online services existed and there were established billing information systems. These billing systems for online services were mostly of two kinds: continuous service accounting for deferred billing (CDR based, common in Telephony), discrete event billing (common in online shops).

Discrete event billing is easy to understand: one SKU is purchased –one SKU is billed. No more time spent.

The CDR (Customer Data Records) are tiny pieces of information that must be collected over a period (monthly usually) to help reconstructing the ‘service usage history’.  Each CDR is as much as possible an independent piece of information intended to be later processed by a ‘billing machine’.  When creating the CDR we must not rely in that any context information will be later available and thus the CDR must contain everything needed to convert it in money: customer ID, service ID, units consumed, time-stamp, and other ‘creative specific billing data’.  The fact is that there is always some context needed at processing time so no CDR system is perfect, but the whole idea of keeping CDRs is to reduce the context that exists at the time of CDR creation and in this way we will be able to post process adding information that was not available at consumption time (in case this information ever appears).

Consolidation of CDRs is a cumbersome process that allows great flexibility in Telco billing but it does not come for free. In fact this ‘flexibility’ has created one of the biggest problems in data back ends: processing of CDRs usually cannot start until the billing period has ended (explanation below) and at that moment CDRs in the billing system can be thousands of millions of records. Huge datacenters have been built for billing. They are expensive, they are slow, they are complex, they are unreliable (no matter how enthusiastic the vendor is and how small he claims is the amount of ‘impossible to charge for’ records). Why is this? Creativity in business models for ‘Telecom utilities’ has been enormous in recent times, especially since the advent of mobile communications. A subscriber is charged usually at the end of a month, and in the middle he can contract, refuse, modify, consume a variety of communication products, receive promotions, discounts, obtain fidelity points, redeem points… All this complexity of actions that affect the monthly bill must be recorded, enriched with context, time-stamped, stored… and a consolidation process must be run at the end of the billing period to transform CDRs in a bill per customer. This high complexity is supported willingly by Telcos today. They seem to have a preference for creating a plethora of different promotions, personal plans, personal discounts, special discounts, special surcharges, different pricing time windows… It seems that currently this complexity is good for Telco business, but the other side of it is that you need CDR billing.

Now you should be questioning yourself about this: Business-wise, is a Content Service more like a shop or more like a mobile Telco service? Will we do better with discrete-event billing or with CDR billing? That may be a tricky question. In my own humble opinion any Content Service must be better thought of as a shop, and a CDN is no exception.  CDNs create an interesting paradox: the Customer (the one who looks for the service and eventually gets it and pays for it) usually is not the same human that ‘consumes’ the service. The typical CDN customer is a company that has some important message to make through internet. There can be millions of ‘consumers’ on demand of that message. There can be thousands of millions of consumption actions in a billing period, exerted by millions of different humans. This fact distracts many people from other more important facts:

-the Service Capabilities are completely determined and agreed before starting to serve

-the SLA is completely established and agreed before starting to serve

-pricing is completely clear and agreed before starting to serve

-no matter there are millions of termination points, it is perfectly possible to track all them to the CDN service and bill all the actions to the proper customer

-a Telco service is strongly asymmetric: the customer is many orders of magnitude less ‘powerful’ than the service provider; a CDN is not. For a CDN many customers may be in fact bigger financially than the service provider, so there is space for initial negotiation, and there is NO space for wild contract changes in the middle of the billing period just because the service provider gets creative about tariffs or whatever.

So I would say that CDR billing for a CDN does only complicate things. Logs of edge activity are the ultimate source for service audit and billing but there is no point in separating individual transactions, time-stamping each one, adding all context that makes a transaction independent from all others, and storing all those millions of records.

A CDN deserves something that rests midway between event-billing and CDR-billing. I like to call it ‘report-based-billing’. Some distributed processing (distributed along the edge and regions of the world) may allow us to separate ‘reports’ about the bytes downloaded from the edge and accountable to each of our customers. These reports are not CDRs. These reports are not either ‘unique events’ to be billed. These reports are partial bills for some time-window and for some customer. We may do processing daily, hourly or even finer than that. We will end up having the daily (for instance) bill for each customer in each region. This daily bill can be accumulated over the month easily so we will have the bill up to day X in month Y with little added effort over daily processing. These reports support ‘easily’ corrections due to failures in service that will have an effect on billing (compensations to customers, free traffic, promotions…) and also  support surgical amendments of daily report consolidation in case (for instance) some edge log was unrecoverable at the time of daily processing but was recovered later.

By implementing this ‘continuous consumption accounting and continuous report consolidation’ it is possible to bill CDN (or any content business) immediately after the billing period ends (month usually), but most important there is no need to process thousands of millions of CDRs to produce our bills nor is it needed to have a huge datacenter for this specific purpose.


7 Operation Systems

This concept of ‘operation’ leads us to an interesting discussion. In the Telco world operation is always present. No system or service can work with ‘zero operation’. This concept of operation goes beyond ‘maintenance’. Operation means ‘keeping the service up’. This is a very varying task from one service to another. One may think that the better the service was imagined the less operation it needs… and that is not a bad idea. It is true. But in the real world ‘zero operation’ is not yet possible.

Put simply, the services we create have so many actions inside that affect so many machines and lines of code that we cannot really believe they can work without keeping an eye on them. Taking care of that is ‘monitoring’, and by the way we never really discovered how to accomplish some tasks automatically (customer contact, contracting, support calls, replacement of SW versions, etc…) and that is ‘support’.  These human concepts of ‘monitoring’ and ‘support’ have been named in the Telco world: OSS (Operation Support Systems) and BSS (Business Support Systems), but in real life there is high overlap between them.  How could you possibly think of a task that means operation of a service without being a support to the business?  Have you ever seen any business that has operations that do not carry costs?  Do you have operations that do not produce business? (If you answered ‘yes’ to any of the two questions you better review your business…).

The most important (in my view) OSS/BSS in Content Services are:

customer provisioning: customer identity management

service provisioning:  contract management + resource allocation

event (incidence) ticketing: link customer + product + ticket + alarms + SLA

SLA monitoring: link product properties + analytics + contract -> alarms -> billing

7.1 Customer/Consumer provisioning

This kind of system, an information system that acquires and handles human identity records has evolved enormously in the recent years. ‘Managing Identity’ is an action incredibly powerful for a business and it carries great responsibility that will be enforced by law.  However only very recently we are seeing some part of the power of Identity Management in real life.

In a series of internal research articles that I wrote seven years ago I was promoting the idea of a ‘partially shared identity’. At that moment the idea was certainly new as some syndication of ‘whole identities’ was entering the industry and some more or less ‘promising standards’ were in the works.  We built a demonstration of three commercial platforms that were loosely-coupled by sharing fragments of the whole identity of the end user.

Today I’m happy to see that the once ‘promising’ standards which were overly complex have been forgotten but the leading commercial platforms and the leading identity management platforms (social networks) now allow cross-authentication by invoking APIs inspired by the idea of ‘set of identity records and set of permissions’. The platform that requires access to your identity data will let you know what ‘items’ it is requesting from your authenticator before you allow the request to go on.  This is a practical implementation of ‘partial identity’.

But let’s focus in the simplest purpose of the ‘Customer Provisioning’: we need to acquire a hook to some human so we can recognize her when she is back, we can give service to some ‘address’, we can take her feedback and we can send her bills and charge her account for the price of our service.

As I’ve said the most intelligent approach to knowing our users today is …going directly to our would-be-customer and saying: … ‘Do you have a social network in which you are well known? Good, please let me know which one. By the way I have bridges to the biggest three. You can choose the one you prefer to authenticate with me and I will not bother you a single minute entering your data.’

Usually social networks do not hold information about payment methods (VISA, PayPal, etc…) so fortunately for the peace of mind of our customer/consumer that part of the personal data cannot be shared. But taking the more general concept of a ‘platform’ in which a consumer has a personal account, it is imaginable a business relationship with another platform in which the consumer would occasionally like to do a purchase but he does not want to rely on them to handle his payment. In case the consumer gives permission the charge could be sent to the first platform that is already trusted by the consumer. The first platform will handle consumer’s money and the new (or second) platform will just be a provider of goods to the first platform, sending these goods (in our case Content Products) directly to the address of the consumer. In this way the consumer obtains the good effects of sharing his payment data without actually sharing them.

I have to say that I’m also happy to see this concept today implemented in Amazon Marketplace. In case of virtual goods (Content) it could be even easier to implement (or more complicated it depends on the nature of content and the kind of delivery that is appropriate.)

7.2 Service Provisioning

This is hard stuff. As I mentioned at the beginning of this article ‘…today we are not talking about delivery…’ But in fact delivery is the most attractive part of content businesses from a technological perspective. It is also the biggest source of pain for the content business. It is where you can fail, where you can be wrong, have the wrong strategy, have the wrong infrastructure, the wrong scale… and you see… It is a hard problem to solve, but this is the reason it is so exciting. CDNs are exciting. Service Provisioning is directly related to how you plan and execute your content delivery.

Provisioning more service is a daily problem in CDNs. It may be due to a new customer arriving or because existing customers demand ‘more service’.  It cannot be taken lightly. Customers/Consumers can be everywhere through your footprint, even worldwide, but you do not have PoPs everywhere and your PoPs do not have infinite capacity. Service provisioning must be the result of thorough thinking and data analysis about your current and projected demand.

As I commented in a different article, a CDN takes requests from essentially anywhere and then has to compute ‘request routing’ to decide per request which is the best resource to serve the request. Resources are not ‘anywhere’. There is a ‘footprint’ for a CDN.  There are many strategies to do this computation, and there are many high level strategies to geographically distribute resources. As of recently the edge of CDNs starts to be less distributed. Or it would be better to say that the original trend of ‘sprawling the edge’ through the world has been greatly slowed down. CDNs nowadays enhance the capacity of their edges but they have almost stopped branching finely the edge. There is a reason for this behavior: the most consumed content in CDNs is VoD (per byte) and pre-recorded content delivery is not very sensible to edge ramification. With appropriate buffering a few-PoPs-edge can do very well with VoD. On the contrary live events and low latency events depend very much in proper branching of the edge.

When the probability of dropping requests in our request routing due to the misalignment of our demand and our resources capacity/position gets above a certain threshold we will need to increase our service.

In a CDN there is usually dynamic allocation of resources to requests. There is no static allocation of resources to some requests, for example to some customer. But there are always exceptions. In a sophisticated CDN it is possible to compute the request routing function with reservation of resources for some customers. This technique of course makes global request routing much more complicated but introduces new business models and new possibilities in SLAs that are worth considering.

In case your CDN applies capacity reservation then a new customer with a reservation will have an immediate impact in service provisioning.

Other impacts in service provisioning emanate from the very nature of some CDN services. For example, when a CDN starts caching a domain of a new customer it is usually necessary to inform the caches of the name of this domain so they (the caches) change their policy to active caching. This action should be triggered by a proper service provisioning system.

7.4 Event ticketing

In any online service it is important to keep track of complaints and possible service failure. I would say that this is not a very special part of a Content service. (Please understand me right: being a Content Service does not make this special over other Services.) Essentially it is a workflow system that will let you account for events and link them to: Customer Identity + Operations work orders.  Easy as it is to implement a simple workflow it is worth the time to use alarms and time stamps to implement a ‘promptly communication policy’. Once you have received notice of a potential problem clock starts ticking and you must ensure that all stakeholders receive updates of your action in due time. The ticketing system does exactly that. It creates ‘tickets’ and manages their lifecycle. A ticket is a piece of information that accounts for a potential problem. As more details are added to the ticket all stakeholders get benefits from the existence of the ticket: the customer gets responses and corrective actions, operations get information to address a problem, the whole system gets repaired, other users avoid running into problems, your data backend and analytical accounting get info about your time to solve problems and number of problems and cost of repairing.

All in all the ticketing system is your opportunity to implement transparency and a ‘communication bus’ that works for emergencies and gives the right priority to many different events and incidences.

7.5 SLA Monitoring

This is an information system that I rarely see ‘out-of-the-box’. You need to build your own most of the times. Many vendors of video equipment and/or OVPs sell ‘probes’ that you can insert in a variety of points in your video distribution chain. These probes can give you a plethora of measures or ‘quality insights’ about your service. Many other vendors will provide you with network probes, traffic analysis, etc…It is advisable to have a solid background in performance analysis before trying to use the vendors’ suggestion of a set of SLO (Service Level Objective) to build a SLA (Service Level Agreement) for a customer. It happens many times that the understanding that we get from the written SLA is not the same that the customer gets. And it happens even more frequently that the measures that the probes give us DO NOT implement what we have announced in our SLA.  It is key to clear any doubt about what is measured and how, exactly, it is measured. (For more in depth information you may want to read my previous article: CDN Performance Management.)

The SLA is our commitment in front of our customer to grant certain characteristics of content traffic. Today no one will be selling a Content Service on the ‘soft promise’ that the Service will scale seamlessly with demand, the traffic shaping will be correct, the delay low, the losses inexistent, the encoding quality superb,… All these ‘fuzzy statements about service quality’ are simply not admitted.  The ‘reach’ of the service in terms of what that service can really do cannot be an aspiration. It must be documented in an SLA. This SLA will state clearly what we can expect from the service using quantitative measures.

There are very serious differences between ‘cheap’ CDNs / content services and ‘serious’ ‘high quality’ services. Even when the finished product may occasionally look the same: video on demand, channels, events… there is a whole world of complexity about preparing the service in advance to support any eventuality. A quality service provider may spend easily 5X to 10X more than a cheap provider preparing for variations in load and preparing for all kinds of performance threats.  Of course taking care of performance in advance is expensive. It involves lots of analysis of your systems, constant re-design and improvement, buying capacity in excess of demand, buying redundancy, hiring emergency teams, buying monitoring systems…how can a business survive to this investment? .  This investment is an opportunity for positive publicity and for a business model based on quality and SLAs. If you are highly confident in your performance you can sign a very aggressive SLA, promising high quality marks and accepting penalties for casual infringement.

There used to be a huge difference in the delivery options available to a Content Service in the early days of CDNs (15 years ago). At that moment it was:

Option 1: Plain carrier connectivity service: no content oriented SLA. Use it at your own risk. Only percentiles of drop packets and average Mbps available were eligible to describe quality. Nothing was said about integrity of individual transactions.

Option 2: CDN. A ‘shy’ SLA, promising a certain uptime of the service, certain bounded average transaction-latency, a certain set of content-related quality KPIs: buffering ratio, time to show, a certain level of cache-hit ratio…

At that moment option 2 was much more valuable than option 1 (no surprise…), and for that reason prices could be 10X raw Carrier traffic prices for CDN customers.

Today after years of CDN business, after continued improvement in Carrier services, but also after a serious escalation in demand of Content and a serious escalation in typical Content bitrate…SLAs have to be different and CDN prices vs traffic prices have to be in a different ratio. Anyway this is a matter for a much longer article.

What happens today is that SLAs are now a much less impressive sales tool. Almost all CDNs show very similar SLAs. I’ve been able to notice a trend that is very interesting. Some CDNs are getting increasingly ‘bold’, promising to achieve certain SLOs that are close-to-impossible to grant.  This is probably an effect of the way most customers check SLAs: they check them only in case of serious failure.  Or even disastrous failure. There is not a culture of reviewing the quality of the traffic when there are no complaints from end users.   Companies that commercialize SLA-based services have noticed this and they may be in some cases relaxing their vigilance on SLAs, moving resources to other more profitable activities and reacting only in the rare case of a disastrous infringement of the SLA. In that case they just refund the customer and go on with their activity. But at the same time they keep on selling service on SLA promises.

My own personal view about managing SLAs is not aligned with this ‘react only in case of problem’ style. It is true that the underlying carrier services are today more reliable than 15 years ago, but as I’ve said Content Technology keeps pushing the envelope so it would be better to redefine the quality standards.  We should not assume that IP-broadcasting of a worldwide event ‘must’ carry a variable delay of 30s to 60s. We should not assume that the end user will have to live with a high buffering ratio for 4K content. We should not assume that the end user must optimize his player for whatever transport my content service uses.

It is a good selling point to provide SLA monitoring reports for all services contracted by the customer on a monthly basis.  These reports will show how closely we have monitored the SLA, and which margin we have had across the month for every SLO in the SLA. Of course these reports also help our internal engineering in analyzing the infrastructure. A good management will create a cycle of continuous improvement that will give us a bigger margin in our SLOs and/or the ability to support more aggressive SLOs.

SLAs and their included SLOs are great opportunities for service differentiation. If my service can have seriously low latency, or no buffering for 4K, let us demonstrate it month by month with reports that we send for free to all customers.

So having SLA reports for all customers all the time is a good idea. These reports can usually be drawn from our Performance Management Systems and through mediation can be personalized to each Customer.


8 Inventory Systems

These are of course core components of our exploitation. As commented above we must keep track of at least: tech resources, customers, products.

I like to start with the hardcore components of a good delivery: tech resources

8.1 Technical Inventory

This technical inventory is a concept that comes very close to the classical OSS inventory of machines. I say close and not identical because a content service should go beyond connectivity in the analysis of the technical resources.

The technical inventory must contain a list of all machines in the service (mostly delivery machines in a content service) with all their key characteristics: capacity, location, status …  These are long term informative items. Real time-load is not represented in the inventory. An alarm (malfunction) may or may not be represented in the inventory. It may be there to signal that a machine is out of service: status out.

Having a well-structured tech inventory helps a lot when implementing automated processes for increasing the delivery capacity. In a CDN it is also especially important to regularly compute the resource map and the demand map. In fact the request routing function is a mapping of the demand onto the resources. Ideally this mapping would be computed instantly and the calculation repeated continuously.

The technical inventory is not required to represent the current instantaneous load of every machine. That is the responsibility of the request routing function. But the request routing is greatly supported by a comprehensive, well-structured technical inventory in which a ‘logical item’ (like for instance a cache) can be linked to a hardware part description (inventory).

Having this rich data HW inventory allows us to implement an automated capacity forecasting process. In case a new big customer wants to receive service we may quickly calculate a projection of demand and determine (through the inventory) which is the best place to increase capacity.

It is also very useful to link the inventory to the Event ticketing system. In case a machine is involved in a service malfunction that machine can be quickly identified, marked as out of service, and retired from our delivery and request routing functions. At the same time our OSS will be triggered for a repair on site, a replacement… or simply we may mark the datacenter as eligible for end of the month visit.

The tech inventory must be also linked to our cost computation process that also takes data from our mediation systems and our purchases department. We want to know the lifetime of each machine that we operate and which is the impact of each machine in our costs. This impact has CAPEX and OPEX components.  Having these links between analytic systems allows us to implement a long term profitability analysis of our business.

8.2 Product Inventory AKA Commercial portfolio

As we saw when talking about the service ideation there is a portfolio of different products. In case of Content Products this portfolio maps to a list of titles and a wealth of ‘actions’ or capabilities that our customers buy the right to execute. We may package titles with actions in the most creative way that anyone could imagine : Channels, Pre-recorded content on demand ,events…any combination of the mentioned with an idea of quality through ‘viewing profiles’ (bitrate, frame size, audio quality, frame rate, codec, color depth,…), monthly subscription, pay per view, hour bonus, plain tariff, premium plain tariff,…whatever. But how do we map all these ‘products’ to our other systems: technical inventory, customer inventory, mediation, analytics, portals, SLAs monitoring, event ticketing…

The best solution is to build links from the Product Inventory to all the systems in a way that makes sense. And that ‘way’ is different for each of the exploitation system components that we have described.

For instance, when designing a VoD product we should map it to the Technical Inventory to be sure that the list of codecs + transports is supported by our streamers. If we have an heterogeneous population of streamers in which some support the new Product and some not…we need to link that knowledge to customer provisioning so we do not sell a footprint for that product that we cannot serve…. If that same VoD product will be billed through a monthly plain tariff with a cap in traffic and with a closed list of titles but we allow premium titles to be streamed for an extra fee… we must include informational tips in the Product inventory so the link to the Mediation can build properly the monthly bill for this customer.  If we want to apply different pricing for different places in the world we need to include those tips in the Product inventory and use them to link to Mediation and to link to Customer provisioning and to link to Portals.

Of course the most obvious link of the Product inventory is to the Product Owner Portal. The Product Owner Portal is the technical tool that is used to design the product capabilities (actions) and as I’ve said it is a system that lives at the core of the Exploitation system, in a dungeon where only developers and a few product owners can touch it. As it goes through frequent updates to provide new and exciting capabilities the same happens to the Product Inventory. This inventory evolves with the Product Owner Portal, to reflect and account for every new capability and to store the tips that are used to link via many processes to the rest of exploitation system components.

8.3 Customer Inventory

As we have mentioned before today having information about our customers is an asset that has turned to be more powerful than ever before. In fact there is serious fight for having the personal data records of customers among commercial platforms. For this reason ‘sharing’ part of the customer identity is the new trend.

Anyway, let’s assume that we are the primary source of data about one particular customer. In that case we need to account for enough information to legally approach our customer: full Name, full address, fiscal ID, payment data.  On top of that we may pile up whatever data we dare to ask our customer about himself.

And on top of ‘what our customer knows we know about him’… we will add a ton of ‘insights’ that we can get about our customer just watching his public activity.  Important: watching a public activity means taking notes on actions that we are supposed to notice as service providers… It is not and will never be spying on other activities of our customer of course!   There are many insights that online businesses do not exploit, or at least exploiting them was cumbersome and not very fashionable until recently.  The ‘Big Data’ age is changing that.  Profiling customers is a hard task that involves lots of interesting algorithms to correlate data sources, but the good part is that we already have the data sources: purchases, timestamps of purchases, traffic, clicks on media players, ‘behavior’ at large. And the other good thing about collecting all these insights is that it is lawful and it is a ‘win-win’ action that benefits equally the service and the Customer.

The Customer Inventory is of course linked to Portals, to Mediation, to event ticketing, to some Analytics and to SLA monitoring.


9 Conclusions

We have seen that Exploitation Systems are a set of ‘systems’ that rival in complexity with the core service systems, usually called ‘service delivery infrastructure’. But Services must be exploitable…easy for the service provider.

We have seen that we cannot buy Exploitation Systems off the shelves. OK, we can. But is it good to go with an all-purpose exploitation suite with 100+ modules that are designed to behave equally when selling cars, houses, apples …movies? My guess is that Content Businesses have some specifics that put them apart from other Services, and even there are specifics that separate one Content Service from another. If we buy a famous exploitation suite for online businesses we MUST have a clear design in mind to customize it.

We have seen that some formality when thinking at the design stage helps later. I suggest creating first a Service Exploitation Model and implementing a Service Exploitation System after it.

We have decomposed the typical pipeline for exploitation of Content Services in major subsystems: Portals, Mediation, Operation, Inventories.

We have reviewed the major subsystems of the exploitation and analyzed the good properties that each subsystem should have for Content Services and also have discussed trends in design of these subsystems.

While reviewing the desired properties of subsystems we have noticed the links and processes that we need to create between them.  We have noticed the huge possibilities that we get from linking subsystems that in other Service views (alien to Content) are kept separated. These links are key to the coordinated behavior of the Exploitation and they must be instrumented by adding information that makes the subsystems cooperate.

As a final remark I would like to emphasize how important it is to apply innovation, analysis and continuous improvement methods to the Exploitation of Content Services. I know it looks fancier to deal with the infrastructure for the delivery of video but there are a lot of interesting and even scientific problems to solve in Exploitation.

Best Regards.                                                                                       Adolfo M. Rosas

CDNs and Net Neutrality

(Download this article as PDF:  CDNs and Net Neutrality)



1. Introduction

In these weeks many articles appear that go in favour or against (very few) net neutrality. This ‘net neutrality’ topic has been present in legal, technological and business worlds for years but it is gaining momentum and now every word spoken by FCC or by any of the big players in these issues unleashes a storm of responses.

In this hyper-sensitive climate it may seem that anyone could have and show an opinion.

I’ve seen people that do not work in Internet, do not work for Internet and do not work by means of Internet and who do not have a background in technology, legal issues or social issues shout their opinion in many media. Defending net neutrality is popular today. It sounds like defending human rights.

The natural result of this over-excitement is that a lot of nonsense is being published.

How can we harness such a difficult topic and bring it back to reason?

In this article I will try to pose the ‘net neutrality’ discussion in right terms, or at least in reasonable terms, connected to the roles that Internet has reached in technology, society, economics and specific businesses as content businesses and CDNs which are specially affected by this discussion.


2. What do they mean with Net Neutrality?

The whole discussion starts with the very definition of Net Neutrality as there are many to choose from. The simpler the better: Net Neutrality is the policy by which we ensure that no net-user is ‘treated differently’ or ‘treated worse’ than other net-user for the sole reason of having a different identity.

I have selected a definition that on purpose avoids business terms and technology terms. This is a good starting point to dig in the meaning of ‘net neutrality’ and inquire for the massive response that it is rising.

What does it mean a ‘policy’ in ‘net neutrality’ definition? A policy is a consolidated behavior. It is a promise of future behavior. It is a continued and consistent practice.

Which is the ‘net’ in ‘net neutrality’ definition? It is Internet. This means every segment of network that talks IP to/from public IP addresses.

Who is the ‘net-user’ in ‘net neutrality’ definition? He is anyone linked to a public IP address. He is anyone that can send and receive IP packets through a public IP address.

What is ‘treating someone differently’ or ‘treating someone worse than others’ in ‘net neutrality’ definition? As we are talking about ‘network-level behaviors’, treating means dealing with network-level objects: packets, addresses, traffic… So we can translate that ambiguous ‘treating worse’ to: handle packets, handle addresses, handle traffic from someone differently/worse than we handle packets, addresses, traffic from any other just because we know who is the traffic originator.

How can we ‘deal worse’ with packets/traffic? There are some network-level-actions that affect traffic adversely and thus can be interpreted as ‘bad treatment’: delaying/dropping packets in router queues.

Why would anyone delay/drop packets from anyone else in a router queue? There is no reason to harm traffic flow in a router just for the sake of doing it or to bother someone. It is plain nonsense. But every minute of every day routers delay packets and drop packets…why? The reason is routers are limited and they can only deal with X packets/s. In case they receive more they forcefully need to ignore (drop) some. This fact should be no major drama as most transport protocols (TCP) deal with lost packets, and all end-to-end applications should deal with continuity of communication no matter which transports they use. The only effect that we can acknowledge from packet drops is that they create a ‘resistance’ to make packets through Internet that increases with network occupation, but this ‘resistance’ is statistically distributed across all net-users that have communications going through each router. No one sees a problem in that. It is the same condition of a crowded highway, it is crowded for the rich and for the poor…no discrimination. Classic routing algorithms do not discriminate individual IP addresses. They maximize network utilization.


3. Does (recent) technology threaten net neutrality?

Despite congestion problems in internet have been present from the beginning of this now famous network, some people have recently developed a tendency to think that router owners will decide to drop some specific net-user’s packets all the time. Just to harm this user. But why complicate the operator’s life so much? It is easier to let routers work in best effort mode, for instance with a policy of first-come first-serve. This is in fact the way most internet routers have behaved for years. Just consider that ‘routing IP packets’ is a network layer activity that involves only checking origin-IP addr, destination IP-addr and IP priority bits (optional). This is very scarce information, that just deals with the network level, without identifying applications or users, and even for those ‘quick routing choices’ the routing algorithms are complex enough and internet is big enough to have kept routers for years under line-rate speed (usually well under line-rate speed) even for the most expensive and more modern routers. ‘Priority bits’ were rarely used by the operators until recently, as creating a policy for priorities used to degrade badly the router performance. Only very recently technology is going over that barrier. Read on to know if we can go much further.

As technology evolves it is now possible (in the few recent years , maybe 5) to apply complex policies to routing engines so they can ‘separate’ traffic in ‘classes’ attending to multiple different criteria that go beyond the classic layer 3 info : (origin, destination, priority). With the advent of Line-rate DPI (Deep Packet Inspection) some routing and prioritization choices can be taken based on upper layers info: protocol on top of IP: FTP, HTTP, MAIL… (this information belongs to layers 4-7) , SW application originating packets (this info belongs to layer 7, and has been used to throttle P2P for instance), content transported (layer 7 info, has been used to block P2P downloads)…
So it is now (maybe in the last 5 years) that it is commercially possible to buy routers with something close to line-rate DPI and program them to create first class Internet passengers and second class Internet passengers attending to many weird criteria. It is possible, yes, no doubt… but is it happening? Does it make sense? Let’s see.


4. What is an ISP and what is ‘Internet Service’?

Internet Service Providers, taken one by one, do not own a significant portion of the Internet, nobody does. So how can anyone offer to you ‘Internet Service’? Do you think that all of them (ISPs of the world) have a commercial alliance so any of them can sell to you the whole service in representation of all of them? No.

Then, what is Internet Service?

We could define IS as being close to a ‘Carrier Service’, that is: a point-to-point messaging service. This basic service takes a ‘packet’ from one entry point (a public IP address) and delivers this packet to other (public IP addr) point. That is all. Well, it is in fact usually much less than that. This ‘point-to-point service’ is not exactly what ISPs sell to us. In case both points lay in the ISP network then yes, ISP contract means end-to-end service inside that ISP ‘portion of internet’ (nothing very attractive for anyone to pay for), but what if the ‘destination point’ is out of the ISP?. Does our ISP promise to deliver our IP packet to any public IP? Nope. Here lies the difference with Carrier Services. Internet Service cannot be sold as an ‘end-to-end’ service. It is impossible to provide that service. It is physically impossible to reach the required number of bilateral agreements to have a reasonable confidence that an IP packet can be delivered to any valid IP address. Internet Service is an ‘access service’. This means that you are being promised to do all reasonable effort to put your packet ‘on the way’ to destination IP, but they never promise you to deliver. What does mean ‘all reasonable effort’? This is subject of much controversy, but usually national laws of fair trade force the ISP to behave ‘correctly’ with locally originated traffic and ‘deliver’ this traffic in good condition to any Internet exchange or any peering point with any other company. That is all. Is this good for you? Will this ensure your IP packet is delivered, or better, ‘delivered in good time’? Nope. As internet exchanges and peering points may distract us from the focus of our discussion lets save these two concepts for a couple of paragraphs later. (See 6).

(NOTE: we will see later that Internet Service, not being end to end, is not currently under the legal denomination of ‘Common Carrier’, and that is extremely important for this discussion.)

The Internet service is essentially different from ‘Carrier services’ that you may be used to.

It is important to review classic Carrier services in search of resemblances & differences to Internet Service.


5. Classic Carrier Services

The paradigm of Carrier Services is ‘snail mail’ or traditional postal service. No company does own the whole postal resources across the world, but there is a tacit agreement between all of them to ‘terminate each other’s services’. Each postal company (usually owned by a government) deals internally with in-house messages, originated in its territory and addressed to its territory. When a message is addressed ‘out of territory’ the company at origin requests a fee from the sender that is usually proportional to the difficulty (distance) to destination. At the destination there is another postal company. This worldwide postal transport is always a business of two companies. The biggest cost is moving the letter from country to country and there is no controversy: the postal company at the originating country takes the burden and cost of moving the letter to the destination territory. And this effort is pre-paid by sender. For the company living in the destination, doing the final step of distribution does not take more than dealing with a local message. Of course the letter must be correctly addressed. These two companies usually do not exchange money. They consider all other postal companies to be ‘peers’ as roughly the same effort (cost) is involved in sending letters ‘out’ of territory (a few letters but a high cost per letter) than the effort (cost) to distribute foreign letters ‘in’ (much more letters but at a low local cost per letter). Roughly each company will spend the same sending out than it may request from all the other companies to deliver their messages. So it is just more ‘polite’ to not receive payment for dealing with foreign origins and in response don’t pay to send messages abroad. Notice also that the local postal company does not need to ‘prepare’ anything special to receive letters from out of the country. The same big central warehouse that is used to collect all local letters is used to receive letters from abroad. This has worked well for postal companies for hundreds of years and it still works. Of course if a country falls in the rare state that no local people send letters out and at the same time local inhabitants receive tons of letters from other countries, the local postal company would have big losses as they have cost but no income. Anyway these situations are scarce if at all possible and usually postal companies have been subsidized or owned by the local government so losses have been taken as a ‘fact of life’.
Important facts to remember about this service: the sender pays per message. The originator company bears the whole cost of moving messages to destination postal company. Each destination may have a different price. Each destination may have a different delivery time. Letters above a certain size and weight will have extra cost proportional to actual size and weight and also to distance to destination. Local companies do not create additional infrastructure to receive foreign messages. There are no termination costs.

Another more modern ‘Carrier Service’ is wired telephony. As in the postal service no company owns the whole telephony network. As in the postal service there exist local companies that take incoming calls from its territory and deliver the calls to its territory. When a call originates out of territory the caller must do special actions: he must add a prefix identifying the destination territory. In the destination country a local company has an explicit (not tacit) agreement with many other companies out there (not all) to terminate the incoming call. As in the postal service the termination business always involves exactly two companies and the highest cost is transporting the call to the ‘doors’ of the destination company. As in the postal service the caller (sender) pays for the extra cost. An important difference is that the caller usually pays at the end of the month for all the calls and not before. Again these telephony companies consider themselves as ‘peers’ but with some important differences: in this service it is required to build and pay for a physical connection from company to company. In the postal service the originator company was free to hire trains, plains, trucks, ships or whatever means to carry letters to the door of the local post company. The volume of letters may vary substantially and it does not mean big trouble for anyone except for the originator that must pay for all the transport means. The receiving infrastructure is exactly the same as for local letters and it is not changed in size or in function by foreign workload. In telephony the local company must allow the entrance of calls at a specific exchange station. Telephony works over switched circuits and this means that the originating company must extend its circuits to other companies in other countries and connect on-purpose through a switch in the other company circuits. This now has a cost (which is not minor by the way). More important: this cost depends on the forecasted capacity that this exchange must have: the estimated amount of simultaneous calls that may come from the foreign company. Now the infrastructure for local calls cannot be simply ‘shared’ with foreign calls. We need to add new switches that cannot be used by local workload. Notice that every new company out there wanting access to my company circuits will require additional capacity from my switches. No telephony company will carry alone the cost of interconnection to other companies in other countries. Now ‘balance’ is important. If a telephony company A sends X simultaneous calls to company B and company B sends Y simultaneous calls to company A now it is very important to compare X to Y. In case they are similar: X~Y, ‘business politeness’ leads to no money being exchanged. In case A sends much more than B: X>>Y, B will charge for the extra cost of injecting A’s calls in B’s circuits. Remember that callers pay A, but B terminates calls and receives nothing for doing that.
Important facts to remember about this service: caller pays per call (or per traffic). The originator company bears the cost of extending circuits to the ‘door’ (switch) of the destination company. Each destination may have a different price for calls. Cost will be proportional to call duration and distance to destination. Local companies MUST create (and pay for) specific infrastructure (switches) and MUST reserve capacity PER EACH foreign company. This infrastructure MUST be planned in advance to avoid congestion. Cost of infrastructure is proportional to expected traffic (expected incoming simultaneous calls). There are termination costs in case of unbalanced traffic between companies.


6. Internet Service compared to Carrier Services

The Internet Service is sometimes viewed as similar to telephony. At the end, in many cases telephony companies have picked up the responsibility (and the benefits) of providing Internet Service. But Internet Service is an access service not an end-to-end service. How is this service built and run? An ISP builds a segment of IP network. If there are public IP addresses inside and they are ‘reachable’ from other public IP addresses, now this segment is part of internet. For the ISP it is no big deal to move packets to and from its own machines, its own IP addresses. The small ISP just applies ‘classic routing’ inside its segment. (Apply classic routing means: all routers in this small network share a common view of this small network and they run well-known algorithms that can determine the best path or route crossing this network from machine 1 to machine 2 possibly jumping through several routers inside the network. These routers have a distributed implementation of a shortest path algorithm based on selecting the next hop in a regularly re-computed routing table. As the required capacity of the routers depends on the number of IP addresses managed and the number of routers inside this ‘small’ network, there is a limit in cost and performance to the size of a network that can apply classical routing.)

What is interesting is what happens when destination IP is out of the ‘small network’. The new segment of internet does not have a clue about how to deliver to destination IP. That destination IP may be at the other end of the world and may belong to an ISP we have never heard of and of course we do not have business with them. The ISP does not feel bad about this. It is confident that ‘It is connected to internet’. How? The ISP is connected to a bigger ISP through a private transit connection and the smaller ISP pays for transit (pays for all the traffic going to the bigger ISP), or it is connected to a similar ISP through a peering connection, or it is connected to many other ISPs at an internet exchange. Usually peering happens between companies (ISPs) that are balanced in traffic, so following the same reasoning that was applied to telephony they do not pay each other. Internet exchanges are places in which physical connection to others is facilitated but nothing is assumed about traffic balance. The Internet Exchange is ‘the place’ but the actual traffic exchange must be governed by agreements 1-to-1 and can be limited to keep it balanced as ‘free peering’ (no charge) or on the contrary it may be measured and paid for as a ‘paid peering’.

We have said that smaller ISPs pay for transit. What is ‘transit’? Small ISPs route packets inside their small network, but to connect to internet they must direct all outgoing traffic through a bigger ISP router. This bigger ISP will see all these IP addresses from the small ISP as their own addresses and apply classical routing to and from its own machines. The bigger ISP supports the whole cost of the transit router. For an ISP to accept traffic in transit from smaller ISPs the transit routers must be dimensioned accordingly to expected traffic. This big ISP may not be very big and so it may in turn direct traffic through transit to a bigger ISP… At the end of the chain, the biggest, worldwide ISPs are called ‘tier 1’. These are owners of huge networks; they are all connected to all the rest of tier 1’s. They see IP addresses of other tier 1’s through ‘peering points’ in which they put powerful routers. The cost of the peering infrastructure is supported by both the two ISPs connecting there. They do not pay for traffic but they invest regularly in maintenance and capacity increases. It is of key importance to both peers to account for the traffic going through the peering point in both directions. They must maintain it balanced. If a misbalance occurs it is either corrected or the ISP that injects substantially more traffic will have to pay for the extra effort it is causing on the other side.

We have not yet demonstrated that the IP packet coming from the small ISP can find its way to destination IP. Let’s say that destination IP belongs to a small ISP that is 20 ‘hops’ (jumps) away from origin. In the middle there can be ten or more middle size ISPs that pay for transit to bigger ISPs and there maybe 3 tier 1 ISPs that peer to each other. The IP packet will be moved onto a higher rank network ‘blindly’ in its way up just for a single reason : all routers in the way notice that they do not know where lays the destination IP so their only possibility is to send the packet through the door (port) marked as ‘way out into internet’. At some point in this way up the IP packet will reach a tier 1 network that talks BGP to other tier 1’s. Some portion of the destination IP will match a big IP pool that all BGP speaking machines handle as a single AS (Autonomous System). Many tier 1’s have one or more AS registered. Many ISPs that are not tier 1’s also talk BGP and have registered one or more AS. What is important is that routers that talk BGP have a way of telling when an IP address is the responsibility of some AS. Let’s say that in this case the first moment at which our destination IP is matched to a ‘responsible destination network’ happens at a tier 1 router talking BGP. This router knows one or more ways (routes) from itself to the destination AS so it simply sends the packet to the next hop (router) that best suites its needs. The next router does the same and in this way our IP packet traverses the tier 1 networks. At one of the tier 1’s the destination IP address will be matched to a local sub-network; this means our packet can now be routed through classical routing algorithms. This classical routing will make our packet go down from bigger ISPs through transit routers to smaller ISPs until it reaches the destination machine.

What has happened in comparison to our good old carrier services? Now no one ‘in the middle’ knows what happened to the packet. They only know they treated this packet ‘fairly’. Essentially transit ISPs just worry about the statistics about dropped packets in their transit routers and make sure that number of dropped packets is kept inside a reasonable margin. For instance 1 drop per 10^5 packets is not a drama for any transit router. But notice that a sudden change in a remote part of the world may increase this router losses to 1 in 10^3 drops and there is little that the router owner can do. In fact all he can do is to rely on global compensation mechanisms implemented at the highest routing level (BGP) that are supposed to slowly balance the load. But in the meantime lost packets are lost and they must be retransmitted if possible. In any case the application that is managing communication will suffer in one way or another. It is now impossible to plan for capacity end to end as routes traverse many companies and these companies are left to the only resource of measuring their bilateral agreements and react to what happens. The transit companies cannot know when traffic is going to increase as it may very well be that the originator of the traffic does not have contracts with any of them so this originator is not going to inform all companies in the middle of its capacity forecasts. Especially difficult is to realize that the dynamic nature of routing may cause that sometimes this traffic causes effort to a certain set of companies and the next moment it causes effort to a different set of companies. For this reason Internet Service is NOT a Carrier Service, it does NOT carry a message ‘end to end’. The IP protocol works end to end, but it is in practice impossible to establish the responsibilities and trade duties of the companies involved in the journey of a single packet. It is impossible to tell a customer of an ISP who (which companies) have taken part in carrying his packets to destination. It is impossible to tell him which is the average quality of the services that his packets have used to reach destination. Worse than all of this it is impossible to all the companies in the middle to prepare to give good service to all traffic that possibly will go at any time through their routers.

So in this terrible environment the companies carrying packets fall back to ‘statistical good behavior’.

For this reason ISPs cannot charge their users for ‘transactions’ as they are not responsible for terminating a transaction nor they are able to set up a commercial agreement with one, or two or one hundred companies that could make them assume the responsibility of granting the transaction. So, as they do not charge per transaction they need a different business model to take traffic from net users. They have decided that the best model is to charge per traffic, considering traffic in a wide statistical sense. In the past and still in the mobile data today they charge per total amount of information sent over a period of time: GBytes/month. It is more common to charge for ‘capacity available’ at access not for ‘capacity used’ : Mbps up/down. This is a dirty trick and a fallacy in business as you may easily see: you are receiving an access service, your ISP wants to charge you 40$ a month for a 5/50 Mbps link no matter if you use it or not. But does this mean you can feed 5 Mbps to any destination IP in the Internet? Or do they grant you can receive 50 Mbps from any combination of other IP? Of course it does not. How could it be? Your ISP can at best move your 5 Mbps up to the next router in the hierarchy. But as you will see no ISP in the world will make a contract with you promising to do that. They will say they do not know how much of those 5 Mbps can go out up to internet.

I think it would be fair to force an ISP to measure incoming and outgoing throughput as an aggregate at the network edge. This means measuring all contributions in from: accesses + transit + peering, then measuring all contributions out to: accesses + transit + peering. Of course it is impossible to tell how many customers need to go ‘out’ at any moment so outgoing throughput may be sometimes a small fraction of incoming throughput. This ratio will probably vary wildly over time. The only number that should be forced as a quality measure onto the ISP is: the aggregate number of bytes taken from users at the edges of the network (from accesses + incoming transit + incoming peering) must equal the aggregate number of bytes delivered to network edges (out to accesses + out to transit + out to peering). If an ISP claims to provide ‘Internet Service’ it should be prepared to handle any situation, any possible distribution of incoming bytes to outgoing bytes.

Notice that it is cheaper if all bytes from local accesses go to other local accesses in the same ISP, in this case transit and peering are unused. Much more dramatic is the case in which all accesses suddenly need to send packets out through a transit router. This will not work in any ISP of the world. Packets will be lost massively at the transit routers. The usual business is to estimate the amount of outgoing traffic as a fraction of the local accesses traffic and then make transit contracts that grant ONLY that fraction of throughput out. These contracts are reviewed yearly, sometimes monthly but there is not much more flexibility. A sudden surge of traffic still can and often does cause bad experiences to users that coincide trying to make their packets through internet at the same time. This ‘resistance’ to go through is experienced in different ways by different applications : email will not suffer much, Voice over IP and Videoconference will be almost impossible and viewing movies can be affected seriously depending on buffering model and average bitrate.

You could hardly convince an ISP to over dimension transit contracts out. What happens if this ISP sees the transit unused for 12 months while still having to pay a tremendous amount of money for those expensive transit contracts? Naturally the ISP will soon reduce the contracts to the point in which the transit carries just the traffic the ISP is paying for. Unfortunately the interconnection machinery cannot be made flexible; it cannot be made to increase capacity as needed. For many reasons this is impossible. As you can see a cascading effect will be created if a few ISPs start over dimensioning their peering/transit machinery while keeping their contracts low… In case they need to flush a sudden surge of traffic the next ISP in the chain not having over dimensioned machinery will not be able to cope with the surge. Also notice that paying for a router that can handle 2X or 5X the traffic it actually has is very bad business and no one will do it.

Important facts to remember about this service: sender does NOT pay per message. He pays per ‘max sending capacity installed at his premises’. The originator company does NOT bear the cost of extending circuits to the ‘door’ (switch) of the destination company. The originator company extends circuits JUST TO ANY other carrier in the middle. Interconnection machinery costs are supported by bigger carriers, not by smaller ones. Interconnection cost between equal sized carriers is shared between them. Each destination will have exactly the SAME price. Cost can NOT be proportional to distance to destination. Service is NOT an end to end transaction. Local companies MUST pay for using specific infrastructure (transit) to go out of territory; and they MUST build specific infrastructure and reserve capacity for EACH transit company that carries traffic into its territory. Both the outgoing contract and the inbound infrastructure MUST be planned in advance to avoid congestion. Cost of infrastructure is proportional to expected traffic (expected aggregated rate of packets/s). It is impossible to forecast the traffic of an incoming connection from a transit company as this company is a dynamic middle-man to an unknown number of traffic sources. There are transit costs in case of unbalanced traffic between companies; the small ISP pays the big ISP. Equal size ISPs do not charge each other for traffic they just ‘peer’. There are infrastructure costs in every interconnection. Both ‘peers’ and transit providers spend lots of money buying and maintaining interconnection machinery.


7. Mapping ‘net neutrality’ to the crude reality of Internet access service

We have seen that Internet Service is an Access Service, not a Carrier Service. This fact is reflected in some legislation, particularly in the Anglo-Saxon world under the concept of ‘Common Carrier Services’. These Services include but are not limited to postal service and wired telephony. These are services so important that usually have been admitted to be ‘public interest service’ and thus governments have interfered the market rules to make sure that these services were widely available, non-abusive, reliable, affordable, responsive… beyond the pure market dynamics.

So when we say ‘do not treat IP packets differently in case they belong to different net users’, how does this statement map to the Internet Service that we have described in the previous paragraph?

How can an ISP comply with the above definition of ‘net neutrality’?

ISPs deal with packets; they have to route packets inside their network and also to the next middle-man in Internet. They get money from accesses (for availability) and charge/pay for Transit contracts (for available capacity and/or transferred traffic). Can they reconcile their business model with net neutrality? Yes, I do not see the problem. It is fairly simple. The flaw is in the business model. It is a very weak proposition to sell ‘a reasonable effort to put your packets in its way to destination’. I’m sure people only buy this service because no other alternative is available. It is easy to drop packets at every interface when there is congestion, possibly frustrating some users out there, and at the same time keep my promise of treating equally (bad) all net-users and at the same time maintain my business model based on ‘reasonable effort’. Who decides what a reasonable effort is? Currently entities like FCC have a hard time trying to regulate Internet Service. As they cannot treat it as ‘Common Carrier’ it would not be fair to force ISPs to have a strict policy on transit contracts. How could they create this policy? Let’s say that FCC forces every ISP to contract transit for a total amount of 10% of its upstream aggregated accesses…Is this enough? Who knows… It is Impossible to tell. Would this grant absence of congestion? No. Internet traffic varies wildly. You see the actions of the regulator will be very unfair for all ISPs and at the end they will not solve congestion.


8. CDNs… businesses beyond plain Internet Access

Now we have seen that plain Internet Service can be kept neutral due to the fact that ‘access business model’ is a weak commercial proposition essentially easy to accomplish to-the-letter while frustrating users.

Are there other Internet Services that cannot be reconciled with net neutrality? Do these other services (if any exist) distort the plain Internet Service? Is any ISP in the world abusing new technology to violate net neutrality, despite being easy to maintain strictly the neutrality claim? I will try to address all these interesting questions.

CDN: Content Delivery Networks, they are businesses that go beyond Internet Service. I’m sure you do not find strange that companies with an important message to communicate are NOT happy with a service that promises ‘a reasonable effort to put your packets in its way to destination’. I will not be happy too. If I had the need and the money to pay for it I would buy something better.

CDNs are end to end services. CDNs are true carrier services. Very interestingly CDNs are not regulated as Common Carrier (not yet at least), but in my humble opinion they are as good carriers as any other end to end service. They sell transactions. They also sell traffic, but not bulk, best effort delivery, they impose SLAs to delivery. CDNs work to build special routing through known paths through internet, so they avoid the middle-man loss of responsibility. Once you know all the actors in the middle you can distribute responsibility and cost and you can make the service work with any quality parameter that you would like to set end to end.

Of course sending bytes through a CDN implies a loss of flexibility. Now routing cannot be the all-purpose dynamic routing of internet. You have to place machines of your own in the middle of the way from one end to another. You have to do many special agreements with network owners for collocation; you have to hire private links, install your own powerful routers, and install your own intermediate storages. All these actions cost an incredible amount of money. Who is going to pay for this? Of course the sender will pay.

Does CDN service violate net neutrality? No. why? CDNs treat packets very differently from plain Internet Service. But who is the owner of these packets? Is it you at home viewing a movie? Nope. The packets you receive are courtesy of someone that paid a CDN to host them. You request an item (a movie) by sending packets as part of your Internet Service. In this Internet Service your packets can be lost with equal probability as any other user sending email, viewing a non-CDNized movie, chatting, or whatever. But when you receive a response that is ‘sponsored’ by someone through a CDN, special care is taken not by your Internet Service Provider, no, do not fool yourself, it is by the resources of this ‘someone’ and this CDN that special actions happen to the packets that reach you. It is not anymore ‘your’ service. It is this ‘someone’s’ service what is better. But the benefit is all FOR YOU.

We can now compare CDN service to our old good Carrier Services. You can imagine that you use regular Royal Mail/US Mail/Any National mail… to request an important document from an entity (maybe even from a Government). Your request is nothing ‘special’ in size, urgency, quality or confidentiality so regular mail service is just OK to carry it. You are using entirely neutral service. The response to you is a very unique and valuable object/document so responder pays a Courier service to deliver urgently and securely to you. Does this violate neutrality of postal service? No, absolutely not. When you receive this high quality service you are not diminishing or subtracting resources from the regular postal service. You do not even pay anything for the extra quality, it is the sender who ‘makes you a present’ by enhancing the speed, security and reliability of delivery. The extra effort is done by an independent third party and this party receives extra payment which is completely fair. No one violates postal service neutrality by providing Courier Services.

Have you ever wondered if the existence of Courier companies could be violating ‘postal service neutrality’? Are the DHLs and UPSs of this world ‘bad people’ because they make money offering better service than National Mail services? Of course they are not. At the same time you would like Courier prices to be lower if possible but that depends only on the differential quality vs National Mail and the price of National Mail.


9. Regulation

Have you ever wondered why so many people claim ‘for a regulation’ over many things? They want regulation over Internet behavior, over telecommunications prices, over highways, over their capacity and their pricing… We are all the time asking ‘someone’ to come and regulate things. No one seems to have a problem with that. Don’t you think there should be limits to regulation? These claims are childish.

Regulation has had a good effect over ‘public interest services’, as we have said there is a fair amount of these services in our world : water distribution, postal service, energy, telephony, first aid and urgency health services (not in all countries), education (not in all countries),… .The regulator places itself above the market and disrupts the ‘pure’ market behavior. Of course to do this only someone with higher authority than the money can buy can take the role of regulator. Only Governments can do it and they usually do it. There are enormous differences about regulation coming from different cultures and political arrangements.

But even regulation cannot work without legal limits. In the Anglo-Saxon legal tradition the figure of ‘Common Carrier’ defines the limits of public interest Carrier service to be a candidate to be regulated by the Government. At least it tries to set the conditions in which a service can be considered to be ‘Carrying messages for anyone without discrimination’ and thus can be considered ‘public interest’ and be regulated to ensure that everyone can have equal access to such a service. It comes from the postal world by the way.

Another reason for an authority to intervene a service is the ‘natural right’ that emanates from the property of resources. For big, common, public infrastructures like the ones needed for water transportation, energy, litter, telecommunications, roads and highways, postal service… it is needed to ‘take’ terrain that belongs to the community and restrict it to a new usage. This capture of resources is done to serve the community but some authority that represents the community (a government) must take control of this action so at the end the community does not receive more damage than benefit.

Internet does not consume physical space (at least nothing that would bother any community). Installations of cabling may cross national borders , like trucks working for postal service do, but there is no need to make ‘border checks’ on information bits, as they are all born equal and harmless in the physical world. There are no national borders for telephony cabling. Companies do not pay fees to governments to cross the borders with cabling. So you start to see that there is no ‘natural right’ to regulate telecommunication emanating from community resources. The only reason to allow for regulation comes from ‘public utility’ of being connected to internet.

No one doubts today that there is a value in having access to internet. It is an economic, social, political, personal value. So internet access has become like water, energy, health, education. But at the same time notice that these important ‘public interest matters’ are not equally regulated all across the planet. Why would you expect that internet access will be?


10. DPI, transparent caching and communication secrecy

I have mentioned DPI as a new technology that allows breaking network neutrality (in case some router owner is very interested in breaking it).

There is bigger controversy about DPI than just allowing for unfair traffic handling. Notice that if DPI allows someone to harm your traffic by imposing a ‘higher resistance’ to cross the Internet, compared to the ‘average resistance’ that all users suffer… prior to causing this damage the one that applied DPI must have had access to information in the upper level protocols (above layer 3). This is comparable in the world of Carrier services to ‘looking inside the envelope’. This violates communication secrecy. It is a major sin.

In life there are envelopes for a reason. In the outside you place information for the handler; in the inside you place information for the receiver. You do not want the handler to eavesdrop inside your envelopes. Regulation of the postal service helped not only ensuring reasonable prices and whole territory coverage so anyone has the ‘right’ to send and receive letters. Postal regulation also set an authority (the Government usually) protecting the secrecy of communication and is this authority who does prosecution of infringers. And this is very serious in most parts of the world.

Wired telephony inherited the protection of the postal service so telephone calls cannot be lawfully intercepted. Both services postal service and telephony have evolved to Internet. Has Internet inherited the protection of carrier services? Oh, it is difficult to tell. My first reaction will be to answer: no. Not yet. You will need to review the question country by country.

Not being a ‘Common Carrier’, things get messy. There are some rules out there that seem to protect ‘all communications’. Out of the Anglo-Saxon world, in Europe, many countries have rules that protect secrecy in communication and that seems to cover Internet messaging. But these countries find difficulties in distributing responsibility to essentially an unknown number of message-handlers in-between sender and receiver.

One good example is ‘caching services’. CDNs have been considered caching services in many regulations.

Did you know that for effective caching it is necessary to eavesdrop inside your messages? Did you know that early caching services started to do it without telling anyone and without permission of sender or receiver? For this very reason many early caching services were found as violators of secrecy and closed.

As caching turned out to be ‘useful’ for ‘common messaging’, that is, good for the sender and user in many circumstances law-makers were ‘compelled’ to modify the secrecy protection allowing exceptions. The ‘caching exception’ is translated into ‘Common Carrier’ laws all around as a list of circumstances that limit the purpose and ways in which information ‘inside the envelope’ can be accessed by the handler.

Of course this is just a ‘patch’ to the law. Smart people can now eavesdrop into you messages claiming they adhere to the ‘caching exception’ to secrecy. As any patch, this is a dirty and misaligned thing in the face of a very solid basic right that is communication secrecy.

How to overcome ‘secrecy protection’ to offer CDN service? Easy; ask for permission to eavesdrop to the sender. As the sender is not the one that receives traffic (a movie for example), but the one who hires a CDN to serve the movie, in the services contract there is a clause allowing technical middle-man packet inspection for caching purposes that comply with the ‘caching exception’ rules. The movie viewer cannot complain. The movie owner does not want to complain, he is happy about the caching.

What about transparent caching? If I do not hire a CDN… can anyone in the middle inspect my messages claiming a ‘caching exception’? Of course not, but sometimes they do. Some ISPs install transparent caches. They inspect traffic from ANY source in search of clues to cache repeated content. They do not ask anyone permission to do that. Prior to ‘caching exceptions’ they could be considered liable of secrecy violation. Today you would need to take the laws of the country in which the DPI/cache is physically placed and carefully study the method and purpose of transparent caching. In many circumstances you will have a legal case against the one who is doing DPI/transparent caching.

Did you know that to avoid legal prosecution it is very probable that you have been made to sign a clause in your ISP contract allowing this ISP to perform DPI/transparent caching? Of course this clause does not say ‘…we hereby are granted permission to eavesdrop…’ No, the clause will more or less say ‘…we are granted permission to manipulate traffic through technical means including caching under the caching exceptions to telecommunication laws…’.

The fact is that asking for permission is the best way to eavesdrop. There is a well known company that give you free email but you allow them to classify your messages by means of inspecting everything inside them.

Another fact is that if someone does not have a contract with you he cannot ask for your permission nor receive it to look into your traffic. That is, if someone different from my ISP places a cache in the middle of my traffic (for example he caches a web page of my own, or intercepts traffic from any server at my home), or anyone does DPI on packets going out of my home, not being my ISP it is impossible that he asked me for permission, and thus I may not agree with him eavesdropping into my messages.

It is important to notice that this is happening and you can do very little to stop it. You could figure out that an ISP in the middle is doing transparent caching, find the country in which the DPI/cache is placed, find the specific law of that country, (try to) find the specific method and purpose the ISP applies and if you find yourself with enough money and strength take them to court. Honestly you do not have much hope of success.


11. Conclusion

We have seen that net neutrality is about dealing with traffic in equal conditions independently of the identity of traffic owner.

We have seen that, as of recently, technology allows to break neutrality. But the violator still needs a reason.

We have seen that Internet service is not a Carrier Service; it is not end-to-end, it is an Access Service.

We have seen that from the legal perspective, Internet Service is not a ‘Common Carrier’ service.

We have seen that regulators, like the FCC, cannot simply force ISPs to increase any capacity in any interconnection. We have seen it will not address congestion problems.

We have seen that neutrality is violated everyday by transparent caching and DPI. We have seen that a ‘patch’ has been applied to law to allow violating secrecy to a certain extent.

It seems clear that even supposing that DPI/ transparent caching is lawful (which in many cases is objectionable) , once a DPI has been performed the router owner could do other things that go beyond violating secrets. He can use the information to change the natural balance of traffic. He can prioritize at will.

This prioritization can be a net neutrality violation.

As Net Neutrality is not a law, it is not even a ‘law principle’, it is just a policy, that is a recommendation, no one can take to the court an ISP due to a ‘creative traffic engineering’, once proved that the DPI performed by this ISP was lawful (under the caching exceptions or allowed by the ISP-user contract).

It is still possible to take to the court ISPs and service providers that have not asked you for permission to unbalance your traffic and that cannot allege lawful caching exception.

Applying these conclusions to some recent cases of ‘famous movie Distribution Company’ vs ‘famous ISP’ you can see that the regulator, or the courts will have a very difficult day (or year) in dealing with their responsibility to take control of the behavior of the ISP or the behavior of the distribution company.

The most probable development of these complaints is to be judged under trade laws, not under communication laws. The courts will not feel competent to apply ‘recommendations’ as ‘net neutrality’ but they will be happy to look for contract infringements.

What is uncertain is if they will find any contract infringement. In my own view it is very likely they won’t.

We can conclude that ‘Net Neutrality’ is an aspiration; it is not backed by law.

Net neutrality is a complex issue that requires society, companies, law and courts to mature and change.

Today Net neutrality is badly understood. I have had the sad experience to read articles, even from reputed writers and journalists that usually have a clear understanding of the topics they deal with, that completely missed the point.

They miss the point because they let themselves be abducted by rage and by a misleading comparison to ‘basic human rights’. They feel it is important to do something to grant neutrality… and they fail to realize that the network is essentially neutral and someone not making his traffic through cannot claim the network is not neutral.

At the same time there are neutrality violators (DPI/transparent caching) but our society has created laws to support them. It is important to realize that these violations are serious and laws must be changed.

I hope that this long reflection about all factors involved in Net Neutrality may have been interesting to all of you.

Have a nice and neutral day.


CDN Performance Management: bridging the gap between business and technology

(Download this article as PDF:Performance Management in content services v3 )


1. Introduction:

In the CDN industry and ecosystem there are many pieces, technologies, companies and roles. One that is of key importance is ‘Performance Management’.

After all, CDNs exist as businesses for only one reason: the ‘standard’ worldwide network (Internet) does not perform well enough in content distribution.  All of us dream of a day in which the Internet will handle seamlessly any type of content without interruption, degradation or excessive delay.  OK, probably pure CDNs do not dream of that day, as their businesses will change dramatically. But they will adapt.

So CDNs are all about improving performance of the underlying network, and thus it would make sense that anyone running a CDN takes Performance Management very seriously.

You will be surprised to know that this is not always the case. It is amazing how many misinterpretations of ‘performance’ are out there and very specially in the CDN ecosystem. It is very common to find companies offering ‘performance data’ that in fact do not reveal anything about the performance of a CDN, others offer data that are a collection of ‘insights about performance’ but cannot connect these insights with any actions that the CDN designers and admins could possibly do, so you can only get anxious about any problems that you discover… some others miss completely the point looking at performance figures not related to content distribution…  Most of the times the problem is that some people do not know the right definition of ‘performance’ for a given system,  or which is the right information about performance they should collect in their system or how to handle that information to their advantage.


2. Do people know what ‘Performance’ is?   

You are probably thinking now: ‘oh man that is obvious. I’ve known that for years. Please do not offend me…’  Well. I not so long ago did an experiment. I asked a new technology support team supposed to operate a new CDN:  what data would you offer to me about the performance of my CDN?  They, after a moderately long time, responded with a template from a well-known industry tool, a commercial SW package from a big technology vendor. It was a standard template based on instrumentation agents for standard OSes (windows, Linux…). The template was offering, for ALL machines in the CDN the following information: %CPU occupation, %RAM occupation, %HD occupation.  That was all. Every machine was reported in the same way: streamers, web caches, backend servers, routers…

I went back to the team and said: “… hey guys, I’ve got a problem, probably I did not make myself clear. I want to know as much as possible about the performance of my CDN. What can you offer me?”  They started to look suspiciously at me (I started to think suspiciously of them…). They repeated their answer.  It was then clear to me I was in front of a ‘standard IT/OSS operations team’. They had received the blueprints and manuals of the new CDN (by the way, not a commercial one, so many new things were inside, quite non-standard for any datacenter except for a CDN datacenter and thus out of standard OSS experience) and they had addressed them, in good faith, as if the CDN were a collection of computing machines in need of processor, RAM, HD, network and power. No less .No more.

It took tremendous effort in terms of time, money and human resources to close the existing gap about ‘performance management’ in that CDN. But in doing that effort many rewards were received: many SW bugs were found that limited the expected performance, some HW pieces were found working out of specs, some architectural principles had been violated in field deployments, some parts had been badly deployed…

It turned out that despite the existence of many Quality Assurance teams, many test environments and many controls in manufacturing, SW development and deployment, there was no coordination between teams and no one in management was concerned enough of end to end performance of the whole system.


3. Performance: an every day’s concept.

Today ‘performance’ has transformed into a buzzword of the technical world but it is good to remember its everyday language meaning. To perform is to ‘act’, to do something. Your ‘performance’ can either be your action itself or, if taken as a reference to continued action, the outstanding properties of your continued action. In brief your performance is an analysis of how you do what you do’: how accurate, how costly, how fast, how reliable, how regular, how efficient…

From the logical-philosophical point of view a ‘complete’ description of even moderately-complex systems performance could be close to impossible. Some properties of your performance will be relevant for some analysts of your action and some other properties will be relevant for others. We will concentrate in a practical view of performance that helps monetize a technological business, in this case a CDN.


4. Performance Management

Performance Management is all about knowing what is important to you from among your ‘behaviors’ , retrieving all possible information about those relevant behaviors, connecting it to your action planning so you can:  1- avoid situations that you do not want and 2- improve your behaviors through feedback.

You have to think of what is important to know about your CDN as a business… and know even more clearly what is not important, so you do not waste your time and money on it.

Businesses are used to handle KPIs: Key Performance Indicators.  They work hard to find  the key properties of the business behavior, then look at them (collect series of values) and react according to what they see.  Typical KPIs in businesses may be: cost, revenue and derivatives: benefit, EBITDA, efficiencies…  Unfortunately CDN machines and SW do not come out of the box with an interface to query them about the whole CDN costs, or the revenue, benefits, etc… Even the smartest off-the-shelf SW suite designed for performance analysis can retrieve just some part of the information and it does not have any clue on which is the influence of that part in the behavior of your very specific business/CDN and thus it does not have the KPIs that you want to build. It is you, the business owner, the CDN owner, the CDN designer at large  (you design the strategy of the CDN business as well as the architecture), who needs to build KPIs from simpler pieces of information picked up from here and there. Of course there are tools, including SW suites, that correctly used can help you a lot.

KPIs that make sense are directly related to the business.  It does not make sense to monitor behaviours just for the sake of ‘knowledge’. You must have a clear purpose.   Think of the purpose of your CDN. You as a CDN designer/owner are selling an IMPROVED delivery service over internet, so your customers expect your CDN to behave BETTER than not having a CDN at all (of course). Your customers are concerned with: time, accuracy, reliability and confidentiality of the delivery. You must also be concerned about all these properties of your CDN and you must gather the best possible information about how are you doing in these departments. You want to have KPIs defined about:  latency (first byte time, first frame time, DNS resolution time, re-buffering time, re-buffering ratio, re-buffering events per period,…) either average or instant, usable throughput :’goodput’ measured in many places, traffic degradation (bytes lost, frames lost, bytes corrupted, artifacts shown, …) measured in many places … and probably many other KPIs that allow you to know which is the REAL value of your CDN for your customers (this is good to help you in pricing your service), KPIs that help you propose SLOs and SLAs or accept the SLAs your customers want to have,  KPIs that let you anticipate upcoming facts with enough time in advance (for instance reaching top capacity), KPIs that help you learn about your business dynamics and trends so you can modify your products and/or create new ones, KPIs that let you know which is your operational efficiency ratio (how many dollars does it cost you to generate any new dollar), KPIs that let you discover that you should do some things differently, KPIs that can be shown to your stakeholders to make them confident and proud of your CDN, KPIs that let you compare to your competitors…


5. What performance of a CDN is not.

Performance of a complex information system is NOT about ‘just collecting’ tons of data about everything. Analyzing your CDN infrastructure as if it were the CERN’s Large Hadron Collider does not make sense. For sure you can get tons of data about the behavior of even the simplest computer system. And it would be easy, you just buy a ‘performance analysis package’, check all monitoring options and ‘voilà’ you will get more data than you could cope with in ten lifetimes. The lesson is: think first of what you want to discover, and then look for related data. Don’t let package vendors tell you what is important for your business.

Any serious performance management approach is a purpose-specific task and it starts from the highest level of your business concepts and goes all the way down to the level of detail that you can afford. You should stop at the point in which you are incurring in disproportionate costs to uncover your behavior…. If the task of analyzing your behavior were for free it would virtually have no limits. You could benefit in so many ways of knowing intricate data about your behavior that the performance management task could be easily more complex than running your own business. (Big Data fans painfully start to notice this …)

Of course many of the KPIs that are legitimate in a CDN (some examples were given in the above paragraph), like time-measures: first byte time, first frame time, etc… are in some way related to computing variables. These ‘time-measures’ depend on the available RAM, the available cycles on a core on a processor, the available HD throughput, the available network throughput… of some or all the machines in the CDN at the same time. The dependence of business-valuable KPIs (examples above), which usually are end-to-end system-wide measures, to computing variables measured in one or more machines is so complex that it is completely useless to try to establish that relationship analytically. And thus it is also completely useless to present these directly-measured computing properties as KPIs.

To say it in lay words: any graph of your caches %CPU occupation (for instance) will give you a very poor indication of what is going on in your CDN business.  Of course knowing that is better than nothing.   If you look at such a graph produced let’s say every 10 seconds you get a huge amount of ‘data’ but you only get very little ‘ information’.  In this very specific example, you for sure realize that it is not the same thing to be running at 1% CPU than to be running at 99% CPU.  At 1% you may think you are not making much business, though it could be perfectly normal and respectable, and at 99% you may be perfectly OK or maybe you are down due to overload, it depends on your CDN design and many other things.   The key point is that %CPU occupation (to name just one variable) is NOT a KPI for you, it carries very little information. It is also information very difficult to act upon. This is also the case with RAM occupation, HD occupation, link occupation, etc…. All these observations carry very little value by themselves. To be useful as KPIs their contribution to at least one business KPI should be established clearly…and in that case what is valuable is the interpretation of the resulting KPI.  It is a waste of resources to retrieve process and represent the wrong data.


6. Characterization versus Modelling

As I have proposed it is useless to try to derive the true analytical relationship between observable values of computing properties and ‘useful business KPIs’. This statement is too general to not be immediately challenged by bold engineers and statisticians coming from the technical field. OK. Let me then say that at least I personally find that to derive analytical relationships between these observable properties and ‘end-to-end system-wide properties’ is impractical. These functions most of the times do not exist. Sometimes you can only create a multidimensional ‘mapping’ (new word for ‘function’ popular today among mathematicians) made of disjoint patches in all the independent variables domains.  Essentially you have to pick a finite number N of independent variables (observable computing properties), and try to ‘map’ an N-vector of values to the value of a system-wide property ‘S’ (can be another vector or a scalar), so there is a mapping from vector domain [N] to scalar domain S or vector domain [S] .

To start with, you cannot be sure to have chosen all the relevant independent variables in your system, and you cannot be sure they are really mutually-independent. To continue there are many physical complex relationships that link these variables and many times you just have models provided by manufactures and third parties, usually these models are limited and, worse than that, you do not have enough documentation on the theoretical applicability limitations of these models.

Building this analytical mapping is a ‘Modelling’ process in which we substitute our system (our CDN) by a ‘Model’ with N inputs and M outputs and we derive the analytical relationship between input values and output values. Modelling seems at least ‘impractical’ if not impossible. So, which is the alternative?

The alternative is ‘Characterization’. It is an old, very practical tool used in engineering. If your system shows an exceedingly complex relationship between inputs and outputs you can at least try to figure out a set of your most relevant inputs, put your system in isolation from any other influence and observe output values evolve while you make your system traverse through all possible input values. You draw the curves stimulus-response. Stimulus is, in general, an N-vector. Response can be also an M-vector, though it is more practical to choose a 1-dimension response domain (for example mapping re-buffering ratio as a function of number of streams and bitrate of streams). You may end up with no possibility of any graphical representation in case N>2 or in case M>1. These cases unfortunately cannot be skipped sometimes and then we will have to live with planar projections or parallel cuts of the ‘mathematical multidimensional objects’ that we produce.


7. Behavior curves: curves stimulus-response

These ‘curves’ stimulus-response are non-analytical representations of experiments. To derive confidence in the results, the experiments must be carefully controlled and the curves must be repeated and all trials must show a ‘similar’ behaviour.  As we have said, it is impossible to account for all influences in the experiment and thus it is impossible to isolate the system ‘freezing the world around’ except N input variables… The result of this limitation is that even the most careful lab team will never obtain two identical response curves for the same set of input values (stimuli). It is necessary to repeat the experiment a reasonable number of times (the number depends on the statistical variability that you observe through trials) and instead of a curve you can draw a ‘region’ in which it is most probable to find the curve. You can imagine how difficult this turns out to be even for a low and ‘easy’ value of N. For N=1 each ‘curve’ is a set of coplanar points so it can be drawn in the plane and the set of all curves from different trials concentrates to shape up a region on the plane. Doing several trials turns the output values ‘line’ (a 2D set that approaches a winding trail) into a ‘strip’ that encloses all individual curves. In case N=2 you obtain a 3D volume that is the ‘envelope’ for many very similar but not identical 3D surfaces. Of course in the process you may discover that the ‘point sets’ that approximate 2D curves (or 3D surfaces) in repeated trials of the same experiment do not come ‘close’ one to another. In that case you cannot keep the data set, you need to go back to your experiment, improve the isolation between influences and start all over again to collect curves. What you are seeing is that the dominant effect in the output comes from some variable you have not controlled or succeeded to freeze.

When repeating an experiment it is very helpful to draw all the different curves produced by the successive trials in the same plane (If N allows for that), accumulating them, but using different ‘colors’ or in parallel planes, one besides another adding an extra dimension (if N allows for that). The planar representation has the advantage that it can be observed purposely forgetting which response corresponds to which trial (forgetting about color) so for each value of the independent variable (stimulus) there is now a set of dependent values (responses). This in fact creates a collection of histograms displayed in sequence in front of you. You can graphically observe statistical properties like accumulation points. If you are in need of creating a single collection of ‘pairs’ stimulus-response, to represent the outcome of all the trials as a single experiment, the best idea is to choose the accumulation point of the responses at each stimulus value.

(P.S.: some people I’ve known do not notice how important it is to repeat every experiment enough times and look at the accumulation. In case they repeat the experiment blindly a low number of times, say 2 or 3, they are tempted to pick the average response value at each stimulus value. This is a serious mistake. Accumulation points may be very different from an arithmetic average of a few values. They usually show ‘gaps’  in regions in which for some reason you never find the response to your stimulus. These regions can tell you a lot of things in the course of any further investigation. It is much easier and more accurate to detect an accumulation visually than simply taking an average. At the same time if after M trials you do not see clearly any accumulation you need more trials. If you just take averages you are at risk of stopping too soon your trials. Averages will hide the gaps that can be so informative of behaviour. Averages in certain cases may not give you any clue about the statistical accumulation of results. Averages in many cases destroy information instead of providing you with information.)


8. Performance Management in three steps:  Characterization -> Measurement -> Action

Over the years I’ve used this method for managing performance of complex systems. It is pure common sense.

Step 1: Characterization: (in the labs, before production):

Take each part of your system that can be reasonably identified as an ‘independent entity’ that is worth of considering isolated from the rest, select your N inputs and M outputs, put the entity in the lab and provide total isolation apart from these N inputs, then prepare experiments that let you traverse the full space defined by these N inputs and carefully annotate the values of all the M outputs for every input combination (an input vector).   This step is ‘characterization’ of your entity. In a CDN a typical decomposition of the whole system into a list of entities may be:  video cache, web cache, acquisition server, request router, DNS, live splitter, backend server of application X, backend server of application Y…, aggregation switch, virtual machine X, cluster of virtual machines, virtualization HW enclosure, edge router,…. whatever you think matters for your very specific design.

It is important to note that the characterization process give us documentation about the range in which our entity can work. Of course this range is multidimensional as is the input space.

It is also important to note that in the characterization docs we describe not only the range (i.e.: a couple of values per independent input variable), but we get a curve that tells us the exact behavior that we can expect for each output when the input moves through that range. It is perfectly possible and reasonable that after looking at the curve we take actions to ensure that the system is never out of some sub-range that we consider optimal. (See compensation mechanisms in ‘step 3’ later in this paragraph.)

At the end of the characterization process you have : a set of documents that describe the behaviour of one entity ( cache, acquisition server, content router,…) when placed in ‘controlled input conditions’ :  known request arrival rate, known bitrate of videos, known title popularity distribution, known instantaneous number of streams,… . This behavior consists of measurements of ‘output values’ : time to serve first byte, time to serve first frame, re-buffering ratio, number of buffering events per minute, instantaneous throughput, average throughput per minute,… . If you have done your characterization ‘professionally’, then any two instances of the characterized entity: any two caches, any two acquisition servers, etc… will have a behavior that falls in the regions delimited by the characterization curves.  It is required to randomly pick some instances of the entity from manufacturing plant or from production, take them to the lab and apply a quick subset of your characterization experiments just to see if there is excessive variation from the curves. If there is… bad luck, you have to refine your characterization experiments and recreate your documentation.

Step 2: Measurement: (in the field, continuously):

After you know how your ‘entity’ reacts to controlled stimuli, you must provide some way to measure the output of this entity while it is at work in the field, in real production. Of course collecting the output must be done without disturbing the behavior of the entity, or at least with the minimum reasonable disturbance.

The collection process must be more robust and refined than the collection performed for characterization so it can be run continuously and unmanned. Note that this is not a lab. You must be prepared to collect data and process them as close as possible to real time, with no human access to the entities, and keep the process running for long periods of time. One of your concerns is to avoid running out of space for collected data which implies forwarding semi-processed data to other systems for representation, feedback to the CDN compensation mechanisms and long term storage for Business Intelligence trend-spotting and/or audit.

At this stage the response values are much more meaningful that before Characterization. Now you can offer a quasi-real-time plot of each value while showing the proper scale in the background. This ‘scale’ is the characterization curve. The most informative layout shows at the same time the whole range of the curve (for any given output variable) and plotted OVER that curve the point that the system is outputting at this instant in time.

Let me give an example. Let’s say that you did your homework to instrument a video cache by building a SW agent that collects a true business-KPI: ‘time to deliver first byte’. You are in production, measuring, and you obtain values: 3ms, 5ms, 3ms, 4ms, 5ms…   If you had not gone through Characterization you could only have questions: is this good? Is this bad? How good is this?  You are managing a true system-wide KPI, a business KPI that is many times used as SLO in contracts that set a given SLA. You can compare the current instantaneous value of the output to the behavior range that you have discovered in the labs. The best way to it is by displaying the current value over the behavior curve using the curve as a background. For example let’s say that you have a planar curve: ‘number of concurrent streams – time to first byte’.  So you represent first byte delay as a ‘function’ of number of concurrent streams. The current value you get is: (1000 streams, 5ms). Maybe your curve ranges from <1ms for numbers of streams under 500 all the way up to 10ms for 2000 concurrent streams.  Your curve is not a dot trail, it is a ‘strip’ that has 2ms height and it is centered at 1ms for 500 users and centered at 9ms for 200 users and it resembles a straight band with some slight deformation. This is much more valuable information. Now you know that you are experiencing 5ms latency while having 1000 users, and you know that it is NORMAL, because the real-time ‘dot’ lies on the behavior curve (within the ‘normality strip’). If you were getting 1ms for 1000 users, notice this point lies out of the curve (a pack of curves that forms a strip or band of 2ms height), something is going strange (unexpectedly ‘good latency’ in this case). You must worry. If you are getting 200ms for 1000 users it is also strange (unexpectedly ‘seriously bad latency’ in this case). You must worry.  Notice that you must worry in BOTH cases because the behavior you see is not normal. Very rarely you will receive a ‘gift’ of extra performance for free. Chances are that some other things have broken. Apart from getting all this information in a glimpse, you also can see that you are running with 1000 users and that is under the dangerous region of 2000 users that marks the end of the range. If the current value were: (1995 streams, 9ms) the value is NORMAL, because it lies in the curve, but you have to worry, because you are at the brink of overload …and if Nature is repeatable you are about to get unacceptable delay, just as you measured previously in the labs.  Not bad. With just a look at a real time plotted dot on top of a pre-computed background curve you know a lot of things and all of them are relevant to your business. And all of them mark your next action if needed.

Step 3:  Re-action: (in the field, when needed):

In the above example about interpretation of the real-time measures you have realized the power of having a pre-computed behavior curve as background.  But what happens if you see that you are about to surpass any well-known limit? (In our previous example the limit was the max number of allowed concurrent users). Do you have time to react? I would say: no. You should plan for these cases in advance and build mechanisms inside your entities and inside your whole CDN to force your system to go through a known path even in case of unusual input. You cannot change the outside world. You cannot stop your users demanding more and more… but you do not have to die when it happens.  You can plan ahead and decide at the design desk what your entities will do when they are forced outside of their limits by the uncontrollable outside world.

In our example above a cache has a known limitation of 2000 concurrent streams. (PS: This is of course an oversimplification, if you use an input vector of <number of streams, bitrate of streams, req arrival rate> you will probably notice that these three variables combine to limit the acceptable functioning range). You know that output will be unacceptable if there are 2001 users in your system so, which is a reasonable behavior when user number 2001 tries to get in? This choice of course depends on the design of your CDN. It may be you have a nice and complex request routing and you can redirect the user to another cache, it may happen that all caches are full or maybe not but the cost of impregnating a new cache with a content it does not have may be too high…it doesn’t matter what it is. At some point you have to take some hard choice like in this case: dropping requests. Of course this is just an example. There are many other situations in which real-time measurements and previous characterization info combine to feed ‘compensation mechanisms’ to keep the CDN well behaved.  Here you can see another true power of the characterization process. If you do not allow any component (entity) of your CDN to work out of range, by taking the proper (even hard) choices at the entity level you can control the ‘acceptable behavior path’ of your entire CDN even in case of very serious massive problems throughout your CDN. (For instance in case of a DDoS attack.)

‘Compensation Mechanisms’ can be designed per entity and also per system. The best possible design will probably have both types. It happens that there are system-wide properties that may NOT be the sum of the entities properties. A good example is ‘goodput’. You may have N caches that work well in a complex range of situations offering a variable ‘goodput’, well maybe your CDN cannot take the sum of all individual maximum ‘goodput’ values. In this case you have to notice that the sum of traffics is going to cause problems and react to a high summed value in some way, probably dropping a few requests in some cleverly chosen caches. This kind of compensation mechanism that acts system-wide is the hardest to apply as it requires gathering information from the entire system, reasoning above the consolidated data, acting over many parts of the system (potentially)… and everything must be done in a very quick cycle. For wide very distributed CDNs that reach for the whole planet there is a challenge to simply collect-consolidate-decide-propagate in such a wide network. The cycle may take minutes in a global network with thousands of machines over tens of thousands of Km of cabling. (P.S: the best figure I’ve seen for a planet-wide CDN cycle is 40s to centrally compute a request routing map and I find it hard to believe. Just looks too good to be true for such a huge network with today’s technology, but it is definitely possible if enough resources are put to the task and the implementation is smart.)


9. Practical CDN list of entities with their most relevant ‘input variables’ and ‘observable outputs’.

I will suggest here a practical (minimum) decomposition of a ‘generic CDN’ in relevant entities and I will propose a list of stimuli/response to watch for each entity. This is a decomposition based on my own experience designing CDN and analyzing performance. There are many other possibilities, but I think it is worth looking at the entities I’ve selected and the criteria used to look at their performance, as this reveals a lot about the method to link the CDN design to the business objectives. You can discover for example that every time it is possible I select a response space that contains SLOs and it is very easy to interpret the response as real business KPIs.

List of Entities (a minimum proposal):

1. – Streaming Cache: a cache used to serve streaming objects, very big objects delivered using several protocols: MPEG-DASH, HLS, HDS, SS, progressive download… any streaming protocol

Stimuli space:   it is made of consumer requests and CDN commands. The latter are completely controlled by the CDN and represent very few transactions so we concentrate in consumer requests or simply ‘requests’.

Properties of the stimulus: (input variables):

-Request arrival rate: number of requests arriving per time unit (units: req/s)

-Concurrent sessions/connections/streams:  total number of outgoing streams (units: natural number, non-dimensional)

-Demanded throughput: aggregate throughput demanded by all concurrent sessions (units: bits/s)

-Distribution of requests over content space: the statistical character of the distribution of requests over the ‘content space’ (the list of titles) is also known as ‘popularity distribution’ and must be known A-priori.

-Distribution of size and bitrate over content space: the statistical character of the distribution of size and bitrate over the list of titles affects performance and must be known A-priori.

Response space:    the response of a Streaming cache is always a stream.

Properties of the response: (output observables):

-Response time:  time from request to the ‘first signal of service’. This property is usually measured by the consumer but that is not appropriate to characterize a server (cache) so we most commonly will use ‘first byte time’ which is the time from ‘request arrival’ to ‘first byte of response going out’.

-Quality of streams:  ‘quality’ is not the same as ‘performance’.  Usually there is a relationship between these two concepts. ‘Play quality’ must be measured at player.  When characterizing the streamer we are interested in keeping ‘play quality’ intact (whatever it is at player) while streamer performance moves through its allowed range.  There is a complex relationship between traffic shaping at the streamer output and play quality. Let’s assume that the network in between is ‘perfect’ : it does not add delay, losses or jitter, in that case a lag in server output may cause re-buffering at player. If we define a ‘player model’ that buffers N s worth of packets, then by inspecting the traffic shaping at streamer output we can determine how the streamer performance will impact on play quality. Usual observable outputs are:  a) number of buffer under-run events per time unit (units: events/minute); b) re-buffer ratio: time spent refilling / total stream play time (units: rational number, non-dimensional)). (P.S: Players today are very complex. This simple model does not account for streamer-player situations in which the players can speedup/throttle the streamer through signaling. Adaptive HTTP streaming and classic streaming protocols deal with these situations effectively decoupling streamer performance from play quality. Increased player buffers also decouple streamer performance from play quality. Anyway it is worth testing the streamer performance (traffic shaping) as a function of load (requests) as this is the ‘base’ for an understanding of play quality at player. Once the streamer is characterized we can add on top the behavior of player (buffering) and streaming protocols (signaling for speedup/throttling).

-Aggregate throughput: total throughput measured going out of cache (units: bits/s)


Curves stimulus-response:

SET 1:  (A ‘must have’ for any decent CDN.)

Arrival rate (X) –response time(Y)

(NOTE: apply requests following a uniform distribution over a space of content that fits completely in the machines DISK but NOT IN RAM. All items in content space must be EQUAL in size. Suggestion: use files over 20 MBytes and under 1Gbyte. Proposed value: 200 MBytes. Proposed total number of items: 20000. For simplicity: encode files at the same bitrate: suggestion: 1 Mbps.)

Arrival rate (X) – re-buffer ratio (Y)

(NOTE: this curve should be plotted overlaid to the previous curve, sharing the X axis: arrival rate).

We want to trace the above mentioned pair of curves for every ‘pure’ service that the streamer is capable of:  VoD HDS, VoD HLS, VoD SS, Progressive D., Live HDS, Live HLS, Live SS …whatever…

We want to trace the above mentioned pair of curves for ‘non pure’ service. The ‘non pure’ services are made of ‘service mixes’. Enough service mixes must be considered. A service mix can be defined as: X1% requests come from service 1, X2% requests come from service 2 …XN% requests come from service N.  X1+X2+…XN =100.  Pure services have a mix in which only one Xi is non-zero.

SET 2: (A ‘must have’)

Connections open(X)-response time for a single new incoming request (Y)

(NOTE:  when X users are in the system and a new request arrives it receives Y response time)

Connections open(X) – re-buffer ratio (Y)

(NOTE: this curve should be plotted overlaid to the previous curve, sharing the X axis: connections).

We want to trace the above mentioned pair of curves for every ‘pure’ service and for some service mixes. (See above details in SET 1).

SET 3:  (Recommended)

Surface:  (input X,Y, response Z): arrival rate(X)-Demanded throughput(Y)-response time(Z)

for all pure services and for some selected mixes of service.

(NOTE: apply requests so ‘demanded throughput’ varies in 10 steps that cover from 10% nominal NIC capacity to 100% nom NIC capacity increasing 10% with each step. This process produces 10 plane curves arrival rate(X)-response time (Z) which are ‘cuts’ to the surface A.R.(X)-D.T.(Y)-R.T.(Z))

Surface:  arrival rate (X) –Demanded throughput(Y)- re-buffer ratio (Z)

(NOTE: we want the slice curves of this surface to be plotted overlaid to the previous slice curves, sharing the X-Y planes: arrival rate-demanded throughput).


SET 4: (optional)

Surface: (input X,Y, response Z): connections open(X)-Demanded throughput(Y)-response time(Z)
for all pure services and for some selected mixes of service.  (See above NOTE).

Surface:  connections open(X) – Demanded Throughput(Y)-re-buffer ratio (Z)

(NOTE: we want the slice curves of this surface to be plotted overlaid to the previous slice curves, sharing the X-Y planes: arrival rate-demanded throughput).

SERIES OF SETS 5: (optional)

Repeat SETS 1,2,3,4 using a non-uniform distribution of requests over content space. As the non-uniform distributions are parametric (Zipf skew factors, variance in normal distributions, etc…) a family of curves and a family of surfaces will result.  To get useful data use at least 3 different skew factors.  This series of experiments will easily explode your data and are time consuming.  In case you go for this you will obtain a parametric family of surfaces that will cost you much effort. The surfaces will be useful only in case your streaming cache is designed to be very sensible to request distribution. Unfortunately this is the case for most caches in the world. Notice that uniform distribution of requests over title space is the worst case while Zipf is the best. Notice that Normal distribution of size over title space is close to the real world.)


2. – Small object cache (web cache): a cache used to serve small objects: small web files

Stimuli Space: the stimulus to the cache is always a consumer request.

Properties of the stimulus: (input variables)

-Request arrival rate: the number of requests arriving per time unit. (Units: r/s).

-Distribution of requests over content space:  object popularity distribution.

-Distribution of Object size over content space:  object size distribution.

-Distribution of Object duration over content space: distribution of object duration (how many objects are replaced by others and how often) impacts greatly on cache performance

Response space: the response of a web cache is always a web object: small file.

Properties of the response: (output observables):

Response time: (see above definition of response time for Streaming cache)

Aggregate throughput: (see above definition of throughput for Streaming cache)

Curves stimulus-response:

Arrival rate (X) –response time(Y)

Trace a family of curves varying:

Distribution of requests over content space: select some small object (40 Kbyte). Create 10^9 renamed copies: ~40 TB worth of objects. Use several Zipf distributions to generate requests changing skew. For each distribution plot the complete curve moving the value: req/s. Plot no less that 5 curves and mark them using the skew parameter.

Distribution of size over content space: select some small object (40 KB). Create 10^9 renamed copies. In the copy process modify the object size following a normal distribution. Do this several times changing parameters in the normal distribution: mean and standard deviation. Use as mean: 4 KB, 40KB, 400 KB, 4MB. Use as variance: 0.1, 1, 5. You obtain 12 distributions. Plot the curve: A.rate(x)-R.time(Y) for each combination of the 5 previous Zipf distributions of requests and the 12 Normal distributions of size. You will obtain a total of 60 curves.

Distribution of duration: Select 3 Zipf distributions with different skew factors. Apply these distributions to object duration (to the total amount of 10^9 objects). Obtain 60×3= 180 curves.

Surface: Arrival rate(X)-Demanded throughput(Y)-response time (Z)

(See Streaming cache SET 3).

Apply different distributions of requests (over object space), of size (over object space) and of duration (over object space) to generate a family of families of surfaces (120 surfaces).


3. – Live Acquisition server:  stream oriented (splitter) for live signals.

Stimuli Space:  the stimulus is the number of incoming streams and the demand of ‘splits’: outgoing streams

Properties of stimulus:

Total number of input + output streams: see streaming cache: connections.

Stream bitrate: See streaming cache: demanded throughput.

Distribution of stream bitrate over stream space (over all connections)

Response space:  outgoing streams or ‘splits’.

Properties of response: ability to sustain bitrate and quality of all demanded streams

Total input + output throughput:  sum of bitrates of all streams: in + out.

Output quality: see streaming cache stream quality.

Curves stimulus-response:

Connections(X)-throughput in + out(Y)


Surface: Connections(X)-Demanded Throughput(Y)-Quality (Z)

(NOTE: See streaming cache SET 4-B. Repeat surface for several Normal distributions of connections of varying BITRATE. Distribute the BITRATE following the Normal distribution.)


4. – File Acquisition server:  file oriented. Is the entry point for files (no matter these files will be distributed as streams later). It usually serves also as an ‘internal origin’ for the CDN caches.

Stimuli Space:  requests to read & write from/to files.

Properties of stimulus:

Writes rate: (Units: writes per second)

Reads rate: (Units: reads per second)

Concurrent users: total connections (writing + reading)

Demanded due size: concurrent reads * size of demanded objects.

Distribution of size over content space

Response space: reads & writes over files.

Properties of response

Response time (write): first byte: See streaming cache.

Response time (read): first byte: See streaming cache.

Write throughput: for a single connection & aggregate (all connections)

Read throughput: for a single connection & aggregate (all connections)

Curves stimulus-response:

Connections(X)-Read Throughput(Y)

Connections(X)-Write Throughput(Y)

Demanded due size(X)-Read Throughput(Y)

Connections(X)-Demanded due size(Y)-Read Throughput (Z)


5. – Request Router: very dependent on CDN design, maybe part of DNS implementation

Stimuli Space: consumer requests in need of routing. The router will respond causing redirection.

Properties of stimulus:

Request rate

Request distribution (over content space)

Request distribution (over consumer-IP space)

Response space:  responses to routing queries.

Properties of response:

Response time

Curves stimulus-response:

Request arrival rate(X)-Response time(Y)

(NOTE: Repeat the above curve for various values of ‘popularity’ of URLS: distribute requests over URLs (content space) using a Zipf distribution.)


6. – Hierarchical DNS:  I assume your CDN will use some form of modified DNS. In any case you will need at least to have a regular DNS to interface to the outside world and its performance always matters.

Stimuli Space:  DNS requests.

Stimulus properties:

Request rate

Response space: DNS responses.

Response properties:

Response time

Curves stimulus-response:

Request rate(X)-Response time(Y)



10.  Conclusions about CDN Performance Management : Concepts & Tasks

We have seen that there are misconceptions about the goal and the method to analyze performance.

We have seen that having the right knowledge about a CDN performance is a powerful business tool that has a direct impact in service monetization, service maintenance, problem solving, and proper design-implementation and as a bonus it helps in trend spotting and evolution of our business.

We have stated that analytical modelling of a CDN is impractical.

We have proposed Characterization as the right tool to help in performance analysis.

We have introduced behavior curves.

We have proposed a method in three steps: Characterization (a priori in labs) -> Measurement (continuous, in the field) -> Action (when required in the field, through compensation mechanisms).

We have suggested a simple decomposition of a modern CDN suitable to apply Performance Management

We have suggested a reasonable set of inputs/outputs and behavior curves for each entity in the CDN.

All these focus in performance management and all these concepts usually lead to carrying out some tasks in our CDN business. Here is my suggestion for a healthy CDN life-cycle: (list of tasks):

-Characterization Lab: for every entity in the CDN the lab provides the behavior curves: stimulus/response. The Lab must be available to re-characterize any production item when the specs are changed (i.e.: a new processor or mother board is used in a cache, a new NIC is used, RAM technology improves, HD technology improves, etc…)

-Soft Real-time Dashboard: a central tool that allows looking at every KPI defined for the CDN in soft real-time. It involves instrumenting all the entities (running collecting agents per entity), pre-computing behavior curves and then graphically displaying real-time measured values of KPIs aggregated or from individual entities over behavior curves. Today the dashboard is typically a graphic web tool.

-Deployment & sizing guide: a high level outcome of the characterization is a ‘rule-of-thumb’ to dimension the max capacity that is usable from current CDN parts. This ‘rule-of-thumb’ is an easy and quick chart that can be used to match a CDN deployment (number and type of caches, links, ports, infrastructure…) to known input (distribution of demand, types of content, bitrates, concurrency…). This is an invaluable document for pre-sales when you meet with your prospective partners/customers and they ask for on-net services that would require an ad-hoc deployment. Having your quick chart at hand will allow you to provide immediately a crude estimation of cost and deployment time in direct comparison to demand. If you just have regular ‘pure CDN’ customers your sizing guide will help you in talking to your engineering and deployment to discuss costs of a capacity increase or renewal of your CDN edge.

-Compensation & safety mechanisms: must exist per entity and per CDN. These mechanisms can be really sophisticated but as the very minimum they must ensure that the CDN still works while it is under pressure: too many requests coming, requests coming at too fast rate… These mechanisms should include ‘levers’ to adjust global behavior, redundancy mechanisms, failover mechanisms, etc…. Many of the most interesting CDN patents and many of the smartest contributions of CDNs to the state of the art in distributed systems are ‘global compensation mechanisms’.


As an ending note I’d say that performance analysis is an endless world that joins together the deepest technology knowledge with the most modern business management techniques. Today Big Data approach to business is starting to nail the surface of instrumentation in many businesses. Technological businesses like CDNs have the advantage of being born technological so it is inherent to them to be instrumented.  Remember that it is always better to recall the true meaning of performance and do not be confused by the buzzwords of technical environment.


(download  as PDF : CDN interconnection business and service )



What does it mean ‘connecting two (or more) CDNs’?

There could be many possible ‘CDN-to-CDN connections’. Let us have from the beginning a wide definition of ‘interconnection of CDNs’:

<<Agreement between two businesses (CDNs) by which some responsibilities and activities of one party (CDN) are delegated to another party or parties and some compensation is agreed and received for this exchange>>

Technical means may be needed to implement the exchange of responsibilities/activities and to handle compensation.

Why connect two (or more) CDNs?

Two distinct CDNs are two separate businesses, two separate networks. The reason that stands up is ‘connect to increase both businesses’. If we cannot find business advantages for both parties in the connection the effort does not make sense.

What business advantages are enabled by interconnection?

A CDN gets money from bulk-delivered traffic and/or per transaction. A CDN receives requests over a ‘URL portfolio’ and gives either bulk traffic or transactions in response. A CDN is designed to gather requests from anywhere in the world BUT to deliver traffic only to some footprint (part of the world). The only way to increase your business results is to increase income or cut costs (obvious). Increasing income can be done either by increasing delivery/transactions or by raising traffic prices or by raising transaction value (have VAS). Cutting cost is completely another tale that can be achieved through many technically intricate ways.




1) Increase delivery

More requests coming in are an opportunity for more business (more transactions and/or more traffic). The only action available to CDN admins that want to increase the number of incoming requests is to increase the URL portfolio: ‘CDN A’ will see more requests if ‘CDN B’ delegates a set of URLs to CDN A (even if it is done temporarily).

(NOTE: End user demand is unpredictable. Total sum of requests over current URL portfolio may increase without actually increasing the size of the portfolio<number of URLs>, just imagine that each URL in our current portfolio receives some more requests, but that increase happens as the result to an action of the end users not the CDN).

But, why would CDN B delegate part of its precious portfolio to CDN A?

Resources are limited. More processing/delivery resources allow for more business. A CDN can never know nor be in control of how many requests come in, so it is possible that some requests coming into CDN B cannot be attended. In that case CDN B could ‘overflow’ some requests to CDN A thus retaining an otherwise lost profit.

1.1)   Temporary delegation of portfolio. (impulsive overflow).

Maybe CDN B just do not have enough machines (limited processing and limited delivery) and ‘success took them by surprise’, in that case this is a pure-temporary-overflow and may be handed to another CDN A that operates in the same footprint. Over time CDN B will adjust capacity and will stop overflowing to CDN A as it is usually more profitable to use your own capacity and retain your whole portfolio of URLs and Clients. The handover in and out must be fast. It is important to be able to trigger overflow based upon some real world variables that are inspected in real time. Trigger action is real time, but all the agreements needed for this to happen are negotiated in advance and the limits and conditions are set in advance.

1.2)  Long term delegation of portfolio. (delegation or ‘traffic termination’).

Maybe CDN B is not deployed in some footprint so it is suboptimal (not profitable) for CDNB to deliver traffic/transactions to that footprint. In this case CDN B needs a long-term-delegation of some URLs delivery to CDN A for a specific footprint. This is a case of ‘keeping growth without increasing cost’.


2) Adjust prices

2.1) Mutual Overflow in the same footprint (Hidden balancing)

Competing CDNs that operate in the same footprint usually force a trend of diminishing prices, trying to capture clients. Two-way Interconnection at low transit prices may have the effect of presenting to the client a stable price in the area (footprint). Any CDN in the area may benefit from a new client of another CDN in case a Mutual Overflow is agreed at a reasonable price. This sounds much more reasonable that trying to compete in quality in the same footprint, as the biggest contributor to quality is the performance of the underlying carrier and that is something that all CDNs in the area can equally get. The mutual overflow means that under some conditions evaluated in real time a URL normally served by CDN A will be served by CDN B in the same footprint and all the way round. This mutual overflow can be thought of as ‘Hidden Balancing’ between CDNs, as it is a mechanism transparent to the CDN clients.

2.2) Balanced CDNs in the same footprint (Explicit balancing)

Two or more CDNs may go to market together through a CDN balancer. In fact many balancers are now in place built by CDN clients or by third parties. The business that comes out of the balancer works on the fact that in a big area (multi country) and through a long time (up to a year) the performance of any CDN is going to vary unexpectedly due to these factors:

-unknown expected behavior of demand over own URL portfolio

-unknown evolution of own client portfolio

-unknown instant performance of underlying transport networks (carriers)

A balancer will take a request and will route it to the best suited CDN in real time. Opposed to ‘Mutual Overflow’==’Hidden Balancing’ this can be called ‘Explicit Balancing’ as the mechanism is now visible to CDN clients. The reasoning for ‘best CDN’ will be complex, based on the real time status of involved CDNs, but also based on ‘fairness for business’ of all involved CDNs in case the balancer is controlled by all of these CDNs. (In case the balancer is property of a third party fairness for all involved CDNs is not guaranteed.)

Many CDN clients feel better when they know that their portfolio has more than one CDN ready for delivery. In that case it is the best option for the CDNs to work on their mutual balancing. In case a third party balances them the results will not be so good, and some money will go to the third party for joining the CDNs. It is better to identify CDNs that may complement our CDN and work together in a balancer that could be sold together and directly to clients.


3) Balance costs vs. income: increase occupation

Planning cost in a CDN is nothing straightforward. Agreements with Carriers cost money (CDNs have to pay for traffic and sometimes pay for dedicated links /ports or housing). Cost of traffic is directly linked to income, but cost of ports, links and housing are fixed costs not related to the amount of activity or the success of the service. Machinery for delivery costs money (edge), but maintenance of machinery: installation and operation may be the most costly part of a CDN.

In case a CDN is not able to maintain a high occupation, the fixed costs will make the business not worth the effort, thus it is a good idea to offer capacity to other CDNs either as overflow, or through a balancer or as traffic termination. The real-time measurement of available capacity in our CDN may be an input to the balancer/overflow/delegation.





High-level CDN-Interconn Service goals:

.Temporary Delegation of some part of portfolio (Impulsive Overflow)

.Long Term Delegation of some part of portfolio (delegation or Traffic Termination)

.Mutual Overflow (Hidden Balancing)

.Explicit balancing


Requirements to implement Long Term Delegation one-way (receiving) in CDN ‘X’:

  1. ‘X’ must have the ability to receive a delegation of some URL set from another CDN, intended ONLY for a given footprint.
  2. Metadata must be given to the receiving CDN (‘X’) identifying the account in ‘X’ that is the owner of the URL set prior to handling any transaction on behalf of this account. (Delegation Metadata).
  3. Metadata must be given to the receiving CDN (‘X’) containing any configuration data needed to perform the transactions on behalf of the donating CDN. Examples of transactions are: geo-targeting, IP blacklisting, geo-blocking, security token verification, etc…. (These metadata may apply for the whole account, and in that case they are part of Delegation Metadata or they may apply to a URL or to a URL set, in that case we call them ‘URL-service Metadata’.)
  4. Creation of the client account in ‘X’ (to handle delegated transactions) could be done ‘on the fly’ on receiving the URLs + metadata (client auto-provisioning) or could be done in advance by an admin process in ‘X’ (client pre-provisioning).
  5. Long Term Delegation must be ‘actionable’ immediately (for instance at the request from ‘X’) and also it must be possible to ‘schedule’ activation/termination planned ahead by both CDNs administrators.
  6. Long Term Delegation must start to be effective ‘fast’, could have a defined termination date and must stop ‘fast’ either immediately (by admin action) or at the scheduled date. Here, in the context of ‘X’, ‘fast’ means as fast as it is convenient (SLA) for the donating CDN. (Usually this ‘fast’ will be in the range of minutes.)
  7. Delegation must carry with it a feedback channel so the donating CDN (the one that ‘gives URLs’) regularly receives details of the delivery/transactions performed by the receiving CDN (the one that ‘takes URLs’). This feedback as the very minimum must contain the full technical records generated at the edge of receiving CDN (this is commonplace in CDN business).
  8. It is desirable that the receiving CDN (‘X’ ) builds ‘Analytics’ specific to delegated traffic, thus offering info about the extra business that comes to it through compensation. In absence of specific arrangements the Analytics and Mediation (Billing) Services in ‘X’ will create graphs and reports of the delegated traffic as for any other account so delegated traffic is not distinguishable ‘per se’. For this reason it is desirable to mark delegation accounts so we can analyze separately traffic due to delegations.


Requirements to implement unrestricted (two-way) Long term Delegation:

  1. Feedback data could transparently update any Analytics & Mediation (billing) service that the donating CDN may have. Records of deliveries/transactions that have been delegated must appear mixed and added with all the regular records of the donating CDN.
  2. Records of deliveries/transactions that have been delegated could be separated from all the regular records of the donating CDN (in a view different from the mixed view), as an additional action that tries to give more information to the business. This information serves to plan capacity increases.
  3. A CDN admin could have the ability to select a subset of the own URL portfolio and delegate it to another CDN ONLY for a given footprint. (Implementation of delegation ‘the other way round’: not receiving but exporting URLs.


Requirements to implement Temporary Delegation:

  1. Temporary delegation (impulsive overflow) must be transparent to clients.
  2. Temporary delegation (impulsive overflow) must be transparent to the exploitation systems of the donating CDN. Consumption records must be accounted as ‘normal’ deliveries/transactions.
  3. Temporary delegation (impulsive overflow) must be ‘triggered’ by rules based on real time inspection of several variables.
  4. Variables that trigger temporary delegation (impulsive overflow) must be defined by the CDN business management.


Requirements to implement Mutual overflow:

  1. A CDN admin must have the ability to select a subset of the own URL portfolio and add it to a mutual balancing set with another CDN ONLY for a given footprint . All URLs in the mutual balancing set must be served to end users by both CDNs. Technical devices and rules used to balance must be set up by both CDNs.

Requirements to implement Explicit Balancing:

  1. A CDN must have a tool for clients: an explicit balancer. The balancer must act as the highest level request router for all involved CDNs, affecting subsets of each CDN portfolio, and applying to a specific footprint for each URL in the sum of portfolios.
  2. The explicit balancer must have public (known to anyone including clients) balancing rules, public input variables and public policies.
  3. The explicit balancer must account for every transaction/delivery and offer Analytics that allow analyzing behavior of the balancer and fine tune balancing rules and giving feedback for pricing models of the balanced CDNs product.

Oculus Rift real value

(Download as pdf: Oculus Rift real value )

Three weeks ago I was sitting by a colleague and I couldn’t help saying aloud “If I really had money I would now invest in Oculus”. It was two weeks before Facebook announced acquisition of Oculus Rift.

What is interesting here is the coincidence. By no means have I seen the future, and I’m not going to play the strategist/consultant typical role by saying “It is obvious why they did it”. (I have something with many consultants and strategists… it bothers me how well they predict the PAST).

The coincidence is interesting. I was not looking for information about Oculus Rift or VR. I had known Oculus for a reasonable amount of time, almost two years, and liked it of course. It was interesting as a company. They had a solid approach. They were gaining momentum… but in my mind they were not yet ready for prime time. What I was doing is my more-or-less-annual tracking of industry geniuses. I was reviewing status of John Carmack. Last time I did may have been 8 months ago, so I was shocked to see he had left Id Software 4 months ago. I quickly read two or three articles about his move to Oculus Rift. Then I did a review of Oculus Rift status. I was trying to understand why it was so appealing to Carmack suddenly. I read everything on display on Oculus website. I read a few articles about their recent demos and talks. I, on purpose, avoided speculative strategists chitchat. I focused on original material: interviews and Oculus PR.

What attracted my attention most was that they had their full staff exposed. They had a list of short CV for each team member. And I read through ALL of them. Being experienced as an engineer lead there was a clear message for me in that team: they were busy building things. They were into REAL hard engineering. They were not wasting time or money in ‘decorated positions’ you know, nothing like “managing director of company transformation”, or obscure positions like “communication manager”. It was (and is) an expert engineering team focused in very specific HW and very specific SW.

Of course if you have an insight in the technologies they work it helps. If you are a little bit in visualization technology you can detect a breakthrough even if you cannot anticipate one, at least you can spot when someone else really hit the right thing. In my view these people have put together enough talent to build something new that really works and they have the capacity to make it affordable. In brief I’m ready to believe that they can sell VR devices to the masses in a short time. They have hit many important things and the last one is hiring Carmack as CTO. Carmack himself is well known to be a disruptive, creative thinker. He regularly comes up with new things. Yes, new. And he dares to try and has the abilities to implement and test many of them. He can program video engines and build and run an aerospace company.

All in all Carmack + Oculus was the right person in the right ‘company’ (word play intended).

So my humble opinion is that FaceBook acquired Oculus just because someone in FB realised Oculus had become suddenly a great value… They didn’t need a clear synergy to their own business: social networks. No, they just had the money and they were clever to see the value. In fact they may have feared that other technologists also were able to see the potential for great success.

They either bought Oculus to ‘save it’ from ending in bad hands, in the hands of a corporation that does not understand the potential of what Oculus is on the brink to achieve, or to ‘save FB themselves’, in case Oculus ends in the hands of a competitor that may figure out how to leverage their talent to boost social networking.

So it was too tempting. FB could not skip the opportunity. They had the money, they were convinced of value, sheer value, and they realized the potential of others acquiring Oculus.

I’m more or less convinced that Oculus people did not expect immediate interest in acquisition by the big ones. Most probably they were relying on a first success by selling their device to the masses at the beginning 2015, maybe associated to a megahit in gaming and at the proper price. Then a big success in sales over a few months may have put them on the forefront of companies ready to acquire.

But this happened too soon. They were spotted unsuspectedly, and I guess FB probably spotted them for the very same reason I did: because of Carmack’s movement. It is too surprising. This man does not usually get into unclear ventures. He makes things succeed. Being so widely recognized in the industry his movement would attract attention.

I guess that other big technologists were ready to offer Oculus a deal. My bets: Google, Microsoft, Sony… It may even have happened. We do not know if FB was the first bidder or the only bidder.

The Oculus value before Carmack’s movement was difficult to evaluate. Were they really close to manufacturing their device? Did they have an agreement to manufacturing companies to produce massively? Did they have a closed contract to any mega hit game provider? (EA, Activision, ,Disney,…). In November 2013 I would have said that to acquire Oculus (in case they were ready to sell) any big company would spend no more than $150M, based on expectative to finish their gadget on time and sell a decent amount of units to core gamers. The biggest interest could come from Sony and Microsoft. They could have approached Oculus before they succeeded just to stop them on the road and acquire their assets to transform their technology for their own gaming consoles. That would have been sad as the value in Oculus would have been wasted in compatibility with the well-established roadmaps of these big companies.

Google is a different company. Their own Google glasses have some overlaps with the potential utility of Oculus VR KIT, but they are not really targeting the same purposes. Google for sure must have been watching Oculus progress and probably they have been tempted to acquire portions of the technology, probably optics and tracking patents. But as I’ve said, the glasses and the VR helmet are not the same thing, they serve really different purposes. The glasses will give you a bit of augmented reality by adding clever ‘touches’ of information on top of your standard view (which is an amazing value if the glass is connected to the internet, something that we can only start to imagine). On the other hand the VR helmet is targeted to remove you from reality and throw you into a completely different world. This is also amazing, it is what you look for when you go to the theatre, to the cinema, when you watch a movie on your TV… so many millions of humans would appreciate REAL immersion at an affordable price.

It is clear to me that Oculus people were not ready to sell the company. Only someone with really deep pockets and maybe with some promises for the future that include promoting the same goals that Oculus initially had could have done the magic of acquiring Oculus.   The price is much higher than expected value for yet another gadget company. The price must have been agreed as a way to protect Oculus from being hunted by others in the short term and a way to demonstrate FB intention to keep it.

What can we expect from Oculus now?

We can just guess. My own bet is: exactly what they were focusing in before acquisition. I can’t believe FB would dare to destroy Oculus real value.   In addition to that value, someone must be now franticly thinking about the best way to CONNECT the VR helmet to internet.

People are connected to internet nowadays while they are on the go: by their mobiles. This is a disruption. If you had asked any technologist 20 years ago about the direction in which personal computing would go and human-to-human communication would evolve, you for sure would have got responses about: better computers at home, better videoconferencing at home, holographic devices at home,… but no one was expecting to have a personal data link to the rest of humans that goes with you wherever you go. That has been disruptive.

The VR helmet is designed to extract you from reality, so it does not make sense to make it mobile. But it can be made the standard gadget for VR Cinema; it can be equipped in planes, trains and ships for leisure effectively replacing millions of small screens. The VR helmet will be present there where you do not want to care about your surroundings for a while.

In the same way our TV sets have become connected and there is great appreciation from the services these connection yields, I envision a near future in which at home, you can opt to simply ‘by-stand’ a TV programme not really focusing on it, maybe while you talk to others, while you have a drink or even you read the newspaper… and the other option is to put on a helmet, a VR helmet that will give to you the same ‘content’ in a very different way. You will immerse in the experience, you will be detached from your reality and be part of a connected or not experience. But the key factor here is that your TV could never extract you from your reality in a way a helmet does.

This is of course a risky statement, as any about the future. I cannot yet see the future, but if eventually I become able I’ll let you know.