Category Archives: Content

Managing topology info in worldwide CDN

(Download this article in pdf format: topology info in worldwide cdn)



A worldwide content network is nothing easy to build and run. Even though today the basics of CDNs have been established and many different CDNs are out there serving bytes from many thousands of happy publishers to millions of happy end users, very few commercial CDNs can claim to be ‘worldwide CDNs’. I can only think of one, maybe two .

There are more private-purpose content networks with worldwide reach than pure-play CDNs or even commercial CDNs with worldwide reach. Google, Netflix, Microsoft, Yahoo… all of them run worldwide networks with private infrastructure for content delivery. Some Telco: Comcast, AT&T, KPN, Telefonica… have built their own internal/commercial CDNs. Some of these Telco CDNs have a wide coverage but very few if any can claim to be global. If we classify a CDN as ‘global’ in case it works as a single content network capable of moving any content to any PoP, capable of consolidated reporting about the whole network, running content routing for the whole network and having PoPs in at least three continents … then only one of the above mentioned Telco CDNs qualifies while all the private CDNs mentioned qualify.

The special category of networks that I’m talking about, have problems of their own. Size matters and running content routing for a network with PoPs in three continents poses some interesting problems about content instantiation, request routing, requester geo-location and the influence of internet and CDN topology.

Does the CDN owner have control of packet routing through its network? The answer is ‘NO’ in most of the cases. Most overlay owners are NOT network owners. The network owners are the Telcos. Creating an overlay on top of internet is not the same thing that creating an internet backbone. The raw network offers a single service to the content network: connection point to point, but it never promises to route packets through any specific path or to conserve a bounded latency and of course forget about having a regular traffic shaping. No. Most overlays can only ensure communication end to end, period. If we think twice about this statement, then, which is the value of the overlay? Internet is an IP network, packet communication end to end is granted if you know the addresses of both endpoints so… wouldn’t you expect overlays to give you good things beyond ensuring end to end connection? Yes every one of us should expect more than connectivity from a content network.

I will explore in this article technology and contributions from CDN space to worldwide-reach network management. How to handle worldwide content routing? How to ensure good Quality of Experience? Which are the challenges? Which are the solutions? Are current CDNs using the best technology? Are current CDNs active innovators? Do they offer significant value on top of internet?

Especially important is to realize that Internet is a patchwork of weaved networks. The owners of the different patches are Telcos and they have plenty of information about their property. Are Telcos leveraging topology info? Are the other stakeholders able to benefit from Telco’s topology knowledge?



So, how can a content overlay add value to networking? (… compared to the raw Internet).

There are several ways:

-providing content discovery, a content directory (list of references) and a mapping to content instances, faster than internet DNS: affecting response latency. This ‘responsiveness’ is key in content services today.

-providing storage that is closer, bigger, more reliable than publisher site: affecting quality of experience (QoE) for consumer’s delivery and scaling up the number of consumers without hampering that QoE.

-providing security against Denial of Service, content tampering, content robbery: affecting QoE for publishers and consumers thus supporting content business, making it possible, making it profitable.

-changing packet routing to optimize content routing: affecting average end to end latency in backbones and affecting backbone and link reliability. This is something that only router owners can do, so not every overlay is in the position to add this value. It is complex. It is risky. It may affect other services. But it may be extremely cost effective if done properly.

Most CDNs, no matter if they were built for private exploitation or open commercialization, just rely on internet naming system (DNS) as a means to content discovery. This leads to fundamental shortcomings in performance as DNS is a collaborative distributed system in which intermediate information is ‘cached’ for various TTL (Time To Live) in various places so end-to-end there are several players affecting latency and they do not work coordinated with a goal in mind. They do not treat content differently from any other named object. And most important they do not share the same policies. All in all we should be grateful that DNS works at all, being an impressive feat that we can retrieve any object from internet in a ‘reasonable time’. But if you design your own content network, no matter whether you put it on top of internet or you build a huge private network, please start thinking of a different, better way to discover content.


Alternate approaches to content discovery.

This lesson has been understood by p2p networks. They probably didn’t do the discovery for performance reasons. Most probably they didn’t want to use open URIs or URLs as their activities were bordering law. So they created private names, new URIs, and created directory services. These directory services work as fuzzy naming systems; you query the directory with ‘keys’ and the directory responds with a list of URIs. In P2P networks these URIs are massively replicated throughout the peers, so anyone that has stored a fragment of a title referred by a certain URI is a candidate to provide bytes from that content to others. P2P networks have their own algorithms to seek peers that have instances of a given URI. Please notice that in this ‘P2P world’ a single content is represented by a URI, but that URI, that ‘name’, does not represent a single instance of the file. The URI represents ALL instances. So the P2P network has an additional task that is to keep track of available instances of a URI in the neighborhood of a requester in near real time. There are a lot of interesting hints and clues for future networking in P2P, but let’s not think about all these fantastic ideas now and just focus on the directory service. There are even distributed directories like the network named Chord or the network named Kademlia.

P2P organization of a content network maybe too aggressive compared to the classical client-server approach. It is difficult to run a business based on fully open P2P cooperation, as there must be responsibility and maintenance of network nodes and everyone wants to know that there is someone who receives payment and has the responsibility to run the nodes. Responsibility cannot be distributed if the network must be commercially exploited. Today there exist P2P commercial content networks (Octoshape) but they are P2P only at machine level as they run central policies and they own all the SW of the peers and have the full responsibility of network behavior. Remember: responsibility cannot be distributed.


Content instantiation and routing strategies

So let’s say that we have a classical approach to content networking, closer to client-server than to P2P. Let’s say that our network is huge in number of nodes and in locations (a truly global network as defined in our introduction to this article). How do we map requests to content instances? There are strategies that we can follow.

-Forward-replicate (pre-position) all content titles in all PoPs. This sounds expensive doesn’t it? Yes it is too expensive. Today CDN PoPs are meant to be close to the requester (consumer), for this reason there are many PoPs and in a truly global network PoPs may form an extremely distributed set with some PoPs far apart from others. In this schema putting all titles from your customers in all places (PoPs) is plainly impossible. Either you have very few customers, or they have very few titles or both things at the same time which doesn’t look like good business for any CDN. The reasonable approach to this strategy is creating a few, distributed ‘BIG Repositories’ and mapping the PoPs requests to them. In this approach a request will be mapped to a local PoP, in case of miss the PoP will go to the closest repository which is BIG and, if you have a policy of forward replication of new titles to Repositories, there are high chances that the title is just there. In the (rare) case of a miss you can forward your query to the very source of the title: your customer.

-Do not forward-replicate ANY title. Just do plain caching. Let your PoPs and caches at the edge collect queries, run a pure location-based request-routing mapping each query to the closest cache/PoP and let the cache go to title source (customer) and thus get ‘impregnated’ with that title for some time (until eviction is needed to use space for another title).

The above mentioned two strategies have important differences that go beyond content discovery latency. These differences affect the way you build your request routing. It is now an A-priori to decide if you will be proactive specializing some PoP/cache in some content and thus route every request , even far ones to the place content is, or on the contrary let the content map to the caches following demand. It is an important choice, it makes all the difference between long-tail and mega-hit distribution.


Dynamic cache specialization: URL hashing

Some authors are confident that the most dynamic approach can work: use URL hashes to specialize caches in some URLs (and thus some content titles) but let caches be filled with content dynamically as demand of titles evolves regionally. This strategy ‘sounds reasonable’. It could be easily implemented by assigning a content router to some region and configure it to apply a hash to every title URL to be distributed in that region in a way that the whole URL space is distributed evenly over cache/PoP space. I’ve said that ‘sounds reasonable’ but…what about the real world performance of such strategy? Does it really work? Does it adapt well to real world demand? Does it adapt well to real world backbone topology? Unfortunately the responses are No, No and No.

In the real world content demand distribution over the title space is not flat. Some contents are much more popular than others. If you prepare your hashes for a flat, even distribution over all caches you’ll be wasting some precious resources in ‘long tail’ titles while at the same time your ‘mega-hits’ are not given enough machines to live in. So the hash function must be tweaked to map popular titles to a bigger spot than long tail titles. You can start to see the difficulty: we do not know popularity A-priori, and popularity changes over time…so we would need a parameterized hash that changes the size of hash classes dynamically. This represents another efficiency problem as we cannot let the hash classes move from one machine to another wildly… we need to capitalize on content being cached in the same machine for some time. We have to balance popularity-change-rate against cache-efficiency so we can adapt reasonably well to the change in popularity of a title without destroying cache efficiency by remapping the hash class too often.



Influence of current network technology in the quality of experience

In the real world backbone topology and access network topology in metropolitan areas could create you serious problems. In any typical big city you’ll see something like a BRAS (Broadband Access server) per 50000-150000 subscribers. About 10-15 DSLAMs per BRAS, which is 5000 to 10000 subscribers per DSLAM. BRASes are joined together through a metro LAN, most likely using n-10Gbps links: n=1,n=2,n=3. At some point this metro LAN reaches a transit datacenter in which a BGP router links that region to other regions (cities/metro areas) probably using n-10Gbps links or 40-Gbps links (still not common) and running IGP. The transit site may or may not be connected to other ISPs/international trunks through EGP. Today DSLAMs are being slowly phased out and replaced by optical multiplexers. One side being IP over optical in a big trunk, the other side a PON that replaces the DSLAM subscriber area. This change improves dramatically latency from end user to IP multiplexer (the biggest drawback of DSL technology) but it does not increase much the IP switch bandwidth or density of access lines, as these numbers impact directly on the cost of the switch backplane electronics and that part is not greatly influenced by optical technology.


Network technology latency watermark

The deployment of fiber (PON) is enhancing the QoE of content services dramatically by reducing latency. As we have noted, latency to receive the requested content is key in the experience and using previous technology (DSL) there was a practical limit in transmission due to multilevel modulation and interleaving in DSL. If you have access to worldwide data about CDN latency (some companies provide these data to compare CDNs performance all over the world) you can track the progress of PON deployment worldwide. In a country with most lines DSL a CDN cannot perform in average better than 50ms for query latency. In a country with a majority of fiber accesses the best CDN can go down to 10-20ms. This is what I call the ‘network technology latency watermark’ of a country’s network. Of course these 50ms or 20ms are average query-response times. The average is calculated for many queries; most of them with cached DNS resolutions (depending on TTL values), so impact of DNS resolution on these averages is low. The highest impact comes from the ‘network technology latency watermark’ of the infrastructure. Round trip time (RTT) on DSL is 50-60 ms. RTT on fiber is 5 ms. RTT in any place in the world today is a mix… but what if you know where is fiber deployed and where not? You could plan ahead your content distribution using this knowledge.

As a more general statement, you can take content-routing choices based on ALL your knowledge about the network. These choices may then follow a fairly complex set of rules involving business variables, time dependent conditions and last but not least low-level network status. Packet routers cannot apply these rules. You need to have applications imposing these choices in real time on your content network.


Topology knowledge: representation and application

The classic traffic engineering is based on a probabilistic model of links and switches (a large set of interconnected queues). If you were to instantiate copies of content in some parts of the network and forward queries to concrete machines you will need a proper way to represent ALL the knowledge that you can possibly gather about your own network and maybe other networks outside of yours. This knowledge representation should be appropriate to write routing actions based on it and flexible to match business criteria. These actions should be at the end decomposed into packet routing actions so the content routing agent will ideally interface to the network routers giving them precise instructions about how to route the traffic for a specific consumer (query) and a specific instance of a title ( a named copy). This interface content-router to network-router is just not there… It is not possible to communicate routing instructions to OSPF/ISIS routers. It is much more possible to communicate to BGP routers, to private enhanced DNSs, and recently through the use of OpenFlow it starts to happen that a content-router can control the forwarding engine of multiple network-routers that implement OpenFlow. But this last option even being very powerful is very recent and only a few large scale deployments exist today (Google is the most significant).

Recently several initiatives have appeared to tackle this topology handling problem. The most famous maybe P4P and ALTO. P4P was born in the P2P academic community to give network owners the opportunity to leverage topology plus performance knowledge about the network without compromising business advantage or revealing secrets or weakening security. ALTO is the IETF counterpart of P4P.

Knowing the exact topology and detailed link performance of any internet trunk may give opportunities to application owners and other Telcos to perform their own traffic engineering tuned to their benefit, so you can understand that Telcos are not willing to provide detailed info to anyone even if they offer to pay for it.

P4P and ALTOS are ‘network information briefing techniques’ that concentrate in the network information that is relevant for the design of overlays over wide area networks. The goal of the overlay is usually to maximize spread and quality of a service while minimizing cost. To pursue this goal it would be good to know ‘where’ the ‘peers’ or ‘nodes’ are in the world and ‘how costly’ it is to send a message from one node to another. Both techniques create a partition of the world (as we deal with internet notice that the world is just the space of all IP addresses). This partition creates a collection of classes or subsets of IP addresses. Each class joins IP addresses that are equal/close using a certain partition criterion. Each class is identified with a number: Partition ID (PID).  If we select a proper partition criterion, for instance: two IPs lay in the same partition class if their access lines go to the same BRAS, (or to the same BGP router),(or lay in the same AS), etc… then we will break the whole IP space in a number of classes and the ‘cost function’ between any two IP addresses can be roughly approximated by a simplified cost function between their classes.

So a convenient way to resume a whole network is to provide a description of classes: a list of all IP addresses that lay in each class (or a rule to know class for every IP), and a cost function class-to-class.

Armed with these two objects : IP to PID mapping and PID to PID square cost matrix, a content router agent may take sophisticated routing decisions that later can be turned into policies for BGP or DNS rules or OpenFlow directives.



We need to decide, for a given region (a subset of IP addresses from the whole IP space), which query (a pair IP-URL) we map to which cache/streamer (IP). This decision making process happens at the regional request router. Using today’s technology the content router can only be a server working at application level (OSI level 7), as far as there are no content oriented network mechanisms integrated in any available network technology. With the evolution of HTTP and OpenFlow we may find changes to this in the coming years.

(IPreq+URLtit) à IPca     is our ‘request routing function’. What parameters do we have to build this function?

-IPreq geo position (UTM)

-IPreq topology: AS number, BGP router ID, topology partition class (PID). Partitions are a good way of briefing all possible knowledge about topology.

-URLtit, do we know which caches currently have a copy of URLtit? Let’s assume that we know exactly that. We may not have a list of URLs mapped to each IPca (each cache) but we can have for example a hash. The simplest form is: if we have N caches we can use the less significant m bits of URL, where 2^m=<N<2^(m+1). In case N=2^m we are done. In case 2^m<N we need to ‘wrap the remainder up’. This means that the cache ID= less significant m bits, but if the resulting ID is bigger than the highest cache ID we obtain a new ID= ID-2^(m-1). This is simple but unfair for some caches that receive double requests than others. We can do better with a proper hash that maps any URL to N caches. But bear in mind that it is difficult to build good hashes (hashes with good class separation) without using congruence in an evident manner. Be careful with uneven hashes.

-Cost info: after partitioning: class to class distance. Usually this info can be represented by a matrix. In the P4P or ALTOS they do not specify any way to represent this distance. A convenient extension to the partitioning proposed there may be:

.have a full-IP-space partition for each region. Extend the information of a region partition by adding the necessary classes out of the region to represent distance to other regions until we have a full-IP-space partition.

.have a full-meshed cost matrix for each region: a square cost matrix class-to-class in every region.

It is very straightforward to program a content router with sophisticated rules once we have the partition ready for a region. What is really hard is to prepare for the first time and later maintain up to date the partition.


How to build a partition from classical routing info + other info

The classical routing in a worldwide network comprises BGP routes + local routes. Due to the current state of technology it can be usually appropriate to set a content router per BGP router as a maximum. It is usually not needed to handle sub-BGP areas with several content routers, and we could even use a single content router for a whole AS with several BGP routers that speak EGP. When partitioning the area covered by a BGP router we can use the routing table of the BGP router to identify prefixes (sub-networks) reachable from that router and we can combine this routing table with RADIUS services to map the IP addresses reachable by that router to one or more BRASes or DSLAMs (or equivalent multiplexer in PON). Usually BRASes speak BGP to other BRASes so many times we can just break down the BRAS area to DSLAM areas and that is not bad as a starting partition. Of course dynamic IP assignment can complicate things, but as far as the IP pool is the same for the same DSLAM/multiplexer it doesn’t matter much that the IP addresses move from one access line to another in the same DSLAM/multiplexer.

We can refine partition over time by handling RADIUS info frequently to manage changes in the pool.

We can also run off-line processes that correlate the geo-positioning info that comes from any good global service or available terminal GPS to the partitioning done from RADIUS and BGP tables.

Also by checking frequently the BGP updates we can correct the cost matrix in case a link is broken or weakened. That effect can be detected and reflected in the cost matrix.



Maybe P2P academics and IETF never achieved their goals of making Telcos cooperate to overlays by providing topology insights through standard interfaces: P4P/ALTOS.

Anyway partitioning has demonstrated to be a good abstraction to put together a lot of useful information about the network : classical routing info both local and AS to AS, performance of routes summarized at class level, economics of routes in case some class to class happens through a peering.

Partitioning is practical today just using BGP info but it is hard to bootstrap and to maintain. We need a lot of computing power to mine down the BGP tables and RADIUS pools, and cannot refresh partitions and cost matrices too often, but it can be done fast enough for today’s applications.

In the immediate future, as more wide area networks embrace OpenFlow we could get rid of BGP tables processing and RADIUS pools analysis. An OpenFlow compliant network can have central routing planners or regional routing planners and the routing info maybe exposed through standard interfaces and will be homogeneous for all routing devices in the network. It is just interesting to be here waiting for the first OpenFlow CDN implementation. Probably it has happened now and it is a private CDN, maybe one of the big ones that I mentioned that are non-commercial. In case it has not happened yet it will happen soon. In the meantime we can only guess how good is the implementation of the request routing function in our current CDNs. Would it be based on partitioning?

Worldwide live distribution through cascaded splitting

(Download as pdf: worldwide live splitting cascading or not)


In this short paper I want to analyze the impact of different architectures in worldwide distribution of live streams over internet: cascaded splitting vs. non-cascaded multi region local splitting.

In a worldwide CDN overlaid on top of Internet we will need to pay attention to regional areas (REG) equipped each one with a single Entry-point that is a First Level Splitter (FLS) that gathers many input channels from customers. We have in the same REG many endpoints/streamers or Second Level Splitters (SLS) that split each input channel into many copies on demand of consumers.  The SLS are classical splitters, classical caches in the border of any current CDN, in brief: Endpoints.  They talk to consumers directly. Each endpoint must obtain its live input from the FLS output. FLS + endpoints act as a hierarchy of splitters inside the REG.

We have two very different scenarios:

Non cascading entry-points: every endpoint in a REG can connect to local channels (local FLS) and can connect to far (foreign) channels (foreign FLSs).  The FLS in the REG receives as input ONLY local channels and gives as output all endpoint connections: local and foreign.

Cascading entry-points: endpoints in a REG connect ONLY to the FLS in that REG. The FLS in a REG receives as input local channels and also foreign channels from other REGs. The FLS gives as output ONLY local connections to local endpoints.

We want to compare these two scenarios in terms of:

Quality of connections:  the shortest the connection the less latency, the less losses and the highest usable throughput. We want to keep connections local (both ends in the same REG) as much as possible.

Capacity of worldwide CDN: We do not want to open more connections than strictly needed.

So we need to count connections in both scenarios and we need to evaluate the average length of connections in both scenarios.

There are also other properties of connections that matter in our study:

-input connections cost MORE to a splitter than output connections. This applies both to FLS and endpoint.

(Note: Intuitively every input connection is a ‘write’ and it is different from any other write. It requires a separate write buffer and separate quota of the transmission capacity. Output connections on the contrary are ‘reads’. Two reads about the same input have separate quota of transmission capacity but they share the same read buffer. This buffer can even be exactly the same write buffer of the input. So creating an input connection is more expensive than creating an output connection. It consumes much more memory. Both input and the output from that input consume exactly the same transmission capacity quota.)

Statements and nomenclature:

O             = Number of REGs

Si                    = Number of source channels in REG ‘i’

Ei              = Number of endpoints in REG ‘i’

pihk                = Probability that channel ‘k’ (1<=k<= Sh ) originated in REG ‘h’ is requested by ANY consumer in REG ‘i’.   (Note that it is possible that i=h, and it is also possible that i≠h.)

peijhk           = Probability that channel ‘k’ (1<=k<= Sh ) originated in REG ‘h’ is requested by ANY consumer in Endpoint ‘j’ (1<=j<= Ei ) in REG ‘i’.   (Note that it is possible that i=h, and it is also possible that i≠h.)

dik                  = distance REG ‘i’ to REG ‘k’.   (Note that, as usual,  dii=0, djk=dki ).




We try to compare two architectural options that present a problem in worldwide distribution of channels. This problem will not be present if we just had local channels that do not need to be distributed far apart of production site. We will not have this problem also if there was a single source and a worldwide distribution tree for that source, in that case we will build a static tree for that source that could be optimal.  What I’m dealing with here is the choices that we have in case there are many live sources in many regions and there is cross-consumption of live distribution from region to region.  Questions are:

-which is the best splitting architecture in terms of quality and CDN effort?

-which are the alternatives?

-which is the quantitative difference between them?

I’ve presented two alternatives that exist in the real world

1)       Non-cascading distributed splitting: two level splitting: first at entrypoint, second at the edge, in endpoints. When a source is foreign, the local endpoints must establish long reach connections.

2)       Cascading distributed splitting: two level splitting: first at entrypoint second at the edge AND entrypoint to entrypoint. When a source is foreign the local endpoints are not allowed to connect directly to it. Instead the local entrypoint connects to the foreign source and turns it local making it available to local endpoints as if had been a local channel.

In the following pages you can see a diagram of several (5) REGs with cross traffic and the analysis of CDN effort and resulting quality in terms of the nomenclature introduced in the above paragraph.

Non-Cascading entrypoints

non cascading


Cascading entrypoints



Practical calculations:

It is clear to anyone that the most difficult to obtain piece of information is the list of probabilities pihk and peijhk.

No matter the complexity of our real world demand, there are some convenient simplifications that we could do.  These simplifications come from our observation of real world behavior:

Simplification 1:

pihk  is always ‘0’ or ‘1’. A channel is either ‘exported for sure’ from REG ‘h’ to REG ‘i’ or ‘completely prohibited’.

(Note: there are so many consumers in REGs that once a channel is announced in a REG it is really easy to find someone interested in it. A single consumer is enough. If one out of 500000 wants the channel in REG ‘i’ then the channel MUST be exported to REG ‘j’.  The probability pihkhere models behavior of isolated people and so we observe that once the population is high enough pihk approximates 1.  If the channel ‘k’ in REG ‘h’ is geo-locked to its original REG, it cannot be requested by any REG ‘i’, and then pihk is 0.

Simplification 2:

piik  is always ‘1’. Following the same reasoning that we see above, once a channel is announced in a REG it is always consumed in that REG.

Simplification 3:

peijhk  has approximately the same value for all ‘j’ in the REG ‘i’.  That means that every endpoint ‘j’ in REG ‘i’ receives equal load of requests about channel ‘k’ from REG ‘h’. Why is this? This happens Just because endpoints receive requests that are balanced to them by a regional Request Router (usually a supercharged local DNS).  The aim of the Req. Router is to distribute requests evenly to streamers working for the REG.

P.S: this behavior depends on implementation of req. Router. A reasonable strategy is based on specializing endpoints in some content (through the use of URL hashes for instance), and thus a given channel will always be routed through the same endpoint. If correctly implemented this would mean that for a given pair (h,k) ,peijhk  is ‘0’ for all values of ‘j’ except one in REG ‘i’, and for that value of  ‘j’ peijhk  is ‘1’.

Current content technology: appropriate ot not up to the task?

(Download this article in pdf format: Content technology apropriate or not)


‘Content’ is a piece of information. We may add ‘intended for humans’. That maybe the most defining property of content in front of other pieces of information. In a world increasingly mediated by machines it would not be appropriate to say that machines ‘do not consume content’, they do, on our behalf.  Key idea is that any machine that processes content does it just immediately on behalf of a human consumer, so it processes content human-like. STBs, browser players, standalone video players, connected TVs,… may be deemed ‘consumers’ of content with a personality of their own (more on this through the article) but they are acting over information that was intended for humans.  They are mimicking human information consumption models, as opposed to M2M (Machine to machine) information processing which happens in a way totally unnatural to humans.

Which is the state of the art in ‘content technology’? By ‘content technology’ I mean: ways to represent information natural for humans, devices designed to take information from humans and provide information to humans. Are these technologies adequate today? How far are we from reaching perceptual limits? Are these technologies expensive/cheap? Do we have significantly better ‘content technology’ than we used to have or not?


The most natural channel to feed info to humans is the combination of audio + video.  Our viewing and hearing capacities work together extremely well. We humans are owners of some other sensory equipment: a fair amount of surface sensible to pressure and temperature (aka skin), an entry-level chemical lab for solids and liquids (tongue) and a definitely sub-par gas analyser (nose). As suggested by these descriptions we cannot be really proud about our perceptions of chemicals but we do fairly well perceiving and interpreting light, sound and pressure.  This is not due to the quality of our sensors, which are surpassed easily by almost every creature out there, but due to the processing power connected to those sensors.  We see, hear and feel through our brain. We humans have devoted more brain real state to image and sound processing than other animals.

It is no coincidence thus that technology intended to handle human information has been focused on audio visual.  Relatively recently a branch of technology took off: ‘Haptics’, that may give our pressure sensors some fun (multi-touch surfaces, force-feedback 3D devices, gesture capture…) but ‘Haptics’ is still underdeveloped if we compare it to audio-visual technologies.

So we have created our information technology around audio-visual. Let’s see where we are.


We are able to perceive light. We develop brain interpretations linked to properties of light: intensity, wavelength (colour is our interpretation of light wavelength + composition of wavelengths). There are clear limits to our perception. We just perceive some of the frequencies. We cannot ‘see’ below red frequencies (infra-red) or above violet (ultra-violet). We have an array of light sensors of different types: ‘cones’ and ‘rods’ and a focus system (an organic ‘lens’). Equipped with this hardware we are able to sample the light field of the outer world. I have chosen to say ‘sample’ because there is still no consensus about how do we process data from time varying light fields: does our brain work in entire ‘frames’? Does it keep a ‘static image’ memory? Does it use a hierarchical scene representation? Does it sample always at the same pace (at a fixed rate)? We know very few things about how we process images. Nevertheless some of the physical limits to our viewing system come from the ‘sensory hardware’ not from the processor, not from the brain.

Lens + light sensor:  aperture, angular resolution, viewing distance

If you browse any good biomechanics literature (ie: Richard Dawkins: Climbing mount improbable, or, you can find that human eyes lens has in average an aperture of 10 deg.   We form ‘stereoscopic’ images by overlapping information from two eyes. We have roughly 125-130 million light receptors per eye, most of them are rods unable to detect colour, and less than 5% (about 6-7 million) are cones sensible to colour. They are not laid out in a comfortable square grid so our field of view maybe represented delimited by something like an oval with its major axis horizontal. Central part of this oval is another oval with mayor axis vertical where the two vision cones overlap, this small spot is our stereoscopic-high-resolution view spot, though we still receive visual info from the resting part of the ‘whole field of view’ which may be 90º vert. and 120º hor.

Optical resolution can be measured in Line pairs per degree (LPD) or cycles per degree (CPD). Humans are able to resolve 0.6 LP per minute arc (1/60 Deg). We can tell two points (each one lying on one different line of a line pair of contrasting colour) are different when they are apart by more than 0.3 minute arc (see for instance or ).  Visual acuity defined as 1/a where ‘a’ is the response in LPD is 1.7, and ‘a’ is the abovementioned 0.6 LP per min arc. This is called 20/20 vision.  You standing at 20 feet from target see the same detail as any normal viewer standing at 20 feet. If you have better acuity than average you can see the same detail standing farther, for instance 22/ 20. If you are subpar you need to get closer for instance 18/20.  As a reference a falcon’s eye can be as good as 20/2. Cells in the fovea (central region of view field spanning 1.4 deg from total 10 deg) are better connected to the brain: 1 cone -1 nerve and cones are more tightly packed there so spatial resolution in the fovea is higher than in the peripheral view. The ratio of connections in the peripheral view can drop as much as 1:20, which means that 20 light-receptors sum up their signals into a single signal they feed to a single nerve.

I know that this way of measuring the resolution power of our eyes is cumbersome, but by the way is the only right method! Let’s do some practical math. Let’s say that we read a book or we read on a tablet. Normal reading distance may be 18” = 45.7 cm. Our eyesight cone at that distance is just 8cm in base-radius (spot height). We see 0.6LPminarc*600 min arc=360LP in 8cm vertical so we can tell 720 ‘points’ to be different in a vertical high-contrast alternating colour strip of points. This is 720 points/8cm = 90 points/cm or 228 points per inch (ppi). You may have noticed that you cannot tell two adjacent dots printed by modern laser printers (300-600 dpi) at normal reading distance. It would not be fair to say that 250 dpi suffices to print so we can read comfortably at normal distance, as far as different printing technologies may need more than 1 inkdrop (a dot) to represent a pixel. This is the reason state of the art printing moves between 300-600 dpi and it does not make much sense to go beyond. (P.S: You may notice there are printers that offer well above that: 1200-1400 ppi, but most of them confuse ‘inkdrop dots’ with pixels. They cannot represent a single pixel with a single drop or dot. There are also scanners boasting as much as 4000dpi…but this is an entirely different world as it may make sense scanning a surface at much closer than viewing distance so we can correct scanner optical defects to produce a right image for normal viewing distance.)

Assuming that display/print technology is not too bad translating pixels to dots we can say that  a good surface for reading at 18”must be capable of showing no less than 250 dpi or ppi so you can take full advantage of your eyes. By these standards ‘regular’ computer displays are not up to the task, as they have 75-100 ppi and typically are viewed at 20” so they would require over 200 ppi. iPad Retina seems a more appropriate display as it has been designed with these magic numbers in mind. Retina devices have pixel densities from 326 ppi (phone) through 264ppi (tablet) down to 220 ppi (monitor). As the viewing distance for a phone is less than 10”, 326ppi fits in the same acuity range that 264 ppi for 18” and so does 220 ppi for 20”. Other display manufacturers have followed on the trail of Apple: Amazon fire HD devices have followed and then surpassed Retina displays : Kindle Fire HD 8.9” (254 ppi), Kindle Fire HDX 7” (323 ppi), Kindle Fire HDX 8.9” (339 ppi). Specially note that the HDX devices are tablets while they use pixel densities that Apple Retina reserves for phones…so these tablets are designed to fit eyes much better than standard. Newer phones like HTC One (468 ppi), Huawei Ascend D2 (443 ppi), LG Nexus 5 (445 ppi), Samsung Galaxy S4 (443 ppi)… go that same way.  (

What about a Full HD TV or a 4K TV? We can calculate the optimal viewing distance for perfect eyesight. Let’s do the math for a 56” Full HD TV and a 56” 4K TV assuming aspect ratio 16:9 and square pixels. The right triangle is: 16:9:18.36 which is homothetic to  ~ 48.8: 27.45: 56, so in vertical 27.45” we have 1080 points (Full HD) which is 39.34 ppi or double for 4K. Let’s round to 40 ppi for a 56” FullHD and 80ppi for a 56” 4K TV. So optimal viewing distance corresponds to 360 points per 5 Deg matching to 40ppi: 360/40= 9 inches viewed at 5 Deg or Distance in inches= 9/tan (5 Deg) = 103 inches= 2.6 m for Full HD and 1.3 m for 4K.   So if you have a nice 56” Full HD TV you will enjoy it best by sitting closer than 2.6 m.  If you are lucky enough to have a 56” 4K TV you can sit as close as 1.3 m and enjoy to the max of your eyes resolving power

Colour perception.

Humans have three types of cones (the colour receptors) each one sensible to a different range of light wavelengths: red, green, blue. Colour is NOT an objective property of light. Colour is an interpretation of two physical phenomena: 1) wavelength of radiation, 2) composition of ‘pure tones’ or single-wavelength radiations. A healthy eye can distinguish more than 16 M different shades of colour (some lab experiments say even as much as 50 M). As we have commented cones are scarce compared to rods so we do not have the same resolution power for colour as we have for any light presence. Pure ‘tones’ range from violet to red. They are called ‘spectral colours’. Non spectral colours must be produced by compositing any number of pure tones. For example white, grey, pink… need to be obtained as compositions.

Colour is subjective. Within a range, different people will see slightly different shades of colour when presented with exactly the same light. (The same exact composition of wavelengths). This is due to the way cones react to light. Cones are pigmented and thus when receiving photons of a certain wavelength range their pigment reacts triggering a current to the nerve. But cone pigment ‘quality’ varies from human to human so they may trigger a different signal for the same stimulus and on the contrary they may trigger the same signal for a slightly different stimulus. The colour we see is an interpretation of light. It happens that different lighting conditions may render exactly the same electrical response. This means that the composition of wavelengths to produce some colour output is not unique; there are a number of input combinations that render the same output (metamers).

Light intensity range.

To make things even more difficult, the colour response (to light) function of our eyes depends on intensity of radiation. Cones may respond to the same wavelength differently when the intensity of light is much higher or much lower (bear in mind that intensity relates to the energy carried by the photons…it is crudely the number of photons reaching the cone per time unit, it has nothing to do with individual energy of each photon that solely relates to its wavelength.)  Our eyes have a dynamic sensitivity range that is truly amazing, it covers 10 decades. We can discern shapes in low light receiving as less as 100/150 photons and we can still see all the way up to 10 orders of magnitude more light!!!. Of course we do not perceive colour information equally well all across the range. When we are in the lower 4 decades of the range we need to sum up all possible receptors to trigger a decent signal, so we ‘see’ mostly through rods (scotopic vision) and through many peripheral rods that are less individually connected to nerves so many of them share a nerve thus losing in spatial resolution but trading in sensibility as very low photon counts per receptor may excite the nerve when summed from several receptors. When we are in the 6 upper decades of intensity range we can perceive colour (photopic vision) although in extreme intensity we just perceive washed or white colours. Some authors and labs have checked the human intensity range for a single scene (see This range is different of the whole range, as the eye is able of 10 decades but not in the same scene, only through a few minutes of adaptation to low/high light. For a single night scene with very low light (gazing at stars for instance) the range is estimated to be 6 decades 1:10^6; for daylight the range is estimated to be 4 decades 1:10^4.

What about current display devices? Are they good to represent colour in front of our eyes?

State of the art displays consist of a grid of picture elements (pixels) each one formed by three light emitting devices selected to be pure tones: R, G, B. The amount of light that is emitted by each device can be controlled independently by polarizing a Liquid Crystal with a variable voltage, allowing more or less light to go through. The polarization range is discretized to N steps by feeding a digital signal through a DAC to the LC.  Superposition of light from three very closely placed light emitters produces a mix of wavelengths, a colour shade, concentrated on one pixel.

Today most LCD panels are 8 bpc (bit per channel) and only the most expensive are 10 bpc. That means that each pure tone (R, G, B) in each pixel can be modulated through 256 steps (8 bit per channel) so roughly 2^24 tones are possible (16.7 M). The best panels support 2^30 tones (1073.7 M). VGA and DVI interfaces only provide 8bpc input: RGB-24. To excite a 10 bpc panel a DisplayPort interface or an HDMI 1.3 with DeepColour: RGB-30 enabled is needed.  Video sources for these panels may be PCs with high end video cards that support true 30 bit colour or high end Blu-ray players with DeepColour (and a Blu-ray title encoded in 30 bit colour of course!). As we are capable of distinguishing over 16 M shades you can think that 8 bpc could be barely enough but here comes the tricky part…who told you that the 16 M shades of an 8bpc panel are ‘precisely these’ 16 M different shades that your eye can see? They are not for most cheap panels. Even when a colorimeter may tell us that the panel is producing 16M different shades, our eyes have a colour transfer function that must be matched by the source of light that pretends to have 16 M recognizable shades. If not, many of the shades produced by the source will probably lie in ‘indistinguishable places’ of our ‘colour transfer function’ rendering effectively much less than 16 M distinguishable shades.  Professional monitors and very high end TVs have ‘corrected gamma’ output. This means that they do not attempt to produce a linear amount of polarization to each channel (R, G and B) in the range 0-255. Instead of that they have pre-computed tables with the right amount of R, G and B that our eyes can see all through the visible gamma. They use internal processing to ‘bias’ a standard RGB-24 signal to render it into ‘the gamma of shades your eyes can see’ with a preference of some shades and some disrespect for others so they can render RGB-24 input into truly 16M distinguishable shades. To achieve this goal these devices store the mapping functions (gamma correction functions) in a LUT (Lookup Table) that is a double entry table that produces the right voltage to excite the LCD for each RGB input. That voltage may have finer steps in some parts of colour space where human eyes have more ‘colour density’. For this reason an 8 bit DAC won’t be enough. More often an 8 bit signal is fed through the LUT to a 10 or 12 bit DAC. You see, more than 8 bits of internal calculus space are needed to handle 8 bit input so many 8 bpc displays are said to be 8 bpc panels with 10 bit LUT or 12 bit LUT. Today the best displays are 10 bpc panels with 14 bit LUT or 16 bit LUT. It is also worth mentioning a technique called FRC (Frame Rate Control). Some manufacturers claim they can show over 16.7 M colours while they use only 8 true bit panels .They double the frame rate and excite the same pixel with two alternating different voltages. So they fake a colour by mixing two colours through ‘modulation in time’. This technique seems to work perceptually but it is always good to know if your panel is true 10 bit or instead 8 bit + FRC. Once available this technique it has been used to make regular monitors cheaper by going all the way down to 6 bit + FRC.

We can conclude that today’s high end monitors that properly implement gamma correction are proficient to show us the maximum range of colours we are capable of seeing (normal people see something above 16 M shades of colour, maybe even 50 M) when using LCD panels that are true 8 bpc or better (true 10 bpc). Unfortunately mainstream computer monitors and TVs do NOT have proper gamma correction and those that have the feature rarely are correctly calibrated (especially TVs), so digital colour is not yet where it should be in our lives. Most cheap monitors take 8 bit per channel colour and feed it right to an 8 bit DAC to produce whatever ranges of colour it ends up being, resulting in much less than 16 M viewable shades of colour. Many cheap computer screens and TVs are even 6 bit + FRC. By applying a colorimeter you can discover the gamma that your device produces and match it to some ‘locus’ in a standard ‘colour space’.  A standard colour space is a bi-dimensional representation of all colour shades the eye can see. This representation can be built only for some fixed intensity level. This means that with more or less intensity the corresponding bi-dimensional representation will be different. You can imagine a ‘cone’ with vertex in ‘intensity 0 plane’. Slices of this cone (one for each fixed intensity value) lay in parallel planes occupying each one a ‘locus’ (a connected bi-dimensional plot) that gets bigger and richer as intensity increases until we get to optimal intensity and then starts to get washed out when intensity is above optimal. The locus of viewable shades takes in most representations the shape of a deformed triangle with pure Red, Green and Blue in the three vertices. Different colour spaces differ in shape but more or less all of them look like a triangle deformed into a ‘horseshoe’. You may know Adobe RGB and sRGB. All these representations are subsets of the viewable locus and tried to standardize respectively what a printing device and a monitor should do (when they were created). Today’s professional monitors can match 99.X% Adobe RGB which is more ample than sRGB.  Most TV sets and monitors can only produce a locus much smaller than sRGB.

Refresh rates, interlacing and time response

How do we see moving objects? Does our brain create a sequence of frames? Does our brain even have the notion of a still frame? How do technology-produced signals compare to natural world in front of our eyes?  Are we close to fooling ourselves with fake windows replacing the real world?

It turns out that we can only see moving objects. Yes that is true. You may be disturbed by this statement but no matter how solidly static an object is and how static you think you are… when you stare at it you are moving constantly your eyes. If your eyes do not move and the world does not move your brain just can see nothing. The image processor in our brain likes movement and searches for it continually. If there is no apparent movement in the world our eyes need to move so the flow of information can continue.

We just do not know if there is something like a frame memory in our brain, but it seems that our eyes continually scan space stopping by the places that show more movement. To understand a ‘still frame’ (reasonably still…, let’s say that for humans something still is something that does not change over 10 to 20 ms) our eyes need to scan it many times looking for movement/features. If there is no movement eyes will focus on edges and high contrast spots.  This process will give our brain a collection of ‘high quality patches’ where the high resolution channel that is the fovea has been aimed to , selected by its image characteristics (movement, edges, contrast) that may not cover completely our field of vision, so effectively we may not see things that our brain reputes as ‘unimportant’.  Our pretended ‘frame’ will look like a big black poster with some high resolution photos stuck all over following strange patterns (edges for instance), surrounded by many low resolution photos all around the HQ ones, and a lot of empty space (black you may imagine).

It seems we do not scan whole frames. It seems we can live well with partial frames and still understand the world. This cognitive process is helped by involuntary eye movement and by voluntary gaze aiming.  Our brain has ‘image persistence’. We rely on partial frame persistence to cover a greater amount of field view by making old samples of the world last in our perceptual system being added to fresh samples in a kind of ‘time-based collage’.  Cinema and TV benefit from image persistence by encoding the moving world as a series of frames that are rapidly presented to the eye. As our brain scans the world from time to time it does not seem very unnatural to look at a ‘fake world’ that is not just there all the time but only some part of the time.  Of course the trick to fool our brain is: don´t be slower than the brain.

Cinema uses 24 frames per second (24 fps) and this is clearly slower than our scanning system so we need an additional trick to fake motion: we achieve it through over-exposure of film frame. To capture a movement through let’s say one second spanning 24 film frames, we allow film to get blurred with the fast movement by overexposing each frame so the image of the moving object impresses a trail on film instead of a static image. If cinema was shown to us without this overexposure we will perceive a jerky movement as 24 fps is not up to our brain’s capabilities to scan for movement.  Most people will be much more comfortable with properly exposed shots played at 100 fps. The use of overexposure defines ‘cinema style’ movement as opposed to ‘video style’ movement. People get used to cinema and when a movie is shot not on film but on video with proper exposure the ‘motion feeling’ is different, they say ‘too lifelike, not cinema’.

Today we see a mixture of techniques to represent motion. In TV we have been using PAL and NTSC standards that were capturing 576 and 480 horizontal lines respectively to form frames in a tricky way. They would capture half a frame (what they call a ‘field’) by sampling just the even lines, then just the odd lines, taking a field every 1/50 s in PAL and every 1/60 in NTSC. This schema produces 25 fps or 30 fps in average, but see that in fact they produce 50 fields per second or 60 fields per second. Due to the abovementioned image persistence two fields seem to combine in a single image, but notice that two consecutive fields were never sampled at the same time, but shifted 1/50 s or 1/60 s so if displayed simultaneously they won’t match. Edges will show ‘combing’ (you will see dents like in a comb). This is precisely what happens when an interlaced TV signal arrives at a progressive TV set and you must turn fields into frames. Of course there are de-comb filters built in modern TVs. I want just to point out that with 95% of TV sets out there being progressive monitors capable of showing 50/60 fps progressive, interlaced TV signals just do not make sense anymore…, but we still ‘enjoy’ interlaced TV in most places of the world.  Of course de-interlacing TV for progressive TV sets comes at a cost: image filters result in a poor image quality. De-comb filters produce un-sharpening. The perceptual result is a loss of resolution.  Maybe you do not realise how deep interlaced-imagery is in our lives: most DVD titles have been captured and stored as interlaced. This makes even less sense than broadcasting interlaced signals. DVD is a Digital format. You are very likely to play it on a Digital TV set (a progressive LCD panel) so why bother interlacing the signal so your DVD or your TV or both will need to de-interlace it?. Plainly it does not make sense, and by the way it worsens image quality.  Even some Blu-ray titles have been made from interlaced masters. Here nonsense gets extreme but it happens anyway. We may forgive DVD producers as when DVD standard came up most TV sets were still interlaced (cathode ray tubes), but having ‘modern’ content shot in interlaced format today is plain heresy.


The human hearing system is made of two channels, each one acquiring information independently, both mapping information to a brain area. In the same way two eyes combine information for stereoscopic vision.

The hearing sensory equipment is complex. It is made of external devices designed for directional wave capture (ear, inner duct and tympanic membrane). We cannot aim ears (do not have the muscles some animals have), just turn our head which is a ‘massive movement’ subject to great inertia and thus slow, so our hearing attention must work all the time for surrounding sounds. There are internal mechanisms (chain of tiny bones: ossicles) and pressure transmission that are intended to transform the spectral response of the human hearing amplifying some frequencies more than others. At the very internal part of the human hearing system there is the cochlea, a tubular, spiralling duct that is covered with sensitive ‘hairs’. Is in this last stage that individual frequencies are identified. All the rest of the equipment: internal and middle part and the ear is just a sophisticated amplifier. As in the visual system there are limitations that come from the sensory equipment, not the brain.

Sound: differential pressure inside a fluid produced as vibration.

What we call sound is a vibration of a fluid. As in any fluid (gas or liquid), there is an average pressure at every point in space. Our hearing system is able to detect pressure variations deviating from the average. We do not detect any small variation (but almost!), and we will not detect a really strong one (at least we will not detect a sound, just pain and possibly damage of the hearing system). So there is a range of intensities in pressure: the human hearing system has an amazing range of 13 decades in pressure (sound intensity as the energy produced per unit surface by the vibration is proportional to differential pressure).  The smallest perceivable pressure difference is: 2×10^-5 Newton/m^2, the highest is 60 Newton/m^2. The intensity range is 10^-12Watt/m^2 to 10Watt/m^2.  We do not detect isolated pulses of differential pressure. We need sustained vibration to excite our hearing system (there is a minimum excitation time of about 200 ms to 300 ms). And thus there is a minimum frequency and also a maximum. Roughly we can hear from 20 Hz to 20 kHz. Aging reduces severely the upper limit. We are not equally sensible to pressure (intensity) through the range. As with our eyes and colour perception, there is a transfer function, a spectral response in frequency space that is not flat.

Frequency resolution: pitch discrimination, Intensity resolution: loudness

Our hearing system retrieves the following information from sound: frequency (pitch), intensity (loudness), position (using two ears).  We are able to detect from 20 Hz to 20 KHz and we can tell 1500 different pitches in the range. Separation of individually recognizable pitches is not the same across the range. There is a fairly variant transfer function. It is assumed that frequency resolution is 3.6 Hz in the octave going from 1 KHz to 2 KHz. This relates to perception of changes in pitch of a pure tone. As with colour, sound can be a perception of combined pure tones. It happens that when there is more than one pure tone, interference between tones can be perceived as a modulation of intensity (loudness) that is called ‘beating’ and the human ear is then more sensible to frequency. For instance two pure tones of 220 Hz and 222 Hz when heard simultaneous interfere producing a beating of 2 Hz. The human ear can perceive that effect, but if we increase let’s say our pure tone from 200 Hz to 202 Hz the human ear will not perceive the change.

We perceive sound intensity (loudness) differently across the frequency range. Several different pressures can be perceived equal if they are vibrating at different frequencies. For this reason the human ear is characterized by drawing ‘loudness curves’. There is a line (a contour line as level curves in geographical maps) per perceived loudness value, these lines cover the whole range of audible frequencies and they (obviously) do not cross. It is noticeable that the low threshold of loudness perception has a valley in the range 2 khz to 4 khz. There is where we can hear the less intense sounds. In that frequency range lays most of the energy of the human voice spectrum.

Sound Hardware: is it up to the task?

It seems that audio-only content is not very fashionable in these days. It does not attract the attention of the masses as it did in the past.  Anyway let’s take a look at what we have.

Sound storage and playback is mostly digital in our days. Since the inception of the CD a whole culture of sound devices employs the same audio capacities. CD spec: two channels, each sampled at 44.1 KHz, 16 bit samples using PCM. (Sound input is filtered by a low-pass with cut frequency at 20 KHz, and then sampled. As you may know to preserve a tone of 20 kHz you must sample at least at 40 kHz: Nyquist theorem). Is CD spec on par with human perception? Let’s see.

Audio sampling takes place in time domain (light sampling happens in frequency). Each sample takes an analog value for pressure (provided converting pressure to voltage or current intensity in a microphone during sampling time), then this value is quantized in N steps (16 bits provide 2^16 =65536 steps). The resulting bit stream can be further compressed after for storage and/or transmission efficiency. Some techniques can be applied before quantization to reduce amount of data (amplitude compression), but usually data reduction is applied while quantizing leading to Differential PCM and Adaptive PCM which deviate from Linear PCM.

If we look at frequency resolution, LPCM with cut-off frequency at 20 kHz is ok. Two separate tones of less than 20 KHz will be properly sampled and can be distinguished perfectly no matter them are separated by 2-4 Hz.

If we look at pressure resolution it has nothing to do with CD spec. A loudness curve for each pressure value can be properly encoded using the CD spec. What plays here is the Microphone technology (sensitivity) and all the chain of manipulations (AD conversion, storing, transmitting, DA conversion, amplifying) analog and digital that intervene to display sound in front of your ears.

It is easier to look at dynamic range to compare fidelity of sound handling. As we said the human ear is capable of 13 decades (130 dB), but as it happens with the human eye not in the same scene, not in the same sound time segment. For human hearing the name of this reduced range effect is called ‘masking’. Loud noises make our hearing system to adapt by reducing the range so we cannot hear faint sounds when an intense signal is playing. Some experiments ( demonstrate that the CD spec (16 bit samples) can render a range of 98 dB for sine shaped signals, a range of 120 dB using special dithering (not LPCM), 20 bit LPCM can render 120 dB, 24 bit LPCM can render 144 dB. But at the same time other elements in the chain: AD/DA steps, amplifiers, transmission are very likely to reduce the range below 90 dB

So theoretically there are high end sound devices with high dynamic range, that could be paired among them so the resulting end to end system gets close to 125 dB, but that may take a fair amount of money.  To technically achieve the maximum possible fidelity one way (recording) and the other way around (playing) you must ensure that all your equipment fits together not breaking the 125 dB dynamic range. For the recording segment it is no problem. Studios have the money to afford that and more. For the playing segment you will find trouble in DAC, amplifiers and loudspeakers. Cheap HW does not have the highest dynamic range. The symptom of not being up to the task is the amount of distortion that appears in the range, usually measured as maximum % of distortion in the range. But anyway ¿do we need the maximum dynamic range all over the chain? Is audio quality available? Is it expensive?

Reading this ( and this (,3733.html) may help us derive the conclusion that YES, we have today the quality needed to experience the best possible sound that our perceptual system can detect, and NO it is not expensive. Using computer parts and peripherals you can build a cheap and very perceptually decent sound system. Of course pressure wave propagation is a tricky science and to adapt to any possible room you may need to invest in more powerful much more expensive equipment. But for ‘direct ray’ sound we are fortunate, virtually anyone can afford perceptually correct equipment today.



With the advent of Digital technologies we have inherited a world in which video is digital. This means of course that the video signal is ‘encoded’ in a digital format. Today EBU, DVB and other organisations like DVD Forum and Blu-ray Disc Association have closed the variety of encoding options to a few standards: MPEG (1, 2, &4) and VC1, there are also other famous video coders: WebM, On2 VP6, VP8 and VP9. By far the more successful standard is MPEG (Motion Pictures Expert Group) which is today well consolidated after more than 25 years of existence.  Most of the TV channels in the world are today delivered as a MPEG2 video over M2TS (MPEG 2 transport stream) and more recently HD TV is delivered as MPEG4-part 10 video over M2TS. Also called ISO H264 or AVC. And the latest addition from MPEG is HEVC (H265)

We are seeing a digital world that moves much faster than EBU/DVB/DVD FORUM/BLURAY DISC ASSOCIATION… these big entities may take as much as 5 years to standardize a new format. Then manufacturers NEED to change production lines and keep them for some years producing devices that adhere to the new standard (so they can derive profit from investment). So you cannot expect a breakthrough in commodity electronics for video in less than 5-7 years and that is accelerating as in the past (1960-2000), TV sets have been built essentially equal in viewing specs for more than 20 years in a row.

But as I’ve said today we can expect to see much more dynamism in video sources and video displays. Thanks to Internet video distribution people can encode and decode video in a wealth of video formats that may fit best their needs than regular broadcast TV. Of course the hardest limitation is the availability of displays. It doesn’t make sense to encode 30 bit colour video if you do not have a capable display… but assuming we have a capable display, today high end PC video boards can be used to feed HDMI 1.4 or DisplayPort signals to these panels overcoming the limitations of broadcasting standards. For this single reason 4K TVs are being sold today. Only PCs can feed 4K content to current 4K TVs, and most of the times this content must be ‘downloaded’ from Internet. Today streaming 4K content would take above 25 Mbps encoded in H265 HEVC.

We have seen that state of the art displays have just met perceptual minimum resolution (250 ppi at 18”) and are getting better every day. We are seeing the introduction of decent colour handling with 10 bpc LCDs and 30 bit RGB colour. We are seeing the introduction of large format high resolution displays: 4K displays starting at 24” for computer monitors (200 ppi) and at 56” for TV sets (80 ppi).  BDA has recently announced that the 4K extension to BluRay spec will be available before end 2014. In the meantime they need to choose the codec (H265 and VP9 are contenders) and cut some corners of the spec.  The available displays have a proper dynamic range usually better that 1000:1 and getting close to 10000:1. At least if we take contrast ratio as if it were a ‘real’ intensity range, which is not. Of course our monitors cannot light up with the intensity of sunlight and at the same time or even in a different scene show a star field with distinguishable faint stars. No. Dynamic range of displays will not get there soon, but HDR (High Dynamic Range) techniques are starting to appear and they can compress the dynamic range of real world input much better than current technology, which by the way does not compress it just clips. Current cameras can take high light clipping low light or on the contrary low light clipping high light, and you are fortunate to be left to select which part of the range you want. Near future HDR cameras will capture the full range. As displays will not be on par with the full range, some image processing that is already available must be done to compress the range and adapt it to the display. (PS: today you can process RAW files to produce your own HDR images, or even take multi-exposure shots to produce HDR files. The problem is that to see the results you must compress the range to standard RGB or otherwise you must select an ‘exposure’ value to view the HDR file.)  We can expect to see incredible high definition high dynamic range content in full glory using 4K 30 bit per colour displays.


Digital audio is not living up to digital video expectations. In the past decade a few high definition audio formats appeared: SACD (Super Audio CD), DVD-A (DVD Audio) and multichannel uncompressed LPCM in BD. From these formats SACD and DVDA have proved real failures. It has been demonstrated ( that increasing the bit count per sample above 16 bits: 20, 24, 32, 48…. and increasing sampling rate above 44.1 khz : 48, 96, 1 Mhz (DDS 1 bit) do not produce perceptually distinguishable results… so the answer is clear:  we got there many years ago. We have achieved the maximum ‘reasonable’ fidelity with the CD spec. (OK, OK the noise floor could be improved going to sound dithering or moving from 16 to 20 bit, but anyway the perceptual effect is ridiculous and the change is not worth the investment at all.) The only breakthrough in digital audio comes from the fact that now we have more space available in content discs so we can go back to uncompressed formats and enjoy again LPCM after years of MP3 compression or other sorts of compression : DTS, Dolby, MPEG. Also state of the art audio is multichannel. So the reference audio today is uncompressed LPCM 16 bit/sample, 44.1 or 48 khz in 5.1 or 7.1 multichannel format stored on Blu-ray disc.


I started this article posing fairly open questions about the availability of perceptually correct technology to display image and sound in front of our eyes and ears.  After careful examination of our viewing and hearing sensory equipment, and after examination of the recent achievements of the CE industry providing displays and audio equipment, and the prices of these devices and after examining the market acceptance for content and the ways to distribute content….  we can conclude that we are living in an extremely interesting time for content. We’ve got ‘there’ and virtually no one noticed. We have the technology to provide perceptually perfect content and we have the distribution paths and we (almost) have the market.

In the way to this discovery we have found that today only a very small amount of devices and content encodings have put all the pieces together, but that is changing. We will be no more delayed by broadcast standards; we will no more be fooled by empty promises in audio specs.  The right technology is just at hand and the rate of price decline is accelerating. Full HD adoption took more than 15 years but maybe 4K adoption will take less than 5 years, and maybe most content will not get to us via broadcast anymore…