The objective before us was to develop an in-house database of household incomes and expenditures for every major city of India. We had no idea how long it will take and the problems we will face, we did not know what was the most appropriate method or technology, and we did not know the quantum of market share that we could capture. Too much was uncertain, and typically investors do not like to venture into a space where even market size is undefined. But many times that is the only way. If we started to worry about Market Shares and market sizes, and competition etc. we would not even have started.
This is always the key break-away point. No business model is completely new, and for all businesses there is always at least one another competitor that has a great advantage; moreover for most start-ups market conditions will tend to seem biased against themselves. But this is when the entrepreneur breaks out. Bad market entry conditions cannot stop him. I was convinced, there was a need in the market place for productized data, there was my experience and interests, and there was a team that had diverse abilities. Most important, I knew that data driven decision-making was going to be the norm in coming decades, and a deep representative database that provided data at district, city and sub-city level would be critical for all companies and also the government.
And so a process started that lasted many years. Early visits to investors resulted in curious meetings, followed by general disinterest. I was an academic, and economist at that, having a start-up in an area which was not a fad at the time – quant and big data and algorithms were to come later. And so not only was the entrepreneur unlikely to succeed, Indicus Analytics did not have the same ‘fad’ value that investors repeatedly fall for! But we introduced a range of products every year, all R&D and investments happened from internal accruals. A few years later, when a great product line had been built, brand recognition was extremely high, and we had a couple of monopoly products. Again I did the rounds, but my business model had 5x growth projections, and investors in Delhi, Mumbai and Bangalore wanted me to revert with a 10x model! That stopped me on the tracks! Anyhow, since funds could be generated internally, we could keep on experimenting and doing things that the outside world, had little inkling of.
The Income Estimation Problem
District level economy estimates could now help bring out much better income estimates for each district. There are very few Indian firms that spend any money on data, and there were even fewer when we first started in the early 2000s. Our prices could therefore not be too high. Doing a large all India survey with a representative sample would be too costly. But no one would give us their data – not IRS for whom this was proprietary, neither NCAER who were themselves unclear what they wanted to do with their data. And so we started to use the one data that was available to us – the Government of India’s sample household employment and expenditure survey conducted by the National Sample Survey Organization (NSSO). Also NSSO had a great and transparent sampling mechanism, developed by none other than Mahalanobis, arguably the greatest statistician there ever was. Moreover, I had also worked earlier with this data and quite well understood its strengths and weaknesses. Much later I realized, what was initially a constraint (costs) became a great strength (use of underlying data from government) when we went to the market.
Income is simply the aggregate of profits, interest, wages, salaries and also other benefits received by individuals from their investments, business or employer. The CSO and economists in the government of India need to estimate incomes to better understand the how the economy is behaving. Businesses also need to understand income but from a different angle – an understanding of individual or household incomes helps us understand the character of their spending. And this obviously helps us in better planning and targeting products and services. Our team’s focus was therefore not as much on aggregate income, but distribution of income. We are able to distribute incomes at the district or city level, and within that by SEC cuts and also Income segments not to mention rural and urban dimensions. We needed and finally developed a data product where India’s income is distributed across more than three thousand cuts!
In this day of big data analytics, users require highly granular estimates. How can we estimate it for a country, where the government does not even share detailed data from the Census of India? After having analyzed data on Indian households via a series of highly advanced econometric and neural network tools we have been able to identify areas and households that tend to behave very similar to each other. For example there are some households in Delhi that would behave very similar to households in (say) Chandigarh. Therefore even if we do not have an adequate sample for Chandigarh, as long as we can identify similar households from Delhi, we can estimate consumption behavior for Chandigarh. This innovation enables us to bring detailed and robust estimates at a much lower cost and much more rapidly than a large survey.
This obviously was not an easy exercise and a very costly one for the number of trials and validations required to develop a methodology that would stand the test of time. However detailed analytics and ever growing computing power allowed us to develop semi-automated algorithms that enabled a complex set of routines that gave us what we wanted. Unknowingly in the early 2000s we were working on big data with millions of records, we were using artificial intelligence techniques when we used neural networks, and we used automated algorithms all the time. Almost all of this technology was available on the net; it was coded information and was accessible to anyone looking for it. The value was not in the technology, but the knowledge and skills that enabled us to use it in a particular manner.
Distribution of income also cannot be easily directly validated. However we do obtain sales of carious consumer products, educational attainment, age distributions, demographic and housing details from commercial vendors and the Census. Thus a good cross-check is theoretically possible even at the sub-district level. We started with one, and consistently improved it over the years.
Correcting Reporting Errors and Biases
There is no single place where household incomes are truthfully revealed and recorded. Whether it is reporting to the tax authorities, or surveyor, or even matrimonial prospects (!) people rarely tell the truth where their incomes are concerned. However households tend to be far more likely to report expenditures better than incomes. Even here, there are different reporting errors: (a) people forget what they had spent, especially low frequency and low value purchases, (b) some people over-report expenditures – especially poorer respondents and (c) some people deliberately under-report, especially those in the higher income segments.
To get things right we needed to get expenditures right and then incomes. This would require us to correct all the flaws in reporting of survey data. Then we needed to estimate income misreporting problem. Estimating expenditures: In-depth analytics showed us how different kinds of consumption expenditures are misreported. Using data from India’s National Account Statistics we could find which expenditures are more mis-reported in surveys, and how much they need to be corrected. This same analysis also helped us figure how mis-reporting varies across different types of households’. Once this mis-reporting and its patterns are understood then appropriate corrections can be made. Estimating Incomes: Some households do report incomes though that is likely to be flawed. However, we do know from other studies how incomes vary with education, location, assets, family structure etc. All of these income enabling factors are available with us for each household in our database. A series of calibrations help us finally achieve income estimate that is in sync with India’s national income and also in sync with the house-holds family structure, location, education and asset base, occupations etc.
To do all this we used data from many different sources. At the core is the Government’s NSSO’s large sample surveys which together give us more than a million observations spread over more than a decade. This is supplemented by data from many different sources including the Census, CSO and the RBI. Wherever data are available from, we use to either improve or validate our data. Further, this was a semi-automated process that allowed us to do things fast enough. Finally, once human capital is released due to automation, it can focus on constant process and product improvements.
But what made all of this possible was not just the knowledge, but the energy that brought the team together. Working days and nights was not a chore, most delayed or missed their leaves, music and laughter were standard background noise, and yes also fights! A great family was being created around a product. Now this is not something which was unique, successful start-ups have a different energy around them, go into the work-space and you will feel it! And that is why it is always better to innovate in a start-up mode that has youth on its side, and little of the past burdening the way forward.
This energy enabled hundreds of trials, back and forth, failure, delayed salaries, short term loans, and so forth in a highly charged and enjoyable environment. (Many years later, we went through another such product development process, but this time this same the back and forth, trial and error, and delays created massive stress, frayed tempers, disillusionment, bitching and backbiting. But more on that experience in a later post.) Both economic and management theory is extremely weak at understanding the importance of informality in making innovative institutions work. You can build all the hardware, all the processes, and get all the skillsets. But a certain life force is critical for any organization, and most of all a start-up.