CIO Folder: Data pedantry must rule
13 July 2018 | 0
Let’s face it, we now live in a world of data. Communications, entertainment, records and archives, business and finance and government and espionage and — increasingly — active conflict are all digital. Printing and handwriting still have their places and roles, but the amount of digital information generated by today’s civilisation is already a multiple of the traditional methods.
One of the many surveys or studies of the amount of data generated estimates that 90% of our total data has been generated in the last two years. Never mind the possible/probable accuracy of the figures: it is a clear indicator that digital information reigns supreme in this world today. The same report says that out current output is 2.5 quintillion bytes a day. The choice of ‘bytes’ is a clear attempt to impress: it is no more than 2.5 exabytes. A quintillion bytes is 1,000 terabytes, a million terabytes and a billion gigabytes. A thousands exabytes is equal to a zettabyte. That unit was defined only eight years ago. Do we need another, larger measurement as we approach cosmic proportions?
“Given a set of databases that go back for more than two decades (or less) it is well-nigh impossible to eliminate the possibility that the same person is listed in two or more records as a different individual. That automatically poses potential security and credit rating risks not to mention eligibility for state schemes of various kinds”
But an important point is that units of 1,000 have overtaken the historical binary scale in which a megabyte is actually 1,024. Strictly speaking, 1,024 units is an mebibyte (1,048,576 bytes). They may be equivalents in the minds of consumers, who normally deal in MBs and GBs, but an important volume measurement distinction for professionals designing new equipment or simply investing in data capacity. Precision is key.
A vast challenge to digital precision is language, not quantitative terminology. Take names is databases, for example. Internationally, there is no particular problem with Smith or Bush or Carter or indeed Murphy. But there is an international problem with accents or non-English characters. Because computing grew primarily in the United States and English-speaking countries (no disrespect to other cultures which are now leaders in digital advancement) the computing vocabulary, when it extended beyond quantities/figures, was monolingual.
So Ireland, to take a local example, personal names in Irish are a problem in state records, customer databases and mobile communications, not to mention place names and their varied spellings. The family names Ó Muirí and Ó Ceallacháin, for example, have a perennial set of digitisation problems. The first is obvious — the í and á — the accent or síneadh fada on the letters. Word processing has been able to cope with that for decades, particularly because other common languages share that attribute.
A second and similarly obvious difficulty is that most Irish speakers use their surnames in both Irish and English, as the occasion demands or as appropriate when ‘signing up’. Thus there could be three variants: Ó Muirí, Murray or O’Murray. Ní Mhuineacháin in English is Monaghan or Monahan, with no indication of gender.
A third difficulty is that capital letters are not accented in some languages. It was a somewhat long tradition in Irish letterpress printing when moveable fonts were missing capitals with accents (Ó, Á, Ú, etc). It was overcome at last by word processing and offset printing, which enabled diacritical marks in typefaces.
Last but not least, we are still bedevilled by the inability of some systems to allow apostrophes in surnames—O’Brien, O’Callaghan, O’Dea and so on are rendered O Brien, O Callaghan, etc. Online subscription or purchase systems, especially credit cards, airline bookings, hotel reservations, may be unable to accept names with an apostrophe. Error messages are known to say “That is not a valid name”! In such cases if it does not match your passport or your bank account you are in trouble — refusal of service or security. Several state services and Ryanair were guilty in recent years, almost certainly because they were using ‘foreign’ systems. Many current interactive smart phone apps are built on English language engines and do not accept either apostrophes or accents — the fada.
So in databases of names there is a peculiarly Irish problem—there may be as many as six variations of the name of the same person: Murray, O’Murray, O Murray, Ó Muirí, O Muirí. O Muiri. From another point of view, that poses major challenges to database administrators and authorities. Given a set of databases that go back for more than two decades (or less) it is well-nigh impossible to eliminate the possibility that the same person is listed in two or more records as a different individual. That automatically poses potential security and credit rating risks not to mention eligibility for state schemes of various kinds. The increasing power of search engines is not an answer to this problem unless the surname or full name is unusual or distinctive. The date of birth, a shaky pillar of validation and security systems, is not a certain answer.
ASCII characters are the basis of international standards, but although they are regarded as almost as old as computing they were in fact published first in 1963. American Standard Code for Information Interchange is the full name and note the ‘American.’ It had its origins in teleprinter and telex systems, now disappeared for several generations.
To add another challenge: Telex was in due course replaced by fax and in recent years scanning-to-email. Which brings us to one of the most daunting tasks in collating, indexing and searching masses of data: images. A scan of a letter, official document or certificate is a JPG or other image format. The contents are hidden. The ‘official’ description for indexing of the image may be mistaken or false. The same is broadly true of a PDF, the most common format for electronic transmission or downloading of longer documents. The format is not readily searchable, although Adobe provide some tools for searching for words/phrases in multiple PDFs.
Those few examples illustrate the immense range of difficulties in managing data. Information management is a better term when we are talking about information for human use. The machines can speak to each other in bits and bytes and code. But more and more the data the computing is working with is a mix of human input, purely machine data (IoT and similar) and AI which makes automated decisions on the basis of programmed interpretation of the data mass available to it.
So financial institutions using advanced AI to make credit (or fraud) decisions are dependent on the integrity of the data. Among other factors, family names in Irish, apostrophes and diacritical marks can screw that up.
We are not talking about errors, though those can cause systems failure or at least failure to perform the task at hand. We are emphasising the crucial importance of data integrity and consistency. To give an easy example, how often do the differences between the US gallon and pint and our liquid measures give rise to mistakes in shipping quantities or paperwork? At a higher level, comparative international trade figures often have errors in measurement of the units, currency exchange rates or inconsistent dates of national reports.
Integrity, consistency of the data — and the platform accuracy — are the essentials for a digitally functioning world. Digital transformation is a much-used term these days but it relies on the accuracy of the accumulated data. So a first step is to review the corporate database assets. This column calls for data pedantry. Workable accuracy is not enough. Precision to the Five Nines standard [99.999%] is essential for automated systems and AI. Third party and partner data input or optional resources are increasingly common, from multinationals to the EU to healthcare down to smart SMEs. It is crucial for an organisation to constantly review and judge the integrity of its own data. It is nearly as important to regularly review in depth external sources of information.
In the data universe that is inevitable, pedants will rule supreme.