The evolution of the dataset – why it’s not just about the algorithm

The evolution of the dataset - why it’s not just about the algorithm

Pause to consider why it is that Wall St. is all abuzz about ‘quants’ and the use of algorithms. Or why major brokerage houses announce their application of artificial intelligence with ‘robotic’ advisors to manage your finances. Although these PhDs in mathematics and physics are no doubt, rocket scientists, and while their algorithms are yielding results, it is not the algorithm, per se, that is the basis for the success, but rather, the dataset.

What is a dataset, you ask? Consider your average database holding all of your data. What portion of the data is unique? For example, you may be a major bank with an extensive client database, with many entries of people living in New York. The fact that New York may be in your database several thousand times is part of your data, the item, ‘New York’ as a city or a state, would be one unique item in a dataset.

So back to Wall St. – while the algorithms producing results are certainly exciting, they rely on what is probably the world’s most refined dataset – financial transaction data. Timed to the second, the buy, the sell, the stock, the volume – all the data is perfect. There is no ambiguity about how IBM traded on any particular day of the week for the past umpteen years.

How is this relevant in the HR tech space? Consider CareerBuilder’s late 2015’s acquisition of Textkernel, a parsing company which was recently acquired by a private equity group led by Apollo Management. Or Randstad’s recent acquisition of Monster, or Recruit Holdings’ acquisition of Indeed – what’s the connection between brick & mortar staffing, with premier career sites? What value-add did Textkernel bring to CareerBuilder? The global bricks & mortar operations are picking up the internet job board real estate. All want superior search/match algorithms so their recruiters operate as efficiently as possible, given the volume of resumes received.

How do you find the best candidate for an open position when you have mountains of data? Key word searching has long since been replaced by semantic matching. Semantic matching, available from a number of firms, is a sophisticated algorithm (or algorithms) for identifying relevant competencies (along with other parameters) to identify the needle in the haystack of data. Of course, the algorithm(s) is highly dependent on the dataset. The dataset most relevant to successful recruitment and hiring in HR can be found in resumes and job postings. As such, the likelihood of success of all AI-based applications related to recruiting is highly dependent on the quality of the dataset.

Our Company, HireAbility, has had the opportunity to parse over 100 million resumes and job postings for hundreds of clients on six continents. Our support for over 40 languages and dialects has helped amass a dataset comprised of over 54,000 multi-lingual competencies and job titles and over 650,000 multi-lingual classification sets, including names, locations, resume section headers, education summaries and school degrees. In fact, our clients contribute to our dataset in an effort to continually improve the quality of the parsing results.

HireAbility has two separate datasets. One is a set of closely related hierarchical terms that aid in skill and job title standardization. Hierarchies identify parent/child relationships between skills or job titles. For example, skills Capital Planning, Budget Control, Bookkeeping, Medical Billing have a parent Accounting. With Hierarchies option enabled, ALEX would return “Accounting” as an additional derived skill when one of the children of Accounting was identified. Skills can have multiple parents and multiple levels of parents. For example, the skill Quickbooks has the parent Accounting Software, which in turn has 2 parents: Accounting and Software.

The second dataset is a collection of words and successive words that aid in parsing resumes and job postings. Both datasets are multilingual and contain data such as geographical, ordinal mapping data as well as categorical data.

Post-parsing, the resulting XML or JSON is then used by our clients for more precise search and/or matching purposes between job postings and resumes/CVs. The datasets are growing continually and are the result of application of automated data collection and machine learning algorithms. As competencies, titles and relationships between them develop, the dataset can never be finished – it is organic and evolving, and so never complete. But it has come a long way since our earliest parsing efforts in 2001.

The precision in parsing in turn helps create highly accurate datasets for any company involved with big data, statistics, analytics, etc. Imagine parsing a million resumes that are sitting in your database and extracting information on a combination of skill sets, or schools (geography) and degrees from each. That data can assist in hiring decisions. Or, as a result of parsing you collect data on people who are currently looking for work and have more than 15 years of experience in a particular field. You can collect statistics on how many people with a particular competency holding a Bachelor’s degree are in-house.

In summary, there’s no question that the algorithms that provide superior search and match results for career sites, applicant tracking systems, staffing/recruiting firms, recruiters, analytics, VMS, and HRIS tools have to be top notch. But vital to their results will be the quality of the dataset. Resume and job posting parsing gives HR the baseline dataset from which all talent acquisition can begin. “The loftier the building, the deeper must the foundation be laid.” – Thomas Kempis

Steve Kenda, CEO, LLC

Free Trial

Our FREE CV / Résumé parsing and Job Order parsing trial includes 30 parses and is valid for 30 days.

Request A Free Trial Today

Recent Posts

Ask Us About:

  • Semantic Searching and Matching capabilities

  • Batch & email processing

  • Language support

  • Customization

  • OCR capabilities