Tuesday, October 29, 2013

PATSTAT- TM - ORBIS match from EPO-OHIM

What follows is a personal sinthesys including a lot of extracts from: 

Intellectual property rights intensive industries: contribution to economic performance and employment in the European Union
http://www.epo.org/service-support/publications/studies/ip-intensive-industries.html

especially aimed on chapter 9 where match among patents and TM applicants and Orbis companies has been performed.



Harmonization and matching:

Convert name to the uppercase

Clean legal form information
A dictionary was created, containing 480 regular
expressions (regex) allowing for identification and removal of legal forms typical in each Member State of the European Union. For some countries (BE, DE, PL), a second step of legal form cleaning was added. In the case of Belgium, the purpose was mainly to look for cases where the legal form was indicated in both French and Dutch. For Germany and Poland, the second cleaning loop was designed to deal with composite legal forms, such as GMBH CO KG


Convert special characters using NFKD normalizer

By default, IP registers can record applicant names using the national characters of the country of origin. Nevertheless, sometimes applicants or their legal representatives file new applications with the name already converted into its Latin equivalent, without any specific national characters. This problem was dealt with by applying the Normalization Form Compatibility Decomposition (NFKD) Unicode normalisation transformation procedure implemented in Java. This allowed for automatic conversion of all the names into the normalised forms.

Clean other characters,remove double spaces etc.
In a further pre-processing step all characters other than a-zA-Z0-9&@$+ were replaced with a space, and periods were removed. Leading and trailing whitespaces were also removed, and
multiple whitespaces were reduced to one space.

Clean nondistinctive, weak words

As a first step, each country was assigned a code specific to that country/language, and non distinctive words were removed from the normalised names. The list of non-distinctive words was based on a calculation of the presence of words within the firms’ names and a thorough, labour intensive
analysis of each data set. This part of the procedure was not wholly automatic, as not all
the relatively frequent words were removed from the normalised name field through the automated procedure. By the same token, some words that are relatively less frequent than others were removed from the normalised names because, after analysis of each dataset, it turned out that they were not distinctive.

Further refinements have been: separate match of natural and legal persons;
Trading as (language sensitive) denomination has been separated from original name

Names obtained are matched.

After the initial matching phase, the one-to-one matches (one EPO/OHIM record matched with only one ORBIS record) were filtered out, and one-to-many matches (where one EPO/OHIM record matched several ORBIS records) were selected for further processing.

At this stage, additional information (other than the firm name) was used.
The ORBIS dataset contains a field called DUO (domestic ultimate owner). As a first step, all the companies from the ORBIS dataset were grouped by their normalised name and a check was carried out to establish how many unique DUO numbers corresponded to each group. If there was only one DUO number associated with several ORBIS firms with the same normalised name, then the record associated with that company was taken as a potential match.

ORBIS Domestic Ultimate Owner CHECK
Before matching those records, the completeness of the DUO company record was compared with that of the other companies in the group, in terms of turnover and employment reported. This was necessary because no information was available on whether the DUO company was consolidating accounts of its subsidiaries. Therefore, the EPO/OHIM record was matched to only one relevant ORBIS record (DUO or subsidiary), namely that with the highest turnover and employment figures within the group.

ROOT BVD ID CHECK
Now, groups of ORBIS records with the same normalised name and the same Bureau van Dijk(BvD) id root were identified (Sometimes ORBIS branches or subsidiaries have the same number as the parent company, with additional digits separated from the root number with a hyphen.)
This hyphen and all digits following the hyphen were stripped off to check whether all the ORBIS companies with the same normalised name had the same root BvD id number. If so, the EPO/OHIM record was linked with the company whose BvD id number was the root number for all ORBIS companies with the same normalised name.

LEGAL FORM CHECK
Subsequently, the algorithm checked whether among the ORBIS companies with the same normalised name there was only one company with the same legal form as at least one company in the EPO/OHIM database.

ZIP CODE CHECK
In a final attempt to find a unique match, the postal codes in the EPO-OHIM record were compared with those in the various ORBIS records matched to it. If only one ORBIS record matched the postal code in the EPOOHIM record, it was added to the matched dataset.

After all checks above, the matched records which still had one-to-many relationships following the disambiguation process were disregarded.

MANUAL SEARCH
For the manual checking process, applicant information from sources other than ORBIS was used, such as national business registers or company websites, in order to find the reason for the nonmatch.
In some cases, for example, it could be established that the company had recently changed its name. In such cases, this new piece of information was used to query the ORBIS database again.
Thus, the normalised name in ORBIS sometimes did not correspond to the normalised name in the OHIM/EPO database.


CONCORDANCE TABLES

three tables were produced:
ORBIS-EPO concordance:
person_id number from the tls206_person - BvD idnumber from the ORBIS dataset.

ORBIS-OHIM concordance
owner_code from the dim_owner table of OHIM’s datawarehouse and the BvD id number from the ORBIS dataset.

EPO-OHIM concordance
person_id number from the tls206_person table of PATSTAT and the owner_code from the dim_owner table of OHIM’s datawarehouse.


Redistribution from head offices
One problem identified during the initial calculations was the presence of some general, non-specific industry codes, namely 7010
Activities of head offices, 6420 Activities of holding companies and 8299 Other business support service activities n.e.c.
This practice could potentially have distorted the industry-intensity analysis if some industries were more prone than others to leave maintenance of their trade mark / patent portfolio to the holding company/head office, as those industries would then be underrepresented in the general classification.


Redistribution of patents from head offices
In terms of absolute patent intensity, those industries codes 7010, 6420 and 8299 were ranked second, third and 27th, respectively. This phenomenon reflects the common business practice of concentrating patent portfolios at head offices, which also handle all the relevant filing and registering procedures.

codes 7010, 6420 and 8299 were analysed in ORBIS by:

assigning the match to the subsidiaries with the same name in case the match was to a DUO
assigning the match to sisters subsidiaries (realted to same DUO) in case the match is not a DUO


Redistribution of TM from head offices
Same problem as above for codes 7010, 6420 and 8299 (ranking 1st 2nd and 3rd). The problem was dealt with in exactly the same manner
as described in the previous section.


NACE codes at different levels of aggregation
In some cases, ORBIS assigns to a firm the NACE code at a higher level of aggregation (3-digit group or 2-digit division) when in the NACE classification those codes could be disaggregated into a lower level of analysis (class). For computational reasons, ORBIS adds one or two zeros to such group or division codes in order to create 4-digit classes in all records. These classes are referred to as synthetic classes.
It was decided to deal with this problem by redistributing the patents associated with synthetic classes among the classes within the division or group, as applicable.