What follows is a personal sinthesys including a lot of extracts from:
Intellectual property rights intensive industries: contribution to economic performance and employment in the European Union
http://www.epo.org/service-support/publications/studies/ip-intensive-industries.html
especially aimed on chapter 9 where match among patents and TM applicants and Orbis companies has been performed.
Intellectual property rights intensive industries: contribution to economic performance and employment in the European Union
http://www.epo.org/service-support/publications/studies/ip-intensive-industries.html
especially aimed on chapter 9 where match among patents and TM applicants and Orbis companies has been performed.
Harmonization and matching:
Convert name to the uppercase
Clean legal form information
A dictionary was created, containing 480 regular
expressions (regex) allowing for identification and
removal of legal forms typical in each Member State of the European Union. For
some countries (BE, DE, PL), a second step of legal form cleaning was added. In
the case of Belgium, the purpose was mainly to look for cases where the legal
form was indicated in both French and Dutch. For Germany and Poland, the second
cleaning loop was designed to deal with composite legal forms, such as GMBH
CO KG
Convert special characters using NFKD normalizer
By default, IP registers can record applicant names
using the national characters of the country of origin. Nevertheless, sometimes
applicants or their legal representatives file new applications with the name
already converted into its Latin equivalent, without any specific national
characters. This problem was dealt with by applying the Normalization Form
Compatibility Decomposition (NFKD) Unicode normalisation transformation
procedure implemented in Java. This allowed for automatic conversion of all the
names into the normalised forms.
Clean other characters,remove double spaces etc.
In a further pre-processing step all characters other
than a-zA-Z0-9&@$+ were replaced with a space, and periods were removed.
Leading and trailing whitespaces were also removed, and
multiple whitespaces were reduced to one space.
Clean nondistinctive, weak words
As a first step, each country was assigned a code
specific to that country/language, and non distinctive words were removed from
the normalised names. The list of non-distinctive words was based on a
calculation of the presence of words within the firms’ names and a thorough,
labour intensive
analysis of each data set. This part of the procedure
was not wholly automatic, as not all
the relatively frequent words were removed from the
normalised name field through the automated procedure. By the same token, some
words that are relatively less frequent than others were removed from the
normalised names because, after analysis of each dataset, it turned out that
they were not distinctive.
Further refinements have been: separate match of
natural and legal persons;
Trading as
(language sensitive) denomination has been separated from original name
Names obtained
are matched.
After the initial matching phase, the one-to-one
matches (one EPO/OHIM record matched with only one ORBIS record) were filtered
out, and one-to-many matches (where one EPO/OHIM record matched several ORBIS
records) were selected for further processing.
At this stage, additional
information (other than the firm name) was used.
The ORBIS dataset contains a field called DUO
(domestic ultimate owner). As a first step, all the companies from the ORBIS
dataset were grouped by their normalised name and a check was carried out to
establish how many unique DUO numbers corresponded to each group. If there was
only one DUO number associated with several ORBIS firms with the same
normalised name, then the record associated with that company was taken as a
potential match.
ORBIS Domestic Ultimate
Owner CHECK
Before matching those records, the completeness of the
DUO company record was compared with that of the other companies in the group,
in terms of turnover and employment reported. This was necessary because no
information was available on whether the DUO company was consolidating accounts
of its subsidiaries. Therefore, the EPO/OHIM record was matched to only one
relevant ORBIS record (DUO or subsidiary), namely that with the highest
turnover and employment figures within the group.
ROOT BVD ID CHECK
Now, groups of ORBIS records with the same normalised
name and the same Bureau van Dijk(BvD) id root were identified (Sometimes ORBIS
branches or subsidiaries have the same number as the parent company, with
additional digits separated from the root number with a hyphen.)
This hyphen and all digits following the hyphen were
stripped off to check whether all the ORBIS companies with the same normalised
name had the same root BvD id number. If so, the EPO/OHIM record was linked
with the company whose BvD id number was the root number for all ORBIS
companies with the same normalised name.
LEGAL FORM CHECK
Subsequently, the algorithm checked whether among the
ORBIS companies with the same normalised name there was only one company with
the same legal form as at least one company in the EPO/OHIM database.
ZIP CODE CHECK
In a final attempt to find a unique match, the postal
codes in the EPO-OHIM record were compared with those in the various ORBIS
records matched to it. If only one ORBIS record matched the postal code in the
EPO‑OHIM record, it was added to the matched
dataset.
After all checks above, the matched records which
still had one-to-many
relationships following the disambiguation process were disregarded.
MANUAL SEARCH
For the manual checking process, applicant information
from sources other than ORBIS was used, such as national business registers or
company websites, in order to find the reason for the non‑match.
In some cases, for example, it could be established
that the company had recently changed its name. In such cases, this new piece
of information was used to query the ORBIS database again.
Thus, the normalised name in ORBIS sometimes did not
correspond to the normalised name in the OHIM/EPO database.
CONCORDANCE TABLES
three tables were produced:
ORBIS-EPO concordance:
person_id number from the tls206_person - BvD idnumber
from the ORBIS dataset.
ORBIS-OHIM concordance
owner_code from the dim_owner table of OHIM’s
datawarehouse and the BvD id number from the ORBIS dataset.
EPO-OHIM concordance
person_id number from the tls206_person table of
PATSTAT and the owner_code from the dim_owner table of OHIM’s datawarehouse.
Redistribution from head
offices
One problem identified during the initial calculations
was the presence of some general, non-specific industry codes, namely 7010
Activities of head offices, 6420 Activities of holding
companies and 8299 Other business support service activities n.e.c.
This practice could potentially have distorted the
industry-intensity analysis if some industries were more prone than others to
leave maintenance of their trade mark / patent portfolio to the holding
company/head office, as those industries would then be underrepresented in the
general classification.
Redistribution of patents
from head offices
In terms of absolute patent intensity, those
industries codes 7010, 6420 and 8299 were ranked second, third and 27th,
respectively. This phenomenon reflects the common business practice of
concentrating patent portfolios at head offices, which also handle all the
relevant filing and registering procedures.
codes 7010, 6420 and 8299 were analysed in ORBIS by:
assigning the match to the subsidiaries with the same
name in case the match was to a DUO
assigning the match to sisters subsidiaries (realted
to same DUO) in case the match is not a DUO
Redistribution of TM from
head offices
Same problem as above for codes 7010, 6420 and 8299
(ranking 1st 2nd and 3rd). The problem was dealt with in exactly the same
manner
as described in the previous section.
NACE codes at different
levels of aggregation
In some cases, ORBIS assigns to a firm the NACE code
at a higher level of aggregation (3-digit group or 2-digit division) when in
the NACE classification those codes could be disaggregated into a lower level
of analysis (class). For computational reasons, ORBIS adds one or two zeros to
such group or division codes in order to create 4-digit classes in all records.
These classes are referred to as synthetic classes.
It was decided to deal with this problem by redistributing
the patents associated with synthetic classes among the classes within the
division or group, as applicable.