Microsoft – Anish Desai, Lifang Yao, Sharad Bhardwaj, Oracle International Corp

Abstract for “Analysis using rules for documents”

“One or more computers receive input indicative that multiple files are to be analyzed together. By performing one or more predetermined activities using the contents (e.g. string of text) of the corresponding one or several structures. Each file should contain the names of the corresponding structures to identify them. An application program can use the one or more structures to organize the contents of the files. One or more computers can be programmed to automatically read each file and identify the layout structures. Based on the presence of corresponding names for layout structures in each file. After parsing, one or more computers perform one or more predetermined activities to produce an output structure that contains the results based upon the contents of each layout structures identified in each file.

Background for “Analysis using rules for documents”

Due to the proprietary nature of documents containing text in any natural language, they are formatted using a common word processing program (such WORD sold to MICROSOFT CORPORATION) or WORDPERFECT sold to COREL CORPORATION), it is necessary to review such documents manually. A human reviewing a group of documents typically involves opening each one in turn in a word processing program and reading the text with his eyes until he finds the text that interests him. The human then analyzes the text of particular interest, e.g. The human then manually counts the rows within a particular table in the document. Manual review of large numbers of documents can be tedious, time-consuming, and inefficient.

“Several methods have been used to extract the contents of word processing documents. These solutions are only compatible with documents that have specific content and format. At specific positions. The prior art solutions that were known to the inventors can’t be used as a general solution across documents with different formats and content by users without having to rewrite the code. These solutions mainly extract content from word processing documents and do not have the ability to analyze text to gather actionable intelligence.

“The inventors currently believe that an automated solution for analyzing documents of different types and with different content would greatly improve efficiency and accuracy for people and companies who use word processing documents. According to the current inventors, it is necessary to automatically analyze multiple documents for a subset or set of user-defined content. An invention of the following type can be used to limit a search to count only a certain number of rows or search only a particular subsection in a section.

“One or more computers can be programmed to receive input according to the invention (e.g. A user can input a command that indicates word-processing documents in electronic format, which will be analyzed together. In some embodiments multiple word-processing documents can be analysed in response to one command input from a user. This could identify, for instance, a directory name or a portion of file names. Based on user input, the multiple word-processing documents will be analysed by performing one or several predetermined actions. Document contents (e.g. Strings of text that are structured according to a structure associated with a predetermined action. This structure is found in every word-processing document that contains multiple word-processing documents that satisfy a condition in that rule.

“Depending on the embodiment, one or more structures can be identified (for performing the associated one/more actions) by the presence of certain text in each word processing document (e.g. A word or sequence of words that forms a name (or another such identifier) and is arranged in a particular sequence relative to the structure (e.g. Before the structure. The structure (also known as ‘layout structure? The structure (also known as?layout structure?) is used by an application program in every word-processing document to arrange the text. It is used to display the text on a page or print it on paper. A word-processing table is an example of a layout structure. It is used by word-processors to display/print text in tabular forms on a page. A word-processing section is another example of a layout structure. It’s used by word-processors to display/print text in a hierarchical arrangement of sections and subsections, indented relative one another, on one page.

“In multiple embodiments, different layout structures can be created in a word processing document manually by a user entering words of text into an application program feature (e.g. To insert a layout structure, a word processing program is used. In the word-processing program, the user can also input a identifier for the layout structure (e.g. A word or sequence of words that the user chooses to indicate a table name, section heading or other information in the word processing document. Depending on the embodiment, an identifier may be placed in the word-processing file either before, after, or within a particular section of the layout structure. To later analyze the word-processing file, the user can also use the layout structures and their identifications to create a condition in the rule. The user creates the rule and also specifies the action that should be taken when the layout structure identified in the condition of the rule is found in a document. This can be done by one or more computers using software (the “document analyzer”). In accordance with the invention

“In some embodiments, the word-processing document is used later as a template. It is created in accordance with the previous paragraph. One or more users can create a template by copying the template and manually editing the text. This allows them to obtain additional word processing documents (also known as “standardized documents”). One or more of these documents can then be searched in accordance to the invention using one or several of the above-described rules, on one or more computers that are programmed with the document analyser.

“Another embodiment of the invention allows multiple word-processing documents to be created by one or more users, without the use of a template. Instead, the users input the features of the application program manually to insert layout structures as well as their identifiers into each word processing document. The alternative embodiments allow for the matching of layout structure identifiers with conditions in rules of the type described above by one or more computers that are programmed with the document analyser.

“Computer(s), programmed with document analyzer apply one (or more) rules of the type discussed above, to search every document as follows. In some embodiments, the computer (s) removes images from every document, and converts each document into a markup languages. Then, each document is identified within a document with a layout structure that fulfills a condition. There are many ways to identify a layout structure that meets a condition in a rule. The presence of an identifier for a layout structure in any document is checked to ensure that it’s present in the rule’s condition.

“In response to any match (between an identifyr in a file and the condition in a rule), an associated rule-specific action is taken using the words of the text in the layout structure identified as the identifier. Each action is performed using the layout structure that was found in each document. The output of all such actions on multiple documents is combined. For future reference, the results of multiple documents are collected in a non-transitory storage accessible to one or more computers. To further process, display, and/or print the collected results.”

“A processor 120 is programmed in a computer 100 with software instructions134 according to the invention to perform the method illustrated in FIG. 1A, e.g. To receive input in act 101A (e.g. A user may provide input (e.g. Input may be made in a client computer (FIG. 1B) by a human operator 183 using an input device like a keyboard or mouse (not shown). Client computer 184 transmits user input via wired or wireless link151 to server computer 100. Once received, the input is stored in memory 130 in the usual manner.

“The user input received under act 101A could identify, for instance, a directory name on hard disk 140. In the form of a URL (uniform Resource Locator), all documents in the identified directory make up a group of 115X documents that are automatically parsed 120 by a processor 120 (also known document analyzer), in accordance with act 102 (FIG. 1A) which can be repeated as per loops 102L (for each layout). Alternately or additionally, user input on link 101A that is received in act101A may specify file names to be searched in multiple subdirectories of the identified directory. User 183 may specify the file names in the usual way, e.g. By using a search term (with wildcard), the file names may be specified by user 183 in the normal manner, e.g. 1A).”

“Moreover, computer 100 also receives two additional user inputs in act 101B: One input indicates a condition on at most a portion of name or other identifier (e.g. FIG. 1B) is used to identify a structure J that was created by a word processor to lay out text on a page to be printed on paper or displayed on a screen. Structure J can be used to display/print text using a table. Another example of structure J could be a section that displays/prints text in a hierarchy. Structure J, also known as ‘layout structure? It is identified by the ID-J name, which is located adjacent to it in a predetermined order relative to (e.g. located before) structure. To identify a layout structure J, a name or an identifier IDJ is used to identify it. This eliminates the need for prior art to indicate a page’s position. If the ID-J name or ID-J is found in a predetermined sequence, the layout structure is used to fulfill a condition. Text within the layout structure is used to perform an action that is indicated by another input from a user. These two inputs from the user (i.e. Computer 100 internally associates the two user inputs (i.e. user input on a condition, and user input for an action) to create a rule.

“Note: Action 101B can also be performed in loops (see branch 101L, FIG. 1A) as many times as necessary, e.g. Each layout structure within a word-processing file must be analyzed once. You should also note that 101B and 101A can be performed independently of each other and may be done in any order.

“In some embodiments, once inputs have been received in accordance with acts 101A or 101B, a file 112I can be searched in an act 102 (FIG. 1A) to identify every layout structure associated with an action taken by a rule. Computer 100 uses the predetermined identification (e.g. FIG. 1B) to identify structure J as appearing in document 112I. In certain embodiments, the identifier IDJ and the corresponding structure J must be present in a particular sequence relative to each other (e.g. ID-J is before structure J) in the document 112I. . . The 112N search results in act 102 were all generated using a common template 131X (see FIG. 1B). One or more human users 181A-181N (e.g. Users who report to user 183 within an organization and provide input (e.g. via keyboards 182A-182N in the form text (and optionally graphics), for inserting into templates 131X-13Z using word-processing software (also known as?word-processor?). That interfaces directly or indirectly to the one or more respective computers 182A-182N.”

“During document creation, each user’s input 181I is used to create a document by a computer 182I that has been programmed with the word processor to replace (i.e. Overwrite default sample text (or any blanks) from a local copy template 131X to create a customized copy. This is saved to hard disk 140 (or another non-transitory storage) of server 100 as document 112I. Document 112I is created in the above-described way, using a modified template 131X. This is also known as a standard document.

FIG. 1B, there are many different standard documents 112A. . . 112I . . . 112N can be generated from the same template, 131X. A subset of these standardized document 112I can also be generated. . . 112N form group, 115X. Template 131X contains several structures B. . . J. . . M are identified using the respective identifiers ID?B, ID?J and ID?M. The structures (also known as?layout structure?) B. . . J. . . Template 131X contains M, e.g. in binary form (originally created by word-processing software such as WORD from Microsoft Corporation in template 131X). These structures are B. . . J. . . M and their identifiers ID -B, ID -J and ID -M are kept in a new 112I document after it has been created by user 181I. To input text into any of the structures B, copy template 131X and edit new 112I. . . J. . . M.”

“Accordingly, template 131X contains a number identifiers ID’B, ID’J, and ID-M in a predetermined order relative to the corresponding layout structures B. . . J. . . M. Depending upon the embodiment, identifiers I-B, I-J, and ID-M can be either pre-existing in template 1313X as text previously provided by user 183 during template creation or (b) intentionally added to template 1313X (manually, or automatically) to aid identification of the corresponding layout structures B. . . J. . . M during the parsing of documents 112I . . 112N in Act 102 (FIG. 1A).”

“Remember that once a document 112I is created, user 181I can make any number or changes to it, including duplicating (by cutting & pasting one or more structures B). . . J. . . M, their identifiers ID?B, ID?J and ID?M. Also, new layout structures can be created. Document analyzer 134 can analyze document 112I, even if multiple copies of structures B are available. . . J. . . M and the identifiers ID?B, ID?J, and ID?M are found in document 112I. Document analyzer 134 does not hardcode with physical dimensions and/or positions of structures B. . . J. . . M is in a page and document analyzer 134 instead uses rules (expressed as a position-independent format) of the type described below.

“In some embodiments, each identifier ID -J is manually entered by user 183 during template 131X creation and placed therein in sequence immediately prior (i.e. The corresponding layout structure J is preceding. Layout structure J, for example, can be a section with a hierarchy of sub-sections. The identifier ID is then inserted into template 1313X as the top-most heading in section J. Space (i.e. Blank or other characters may be permitted between layout structures J and their identifier ID?J, depending on the implementation. User 183 may insert each ID-J immediately before, depending on the embodiment. The corresponding layout structure J of template 131X should be preceded by the ID-J. Layout structure J, for example, can be a table with cells (also known as “tabular cells”). A user can insert a name for the table as an identifier IDJ right before the table J. The layout structure B is used whenever a standard document 112I has been created from a template 1313X. . . J. . . M, as well as their identifiers (ID-B, ID?J and ID?M) in the template 131X are copied to each standardized 112I document that is then customized 181I.

“During normal operation, additional documents (not shown at FIG. 1B) can also be generated by human users 181A-181N using other templates 131Y or 131Z. These additional documents are also standardized documents and are stored on a hard drive 140. User 183 can identify the additional documents generated using other templates 131Y or 131Z by grouping them together (not shown at FIG. 1B), however, none of these additional documents are to be identified in group 115X. This group is used to identify standardized documents 112I. . . 112N are based upon template 131X. If a user error is detected in the input via link 151 and a document identified as group 115X is not created using template 131X, computer 100 creates an error message and stores it in a non-transitory storage of computer 100. The error message can also be transmitted to computer 184 and displayed to the user 183.

In summary, a layout structure J can be identified in each 112I document in the user-identified groups 115. The contents of each layout structure J are then used to perform an act associated with structure J, as per act 103. Multiple times, an action related to a layout structure J can be performed (per act 103L), on multiple copies of structure in each document 112I. . . 112N. The results of the action performed on multiple copies J of the structure are then stored as per FIG. 104. 1A can be repeated as per act104L for each layout structure J. Document analyzer 134 can loop over each document (as in act 105) to perform acts 102-104 multiple time. Document analyzer 134 generates the above-described results for each document. They are stored together for each structure J (e.g. As collection 135J, or as statistics 13J, even though the contents of each structure are taken from multiple documents 112I. . . 112N. In some embodiments, actions can be associated with different structures B. . . J. . . M to create respective collections 135B . . 135J . . . 135M.”

“In some embodiments the one or more actions associated to a structure J can be specified in the form rules in a rulesfile 133X (FIG. 1B) is an input to processor 120, which executes software 134 (also known as document analyzer) in computing 100. The rules file 133 can be created in many ways, e.g. You can generate rules file 133 manually, automatically, or a combination of both. In some cases, rules file 133X can be generated from the invocation of a rule generator 132. Rules generator 132 is invoked using a file name for a template 131X that has been specified via a wired or wireless connection 152 by the user 183 via computer 184. Rules generator 132 automatically parses the template 131X filename on link 152 to identify all layouts B. . . J. . . M are identified by the respective identifiers ID?B, ID?J, and ID?M. Computer 184 supplies them to computer via a link 153. These are then displayed to user 183.

“User183” then identifies to computer184 a specific action that is to be done on a particular layout structure by document analyzer 134. An illustrative embodiment displays a drop-down menu of actions that can be supported by (i.e. Document analyzer 134 can be used to perform these actions. A drop-down list of actions that are supported by (i.e. You can copy text from the layout structure or count up the words in it.

Computer 184 responds by forming an association with a user-selected action and a specific identifier IDJ. Computer 184 transmits each specific identifier IDJ of a layout construction J to computer 100 via a link 154. Computer 184 may display a web page, in some instances (e.g. HTML) is sent from computer 100 to display a web page. It then executes in computer 100 and sends computer 100 any input it receives from user 183. In such embodiments, computer 100 forms the above-described affiliation.

Based on user 183’s input, rules generator132 creates rules file 1333X by writing each ID-J or part thereof in a condition along with its associated action. Every identifier IDJ within a condition and the associated action together make a rule. Therefore, rules file 133X may contain as many rules as structures B. . . J. . . M in template 1313X. User 183 may input more or less rules to file 133X depending on the number of templates in a template. User 183 might identify multiple actions for a condition. For a layout J structure the user 183 could identify more than one action, while the same user 183 might also identify no action for another layout M structure.”

“Additionally, depending upon the embodiment, rules from two rules files (133X and 133Q) can be copied into a common rules file 133R (FIG. 1B) are copied by the user to a common rule file 133R. The user then uses document analyzer 134 for analysis of similar or identical layout structures in different word-processing documents 112A and 112Q. . . 112I . . . 112N. A table with the name “document metadata” is an example. It is used in the following two types of word-processing document: functional design documents or user manuals. They are correspondingly created using two different templates 131X 131Y and information in?document metadata. By using a common rules file (133R) that contains a single rule that identifies in a condition in the?document metadata, both types of word processing documents can be extracted table. table. The number of rules in any given rules file 133X might not be equal to the number layout structures in any particular template 131X.

User 183 may manually view and modify rules file 133X if needed to ensure that the appropriate actions are associated each layout structure J. User 183, for example, may modify a previously identified action that was associated with ID-J in rules file 133.X by adding a link 155. A condition in file 133X does not have to identify an identifier or a name (e.g. ?document metadata? As noted in the previous paragraph, a layout structure is intended to trigger an action. Instead of a wildcard such as?*? A wild card such as?*? oder?% can be used instead of a layout structure that is to trigger an action, as noted in the previous paragraph. You can use?%? as an identifier with partial information, such a part of a layout name (e.g. ?document meta*?). Processor 120 uses rules file 133X to execute document analyzer134 when it analyzes word-processing documents 112I. . . 112N generated using template 131X to determine the action to take when a condition of a rule matches a layout structure within a document 112I. Execution of document analyzer134 by processor 120 requires user 183 to choose an action in rules file 13X, and thus the type of data that will be collected from each structure J, e.g. to be stored in an RDBMS table, or displayed on a computer monitor (i.e. Computer monitor, such as a cathode-ray tube.

The output of processor 120 in executing document analyzer134, such as collection 135J or statistics 136J may take different forms depending on which embodiment is being used. web page 191 can be used in a browser, spreadsheet 192 in a spreadsheet program, and relational database 138 accessed via a relational data management system (RDBMS), 1905 such as ORACLE DATABASE11gR1, available from ORACLE CORPORATION. Each web page 191, spreadsheets 192, and relational databases 138 are stored as files in a folder system 190. This file system is readable by a computer via a hard drive or another non-transitory storage medium (i.e. Any non-transitory storage media that is computer-readable.

“Processor 120’s output can be stored in an RDBMS Table (such as table 138) and further processed using queries in a structured question language (SQL) to generate reports that are displayed on a web page by user 183. Document analyzer 134 in some embodiments can be invoked by user 383 providing a location (e.g. URL) to the document repository (where group 115X word-processing documents are located) and select a rules files 133X. Instructions to document analyzer134 on what layout structures to search for in the documents 112I-112N (see group.115X in FIG. 1B), as well as what text and statistics should be collected from each layout structure recognized in the documents.

“Depending on the embodiment, there may be any number 133X of rules files. . . 133Q . . . 133R (see FIG. 1B) is unrelated to the number 131X-131Z of templates. One user may be interested, for example, in certain templates X. Another user (not shown), may be interested to other templates X. Therefore, the two users create two rules files from the same template.

“When user 183 selects an operation to be executed by processor 120 executing the document analyzer 134 in order to use a relationshipal database 138 user 183 also creates necessary tables via link 56, e.g. FIG. 1B shows an RDBMS table J, which is item 138J in FIG. 1B. User 183 also updates computer 100 via link157, a Property File 139 to create an association between RDBMS Table J and a corresponding Layout Structure J (identified with its ID-J), whose data will be written into RDBMS Table J by analyzer134. Property file 139 can be used in some cases to store additional information, e.g. Property file 139 can be used to hold additional information (e.g. Property file 139 is used in several embodiments to specify relational database tables that store the results of rules applied on corresponding layout structures. It also contains other processing logic, such as 1) identification and configuration of environment information, 2) application of rule files to which templates, and 3) identification and modification of existing rule files. The ID-J identifier can also be used herein to identify a layout structure. It could be a TABLE_NAME, or a Section_NAME, depending on the identified layout structure. In a word-processing file, a table is or a section.

“Supplying Rules Files 133X, 133Q and 133R as inputs to Document Analyzer 134 (FIG. 1B allows a user 183 access to various types of data from word-processing documents (115X) by configuring files 133Q, 133Q, 133R, and 139. Document analyzer 134 can be modified by the user 183 to change the action or layout structure in rules files 133, 133R, and 133Q, and/or tables in property file 13.3. This eliminates the need for writing software code. If a new layout structure is required (e.g. User 183 creates a new RDBMS database table in database 138 and adds an association between it and the new layout structure’s ID (e.g. To generate a revised property folder, user 183 adds TABLE_NAME and SECTION_NAME to property file 139. This creates a new property file. It also adds an association between an action (e.g. extract data) and the new layout structure identifier to a rule file to create a revised rule file. Document analyzer 134 then runs by processor 120 with the revised property and revised rules files as inputs. These configuration changes (in the current paragraph), are simple and can be performed by user 183. According to current inventors, this is at least one order of magnitude faster than manually altering software source code in a prior art document analyser.

“Note: document analyzer 134 is able to be executed by processor 120 even if there are no data stored in relational databases 138, e.g. document analyzer 134 can be used to provide its output on a web page 191(e.g. in HTML and/or a spreadsheet file (e.g. In a format known as comma separated value (CSV) that can then be opened with a software program called Excel available from MICROSOFT CORPORATION. You can import the results from the above-described spreadsheet file (192) into a relational data base by user 183. Then, you can prepare reports by running SQL (or structured query languages) queries.

“Note: Links 152, 53, 154 and 155 may differ from one another, and link 151 (discussed previously), depending on the embodiment.

Documents 112A-112N (FIG.) are used in many examples. 1B are word-processing documents and layout structure J (FIG. 1B is a word processing table in template 1313X. Each row of the table J contains a message in text in natural language. This example shows that table J in word processing document 112I contains q rows of messages while table J in another word processing document 112N has only s rows. Each table J is identified using a predetermined identifier such as “Messages?” Both word-processing documents 112I & 112N. If the word “Messages” is found in word-processing documents 112I and 112N, it will be analysed using document analyzer 134. Processor 120 locates the word?Messages immediately before a table and processor 120 copies q messages from this table into collection 135J.

“Similarly, word-processing document 112I analyzes word?Messages. Processor 120 again finds it and copies the s messages from that table immediately preceding the word. Then, processor 120 adds the s messages into the previously copied 135J q messages. If a predetermined identifier is found in the rules file (e.g. Name?Messages Tab? (in this example) is not found in processor 120, it forms an association in memory between that document, and the predetermined identifier. to be used in an error message stating that the document doesn’t contain the predetermined identification.

“Accordingly, all word-processing documents 112I are. . . 112N were analyzed. Collection 135J includes the q+s messages extracted from table J using multiple word-processing documents within the group 115X. Statistics 136J for table K is a set q+s count, with each count representing the number of words in a message. When processor 120 executes document analyzer134, the outputs (the q+s messages or the q+s count) can be stored in nontransitory memory and sent to client computer 184 as a document, such as a webpage, spreadsheet, or RDBMS table. User 183 may continue using the information in the usual way.

“In the above-described illustration, there are two additional layout structures B, M, in word-processing documents 112.A-11.2N. These two word-processing tables (template 131X) include a name for each person. Table B contains authors. The word ‘Authors’ is used to identify them. ID-B is used to identify table B in template 1313X. The persons in table M are reviewers, and the word “Reviewers” is used for them. It is used to identify table M within template 131X as ID-M.

“Hence, processor 120 determines the immediately preceding table to be table B when analysing word-processing file 112I. When the word?Authors? is found, processor 120 (while running document analyzer 134) determines that the immediately preceding tableau to be table A and copies the author names from table B into collection 135B. If the word “Reviewers” is found in document 112I, processor 120 will also determine whether it is table B. If the word?Reviewers? is found in document 112I, processor 120 determines that the immediately preceding tableau to be table M. Then, processor 120 copies the author names from table M into collection 135M. As each word-processing file 112I is analysed, each collection 135B, 135J, and 135M is incrementally created. The compilation of collections 135B-135J and 135M is completed by the analysis of the last word-processing file 112N of the group 115.

“Each collection 135B, 135J, and 135M organizes user-input content that is stored in structured form in word processing documents 112I. . . 112N of group115 can be viewed in an easy-to-review manner by user 183. User 183 can, for example, review 135J of messages in order to determine if any four-letter words are present. You can open the collection 135J using a browser. The browser’s?search? function can be used to open the collection 135J. Alternativly, you can automate this check by selecting the appropriate action in rules file 133. User 183 can also manually check the message collection 135J to ensure it conforms with natural language grammar. User 183 can therefore invoke document analyzer 120 by processor 120 to validate the content and quality word-processing documents 112I-112N. (See group 115X in FIG. 1B). 1B).

User 183 can also obtain a list with the authors of word-processing documents 112I-112N. This is done by screening duplicates from 135B. Users can also obtain a list with reviewers by screening duplicates from 135M. The number of times that a name has been repeated is an indicator of how much work they have contributed. The level of completion of documents 112I is also indicated by the number of rows containing default sample text (or empty) in collection 135J. . . 112N. Computer 100 can determine if half the rows in a word processing document are empty or have default text. This is in contrast to another word-processing file in which all rows have default text (or blanks). In some embodiments, processor 120 counts the number rows that are empty or have default text. This is done in response to user 183 specifying this count in the action associated with the word-processing table J at link 152. Document analyzer 134 allows user 183 to efficiently assess the completeness word-processing documents 112I-112N using document analyzer. (see group 115X in FIG. 1B). 1B).

“Other embodiments automatically count how many times an individual is identified as an author by specifying such counting within an action associated to the respective table M. Other embodiments count the number times a person has been identified as a reviewer by specifying this counting in an action associated to the respective table B. These counts are statistical 136B. . . User 183 can use 136M as actionable intelligence (e.g. To set bonuses. User 183 can then use document analyzer134 to examine the qualitative and quantitative characteristics 112I-112N of word-processing document 112I-112N at each document level or aggregate text. This allows them to create statistics across multiple word-processing document at different levels of the hierarchy of a software product and within a line (or series) of software products. Document analyzer 134 can produce various statistics by analyzing word-processing documents across product families, lines, and products in a way that was previously impossible. Without document analyzer, significant quality improvements are not possible due to the limitations of existing document editors. A user 183 can now interpret and use word-processing documents 112I-112N using document analyzer134 in a way that was not possible before.

Microsoft Corporation sells MICROSOFT OFFICE XP, which is used to create and edit word-processing documents 112A-112N, and templates 131X-13Z. The components: Word-processing software called “Word 2002?” Excel 2002 is a spreadsheet program that can be used to process word-processing documents. For spreadsheets, use?Excel 2002? Slides This software example is an application program that has multiple components. It is normally installed on each computer and executed independently by each computer 182I.

“Another example software to create word processing documents 112A-112N or templates 131X-13Z is installed in and executed from a central server (e.g. computer 100) is made available to all computers 182A-182N as a service (i.e. software as a service, or SaaS. This word-processing software is accessible via a browser on computers 182A-182N. Google, Inc. also offers the Google Docs office software. Google Docs office suite supports browsers on computers 182A-182N, 184 and 205 to access an online word processing service. It also includes a spreadsheet service and slide presentation service.

“In some embodiments template 131X, word-processing documents 112A-112N can all be created using the same word processor (also known as word-processing software). This word-processor includes a justification feature that allows for left, center, or right justification, character formatting features (bold and underline, italic formats), spell-checking and grammar checking features, a word counting feature and a table insertion feature to insert a word processing table. There is also an optional section insertion feature that can be used to inserting section. Word-processing software is a word processor that prepares business documents in the usual manner. For example, WORD from Microsoft Corporation. However, some word-processors do not have publishing features like kerning or typesetting. These embodiments of word processing software exclude publishing software such as FRAMEMAKER and ACROBAT, both of which are sold separately by ADOBE.SYSTEMS INCORPORATED.

Word 2002 is one example of word-processing program. Word 2002 is sold by MICROSOFT CORPORATION. WORDPERFECT is another example. These embodiments may include the above-described insertion functions, namely section insertion feature or table insertion feature, which can use styles to ensure consistency in formatting words in a layout structure. Layout structure J, for example, is a word processing table that is normally kept in a binary format by word-processing software (e.g. Word 2002 proprietary. Each row of the word-processing tables J contains a message in natural language. English. English. A MICROSOFT Word document) in a file named?documentA.doc.?”

“Use document analyzer134 allows the messages in word processing table J of?documentA.doc to be extracted and stored” extract and stored in an RDBMS Table 138J, e.g. In a relational database table 138 created by Oracle Corporation using the software Oracle Database 11g. Names of authors in the word-processing tables B and names reviewers in the word-processing tables M of?documentA.doc are also available. These names are then extracted and stored in the RDBMS tables 138B & 138M. The word-processing document titled?documentI.doc is then analysed. Each row of text in the document named?documentI.doc is taken and placed into one of the RDBMS tables (138J, 138J, and 138M). In some embodiments, database 138 also contains a 138Z RDBMS table that holds statistics. This table is shared in an illustrative embodiment across all templates and layouts in computer 100. After all the word-processing documents from group 115X have been analyzed, RDBMS table 138 is analyzed using SQL queries to generate reports and/or create files in the usual manner (e.g. web pages), as would be obvious to the skilled artisan considering this disclosure.”

“Hence, the document analyzer 134 offers at least these advantages in the example above: (1) Automated analysis of different types of word-processing document (such as product brochures, functional design documents, and user manuals) to determine content quality; (2) Content extraction for design reviews to verify feature completeness. (3) Comparison of content from different word processing documents throughout the software development lifecycle (SDLC). For example, user 183 can use the computer to check that the document contains’must have’ features. Features in the requirements document were implemented; (4) Create content repository for downstream uses (for example, build repository of product usage cases from functional designer document and insert them into a quality center tool for automating testware creation, increasing accuracy, and decreasing the cost significantly); (5) Intelligence collection from different types of word-processing document (for example, user 183 can use a computer to determine how many product use case are there in a software suite of software; how many features are within the scope of a current version of that are allowed for the current version of the current version of the suite of software suite of the current version of the software software suite of the current version of the current version of the suite of the current version?

“Note: Although templates 131X and 131Z may be used in certain embodiments of the type illustrated at FIG. 1B, certain embodiments of this invention do not use templates 131X-131Z as shown in FIG. 1C. Users 181A-181C can create word-processing documents 117A -117N in several alternative embodiments (FIG. 1C) without using any templates. Word-processing files 117A-117N can be prepared in a similar way to word-processing document 112A-112N. They may even be identical, depending on how they are arranged. Word-processing documents that 117A-117N can be considered standardized documents. This is because the documents 117A?117N were not created as copies of any of the templates 131X?131Z. Document analyzer 134, as described in FIG. 1B is also referred herein as a standard documents analyzer (abbreviated SDA) and documents analyzers 141 as described below with reference to FIG. 1C is also referred to herein as structured documents analyzer (also abbreviated SDA).

“Because in FIG. “Because in FIG. 1C, thereby to create rules files 133X. Rules file 133X in FIG. 1C is identical or similar to file 133X in FIG. 1B. Select the appropriate action to send via link 154 (FIG. 1C, user 183 retrieves documentation 142 via link 158 to computer 100. Documentation 142 includes names of actions that are supported by documents analyzer (FIG. 1C and a description (in natural human language). User 183 also identified the associations between each word processing table and a corresponding RDBMS in relational database 388, in a property file 139. As shown in FIG. 1C, which is identical or similar to file 139 as described in FIG. 1B.”

FIG. 141: Structured documents analyzer 1C is identical or similar to the standardized documents analyser 134 of FIG. 1B, except where noted otherwise. Some information may be stored in files in certain embodiments (e.g. Some information may be stored in files (e.g. rules in rules file 133X and configuration in propertyfile 139), but in other embodiments, the above-described information can be stored in tables in a relational database (e.g. Rules are stored in a table and configuration in a table. Both tables can be accessed via an RDBMS.

“In light of the description of FIGS. “In view of the above description of FIGS. 1A-1C, it will become obvious to the skilled craftsman that the documents analyzers 134 and 141 according to the invention allows a user to quickly perform analysis of documents by simply preparing an appropriate configuration to operate SDA 134/141 rather than writing new software. SDA 134/141, for example, eliminates the need to create macros in word processing software (e.g. WORD is sold by MICROSOFT CORPORATION to open and process documents 112I-112N. (see group 115X at FIG. 1B). Additionally, macros in word-processing software of the prior art typically record a position on a page where a particular action is to take place, followed by another position, at which another action will be taken, and so forth.

“Unlike prior-art macros that are position-based,” many embodiments according to the invention of an SDA134/141 do not require any pre-recorded positions to perform their actions. Many embodiments of SDA 134/141 instead use a rules-file that doesn’t identify any text positions on a page. the rules file is expressed in a ?position-independent? format as described below. Use of a rules file 133X in a ?position-independent? SDA 134/141 can operate in a format that does not require calculation of position. Before taking any action, you must follow the x-direction, which is horizontal direction from the left margin of the page, and the y-direction, which is vertical direction from the top margin. SDA 134/141 is a generic solution to document analysis. It performs user-specified actions, independent of the positions of layout structures on a printed or displayed page. SDA 134/141 can be used as a generic solution by using a file 133X that is position-independent. A property file 139 allows new word-processing layouts to be mapped to RDBMS tables that the user has created in a relational database. This does not require the user to write any code.

“Furthermore,” as we will discuss below, word processing files with new file extensions (such?.docy?) can be processed. Both?.docy? and?.docz can also be processed using a single modification to the file extensions listed by SDA134/141. This makes it more generic. SDA 134/141 is able to be used on any type document (e.g. SDA 134/141 can be used on any type of document (e.g., a functional design document and a manual are two types of documents) which is an improvement to a prior art tool that only does XML conversions from a proprietary binary format to MICROSOFT Word. This prior art tool is typically hard-coded to only work on one type. To use it with a different type of document, the user must create a DTD, create a XML, and code new structures. SDA134/141, on the other hand, can handle any type of document by simply changing to rules file 133. This allows for new layouts to be specified, and without having to modify any of SDA134/141’s software code.

SDA 134/141 also has the unique feature that the documents 112I-112N are prepared in the usual way, using the most widely used word-processing software in the industry, MICROSOFT Word. This may allow absolute positioning of text or images on a page. Before the invention of SDA 134/141 the only way inventors could analyze text in layout structures in documents in MICROSOFT Word format was to manually open word-processing documents one at a time. This required a human to manually take note of each document and then manually compare notes between the documents. needs human intelligence.”

“In multiple embodiments, SDA134/141 has been programmed to support many different types of actions. These actions are performed when a rule from rules file 133X matches the layout structure in document 112I. Rules file 133X usually contains multiple rules. In some cases, each rule in rules folder 133X is associated only with one action. A user can choose from 10 actions, such as extracting table data or checking empty table fields. Each rule can also have many parameters, some mandatory, others optional, to allow flexibility in specifying any layout structure, context data, and so on. SDA 134/141 can perform three actions. ), (b.) Check if the text within the layout structure meets a user-specified condition such as accuracy or completeness. (c) Copy and store that text from the layout structure in a relational table, to be used in SQL queries across documents that are similar (e.g. All of these may have been generated from a common template.

“Examples such actions are now described in reference a document 112I prepared using a template illustrated at FIG. 2A (which will be described in greater detail in the following paragraphs). These examples show a specific layout structure that is identified in word processing document 112I as a word processor cell. It has a heading 211B in the row with the string value?Author? and is located in a table identified by table ID 213 with the string value?Document Metadata. If such a cell is located (e.g. FIG. 2A) in FIG. It is used to identify documents in which Anish has been named as authors. An example of the above-described (b) action is to ensure that the cell isn’t empty. ? If it does, then to log a message into computer memory. An example of the above-described (c) action is to copy and store the text string from this word-processing cells into a column called?Author? In an RDBMS table with a column called?File name? (containing a file name, and extension for document 112I).

“Remember that SDA 134/141 can perform any of the actions (a-(c),) only after table210 has been identified with the identifier?Document Metadata? It is located in document 112I. The header of the cell that contains the user-specified string (in our example,?Author?)) will also be found. If, for example, in document 112I, three rows in table 220 all have the same user-specified string (in this case?Author?) SDA 134/141 then performs the above-described action on each cell of table 210. It does this because each row has the header??Author? In this example. A document 112I identifying three authors is processed correctly. SDA 134/141 was designed (as discussed herein) for the application of rules (specified by a rules file 133) that identify a layout design specified by the user (e.g. Instead of identifying a specific position on a page, we compare sequences of text strings or tags.

According to the inventors, SDA 134/141 has two unique features. (1) SDA 134/141 allows users process existing and new word processing templates and create a repository of documents based upon the processed templates. (2) SDA 134/141 allows users dynamically capture data structures in a relational table, again without code modifications. SDA 134/141 does not require code changes. Instead, configuration changes are made by the user. This is much simpler than code changes.

“In many embodiments of the type described above, SDA 134/141 provides a unique end-to-end solution that enables users to unlock the data and intelligence?previously only accessed at individual word processing document level?across several word processing documents to gain operational, procedural and process efficiencies. According to the inventors, this end-to-end solution is not possible at any software company (or any other company that uses functional design documents, product brochures or user manuals) The current inventors know that no one has ever been able to extract intelligence by analysing a collection of documents instead of manually reviewing a single document. As discussed below, the current inventors understood, overcame and overcome many challenges in creating such an end-to-end solution that nobody else has been able.

“C1: The ability to handle large files and complex content that contain diagrams, while trying to convert these documents from proprietary to generic text formats. These are just four examples of the challenges identified by current inventors. C2: The ability to process documents with different structures and content, without having to write new code. This problem cannot be solved by any solution that is not scalable and acceptable for general use. C3: The ability to meet the user requirements of multiple users using the same structure and content of any document. Examples of various types of documents include functional design document, requirements specification (user’s manual), product brochure, and user’s manual. One user may be interested in counting words within a cell in a word processing document, while another user might need to ensure certain text is included within the same cell. It has been difficult to address the needs of different users using a generic solution (one that doesn’t require changing code). C4: Complex analytics of the content of different types of word processing documents using a single solution. Although some analysis can be done at the document level, it is possible to do more complicated analysis using data from multiple types of documents. The inventors have not been able to solve the problem of being able to store different data structures from word processing documents in an RDBMS database.

The current inventors combined years of computer programming experience with many different technologies to create a generic solution. Here is a list of some of the innovative solutions that were used to address the above challenges. I1: Identify, remove and convert images from the native document format into XML. I2: Allow users to specify the content or structure of interest in any document, as input to SDA134/141. This is done using a simple interface that doesn’t require any technical knowledge. This mechanism allows users to specify the information they are interested in by selecting the type of documents that they want to process. In the following sections, we will describe this capability as rules generator and rule file. I3: Users can also specify the type of action they want (extract data or count words in a table column, check for default text, etc.). They want to perform on a specific type of structure or content in a certain document. According to the current inventors C2 and C3 have been the greatest obstacles that prevented previous attempts from succeeding. These challenges were solved by the current inventors using an innovation that was implemented as a rule generator and a rule file. I4: To address the challenge C4, current inventors created an interface that allows users to define a mapping between each word processor document structure and the corresponding RDBMS structure (also known by an RDMBS Table) in a text format. as a property file number 139, which is input to SDA 13/141. The property file can be saved in text format. This allows the user to edit the mapping information with any text editor. SDA 134/141 interprets and consumes mapping information in text format from property file 139 and performs operations to extract text from word-processing tables and store it into a RDBMS database table, as described herein.

The following are some examples of the challenges that current inventors have recognized and the solutions they have found. H1: Use new templates. It is simpler to support existing templates. It is easier to only support existing templates, which in the prior art is a?hardcoded? solution. Nobody knows which layout structure, in what format and in what order, will be used in a new templates. It is difficult to make SDA 134/141 intelligent enough to analyze any new template structure based on words, and match any documents that are based on it. Current inventors suggest using rules stored in Rule Files, applied by an engine in SDA134/141, and created by a rule generator. H2: Manage any user documents that are based on templates. Even if the template is used, user documents may have different contents. Another challenge is extracting valuable content and reporting violations. Current inventors propose to capture exceptions for reporting. Processing continues until the final document is completed in all cases. H3: Use multiple versions of the same template. You can refer to the same data elements from an older template by using a new table or cell name. Table fields may be deleted/updated/added. These must all be intelligently reflected in the database output repository and linked together. Current inventors suggest using a property file to link new and old names. Maximum table definition must contain superset of all columns in all templates.

The following are some examples of the challenges that current inventors have identified and the solutions they have found. H4: Manage a collection of documents and identify documents that aren’t based on any templates, documents that aren’t in sync to the rule and template selected, documents that are largely based upon a template but have different contents that violate the template. Current inventors suggest exception handling and reporting. SDA 134/141 intelligently filters all documents not based upon a user-selected templates, only processes good documents, and only generates results from good content. It also reports exceptions at the system, document, and content levels. H5: Manage an arbitrary database repository. Nobody knows the data elements that a new template will include. It is difficult to analyze the documents and save the results to a database. Current inventors suggest using a property file to link data elements from documents to tables in the database. H6: Manage large volumes of data in complex formats. Document repository can contain gigabytes and complex elements, such as large images and nested tables (see C1 and I1 above). Computer memory can also be used indefinitely. Current inventors have programmed SDA 134/141 in order to identify and remove all images prior to processing. This reduces memory usage, makes bulk processing possible, and produces clean results. SDA 133/141 can also identify complex structures and process them or report them. However, SDA 134/141 continues to crawl the entire repository, and only one result is generated.

“FIG. 2A shows a screen showing a word processor that has opened a template 131X. Template 131X contains a word processing table 210 with two columns (211 and 212). Column 211 contains a number headings (also known as?row headings?) The headings 211A to 211Z are arranged vertically in relation to each other. In this example row headings 211A and 211Z are at the top, respectively. Column 212 contains sample text 212A at its top, in the first row adjacent to row headings 211A. Column 212 also contains sample text 212Z at its bottom, located in the last row adjacent to row headings 211Z. Word-processing table (210) is a vertical one because row headings 211-211Z are placed vertically in table 220 and separated by the contents of tables 210.

“In this example, a string (also known as?table identifier?) is used. The number 213 appears immediately before the word-processing tables 210. Due to their relative locations, this string 213 has an semantic relationship with word processing table 210. This relationship is obvious to human users 183-181N, but it is not apparent to word-processing software. The relationship is that string 213, which is located in physical proximity of each other, is an identifier for word-processing tables 210. Referring to FIG. 2B) is created by user 181A. A string 223, which is immediately preceding table 220 in the sequence text in document 112A, is identical to string 213, which identifies table210 in template 131.X. Document 112A row headings 221A-221Z are identically kept by human users 181A-181N to the corresponding row headings 21A-211Z in template 1313X. By overwriting the template 131X text 211A-211Z, text 222A-222Z can be added to table 220 using input from human user 181A.

“Similarly, user 181I input is used to insert text 231-232Z in table 230 (see FIG. 2C) by replacing the text sample 211A-211Z from template 131X with string 233. Document 112I’s row headings 231A-231Z and the string 233 were all identical to the corresponding row headings in template 131X. FIG. 112N is document 112N. 2D is also created using the input of user 181N. In FIGS. 2A-2F, and 2K are Word 2002 word-processing software sold by MICROSOFT CORPORATION.

“FIG. “FIG. 2A. Rules generator 132 contains a string text 251 from rules file 133 that is identical to the text string 213, which happens to be adjacent and immediately preceding template 210X. Parsing rule 250 of rules generator 132 identifies text string 251 as the table name for table 219. A row heading 211A is also found in template 131X. (FIG. Rules generator 132 assigns 2A a name 252 (FIG. 2E) is a cell in table 210. Rules generator 132 also includes additional row headings for table 210, as shown in FIG. 2E.”

“Rules generator 132 also includes in parsing rule 250 an orientation direction 259, in which cell headings will be arranged in table 215. HORIZONTAL is a type of orientation. VERTICAL is a type of orientation. COMPLEX_HORIZONTAL is a type of orientation. As described below, tables in a document may also be COMPLEX_HORIZONTAL or COMPLEX_VERTICAL.

Document analyzer 134 processes a table in HORIZONTAL orientation direction 259 using text in the first row. Then, it uses the header text from the first row to process the text in the remaining rows using one or more of the actions identified in the parsing rule 254. Document analyzer 134 processes the table in the same way as when orientation direction 259 of a table layout structure is VERTICAL. Document analyzer 134 doesn’t process the entire complex table if orientation direction 259 is COMPLEX. Document analyzer 134 instead breaks down the table into HORIZONTAL AND VERTICAL sub-tables. These sub tables are then processed the same as the normal HORIZONTAL or VERTICAL tables described in the beginning of this paragraph. After defining attributes in computer RAM, these sub tables can be used to assemble all the data from the complex table.

“Therefore, in certain embodiments, a complicated word-processing table includes both horizontal and vertical under tables. The document analyzer uses a divide and conquer approach to extract the data from each table and then assembles it using common parent keys and other context data. A horizontal sub-table is also known as COMPLEX_HORIZONTAL. This refers to a horizontal table within a complex table. A COMPLEX_VERTICAL is a vertical sub-table in a complex.

“An example for a complicated table is:”

“Detailed Error Condition and Messages”

“Business Actionable\nRule ID Error Condition Preventable by End User Diagnostics\nMessages (repeat the rows below, including Tokens, for each message)\nMessage\nType Message Category Severity\nError Product High\nWarning System Medium\nConfirmation Security Low\nProcessing\nInformation\nMessage Name\nMessage Text\nUser Message Detail\nAdmin Message\nDetail\nCause\nUser Action\nAdmin Action\nTranslator Notes\nTokens (repeat the rows below for each token)\nToken\nToken Name Type Token Description\nDate\nNumber\nText”

“Moreover, parsing rule 250 is prepared by rules generator 132 to include a user-selected action identified by string 258 which in this case is ?EXTRACT_TABLE_DATA.? This action, namely EXTRACT_TABLE_DATA, is to automatically be performed by processor 120 (when executing SDA 134) when the document 112I containing table 210 is being analyzed. Rules file 133 (FIGS.) is not to be confused with rules file 2E and 2F. 2E and 2F do not identify any physical dimensions of table 210 in parsing rule 250. Users 181A-181N have the ability to change the physical dimensions of individual cells or table 210 in its entirety, e.g. SDA 134 can be used to modify the page’s left and/or right margins, but not in any way that affects its operation, as explained below.

“Specifically, certain embodiments allow tables or cell table to be moved in relation to their location in the margins. They can also be placed anyplace in a document, provided that the layout structure and its identifyr are kept in a predetermined order. Multiple instances of the same layout structure can result in the same action being taken and the results captured. A table can allow you to change the order of the headers of columns and rows. For example, one word-processing document may have two columns of names, Age and Gender. However, another word-processing document might have three columns of names, Gender, Name, age. The document analyzer will still capture data from the identified column regardless of its location in the table.

“FIG. 2I is a screen that is displayed by computer 184 and user 183 for user input. It shows an action to be taken on a layout structure. In particular, FIG. FIG. 2I shows a table 210 from template 131X (FIG. 1B is displayed. A drop-down box 282 will be displayed if the cursor 281 hovers above table name 213. The user 183 can move their mouse to select one of the elements from list box 282. Each element in the list box 282 represents an action that can take place on table 210. The processor 120 executes standardized document analyzer 134. Rules generator 132 uses such user input to create rules file 133X by associating user selected action (e.g. EXTRACT_TABLE_DATA is a table identifier that associates with the EXTRACT_TABLE_DATA (e.g. Document Metadata. Note that the set of actions shown in box 282 are different for a layout structure that is a section, because actions applicable to a section are different, such as actions CHECK_SECTION_EXISTENCE, and EXTRACT_SECTION_DATA, described below.”

“Examples supported by standard document analyzer134 are shown in this table. (And as discussed above. Certain of these actions that are applicable to a table can be found in a drop-down box 282, for selection of a user.)

“Action\nACTION_TYPE Scope Result\nCHECK_SECTION_EXISTENCE SECTION It checks if a document has the\nspecified section name in the section\nhierarchy.\nEXTRACT_SECTION_DATA SECTION It extracts the text from the specified\nsection\nEXTRACT_TABLE_DATA TABLE It extracts specified table data.\nCOUNT_TABLES TABLE It counts total number of instances of\nVERTICAL or\nCOMPLEX_VERTICAL table\nstructure throughout the document.\nCOUNT_TABLE_ROWS TABLE It counts total number of rows in each\ninstance of HORIZONTAL or\nCOMPLEX_HORIZONTAL table\nstructure throughout the document\nCOUNT_EMPTY_TABLE_CELLS TABLE It counts total number of table cells that\n(cell are mandatory to have data but empty\nlevel)\nCOUNT_TABLE_CELLS_UNDER_CONDITION TABLE It specifies the CELL_VALUES that a\n(cell CELL_NAME can have, and group\nlevel) document data into those values,\nincluding valid and invalid values\nCOUNT_WORDS_IN_TABLE_CELL TABLE It counts total number of words in\n(cell specified table cells.\nlevel)”

“In addition to TABLE_NAME, and SECTION_NAME attributes, a rules file 1333X is also available. This will allow processor 120 to determine which layouts are most relevant and will be used by SDA 134 (when processing the rule). In case of a TABLE layout structure, note that the names of cells are also included in the rules file 133X as CELL_NAME attributes of the TABLE, for use by SDA 134 to perform cell level actions, such as COUNT_EMPTY_TABLE_CELLS. FIG. 2J shows an example of output from SDA 134. 2J.”

“FIG. “FIG. Top frame 291 represents a template. In response rules generator132, the user input selects an action to be performed on the layout structure. The left frame 292 allows the user to select any section from which to create rules. The right frame 293, where the user can set up the rule attributes for a section, is available. The attributes that can be used to set up the rules will vary depending on whether the section has a table layout. No table is available to setup. Only the SECTION rules can be used and the corresponding attributes. All rules that are available for setup will be available if there is a table in the section. After looking at the template, the user can choose the table format they want to use in order to create a rule for a particular table. All table fields are available if the rule is at the table field level. It contains all attributes necessary to generate rules. After completing one rule setup click on Add Another Rule to create a new section rule. After all the rules have been set up for each section, click this button to generate an XML file containing all definitions. This file is sent to computer 100 and can be used by SDA134.

“In certain aspects of the invention, the rules generator 132 is used to create the document analyzer 134. All rules can be applied to existing templates or new ones, as well as existing versions and newer versions of the same template. The documents 112I-112N are shown in FIG. 1B) are based on these templates and can be analyzed by document analyser 134 without any code changes. All rules must be interpreted within the context of Section Hierarchy in a word-processing file.

“Moreover, document analyzer 134.1 can be responsive to additional attributes applicable to Section Hierarchy in order to produce context-sensitive and more targeted results. To limit the scope of document analyzer 134’s analysis to layout structures only, the user can specify PARENT_SECTION_NAME. Document analyzer 134 will analyze layout structures in the entire document 112I if there is no value for this attribute. The user may have an overview table in the following parent sections: ‘Features in Scope? section and ‘Features out of Scope’? section. If the user only wants to collect the Overview data for Features in Scope, the user can set Table_Name=?Overview? and PARENT_SECTION_NAME=?Features in Scope?. A second example is if there are multiple copies of the Error Message Table table in the 112I document, e.g. In all sections. The user might want to know all errors messages, regardless of where they are used. To do this, the user ignores the attribute PARENT_SECTION_NAME and it uses the default value which is the entire document.

“Likewise, in many embodiments, the user can also specify to document analyzer134 another attribute DATA_LEVEL in order to indicate how many levels into the data in 112I the user cares about, starting from the given point. In a Section Hierarchy for example, you might see different TABLE structures at every level. All these table structures may have some common columns such as Description or Name. Using DATA_LEVEL, users can ask document analyzer 134 for specific levels or all levels of data. In some cases, TABLE rules also support filters that allow empty records or invalid records to be captured in order to report exceptions or excluded to obtain the actual count. In several aspects, the document analyzer134 processes all the rules and multiple documents, generates one result, and then applies all rules to all documents.

Although the discussion above was made with reference to FIGS. While the discussion in reference to FIGS. 2A-2E (and 2) has focused on word processing tables as layout structures for templates, it is possible that a template can include a hierarchy or sections as a layout structure such as shown in FIG. 2G. Moreover, FIG. FIG. 2H shows a rules file created by calling rules generator 132 and using the template in FIG. 2G. To perform the acts shown in FIG. 3A. Depending on the embodiment rules generator 132 may be implemented entirely in server computer 100, client computer 184, or partially in each computer 100 and 184. Computer 100 executes rules generator 132 in order to perform the method shown in FIG. 3A.”

“Specifically, in act.301 (FIG. Computer 100 is then sent to act 302. Computer 100 verifies that the template is not in a proprietary binary format. This is done in act 302, where it is checked if the file is compatible with Word 2002, MICROSOFT CORPORATION’s word-processing software. Computer 100, for example, checks whether the file extension of the template ends with?.docx?. Computer 100 checks, for example, if the file extension of the template ends in?.doc? oder?.docx. If so, determines that act 302 answers are yes and goes to act 303. Also,?.dot. Extension can also be used in the same way as described above, in other embodiments. Computer 100 converts the template using a feature (e.g. computer 100 converts the template from the word-processing program (which created it) into a rich format in a markup languages readable by humans. This allows text documents in this format (?interoperable formats?) to be used on different word-processors on different platforms. In different cases, the interoperable formats can differ. In the above-described example where Word 2002 is used as the word-processing program, the interoperable format will be the Rich Text Format (RTF), published by MICROSOFT CORPORATION.

If the template has been written in human-readable markup language then act 302 does not apply. Act 303 is therefore skipped and the computer performs act. Act 304 can also be performed after the completion of act 303. Computer 100 converts the template in act 304 (e.g. Now in RTF format) into another markup languages for displaying pages. It does not have any semantic markup. The Extensible Stylesheet Language to Format Objects, or?XSLFO. W3C defines XSL-FO as a part of its XSL specification. available on the Internet at http://www.w3.org/TR/xsl11/. Additional information is available at http://www.w3.org/TR/xsl11/. by Eliotte Rusty Harold, 2001 available at http://www.cafeconleche.org/books/bible2/chapters/ch18.html, which is incorporated by reference herein in its entirety.”

“An example conversion of FIG. 2A into the XSL-FO format is illustrated in FIGS. FIGS. 4A-4D. This example uses the Oracle Corporation software BI Publicizer 11g to convert act 304. However, any other software could be used depending upon the embodiment. As shown in FIG. 4, you will see that the XSLFO file contains many properties after the above-described conversion. 4A, for example margin-left=30.6 pt margin-right=?30.6 pt? page-height=?792.0 pt? page-width=?612.0 pt? margin-top=?21.6 pt? margin-bottom=?36.0 pt. As illustrated in FIG. 2H is, therefore, in a ‘dimensionless? format. format. Rules files 133X do not contain positions.

“Next, in act 305 computer 100 parses a template in the XSLFO format to identify text strings that precede a particular type of layout structure, namely tables, and determines that they are table names. Computer 100 does the same thing in act 305. It also determines text strings in styles for another type of layout structure, namely section headings to be section names. Computer 100 then associates the action chosen by the user in act 306 (e.g. As shown in FIG. 2K or 2I) to each table and each section. Act 306 may be used to extract text from each cell and check for text in each section. The identifiers along with the associated actions are then written into computer memory as a rules file 307. This file is known in certain embodiments as rule file 133X. in XSL. A dimensionless format, as mentioned above, does not use any dimensions that are normally used for laying out a document on a computer monitor or printed paper such font size, distance left margin, distance right margin.

“In certain aspects of the invention rules file 133X are also transmitted by server computer 100 from client computer 184, as illustrated by an optional act 308. FIG. 3A. 3A. User 183 can modify any number of rules, e.g. By using word-processing software on client computer 184. Computer 184 receives any changes made by user 183 and transmits them to server computer 100 in act 310. Computer 100 saves the user-modified rules file 1337X in act 310 and transmits them to server computer 100. This information can be used for document analyzer 134 (as explained above).

“Remember that document analyzer134 is a reference to the execution of document analyser 134 by computer 100. Some embodiments of document analyzer 134 perform acts as shown in FIG. 3B are, unless otherwise noted, similar to or identical to the acts described in FIG. 3A. 3A. Any URL that is accessible publicly on the Internet or on an intra-corporate website. At which are stored a set of documents to analyze. A user can receive the above-described location identification, e.g. A screen of the type shown in FIGS. 3C and 3E.”

“A repository is located at the user-specified address. It contains the group of documents 112I-112N which are to be analyzed. These documents have been created by users 181I-181N. (FIG. 1B) as mentioned above (specifically by users entering keyboard and mouse inputs to document editing software like Microsoft Word to replace the sample text or blank text in template 131X with their custom text). In a next act 312, document analyser 134 crawls user-identified locations received in act 311, in order to find all documents 112I-112N to be analyzed. Then, each document is transferred into server computer 100 (from the repository at the user’s location).

In act 312, document analyzer134 connects to a file server at a user-specified place and then transfers each file into a local memory in 100. This local memory can be accessed via a local network (LAN) or within 100. A file transfer can be done according to some aspects of the invention using a generic protocol such as File Transfer Protocol FTP to transfer each document 112I-112N to a non-transitory computer-readable medium that is within computer 100.

“After act 312, document analyser 134 enters in a loop, each document to be analyzed is a current one. In particular, document analyzer 134 checks whether the current document is in binary format (e.g. By checking whether the file extension of a file located at the user-specified address is?.docx? Check if the file name extension of a file at the user-specified location is?.doc????? or?.docx?? ?.docx? File name extensions in some aspects of the invention are checked to ensure they are one of?.docx or?.doc. File name extensions are checked to be one of?.doc?,?.docx? Prior to file transfer via FTP from user-specified place to a nontransitory memory within or accessible by server computer 100, only files with one of these extensions will be transferred to computer 100.

“In one example, files 112I-112N are used (see group.115X in FIG. 1B) with?.docx? 1B) having?.doc? extensions are stored in a binary format proprietary MICROSOFT Word. Act 314 converts these files 112I-112N to a rich format such as the Rich Text format RTF. This rich format allows the document to be shared between word-processors and is therefore also known as an interoperable format. Act 314, document analyzer 134.1 checks to see if the extension of the document is not?.rft. If the file extensions are.doc, it checks if they are not?.rft? Traditionally, and.docx? recently-introduced), then computer 100 performs a conversion into RTF.”

“Note that new file extensions may be added to certain aspects of the invention without affecting the code of document analyser 134 (e.g. If MICROSOFT CORPORATION adds new extensions to its code, such as?.docy or?.docz? ?.docz and?.docy? These extensions, which are hypothetical examples as there are no such extensions today, are then processed in act 313, after simple configuration changes without any impact on code of document analyzer 134. Act 313, document analyzer 13 goes to act 315. If the answer is “no”, document analyzer 134 goes to act 313.

“Note: In act 314, document analyser 134 removes any images from the current file and converts it into RTF. Document analyzer 134 converts RTF documents into a tag-based marking language XSL. Using the XSLFO format, document analyzer 134 converts the RTF document into a tag-based markup language XSL. Acts 311-315 can be performed independently of and without knowledge of layout structures. Document analyzer 134’s performance is improved by removing images from act 314, both in terms memory usage and time taken to process documents. Document analyzer 134 can now be used in bulk because of the above-mentioned improvements in performance. processing, i.e. Analyze a group of documents 112I-112N with an order of magnitude (e.g. 10 documents or more, relative to a single document being analyzed. 1 document). You should note that the removal of images can change the positions of different layout structures in a document 112I. However, as these positions are not used to determine actions by document analyzer, 134, this change (removal) increases document analyzer’s 134 performance sufficiently to allow bulk processing.

“Note: operation 316 is briefly summarized in the following paragraph. Details of an illustrative execution of operation 316 can be found in FIG. 3C, as described below. In operation 316 (FIG. 3B) The document analyzer, 134 scans the document to identify each valid layout structure. This is possible in certain embodiments by using style names or tag names resulting form act 315. The tag example This is used to identify the preceding words of text with a predetermined style (e.g. Style-name=?Table Heading (in this case, the table name) as the name of the layout structure. You can add any text between the layout structure’s name, and the actual layout structure. However, such text must be in a style other than the one used for the name. Document analyzer 134 finds a layout structure and extracts text from it. Then, the rule file 133X is sent to document analyzer 134. Document analyzer 134 performs an action when a rule’s condition has been satisfied. However, the action is not limited to the contents of a document 112I but only to the text extracted from the specific layout structure that was found to satisfy the rule’s condition. ”

Summary for “Analysis using rules for documents”

Due to the proprietary nature of documents containing text in any natural language, they are formatted using a common word processing program (such WORD sold to MICROSOFT CORPORATION) or WORDPERFECT sold to COREL CORPORATION), it is necessary to review such documents manually. A human reviewing a group of documents typically involves opening each one in turn in a word processing program and reading the text with his eyes until he finds the text that interests him. The human then analyzes the text of particular interest, e.g. The human then manually counts the rows within a particular table in the document. Manual review of large numbers of documents can be tedious, time-consuming, and inefficient.

“Several methods have been used to extract the contents of word processing documents. These solutions are only compatible with documents that have specific content and format. At specific positions. The prior art solutions that were known to the inventors can’t be used as a general solution across documents with different formats and content by users without having to rewrite the code. These solutions mainly extract content from word processing documents and do not have the ability to analyze text to gather actionable intelligence.

“The inventors currently believe that an automated solution for analyzing documents of different types and with different content would greatly improve efficiency and accuracy for people and companies who use word processing documents. According to the current inventors, it is necessary to automatically analyze multiple documents for a subset or set of user-defined content. An invention of the following type can be used to limit a search to count only a certain number of rows or search only a particular subsection in a section.

“One or more computers can be programmed to receive input according to the invention (e.g. A user can input a command that indicates word-processing documents in electronic format, which will be analyzed together. In some embodiments multiple word-processing documents can be analysed in response to one command input from a user. This could identify, for instance, a directory name or a portion of file names. Based on user input, the multiple word-processing documents will be analysed by performing one or several predetermined actions. Document contents (e.g. Strings of text that are structured according to a structure associated with a predetermined action. This structure is found in every word-processing document that contains multiple word-processing documents that satisfy a condition in that rule.

“Depending on the embodiment, one or more structures can be identified (for performing the associated one/more actions) by the presence of certain text in each word processing document (e.g. A word or sequence of words that forms a name (or another such identifier) and is arranged in a particular sequence relative to the structure (e.g. Before the structure. The structure (also known as ‘layout structure? The structure (also known as?layout structure?) is used by an application program in every word-processing document to arrange the text. It is used to display the text on a page or print it on paper. A word-processing table is an example of a layout structure. It is used by word-processors to display/print text in tabular forms on a page. A word-processing section is another example of a layout structure. It’s used by word-processors to display/print text in a hierarchical arrangement of sections and subsections, indented relative one another, on one page.

“In multiple embodiments, different layout structures can be created in a word processing document manually by a user entering words of text into an application program feature (e.g. To insert a layout structure, a word processing program is used. In the word-processing program, the user can also input a identifier for the layout structure (e.g. A word or sequence of words that the user chooses to indicate a table name, section heading or other information in the word processing document. Depending on the embodiment, an identifier may be placed in the word-processing file either before, after, or within a particular section of the layout structure. To later analyze the word-processing file, the user can also use the layout structures and their identifications to create a condition in the rule. The user creates the rule and also specifies the action that should be taken when the layout structure identified in the condition of the rule is found in a document. This can be done by one or more computers using software (the “document analyzer”). In accordance with the invention

“In some embodiments, the word-processing document is used later as a template. It is created in accordance with the previous paragraph. One or more users can create a template by copying the template and manually editing the text. This allows them to obtain additional word processing documents (also known as “standardized documents”). One or more of these documents can then be searched in accordance to the invention using one or several of the above-described rules, on one or more computers that are programmed with the document analyser.

“Another embodiment of the invention allows multiple word-processing documents to be created by one or more users, without the use of a template. Instead, the users input the features of the application program manually to insert layout structures as well as their identifiers into each word processing document. The alternative embodiments allow for the matching of layout structure identifiers with conditions in rules of the type described above by one or more computers that are programmed with the document analyser.

“Computer(s), programmed with document analyzer apply one (or more) rules of the type discussed above, to search every document as follows. In some embodiments, the computer (s) removes images from every document, and converts each document into a markup languages. Then, each document is identified within a document with a layout structure that fulfills a condition. There are many ways to identify a layout structure that meets a condition in a rule. The presence of an identifier for a layout structure in any document is checked to ensure that it’s present in the rule’s condition.

“In response to any match (between an identifyr in a file and the condition in a rule), an associated rule-specific action is taken using the words of the text in the layout structure identified as the identifier. Each action is performed using the layout structure that was found in each document. The output of all such actions on multiple documents is combined. For future reference, the results of multiple documents are collected in a non-transitory storage accessible to one or more computers. To further process, display, and/or print the collected results.”

“A processor 120 is programmed in a computer 100 with software instructions134 according to the invention to perform the method illustrated in FIG. 1A, e.g. To receive input in act 101A (e.g. A user may provide input (e.g. Input may be made in a client computer (FIG. 1B) by a human operator 183 using an input device like a keyboard or mouse (not shown). Client computer 184 transmits user input via wired or wireless link151 to server computer 100. Once received, the input is stored in memory 130 in the usual manner.

“The user input received under act 101A could identify, for instance, a directory name on hard disk 140. In the form of a URL (uniform Resource Locator), all documents in the identified directory make up a group of 115X documents that are automatically parsed 120 by a processor 120 (also known document analyzer), in accordance with act 102 (FIG. 1A) which can be repeated as per loops 102L (for each layout). Alternately or additionally, user input on link 101A that is received in act101A may specify file names to be searched in multiple subdirectories of the identified directory. User 183 may specify the file names in the usual way, e.g. By using a search term (with wildcard), the file names may be specified by user 183 in the normal manner, e.g. 1A).”

“Moreover, computer 100 also receives two additional user inputs in act 101B: One input indicates a condition on at most a portion of name or other identifier (e.g. FIG. 1B) is used to identify a structure J that was created by a word processor to lay out text on a page to be printed on paper or displayed on a screen. Structure J can be used to display/print text using a table. Another example of structure J could be a section that displays/prints text in a hierarchy. Structure J, also known as ‘layout structure? It is identified by the ID-J name, which is located adjacent to it in a predetermined order relative to (e.g. located before) structure. To identify a layout structure J, a name or an identifier IDJ is used to identify it. This eliminates the need for prior art to indicate a page’s position. If the ID-J name or ID-J is found in a predetermined sequence, the layout structure is used to fulfill a condition. Text within the layout structure is used to perform an action that is indicated by another input from a user. These two inputs from the user (i.e. Computer 100 internally associates the two user inputs (i.e. user input on a condition, and user input for an action) to create a rule.

“Note: Action 101B can also be performed in loops (see branch 101L, FIG. 1A) as many times as necessary, e.g. Each layout structure within a word-processing file must be analyzed once. You should also note that 101B and 101A can be performed independently of each other and may be done in any order.

“In some embodiments, once inputs have been received in accordance with acts 101A or 101B, a file 112I can be searched in an act 102 (FIG. 1A) to identify every layout structure associated with an action taken by a rule. Computer 100 uses the predetermined identification (e.g. FIG. 1B) to identify structure J as appearing in document 112I. In certain embodiments, the identifier IDJ and the corresponding structure J must be present in a particular sequence relative to each other (e.g. ID-J is before structure J) in the document 112I. . . The 112N search results in act 102 were all generated using a common template 131X (see FIG. 1B). One or more human users 181A-181N (e.g. Users who report to user 183 within an organization and provide input (e.g. via keyboards 182A-182N in the form text (and optionally graphics), for inserting into templates 131X-13Z using word-processing software (also known as?word-processor?). That interfaces directly or indirectly to the one or more respective computers 182A-182N.”

“During document creation, each user’s input 181I is used to create a document by a computer 182I that has been programmed with the word processor to replace (i.e. Overwrite default sample text (or any blanks) from a local copy template 131X to create a customized copy. This is saved to hard disk 140 (or another non-transitory storage) of server 100 as document 112I. Document 112I is created in the above-described way, using a modified template 131X. This is also known as a standard document.

FIG. 1B, there are many different standard documents 112A. . . 112I . . . 112N can be generated from the same template, 131X. A subset of these standardized document 112I can also be generated. . . 112N form group, 115X. Template 131X contains several structures B. . . J. . . M are identified using the respective identifiers ID?B, ID?J and ID?M. The structures (also known as?layout structure?) B. . . J. . . Template 131X contains M, e.g. in binary form (originally created by word-processing software such as WORD from Microsoft Corporation in template 131X). These structures are B. . . J. . . M and their identifiers ID -B, ID -J and ID -M are kept in a new 112I document after it has been created by user 181I. To input text into any of the structures B, copy template 131X and edit new 112I. . . J. . . M.”

“Accordingly, template 131X contains a number identifiers ID’B, ID’J, and ID-M in a predetermined order relative to the corresponding layout structures B. . . J. . . M. Depending upon the embodiment, identifiers I-B, I-J, and ID-M can be either pre-existing in template 1313X as text previously provided by user 183 during template creation or (b) intentionally added to template 1313X (manually, or automatically) to aid identification of the corresponding layout structures B. . . J. . . M during the parsing of documents 112I . . 112N in Act 102 (FIG. 1A).”

“Remember that once a document 112I is created, user 181I can make any number or changes to it, including duplicating (by cutting & pasting one or more structures B). . . J. . . M, their identifiers ID?B, ID?J and ID?M. Also, new layout structures can be created. Document analyzer 134 can analyze document 112I, even if multiple copies of structures B are available. . . J. . . M and the identifiers ID?B, ID?J, and ID?M are found in document 112I. Document analyzer 134 does not hardcode with physical dimensions and/or positions of structures B. . . J. . . M is in a page and document analyzer 134 instead uses rules (expressed as a position-independent format) of the type described below.

“In some embodiments, each identifier ID -J is manually entered by user 183 during template 131X creation and placed therein in sequence immediately prior (i.e. The corresponding layout structure J is preceding. Layout structure J, for example, can be a section with a hierarchy of sub-sections. The identifier ID is then inserted into template 1313X as the top-most heading in section J. Space (i.e. Blank or other characters may be permitted between layout structures J and their identifier ID?J, depending on the implementation. User 183 may insert each ID-J immediately before, depending on the embodiment. The corresponding layout structure J of template 131X should be preceded by the ID-J. Layout structure J, for example, can be a table with cells (also known as “tabular cells”). A user can insert a name for the table as an identifier IDJ right before the table J. The layout structure B is used whenever a standard document 112I has been created from a template 1313X. . . J. . . M, as well as their identifiers (ID-B, ID?J and ID?M) in the template 131X are copied to each standardized 112I document that is then customized 181I.

“During normal operation, additional documents (not shown at FIG. 1B) can also be generated by human users 181A-181N using other templates 131Y or 131Z. These additional documents are also standardized documents and are stored on a hard drive 140. User 183 can identify the additional documents generated using other templates 131Y or 131Z by grouping them together (not shown at FIG. 1B), however, none of these additional documents are to be identified in group 115X. This group is used to identify standardized documents 112I. . . 112N are based upon template 131X. If a user error is detected in the input via link 151 and a document identified as group 115X is not created using template 131X, computer 100 creates an error message and stores it in a non-transitory storage of computer 100. The error message can also be transmitted to computer 184 and displayed to the user 183.

In summary, a layout structure J can be identified in each 112I document in the user-identified groups 115. The contents of each layout structure J are then used to perform an act associated with structure J, as per act 103. Multiple times, an action related to a layout structure J can be performed (per act 103L), on multiple copies of structure in each document 112I. . . 112N. The results of the action performed on multiple copies J of the structure are then stored as per FIG. 104. 1A can be repeated as per act104L for each layout structure J. Document analyzer 134 can loop over each document (as in act 105) to perform acts 102-104 multiple time. Document analyzer 134 generates the above-described results for each document. They are stored together for each structure J (e.g. As collection 135J, or as statistics 13J, even though the contents of each structure are taken from multiple documents 112I. . . 112N. In some embodiments, actions can be associated with different structures B. . . J. . . M to create respective collections 135B . . 135J . . . 135M.”

“In some embodiments the one or more actions associated to a structure J can be specified in the form rules in a rulesfile 133X (FIG. 1B) is an input to processor 120, which executes software 134 (also known as document analyzer) in computing 100. The rules file 133 can be created in many ways, e.g. You can generate rules file 133 manually, automatically, or a combination of both. In some cases, rules file 133X can be generated from the invocation of a rule generator 132. Rules generator 132 is invoked using a file name for a template 131X that has been specified via a wired or wireless connection 152 by the user 183 via computer 184. Rules generator 132 automatically parses the template 131X filename on link 152 to identify all layouts B. . . J. . . M are identified by the respective identifiers ID?B, ID?J, and ID?M. Computer 184 supplies them to computer via a link 153. These are then displayed to user 183.

“User183” then identifies to computer184 a specific action that is to be done on a particular layout structure by document analyzer 134. An illustrative embodiment displays a drop-down menu of actions that can be supported by (i.e. Document analyzer 134 can be used to perform these actions. A drop-down list of actions that are supported by (i.e. You can copy text from the layout structure or count up the words in it.

Computer 184 responds by forming an association with a user-selected action and a specific identifier IDJ. Computer 184 transmits each specific identifier IDJ of a layout construction J to computer 100 via a link 154. Computer 184 may display a web page, in some instances (e.g. HTML) is sent from computer 100 to display a web page. It then executes in computer 100 and sends computer 100 any input it receives from user 183. In such embodiments, computer 100 forms the above-described affiliation.

Based on user 183’s input, rules generator132 creates rules file 1333X by writing each ID-J or part thereof in a condition along with its associated action. Every identifier IDJ within a condition and the associated action together make a rule. Therefore, rules file 133X may contain as many rules as structures B. . . J. . . M in template 1313X. User 183 may input more or less rules to file 133X depending on the number of templates in a template. User 183 might identify multiple actions for a condition. For a layout J structure the user 183 could identify more than one action, while the same user 183 might also identify no action for another layout M structure.”

“Additionally, depending upon the embodiment, rules from two rules files (133X and 133Q) can be copied into a common rules file 133R (FIG. 1B) are copied by the user to a common rule file 133R. The user then uses document analyzer 134 for analysis of similar or identical layout structures in different word-processing documents 112A and 112Q. . . 112I . . . 112N. A table with the name “document metadata” is an example. It is used in the following two types of word-processing document: functional design documents or user manuals. They are correspondingly created using two different templates 131X 131Y and information in?document metadata. By using a common rules file (133R) that contains a single rule that identifies in a condition in the?document metadata, both types of word processing documents can be extracted table. table. The number of rules in any given rules file 133X might not be equal to the number layout structures in any particular template 131X.

User 183 may manually view and modify rules file 133X if needed to ensure that the appropriate actions are associated each layout structure J. User 183, for example, may modify a previously identified action that was associated with ID-J in rules file 133.X by adding a link 155. A condition in file 133X does not have to identify an identifier or a name (e.g. ?document metadata? As noted in the previous paragraph, a layout structure is intended to trigger an action. Instead of a wildcard such as?*? A wild card such as?*? oder?% can be used instead of a layout structure that is to trigger an action, as noted in the previous paragraph. You can use?%? as an identifier with partial information, such a part of a layout name (e.g. ?document meta*?). Processor 120 uses rules file 133X to execute document analyzer134 when it analyzes word-processing documents 112I. . . 112N generated using template 131X to determine the action to take when a condition of a rule matches a layout structure within a document 112I. Execution of document analyzer134 by processor 120 requires user 183 to choose an action in rules file 13X, and thus the type of data that will be collected from each structure J, e.g. to be stored in an RDBMS table, or displayed on a computer monitor (i.e. Computer monitor, such as a cathode-ray tube.

The output of processor 120 in executing document analyzer134, such as collection 135J or statistics 136J may take different forms depending on which embodiment is being used. web page 191 can be used in a browser, spreadsheet 192 in a spreadsheet program, and relational database 138 accessed via a relational data management system (RDBMS), 1905 such as ORACLE DATABASE11gR1, available from ORACLE CORPORATION. Each web page 191, spreadsheets 192, and relational databases 138 are stored as files in a folder system 190. This file system is readable by a computer via a hard drive or another non-transitory storage medium (i.e. Any non-transitory storage media that is computer-readable.

“Processor 120’s output can be stored in an RDBMS Table (such as table 138) and further processed using queries in a structured question language (SQL) to generate reports that are displayed on a web page by user 183. Document analyzer 134 in some embodiments can be invoked by user 383 providing a location (e.g. URL) to the document repository (where group 115X word-processing documents are located) and select a rules files 133X. Instructions to document analyzer134 on what layout structures to search for in the documents 112I-112N (see group.115X in FIG. 1B), as well as what text and statistics should be collected from each layout structure recognized in the documents.

“Depending on the embodiment, there may be any number 133X of rules files. . . 133Q . . . 133R (see FIG. 1B) is unrelated to the number 131X-131Z of templates. One user may be interested, for example, in certain templates X. Another user (not shown), may be interested to other templates X. Therefore, the two users create two rules files from the same template.

“When user 183 selects an operation to be executed by processor 120 executing the document analyzer 134 in order to use a relationshipal database 138 user 183 also creates necessary tables via link 56, e.g. FIG. 1B shows an RDBMS table J, which is item 138J in FIG. 1B. User 183 also updates computer 100 via link157, a Property File 139 to create an association between RDBMS Table J and a corresponding Layout Structure J (identified with its ID-J), whose data will be written into RDBMS Table J by analyzer134. Property file 139 can be used in some cases to store additional information, e.g. Property file 139 can be used to hold additional information (e.g. Property file 139 is used in several embodiments to specify relational database tables that store the results of rules applied on corresponding layout structures. It also contains other processing logic, such as 1) identification and configuration of environment information, 2) application of rule files to which templates, and 3) identification and modification of existing rule files. The ID-J identifier can also be used herein to identify a layout structure. It could be a TABLE_NAME, or a Section_NAME, depending on the identified layout structure. In a word-processing file, a table is or a section.

“Supplying Rules Files 133X, 133Q and 133R as inputs to Document Analyzer 134 (FIG. 1B allows a user 183 access to various types of data from word-processing documents (115X) by configuring files 133Q, 133Q, 133R, and 139. Document analyzer 134 can be modified by the user 183 to change the action or layout structure in rules files 133, 133R, and 133Q, and/or tables in property file 13.3. This eliminates the need for writing software code. If a new layout structure is required (e.g. User 183 creates a new RDBMS database table in database 138 and adds an association between it and the new layout structure’s ID (e.g. To generate a revised property folder, user 183 adds TABLE_NAME and SECTION_NAME to property file 139. This creates a new property file. It also adds an association between an action (e.g. extract data) and the new layout structure identifier to a rule file to create a revised rule file. Document analyzer 134 then runs by processor 120 with the revised property and revised rules files as inputs. These configuration changes (in the current paragraph), are simple and can be performed by user 183. According to current inventors, this is at least one order of magnitude faster than manually altering software source code in a prior art document analyser.

“Note: document analyzer 134 is able to be executed by processor 120 even if there are no data stored in relational databases 138, e.g. document analyzer 134 can be used to provide its output on a web page 191(e.g. in HTML and/or a spreadsheet file (e.g. In a format known as comma separated value (CSV) that can then be opened with a software program called Excel available from MICROSOFT CORPORATION. You can import the results from the above-described spreadsheet file (192) into a relational data base by user 183. Then, you can prepare reports by running SQL (or structured query languages) queries.

“Note: Links 152, 53, 154 and 155 may differ from one another, and link 151 (discussed previously), depending on the embodiment.

Documents 112A-112N (FIG.) are used in many examples. 1B are word-processing documents and layout structure J (FIG. 1B is a word processing table in template 1313X. Each row of the table J contains a message in text in natural language. This example shows that table J in word processing document 112I contains q rows of messages while table J in another word processing document 112N has only s rows. Each table J is identified using a predetermined identifier such as “Messages?” Both word-processing documents 112I & 112N. If the word “Messages” is found in word-processing documents 112I and 112N, it will be analysed using document analyzer 134. Processor 120 locates the word?Messages immediately before a table and processor 120 copies q messages from this table into collection 135J.

“Similarly, word-processing document 112I analyzes word?Messages. Processor 120 again finds it and copies the s messages from that table immediately preceding the word. Then, processor 120 adds the s messages into the previously copied 135J q messages. If a predetermined identifier is found in the rules file (e.g. Name?Messages Tab? (in this example) is not found in processor 120, it forms an association in memory between that document, and the predetermined identifier. to be used in an error message stating that the document doesn’t contain the predetermined identification.

“Accordingly, all word-processing documents 112I are. . . 112N were analyzed. Collection 135J includes the q+s messages extracted from table J using multiple word-processing documents within the group 115X. Statistics 136J for table K is a set q+s count, with each count representing the number of words in a message. When processor 120 executes document analyzer134, the outputs (the q+s messages or the q+s count) can be stored in nontransitory memory and sent to client computer 184 as a document, such as a webpage, spreadsheet, or RDBMS table. User 183 may continue using the information in the usual way.

“In the above-described illustration, there are two additional layout structures B, M, in word-processing documents 112.A-11.2N. These two word-processing tables (template 131X) include a name for each person. Table B contains authors. The word ‘Authors’ is used to identify them. ID-B is used to identify table B in template 1313X. The persons in table M are reviewers, and the word “Reviewers” is used for them. It is used to identify table M within template 131X as ID-M.

“Hence, processor 120 determines the immediately preceding table to be table B when analysing word-processing file 112I. When the word?Authors? is found, processor 120 (while running document analyzer 134) determines that the immediately preceding tableau to be table A and copies the author names from table B into collection 135B. If the word “Reviewers” is found in document 112I, processor 120 will also determine whether it is table B. If the word?Reviewers? is found in document 112I, processor 120 determines that the immediately preceding tableau to be table M. Then, processor 120 copies the author names from table M into collection 135M. As each word-processing file 112I is analysed, each collection 135B, 135J, and 135M is incrementally created. The compilation of collections 135B-135J and 135M is completed by the analysis of the last word-processing file 112N of the group 115.

“Each collection 135B, 135J, and 135M organizes user-input content that is stored in structured form in word processing documents 112I. . . 112N of group115 can be viewed in an easy-to-review manner by user 183. User 183 can, for example, review 135J of messages in order to determine if any four-letter words are present. You can open the collection 135J using a browser. The browser’s?search? function can be used to open the collection 135J. Alternativly, you can automate this check by selecting the appropriate action in rules file 133. User 183 can also manually check the message collection 135J to ensure it conforms with natural language grammar. User 183 can therefore invoke document analyzer 120 by processor 120 to validate the content and quality word-processing documents 112I-112N. (See group 115X in FIG. 1B). 1B).

User 183 can also obtain a list with the authors of word-processing documents 112I-112N. This is done by screening duplicates from 135B. Users can also obtain a list with reviewers by screening duplicates from 135M. The number of times that a name has been repeated is an indicator of how much work they have contributed. The level of completion of documents 112I is also indicated by the number of rows containing default sample text (or empty) in collection 135J. . . 112N. Computer 100 can determine if half the rows in a word processing document are empty or have default text. This is in contrast to another word-processing file in which all rows have default text (or blanks). In some embodiments, processor 120 counts the number rows that are empty or have default text. This is done in response to user 183 specifying this count in the action associated with the word-processing table J at link 152. Document analyzer 134 allows user 183 to efficiently assess the completeness word-processing documents 112I-112N using document analyzer. (see group 115X in FIG. 1B). 1B).

“Other embodiments automatically count how many times an individual is identified as an author by specifying such counting within an action associated to the respective table M. Other embodiments count the number times a person has been identified as a reviewer by specifying this counting in an action associated to the respective table B. These counts are statistical 136B. . . User 183 can use 136M as actionable intelligence (e.g. To set bonuses. User 183 can then use document analyzer134 to examine the qualitative and quantitative characteristics 112I-112N of word-processing document 112I-112N at each document level or aggregate text. This allows them to create statistics across multiple word-processing document at different levels of the hierarchy of a software product and within a line (or series) of software products. Document analyzer 134 can produce various statistics by analyzing word-processing documents across product families, lines, and products in a way that was previously impossible. Without document analyzer, significant quality improvements are not possible due to the limitations of existing document editors. A user 183 can now interpret and use word-processing documents 112I-112N using document analyzer134 in a way that was not possible before.

Microsoft Corporation sells MICROSOFT OFFICE XP, which is used to create and edit word-processing documents 112A-112N, and templates 131X-13Z. The components: Word-processing software called “Word 2002?” Excel 2002 is a spreadsheet program that can be used to process word-processing documents. For spreadsheets, use?Excel 2002? Slides This software example is an application program that has multiple components. It is normally installed on each computer and executed independently by each computer 182I.

“Another example software to create word processing documents 112A-112N or templates 131X-13Z is installed in and executed from a central server (e.g. computer 100) is made available to all computers 182A-182N as a service (i.e. software as a service, or SaaS. This word-processing software is accessible via a browser on computers 182A-182N. Google, Inc. also offers the Google Docs office software. Google Docs office suite supports browsers on computers 182A-182N, 184 and 205 to access an online word processing service. It also includes a spreadsheet service and slide presentation service.

“In some embodiments template 131X, word-processing documents 112A-112N can all be created using the same word processor (also known as word-processing software). This word-processor includes a justification feature that allows for left, center, or right justification, character formatting features (bold and underline, italic formats), spell-checking and grammar checking features, a word counting feature and a table insertion feature to insert a word processing table. There is also an optional section insertion feature that can be used to inserting section. Word-processing software is a word processor that prepares business documents in the usual manner. For example, WORD from Microsoft Corporation. However, some word-processors do not have publishing features like kerning or typesetting. These embodiments of word processing software exclude publishing software such as FRAMEMAKER and ACROBAT, both of which are sold separately by ADOBE.SYSTEMS INCORPORATED.

Word 2002 is one example of word-processing program. Word 2002 is sold by MICROSOFT CORPORATION. WORDPERFECT is another example. These embodiments may include the above-described insertion functions, namely section insertion feature or table insertion feature, which can use styles to ensure consistency in formatting words in a layout structure. Layout structure J, for example, is a word processing table that is normally kept in a binary format by word-processing software (e.g. Word 2002 proprietary. Each row of the word-processing tables J contains a message in natural language. English. English. A MICROSOFT Word document) in a file named?documentA.doc.?”

“Use document analyzer134 allows the messages in word processing table J of?documentA.doc to be extracted and stored” extract and stored in an RDBMS Table 138J, e.g. In a relational database table 138 created by Oracle Corporation using the software Oracle Database 11g. Names of authors in the word-processing tables B and names reviewers in the word-processing tables M of?documentA.doc are also available. These names are then extracted and stored in the RDBMS tables 138B & 138M. The word-processing document titled?documentI.doc is then analysed. Each row of text in the document named?documentI.doc is taken and placed into one of the RDBMS tables (138J, 138J, and 138M). In some embodiments, database 138 also contains a 138Z RDBMS table that holds statistics. This table is shared in an illustrative embodiment across all templates and layouts in computer 100. After all the word-processing documents from group 115X have been analyzed, RDBMS table 138 is analyzed using SQL queries to generate reports and/or create files in the usual manner (e.g. web pages), as would be obvious to the skilled artisan considering this disclosure.”

“Hence, the document analyzer 134 offers at least these advantages in the example above: (1) Automated analysis of different types of word-processing document (such as product brochures, functional design documents, and user manuals) to determine content quality; (2) Content extraction for design reviews to verify feature completeness. (3) Comparison of content from different word processing documents throughout the software development lifecycle (SDLC). For example, user 183 can use the computer to check that the document contains’must have’ features. Features in the requirements document were implemented; (4) Create content repository for downstream uses (for example, build repository of product usage cases from functional designer document and insert them into a quality center tool for automating testware creation, increasing accuracy, and decreasing the cost significantly); (5) Intelligence collection from different types of word-processing document (for example, user 183 can use a computer to determine how many product use case are there in a software suite of software; how many features are within the scope of a current version of that are allowed for the current version of the current version of the suite of software suite of the current version of the software software suite of the current version of the current version of the suite of the current version?

“Note: Although templates 131X and 131Z may be used in certain embodiments of the type illustrated at FIG. 1B, certain embodiments of this invention do not use templates 131X-131Z as shown in FIG. 1C. Users 181A-181C can create word-processing documents 117A -117N in several alternative embodiments (FIG. 1C) without using any templates. Word-processing files 117A-117N can be prepared in a similar way to word-processing document 112A-112N. They may even be identical, depending on how they are arranged. Word-processing documents that 117A-117N can be considered standardized documents. This is because the documents 117A?117N were not created as copies of any of the templates 131X?131Z. Document analyzer 134, as described in FIG. 1B is also referred herein as a standard documents analyzer (abbreviated SDA) and documents analyzers 141 as described below with reference to FIG. 1C is also referred to herein as structured documents analyzer (also abbreviated SDA).

“Because in FIG. “Because in FIG. 1C, thereby to create rules files 133X. Rules file 133X in FIG. 1C is identical or similar to file 133X in FIG. 1B. Select the appropriate action to send via link 154 (FIG. 1C, user 183 retrieves documentation 142 via link 158 to computer 100. Documentation 142 includes names of actions that are supported by documents analyzer (FIG. 1C and a description (in natural human language). User 183 also identified the associations between each word processing table and a corresponding RDBMS in relational database 388, in a property file 139. As shown in FIG. 1C, which is identical or similar to file 139 as described in FIG. 1B.”

FIG. 141: Structured documents analyzer 1C is identical or similar to the standardized documents analyser 134 of FIG. 1B, except where noted otherwise. Some information may be stored in files in certain embodiments (e.g. Some information may be stored in files (e.g. rules in rules file 133X and configuration in propertyfile 139), but in other embodiments, the above-described information can be stored in tables in a relational database (e.g. Rules are stored in a table and configuration in a table. Both tables can be accessed via an RDBMS.

“In light of the description of FIGS. “In view of the above description of FIGS. 1A-1C, it will become obvious to the skilled craftsman that the documents analyzers 134 and 141 according to the invention allows a user to quickly perform analysis of documents by simply preparing an appropriate configuration to operate SDA 134/141 rather than writing new software. SDA 134/141, for example, eliminates the need to create macros in word processing software (e.g. WORD is sold by MICROSOFT CORPORATION to open and process documents 112I-112N. (see group 115X at FIG. 1B). Additionally, macros in word-processing software of the prior art typically record a position on a page where a particular action is to take place, followed by another position, at which another action will be taken, and so forth.

“Unlike prior-art macros that are position-based,” many embodiments according to the invention of an SDA134/141 do not require any pre-recorded positions to perform their actions. Many embodiments of SDA 134/141 instead use a rules-file that doesn’t identify any text positions on a page. the rules file is expressed in a ?position-independent? format as described below. Use of a rules file 133X in a ?position-independent? SDA 134/141 can operate in a format that does not require calculation of position. Before taking any action, you must follow the x-direction, which is horizontal direction from the left margin of the page, and the y-direction, which is vertical direction from the top margin. SDA 134/141 is a generic solution to document analysis. It performs user-specified actions, independent of the positions of layout structures on a printed or displayed page. SDA 134/141 can be used as a generic solution by using a file 133X that is position-independent. A property file 139 allows new word-processing layouts to be mapped to RDBMS tables that the user has created in a relational database. This does not require the user to write any code.

“Furthermore,” as we will discuss below, word processing files with new file extensions (such?.docy?) can be processed. Both?.docy? and?.docz can also be processed using a single modification to the file extensions listed by SDA134/141. This makes it more generic. SDA 134/141 is able to be used on any type document (e.g. SDA 134/141 can be used on any type of document (e.g., a functional design document and a manual are two types of documents) which is an improvement to a prior art tool that only does XML conversions from a proprietary binary format to MICROSOFT Word. This prior art tool is typically hard-coded to only work on one type. To use it with a different type of document, the user must create a DTD, create a XML, and code new structures. SDA134/141, on the other hand, can handle any type of document by simply changing to rules file 133. This allows for new layouts to be specified, and without having to modify any of SDA134/141’s software code.

SDA 134/141 also has the unique feature that the documents 112I-112N are prepared in the usual way, using the most widely used word-processing software in the industry, MICROSOFT Word. This may allow absolute positioning of text or images on a page. Before the invention of SDA 134/141 the only way inventors could analyze text in layout structures in documents in MICROSOFT Word format was to manually open word-processing documents one at a time. This required a human to manually take note of each document and then manually compare notes between the documents. needs human intelligence.”

“In multiple embodiments, SDA134/141 has been programmed to support many different types of actions. These actions are performed when a rule from rules file 133X matches the layout structure in document 112I. Rules file 133X usually contains multiple rules. In some cases, each rule in rules folder 133X is associated only with one action. A user can choose from 10 actions, such as extracting table data or checking empty table fields. Each rule can also have many parameters, some mandatory, others optional, to allow flexibility in specifying any layout structure, context data, and so on. SDA 134/141 can perform three actions. ), (b.) Check if the text within the layout structure meets a user-specified condition such as accuracy or completeness. (c) Copy and store that text from the layout structure in a relational table, to be used in SQL queries across documents that are similar (e.g. All of these may have been generated from a common template.

“Examples such actions are now described in reference a document 112I prepared using a template illustrated at FIG. 2A (which will be described in greater detail in the following paragraphs). These examples show a specific layout structure that is identified in word processing document 112I as a word processor cell. It has a heading 211B in the row with the string value?Author? and is located in a table identified by table ID 213 with the string value?Document Metadata. If such a cell is located (e.g. FIG. 2A) in FIG. It is used to identify documents in which Anish has been named as authors. An example of the above-described (b) action is to ensure that the cell isn’t empty. ? If it does, then to log a message into computer memory. An example of the above-described (c) action is to copy and store the text string from this word-processing cells into a column called?Author? In an RDBMS table with a column called?File name? (containing a file name, and extension for document 112I).

“Remember that SDA 134/141 can perform any of the actions (a-(c),) only after table210 has been identified with the identifier?Document Metadata? It is located in document 112I. The header of the cell that contains the user-specified string (in our example,?Author?)) will also be found. If, for example, in document 112I, three rows in table 220 all have the same user-specified string (in this case?Author?) SDA 134/141 then performs the above-described action on each cell of table 210. It does this because each row has the header??Author? In this example. A document 112I identifying three authors is processed correctly. SDA 134/141 was designed (as discussed herein) for the application of rules (specified by a rules file 133) that identify a layout design specified by the user (e.g. Instead of identifying a specific position on a page, we compare sequences of text strings or tags.

According to the inventors, SDA 134/141 has two unique features. (1) SDA 134/141 allows users process existing and new word processing templates and create a repository of documents based upon the processed templates. (2) SDA 134/141 allows users dynamically capture data structures in a relational table, again without code modifications. SDA 134/141 does not require code changes. Instead, configuration changes are made by the user. This is much simpler than code changes.

“In many embodiments of the type described above, SDA 134/141 provides a unique end-to-end solution that enables users to unlock the data and intelligence?previously only accessed at individual word processing document level?across several word processing documents to gain operational, procedural and process efficiencies. According to the inventors, this end-to-end solution is not possible at any software company (or any other company that uses functional design documents, product brochures or user manuals) The current inventors know that no one has ever been able to extract intelligence by analysing a collection of documents instead of manually reviewing a single document. As discussed below, the current inventors understood, overcame and overcome many challenges in creating such an end-to-end solution that nobody else has been able.

“C1: The ability to handle large files and complex content that contain diagrams, while trying to convert these documents from proprietary to generic text formats. These are just four examples of the challenges identified by current inventors. C2: The ability to process documents with different structures and content, without having to write new code. This problem cannot be solved by any solution that is not scalable and acceptable for general use. C3: The ability to meet the user requirements of multiple users using the same structure and content of any document. Examples of various types of documents include functional design document, requirements specification (user’s manual), product brochure, and user’s manual. One user may be interested in counting words within a cell in a word processing document, while another user might need to ensure certain text is included within the same cell. It has been difficult to address the needs of different users using a generic solution (one that doesn’t require changing code). C4: Complex analytics of the content of different types of word processing documents using a single solution. Although some analysis can be done at the document level, it is possible to do more complicated analysis using data from multiple types of documents. The inventors have not been able to solve the problem of being able to store different data structures from word processing documents in an RDBMS database.

The current inventors combined years of computer programming experience with many different technologies to create a generic solution. Here is a list of some of the innovative solutions that were used to address the above challenges. I1: Identify, remove and convert images from the native document format into XML. I2: Allow users to specify the content or structure of interest in any document, as input to SDA134/141. This is done using a simple interface that doesn’t require any technical knowledge. This mechanism allows users to specify the information they are interested in by selecting the type of documents that they want to process. In the following sections, we will describe this capability as rules generator and rule file. I3: Users can also specify the type of action they want (extract data or count words in a table column, check for default text, etc.). They want to perform on a specific type of structure or content in a certain document. According to the current inventors C2 and C3 have been the greatest obstacles that prevented previous attempts from succeeding. These challenges were solved by the current inventors using an innovation that was implemented as a rule generator and a rule file. I4: To address the challenge C4, current inventors created an interface that allows users to define a mapping between each word processor document structure and the corresponding RDBMS structure (also known by an RDMBS Table) in a text format. as a property file number 139, which is input to SDA 13/141. The property file can be saved in text format. This allows the user to edit the mapping information with any text editor. SDA 134/141 interprets and consumes mapping information in text format from property file 139 and performs operations to extract text from word-processing tables and store it into a RDBMS database table, as described herein.

The following are some examples of the challenges that current inventors have recognized and the solutions they have found. H1: Use new templates. It is simpler to support existing templates. It is easier to only support existing templates, which in the prior art is a?hardcoded? solution. Nobody knows which layout structure, in what format and in what order, will be used in a new templates. It is difficult to make SDA 134/141 intelligent enough to analyze any new template structure based on words, and match any documents that are based on it. Current inventors suggest using rules stored in Rule Files, applied by an engine in SDA134/141, and created by a rule generator. H2: Manage any user documents that are based on templates. Even if the template is used, user documents may have different contents. Another challenge is extracting valuable content and reporting violations. Current inventors propose to capture exceptions for reporting. Processing continues until the final document is completed in all cases. H3: Use multiple versions of the same template. You can refer to the same data elements from an older template by using a new table or cell name. Table fields may be deleted/updated/added. These must all be intelligently reflected in the database output repository and linked together. Current inventors suggest using a property file to link new and old names. Maximum table definition must contain superset of all columns in all templates.

The following are some examples of the challenges that current inventors have identified and the solutions they have found. H4: Manage a collection of documents and identify documents that aren’t based on any templates, documents that aren’t in sync to the rule and template selected, documents that are largely based upon a template but have different contents that violate the template. Current inventors suggest exception handling and reporting. SDA 134/141 intelligently filters all documents not based upon a user-selected templates, only processes good documents, and only generates results from good content. It also reports exceptions at the system, document, and content levels. H5: Manage an arbitrary database repository. Nobody knows the data elements that a new template will include. It is difficult to analyze the documents and save the results to a database. Current inventors suggest using a property file to link data elements from documents to tables in the database. H6: Manage large volumes of data in complex formats. Document repository can contain gigabytes and complex elements, such as large images and nested tables (see C1 and I1 above). Computer memory can also be used indefinitely. Current inventors have programmed SDA 134/141 in order to identify and remove all images prior to processing. This reduces memory usage, makes bulk processing possible, and produces clean results. SDA 133/141 can also identify complex structures and process them or report them. However, SDA 134/141 continues to crawl the entire repository, and only one result is generated.

“FIG. 2A shows a screen showing a word processor that has opened a template 131X. Template 131X contains a word processing table 210 with two columns (211 and 212). Column 211 contains a number headings (also known as?row headings?) The headings 211A to 211Z are arranged vertically in relation to each other. In this example row headings 211A and 211Z are at the top, respectively. Column 212 contains sample text 212A at its top, in the first row adjacent to row headings 211A. Column 212 also contains sample text 212Z at its bottom, located in the last row adjacent to row headings 211Z. Word-processing table (210) is a vertical one because row headings 211-211Z are placed vertically in table 220 and separated by the contents of tables 210.

“In this example, a string (also known as?table identifier?) is used. The number 213 appears immediately before the word-processing tables 210. Due to their relative locations, this string 213 has an semantic relationship with word processing table 210. This relationship is obvious to human users 183-181N, but it is not apparent to word-processing software. The relationship is that string 213, which is located in physical proximity of each other, is an identifier for word-processing tables 210. Referring to FIG. 2B) is created by user 181A. A string 223, which is immediately preceding table 220 in the sequence text in document 112A, is identical to string 213, which identifies table210 in template 131.X. Document 112A row headings 221A-221Z are identically kept by human users 181A-181N to the corresponding row headings 21A-211Z in template 1313X. By overwriting the template 131X text 211A-211Z, text 222A-222Z can be added to table 220 using input from human user 181A.

“Similarly, user 181I input is used to insert text 231-232Z in table 230 (see FIG. 2C) by replacing the text sample 211A-211Z from template 131X with string 233. Document 112I’s row headings 231A-231Z and the string 233 were all identical to the corresponding row headings in template 131X. FIG. 112N is document 112N. 2D is also created using the input of user 181N. In FIGS. 2A-2F, and 2K are Word 2002 word-processing software sold by MICROSOFT CORPORATION.

“FIG. “FIG. 2A. Rules generator 132 contains a string text 251 from rules file 133 that is identical to the text string 213, which happens to be adjacent and immediately preceding template 210X. Parsing rule 250 of rules generator 132 identifies text string 251 as the table name for table 219. A row heading 211A is also found in template 131X. (FIG. Rules generator 132 assigns 2A a name 252 (FIG. 2E) is a cell in table 210. Rules generator 132 also includes additional row headings for table 210, as shown in FIG. 2E.”

“Rules generator 132 also includes in parsing rule 250 an orientation direction 259, in which cell headings will be arranged in table 215. HORIZONTAL is a type of orientation. VERTICAL is a type of orientation. COMPLEX_HORIZONTAL is a type of orientation. As described below, tables in a document may also be COMPLEX_HORIZONTAL or COMPLEX_VERTICAL.

Document analyzer 134 processes a table in HORIZONTAL orientation direction 259 using text in the first row. Then, it uses the header text from the first row to process the text in the remaining rows using one or more of the actions identified in the parsing rule 254. Document analyzer 134 processes the table in the same way as when orientation direction 259 of a table layout structure is VERTICAL. Document analyzer 134 doesn’t process the entire complex table if orientation direction 259 is COMPLEX. Document analyzer 134 instead breaks down the table into HORIZONTAL AND VERTICAL sub-tables. These sub tables are then processed the same as the normal HORIZONTAL or VERTICAL tables described in the beginning of this paragraph. After defining attributes in computer RAM, these sub tables can be used to assemble all the data from the complex table.

“Therefore, in certain embodiments, a complicated word-processing table includes both horizontal and vertical under tables. The document analyzer uses a divide and conquer approach to extract the data from each table and then assembles it using common parent keys and other context data. A horizontal sub-table is also known as COMPLEX_HORIZONTAL. This refers to a horizontal table within a complex table. A COMPLEX_VERTICAL is a vertical sub-table in a complex.

“An example for a complicated table is:”

“Detailed Error Condition and Messages”

“Business Actionable\nRule ID Error Condition Preventable by End User Diagnostics\nMessages (repeat the rows below, including Tokens, for each message)\nMessage\nType Message Category Severity\nError Product High\nWarning System Medium\nConfirmation Security Low\nProcessing\nInformation\nMessage Name\nMessage Text\nUser Message Detail\nAdmin Message\nDetail\nCause\nUser Action\nAdmin Action\nTranslator Notes\nTokens (repeat the rows below for each token)\nToken\nToken Name Type Token Description\nDate\nNumber\nText”

“Moreover, parsing rule 250 is prepared by rules generator 132 to include a user-selected action identified by string 258 which in this case is ?EXTRACT_TABLE_DATA.? This action, namely EXTRACT_TABLE_DATA, is to automatically be performed by processor 120 (when executing SDA 134) when the document 112I containing table 210 is being analyzed. Rules file 133 (FIGS.) is not to be confused with rules file 2E and 2F. 2E and 2F do not identify any physical dimensions of table 210 in parsing rule 250. Users 181A-181N have the ability to change the physical dimensions of individual cells or table 210 in its entirety, e.g. SDA 134 can be used to modify the page’s left and/or right margins, but not in any way that affects its operation, as explained below.

“Specifically, certain embodiments allow tables or cell table to be moved in relation to their location in the margins. They can also be placed anyplace in a document, provided that the layout structure and its identifyr are kept in a predetermined order. Multiple instances of the same layout structure can result in the same action being taken and the results captured. A table can allow you to change the order of the headers of columns and rows. For example, one word-processing document may have two columns of names, Age and Gender. However, another word-processing document might have three columns of names, Gender, Name, age. The document analyzer will still capture data from the identified column regardless of its location in the table.

“FIG. 2I is a screen that is displayed by computer 184 and user 183 for user input. It shows an action to be taken on a layout structure. In particular, FIG. FIG. 2I shows a table 210 from template 131X (FIG. 1B is displayed. A drop-down box 282 will be displayed if the cursor 281 hovers above table name 213. The user 183 can move their mouse to select one of the elements from list box 282. Each element in the list box 282 represents an action that can take place on table 210. The processor 120 executes standardized document analyzer 134. Rules generator 132 uses such user input to create rules file 133X by associating user selected action (e.g. EXTRACT_TABLE_DATA is a table identifier that associates with the EXTRACT_TABLE_DATA (e.g. Document Metadata. Note that the set of actions shown in box 282 are different for a layout structure that is a section, because actions applicable to a section are different, such as actions CHECK_SECTION_EXISTENCE, and EXTRACT_SECTION_DATA, described below.”

“Examples supported by standard document analyzer134 are shown in this table. (And as discussed above. Certain of these actions that are applicable to a table can be found in a drop-down box 282, for selection of a user.)

“Action\nACTION_TYPE Scope Result\nCHECK_SECTION_EXISTENCE SECTION It checks if a document has the\nspecified section name in the section\nhierarchy.\nEXTRACT_SECTION_DATA SECTION It extracts the text from the specified\nsection\nEXTRACT_TABLE_DATA TABLE It extracts specified table data.\nCOUNT_TABLES TABLE It counts total number of instances of\nVERTICAL or\nCOMPLEX_VERTICAL table\nstructure throughout the document.\nCOUNT_TABLE_ROWS TABLE It counts total number of rows in each\ninstance of HORIZONTAL or\nCOMPLEX_HORIZONTAL table\nstructure throughout the document\nCOUNT_EMPTY_TABLE_CELLS TABLE It counts total number of table cells that\n(cell are mandatory to have data but empty\nlevel)\nCOUNT_TABLE_CELLS_UNDER_CONDITION TABLE It specifies the CELL_VALUES that a\n(cell CELL_NAME can have, and group\nlevel) document data into those values,\nincluding valid and invalid values\nCOUNT_WORDS_IN_TABLE_CELL TABLE It counts total number of words in\n(cell specified table cells.\nlevel)”

“In addition to TABLE_NAME, and SECTION_NAME attributes, a rules file 1333X is also available. This will allow processor 120 to determine which layouts are most relevant and will be used by SDA 134 (when processing the rule). In case of a TABLE layout structure, note that the names of cells are also included in the rules file 133X as CELL_NAME attributes of the TABLE, for use by SDA 134 to perform cell level actions, such as COUNT_EMPTY_TABLE_CELLS. FIG. 2J shows an example of output from SDA 134. 2J.”

“FIG. “FIG. Top frame 291 represents a template. In response rules generator132, the user input selects an action to be performed on the layout structure. The left frame 292 allows the user to select any section from which to create rules. The right frame 293, where the user can set up the rule attributes for a section, is available. The attributes that can be used to set up the rules will vary depending on whether the section has a table layout. No table is available to setup. Only the SECTION rules can be used and the corresponding attributes. All rules that are available for setup will be available if there is a table in the section. After looking at the template, the user can choose the table format they want to use in order to create a rule for a particular table. All table fields are available if the rule is at the table field level. It contains all attributes necessary to generate rules. After completing one rule setup click on Add Another Rule to create a new section rule. After all the rules have been set up for each section, click this button to generate an XML file containing all definitions. This file is sent to computer 100 and can be used by SDA134.

“In certain aspects of the invention, the rules generator 132 is used to create the document analyzer 134. All rules can be applied to existing templates or new ones, as well as existing versions and newer versions of the same template. The documents 112I-112N are shown in FIG. 1B) are based on these templates and can be analyzed by document analyser 134 without any code changes. All rules must be interpreted within the context of Section Hierarchy in a word-processing file.

“Moreover, document analyzer 134.1 can be responsive to additional attributes applicable to Section Hierarchy in order to produce context-sensitive and more targeted results. To limit the scope of document analyzer 134’s analysis to layout structures only, the user can specify PARENT_SECTION_NAME. Document analyzer 134 will analyze layout structures in the entire document 112I if there is no value for this attribute. The user may have an overview table in the following parent sections: ‘Features in Scope? section and ‘Features out of Scope’? section. If the user only wants to collect the Overview data for Features in Scope, the user can set Table_Name=?Overview? and PARENT_SECTION_NAME=?Features in Scope?. A second example is if there are multiple copies of the Error Message Table table in the 112I document, e.g. In all sections. The user might want to know all errors messages, regardless of where they are used. To do this, the user ignores the attribute PARENT_SECTION_NAME and it uses the default value which is the entire document.

“Likewise, in many embodiments, the user can also specify to document analyzer134 another attribute DATA_LEVEL in order to indicate how many levels into the data in 112I the user cares about, starting from the given point. In a Section Hierarchy for example, you might see different TABLE structures at every level. All these table structures may have some common columns such as Description or Name. Using DATA_LEVEL, users can ask document analyzer 134 for specific levels or all levels of data. In some cases, TABLE rules also support filters that allow empty records or invalid records to be captured in order to report exceptions or excluded to obtain the actual count. In several aspects, the document analyzer134 processes all the rules and multiple documents, generates one result, and then applies all rules to all documents.

Although the discussion above was made with reference to FIGS. While the discussion in reference to FIGS. 2A-2E (and 2) has focused on word processing tables as layout structures for templates, it is possible that a template can include a hierarchy or sections as a layout structure such as shown in FIG. 2G. Moreover, FIG. FIG. 2H shows a rules file created by calling rules generator 132 and using the template in FIG. 2G. To perform the acts shown in FIG. 3A. Depending on the embodiment rules generator 132 may be implemented entirely in server computer 100, client computer 184, or partially in each computer 100 and 184. Computer 100 executes rules generator 132 in order to perform the method shown in FIG. 3A.”

“Specifically, in act.301 (FIG. Computer 100 is then sent to act 302. Computer 100 verifies that the template is not in a proprietary binary format. This is done in act 302, where it is checked if the file is compatible with Word 2002, MICROSOFT CORPORATION’s word-processing software. Computer 100, for example, checks whether the file extension of the template ends with?.docx?. Computer 100 checks, for example, if the file extension of the template ends in?.doc? oder?.docx. If so, determines that act 302 answers are yes and goes to act 303. Also,?.dot. Extension can also be used in the same way as described above, in other embodiments. Computer 100 converts the template using a feature (e.g. computer 100 converts the template from the word-processing program (which created it) into a rich format in a markup languages readable by humans. This allows text documents in this format (?interoperable formats?) to be used on different word-processors on different platforms. In different cases, the interoperable formats can differ. In the above-described example where Word 2002 is used as the word-processing program, the interoperable format will be the Rich Text Format (RTF), published by MICROSOFT CORPORATION.

If the template has been written in human-readable markup language then act 302 does not apply. Act 303 is therefore skipped and the computer performs act. Act 304 can also be performed after the completion of act 303. Computer 100 converts the template in act 304 (e.g. Now in RTF format) into another markup languages for displaying pages. It does not have any semantic markup. The Extensible Stylesheet Language to Format Objects, or?XSLFO. W3C defines XSL-FO as a part of its XSL specification. available on the Internet at http://www.w3.org/TR/xsl11/. Additional information is available at http://www.w3.org/TR/xsl11/. by Eliotte Rusty Harold, 2001 available at http://www.cafeconleche.org/books/bible2/chapters/ch18.html, which is incorporated by reference herein in its entirety.”

“An example conversion of FIG. 2A into the XSL-FO format is illustrated in FIGS. FIGS. 4A-4D. This example uses the Oracle Corporation software BI Publicizer 11g to convert act 304. However, any other software could be used depending upon the embodiment. As shown in FIG. 4, you will see that the XSLFO file contains many properties after the above-described conversion. 4A, for example margin-left=30.6 pt margin-right=?30.6 pt? page-height=?792.0 pt? page-width=?612.0 pt? margin-top=?21.6 pt? margin-bottom=?36.0 pt. As illustrated in FIG. 2H is, therefore, in a ‘dimensionless? format. format. Rules files 133X do not contain positions.

“Next, in act 305 computer 100 parses a template in the XSLFO format to identify text strings that precede a particular type of layout structure, namely tables, and determines that they are table names. Computer 100 does the same thing in act 305. It also determines text strings in styles for another type of layout structure, namely section headings to be section names. Computer 100 then associates the action chosen by the user in act 306 (e.g. As shown in FIG. 2K or 2I) to each table and each section. Act 306 may be used to extract text from each cell and check for text in each section. The identifiers along with the associated actions are then written into computer memory as a rules file 307. This file is known in certain embodiments as rule file 133X. in XSL. A dimensionless format, as mentioned above, does not use any dimensions that are normally used for laying out a document on a computer monitor or printed paper such font size, distance left margin, distance right margin.

“In certain aspects of the invention rules file 133X are also transmitted by server computer 100 from client computer 184, as illustrated by an optional act 308. FIG. 3A. 3A. User 183 can modify any number of rules, e.g. By using word-processing software on client computer 184. Computer 184 receives any changes made by user 183 and transmits them to server computer 100 in act 310. Computer 100 saves the user-modified rules file 1337X in act 310 and transmits them to server computer 100. This information can be used for document analyzer 134 (as explained above).

“Remember that document analyzer134 is a reference to the execution of document analyser 134 by computer 100. Some embodiments of document analyzer 134 perform acts as shown in FIG. 3B are, unless otherwise noted, similar to or identical to the acts described in FIG. 3A. 3A. Any URL that is accessible publicly on the Internet or on an intra-corporate website. At which are stored a set of documents to analyze. A user can receive the above-described location identification, e.g. A screen of the type shown in FIGS. 3C and 3E.”

“A repository is located at the user-specified address. It contains the group of documents 112I-112N which are to be analyzed. These documents have been created by users 181I-181N. (FIG. 1B) as mentioned above (specifically by users entering keyboard and mouse inputs to document editing software like Microsoft Word to replace the sample text or blank text in template 131X with their custom text). In a next act 312, document analyser 134 crawls user-identified locations received in act 311, in order to find all documents 112I-112N to be analyzed. Then, each document is transferred into server computer 100 (from the repository at the user’s location).

In act 312, document analyzer134 connects to a file server at a user-specified place and then transfers each file into a local memory in 100. This local memory can be accessed via a local network (LAN) or within 100. A file transfer can be done according to some aspects of the invention using a generic protocol such as File Transfer Protocol FTP to transfer each document 112I-112N to a non-transitory computer-readable medium that is within computer 100.

“After act 312, document analyser 134 enters in a loop, each document to be analyzed is a current one. In particular, document analyzer 134 checks whether the current document is in binary format (e.g. By checking whether the file extension of a file located at the user-specified address is?.docx? Check if the file name extension of a file at the user-specified location is?.doc????? or?.docx?? ?.docx? File name extensions in some aspects of the invention are checked to ensure they are one of?.docx or?.doc. File name extensions are checked to be one of?.doc?,?.docx? Prior to file transfer via FTP from user-specified place to a nontransitory memory within or accessible by server computer 100, only files with one of these extensions will be transferred to computer 100.

“In one example, files 112I-112N are used (see group.115X in FIG. 1B) with?.docx? 1B) having?.doc? extensions are stored in a binary format proprietary MICROSOFT Word. Act 314 converts these files 112I-112N to a rich format such as the Rich Text format RTF. This rich format allows the document to be shared between word-processors and is therefore also known as an interoperable format. Act 314, document analyzer 134.1 checks to see if the extension of the document is not?.rft. If the file extensions are.doc, it checks if they are not?.rft? Traditionally, and.docx? recently-introduced), then computer 100 performs a conversion into RTF.”

“Note that new file extensions may be added to certain aspects of the invention without affecting the code of document analyser 134 (e.g. If MICROSOFT CORPORATION adds new extensions to its code, such as?.docy or?.docz? ?.docz and?.docy? These extensions, which are hypothetical examples as there are no such extensions today, are then processed in act 313, after simple configuration changes without any impact on code of document analyzer 134. Act 313, document analyzer 13 goes to act 315. If the answer is “no”, document analyzer 134 goes to act 313.

“Note: In act 314, document analyser 134 removes any images from the current file and converts it into RTF. Document analyzer 134 converts RTF documents into a tag-based marking language XSL. Using the XSLFO format, document analyzer 134 converts the RTF document into a tag-based markup language XSL. Acts 311-315 can be performed independently of and without knowledge of layout structures. Document analyzer 134’s performance is improved by removing images from act 314, both in terms memory usage and time taken to process documents. Document analyzer 134 can now be used in bulk because of the above-mentioned improvements in performance. processing, i.e. Analyze a group of documents 112I-112N with an order of magnitude (e.g. 10 documents or more, relative to a single document being analyzed. 1 document). You should note that the removal of images can change the positions of different layout structures in a document 112I. However, as these positions are not used to determine actions by document analyzer, 134, this change (removal) increases document analyzer’s 134 performance sufficiently to allow bulk processing.

“Note: operation 316 is briefly summarized in the following paragraph. Details of an illustrative execution of operation 316 can be found in FIG. 3C, as described below. In operation 316 (FIG. 3B) The document analyzer, 134 scans the document to identify each valid layout structure. This is possible in certain embodiments by using style names or tag names resulting form act 315. The tag example This is used to identify the preceding words of text with a predetermined style (e.g. Style-name=?Table Heading (in this case, the table name) as the name of the layout structure. You can add any text between the layout structure’s name, and the actual layout structure. However, such text must be in a style other than the one used for the name. Document analyzer 134 finds a layout structure and extracts text from it. Then, the rule file 133X is sent to document analyzer 134. Document analyzer 134 performs an action when a rule’s condition has been satisfied. However, the action is not limited to the contents of a document 112I but only to the text extracted from the specific layout structure that was found to satisfy the rule’s condition. ”

Click here to view the patent on Google Patents.