Toolbox
  • Printable version
 
Toolbox
LANGUAGES
Language
Personal tools
Wikipedia Affiliate Button
 
In other languages

WikiWord/scrap

From BrightByte

Jump to: navigation, search

Contents

Analyzer

  • Check out warnings, refine rules
  • Check out table stats, look for unused types
  • more concept types:
    • NUMBER
    • ORGANIZATION
    • CHARACTER
    • UNIT
    • (OTHER_)PROPER_NAME: WORK, BRAND, etc
  • handle lists!
  • ignore very long concept names and terms (and definitions)
  • more test cases, esp en, nl, no
  • Aganda:
    • resume in buildXXX (reset steps!) -> test!
    • flush prompt?!
  • Exclude:
    • Portals/Projects (if not in own namespace - check thsi!)
    • Shortcuts (WP:xxx)
    • -> badLink pattern
  • fix: term-from-sortkey (?!)
  • decode url-encoded links
  • infer terms for "missing" concepts!
  • min/max length for terms/concepts!
  • TODO: strip left-over [] from definitions
  • {{merge}}...
  • discuss/handle {{duplicate}} bzw. {{merge}}
  • discuss/handle {{otheruses}}
  • NOT Fokus:
    • Categories
    • Link-Structure
  • Fokus:
    • Semantic Gap
    • Term-Konzept
    • Multilang
    • Disambig (?!)
  • Pluggable sentence-splitter
  • Keep "content" of "inline"-templates
  • use sentence/words/links from disambig-line
  • en: ConceptType: ORGANIZATION...
  • keep section links with:colons
  • FIXME: fr inline templates... lots of them...
  • check all warnings
  • eyeball microcorpus...
  • mine style-X diambig: otheruses, disambiglink...


  • mis-spellings: skos:hidden-label

from AI3

From AI3: 99 Wikipedia Sources Aiding the Semantic Web

But much broader data mining and text mining and analysis is being conducted against Wikipedia, that is currently defining the state-of-the-art in these areas, too:

   * Ontology development and categorization
   * Word sense disambiguation
   * Named entity recognition
   * Named entity disambiguation
   * Semantic relatedness and relations.

These objectives, in turn, are mining and extracting these various kinds of structure for these purposes in Wikipedia:

  • Articles
    • First paragraph — Definitions
    • Full text — Description of meaning; related terms; translations
    • Redirects — Synonymy; spelling variations, misspellings; abbreviations
    • Title — Named entities; domain specific terms or senses
    • Subject — Category suggestion (phrase marked in bold or in first paragraph)
    • Section heading — Category suggestions
  • Article links
    • Context — Related terms; co-occurrences
    • Label — Synonyms; spelling variations; related terms
    • Target — Link graph; related terms
    • LinksTo — Category suggestion
    • LinkedBy — Category suggestion
  • Categories
    • Category — Category suggestion
    • Contained articles — Semantically related terms (siblings)
    • Hierarchy — Hyponymic and meronymic relations between terms
  • Disambiguation pages
    • Article links — Sense inventory
  • Infobox Templates
    • Name –
    • Item — Category suggestion; entity suggestion
  • Lists
    • Hyponyms

Clustering

  • Implement similarity clusterator
  • Greedy: Allow multi-concept merge in single round
    • hope: less conflicts, get [A, B, C] [D] instead of [A B] [C D]!
  • TRY: collect weakly connected once no morge merging of strongly connected concepts is possible
  • Interwiki:
    • muliple targets in the same language from single source (if source merged with cat page or redir)
    • intgerwiki -> redir
    • intgerwiki -> disambig
    • intgerwiki -> cat (!!)

Database

  • "similarity" -> "siblings"
    • make symmetrical in the end!
  • for large updates (idLinks, ...):
    • chunked
    • programmatic
    • file (shell out, or implement...)
    • medusa
  • for each concept, calculate:
    • idf -> (Nakayama bezgl Wikipedia)
    • local generality (indeg/outdeg) -> Muchnik et.Al.
    • discover hyperonyms by picking very general outlinks (but beware years and units)
  • relations as feature vectors
  • check type conflicts on merge? check out warnings!
  • JobQueue-Workers -> Daemon?!
  • disable keys for clustering? benchmark!
  • global meanings: not needed! (using local meanings via origin-table should be fast enough)
  • in meaning table, flag meanings that come *only* from link-text rule (and show frequency)
    • min-freq for link-text-only meanings
    • drop all stuff below threashold?
  • confidence level (esp for broad/narrow) ?
  • link/reference table for global thesaurus
  • fix statistics for global thesaurus
  • fix table-stats: which schema?!
  • section links:
    • ignore if on-page!
    • link section-concept to "parent"?
  • fill title not listed as term in meanings! (run query to find them - example: de:Seestrandkiefer)
  • eval meaning survey: coverage for different modes (how much is stripped?)
  • RAND for missing!
  • no name for global concepts! preferred label for local concepts!
  • TRY: when building global concepts, exclude UNKNOWN concepts at first, import later! (they have no translations, thus can't be merged!)
  • concept names in Broader table! etc...
  • ConceptDescription.getName
  • FIXME: langlinks -> redirect: need resolve-redirect before buildLangPrep()!
  • TODO: VERIFY import/clustering of micro-corpus!
  • DON'T DELETE WARNINGS TABLE!
  • smart cutoff for getMeanings!
  • include links in conceptinfo!
  • FIXME: "unknown" concepts generated by redirects (and other links) to disambig pages. should probably be ignored! entires in meanign table are misleading!
  • XXX: name clashes between disambig and catregory -> false positives when deleting bad links from borader-table.
  • Filter terms: remove terms that also applie to broader (or narrower) concepts
  • FIXME: breakl cycles (with/without leafs)
    • -> alternative algorithm: prune leafs/roots interatively, until only loops are left.
  • FIXME: make super-root!
    • Note: need "substance" (origin/concept)!
  • TODO: "similar" by simple langlink, "very similar" by langmatch (merge-clash)
  • TODO: "similarity" based on bidi-links -> skos:related
    • check semantics for skos:related - conflict with skos:transitiveBroader?
  • TODO: check consistency!

Micro Corpus

  • redir, redir->redir, redir->nowhere, redir->disambig
  • disambigs
  • category pages (structured)
  • test:
    • for lanuages: import concepts, extract text, build concept info, build statistics, build thesaurus
    • for thesaurus: build concept info, build statistics
  • verify:
    • ogle via web
    • check in db: all entities, all relations...
  • Concept: Merge with Category (duped interwikis...)

Test Corpus

  • Missing redirects in Mountains-Corpus: wp:en:Gerizim, wp:de:Mount_Blanc
  • include categories
  • include *all* redirects
  • include some disambiguations
  • TRY SIMPLE ENGLISH!

Web Interface

  • names for thesauri -> into title
  • show warnings for given concept/resource
  • corpus matrix: overview
  • show concept-relations:
    • broader, narrower
    • in-links, out-links -> maintain global pagelink tabe!
    • langlinks?
    • siblings/similar (clustering conflicts)
    • new stuff: cooc, co-cooc, disambig-context (?)
  • fix page: statistics, warnings
  • log:
    • indent (need repeat-macro for yates!)
    • show context
    • hide start/end
    • fix parameters
    • pretty duration
  • Integrate Zipfer
  • concept langlinks: no links? no ids!
  • fix terms for global concept: maintain lang!
  • fix dataset selection
  • fix language-set detection
  • fix justify-terms (in global mode)

Directories

  • Diplomarbeit
    • Paper
    • WikiWord
      • doc
      • src
      • lib
    • Data
      • data..
    • Evaluation
      • data...
  • LICENSE !

Outline

Abstract

Motivation

  • Semantic bootstrapping, semantic dictionary -> Glossary/Thesaurus
    • synonyms (terms), homonyms (meanings), translations
  • Wikipedia is nice:
    • unique IDs
    • disambiguated
    • high quality
    • log redundancy
    • conventions -> recurring patterns
  • Potential users
    • Wortschatz
    • DBPedia
    • OmegaWiki
    • YAGO
    • FreeBase
    • Wikipedia (search feature)

Scope

  • what is it?
    • thesaurus ?
    • semantic lexicon?
    • semantic dictionary?
  • features
    • term (lexeme) <-> concept relation -> homonyms, synonyms
    • translingual concepts -> translations, knowledge transfer
    • heuristics/patterns model per-project conventions
      • give examples of explicit conventions!
      • only per-language info is list of abbreviations for sentence splitting. only for def-extraction
    • extract definition, plain text (sahnehäubchen)
    • low complexity (calculate!)
    • largely ignores templates -> no problem with maintenance categories
    • only light use of category structure
    • concept types (identify propert names, etc)
    • virtually no stopwords
    • nearly exclusively nouns
    • multi-word phrases, proper nouns/individuals/named entities
    • (common) inflected forms, casual/contextualized forms ("greek")
    • special: "section concepts"
  • others
    • full text corpus stats
    • nlp parsing (extracting semantics)
    • structured data extraction (infoboxes)
    • taxonomy
    • semantic relations
  • Result
    • soft data, some errors, lots of blur
    • no good for reasoning
    • good as context data for
      • disambiguation
      • query expansion
      • etc...
      • needs experiment!

Architecture

  • UML: storage, import
  • import flow
  • db layout
  • entry points -> use cases
  • parallelization, performance

Heuristics

  • most heuristics model conventions
  • tags, categories, titles, and other patterns
  • style guides:
    • definition first
    • structure of disambig
    • use of sortkey
    • ...
  • unique ID for heuristic
    • at explanation + reference to spec in wikipedia
    • in source code / javadoc

Clustering Algorithms

  • reciprocal links
  • translation set similarity
    • conflict: negative similarity (extreme: neg inf)
    • weight by granularity (project size)?
    • heigh weight for direct reference to the local concept itself.

Disambig

  • use per-concept context, cooc-freqency, generality
  • inlinks+outlinks, broader+narrower, cooc + co-cooc, siblings, ...
  • eval:
    • map to wordnet synsets / compare
    • compare wordnet/SUMO taxonomy

Data Evaluation

  • eval concepts
    • 50+50 pages per wiki
    • resourceType, conceptType
    • Definition: missing, good, truncated, broken, wrong
      • Stats per type?!
    • (+plain text)
    • (+hyperonyms, langlinks, outlinks)
  • eval terms for each concept:
    • pick random concept -> prefer "common" concepts!
    • exact match
    • inflected
    • capitalized
    • too broad -> abbreviation (contextualization): [[griechische|Griechische Mythologie]] (how common?)
    • too narrow -> generalization: [[griechische Mythologie|Griechenland]] (how common?)
    • extra: personification (Amerika -> Amerikaner; Bass -> Bassist; ...)
    • plain wrong / misleading
    • broken
    • calculate percentage unweighted/weighted by fq/log(fq)
    • with/without terms-from-links
    • with cutoff (min ~2)
    • with cutoff only for link-text-only terms
    • without "missing" terms!
    • TODO: eval individual rules! (use link table)
  • eval retrieval (use as search index)
    • for 50+50 terms per language:
    • list concepts for term
    • compare with gold std
      • WordNet, ResearchCyc
      • wikipedia search
      • google
      • manual
      • WordNet
      • YAGO
    • calculate percentage unweighted/weighted by fq/log(fq)
    • with/without terms-from-links
    • with cutoff (min ~2)
    • with cutoff only for link-text-only terms
  • eval clustering
    • try excerpts: Mountains, Domesticated animals, Popes, ...
    • verify cluster content
    • verify siblings ("similar concepts")
  • TODO: BAD concept! (wrong red link)
  • TODO: check for stat. significance!

Data Export

  • SKOS
  • Topic Maps
  • ...?

Outlook

Disambig

  • use per-concept context, cooc-freqency, generality
  • inlinks+outlinks, broader+narrower, cooc + co-cooc, siblings, ...
  • map to wordnet synsets / compare
  • compare wordnet/SUMO taxonomy

Networks

  • for each: cluster, degree distribution (small world? zipf?)
  • try to construct a hierarhcy by clustering (-> similarity via feature-vectors = relations)
  • Net of:
    • links (in/out/undir)
    • cooc, co-cooc
    • coref, co-coref
  • try PageRank, HITS

Information Retrieval

  • use as search index
  • use for QE
  • use for resource selection
  • ...

Conclusion

CLI

  • query interface


Tools

  • Medusa [1]
  • TinyCC [2]
  • Findlinks, Text2Satz, Sentrick (Vorl. Textdatenbanken) [3]
  • abbr-detector (Vorl. Textdatenbanken, email uq)
  • ASV Toolbox [4]
  • OpenNLP (?) [5]


People

  • Medusa -> Marco Büchler
  • Tomas Wittig
  • S. Bordag
  • Mathias Richter (mathias dot richter at info uni le) -> Wikipedia cooc

Misc

  • Lizenzen...
    • Arbeit: GFDL+CC-BY-SA
    • Programm: GPL (libs: auch LGPL, BSD, Apache, etc)

[talk page]Talk:WikiWord/scrap

== book mark back

==

privv boo kok 445

[edit] children game new year’s day miramar fun holiday craft ideas kids

He won't sing every time, so keep raising and lowering him if you really want to hear him. For little ones Let's Rock Elmo is sure to be a top pick this year. Parents may like them because they have educational value, but children like them because of their colors, the pictures on them, and their numbers, patterns, and symbols look nice piled up.

[edit] ultimate games for girls 2 major tom coming home by shiny toy guns

What advice can we give to mothers? Their children need to work at an interesting occupation: they should not be helped unnecessarily, nor interrupted, once they have begun to do something intelligent. Then there are kits for intermediate level and the ultimate one is the advanced of them all, which will come up with parts and help the child to develop his skill. Now back in my day there have been like right now packs of Lego close to a concept.



The above comments may have been left by visitors.

This site's operators can not take responsibility for the content of such comments.