Moving Average Using Proc Expand


A maioria das pessoas está familiarizada com a frase, isso vai matar dois pássaros com uma pedra. Se você não estiver, a fase se refere a uma abordagem que aborda dois objetivos em uma ação. (Infelizmente, a expressão em si é bastante desagradável, como a maioria de nós não deseja lançar pedras em animais inocentes). Hoje, I39m vai abranger alguns conceitos básicos em dois grandes recursos no SQL Server: o índice Columnstore (disponível apenas no SQL Server Enterprise) e O SQL Query Store. A Microsoft realmente implementou o índice Columnstore no SQL 2012 Enterprise, embora eles o aprimorem nos últimos dois lançamentos do SQL Server. A Microsoft apresentou o Query Store no SQL Server 2016. Então, quais são esses recursos e por que eles são importantes? Bem, eu tenho um demo que irá apresentar os dois recursos e mostrar como eles podem nos ajudar. Antes de ir mais longe, também cubro esse (e outros recursos do SQL 2016) no meu artigo da revista CODE sobre os novos recursos SQL 2016. Como uma introdução básica, o índice Columnstore pode ajudar a acelerar as consultas que exploram as quantidades de grandes quantidades de dados e A Query Store rastreia as execuções de consultas, os planos de execução e as estatísticas de tempo de execução que você normalmente precisa colecionar manualmente. Confie em mim quando eu digo, são excelentes recursos. Para esta demo, eu estarei usando o banco de dados de demonstração do Microsoft Contoso Retail Data Warehouse. Falando vagamente, o Contoso DW é como uma quota muito grande AdventureWorksquot, com tabelas contendo milhões de linhas. (A maior tabela AdventureWorks contém aproximadamente 100.000 linhas no máximo). Você pode baixar o banco de dados do Contoso DW aqui: microsoften-usdownloaddetails. aspxid18279. O Contoso DW funciona muito bem quando você deseja testar desempenho em consultas contra tabelas maiores. O Contoso DW contém uma tabela de fatos de data warehouse padrão chamada FactOnLineSales, com 12,6 milhões de linhas. Certamente, essa não é a maior mesa de armazenamento de dados do mundo, mas também não é uma criança. Suponha que eu quero resumir o valor das vendas do produto para 2009 e classificar os produtos. Eu posso consultar a tabela de fatos e juntar-se à tabela Dimensão do produto e usar uma função RANK, assim: Aqui, um conjunto de resultados parcial das 10 melhores linhas, por Total Sales. No meu laptop (i7, 16 GB de RAM), a consulta leva entre 3-4 segundos para ser executada. Isso pode não parecer o fim do mundo, mas alguns usuários podem esperar resultados quase instantâneos (da maneira que você pode ver resultados quase instantâneos ao usar o Excel contra um cubo OLAP). O único índice que eu atualmente tenho nesta tabela é um índice agrupado em uma chave de vendas. Se eu olhar para o plano de execução, o SQL Server faz uma sugestão para adicionar um índice de cobertura para a tabela: agora, só porque o SQL Server sugere um índice, não significa que você deve criar índices cegamente em todas as mensagens de indexação quotmissing. No entanto, nessa instância, o SQL Server detecta que estamos filtrando com base no ano e usando a chave de produto e a quantidade de vendas. Assim, o SQL Server sugere um índice de cobertura, com o DateKey como o campo da chave de índice. A razão pela qual chamamos isso de quotcoveringquot index é porque o SQL Server irá recorrer ao longo dos campos não-chave que usamos na consulta, quanto ao ridequot. Dessa forma, o SQL Server não precisa usar a tabela ou o índice em cluster em todo o mecanismo do banco de dados pode simplesmente usar o índice de cobertura para a consulta. Os índices de cobertura são populares em determinados cenários de banco de dados de dados e relatórios, embora eles tenham um custo do mecanismo de banco de dados, mantendo-os. Nota: Os índices de cobertura foram durante muito tempo, então eu ainda não abordava o índice Columnstore e a Query Store. Então, vou adicionar o índice de cobertura: se eu re-executar a mesma consulta que corri um momento (o que agregou o valor das vendas para cada produto), a consulta às vezes parece executar cerca de um segundo mais rápido e recebo uma Plano de execução diferente, que usa uma pesquisa de índice em vez de uma verificação de índice (usando a chave de data no índice de cobertura para recuperar vendas para 2009). Portanto, antes do Índice Columnstore, isso poderia ser uma maneira de otimizar essa consulta em versões muito antigas do SQL Server. Ele é executado um pouco mais rápido do que o primeiro, e eu recebo um plano de execução com um Index Seek em vez de um Index Scan. No entanto, existem alguns problemas: os dois operadores de execução, quotIndex Seekquot e quotHash Match (Aggregate), ambos operam essencialmente quotrow by rowquot. Imagine isso em uma mesa com centenas de milhões de linhas. Relacionado, pense no conteúdo de uma tabela de fatos: neste caso, um valor de chave de data único e um valor de chave de produto único podem ser repetidos em centenas de milhares de linhas (lembre-se, a tabela de fato também possui chaves para geografia, promoção, vendedor , Etc.) Então, quando o quotIndex Seekquot e quotHash Matchquot funcionam por linha, eles estão fazendo isso sobre valores que podem ser repetidos em muitas outras linhas. Normalmente, esse é o caso do I39d segue para o índice SQL Server Columnstore, que oferece um cenário para melhorar o desempenho desta consulta de maneiras surpreendentes. Mas antes que eu faça isso, let39s voltem no tempo. Let39s voltam para o ano de 2010, quando a Microsoft apresentou um suplemento para o Excel conhecido como PowerPivot. Muitas pessoas provavelmente se lembravam de mostrar demonstrações do PowerPivot para Excel, onde um usuário poderia ler milhões de linhas de uma fonte de dados externa para o Excel. O PowerPivot comprimiria os dados e forneceria um mecanismo para criar tabelas dinâmicas e gráficos dinâmicos que funcionavam a velocidades surpreendentes contra os dados compactados. O PowerPivot usou uma tecnologia em memória que a Microsoft denominou quotVertiPaqquot. Esta tecnologia em memória no PowerPivot basicamente levaria valores de chave de chave de negócios duplicados e comprimi-los para um único vetor. A tecnologia em memória também digitalizaria esses valores em paralelo, em blocos de várias centenas por vez. A linha inferior é que a Microsoft assustou uma grande quantidade de aprimoramentos de desempenho no recurso VertiPaq em memória para uso, à direita da caixa proverbial. Por que estou tirando esse pequeno passeio pela linha de memória Porque, no SQL Server 2012, a Microsoft implementou uma das características mais importantes no histórico de seu mecanismo de banco de dados: o índice Columnstore. O índice é apenas um índice apenas em nome: é uma maneira de tomar uma tabela do SQL Server e criar uma barra de colunas comprimida na memória que comprime os valores das chaves estrangeiras duplicadas para valores vetoriais únicos. A Microsoft também criou um novo conjunto de buffer para ler esses valores de vetores compactados em paralelo, criando o potencial de ganhos de desempenho enormes. Então, eu vou criar um índice de armazenamento de colunas na tabela, e eu verá o quanto melhor (e mais eficientemente) a consulta é executada, em relação à consulta que é executada contra o índice de cobertura. Então, eu criei uma cópia duplicada do FactOnlineSales (I39ll chamá-lo de FactOnlineSalesDetailNCCS), e I39ll crie um índice de armazenamento de colunas na tabela duplicada dessa maneira eu não interfiro com a tabela original e o índice de cobertura de qualquer maneira. Em seguida, eu crie um índice de armazenamento de colunas na nova tabela: Observe várias coisas: I39ve especificou várias colunas de chave estrangeiras, bem como a quantidade de vendas. Lembre-se de que um índice de armazenamento de colunas não é como um índice de linha-loja tradicional. Não há quotkeyquot. Estamos simplesmente indicando quais colunas o SQL Server deve comprimir e colocar em uma pasta de colunas na memória. Para usar a analogia do PowerPivot para o Excel quando criamos um índice de armazenamento de colunas, nós pedimos ao SQL Server que faça essencialmente o mesmo que o PowerPivot fez quando importámos 20 milhões de linhas para o Excel usando o PowerPivot Então, I39ll re-execute a consulta, desta vez usando A tabela duvidosa FactOnlineSalesDetailNCCS que contém o índice columnstore. Essa consulta é executada instantaneamente em menos de um segundo. E eu também posso dizer que, mesmo que a mesa tivesse centenas de milhões de linhas, ainda funcionaria no quotbat proverbial de um eyelashquot. Podemos olhar para o plano de execução (e em alguns momentos, vamos), mas agora é o momento de cobrir o recurso da Loja de consultas. Imagine por um momento, que executamos ambas as consultas durante a noite: a consulta que usou a tabela regular FactOnlineSales (com o índice de cobertura) e a consulta que usou a tabela duplicada com o índice Columnstore. Quando nos efetuamos o login na manhã seguinte, gostaríamos de ver o plano de execução para ambas as consultas, assim como as estatísticas de execução. Em outras palavras, gostaríamos de ver as mesmas estatísticas que poderíamos ver se executássemos ambas as consultas de forma interativa no SQL Management Studio, ativadas em TIME e IO Statistics, e visualizamos o plano de execução logo após a execução da consulta. Bem, isso é o que a Query Store nos permite fazer, podemos ativar (habilitar) o Query Store para um banco de dados, que irá acionar o SQL Server para armazenar a execução da consulta e planejar as estatísticas para que possamos visualizá-las mais tarde. Então, eu vou habilitar a Query Store no banco de dados Contoso com o seguinte comando (e I39ll também limpar qualquer cache): Então I39ll executar as duas consultas (e quotpretendquot que eu as executei há horas atrás): Agora vamos fingir que eles funcionaram horas atrás. De acordo com o que eu disse, a Query Store irá capturar as estatísticas de execução. Então, como eu os vejo Felizmente, isso é bastante fácil. Se eu expandir o banco de dados Contoso DW, I39ll verá uma pasta Query Store. A Query Store tem uma tremenda funcionalidade e tentei cobrir uma grande parte disso em postagens de blog subseqüentes. Mas por agora, eu quero ver estatísticas de execução nas duas consultas e examinar especificamente os operadores de execução para o índice de armazenamento de colunas. Então, eu vou clicar com o botão direito no Top Resource Consuming Queries e executar essa opção. Isso me dá um gráfico como o abaixo, onde posso ver o tempo de duração da execução (em milissegundos) para todas as consultas que foram executadas. Nessa instância, a Query 1 foi a consulta contra a tabela original com o índice de cobrança e o Query 2 foi contra a tabela com o índice de armazenamento de colunas. Os números que não são o índice de armazenamento de colunas superaram o índice de cobertura de tabela original por um fator de quase 7 a 1. Eu posso mudar a métrica para ver o consumo de memória. Nesse caso, observe que a consulta 2 (a consulta do índice de armazenamento de colunas) usou muito mais memória. Isso demonstra claramente por que o índice columnstore representa a tecnologia quotin-memoryquot. O SQL Server carrega todo o índice de armazenamento de colunas na memória e usa um pool de buffer completamente diferente com operadores de execução aprimorados para processar o índice. OK, então temos alguns gráficos para ver as estatísticas de execução, podemos ver o plano de execução (e os operadores de execução) associados a cada execução Sim, podemos se você clicar na barra vertical para a consulta que usou o índice columnstore, você verá a execução Plano abaixo. A primeira coisa que vemos é que o SQL Server realizou uma verificação de índice de armazenamento de colunas, e isso representou quase 100 do custo da consulta. Você pode estar dizendo, por um minuto, a primeira consulta usou um índice de cobertura e realizou um índice de busca, então, como uma verificação do índice de armazenamento de colunas pode ser mais rápida. Essa é uma questão legítima e, felizmente, isso é uma resposta. Mesmo quando a primeira consulta realizou um índice de busca, ele ainda executou quotrow by rowquot. Se eu colocar o mouse sobre o operador de varredura do índice de lojas de colunas, vejo uma dica de ferramenta (como a abaixo), com uma configuração importante: o Modo de Execução é BATCH (em oposição a ROW.) O que nós tivemos com a primeira consulta usando o Índice de cobertura). Esse modo BATCH nos diz que o SQL Server está processando os vetores compactados (para quaisquer valores de chave estrangeiros duplicados, como a chave do produto e a chave da data) em lotes de quase 1.000, em paralelo. Portanto, o SQL Server ainda é capaz de processar o índice columnstore muito mais eficientemente. Além disso, se eu colocar o mouse sobre a tarefa Hash Match (Aggregate), também vejo que o SQL Server está agregando o índice de armazenamento de colunas usando o modo Batch (embora o próprio operador represente uma porcentagem tão pequena do custo da consulta). Finalmente, você Pode estar perguntando, quotOK, então o SQL Server comprime os valores nos dados, trata os valores como vetores e lê-los em blocos de quase mil valores em paralelo, mas minha consulta só queria dados para 2009. Portanto, o SQL Server está escaneando o Conjunto completo de dados. Mais uma vez, uma boa pergunta. A resposta é, quase não. Felizmente para nós, o novo pool de buffer de índice de colunas executa outra função chamada quotsegment eliminationquot. Basicamente, o SQL Server examinará os valores vetoriais da coluna da chave da data no índice do armazenamento de colunas e eliminará os segmentos que estão fora do escopo do ano de 2009. I39ll pararão aqui. Nas postagens de blog subseqüentes, eu abrico tanto o índice de armazenamento de colunas quanto o Query Store com mais detalhes. Essencialmente, o que vimos aqui hoje é que o índice Columnstore pode acelerar significativamente as consultas que exploram em grande quantidade de dados e a Query Store irá capturar execuções de consultas e nos permitir examinar as estatísticas de execução e desempenho mais tarde. No final, gostaríamos de produzir um conjunto de resultados que mostra o seguinte. Observe três coisas: as colunas rotulam essencialmente todos os possíveis Razões de retorno, depois de mostrar o valor das vendas. O conjunto de resultados contém subtotais na semana (Domingo) em todos os clientes (onde o Cliente é Nulo). O conjunto de resultados contém um total geral Linha (onde o Cliente e a Data são ambos NULL) Primeiro, antes de entrar no fim do SQL, poderíamos usar a capacidade de pivô dinâmico dinâmico no SSRS. Nós simplesmente precisamos combinar os dois conjuntos de resultados por uma coluna e então poderemos alimentar os resultados para o controle da matriz SSRS, que irá espalhar os motivos de retorno no eixo das colunas do relatório. No entanto, nem todos usam SSRS (embora a maioria das pessoas deveria). Mas, mesmo assim, às vezes, os desenvolvedores precisam consumir conjuntos de resultados em algo diferente de uma ferramenta de relatórios. Então, para este exemplo, vamos assumir que queremos gerar o conjunto de resultados para uma página de grade da web e, possivelmente, o desenvolvedor quer quotst out outar as linhas do subtotal (onde eu tenho um valor ResultSetNum de 2 e 3) e colocá-los em uma grade de resumo. Então, a linha inferior, precisamos gerar a saída acima diretamente de um procedimento armazenado. E como um toque adicional na próxima semana, poderia haver Return Raison X e Y e Z. Então, não sabemos quantos motivos de retorno podem existir. Nós simplesmente queremos que a consulta pivote sobre os possíveis valores distintos para Return Rason. Aqui é onde o PIVOT T-SQL tem uma restrição que precisamos fornecer os valores possíveis. Uma vez que ganhamos, sabemos que até o tempo de execução, precisamos gerar a seqüência de consulta dinamicamente usando o padrão SQL dinâmico. O padrão SQL dinâmico envolve a geração da sintaxe, peça por peça, armazenando-a em uma string e, em seguida, executando a string no final. SQL dinâmico pode ser complicado, pois temos que incorporar sintaxe dentro de uma string. Mas neste caso, é nossa única opção verdadeira se quisermos lidar com um número variável de razões de retorno. Eu sempre achei que a melhor maneira de criar uma solução SQL dinâmica é descobrir o que a consulta gerada quotidealquot seria no final (neste caso, dados os motivos de Retorno que conhecemos). E, em seguida, englobá-la de modo inverso Juntamente, uma parte por vez. E então, aqui é o SQL que precisamos se soubéssemos que os Razões de Retorno (A a D) eram estáticas e não mudariam. A consulta faz o seguinte: Combina os dados do SalesData com os dados de ReturnData, onde fazemos quot-wirequot a palavra Vendas como um Tipo de Ação da Tabela de Vendas e, em seguida, usamos o Razão de Retorno dos Dados de Retorno na mesma coluna do ActionType. Isso nos dará uma coluna ActionType limpa sobre a qual podermos girar. Estamos combinando as duas instruções SELECT em uma expressão de tabela comum (CTE), que é basicamente uma subconsulta de tabela derivada que posteriormente usamos na próxima declaração (para PIVOT) Uma declaração PIVOT contra o CTE, que resume os dólares para o Tipo de Ação Estar em um dos possíveis valores do tipo de ação. Observe que este não é o conjunto de resultados final. Estamos colocando isso em um CTE que lê do primeiro CTE. A razão para isso é porque queremos fazer vários agrupamentos no final. A declaração SELECT final, que lê a partir do PIVOTCTE e combina-a com uma consulta subseqüente contra o mesmo PIVOTCTE, mas onde também implementamos dois agrupamentos no recurso GROUPING SETS no SQL 2008: GROUPING by Week Week (dbo. WeekEndingDate) GRUPO para todas as linhas () Então, se soubéssemos com certeza que nunca tivemos mais códigos de razão de retorno, então essa seria a solução. No entanto, precisamos contabilizar outros códigos de razão. Portanto, precisamos gerar toda essa consulta acima como uma grande cadeia onde construímos os possíveis motivos de retorno como uma lista separada por vírgulas. I39m vai mostrar todo o código T-SQL para gerar (e executar) a consulta desejada. E então eu vou dividi-lo em partes e explicar cada passo. Então, primeiro, aqui o código inteiro para gerar dinamicamente o que eu tenho acima. Existem basicamente cinco etapas que precisamos cobrir. Passo 1 . Nós sabemos que em algum lugar da mistura, precisamos gerar uma string para isso na consulta: SalesAmount, Razão A, Razão B, Razão C, Razão D0160016001600160 O que podemos fazer é criar uma expressão de tabela comum temporária que combina as quotSales com fio rígido Montante da coluna com a lista única de possíveis códigos de razão. Uma vez que temos isso em um CTE, podemos usar o pequeno truque de FOR XML PATH (3939) para colapsar essas linhas em uma única seqüência de caracteres, colocar uma vírsa na frente de cada linha que a consulta lê e usar STUFF para substituir A primeira instância de uma vírgula com espaço vazio. Este é um truque que você pode encontrar em centenas de blogs SQL. Então, esta primeira parte cria uma string chamada ActionString que podemos usar mais abaixo. Passo 2 . Nós também sabemos que queremos somar as colunas de motivo geradas, juntamente com a coluna de vendas padrão. Então, precisamos de uma string separada para isso, que eu chamarei de SUMSTRING. Eu simplesmente usarei o ActionString original e, em seguida, REPLACE os suportes externos com a sintaxe SUM, mais os suportes originais. Passo 3: agora o trabalho real começa. Usando essa consulta original como modelo, queremos gerar a consulta original (começando com o UNION das duas tabelas), mas substituindo quaisquer referências a colunas giratórias com as strings que geramos dinamicamente acima. Além disso, embora não seja absolutamente necessário, I39 também criou uma variável para simplesmente qualquer combinação de feed de retorno de carro que queremos inserir na consulta gerada (para legibilidade). Então, construamos toda a consulta em uma variável chamada SQLPivotQuery. Passo 4. Continuamos construindo a consulta novamente, concatenando a sintaxe, podemos quotar-wirequot com ActionSelectString (que geramos dinamicamente para manter todos os possíveis valores de razão de retorno) Etapa 5. Finalmente, nós geramos a parte final do Pivot Query, que lê a partir da 2ª expressão da tabela comum (PIVOTCTE, do modelo acima) e gera o SELECT final para ler do PIVOTCTE e combiná-lo com uma 2ª leitura contra o PIVOTCTE para Implementar os conjuntos de agrupamento. Finalmente, podemos citarxecutequot a string usando o processo SQL armazenado spexecuteSQL. Então, espero que você possa ver que o processo a seguir para este tipo de esforço é Determinar qual seria a consulta final, com base em seu conjunto atual de dados e valores (isto é, construído Um modelo de consulta) Escreva o código T-SQL necessário para gerar esse modelo de consulta como uma string. Provavelmente, a parte mais importante é determinar o conjunto único de valores em que você PENSA, e depois colapsá-los em uma seqüência usando a função STUFF e o trilho FOR XML PATH (3939) Então, o que está em minha mente hoje Bem, pelo menos, 13 itens Dois No verão, escrevi um rascunho BDR que enfoca (em parte) o papel da educação e o valor de uma boa base de artes liberais, não apenas para a indústria de software, mas também para outras indústrias. Um dos temas deste BDR especial enfatizou um ponto de vista fundamental e iluminado do renomado arquiteto de software Allen Holub sobre artes liberais. (Fielmente) parafraseando sua mensagem: ele destacou os paralelos entre a programação e o estudo da história, lembrando a todos que a história está lendo e escrevendo (e eu somo, identificando padrões) e o desenvolvimento de software também está lendo e escrevendo (e novamente, identificando padrões ). E então escrevi uma peça de opinião focada neste e em outros tópicos relacionados. Mas até hoje, nunca cheguei a publicar nem publicar. De vez em quando, penso em revisá-lo, e até mesmo me sentar por alguns minutos e fazer alguns ajustes. Mas então a vida em geral iria entrar no caminho e eu nunca terminaria. Então, o que mudou Algumas semanas atrás, o colecionador CoDe Magazine e o líder da indústria, Ted Neward, escreveram uma peça em sua coluna regular, Managed Coder, que chamou minha atenção. O título do artigo é On Liberal Arts. E eu recomendo que todos leu. Ted discute o valor de um fundo de artes liberais, a falsa dicotomia entre um fundo de artes liberais e o sucesso no desenvolvimento de software, e a necessidade de escrever se comunicar bem. Ele fala sobre alguns de seus encontros anteriores com o gerenciamento de pessoal de RH em relação aos seus antecedentes educacionais. Ele também enfatiza a necessidade de aceitar e adaptar-se às mudanças em nossa indústria, bem como as características de um profissional de software bem-sucedido (ser confiável, planejar com antecedência e aprender a superar os conflitos iniciais com outros membros da equipe). Então, é uma ótima leitura, assim como os outros artigos CoDe de Teds e entradas de blog. Também me trouxe de volta a pensar em minhas opiniões sobre isso (e outros tópicos), e finalmente me motivou a terminar meu próprio editorial. Então, melhor tarde do que nunca, aqui estão os meus Bakers Dozen of Reflections: eu tenho um ditado: a água congela a 32 graus. Se você estiver em um papel de treinamento, você pode pensar que você está fazendo tudo no mundo para ajudar alguém quando de fato, eles só sentem uma temperatura de 34 graus e, portanto, as coisas não estão solidificando para eles. Às vezes é preciso apenas um pouco mais de esforço ou outro catalisador idequímico ou uma nova perspectiva, o que significa que aqueles com educação prévia podem recorrer a diferentes fontes. A água congela a 32 graus. Algumas pessoas podem manter altos níveis de concentração mesmo com uma sala cheia de gente barulhenta. Eu não sou um ocasionalmente eu preciso de alguma privacidade para pensar em um problema crítico. Algumas pessoas descrevem isso porque você deve aprender a se afastar disso. Dito de outra forma, é uma busca pelo ar rarefeito. Na semana passada, passei horas em quarto meio iluminado e silencioso com um quadro branco, até entender completamente um problema. Foi só então que eu poderia falar com outros desenvolvedores sobre uma solução. A mensagem aqui não é para pregar como você deve seguir seu negócio de resolver problemas, mas sim para que todos saibam seus pontos fortes e o que funciona e use-os em sua vantagem tanto quanto possível. Algumas frases são como as unhas em um quadro para mim. Use-o como um momento de ensino é um. (Por que é como as unhas em um quadro-negro Porque, se você estiver em um papel de mentor, você geralmente deve estar no modo de momento de ensino de qualquer maneira, por mais sutil que seja). Por outro lado, não posso realmente explicar isso em palavras, mas entendo. Isso pode soar um pouco frio, mas se uma pessoa realmente não pode explicar algo em palavras, talvez eles não entendam. Claro, uma pessoa pode ter uma sensação difusa de como algo funciona, eu posso explodir meu caminho através da descrição de como uma câmera digital funciona, mas a verdade é que eu realmente não entendo tudo tão bem. Existe um campo de estudo conhecido como epistemologia (o estudo do conhecimento). Uma das bases fundamentais para entender se é uma câmera ou um padrão de design - é a capacidade de estabelecer o contexto, identificar a cadeia de eventos relacionados, os atributos de qualquer componente ao longo do caminho, etc. Sim, a compreensão às vezes é um trabalho muito difícil , Mas mergulhar em um tópico e separá-lo vale o esforço. Mesmo aqueles que evitam a certificação reconhecerão que o processo de estudo para testes de certificação ajudará a preencher lacunas no conhecimento. Um gerenciador de banco de dados é mais provável contratar um desenvolvedor de banco de dados que possa falar extemporaneamente (e sem esforço) sobre os níveis de isolamento de transações e desencadeia, em oposição a alguém que sabe disso, mas se esforça para descrever seu uso. Há outro corolário aqui. Ted Neward recomenda que os desenvolvedores ocupem discursos em público, blogueiros, etc. Eu concordo 100. O processo de falar em público e blogging praticamente o forçará a começar a pensar em tópicos e a quebrar as definições que você poderia ter dado por certo. Há alguns anos pensei ter entendido a afirmação T-SQL MERGE muito bem. Mas apenas depois de escrever sobre isso, falando sobre, colocando perguntas de outros que tiveram perspectivas que nunca me ocorreram que meu nível de compreensão aumentou exponencialmente. Conheço uma história de gerente de contratação que já entrevistou um autordeveloper para um cargo contratado. O gerente de contratação era desdenhoso de publicações em geral e atirava o candidato, então, se você estiver trabalhando aqui, preferiria estar escrevendo livros ou escrevendo código. Sim, eu concedo que em qualquer setor haverá alguns acadêmicos puros. Mas o que o gerente de contratação perdeu foi a oportunidade de fortalecer e aprimorar os conjuntos de habilidades. Ao limpar uma velha caixa de livros, encontrei um tesouro da década de 1980: programadores no trabalho. Que contém entrevistas com um jovem Bill Gates, Ray Ozzie e outros nomes bem conhecidos. Toda entrevista e cada visão vale o preço do livro. Na minha opinião, a entrevista mais interessante foi com Butler Lampson. Que deu alguns conselhos poderosos. Para o inferno com a alfabetização informática. É absolutamente ridículo. Estudar matematica. Aprenda a pensar. Ler. Escreva. Essas coisas são de valor mais duradouro. Saiba como provar teoremas: muita evidência se acumulou ao longo dos séculos que sugere que essa habilidade é transferível para muitas outras coisas. Butler fala a verdade. Acréscimo a esse ponto, aprenda como jogar os demônios defendem contra si mesmos. Quanto mais você puder ver a realidade, verifique seus próprios processos e trabalhe, melhor será. O grande computador scientistauthor Allen Holub fez a conexão entre o desenvolvimento de software e as artes liberais especificamente, o assunto da história. Aqui estava o seu ponto de vista: o que é história Lendo e escrevendo. O que é desenvolvimento de software Entre outras coisas, leitura e escrita. Eu costumava dar a meus alunos perguntas de ensaio T-SQL como testes práticos. Um estudante brincou que eu atuei mais como um professor de direito. Bem, assim como o treinador Donny Haskins disse no filme Glory Road, meu caminho é difícil. Eu acredito firmemente em uma forte base intelectual para qualquer profissão. Assim como as aplicações podem se beneficiar de estruturas, os indivíduos e seus processos de pensamento também podem se beneficiar de estruturas humanas. Essa é a base fundamental da bolsa de estudos. Há uma história que, na década de 1970, a IBM expandiu seus esforços de recrutamento nas principais universidades, concentrando-se nos melhores e mais brilhantes graduados de artes liberais. Mesmo assim, reconheceram que os melhores leitores e escritores podem algum dia se tornar fortes analistas de sistemas programáticos. (Sinta-se livre para usar essa história para qualquer tipo de RH que insista que um candidato deve ter um diploma de ciência da computação) E falando de história: se por nenhum outro motivo, é importante lembrar o histórico de lançamentos de produtos se eu estiver trabalhando em um Site cliente que ainda está usando SQL Server 2008 ou mesmo (gasp) SQL Server 2005, eu tenho que lembrar quais recursos foram implementados nas versões ao longo do tempo. Já tem um médico favorito que você gostou porque ele explicou coisas em inglês simples, deu-lhe a verdade direta e ganhou sua confiança para operar com você. Essas são habilidades loucas. E são o resultado de experiências e TRABALHOS DUROS que levam anos e até décadas a cultivar. Não há garantias sobre o foco no sucesso do trabalho nos fatos, tire alguns riscos calculados quando tiver certeza de que pode ver o seu caminho até a linha de chegada, deixar as fichas cair onde elas podem, e nunca perder de vista ser como aquele médico que ganhou sua confiança. Mesmo que alguns dias eu fiquei curto, eu tento tratar meu cliente e seus dados como um médico trataria os pacientes. Mesmo que um médico ganhe mais dinheiro Existem muitos clichês que eu detesto, mas heres, eu não odeio: não existe uma pergunta ruim. Como ex-instrutor, uma coisa que atraiu minha ira era ouvir alguém criticar outra pessoa por ter feito uma pergunta suposta e estúpida. Uma pergunta indica que uma pessoa reconhece que eles têm alguma lacuna no conhecimento que estão procurando preencher. Sim, algumas perguntas são melhor formuladas do que outras, e algumas questões requerem enquadramento adicional antes de serem respondidas. Mas a jornada de formar uma pergunta para uma resposta provavelmente gerará um processo mental ativo em outros. Há todas as coisas boas. Muitas discussões boas e frutíferas se originam com uma pergunta estúpida. Eu trabalho em todas as ferramentas no SSIS, SSAS, SSRS, MDX, PPS, SharePoint, Power BI, DAX todas as ferramentas na pilha de BI da Microsoft. Ainda escrevo algum código. NET de vez em quando. Mas adivinhe o que ainda gasto tanto tempo escrevendo código T-SQL para dados de perfil como parte do processo de descoberta. Todos os desenvolvedores de aplicativos devem ter bons cortes T-SQL. Ted Neward escreve (corretamente) sobre a necessidade de se adaptar às mudanças de tecnologia. Acréscimo a isso a necessidade de me adaptar às mudanças do cliente. As empresas mudam as regras de negócios. As empresas adquirem outras empresas (ou se tornam alvo de uma aquisição). As empresas cometem erros na comunicação de requisitos e especificações comerciais. Sim, às vezes podemos desempenhar um papel em ajudar a gerenciar essas mudanças e, às vezes, eram a mosca, não o pára-brisa. Isso às vezes causa grande dor para todos, especialmente o I. T. pessoas. É por isso que o termo "fato da vida" existe, temos de lidar com isso. Assim como nenhum desenvolvedor escreve código sem erros sempre, não I. T. A pessoa lida bem com as mudanças a cada momento. Uma das maiores lutas que eu tive nos meus 28 anos nesta indústria está mostrando paciência e restrição quando as mudanças estão voando de muitas direções diferentes. Aqui é onde minha sugestão anterior sobre como procurar o ar rarizado pode ajudar. Se você consegue assimilar as mudanças em seu processo de pensamento e, sem se sentir sobrecarregado, as chances são de você ser um ativo significativo. Nos últimos 15 meses, tive que lidar com uma grande quantidade de mudanças profissionais. Tem sido muito difícil às vezes, mas eu decidi que a mudança será a norma e eu tentei ajustar meus próprios hábitos do melhor jeito para lidar com mudanças freqüentes (e incertas). É difícil, muito difícil. Mas como o treinador Jimmy Duggan disse no filme A League of Own: Claro que é difícil. Se não fosse difícil, todos iriam fazê-lo. O difícil, é o que o torna ótimo. Uma mensagem poderosa. Havia conversas na indústria nos últimos anos sobre a conduta em conferências profissionais (e a conduta na indústria como um todo). Muitos escritores respeitados escreveram muito bons editoriais sobre o assunto. É minha contribuição, para o que vale a pena. Its a message to those individuals who have chosen to behave badly: Dude, it shouldnt be that hard to behave like an adult. A few years ago, CoDe Magazine Chief Editor Rod Paddock made some great points in an editorial about Codes of Conduct at conferences. Its definitely unfortunate to have to remind people of what they should expect out of themselves. But the problems go deeper. A few years ago I sat on a five-person panel (3 women, 2 men) at a community event on Women in Technology. The other male stated that men succeed in this industry because the Y chromosome gives men an advantage in areas of performance. The individual who made these remarks is a highly respected technology expert, and not some bozo making dongle remarks at a conference or sponsoring a programming contest where first prize is a date with a bikini model. Our world is becoming increasingly polarized (just watch the news for five minutes), sadly with emotion often winning over reason. Even in our industry, recently I heard someone in a position of responsibility bash software tool XYZ based on a ridiculous premise and then give false praise to a competing tool. So many opinions, so many arguments, but heres the key: before taking a stand, do your homework and get the facts . Sometimes both sides are partly rightor wrong. Theres only one way to determine: get the facts. As Robert Heinlein wrote, Facts are your single clue get the facts Of course, once you get the facts, the next step is to express them in a meaningful and even compelling way. Theres nothing wrong with using some emotion in an intellectual debate but it IS wrong to replace an intellectual debate with emotion and false agenda. A while back I faced resistance to SQL Server Analysis Services from someone who claimed the tool couldnt do feature XYZ. The specifics of XYZ dont matter here. I spent about two hours that evening working up a demo to cogently demonstrate the original claim was false. In that example, it worked. I cant swear it will always work, but to me thats the only way. Im old enough to remember life at a teen in the 1970s. Back then, when a person lost hisher job, (often) it was because the person just wasnt cutting the mustard. Fast-forward to today: a sad fact of life is that even talented people are now losing their jobs because of the changing economic conditions. Theres never a full-proof method for immunity, but now more than ever its critical to provide a high level of what I call the Three Vs (value, versatility, and velocity) for your employerclients. I might not always like working weekends or very late at night to do the proverbial work of two people but then I remember there are folks out there who would give anything to be working at 1 AM at night to feed their families and pay their bills. Always be yourselfyour BEST self. Some people need inspiration from time to time. Heres mine: the great sports movie, Glory Road. If youve never watched it, and even if youre not a sports fan I can almost guarantee youll be moved like never before. And Ill close with this. If you need some major motivation, Ill refer to a story from 2006. Jason McElwain, a high school student with autism, came off the bench to score twenty points in a high school basketball game in Rochester New York. Heres a great YouTube video. His mother said it all . This is the first moment Jason has ever succeeded and is proud of himself. I look at autism as the Berlin Wall. He cracked it. To anyone who wanted to attend my session at todays SQL Saturday event in DC I apologize that the session had to be cancelled. I hate to make excuses, but a combination of getting back late from Detroit (client trip), a car thats dead (blown head gasket), and some sudden health issues with my wife have made it impossible for me to attend. Back in August, I did the same session (ColumnStore Index) for PASS as a webinar. You can go to this link to access the video (itll be streamed, as all PASS videos are streamed) The link does require that you fill out your name and email address, but thats it. And then you can watch the video. Feel free to contact me if you have questions, at kgoffkevinsgoff. net November 15, 2013 Getting started with Windows Azure and creating SQL Databases in the cloud can be a bit daunting, especially if youve never tried out any of Microsofts cloud offerings. Fortunately, Ive created a webcast to help people get started. This is an absolute beginners guide to creating SQL Databases under Windows Azure. It assumes zero prior knowledge of Azure. You can go to the BDBI Webcasts of this website and check out my webcast (dated 11102013). Or you can just download the webcast videos right here: here is part 1 and here is part 2. You can also download the slide deck here. November 03, 2013 Topic this week: SQL Server Snapshot Isolation Levels, added in SQL Server 2005. To this day, there are still many SQL developers, many good SQL developers who either arent aware of this feature, or havent had time to look at it. Hopefully this information will help. Companion webcast will be uploaded in the next day look for it in the BDBI Webcasts section of this blog. October 26, 2013 Im going to start a weekly post of T-SQL tips, covering many different versions of SQL Server over the years Heres a challenge many developers face. Ill whittle it down to a very simple example, but one where the pattern applies to many situations. Suppose you have a stored procedure that receives a single vendor ID and updates the freight for all orders with that vendor id. create procedure dbo. UpdateVendorOrders update Purchasing. PurchaseOrderHeader set Freight Freight 1 where VendorID VendorID Now, suppose we need to run this for a set of vendor IDs. Today we might run it for three vendors, tomorrow for five vendors, the next day for 100 vendors. We want to pass in the vendor IDs. If youve worked with SQL Server, you can probably guess where Im going with this. The big question is how do we pass a variable number of Vendor IDs Or, stated more generally, how do we pass an array, or a table of keys, to a procedure Something along the lines of exec dbo. UpdateVendorOrders SomeListOfVendors Over the years, developers have come up with different methods: Going all the way back to SQL Server 2000, developers might create a comma-separated list of vendor keys, and pass the CSV list as a varchar to the procedure. The procedure would shred the CSV varchar variable into a table variable and then join the PurchaseOrderHeader table to that table variable (to update the Freight for just those vendors in the table). I wrote about this in CoDe Magazine back in early 2005 (code-magazinearticleprint. aspxquickid0503071ampprintmodetrue. Tip 3) In SQL Server 2005, you could actually create an XML string of the vendor IDs, pass the XML string to the procedure, and then use XQUERY to shred the XML as a table variable. I also wrote about this in CoDe Magazine back in 2007 (code-magazinearticleprint. aspxquickid0703041ampprintmodetrue. Tip 12)Also, some developers will populate a temp table ahead of time, and then reference the temp table inside the procedure. All of these certainly work, and developers have had to use these techniques before because for years there was NO WAY to directly pass a table to a SQL Server stored procedure. Until SQL Server 2008 when Microsoft implemented the table type. This FINALLY allowed developers to pass an actual table of rows to a stored procedure. Now, it does require a few steps. We cant just pass any old table to a procedure. It has to be a pre-defined type (a template). So lets suppose we always want to pass a set of integer keys to different procedures. One day it might be a list of vendor keys. Next day it might be a list of customer keys. So we can create a generic table type of keys, one that can be instantiated for customer keys, vendor keys, etc. CREATE TYPE IntKeysTT AS TABLE ( IntKey int NOT NULL ) So Ive created a Table Typecalled IntKeysTT . Its defined to have one column an IntKey. Nowsuppose I want to load it with Vendors who have a Credit Rating of 1..and then take that list of Vendor keys and pass it to a procedure: DECLARE VendorList IntKeysTT INSERT INTO VendorList SELECT BusinessEntityID from Purchasing. Vendor WHERE CreditRating 1 So, I now have a table type variable not just any table variable, but a table type variable (that I populated the same way I would populate a normal table variable). Its in server memory (unless it needs to spill to tempDB) and is therefore private to the connectionprocess. OK, can I pass it to the stored procedure now Well, not yet we need to modify the procedure to receive a table type. Heres the code: create procedure dbo. UpdateVendorOrdersFromTT IntKeysTT IntKeysTT READONLY update Purchasing. PurchaseOrderHeader set Freight Freight 1 FROM Purchasing. PurchaseOrderHeader JOIN IntKeysTT TempVendorList ON PurchaseOrderHeader. VendorID Te mpVendorList. IntKey Notice how the procedure receives the IntKeysTT table type as a Table Type (again, not just a regular table, but a table type). It also receives it as a READONLY parameter. You CANNOT modify the contents of this table type inside the procedure. Usually you wont want to you simply want to read from it. Well, now you can reference the table type as a parameter and then utilize it in the JOIN statement, as you would any other table variable. So there you have it. A bit of work to set up the table type, but in my view, definitely worth it. Additionally, if you pass values from. NET, youre in luck. You can pass an ADO. NET data table (with the same tablename property as the name of the Table Type) to the procedure. For. NET developers who have had to pass CSV lists, XML strings, etc. to a procedure in the past, this is a huge benefit. Finally I want to talk about another approach people have used over the years. SQL Server Cursors. At the risk of sounding dogmatic, I strongly advise against Cursors, unless there is just no other way. Cursors are expensive operations in the server, For instance, someone might use a cursor approach and implement the solution this way: DECLARE VendorID int DECLARE dbcursor CURSOR FASTFORWARD FOR SELECT BusinessEntityID from Purchasing. Vendor where CreditRating 1 FETCH NEXT FROM dbcursor INTO VendorID WHILE FETCHSTATUS 0 EXEC dbo. UpdateVendorOrders VendorID FETCH NEXT FROM dbcursor INTO VendorID The best thing Ill say about this is that it works. And yes, getting something to work is a milestone. But getting something to work and getting something to work acceptably are two different things. Even if this process only takes 5-10 seconds to run, in those 5-10 seconds the cursor utilizes SQL Server resources quite heavily. Thats not a good idea in a large production environment. Additionally, the more the of rows in the cursor to fetch and the more the number of executions of the procedure, the slower it will be. When I ran both processes (the cursor approach and then the table type approach) against a small sampling of vendors (5 vendors), the processing times where 260 ms and 60 ms, respectively. So the table type approach was roughly 4 times faster. But then when I ran the 2 scenarios against a much larger of vendors (84 vendors), the different was staggering 6701 ms versus 207 ms, respectively. So the table type approach was roughly 32 times faster. Again, the CURSOR approach is definitely the least attractive approach. Even in SQL Server 2005, it would have been better to create a CSV list or an XML string (providing the number of keys could be stored in a scalar variable). But now that there is a Table Type feature in SQL Server 2008, you can achieve the objective with a feature thats more closely modeled to the way developers are thinking specifically, how do we pass a table to a procedure Now we have an answer Hope you find this feature help. Feel free to post a comment. Mixed Models for Missing Data With Repeated Measures Part 1 David C. Howell This is a two part document. For the second part go to Mixed-Models-for-Repeated-Measures2.html When we have a design in which we have both random and fixed variables, we have what is often called a mixed model. Mixed models have begun to play an important role in statistical analysis and offer many advantages over more traditional analyses. At the same time they are more complex and the syntax for software analysis is not always easy to set up. I will break this paper up into two papers because there are a number of designs and design issues to consider. This document will deal with the use of what are called mixed models (or linear mixed models, or hierarchical linear models, or many other things) for the analysis of what we normally think of as a simple repeated measures analysis of variance. Future documents will deal with mixed models to handle single-subject design (particularly multiple baseline designs) and nested designs. A large portion of this document has benefited from Chapter 15 in Maxwell Delaney (2004) Designing Experiments and Analyzing Data. They have one of the clearest discussions that I know. I am going a step beyond their example by including a between-groups factor as well as a within-subjects (repeated measures) factor. For now my purpose is to show the relationship between mixed models and the analysis of variance. The relationship is far from perfect, but it gives us a known place to start. More importantly, it allows us to see what we gain and what we lose by going to mixed models. In some ways I am going through the Maxwell Delaney chapter backwards, because I am going to focus primarily on the use of the repeated command in SAS Proc Mixed . I am doing that because it fits better with the transition from ANOVA to mixed models. My motivation for this document came from a question asked by Rikard Wicksell at Karolinska University in Sweden. He had a randomized clinical trial with two treatment groups and measurements at pre, post, 3 months, and 6 months. His problem is that some of his data were missing. He considered a wide range of possible solutions, including last trial carried forward, mean substitution, and listwise deletion. In some ways listwise deletion appealed most, but it would mean the loss of too much data. One of the nice things about mixed models is that we can use all of the data we have. If a score is missing, it is just missing. It has no effect on other scores from that same patient. Another advantage of mixed models is that we dont have to be consistent about time. For example, and it does not apply in this particular example, if one subject had a follow-up test at 4 months while another had their follow-up test at 6 months, we simply enter 4 (or 6) as the time of follow-up. We dont have to worry that they couldnt be tested at the same intervals. A third advantage of these models is that we do not have to assume sphericity or compound symmetry in the model. We can do so if we want, but we can also allow the model to select its own set of covariances or use covariance patterns that we supply. I will start by assuming sphericity because I want to show the parallels between the output from mixed models and the output from a standard repeated measures analysis of variance. I will then delete a few scores and show what effect that has on the analysis. I will compare the standard analysis of variance model with a mixed model. Finally I will use Expectation Maximization (EM) and Multiple Imputation (MI) to impute missing values and then feed the newly complete data back into a repeated measures ANOVA to see how those results compare. (If you want to read about those procedures, I have a web page on them at Missing. html ). I have created data to have a number of characteristics. There are two groups - a Control group and a Treatment group, measured at 4 times. These times are labeled as 1 (pretest), 2 (one month posttest), 3 (3 months follow-up), and 4 (6 months follow-up). I created the treatment group to show a sharp drop at post-test and then sustain that drop (with slight regression) at 3 and 6 months. The Control group declines slowly over the 4 intervals but does not reach the low level of the Treatment group. There are noticeable individual differences in the Control group, and some subjects show a steeper slope than others. In the Treatment group there are individual differences in level but the slopes are not all that much different from one another. You might think of this as a study of depression, where the dependent variable is a depression score (e. g. Beck Depression Inventory) and the treatment is drug versus no drug. If the drug worked about as well for all subjects the slopes would be comparable and negative across time. For the control group we would expect some subjects to get better on their own and some to stay depressed, which would lead to differences in slope for that group. These facts are important because when we get to the random coefficient mixed model the individual differences will show up as variances in intercept, and any slope differences will show up as a significant variance in the slopes. For the standard ANOVA, and for mixed models using the Repeated command, the differences in level show up as a Subject effect and we assume that the slopes are comparable across subjects. The program and data used below are available at the following links. I explain below the differences between the data files. The results of a standard repeated measures analysis of variance with no missing data and using SAS Proc GLM follow. You would obtain the same results using the SPSS Univariate procedure. Because I will ask for a polynomial trend analysis, I have told it to recode the levels as 0, 1, 3, 6 instead of 1, 2, 3, 4. I did not need to do this, but it seemed truer to the experimental design. It does not affect the standard summary table. (I give the entire data entry parts of the program here, but will leave it out in future code.) Here we see that each of the effects in the overall analysis is significant. We dont care very much about the group effect because we expected both groups to start off equal at pre-test. What is important is the interaction, and it is significant at p .0001. Clearly the drug treatment is having a differential effect on the two groups, which is what we wanted to see. The fact that the Control group seems to be dropping in the number of symptoms over time is to be expected and not exciting, although we could look at these simple effects if we wanted to. We would just run two analyses, one on each group. I would not suggest pooling the variances to calculate F . though that would be possible. In the printout above I have included tests on linear, quadratic, and cubic trend that will be important later. However you have to read this differently than you might otherwise expect. The first test for the linear component shows an F of 54.27 for mean and an F of 0.59 for group. Any other software that I have used would replace mean with Time and group with Group times Time. In other words we have a significant linear trend over time, but the linear times group contrast is not significant. I dont know why they label them that way. (Well, I guess I do, but its not the way that I would do it.) I should also note that my syntax specified the intervals for time, so that SAS is not assuming equally spaced intervals. The fact that the linear trend was not significant for the interaction means that both groups are showing about the same linear trend. But notice that there is a significant interaction for the quadratic. Mixed Model The use of mixed models represents a substantial difference from the traditional analysis of variance. For balanced designs (which roughly translates to equal cell sizes) the results will come out to be the same, assuming that we set the analysis up appropriately. But the actual statistical approach is quite different and ANOVA and mixed models will lead to different results whenever the data are not balanced or whenever we try to use different, and often more logical, covariance structures. First a bit of theory. Within Proc Mixed the repeated command plays a very important role in that it allows you to specify different covariance structures, which is something that you cannot do under Proc GLM . You should recall that in Proc GLM we assume that the covariance matrix meets our sphericity assumption and we go from there. In other words the calculations are carried out with the covariance matrix forced to sphericity. If that is not a valid assumption we are in trouble. Of course there are corrections due to Greenhouse and Geisser and Hyunh and Feldt, but they are not optimal solutions. But what does compound symmetry, or sphericity, really represent (The assumption is really about sphericity, but when speaking of mixed models most writers refer to compound symmetry, which is actually a bit more restrictive.) Most people know that compound symmetry means that the pattern of covariances or correlations is constant across trials. In other words, the correlation between trial 1 and trial 2 is equal to the correlation between trial 1 and trial 4 or trial 3 and trial 4, etc. But a more direct way to think about compound symmetry is to say that it requires that all subjects in each group change in the same way over trials. In other words the slopes of the lines regressing the dependent variable on time are the same for all subjects. Put that way it is easy to see that compound symmetry can really be an unrealistic assumption. If some of your subjects improve but others dont, you do not have compound symmetry and you make an error if you use a solution that assumes that you do. Fortunately Proc Mixed allows you to specify some other pattern for those covariances. We can also get around the sphericity assumption using the MANOVA output from Proc GLM . but that too has its problems. Both standard univariate GLM and MANOVA GLM will insist on complete data. If a subject is missing even one piece of data, that subject is discarded. That is a problem because with a few missing observations we can lose a great deal of data and degrees of freedom. Proc Mixed with repeated is different. Instead of using a least squares solution, which requires complete data, it uses a maximum likelihood solution, which does not make that assumption. (We will actually use a Restricted Maximum Likelihood (REML) solution.) When we have balanced data both least squares and REML will produce the same solution if we specify a covariance matrix with compound symmetry. But even with balanced data if we specify some other covariance matrix the solutions will differ. At first I am going to force sphericity by adding type cs (which stands for compound symmetry) to the repeated statement. I will later relax that structure. The first analysis below uses exactly the same data as for Proc GLM . though they are entered differently. Here data are entered in what is called long form, as opposed to the wide form used for Proc GLM . This means that instead of having one line of data for each subject, we have one line of data for each observation. So with four measurement times we will have four lines of data for that subject. Because we have a completely balanced design (equal sample sizes and no missing data) and because the time intervals are constant, the results of this analysis will come out exactly the same as those for Proc GLM so long as I specify type cs. The data follow. I have used card input rather than reading a file just to give an alternative approach. I have put the data in three columns to save space, but the real syntax statements would have 48 lines of data. The first set of commands plots the results of each individual subject broken down by groups. Earlier we saw the group means over time. Now we can see how each of the subjects stands relative to the means of his or her group. In the ideal world the lines would start out at the same point on the Y axis (i. e. have a common intercept) and move in parallel (i. e. have a common slope). That isnt quite what happens here, but whether those are chance variations or systematic ones is something that we will look at later. We can see in the Control group that a few subjects decline linearly over time and a few other subjects, especially those with lower scores decline at first and then increase during follow-up. Plots (Group 1 Control, Group 2 Treatment) For Proc Mixed we need to specify that group, time, and subject are class variables. (See the syntax above.) This will cause SAS to treat them as factors (nominal or ordinal variables) instead of as continuous variables. The model statement tells the program that we want to treat group and time as a factorial design and generate the main effects and the interaction. (I have not appended a solution to the end of the model statement because I dont want to talk about the parameter estimates of treatment effects at this point, but most people would put it there.) The repeated command tells SAS to treat this as a repeated measures design, that the subject variable is named subj, and that we want to treat the covariance matrix as exhibiting compound symmetry, even though in the data that I created we dont appear to come close to meeting that assumption. The specification rcorr will ask for the estimated correlation matrix. (we could use r instead of rcorr, but that would produce a covariance matrix, which is harder to interpret.) The results of this analysis follow, and you can see that they very much resemble our analysis of variance approach using Proc GLM . On this printout we see the estimated correlations between times. These are not the actual correlations, which appear below, but the estimates that come from an assumption of compound symmetry. That assumption says that the correlations have to be equal, and what we have here are basically average correlations. The actual correlations, averaged over the two groups using Fishers transformation, are: Notice that they are quite different from the ones assuming compound symmetry, and that they dont look at all as if they fit that assumption. We will deal with this problem later. (I dont have a clue why the heading refers to subject 1. It just does) There are also two covariance parameters. Remember that there are two sources of random effects in this design. There is our normal sigma 2 e . which reflects random noise. In addition we are treating our subjects as a random sample, and there is thus random variance among subjects. Here I get to play a bit with expected mean squares. You may recall that the expected mean squares for the error term for the between-subject effect is E(MS win subj ) sigma e 2 asigma pi 2 and our estimate of sigma e 2. taken from the GLM analysis, is MS residual . which is 2760.6218. The letter a stands for the number of measurement times 4, and MS subj win grps 12918.0663, again from the GLM analysis. Therefore our estimate of sigma pi 2 (12918.0663 2760.6218)4 2539.36. These two estimates are our random part of the model and are given in the section headed Covariance Parameter Estimates. I dont see a situation in this example in which we would wish to make use of these values, but in other mixed designs they are useful. You may notice one odd thing in the data. Instead of entering time as 1,2, 3, 4, I entered it as 0, 1, 3, and 6. If this were a standard ANOVA it wouldnt make any difference, and in fact it doesnt make any difference here, but when we come to looking at intercepts and slopes, it will be very important how we designated the 0 point. We could have centered time by subtracting the mean time from each entry, which would mean that the intercept is at the mean time. I have chosen to make 0 represent the pretest, which seems a logical place to find the intercept. I will say more about this later. Missing Data I have just spent considerable time discussing a balanced design where all of the data are available. Now I want to delete some of the data and redo the analysis. This is one of the areas where mixed designs have an important advantage. I am going to delete scores pretty much at random, except that I want to show a pattern of different observations over time. It is easiest to see what I have done if we look at data in the wide form, so the earlier table is presented below with . representing missing observations. It is important to notice that data are missing completely at random, not on the basis of other observations. If we treat this as a standard repeated measures analysis of variance, using Proc GLM . we have a problem. Of the 24 cases, only 17 of them have complete data. That means that our analysis will be based on only those 17 cases. Aside from a serious loss of power, there are other problems with this state of affairs. Suppose that I suspected that people who are less depressed are less likely to return for a follow-up session and thus have missing data. To build that into the example I could deliberately have deleted data from those who scored low on depression to begin with, though I kept their pretest scores. (I did not actually do this here.) Further suppose that people low in depression respond to treatment (or non-treatment) in different ways from those who are more depressed. By deleting whole cases I will have deleted low depression subjects and that will result in biased estimates of what we would have found if those original data points had not been missing. This is certainly not a desirable result. To expand slightly on the previous paragraph, if we using Proc GLM . or a comparable procedure in other software, we have to assume that data are missing completely at random, normally abbreviated MCAR. (See Howell, 2008.) If the data are not missing completely at random, then the results would be biased. But if I can find a way to keep as much data as possible, and if people with low pretest scores are missing at one or more measurement times, the pretest score will essentially serve as a covariate to predict missingness. This means that I only have to assume that data are missing at random (MAR) rather than MCAR. That is a gain worth having. MCAR is quite rare in experimental research, but MAR is much more common. Using a mixed model approach requires only that data are MAR and allows me to retain considerable degrees of freedom. (That argument has been challenged by Overall Tonidandel (2007), but in this particular example the data actually are essentially MCAR. I will come back to this issue later.) Proc GLM results The output from analyzing these data using Proc GLM follows. I give these results just for purposes of comparison, and I have omitted much of the printout. Notice that we still have a group effect and a time effect, but the F for our interaction has been reduced by about half, and that is what we care most about. (In a previous version I made it drop to nonsignificant, but I relented here.) Also notice the big drop in degrees of freedom due to the fact that we now only have 17 subjects. Proc Mixed Now we move to the results using Proc Mixed . I need to modify the data file by putting it in its long form and to replacing missing observations with a period, but that means that I just altered 9 lines out of 96 (10 of the data) instead of 7 out of 24 (29). The syntax would look exactly the same as it did earlier. The presence of time on the repeated statement is not necessary if I have included missing data by using a period, but it is needed if I just remove the observation completely. (At least that is the way I read the manual.) The results follow, again with much of the printout deleted: This is a much nicer solution, not only because we have retained our significance levels, but because it is based on considerably more data and is not reliant on an assumption that the data are missing completely at random. Again you see a fixed pattern of correlations between trials which results from my specifying compound symmetry for the analysis. Other Covariance Structures To this point all of our analyses have been based on an assumption of compound symmetry. (The assumption is really about sphericity, but the two are close and Proc Mixed refers to the solution as type cs.) But if you look at the correlation matrix given earlier it is quite clear that correlations further apart in time are distinctly lower than correlations close in time, which sounds like a reasonable result. Also if you looked at Mauchlys test of sphericity (not shown) it is significant with p .012. While this is not a great test, it should give us pause. We really ought to do something about sphericity. The first thing that we could do about sphericity is to specify that the model will make no assumptions whatsoever about the form of the covariance matrix. To do this I will ask for an unstructured matrix. This is accomplished by including type un in the repeated statement. This will force SAS to estimate all of the variances and covariances and use them in its solution. The problem with this is that there are 10 things to be estimated and therefore we will lose degrees of freedom for our tests. But I will go ahead anyway. For this analysis I will continue to use the data set with missing data, though I could have used the complete data had I wished. I will include a request that SAS use procedures due to Hotelling-Lawley-McKeon (hlm) and Hotelling-Lawley-Pillai-Samson (hlps) which do a better job of estimating the degrees of freedom for our denominators. This is recommended for an unstructured model. The results are shown below. Results using unstructured matrix Notice the matrix of correlations. From pretest to the 6 month follow-up the correlation with pretest scores has dropped from .46 to -.03, and this pattern is consistent. That certainly doesnt inspire confidence in compound symmetry. The F s have not changed very much from the previous model, but the degrees of freedom for within-subject terms have dropped from 57 to 22, which is a huge drop. That results from the fact that the model had to make additional estimates of covariances. Finally, the hlm and hlps statistics further reduce the degrees of freedom to 20, but the effects are still significant. This would make me feel pretty good about the study if the data had been real data. But we have gone from one extreme to another. We estimated two covariance parameters when we used type cs and 10 covariance parameters when we used type un. (Put another way, with the unstructured solution we threw up our hands and said to the program You figure it out We dont know whats going on. There is a middle ground (in fact there are many). We probably do know at least something about what those correlations should look like. Often we would expect correlations to decrease as the trials in question are further removed from each other. They might not decrease as fast as our data suggest, but they should probably decrease. An autoregressive model, which we will see next, assumes that correlations between any two times depend on both the correlation at the previous time and an error component. To put that differently, your score at time 3 depends on your score at time 2 and error. (This is a first order autoregression model. A second order model would have a score depend on the two previous times plus error.) In effect an AR(1) model assumes that if the correlation between Time 1 and Time 2 is .51, then the correlation between Time 1 an d Time 3 has an expected value of .512 2 .26 and between Time 1 and Time 4 has an expected value of .513 3 .13. Our data look reasonably close to that. (Remember that these are expected values of r . not the actual obtained correlations.) The solution using a first order autoregressive model follows. Notice the pattern of correlations. The .6182 as the correlation between adjacent trials is essentially an average of the correlations between adjacent trials in the unstructured case. The .3822 is just .61822 2 and .2363 .61823 3. Notice that tests on within-subject effects are back up to 57 df, which is certainly nice, and our results are still significant. This is a far nicer solution than we had using Proc GLM . Now we have three solutions, but which should we choose One aid in choosing is to look at the Fit Statistics that are printed out with each solution. These statistics take into account both how well the model fits the data and how many estimates it took to get there. Put loosely, we would probably be happier with a pretty good fit based on few parameter estimates than with a slightly better fit based on many parameter estimates. If you look at the three models we have fit for the unbalanced design you will see that the AIC criterion for the type cs model was 909.4, which dropped to 903.7 when we relaxed the assumption of compound symmetry. A smaller AIC value is better, so we should prefer the second model. Then when we aimed for a middle ground, by specifying the pattern or correlations but not making SAS estimate 10 separate correlations, AIC dropped again to 899.1. That model fit better, and the fact that it did so by only estimating a variance and one correlation leads us to prefer that model. SPSS Mixed You can accomplish the same thing using SPSS if you prefer. I will not discuss the syntax here, but the commands are given below. You can modify this syntax by replacing CS with UN or AR(1) if you wish. (A word of warning. For some reason SPSS has changed the way it reads missing data. In the past you could just put in a period and SPSS knew that was missing. It no longer does so. You need to put in something like -99 and tell it that -99 is the code for missing. While Im at it, they changed something else. In the past it distinguished one value from another by looking for white space. Thus if there were a tab, a space, 3 spaces, a space and a tab, or whatever, it knew that it had read one variable and was moving on to the next. NOT ANYMORE I cant imagine why they did it, but for some ways of readig the data, if you put two spaces in your data file to keep numbers lined up vertically, it assumes that the you have skipped a variable. Very annoying. Just use one space or one tab between entries.) Analyses Using R The following commands will run the same analysis using the R program (or using S-PLUS). The results will not be exactly the same, but they are very close. Lines beginning with are comments. In revising this version I found the following reference just stuck in the middle of nowhere. I dont recall why I did that, but Bodo Winter has an excellent page that I recommend that you look at. The link is bodowintertutorialbwLMEtutorial2.pdf. Where do we go now This document is sufficiently long that I am going to create a new one to handle this next question. In that document we will look at other ways of doing much the same thing. The reason why I move to alternative models, even though they do the same thing, is that the logic of those models will make it easier for you to move to what are often called single-case designs or multiple baseline designs when we have finished with what is much like a traditional analysis of variance approach to what we often think of as traditional analysis of variance designs. References Guerin, L. and W. W. Stroup. 2000. A simulation study to evaluate PROC MIXED analysis of repeated measures data. P. 170-203. In Proc. 12th Kansas State Univ. Conf. on Applied Statistics in Agriculture. Kansas State Univ. Manhattan. Howell, D. C. (2008) The analysis of variance. In Osborne, J. I. Best practices in Quantitative Methods. Sage. Little, R. C. Milliken, G. A. Stroup, W. W. Wolfinger, R. D. Schabenberger, O. (2006). SAS for Mixed Models. Cary. NC. SAS Institute Inc. Maxwell, S. E. Delaney, H. D. (2004) Designing Experiments and Analyzing Data: A Model Comparison Approach, 2nd edition. Belmont, CA. Wadsworth. Overall, J. E. Ahn, C. Shivakumar, C. Kalburgi, Y. (1999). Problematic formulations of SAS Proc. Mixed models for repeated measurements. Journal of Biopharmaceutical Statistics, 9, 189-216. Overall, J. E. Tonindandel, S. (2002) Measuring change in controlled longitudinal studies. British Journal of Mathematical and Statistical Psychology, 55, 109-124. Overall, J. E. Tonindandel, S. (2007) Analysis of data from a controlled repeated measurements design with baseline-dependent dropouts. Methodology, 3, 58-66. Pinheiro, J. C. Bates, D. M. (2000). Mixed-effects Models in S and S-Plus. Springer. Some good references on the web are: The following is a good reference for people with questions about using SAS in general. Downloadable Papers on Multilevel Models Good coverage of alternative covariance structures The main reference for SAS Proc Mixed is Little, R. C. Milliken, G. A. Stroup, W. W. Wolfinger, R. D. Schabenberger, O. (2006) SAS for mixed models, Cary, NC SAS Institute Inc. Maxwell, S. E. Delaney, H. D. (2004). Designing Experiments and Analyzing Data (2nd edition). Lawrence Erlbaum Associates. The classic reference for R is Penheiro, J. C. Bates, D. M. (2000) Mixed-effects models in S and S-Plus. New York: Springer. Last revised 6282015 When we have a design in which we have both random and fixed variables, we have what is often called a mixed model. Mixed models have begun to play an important role in statistical analysis and offer many advantages over more traditional analyses. At the same time they are more complex and the syntax for software analysis is not always easy to set up. I will break this paper up into two papers because there are a number of designs and design issues to consider. This document will deal with the use of what are called mixed models (or linear mixed models, or hierarchical linear models, or many other things) for the analysis of what we normally think of as a simple repeated measures analysis of variance. Future documents will deal with mixed models to handle single-subject design (particularly multiple baseline designs) and nested designs. A large portion of this document has benefited from Chapter 15 in Maxwell Delaney (2004) Designing Experiments and Analyzing Data. They have one of the clearest discussions that I know. I am going a step beyond their example by including a between-groups factor as well as a within-subjects (repeated measures) factor. For now my purpose is to show the relationship between mixed models and the analysis of variance. The relationship is far from perfect, but it gives us a known place to start. More importantly, it allows us to see what we gain and what we lose by going to mixed models. In some ways I am going through the Maxwell Delaney chapter backwards, because I am going to focus primarily on the use of the repeated command in SAS Proc Mixed . I am doing that because it fits better with the transition from ANOVA to mixed models. My motivation for this document came from a question asked by Rikard Wicksell at Karolinska University in Sweden. He had a randomized clinical trial with two treatment groups and measurements at pre, post, 3 months, and 6 months. His problem is that some of his data were missing. He considered a wide range of possible solutions, including last trial carried forward, mean substitution, and listwise deletion. In some ways listwise deletion appealed most, but it would mean the loss of too much data. One of the nice things about mixed models is that we can use all of the data we have. If a score is missing, it is just missing. It has no effect on other scores from that same patient. Another advantage of mixed models is that we dont have to be consistent about time. For example, and it does not apply in this particular example, if one subject had a follow-up test at 4 months while another had their follow-up test at 6 months, we simply enter 4 (or 6) as the time of follow-up. We dont have to worry that they couldnt be tested at the same intervals. A third advantage of these models is that we do not have to assume sphericity or compound symmetry in the model. We can do so if we want, but we can also allow the model to select its own set of covariances or use covariance patterns that we supply. I will start by assuming sphericity because I want to show the parallels between the output from mixed models and the output from a standard repeated measures analysis of variance. I will then delete a few scores and show what effect that has on the analysis. I will compare the standard analysis of variance model with a mixed model. Finally I will use Expectation Maximization (EM) and Multiple Imputation (MI) to impute missing values and then feed the newly complete data back into a repeated measures ANOVA to see how those results compare. (If you want to read about those procedures, I have a web page on them at Missing. html ). I have created data to have a number of characteristics. There are two groups - a Control group and a Treatment group, measured at 4 times. These times are labeled as 1 (pretest), 2 (one month posttest), 3 (3 months follow-up), and 4 (6 months follow-up). I created the treatment group to show a sharp drop at post-test and then sustain that drop (with slight regression) at 3 and 6 months. The Control group declines slowly over the 4 intervals but does not reach the low level of the Treatment group. There are noticeable individual differences in the Control group, and some subjects show a steeper slope than others. In the Treatment group there are individual differences in level but the slopes are not all that much different from one another. You might think of this as a study of depression, where the dependent variable is a depression score (e. g. Beck Depression Inventory) and the treatment is drug versus no drug. If the drug worked about as well for all subjects the slopes would be comparable and negative across time. For the control group we would expect some subjects to get better on their own and some to stay depressed, which would lead to differences in slope for that group. These facts are important because when we get to the random coefficient mixed model the individual differences will show up as variances in intercept, and any slope differences will show up as a significant variance in the slopes. For the standard ANOVA, and for mixed models using the Repeated command, the differences in level show up as a Subject effect and we assume that the slopes are comparable across subjects. The program and data used below are available at the following links. I explain below the differences between the data files. The results of a standard repeated measures analysis of variance with no missing data and using SAS Proc GLM follow. You would obtain the same results using the SPSS Univariate procedure. Because I will ask for a polynomial trend analysis, I have told it to recode the levels as 0, 1, 3, 6 instead of 1, 2, 3, 4. I did not need to do this, but it seemed truer to the experimental design. It does not affect the standard summary table. (I give the entire data entry parts of the program here, but will leave it out in future code.) Here we see that each of the effects in the overall analysis is significant. We dont care very much about the group effect because we expected both groups to start off equal at pre-test. What is important is the interaction, and it is significant at p .0001. Clearly the drug treatment is having a differential effect on the two groups, which is what we wanted to see. The fact that the Control group seems to be dropping in the number of symptoms over time is to be expected and not exciting, although we could look at these simple effects if we wanted to. We would just run two analyses, one on each group. I would not suggest pooling the variances to calculate F . though that would be possible. In the printout above I have included tests on linear, quadratic, and cubic trend that will be important later. However you have to read this differently than you might otherwise expect. The first test for the linear component shows an F of 54.27 for mean and an F of 0.59 for group. Any other software that I have used would replace mean with Time and group with Group times Time. In other words we have a significant linear trend over time, but the linear times group contrast is not significant. I dont know why they label them that way. (Well, I guess I do, but its not the way that I would do it.) I should also note that my syntax specified the intervals for time, so that SAS is not assuming equally spaced intervals. The fact that the linear trend was not significant for the interaction means that both groups are showing about the same linear trend. But notice that there is a significant interaction for the quadratic. Mixed Model The use of mixed models represents a substantial difference from the traditional analysis of variance. For balanced designs (which roughly translates to equal cell sizes) the results will come out to be the same, assuming that we set the analysis up appropriately. But the actual statistical approach is quite different and ANOVA and mixed models will lead to different results whenever the data are not balanced or whenever we try to use different, and often more logical, covariance structures. First a bit of theory. Within Proc Mixed the repeated command plays a very important role in that it allows you to specify different covariance structures, which is something that you cannot do under Proc GLM . You should recall that in Proc GLM we assume that the covariance matrix meets our sphericity assumption and we go from there. In other words the calculations are carried out with the covariance matrix forced to sphericity. If that is not a valid assumption we are in trouble. Of course there are corrections due to Greenhouse and Geisser and Hyunh and Feldt, but they are not optimal solutions. But what does compound symmetry, or sphericity, really represent (The assumption is really about sphericity, but when speaking of mixed models most writers refer to compound symmetry, which is actually a bit more restrictive.) Most people know that compound symmetry means that the pattern of covariances or correlations is constant across trials. In other words, the correlation between trial 1 and trial 2 is equal to the correlation between trial 1 and trial 4 or trial 3 and trial 4, etc. But a more direct way to think about compound symmetry is to say that it requires that all subjects in each group change in the same way over trials. In other words the slopes of the lines regressing the dependent variable on time are the same for all subjects. Put that way it is easy to see that compound symmetry can really be an unrealistic assumption. If some of your subjects improve but others dont, you do not have compound symmetry and you make an error if you use a solution that assumes that you do. Fortunately Proc Mixed allows you to specify some other pattern for those covariances. We can also get around the sphericity assumption using the MANOVA output from Proc GLM . but that too has its problems. Both standard univariate GLM and MANOVA GLM will insist on complete data. If a subject is missing even one piece of data, that subject is discarded. That is a problem because with a few missing observations we can lose a great deal of data and degrees of freedom. Proc Mixed with repeated is different. Instead of using a least squares solution, which requires complete data, it uses a maximum likelihood solution, which does not make that assumption. (We will actually use a Restricted Maximum Likelihood (REML) solution.) When we have balanced data both least squares and REML will produce the same solution if we specify a covariance matrix with compound symmetry. But even with balanced data if we specify some other covariance matrix the solutions will differ. At first I am going to force sphericity by adding type cs (which stands for compound symmetry) to the repeated statement. I will later relax that structure. The first analysis below uses exactly the same data as for Proc GLM . though they are entered differently. Here data are entered in what is called long form, as opposed to the wide form used for Proc GLM . This means that instead of having one line of data for each subject, we have one line of data for each observation. So with four measurement times we will have four lines of data for that subject. Because we have a completely balanced design (equal sample sizes and no missing data) and because the time intervals are constant, the results of this analysis will come out exactly the same as those for Proc GLM so long as I specify type cs. The data follow. I have used card input rather than reading a file just to give an alternative approach. I have put the data in three columns to save space, but the real syntax statements would have 48 lines of data. The first set of commands plots the results of each individual subject broken down by groups. Earlier we saw the group means over time. Now we can see how each of the subjects stands relative to the means of his or her group. In the ideal world the lines would start out at the same point on the Y axis (i. e. have a common intercept) and move in parallel (i. e. have a common slope). That isnt quite what happens here, but whether those are chance variations or systematic ones is something that we will look at later. We can see in the Control group that a few subjects decline linearly over time and a few other subjects, especially those with lower scores decline at first and then increase during follow-up. Plots (Group 1 Control, Group 2 Treatment) For Proc Mixed we need to specify that group, time, and subject are class variables. (See the syntax above.) This will cause SAS to treat them as factors (nominal or ordinal variables) instead of as continuous variables. The model statement tells the program that we want to treat group and time as a factorial design and generate the main effects and the interaction. (I have not appended a solution to the end of the model statement because I dont want to talk about the parameter estimates of treatment effects at this point, but most people would put it there.) The repeated command tells SAS to treat this as a repeated measures design, that the subject variable is named subj, and that we want to treat the covariance matrix as exhibiting compound symmetry, even though in the data that I created we dont appear to come close to meeting that assumption. The specification rcorr will ask for the estimated correlation matrix. (we could use r instead of rcorr, but that would produce a covariance matrix, which is harder to interpret.) The results of this analysis follow, and you can see that they very much resemble our analysis of variance approach using Proc GLM . On this printout we see the estimated correlations between times. These are not the actual correlations, which appear below, but the estimates that come from an assumption of compound symmetry. That assumption says that the correlations have to be equal, and what we have here are basically average correlations. The actual correlations, averaged over the two groups using Fishers transformation, are: Notice that they are quite different from the ones assuming compound symmetry, and that they dont look at all as if they fit that assumption. We will deal with this problem later. (I dont have a clue why the heading refers to subject 1. It just does) There are also two covariance parameters. Remember that there are two sources of random effects in this design. There is our normal sigma 2 e . which reflects random noise. In addition we are treating our subjects as a random sample, and there is thus random variance among subjects. Here I get to play a bit with expected mean squares. You may recall that the expected mean squares for the error term for the between-subject effect is E(MS win subj ) sigma e 2 asigma pi 2 and our estimate of sigma e 2. taken from the GLM analysis, is MS residual . which is 2760.6218. The letter a stands for the number of measurement times 4, and MS subj win grps 12918.0663, again from the GLM analysis. Therefore our estimate of sigma pi 2 (12918.0663 2760.6218)4 2539.36. These two estimates are our random part of the model and are given in the section headed Covariance Parameter Estimates. I dont see a situation in this example in which we would wish to make use of these values, but in other mixed designs they are useful. You may notice one odd thing in the data. Instead of entering time as 1,2, 3, 4, I entered it as 0, 1, 3, and 6. If this were a standard ANOVA it wouldnt make any difference, and in fact it doesnt make any difference here, but when we come to looking at intercepts and slopes, it will be very important how we designated the 0 point. We could have centered time by subtracting the mean time from each entry, which would mean that the intercept is at the mean time. I have chosen to make 0 represent the pretest, which seems a logical place to find the intercept. I will say more about this later. Missing Data I have just spent considerable time discussing a balanced design where all of the data are available. Now I want to delete some of the data and redo the analysis. This is one of the areas where mixed designs have an important advantage. I am going to delete scores pretty much at random, except that I want to show a pattern of different observations over time. It is easiest to see what I have done if we look at data in the wide form, so the earlier table is presented below with . representing missing observations. It is important to notice that data are missing completely at random, not on the basis of other observations. If we treat this as a standard repeated measures analysis of variance, using Proc GLM . we have a problem. Of the 24 cases, only 17 of them have complete data. That means that our analysis will be based on only those 17 cases. Aside from a serious loss of power, there are other problems with this state of affairs. Suppose that I suspected that people who are less depressed are less likely to return for a follow-up session and thus have missing data. To build that into the example I could deliberately have deleted data from those who scored low on depression to begin with, though I kept their pretest scores. (I did not actually do this here.) Further suppose that people low in depression respond to treatment (or non-treatment) in different ways from those who are more depressed. By deleting whole cases I will have deleted low depression subjects and that will result in biased estimates of what we would have found if those original data points had not been missing. This is certainly not a desirable result. To expand slightly on the previous paragraph, if we using Proc GLM . or a comparable procedure in other software, we have to assume that data are missing completely at random, normally abbreviated MCAR. (See Howell, 2008.) If the data are not missing completely at random, then the results would be biased. But if I can find a way to keep as much data as possible, and if people with low pretest scores are missing at one or more measurement times, the pretest score will essentially serve as a covariate to predict missingness. This means that I only have to assume that data are missing at random (MAR) rather than MCAR. That is a gain worth having. MCAR is quite rare in experimental research, but MAR is much more common. Using a mixed model approach requires only that data are MAR and allows me to retain considerable degrees of freedom. (That argument has been challenged by Overall Tonidandel (2007), but in this particular example the data actually are essentially MCAR. I will come back to this issue later.) Proc GLM results The output from analyzing these data using Proc GLM follows. I give these results just for purposes of comparison, and I have omitted much of the printout. Notice that we still have a group effect and a time effect, but the F for our interaction has been reduced by about half, and that is what we care most about. (In a previous version I made it drop to nonsignificant, but I relented here.) Also notice the big drop in degrees of freedom due to the fact that we now only have 17 subjects. Proc Mixed Now we move to the results using Proc Mixed . I need to modify the data file by putting it in its long form and to replacing missing observations with a period, but that means that I just altered 9 lines out of 96 (10 of the data) instead of 7 out of 24 (29). The syntax would look exactly the same as it did earlier. The presence of time on the repeated statement is not necessary if I have included missing data by using a period, but it is needed if I just remove the observation completely. (At least that is the way I read the manual.) The results follow, again with much of the printout deleted: This is a much nicer solution, not only because we have retained our significance levels, but because it is based on considerably more data and is not reliant on an assumption that the data are missing completely at random. Again you see a fixed pattern of correlations between trials which results from my specifying compound symmetry for the analysis. Other Covariance Structures To this point all of our analyses have been based on an assumption of compound symmetry. (The assumption is really about sphericity, but the two are close and Proc Mixed refers to the solution as type cs.) But if you look at the correlation matrix given earlier it is quite clear that correlations further apart in time are distinctly lower than correlations close in time, which sounds like a reasonable result. Also if you looked at Mauchlys test of sphericity (not shown) it is significant with p .012. While this is not a great test, it should give us pause. We really ought to do something about sphericity. The first thing that we could do about sphericity is to specify that the model will make no assumptions whatsoever about the form of the covariance matrix. To do this I will ask for an unstructured matrix. This is accomplished by including type un in the repeated statement. This will force SAS to estimate all of the variances and covariances and use them in its solution. The problem with this is that there are 10 things to be estimated and therefore we will lose degrees of freedom for our tests. But I will go ahead anyway. For this analysis I will continue to use the data set with missing data, though I could have used the complete data had I wished. I will include a request that SAS use procedures due to Hotelling-Lawley-McKeon (hlm) and Hotelling-Lawley-Pillai-Samson (hlps) which do a better job of estimating the degrees of freedom for our denominators. This is recommended for an unstructured model. The results are shown below. Results using unstructured matrix Notice the matrix of correlations. From pretest to the 6 month follow-up the correlation with pretest scores has dropped from .46 to -.03, and this pattern is consistent. That certainly doesnt inspire confidence in compound symmetry. The F s have not changed very much from the previous model, but the degrees of freedom for within-subject terms have dropped from 57 to 22, which is a huge drop. That results from the fact that the model had to make additional estimates of covariances. Finally, the hlm and hlps statistics further reduce the degrees of freedom to 20, but the effects are still significant. This would make me feel pretty good about the study if the data had been real data. But we have gone from one extreme to another. We estimated two covariance parameters when we used type cs and 10 covariance parameters when we used type un. (Put another way, with the unstructured solution we threw up our hands and said to the program You figure it out We dont know whats going on. There is a middle ground (in fact there are many). We probably do know at least something about what those correlations should look like. Often we would expect correlations to decrease as the trials in question are further removed from each other. They might not decrease as fast as our data suggest, but they should probably decrease. An autoregressive model, which we will see next, assumes that correlations between any two times depend on both the correlation at the previous time and an error component. To put that differently, your score at time 3 depends on your score at time 2 and error. (This is a first order autoregression model. A second order model would have a score depend on the two previous times plus error.) In effect an AR(1) model assumes that if the correlation between Time 1 and Time 2 is .51, then the correlation between Time 1 an d Time 3 has an expected value of .512 2 .26 and between Time 1 and Time 4 has an expected value of .513 3 .13. Our data look reasonably close to that. (Remember that these are expected values of r . not the actual obtained correlations.) The solution using a first order autoregressive model follows. Notice the pattern of correlations. The .6182 as the correlation between adjacent trials is essentially an average of the correlations between adjacent trials in the unstructured case. The .3822 is just .61822 2 and .2363 .61823 3. Notice that tests on within-subject effects are back up to 57 df, which is certainly nice, and our results are still significant. This is a far nicer solution than we had using Proc GLM . Now we have three solutions, but which should we choose One aid in choosing is to look at the Fit Statistics that are printed out with each solution. These statistics take into account both how well the model fits the data and how many estimates it took to get there. Put loosely, we would probably be happier with a pretty good fit based on few parameter estimates than with a slightly better fit based on many parameter estimates. If you look at the three models we have fit for the unbalanced design you will see that the AIC criterion for the type cs model was 909.4, which dropped to 903.7 when we relaxed the assumption of compound symmetry. A smaller AIC value is better, so we should prefer the second model. Then when we aimed for a middle ground, by specifying the pattern or correlations but not making SAS estimate 10 separate correlations, AIC dropped again to 899.1. That model fit better, and the fact that it did so by only estimating a variance and one correlation leads us to prefer that model. SPSS Mixed You can accomplish the same thing using SPSS if you prefer. I will not discuss the syntax here, but the commands are given below. You can modify this syntax by replacing CS with UN or AR(1) if you wish. (A word of warning. For some reason SPSS has changed the way it reads missing data. In the past you could just put in a period and SPSS knew that was missing. It no longer does so. You need to put in something like -99 and tell it that -99 is the code for missing. While Im at it, they changed something else. In the past it distinguished one value from another by looking for white space. Thus if there were a tab, a space, 3 spaces, a space and a tab, or whatever, it knew that it had read one variable and was moving on to the next. NOT ANYMORE I cant imagine why they did it, but for some ways of readig the data, if you put two spaces in your data file to keep numbers lined up vertically, it assumes that the you have skipped a variable. Very annoying. Just use one space or one tab between entries.) Analyses Using R The following commands will run the same analysis using the R program (or using S-PLUS). The results will not be exactly the same, but they are very close. Lines beginning with are comments. In revising this version I found the following reference just stuck in the middle of nowhere. I dont recall why I did that, but Bodo Winter has an excellent page that I recommend that you look at. The link is bodowintertutorialbwLMEtutorial2.pdf. Where do we go now This document is sufficiently long that I am going to create a new one to handle this next question. In that document we will look at other ways of doing much the same thing. The reason why I move to alternative models, even though they do the same thing, is that the logic of those models will make it easier for you to move to what are often called single-case designs or multiple baseline designs when we have finished with what is much like a traditional analysis of variance approach to what we often think of as traditional analysis of variance designs. References Guerin, L. and W. W. Stroup. 2000. A simulation study to evaluate PROC MIXED analysis of repeated measures data. P. 170-203. In Proc. 12th Kansas State Univ. Conf. on Applied Statistics in Agriculture. Kansas State Univ. Manhattan. Howell, D. C. (2008) The analysis of variance. In Osborne, J. I. Best practices in Quantitative Methods. Sage. Little, R. C. Milliken, G. A. Stroup, W. W. Wolfinger, R. D. Schabenberger, O. (2006). SAS for Mixed Models. Cary. NC. SAS Institute Inc. Maxwell, S. E. Delaney, H. D. (2004) Designing Experiments and Analyzing Data: A Model Comparison Approach, 2nd edition. Belmont, CA. Wadsworth. Overall, J. E. Ahn, C. Shivakumar, C. Kalburgi, Y. (1999). Problematic formulations of SAS Proc. Mixed models for repeated measurements. Journal of Biopharmaceutical Statistics, 9, 189-216. Overall, J. E. Tonindandel, S. (2002) Measuring change in controlled longitudinal studies. British Journal of Mathematical and Statistical Psychology, 55, 109-124. Overall, J. E. Tonindandel, S. (2007) Analysis of data from a controlled repeated measurements design with baseline-dependent dropouts. Methodology, 3, 58-66. Pinheiro, J. C. Bates, D. M. (2000). Mixed-effects Models in S and S-Plus. Springer. Some good references on the web are: The following is a good reference for people with questions about using SAS in general. Downloadable Papers on Multilevel Models Good coverage of alternative covariance structures The main reference for SAS Proc Mixed is Little, R. C. Milliken, G. A. Stroup, W. W. Wolfinger, R. D. Schabenberger, O. (2006) SAS for mixed models, Cary, NC SAS Institute Inc. Maxwell, S. E. Delaney, H. D. (2004). Designing Experiments and Analyzing Data (2nd edition). Lawrence Erlbaum Associates. The classic reference for R is Penheiro, J. C. Bates, D. M. (2000) Mixed-effects models in S and S-Plus. New York: Springer.

Comments