Sr Big Data Engineer Resume
Wayne, PA
SUMMARY
- Around 7 years of technical experience in Data analysis and Data Modeling, business needs of clients, developing effective and efficient solutions and ensuring client deliverables within committed timelines.
- Proficient inData Analysis, Cleansing, Transformation, Data Migration, Data Integration, Data Import, and Data Exportthrough use of ETL tools such as Informatica.
- Solid understanding of Data Modelling, Data Collection, Data Cleansing, Data Warehouse/Data Mart Design, ETL, BI, OLAP, Client/Server applications.
- Analyzed data and provided insights with R Programming and Python Pandas
- Expertise in Business Intelligence, Data warehousing technologies, ETL and Big Data technologies.
- Experience in Creating ETL mappings using Informatica to moveData from multiple sources like Flat files, Oracle into a common target area such asData Warehouse.
- Extensive experience in SAS and SQL Coding and Programming, data modeling and data mining.
- Skilled in data management, ETL, and data manipulation, validation and cleaning using various conversation functions and multiple conditional statements.
- Hands on experience in complex querying writing, query optimizing in relation Databases including Oracle, T - SQL and SQL Server.
- Working knowledge in Big Data Ecosystem including Python, and knowledge in NoSQL databases.
- Experienced in business requirements collection methods using Agile, Scrum and Waterfall methods and software development life cycle (SDLC) testing methodologies, disciplines, tasks, resources and scheduling.
- Experience in writingPL/SQLstatements - Stored Procedures, Functions, Triggers and packages.
- Involved in creating database objects like tables, views, procedures, triggers, and functions using T-SQL to provide definition, structure and to maintain data efficiently.
- Skilled in Tableau Desktop versions 10x for data visualization, Reporting and Analysis.
- Hands on learning with different ETL tools to get data in shape where it could be connected to Tableau through Tableau Data Extract.
- Proficient in Experimental Design, Sampling, Linear and Logistical Regression and Decision Trees.
- Extensive experience in Relational and Dimensional Data modeling for creating Logical and Physical
- Design of Database and ER Diagrams using multiple data modeling tools like ERWIN, ER Studio.
- Strong skills in statistical methodologies such as Hypothesis test and ANOVA.
- Acquired knowledge on the Big data technologies mainly Hadoop, Spark, Hive, Pig and Tableau frameworks.
- Experienced in Database using Oracle, XML, DB2, Teradata, Netezza, SQL server, Big Data and NoSQL.
- Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Data Bricks, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake.
- Excellent knowledge of Bill Inmon and Ralph Kimball methodologies to design database architecture. Excellent knowledge on logical application design and physical implementation for database application.
- Excellent knowledge of complete data warehouse life cycle, testing methodologies, OLAP and OLTP. Excellent knowledge of SDLC phases (Inception, Elaboration, Construction and Transition).
- Performed data mining, data integration and maintenance database.
- Expertise in writingcomplex SQL queries, made use of Indexing, Aggregation and materialized views to optimize query performance.
- Adept in designing visualizations using SAS and Tableau, Dashboards, Business Metrics, Operating Statistics and Graphs.
- Skilled in analytical and statistical programming languages such as SAS and SPSS.
- Worked with Amazon Web Services (AWS) for a multitude of applications utilizing the Amazon Web Services focusing on high-availability, fault tolerance and auto-scaling.
- Published the workbooks and data sources to the tableau server for further review as well enable users to do slicing and dicing on the data to derive more insights. Created action filters, parameters and calculated sets for preparing dishoards and worksheets. Executed and tested required queries and reports before publishing.
- Excellent communicative, interpersonal, intuitive, analysis, leadership skills, quick starter with ability to master and apply new concepts.
TECHNICAL SKILLS
Programming languages: Python, PySpark, Shell Scripting, SQL, PL/SQL and UNIX Bash
Big Data: Hadoop, Sqoop, Apache Spark, NiFi, Kafka, Snowflake, Cloudera, Horton Works, PySpark, Spark, Spark SQL
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Data bases: Oracle, SQL Server, My SQL, DB2, Sybase, Netezza, Hive, Impala
Cloud Technologies: AWS, AZURE
IDE Tools: Aginitiy for Hadoop, PyCharm, Toad, SQL Developer, SQL *Plus, Sublime Text, VI Editor
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.
Others: AutoSys, Crontab, ArcGIS, Clarity, Informatica, Business Objects, IBM MQ, Splunk
Operating Systems: UNIX, LINUX, Solaris, Mainframes
PROFESSIONAL EXPERIENCE
Confidential, Wayne, PA
Sr Big Data Engineer
Responsibilities:
- Worked inAgile environment,and used rally tool to maintain the user stories and tasks.
- ImplementedApache Sentryto restrict the access on the hive tables on a group level.
- EmployedAVROformat for the entire data ingestion for faster operation and less space utilization.
- Experienced in managing and reviewing Hadoop log files.
- Queried and analyzed data fromCassandrafor quick searching,
- Implemented ApacheDrillon Hadoop to join data from SQL and No SQL databases and store it in Hadoop.
- Created architecture stack blueprint for data access with NoSQL DatabaseCassandra
- Brought data from various sources in to Hadoop and Cassandra usingKafka.
- Experienced in usingTidal enterprise scheduler and OozieOperational Services for coordinating the cluster and scheduling workflows.
- Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2). sorting and grouping throughCQL.
- Implemented Partitioning, Dynamic Partitions and Buckets inHIVEfor efficient data access.
- Implemented various Data Modeling techniques forCassandra.
- Developed Apache Spark applications by using spark for data processing from various streaming sources.
- Strong Knowledge on architecture and components of TeaLeaf, and efficient in working with Spark Core, SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka
- Design, create, revise and manage reports generated from operational and analytical systems using SSRS, Tableau, Power BI, and Crystal Reports
- Joined various tables in Cassandra usingspark and Scalaand ran analytics on top of them.
- Created and implemented variousshell scriptsfor automating the jobs.
- Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight
- Knowledge in performancetroubleshooting and tuningHadoop clusters.
- AppliedSparkadvanced procedures liketext analytics and processingusing the in memoryprocessing.
- Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/ Databricks, NoSQL DB).
- Implemented a generic ETL framework withhigh availabilityfor bringing related data for Hadoop & Cassandra from various sources using spark.
- Experienced in usingPlatforaa data visualization tool specific for Hadoop, and created various Lens and Viz boards for a real-time visualization from hive tables.
- ImplementedCompositeserver for thedata virtualizationneeds and created multiples views for restricted data access using a REST API.
- Devised and lead the implementation of next generation architecture for more efficient data ingestion and processing
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark Databricks cluster
- Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
- Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different environments for the SSAS cubes (OLAP)
- SQL Server reporting services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Sub reports, ad-hoc reports, parameterized reports, interactive reports & custom reports
- Created action filters, parameters and calculated sets for preparing dashboards and worksheets using PowerBI
- Developed visualizations and dashboards using PowerBI
- Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing
- Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
- Exposure to Spark, Spark Streaming, Spark MLlib, snow flake, Scala and Creating the Data Frames handled in Sparkwith Scala.
- Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.
Environment: Hadoop, MapR 5.0.1, Map Reduce, HDFS, Hive, pig, Impala,Kafka, Cassandra 5.04, spark, Scala,Solr, Azure Data Lake, Data Factory, Databricks, Azure SQL, Java, SQL, Tableau, PIG, Zookeeper, Sqoop, Kafka, Teradata, Power BI, CentOS, Pentaho.
Confidential, Blue Ash, OH
Big Data Engineer
Responsibilities:
- Evaluating client needs and translating their business requirement to functional specifications thereby onboarding them onto Hadoop ecosystem.
- Installed application on AWS EC2 instances and configured the storage on S3 buckets.
- Stored data in AWS S3 like HDFS and performed EMR programs on data stored.
- Used the AWS-CLI to suspend an AWS Lambda function. Used AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS.
- Developed JavaMap Reduce programsfor the analysis of sample log file stored in cluster.
- Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and preprocessing
- Experienced Good understanding of NoSQL databases and hands on work experience in writing applications No SQL Databases HBase, Cassandra and MongoDB.
- Working on designing the Map Reduce and Yarn flow and writing Map Reduce scripts, performance tuning and debugging.
- Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds
- Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing. Created Lambda jobs and configured Roles using AWS CLI.
- Worked with various HDFS file formats like Parque, IAM, Json for serializing and deserializing.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Worked on setting up and configuringAWS's EMR Clustersand Used AmazonIAMto grant fine-grained access toAWSresources to users
- Very good implementation experience of Object-Oriented concepts, Multithreading and Java/Scala
- Experienced with the Scala, Spark improving the performance and optimization of the existing algorithms in Hadoop using SparkContext, Spark -SQL, Pair RDD's, Spark YARN
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark data bricks cluster
- Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
- Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
- Migrated Map reduce jobs to Spark jobs to achieve better performance.
- UsedAWS Data Pipelineto schedule anAmazon EMR clusterto clean and process web server logs stored inAmazon S3 bucket.
- Written the Map Reduce programs,HiveUDFsin Java
- Extracted and updated the data into HDFS using Sqoop import and export.
- Developed HIVE UDFs to in corporate external business logic into Hive script and Developed join data set scripts using HIVE join operations.
- Used IAM to detect and stop risky identity behaviors using rules, machine learning, and other statistical algorithms
- Responsible to manage data coming from different sources through Kafka.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Impala, Tealeaf, Pair RDD's, Nifi, DevOps, Spark YARN. ‘
- Developed a Spark job in Java which indexes data into ElasticSearch from external Hive tables which are in HDFS.
- Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing
- Written and implemented custom UDF's inPigfor data filtering
- Using Spark Dataframe API in Scala for analyzing data.
- Good experience in using Relational databasesOracle, MY SQL, SQL Server andPostgreSQL
- Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.
- Developed end to end data processing pipelines dat begin with receiving data using distributed messaging systems Kafka for persisting data intoCassandra.
- Designed the data models to be used in data intensive AWS Lambda applications which are aimed to do complex analysis creating analytical reports for end-to-end traceability, lineage, definition of Key Business elements from Aurora.
- Worked on AWS Lambda functions in python for AWS Lambda which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters.
- Developed Apache Spark applications by using spark for data processing from various streaming sources.
- Responsible for developing data pipeline using Spark, Scala, Apache Kafka to ingestion the data from CSL source and store in HDFS protected folder.
- Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.
- Implemented many Kafka ingestion jobs to consume the real time data processing and batch processing.
- Responsible for developing data pipeline withAmazon AWSto extract the data from weblogs and store inHDFSand worked extensively withSqoopfor importing metadata fromOracle.
- Strong Knowledge on architecture and components of Tealeaf, and efficient in working with Spark Core, SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka
- Exposure to Spark, Spark Streaming, Spark MLlib, snowflake, Scala and Creating the Data Frames handled in Sparkwith Scala.
- Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.
- Developed a NIFI Workflow to pick up the data from SFTP server and send dat to Kafka broker.
- Used HUE for running Hive queries. Created partitions according to day using Hive to improve performance.
- Developed Oozie workflow engine to run multiple Hive, Pig, Tealeaf, Mongo DB, Git, Sqoop and Spark jobs.
- Worked on auto scaling the instances to design cost effective, fault tolerant and highly reliable systems.
Environment: Hadoop (HDFS, Map Reduce), Scala, Yarn, IAM, PostgreSql, Spark, Impala, Mongo DB, Java, Pig, DevOps, HBase, Oozie, Hue, Sqoop, Flume, Oracle, NIFI, Git, AWS Services (Lambda, EMR, Auto scaling).
Confidential
Data Engineer
Responsibilities:
- Installed and configuredFlume,Hive,Pig,SqoopandOozieon the Hadoop cluster.
- Responsible for coding Java Batch, Restful Service,MapReduceprogram, Hive query's, testing, debugging, Peer code review, troubleshooting and maintain status report.
- DevelopedPigScripts to store unstructured data inHDFS.
- DevelopedPigLatin scripts to extract and filter relevant data from the web server output files to load into HDFS.
- Analyzed the data by performingHivequeries and runningPigscripts to study customer behavior.
- OptimizedMapReduceJobs to useHDFSefficiently by using various compression mechanisms.
- Enabled speedy reviews and first mover advantages by usingOozieto automate data loading into the Hadoop Distributed File System andPIGto pre-process the data.
- DevelopedOozie workflowfor scheduling and orchestrating theETLprocess.
- Experienced in managing and reviewing the Hadoop log files usingShell scripts.
- DevelopedFlumeAgents for loading and filtering the streaming data intoHDFS.
- Handling continuous streaming data comes from different sources usingFlumeand set destination asHDFS.
- Involved in collecting, aggregating and moving data from servers toHDFSusingFlume.
- Experience in creating variousOoziejobs to manage processing workflows.
- UsedAWS S3to store large amount of data in identical/similar repository.
- Developed job workflow inOozieto automate the tasks of loading the data intoHDFSand few otherHivejobs.
- UsedHiveto analyze the partitioned and bucketed data and compute various metrics for reporting.
- Experience in writing customMapReduceprograms &UDF's in Java to extendHiveandPigcore functionality.
Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, AWS, Flume, Oozie, HBase, Sqoop, RDBMS/DB, Flat files, MySQL, Java.
Confidential
Data Engineer
Responsibilities:
- WrittenHivejobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- DevelopedPIGscripts for the analysis of semi structured data.
- Migrated ETL jobs toPig scriptsto do Transformations, even joins and some pre-aggregations before storing the data onto HDFS.
- Worked on different file formats like Sequence files, XML files and Map files using MapReduce Programs.
- Worked withAvroData Serialization system to work withJSONdata formats.
- Worked on various performance optimizations like using distributed cache for small datasets, partition and bucketing inHive, doing map side joins etc.
- DevelopedPIG UDF'Sfor manipulating the data according to Business Requirements and also worked on developing custom PIG Loaders.
- Worked onOozieworkflow engine for job scheduling.
- Worked with heterogeneous source to Extracted data from Oracle database, XML and flat files and loaded to a relational Oracle warehouse.
- Performed tuning of SQL queries and Stored Procedure for speedy extraction of data to resolve and troubleshoot issues in OLTP environment.
Environment: Hive, HDFS, Oozie, Map Reduce, Oracle 10g, SQL, OLTP, Windows, MS Office